[jira] [Updated] (SPARK-21268) Move center calculations to a distributed map in KMeans

2017-06-30 Thread Guillaume Dardelet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guillaume Dardelet updated SPARK-21268:
---
Description: 
As I was monitoring the perfomance of my algorithm with SparkUI, I noticed that 
their was a "collectAsMap" operation that was done hundreds of time at every 
iteration of Kmeans:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295

It would work just as well by performing the following "foreach" on the RDD, 
and would slightly improve perfomance.

Edit:

Per Sean Owen recommendations, scal() and VectorWithNorm creation should be 
computed in a distributed map before the collectAsMap.

  was:
As I was monitoring the perfomance of my algorithm with SparkUI, I noticed that 
their was a "collectAsMap" operation that was done hundreds of time at every 
iteration of Kmeans:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295

It would work just as well by performing the following "foreach" on the RDD, 
and would slightly improve perfomance.

Edit:




> Move center calculations to a distributed map in KMeans
> ---
>
> Key: SPARK-21268
> URL: https://issues.apache.org/jira/browse/SPARK-21268
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
>Reporter: Guillaume Dardelet
>Priority: Trivial
>  Labels: beginner, easyfix, newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> As I was monitoring the perfomance of my algorithm with SparkUI, I noticed 
> that their was a "collectAsMap" operation that was done hundreds of time at 
> every iteration of Kmeans:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295
> It would work just as well by performing the following "foreach" on the RDD, 
> and would slightly improve perfomance.
> Edit:
> Per Sean Owen recommendations, scal() and VectorWithNorm creation should be 
> computed in a distributed map before the collectAsMap.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21268) Move center calculations to a distributed map in KMeans

2017-06-30 Thread Guillaume Dardelet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guillaume Dardelet updated SPARK-21268:
---
Summary: Move center calculations to a distributed map in KMeans  (was: 
Move some calculations to a distributed map in KMeans)

> Move center calculations to a distributed map in KMeans
> ---
>
> Key: SPARK-21268
> URL: https://issues.apache.org/jira/browse/SPARK-21268
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
>Reporter: Guillaume Dardelet
>Priority: Trivial
>  Labels: beginner, easyfix, newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> As I was monitoring the perfomance of my algorithm with SparkUI, I noticed 
> that their was a "collectAsMap" operation that was done hundreds of time at 
> every iteration of Kmeans:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295
> It would work just as well by performing the following "foreach" on the RDD, 
> and would slightly improve perfomance.
> Edit:



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21268) Move some calculations to a distributed map in KMeans

2017-06-30 Thread Guillaume Dardelet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guillaume Dardelet updated SPARK-21268:
---
Description: 
As I was monitoring the perfomance of my algorithm with SparkUI, I noticed that 
their was a "collectAsMap" operation that was done hundreds of time at every 
iteration of Kmeans:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295

It would work just as well by performing the following "foreach" on the RDD, 
and would slightly improve perfomance.

Edit:



  was:
As I was monitoring the perfomance of my algorithm with SparkUI, I noticed that 
their was a "collectAsMap" operation that was done hundreds of time at every 
iteration of Kmeans:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295

It would work just as well by performing the following "foreach" on the RDD, 
and would slightly improve perfomance.

Thoughts ?


> Move some calculations to a distributed map in KMeans
> -
>
> Key: SPARK-21268
> URL: https://issues.apache.org/jira/browse/SPARK-21268
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
>Reporter: Guillaume Dardelet
>Priority: Trivial
>  Labels: beginner, easyfix, newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> As I was monitoring the perfomance of my algorithm with SparkUI, I noticed 
> that their was a "collectAsMap" operation that was done hundreds of time at 
> every iteration of Kmeans:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295
> It would work just as well by performing the following "foreach" on the RDD, 
> and would slightly improve perfomance.
> Edit:



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21268) Move some calculations to a distributed map in KMeans

2017-06-30 Thread Guillaume Dardelet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guillaume Dardelet updated SPARK-21268:
---
Summary: Move some calculations to a distributed map in KMeans  (was: 
Redundant collectAsMap in KMeans)

> Move some calculations to a distributed map in KMeans
> -
>
> Key: SPARK-21268
> URL: https://issues.apache.org/jira/browse/SPARK-21268
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
>Reporter: Guillaume Dardelet
>Priority: Trivial
>  Labels: beginner, easyfix, newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> As I was monitoring the perfomance of my algorithm with SparkUI, I noticed 
> that their was a "collectAsMap" operation that was done hundreds of time at 
> every iteration of Kmeans:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295
> It would work just as well by performing the following "foreach" on the RDD, 
> and would slightly improve perfomance.
> Thoughts ?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21268) Redundant collectAsMap in KMeans

2017-06-30 Thread Guillaume Dardelet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070177#comment-16070177
 ] 

Guillaume Dardelet commented on SPARK-21268:


Ok I understand why it is useful now thank you.

As for performing the scale and newCenter creation, should I create a new issue 
or reference this one in the pull request ?

> Redundant collectAsMap in KMeans
> 
>
> Key: SPARK-21268
> URL: https://issues.apache.org/jira/browse/SPARK-21268
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
>Reporter: Guillaume Dardelet
>Priority: Trivial
>  Labels: beginner, easyfix, newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> As I was monitoring the perfomance of my algorithm with SparkUI, I noticed 
> that their was a "collectAsMap" operation that was done hundreds of time at 
> every iteration of Kmeans:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295
> It would work just as well by performing the following "foreach" on the RDD, 
> and would slightly improve perfomance.
> Thoughts ?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21268) Redundant collectAsMap in KMeans

2017-06-30 Thread Guillaume Dardelet (JIRA)
Guillaume Dardelet created SPARK-21268:
--

 Summary: Redundant collectAsMap in KMeans
 Key: SPARK-21268
 URL: https://issues.apache.org/jira/browse/SPARK-21268
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: Guillaume Dardelet
Priority: Trivial


As I was monitoring the perfomance of my algorithm with SparkUI, I noticed that 
their was a "collectAsMap" operation that was done hundreds of time at every 
iteration of Kmeans:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295

It would work just as well by performing the following "foreach" on the RDD, 
and would slightly improve perfomance.

Thoughts ?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?

2017-06-01 Thread Guillaume Dardelet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955191#comment-15955191
 ] 

Guillaume Dardelet edited comment on SPARK-12606 at 6/1/17 9:55 AM:


I had the same issue in Scala and I solved it by overloading the constructor so 
that it initialises the UID.

The error comes from the initialisation of the parameter "inputCol".
You get "null__inputCol" because when the parameter was initialised, your class 
didn't have a uid.

Therefore, instead of

{code}
class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] {
  override val uid: String = Identifiable.randomUID("lemmatizer")
  protected def createTransformFunc: String => String = ???
  protected def outputDataType: DataType = StringType
}
{code}

Do this:

{code}
class Lemmatizer(override val uid: String) extends UnaryTransformer[String, 
String, Lemmatizer] {
  def this() = this( Identifiable.randomUID("lemmatizer") )
  protected def createTransformFunc: String => String = ???
  protected def outputDataType: DataType = StringType
}
{code}


was (Author: panoramix):
I had the same issue in Scala and I solved it by overloading the constructor so 
that it initialises the UID.

The error comes from the initialisation of the parameter "inputCol".
You get "null__inputCol" because when the parameter was initialised, your class 
didn't have a uid.

Therefore, instead of

{code}
class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] {
  override val uid: String = Identifiable.randomUID("lemmatizer")
  protected def createTransformFunc: String) => String = ???
  protected def outputDataType: DataType = StringType
}
{code}

Do this:

{code}
class Lemmatizer(override val uid: String) extends UnaryTransformer[String, 
String, Lemmatizer] {
  def this() = this( Identifiable.randomUID("lemmatizer") )
  protected def createTransformFunc: String) => String = ???
  protected def outputDataType: DataType = StringType
}
{code}

> Scala/Java compatibility issue Re: how to extend java transformer from Scala 
> UnaryTransformer ?
> ---
>
> Key: SPARK-12606
> URL: https://issues.apache.org/jira/browse/SPARK-12606
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.2
> Environment: Java 8, Mac OS, Spark-1.5.2
>Reporter: Andrew Davidson
>  Labels: transformers
>
> Hi Andy,
> I suspect that you hit the Scala/Java compatibility issue, I can also 
> reproduce this issue, so could you file a JIRA to track this issue?
> Yanbo
> 2016-01-02 3:38 GMT+08:00 Andy Davidson :
> I am trying to write a trivial transformer I use use in my pipeline. I am 
> using java and spark 1.5.2. It was suggested that I use the Tokenize.scala 
> class as an example. This should be very easy how ever I do not understand 
> Scala, I am having trouble debugging the following exception.
> Any help would be greatly appreciated.
> Happy New Year
> Andy
> java.lang.IllegalArgumentException: requirement failed: Param null__inputCol 
> does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c.
>   at scala.Predef$.require(Predef.scala:233)
>   at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557)
>   at org.apache.spark.ml.param.Params$class.set(params.scala:436)
>   at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
>   at org.apache.spark.ml.param.Params$class.set(params.scala:422)
>   at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
>   at 
> org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83)
>   at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30)
> public class StemmerTest extends AbstractSparkTest {
> @Test
> public void test() {
> Stemmer stemmer = new Stemmer()
> .setInputCol("raw”) //line 30
> .setOutputCol("filtered");
> }
> }
> /**
>  * @ see 
> spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
>  * @ see 
> https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
>  * @ see 
> http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/
>  * 
>  * @author andrewdavidson
>  *
>  */
> public class Stemmer extends UnaryTransformer, List, 
> Stemmer> implements Serializable{
> static Logger logger = LoggerFactory.getLogger(Stemmer.class);
> private static final long serialVersionUID = 1L;
> private static final  ArrayType inputType = 
> DataTypes.createArrayType(DataTypes.StringType, true);
> private final String uid = Stemmer.class.getSimpleName() + "_" + 
> UUID.randomUUID().toString();
> @Override
> public

[jira] [Comment Edited] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?

2017-04-04 Thread Guillaume Dardelet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955191#comment-15955191
 ] 

Guillaume Dardelet edited comment on SPARK-12606 at 4/4/17 2:27 PM:


I had the same issue in Scala and I solved it by overloading the constructor so 
that it initialises the UID.

The error comes from the initialisation of the parameter "inputCol".
You get "null__inputCol" because when the parameter was initialised, your class 
didn't have a uid.

Therefore, instead of

{code:scala}
class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] {
  override val uid: String = Identifiable.randomUID("lemmatizer")
  protected def createTransformFunc: String) => String = ???
  protected def outputDataType: DataType = StringType
}
{code}

Do this:

class Lemmatizer(override val uid: String) extends UnaryTransformer[String, 
String, Lemmatizer] {
  def this() = this( Identifiable.randomUID("lemmatizer") )
  protected def createTransformFunc: String) => String = ???
  protected def outputDataType: DataType = StringType
}


was (Author: panoramix):
I had the same issue in Scala and I solved it by overloading the constructor so 
that it initialises the UID.

The error comes from the initialisation of the parameter "inputCol".
You get "null__inputCol" because when the parameter was initialised, your class 
didn't have a uid.

Therefore, instead of

class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] {
  override val uid: String = Identifiable.randomUID("lemmatizer")
  protected def createTransformFunc: String) => String = ???
  protected def outputDataType: DataType = StringType
}

Do this:

class Lemmatizer(override val uid: String) extends UnaryTransformer[String, 
String, Lemmatizer] {
  def this() = this( Identifiable.randomUID("lemmatizer") )
  protected def createTransformFunc: String) => String = ???
  protected def outputDataType: DataType = StringType
}

> Scala/Java compatibility issue Re: how to extend java transformer from Scala 
> UnaryTransformer ?
> ---
>
> Key: SPARK-12606
> URL: https://issues.apache.org/jira/browse/SPARK-12606
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.2
> Environment: Java 8, Mac OS, Spark-1.5.2
>Reporter: Andrew Davidson
>  Labels: transformers
>
> Hi Andy,
> I suspect that you hit the Scala/Java compatibility issue, I can also 
> reproduce this issue, so could you file a JIRA to track this issue?
> Yanbo
> 2016-01-02 3:38 GMT+08:00 Andy Davidson :
> I am trying to write a trivial transformer I use use in my pipeline. I am 
> using java and spark 1.5.2. It was suggested that I use the Tokenize.scala 
> class as an example. This should be very easy how ever I do not understand 
> Scala, I am having trouble debugging the following exception.
> Any help would be greatly appreciated.
> Happy New Year
> Andy
> java.lang.IllegalArgumentException: requirement failed: Param null__inputCol 
> does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c.
>   at scala.Predef$.require(Predef.scala:233)
>   at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557)
>   at org.apache.spark.ml.param.Params$class.set(params.scala:436)
>   at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
>   at org.apache.spark.ml.param.Params$class.set(params.scala:422)
>   at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
>   at 
> org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83)
>   at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30)
> public class StemmerTest extends AbstractSparkTest {
> @Test
> public void test() {
> Stemmer stemmer = new Stemmer()
> .setInputCol("raw”) //line 30
> .setOutputCol("filtered");
> }
> }
> /**
>  * @ see 
> spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
>  * @ see 
> https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
>  * @ see 
> http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/
>  * 
>  * @author andrewdavidson
>  *
>  */
> public class Stemmer extends UnaryTransformer, List, 
> Stemmer> implements Serializable{
> static Logger logger = LoggerFactory.getLogger(Stemmer.class);
> private static final long serialVersionUID = 1L;
> private static final  ArrayType inputType = 
> DataTypes.createArrayType(DataTypes.StringType, true);
> private final String uid = Stemmer.class.getSimpleName() + "_" + 
> UUID.randomUUID().toString();
> @Override
> public String uid() {
> return u

[jira] [Comment Edited] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?

2017-04-04 Thread Guillaume Dardelet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955191#comment-15955191
 ] 

Guillaume Dardelet edited comment on SPARK-12606 at 4/4/17 2:27 PM:


I had the same issue in Scala and I solved it by overloading the constructor so 
that it initialises the UID.

The error comes from the initialisation of the parameter "inputCol".
You get "null__inputCol" because when the parameter was initialised, your class 
didn't have a uid.

Therefore, instead of

{code}
class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] {
  override val uid: String = Identifiable.randomUID("lemmatizer")
  protected def createTransformFunc: String) => String = ???
  protected def outputDataType: DataType = StringType
}
{code}

Do this:

{code}
class Lemmatizer(override val uid: String) extends UnaryTransformer[String, 
String, Lemmatizer] {
  def this() = this( Identifiable.randomUID("lemmatizer") )
  protected def createTransformFunc: String) => String = ???
  protected def outputDataType: DataType = StringType
}
{code}


was (Author: panoramix):
I had the same issue in Scala and I solved it by overloading the constructor so 
that it initialises the UID.

The error comes from the initialisation of the parameter "inputCol".
You get "null__inputCol" because when the parameter was initialised, your class 
didn't have a uid.

Therefore, instead of

{code:scala}
class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] {
  override val uid: String = Identifiable.randomUID("lemmatizer")
  protected def createTransformFunc: String) => String = ???
  protected def outputDataType: DataType = StringType
}
{code}

Do this:

class Lemmatizer(override val uid: String) extends UnaryTransformer[String, 
String, Lemmatizer] {
  def this() = this( Identifiable.randomUID("lemmatizer") )
  protected def createTransformFunc: String) => String = ???
  protected def outputDataType: DataType = StringType
}

> Scala/Java compatibility issue Re: how to extend java transformer from Scala 
> UnaryTransformer ?
> ---
>
> Key: SPARK-12606
> URL: https://issues.apache.org/jira/browse/SPARK-12606
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.2
> Environment: Java 8, Mac OS, Spark-1.5.2
>Reporter: Andrew Davidson
>  Labels: transformers
>
> Hi Andy,
> I suspect that you hit the Scala/Java compatibility issue, I can also 
> reproduce this issue, so could you file a JIRA to track this issue?
> Yanbo
> 2016-01-02 3:38 GMT+08:00 Andy Davidson :
> I am trying to write a trivial transformer I use use in my pipeline. I am 
> using java and spark 1.5.2. It was suggested that I use the Tokenize.scala 
> class as an example. This should be very easy how ever I do not understand 
> Scala, I am having trouble debugging the following exception.
> Any help would be greatly appreciated.
> Happy New Year
> Andy
> java.lang.IllegalArgumentException: requirement failed: Param null__inputCol 
> does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c.
>   at scala.Predef$.require(Predef.scala:233)
>   at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557)
>   at org.apache.spark.ml.param.Params$class.set(params.scala:436)
>   at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
>   at org.apache.spark.ml.param.Params$class.set(params.scala:422)
>   at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
>   at 
> org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83)
>   at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30)
> public class StemmerTest extends AbstractSparkTest {
> @Test
> public void test() {
> Stemmer stemmer = new Stemmer()
> .setInputCol("raw”) //line 30
> .setOutputCol("filtered");
> }
> }
> /**
>  * @ see 
> spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
>  * @ see 
> https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
>  * @ see 
> http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/
>  * 
>  * @author andrewdavidson
>  *
>  */
> public class Stemmer extends UnaryTransformer, List, 
> Stemmer> implements Serializable{
> static Logger logger = LoggerFactory.getLogger(Stemmer.class);
> private static final long serialVersionUID = 1L;
> private static final  ArrayType inputType = 
> DataTypes.createArrayType(DataTypes.StringType, true);
> private final String uid = Stemmer.class.getSimpleName() + "_" + 
> UUID.randomUUID().toString();
> @Override
> public Strin

[jira] [Commented] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?

2017-04-04 Thread Guillaume Dardelet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955191#comment-15955191
 ] 

Guillaume Dardelet commented on SPARK-12606:


I had the same issue in Scala and I solved it by overloading the constructor so 
that it initialises the UID.

The error comes from the initialisation of the parameter "inputCol".
You get "null__inputCol" because when the parameter was initialised, your class 
didn't have a uid.

Therefore, instead of

class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] {
  override val uid: String = Identifiable.randomUID("lemmatizer")
  protected def createTransformFunc: String) => String = ???
  protected def outputDataType: DataType = StringType
}

Do this:

class Lemmatizer(override val uid: String) extends UnaryTransformer[String, 
String, Lemmatizer] {
  def this() = this( Identifiable.randomUID("lemmatizer") )
  protected def createTransformFunc: String) => String = ???
  protected def outputDataType: DataType = StringType
}

> Scala/Java compatibility issue Re: how to extend java transformer from Scala 
> UnaryTransformer ?
> ---
>
> Key: SPARK-12606
> URL: https://issues.apache.org/jira/browse/SPARK-12606
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.2
> Environment: Java 8, Mac OS, Spark-1.5.2
>Reporter: Andrew Davidson
>  Labels: transformers
>
> Hi Andy,
> I suspect that you hit the Scala/Java compatibility issue, I can also 
> reproduce this issue, so could you file a JIRA to track this issue?
> Yanbo
> 2016-01-02 3:38 GMT+08:00 Andy Davidson :
> I am trying to write a trivial transformer I use use in my pipeline. I am 
> using java and spark 1.5.2. It was suggested that I use the Tokenize.scala 
> class as an example. This should be very easy how ever I do not understand 
> Scala, I am having trouble debugging the following exception.
> Any help would be greatly appreciated.
> Happy New Year
> Andy
> java.lang.IllegalArgumentException: requirement failed: Param null__inputCol 
> does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c.
>   at scala.Predef$.require(Predef.scala:233)
>   at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557)
>   at org.apache.spark.ml.param.Params$class.set(params.scala:436)
>   at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
>   at org.apache.spark.ml.param.Params$class.set(params.scala:422)
>   at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
>   at 
> org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83)
>   at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30)
> public class StemmerTest extends AbstractSparkTest {
> @Test
> public void test() {
> Stemmer stemmer = new Stemmer()
> .setInputCol("raw”) //line 30
> .setOutputCol("filtered");
> }
> }
> /**
>  * @ see 
> spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
>  * @ see 
> https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
>  * @ see 
> http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/
>  * 
>  * @author andrewdavidson
>  *
>  */
> public class Stemmer extends UnaryTransformer, List, 
> Stemmer> implements Serializable{
> static Logger logger = LoggerFactory.getLogger(Stemmer.class);
> private static final long serialVersionUID = 1L;
> private static final  ArrayType inputType = 
> DataTypes.createArrayType(DataTypes.StringType, true);
> private final String uid = Stemmer.class.getSimpleName() + "_" + 
> UUID.randomUUID().toString();
> @Override
> public String uid() {
> return uid;
> }
> /*
>override protected def validateInputType(inputType: DataType): Unit = {
> require(inputType == StringType, s"Input type must be string type but got 
> $inputType.")
>   }
>  */
> @Override
> public void validateInputType(DataType inputTypeArg) {
> String msg = "inputType must be " + inputType.simpleString() + " but 
> got " + inputTypeArg.simpleString();
> assert (inputType.equals(inputTypeArg)) : msg; 
> }
> 
> @Override
> public Function1, List> createTransformFunc() {
> // 
> http://stackoverflow.com/questions/6545066/using-scala-from-java-passing-functions-as-parameters
> Function1, List> f = new 
> AbstractFunction1, List>() {
> public List apply(List words) {
> for(String word : words) {
> logger.error("AEDWIP input word: {}", word);
> }
> return wor