[ https://issues.apache.org/jira/browse/SPARK-18213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15635662#comment-15635662 ]
Wojciech Szymanski commented on SPARK-18213: -------------------------------------------- Thanks for your opinion. Initially I was thinking about varargs based constructor, since stage array is the only one attribute supported by pipeline. {code} // only Scala val pipeline = new Pipeline(tokenizer, stopWordsRemover, countVectorizer) {code} Unfortunately, current Scala compiler does not support generating pure Java varargs constructors with @varargs annotation. Another option is companion object, but again, it wouldn't be convenient from Java perspective. {code} // Scala val pipeline = Pipeline(tokenizer, stopWordsRemover, countVectorizer) // Java - ugly approach Pipeline pipeline = Pipeline.apply(tokenizer, stopWordsRemover, countVectorizer); {code} Last thing that comes to my mind is array based constructor, but on the other hand it does not simplify much. // Scala val pipeline = new Pipeline(Array(tokenizer, stopWordsRemover, countVectorizer)) // Java Pipeline pipeline = Pipeline.apply(new Pipeline[] {tokenizer, stopWordsRemover, countVectorizer}); {code} > Syntactic sugar over Pipeline API > --------------------------------- > > Key: SPARK-18213 > URL: https://issues.apache.org/jira/browse/SPARK-18213 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.0.1 > Reporter: Wojciech Szymanski > Priority: Minor > > Currently, creating ML Pipeline is based on very verbose setStages method as > below: > {code} > val tokenizer = new RegexTokenizer() > val stopWordsRemover = new StopWordsRemover() > val countVectorizer = new CountVectorizer() > val pipeline = new Pipeline().setStages(Array(tokenizer, > stopWordsRemover, countVectorizer)) > {code} > What about a bit of syntactic sugar over Pipeline API? > {code} > val tokenizer = new RegexTokenizer() > val stopWordsRemover = new StopWordsRemover() > val countVectorizer = new CountVectorizer() > val pipeline = tokenizer + stopWordsRemover + countVectorizer > {code} > Production code changes in > mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala: > https://github.com/apache/spark/commit/181df64bf50081f3af5a84b567b677178c88524f#diff-5226e84dea43423760dc6300ddafb01b > Scala example: > https://github.com/apache/spark/commit/181df64bf50081f3af5a84b567b677178c88524f#diff-798e85dd9107565fabab1126f57e3d6e > Java example: > https://github.com/apache/spark/commit/181df64bf50081f3af5a84b567b677178c88524f#diff-69ac857220f21b5e1684444d80d6dffe > Thanks in advance for your feedback. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org