[ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471366#comment-15471366
 ] 

yuhao yang commented on SPARK-17094:
------------------------------------

Thanks for the comment, Sean. The two questions were great. 
1. For the configuration, it might be something like 
{code}
pipeline("tokenizer").asInstanceOf[Tokenizer].set...
pipeline(2).asInstanceOf[Tokenizer].set...
{code}
It will be great if there's a way to avoid the cast. 
Eventually, I think it would be great to have configuration support for ML 
transformers, thus we can do:
{code}
sc.set("ml.tokenizer.toLowercase", "false") 
{code}
and configuration file support, which can avoid hard coding and provide great 
support for tuning on cluster. (Anyone like the idea? cc [~josephkb] [~mengxr])

2. I'm thinking most users would only use linear pipeline. Could you please 
provide an example for non-linear pipelines? So we can have a specific 
discussion.

I tried your code yet I cannot find a constructor for Pipeline like that. Is it 
something under development? And do we need to set the input column and output 
column for each stage?

Overall, the feature would 
1. Allow people (especially starters) to create a ML application in one simple 
line of code. 
2. And can be handy for users as they don't have to set the input, output 
columns.
3. Thinking further, we may not need code any longer to build a Spark ML 
application as it can be done by configuration:
{code}
"ml.pipeline": "tokenizer", "hashingTF", "lda"
"ml.tokenizer.toLowercase": "false"
...
{code}. 




> provide simplified API for ML pipeline
> --------------------------------------
>
>                 Key: SPARK-17094
>                 URL: https://issues.apache.org/jira/browse/SPARK-17094
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to