[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

2016-09-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15506649#comment-15506649
 ] 

Sean Owen commented on SPARK-17094:
---

Sure, consider a pipeline that needs to convert several subsets of columns to 
categorical variables and then reassemble them. This is done with separate 
transformations of the source DataFrame, and then reassembled with 
VectorAssembler. It's not the case that each stage uses as its input column the 
previous stage's output column. I don't even think that's common given any 
non-trivial ETL pipeline upfront.

Consider a pipeline that builds several models off one set of input.

The case that you have a truly linear pipeline (output of one always is input 
to next) with no other configuration at all is rare, I think. It's also already 
about as easy with the current API.

> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> {code}
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> {code}
> Overall, the feature would 
> 1. Allow people (especially starters) to create a ML application in one 
> simple line of code. 
> 2. And can be handy for users as they don't have to set the input, output 
> columns.
> 3. Thinking further, we may not need code any longer to build a Spark ML 
> application as it can be done by configuration:
> {code}
> "ml.pipeline.input": "hdfs://path.svm"
> "ml.pipeline": "tokenizer", "hashingTF", "lda"
> "ml.tokenizer.toLowercase": "false"
> ...
> {code}, which can be quite efficient for tuning on cluster.
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

2016-09-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471563#comment-15471563
 ] 

Nick Pentreath commented on SPARK-17094:


It's true that constructor doesn't exist. It could be {{new 
Pipeline().setStages(Array(new Tokenizer(), new CountVectorizer(), ...}}

> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> {code}
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> {code}
> Overall, the feature would 
> 1. Allow people (especially starters) to create a ML application in one 
> simple line of code. 
> 2. And can be handy for users as they don't have to set the input, output 
> columns.
> 3. Thinking further, we may not need code any longer to build a Spark ML 
> application as it can be done by configuration:
> {code}
> "ml.pipeline.input": "hdfs://path.svm"
> "ml.pipeline": "tokenizer", "hashingTF", "lda"
> "ml.tokenizer.toLowercase": "false"
> ...
> {code}, which can be quite efficient for tuning on cluster.
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

2016-09-07 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471366#comment-15471366
 ] 

yuhao yang commented on SPARK-17094:


Thanks for the comment, Sean. The two questions were great. 
1. For the configuration, it might be something like 
{code}
pipeline("tokenizer").asInstanceOf[Tokenizer].set...
pipeline(2).asInstanceOf[Tokenizer].set...
{code}
It will be great if there's a way to avoid the cast. 
Eventually, I think it would be great to have configuration support for ML 
transformers, thus we can do:
{code}
sc.set("ml.tokenizer.toLowercase", "false") 
{code}
and configuration file support, which can avoid hard coding and provide great 
support for tuning on cluster. (Anyone like the idea? cc [~josephkb] [~mengxr])

2. I'm thinking most users would only use linear pipeline. Could you please 
provide an example for non-linear pipelines? So we can have a specific 
discussion.

I tried your code yet I cannot find a constructor for Pipeline like that. Is it 
something under development? And do we need to set the input column and output 
column for each stage?

Overall, the feature would 
1. Allow people (especially starters) to create a ML application in one simple 
line of code. 
2. And can be handy for users as they don't have to set the input, output 
columns.
3. Thinking further, we may not need code any longer to build a Spark ML 
application as it can be done by configuration:
{code}
"ml.pipeline": "tokenizer", "hashingTF", "lda"
"ml.tokenizer.toLowercase": "false"
...
{code}. 




> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

2016-09-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15469765#comment-15469765
 ] 

Sean Owen commented on SPARK-17094:
---

This is already pretty much possible as:
{code}
val model = new Pipeline(new Tokenizer(), new CountVectorizer(),...).fit(data)
{code}

How would you configure the elements of the pipeline?
How would you configure non-linear pipelines?

You're suggesting adding a third type of API. I just don't think this is worth 
it given that if you answer the points here it'll be the same as the current 
API, just different.

> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

2016-09-06 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15469609#comment-15469609
 ] 

yuhao yang commented on SPARK-17094:


Something like Stanford CoreNLP pipeline: 

props.setProperty("annotators", 
"tokenize,ssplit,pos,lemma,ner,regexner,parse,mention,coref");

> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

2016-08-21 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429797#comment-15429797
 ] 

Jacek Laskowski commented on SPARK-17094:
-

Are there any other pipelines? Please explain if you don't mind. Thanks.

> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

2016-08-19 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428038#comment-15428038
 ] 

Nick Pentreath commented on SPARK-17094:


What about input/output columns? We could set the input column for each stage 
automatically to the output column for the previous stage - but we would still 
need to set the inputCol for the first stage. I think this will only work for 
linear pipelines?

> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org