[jira] [Commented] (SPARK-4587) Model export/import

2015-03-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382284#comment-14382284
 ] 

Joseph K. Bradley commented on SPARK-4587:
--

Can you not check out that repo?

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> UPDATE: As in the design doc, we plan to support:
> * Our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding the complexity of 
> PMML).
> * PMML
> ** This is needed since it is the most commonly used format in industry.
> This JIRA will be for the internal Spark-specific format described in the 
> design doc. Parallel JIRAs will cover PMML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-03-26 Thread zhangyouhua (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381544#comment-14381544
 ] 

zhangyouhua commented on SPARK-4587:


“Sorry, this file is invalid so it cannot be displayed.”

could you send me?

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> UPDATE: As in the design doc, we plan to support:
> * Our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding the complexity of 
> PMML).
> * PMML
> ** This is needed since it is the most commonly used format in industry.
> This JIRA will be for the internal Spark-specific format described in the 
> design doc. Parallel JIRAs will cover PMML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-03-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381387#comment-14381387
 ] 

Joseph K. Bradley commented on SPARK-4587:
--

I checked, and it should be publicly accessible.  In case it's a firewall 
issue, I put a PDF of it here on this branch of Spark in my repo: 
[https://github.com/jkbradley/spark/blob/137ea4649c0f24f2679e874da127b546bea7a774/MLModelImportExportDesignDoc.pdf]

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> UPDATE: As in the design doc, we plan to support:
> * Our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding the complexity of 
> PMML).
> * PMML
> ** This is needed since it is the most commonly used format in industry.
> This JIRA will be for the internal Spark-specific format described in the 
> design doc. Parallel JIRAs will cover PMML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-03-25 Thread zhangyouhua (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381313#comment-14381313
 ] 

zhangyouhua commented on SPARK-4587:


I can'not access the page "Design doc for model import/export" , so could you 
send it to me ? thank you.

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> UPDATE: As in the design doc, we plan to support:
> * Our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding the complexity of 
> PMML).
> * PMML
> ** This is needed since it is the most commonly used format in industry.
> This JIRA will be for the internal Spark-specific format described in the 
> design doc. Parallel JIRAs will cover PMML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-02-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340539#comment-14340539
 ] 

Apache Spark commented on SPARK-4587:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/4816

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> UPDATE: As in the design doc, we plan to support:
> * Our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding the complexity of 
> PMML).
> * PMML
> ** This is needed since it is the most commonly used format in industry.
> This JIRA will be for the internal Spark-specific format described in the 
> design doc. Parallel JIRAs will cover PMML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-02-04 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306147#comment-14306147
 ] 

Joseph K. Bradley commented on SPARK-4587:
--

It sounds like we're converging!  I'll work on splitting the JIRAs between 
internal model save/load and PMML export since they are fairly separate 
efforts.  (By the way, some of the infrastructure required for the internal 
format will be useful for porting complex models to Python.  E.g., for 
DecisionTree, it could go: DecisionTreeModel --> DataFrame --> Python DataFrame 
--> Python model.  So this internal format will have other uses too.)

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-02-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306115#comment-14306115
 ] 

Sean Owen commented on SPARK-4587:
--

OK, an internal-only format makes sense. So the idea is that simple POJO 
serialization is insufficient for distributed models, so, make a new 
representation that can store the model and distributed data too? Yes, most of 
the questions I have go away if it's not intended for consumption by other 
tools - not for export.

I agree that people want serialization/deserialization for use within Spark 
more than they want PMML serialization for use with another tool. PMML is still 
necessary for export to other tools though. It's really a separate use case, 
isn't it? how internal serialization works vs how export works.

In the name of saving development effort, I suppose I was suggesting to use 
PMML for whatever it can be used for rather than reinvent a representation. But 
given that PMML is only useful in the export use case, it may be more clunky 
than it's worth to try to leverage PMML for just a part of an internal 
serialization format.

Yes, I'm referring to PMML import at the end. Almost all models that PMML can 
describe won't be supportable in MLlib, so most import would need to fail. It 
could be confusing if the "PMML import" feature doesn't import much. Still, it 
doesn't mean it shouldn't exist at all. It seems coherent to import what you 
can export and just draw the line there.

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-02-04 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306088#comment-14306088
 ] 

Xiangrui Meng commented on SPARK-4587:
--

[~srowen] Parquet is just an implementation detail. We don't expect users to 
read the model files directly but use `Model.load`, nor using `Model.load` with 
models produced by other ML libraries. It could be plain text, Avro, or ORC, as 
long as we promise that we can load it back in future releases. We chose 
Parquet just because it is built-in and it is a general format, which means we 
can inspect the content and provide loaders easily in Python as well.

PMML is definitely important. But you have seen how many users n the user list 
asked about how to save and load models (especially ALS), compared to how many 
users asked for PMML. IMHO, we do need to provide both save and load to every 
model we support, regardless of the format. I hope that we all agree.

Since we already know the limitation of PMML, it is relatively easy to estimate 
how far we can go. It doesn't handle distributed models like ALS, LDA, frequent 
itemsets, spectral clustering, etc. What would be the solution? Even if we use 
XML and leave a pointer in the XML file, we still need to consider how to store 
the data and it takes us back to the work here.

I don't quite get your last question. It seems that it only applies to PMML but 
not to our internal format. Did I understand it correctly?

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-02-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306032#comment-14306032
 ] 

Sean Owen commented on SPARK-4587:
--

True; you could also store N separate PMML models! At least, the question of 
how to divvy up one huge model data set is different from how to represent it. 
One way or the other you have to keep track of distributed model data. the only 
thing I'd really suggest is trying to get the PMML machinery working first, and 
seeing how far that can be leveraged before building new mechanisms.

Importing what you export seems like a good, bright line to draw. The big 
question will be what to do with features or settings you don't support: if the 
model says use a certain distance metric that isn't supported in k-means, do 
you ignore it or fail?


> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-02-04 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306022#comment-14306022
 ] 

Joseph K. Bradley commented on SPARK-4587:
--

You may be right about compression; I'm not really sure.  One benefit of using 
Parquet for big forests is that the forest can be stored in a distributed 
fashion, and the work for model import and export can be distributed.  (I guess 
this could be done with PMML, with a central file containing pointers to the 
distributed files, but that seems a bit more awkward.)

I agree having 2 model formats is annoying.  If we do want to be able to import 
any models we export, what are your thoughts about having partial import 
support for PMML, where we give no guarantees for any models created outside of 
Spark?

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-02-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306002#comment-14306002
 ] 

Sean Owen commented on SPARK-4587:
--

You can get hundreds of megabytes of XML, yeah. I had thought to myself that, 
after compression, any representation of a huge forest is probably going to be 
the same order of magnitude.

There's definitely a need to define how to serialize the distributed data since 
it seems like must live outside any model file format (?) For example I was 
using a simple text-based format, which isn't exactly ideal compared to Parquet 
perhaps. I suppose what I was questioning was serializing the model itself with 
a new format. For example, a newly-invented clustering model format doesn't 
seem to add much.

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-02-04 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305991#comment-14305991
 ] 

Joseph K. Bradley commented on SPARK-4587:
--

Thanks for the correction about Zementis and PMML; I updated the doc.

I do think verbosity may be an issue for forests.  I've heard of use cases with 
100s or 1000s of tree, with millions of nodes in total, which makes a columnar 
format seem pretty nice.

+1 for emphasizing PMML export

W.r.t. the value of Parquet-based formats, I agree that distributed models and 
the difficulty with PMML import are the biggest issues.  We are getting more 
distributed models (ALS, LDA, and likely more before long).  For import, while 
exporting to model serving tools is important, it will be helpful for people to 
be able to import back into Spark, especially for training or evaluating models 
on new data.  We could provide partial support for PMML import early on by 
supporting PMML exported from Spark but not from other tools, but I agree with 
you that partial PMML import support could cause a lot of trouble.

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-02-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305973#comment-14305973
 ] 

Sean Owen commented on SPARK-4587:
--

Coming late to the discussion, with a few comments on the design doc:

- PMML is not a Zementis-led format BTW
- I don't find the verbosity of XML to be a big problem, with the possible 
exception of decision forests. It compresses well.
- Export to PMML is much easier than import, to the extent that I am not sure I 
would even bother with import in the medium term. so much would be unsupported 
that it would probably cause more confusion than help.
- jpmml-evaluator is not needed for model import/export; the part that's needed 
is BSD 3-clause

This Parquet-based format is internal to Spark? then why not just use the 
model's serialized form (modulo the issue of distributed models, of course). If 
it's not internal, it looks like yet another format that is a subset of PMML, 
written differently, that nothing else will read. What does it add? is its role 
really for serializing pipelines?

The unsupported model types are really an issue though. You can make up your 
own serialization in an Extension element though, which is perhaps better than 
conceiving a wholly separate format. I still imagine a huge factored matrix 
model can't reasonably be contained in any file format; I have resorted to just 
recording pointers to the location of the distributed data in a model file. 

I suppose I'm worried about ending up with a bunch of half-finished 
imports/exports rather than focusing. IMHO top priority should be PMML export 
and then see how it goes.

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-02-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301896#comment-14301896
 ] 

Xiangrui Meng commented on SPARK-4587:
--

[~selvinsource] Thanks for joining the discussion! I think our goal is to 
provide users ways(s) to save/load models across languages that Spark currently 
supports. The current hacky solution is via Java serialization and it only 
works for Java. The work described in this JIRA is in parallel with the PMML 
work. For usability, it is important to provide save and load functions 
together.

For PMML, it is easy to export small models. But it is hard to handle 
distributed models, e.g., ALS/LDA, and it is harder to parse them back. At this 
stage, exporting to PMML is to provide an exit point for people who train 
models in MLlib but use them somewhere else. This is also an important use 
case, but I would view it as a parallel work.

Sorry for the delay on the code review! I will make a pass on your PR soon.

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-01-31 Thread Vincenzo Selvaggio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299929#comment-14299929
 ] 

Vincenzo Selvaggio commented on SPARK-4587:
---

[~josephkb]

I have read your detailed document and here some comments.

1. You mentioned in the development plan that PMML export will be supported in 
1.4. As I completed many of the models that could be mapped to the PMML format, 
why not add those in 1.3?

2. As discussed in https://issues.apache.org/jira/browse/SPARK-1406, I think 
the import from PMML is not that useful as Spark is primarily used as model 
producer rather than consumer. However it won't be difficult to implement the 
complementary code to import the same PMML models that the ModelExporter class 
is able to export.

3. Regarding the JPMML library, we only need to use jpmml-model for both export 
and eventually the import and it has a compatible licence (BSD 3-Clause) to 
Apache. The evaluation library is not needed.

4. I had a look at your PR https://github.com/apache/spark/pull/4233 and 
noticed that the save/load methods exists in each model. Probably it is OK for 
an internal format, but I wouldn't tie the PMML export/import to the models, 
models should be kept clean and without dependencies on file formats.

5. You made a valid point on checking the security aspect when importing PMML 
(therefore xml) models. I knew already that Java parser, in general, are 
vulnerable to 
https://www.owasp.org/index.php/XML_External_Entity_%28XXE%29_Processing. I 
double checked the jpmml-model library and it does use the JAXB API which is by 
default vulnerable.
Here the file that will need to be changed
https://github.com/jpmml/jpmml-model/blob/master/pmml-model/src/main/java/org/jpmml/model/JAXBUtil.java
with this solution
http://stackoverflow.com/questions/12977299/prevent-xxe-attack-with-jaxb.
I can contact the author (I talked to him before) and see what he thinks.

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294312#comment-14294312
 ] 

Apache Spark commented on SPARK-4587:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/4233

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-01-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294035#comment-14294035
 ] 

Joseph K. Bradley commented on SPARK-4587:
--

Thanks for reading the doc!

The Parquet format is really an internal format, meant to be read back in by 
Spark.  For passing models to some other tool, PMML will probably be the best 
option.

It's certainly worth discussing whether another format would be more compatible 
with other tools (in case we open up the format as a public API in the future), 
if that other format will support the same requirements (store local and 
distributed models, be compatible across versions, etc.).  I know we've run 
into issues with protobuf's API changing across versions; I'm not sure about 
thrift.


> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-01-27 Thread Peter Prettenhofer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293892#comment-14293892
 ] 

Peter Prettenhofer commented on SPARK-4587:
---

I read the design document for model import / export

My concern about using parquet as a file format is that it is not widely 
supported. E.g. it will be hard to deploy a trained mllib pipeline in a Python 
or go environment. 
To me, a format like thrift or protobuf looks more versatile. E.g. this project 
by Steven Noble https://github.com/snoble/honeybee contains thrift definitions 
of decision trees and logistic regression models that I find interesting. My 
knowledge of parquet is quite limited, it seems that one cannot compare it to 
thrift serialization and protocol buffers per se but I'm arguing mostly about 
tooling (ie. the lack of a parquet reader for python/go).

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2014-11-26 Thread Martin Liesenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225997#comment-14225997
 ] 

Martin Liesenberg commented on SPARK-4587:
--

There has been some discussion in SPARK-1406

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org