[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306088#comment-14306088
 ] 

Xiangrui Meng commented on SPARK-4587:
--------------------------------------

[~srowen] Parquet is just an implementation detail. We don't expect users to 
read the model files directly but use `Model.load`, nor using `Model.load` with 
models produced by other ML libraries. It could be plain text, Avro, or ORC, as 
long as we promise that we can load it back in future releases. We chose 
Parquet just because it is built-in and it is a general format, which means we 
can inspect the content and provide loaders easily in Python as well.

PMML is definitely important. But you have seen how many users n the user list 
asked about how to save and load models (especially ALS), compared to how many 
users asked for PMML. IMHO, we do need to provide both save and load to every 
model we support, regardless of the format. I hope that we all agree.

Since we already know the limitation of PMML, it is relatively easy to estimate 
how far we can go. It doesn't handle distributed models like ALS, LDA, frequent 
itemsets, spectral clustering, etc. What would be the solution? Even if we use 
XML and leave a pointer in the XML file, we still need to consider how to store 
the data and it takes us back to the work here.

I don't quite get your last question. It seems that it only applies to PMML but 
not to our internal format. Did I understand it correctly?

> Model export/import
> -------------------
>
>                 Key: SPARK-4587
>                 URL: https://issues.apache.org/jira/browse/SPARK-4587
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Joseph K. Bradley
>            Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to