[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301896#comment-14301896
 ] 

Xiangrui Meng commented on SPARK-4587:
--------------------------------------

[~selvinsource] Thanks for joining the discussion! I think our goal is to 
provide users ways(s) to save/load models across languages that Spark currently 
supports. The current hacky solution is via Java serialization and it only 
works for Java. The work described in this JIRA is in parallel with the PMML 
work. For usability, it is important to provide save and load functions 
together.

For PMML, it is easy to export small models. But it is hard to handle 
distributed models, e.g., ALS/LDA, and it is harder to parse them back. At this 
stage, exporting to PMML is to provide an exit point for people who train 
models in MLlib but use them somewhere else. This is also an important use 
case, but I would view it as a parallel work.

Sorry for the delay on the code review! I will make a pass on your PR soon.

> Model export/import
> -------------------
>
>                 Key: SPARK-4587
>                 URL: https://issues.apache.org/jira/browse/SPARK-4587
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Joseph K. Bradley
>            Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to