[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305991#comment-14305991
 ] 

Joseph K. Bradley commented on SPARK-4587:
------------------------------------------

Thanks for the correction about Zementis and PMML; I updated the doc.

I do think verbosity may be an issue for forests.  I've heard of use cases with 
100s or 1000s of tree, with millions of nodes in total, which makes a columnar 
format seem pretty nice.

+1 for emphasizing PMML export

W.r.t. the value of Parquet-based formats, I agree that distributed models and 
the difficulty with PMML import are the biggest issues.  We are getting more 
distributed models (ALS, LDA, and likely more before long).  For import, while 
exporting to model serving tools is important, it will be helpful for people to 
be able to import back into Spark, especially for training or evaluating models 
on new data.  We could provide partial support for PMML import early on by 
supporting PMML exported from Spark but not from other tools, but I agree with 
you that partial PMML import support could cause a lot of trouble.

> Model export/import
> -------------------
>
>                 Key: SPARK-4587
>                 URL: https://issues.apache.org/jira/browse/SPARK-4587
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Joseph K. Bradley
>            Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> The design doc proposes:
> * Support our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding new dependencies).
> * Also support PMML
> ** This is needed since it is the only thing approaching an industry standard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to