[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301896#comment-14301896 ]
Xiangrui Meng commented on SPARK-4587: -------------------------------------- [~selvinsource] Thanks for joining the discussion! I think our goal is to provide users ways(s) to save/load models across languages that Spark currently supports. The current hacky solution is via Java serialization and it only works for Java. The work described in this JIRA is in parallel with the PMML work. For usability, it is important to provide save and load functions together. For PMML, it is easy to export small models. But it is hard to handle distributed models, e.g., ALS/LDA, and it is harder to parse them back. At this stage, exporting to PMML is to provide an exit point for people who train models in MLlib but use them somewhere else. This is also an important use case, but I would view it as a parallel work. Sorry for the delay on the code review! I will make a pass on your PR soon. > Model export/import > ------------------- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib > Reporter: Xiangrui Meng > Assignee: Joseph K. Bradley > Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org