[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305991#comment-14305991 ]
Joseph K. Bradley commented on SPARK-4587: ------------------------------------------ Thanks for the correction about Zementis and PMML; I updated the doc. I do think verbosity may be an issue for forests. I've heard of use cases with 100s or 1000s of tree, with millions of nodes in total, which makes a columnar format seem pretty nice. +1 for emphasizing PMML export W.r.t. the value of Parquet-based formats, I agree that distributed models and the difficulty with PMML import are the biggest issues. We are getting more distributed models (ALS, LDA, and likely more before long). For import, while exporting to model serving tools is important, it will be helpful for people to be able to import back into Spark, especially for training or evaluating models on new data. We could provide partial support for PMML import early on by supporting PMML exported from Spark but not from other tools, but I agree with you that partial PMML import support could cause a lot of trouble. > Model export/import > ------------------- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib > Reporter: Xiangrui Meng > Assignee: Joseph K. Bradley > Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org