[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382284#comment-14382284 ] Joseph K. Bradley commented on SPARK-4587: -- Can you not check out that repo? > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > UPDATE: As in the design doc, we plan to support: > * Our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding the complexity of > PMML). > * PMML > ** This is needed since it is the most commonly used format in industry. > This JIRA will be for the internal Spark-specific format described in the > design doc. Parallel JIRAs will cover PMML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381544#comment-14381544 ] zhangyouhua commented on SPARK-4587: “Sorry, this file is invalid so it cannot be displayed.” could you send me? > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > UPDATE: As in the design doc, we plan to support: > * Our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding the complexity of > PMML). > * PMML > ** This is needed since it is the most commonly used format in industry. > This JIRA will be for the internal Spark-specific format described in the > design doc. Parallel JIRAs will cover PMML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381387#comment-14381387 ] Joseph K. Bradley commented on SPARK-4587: -- I checked, and it should be publicly accessible. In case it's a firewall issue, I put a PDF of it here on this branch of Spark in my repo: [https://github.com/jkbradley/spark/blob/137ea4649c0f24f2679e874da127b546bea7a774/MLModelImportExportDesignDoc.pdf] > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > UPDATE: As in the design doc, we plan to support: > * Our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding the complexity of > PMML). > * PMML > ** This is needed since it is the most commonly used format in industry. > This JIRA will be for the internal Spark-specific format described in the > design doc. Parallel JIRAs will cover PMML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381313#comment-14381313 ] zhangyouhua commented on SPARK-4587: I can'not access the page "Design doc for model import/export" , so could you send it to me ? thank you. > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > UPDATE: As in the design doc, we plan to support: > * Our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding the complexity of > PMML). > * PMML > ** This is needed since it is the most commonly used format in industry. > This JIRA will be for the internal Spark-specific format described in the > design doc. Parallel JIRAs will cover PMML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340539#comment-14340539 ] Apache Spark commented on SPARK-4587: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/4816 > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > UPDATE: As in the design doc, we plan to support: > * Our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding the complexity of > PMML). > * PMML > ** This is needed since it is the most commonly used format in industry. > This JIRA will be for the internal Spark-specific format described in the > design doc. Parallel JIRAs will cover PMML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306147#comment-14306147 ] Joseph K. Bradley commented on SPARK-4587: -- It sounds like we're converging! I'll work on splitting the JIRAs between internal model save/load and PMML export since they are fairly separate efforts. (By the way, some of the infrastructure required for the internal format will be useful for porting complex models to Python. E.g., for DecisionTree, it could go: DecisionTreeModel --> DataFrame --> Python DataFrame --> Python model. So this internal format will have other uses too.) > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306115#comment-14306115 ] Sean Owen commented on SPARK-4587: -- OK, an internal-only format makes sense. So the idea is that simple POJO serialization is insufficient for distributed models, so, make a new representation that can store the model and distributed data too? Yes, most of the questions I have go away if it's not intended for consumption by other tools - not for export. I agree that people want serialization/deserialization for use within Spark more than they want PMML serialization for use with another tool. PMML is still necessary for export to other tools though. It's really a separate use case, isn't it? how internal serialization works vs how export works. In the name of saving development effort, I suppose I was suggesting to use PMML for whatever it can be used for rather than reinvent a representation. But given that PMML is only useful in the export use case, it may be more clunky than it's worth to try to leverage PMML for just a part of an internal serialization format. Yes, I'm referring to PMML import at the end. Almost all models that PMML can describe won't be supportable in MLlib, so most import would need to fail. It could be confusing if the "PMML import" feature doesn't import much. Still, it doesn't mean it shouldn't exist at all. It seems coherent to import what you can export and just draw the line there. > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306088#comment-14306088 ] Xiangrui Meng commented on SPARK-4587: -- [~srowen] Parquet is just an implementation detail. We don't expect users to read the model files directly but use `Model.load`, nor using `Model.load` with models produced by other ML libraries. It could be plain text, Avro, or ORC, as long as we promise that we can load it back in future releases. We chose Parquet just because it is built-in and it is a general format, which means we can inspect the content and provide loaders easily in Python as well. PMML is definitely important. But you have seen how many users n the user list asked about how to save and load models (especially ALS), compared to how many users asked for PMML. IMHO, we do need to provide both save and load to every model we support, regardless of the format. I hope that we all agree. Since we already know the limitation of PMML, it is relatively easy to estimate how far we can go. It doesn't handle distributed models like ALS, LDA, frequent itemsets, spectral clustering, etc. What would be the solution? Even if we use XML and leave a pointer in the XML file, we still need to consider how to store the data and it takes us back to the work here. I don't quite get your last question. It seems that it only applies to PMML but not to our internal format. Did I understand it correctly? > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306032#comment-14306032 ] Sean Owen commented on SPARK-4587: -- True; you could also store N separate PMML models! At least, the question of how to divvy up one huge model data set is different from how to represent it. One way or the other you have to keep track of distributed model data. the only thing I'd really suggest is trying to get the PMML machinery working first, and seeing how far that can be leveraged before building new mechanisms. Importing what you export seems like a good, bright line to draw. The big question will be what to do with features or settings you don't support: if the model says use a certain distance metric that isn't supported in k-means, do you ignore it or fail? > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306022#comment-14306022 ] Joseph K. Bradley commented on SPARK-4587: -- You may be right about compression; I'm not really sure. One benefit of using Parquet for big forests is that the forest can be stored in a distributed fashion, and the work for model import and export can be distributed. (I guess this could be done with PMML, with a central file containing pointers to the distributed files, but that seems a bit more awkward.) I agree having 2 model formats is annoying. If we do want to be able to import any models we export, what are your thoughts about having partial import support for PMML, where we give no guarantees for any models created outside of Spark? > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306002#comment-14306002 ] Sean Owen commented on SPARK-4587: -- You can get hundreds of megabytes of XML, yeah. I had thought to myself that, after compression, any representation of a huge forest is probably going to be the same order of magnitude. There's definitely a need to define how to serialize the distributed data since it seems like must live outside any model file format (?) For example I was using a simple text-based format, which isn't exactly ideal compared to Parquet perhaps. I suppose what I was questioning was serializing the model itself with a new format. For example, a newly-invented clustering model format doesn't seem to add much. > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305991#comment-14305991 ] Joseph K. Bradley commented on SPARK-4587: -- Thanks for the correction about Zementis and PMML; I updated the doc. I do think verbosity may be an issue for forests. I've heard of use cases with 100s or 1000s of tree, with millions of nodes in total, which makes a columnar format seem pretty nice. +1 for emphasizing PMML export W.r.t. the value of Parquet-based formats, I agree that distributed models and the difficulty with PMML import are the biggest issues. We are getting more distributed models (ALS, LDA, and likely more before long). For import, while exporting to model serving tools is important, it will be helpful for people to be able to import back into Spark, especially for training or evaluating models on new data. We could provide partial support for PMML import early on by supporting PMML exported from Spark but not from other tools, but I agree with you that partial PMML import support could cause a lot of trouble. > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305973#comment-14305973 ] Sean Owen commented on SPARK-4587: -- Coming late to the discussion, with a few comments on the design doc: - PMML is not a Zementis-led format BTW - I don't find the verbosity of XML to be a big problem, with the possible exception of decision forests. It compresses well. - Export to PMML is much easier than import, to the extent that I am not sure I would even bother with import in the medium term. so much would be unsupported that it would probably cause more confusion than help. - jpmml-evaluator is not needed for model import/export; the part that's needed is BSD 3-clause This Parquet-based format is internal to Spark? then why not just use the model's serialized form (modulo the issue of distributed models, of course). If it's not internal, it looks like yet another format that is a subset of PMML, written differently, that nothing else will read. What does it add? is its role really for serializing pipelines? The unsupported model types are really an issue though. You can make up your own serialization in an Extension element though, which is perhaps better than conceiving a wholly separate format. I still imagine a huge factored matrix model can't reasonably be contained in any file format; I have resorted to just recording pointers to the location of the distributed data in a model file. I suppose I'm worried about ending up with a bunch of half-finished imports/exports rather than focusing. IMHO top priority should be PMML export and then see how it goes. > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301896#comment-14301896 ] Xiangrui Meng commented on SPARK-4587: -- [~selvinsource] Thanks for joining the discussion! I think our goal is to provide users ways(s) to save/load models across languages that Spark currently supports. The current hacky solution is via Java serialization and it only works for Java. The work described in this JIRA is in parallel with the PMML work. For usability, it is important to provide save and load functions together. For PMML, it is easy to export small models. But it is hard to handle distributed models, e.g., ALS/LDA, and it is harder to parse them back. At this stage, exporting to PMML is to provide an exit point for people who train models in MLlib but use them somewhere else. This is also an important use case, but I would view it as a parallel work. Sorry for the delay on the code review! I will make a pass on your PR soon. > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299929#comment-14299929 ] Vincenzo Selvaggio commented on SPARK-4587: --- [~josephkb] I have read your detailed document and here some comments. 1. You mentioned in the development plan that PMML export will be supported in 1.4. As I completed many of the models that could be mapped to the PMML format, why not add those in 1.3? 2. As discussed in https://issues.apache.org/jira/browse/SPARK-1406, I think the import from PMML is not that useful as Spark is primarily used as model producer rather than consumer. However it won't be difficult to implement the complementary code to import the same PMML models that the ModelExporter class is able to export. 3. Regarding the JPMML library, we only need to use jpmml-model for both export and eventually the import and it has a compatible licence (BSD 3-Clause) to Apache. The evaluation library is not needed. 4. I had a look at your PR https://github.com/apache/spark/pull/4233 and noticed that the save/load methods exists in each model. Probably it is OK for an internal format, but I wouldn't tie the PMML export/import to the models, models should be kept clean and without dependencies on file formats. 5. You made a valid point on checking the security aspect when importing PMML (therefore xml) models. I knew already that Java parser, in general, are vulnerable to https://www.owasp.org/index.php/XML_External_Entity_%28XXE%29_Processing. I double checked the jpmml-model library and it does use the JAXB API which is by default vulnerable. Here the file that will need to be changed https://github.com/jpmml/jpmml-model/blob/master/pmml-model/src/main/java/org/jpmml/model/JAXBUtil.java with this solution http://stackoverflow.com/questions/12977299/prevent-xxe-attack-with-jaxb. I can contact the author (I talked to him before) and see what he thinks. > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294312#comment-14294312 ] Apache Spark commented on SPARK-4587: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/4233 > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294035#comment-14294035 ] Joseph K. Bradley commented on SPARK-4587: -- Thanks for reading the doc! The Parquet format is really an internal format, meant to be read back in by Spark. For passing models to some other tool, PMML will probably be the best option. It's certainly worth discussing whether another format would be more compatible with other tools (in case we open up the format as a public API in the future), if that other format will support the same requirements (store local and distributed models, be compatible across versions, etc.). I know we've run into issues with protobuf's API changing across versions; I'm not sure about thrift. > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293892#comment-14293892 ] Peter Prettenhofer commented on SPARK-4587: --- I read the design document for model import / export My concern about using parquet as a file format is that it is not widely supported. E.g. it will be hard to deploy a trained mllib pipeline in a Python or go environment. To me, a format like thrift or protobuf looks more versatile. E.g. this project by Steven Noble https://github.com/snoble/honeybee contains thrift definitions of decision trees and logistic regression models that I find interesting. My knowledge of parquet is quite limited, it seems that one cannot compare it to thrift serialization and protocol buffers per se but I'm arguing mostly about tooling (ie. the lack of a parquet reader for python/go). > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. > UPDATE: [Design doc for model import/export | > https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing] > This document sketches machine learning model import/export plans, including > goals, an API, and development plans. > The design doc proposes: > * Support our own Spark-specific format. > ** This is needed to (a) support distributed models and (b) get model > import/export support into Spark quickly (while avoiding new dependencies). > * Also support PMML > ** This is needed since it is the only thing approaching an industry standard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4587) Model export/import
[ https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225997#comment-14225997 ] Martin Liesenberg commented on SPARK-4587: -- There has been some discussion in SPARK-1406 > Model export/import > --- > > Key: SPARK-4587 > URL: https://issues.apache.org/jira/browse/SPARK-4587 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xiangrui Meng >Priority: Critical > > This is an umbrella JIRA for one of the most requested features on the user > mailing list. Model export/import can be done via Java serialization. But it > doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we > should provide save/load methods to every model. PMML is an option but it has > its limitations. There are couple things we need to discuss: 1) data format, > 2) how to preserve partitioning, 3) data compatibility between versions and > language APIs, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org