[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219686#comment-16219686 ]
Joseph K. Bradley commented on SPARK-22346: ------------------------------------------- My thoughts on the options. Summary: I'm ambivalent between Options 2 & 3 since I think either could be done without breaking changes. * Option 1 (VectorAssembler as an Estimator): too drastic ** This would break almost every MLlib workflow I've seen. * Option 2 (drop metadata when unavailable): good if we're careful ** I think we can avoid breaking changes here. ** We can drop metadata only when (a) part of the metadata are unavailable and (b) the DataFrame is a streaming DataFrame. That way, we won't change existing MLlib workflows, and we will enable new ones using streaming. We can also log a warning about metadata being dropped. ** Long-term, we can improve these streaming workflows by maintaining partial metadata. * Option 3 (fixed-length vectors): good if we're careful ** I think we can avoid breaking changes here. ** Note that fixed-length vectors are sort-of required already since ML attributes for Vector columns assume fixed lengths. I've also never heard of the need for variable lengths. ** We can provide a method (or better, a Transformer) which adds metadata to a column. (It could just be for specifying vector length for now, not a general metadata utility.) ** Current MLlib workflows should not require this; they are either batch or they already have metadata. ** New MLlib workflows using Streaming without metadata will be enabled when users add this Transformer to their workflow. Long-term, it'd be great to support a sparse metadata format. Assuming we want to keep metadata around (and I think we should because it's really useful, e.g., for providing parity with R models by tracking feature names), then this seems like the best option for fixing these several issues around metadata. What do you think? > Update VectorAssembler to work with Structured Streaming > -------------------------------------------------------- > > Key: SPARK-22346 > URL: https://issues.apache.org/jira/browse/SPARK-22346 > Project: Spark > Issue Type: Improvement > Components: ML, Structured Streaming > Affects Versions: 2.2.0 > Reporter: Bago Amirbekian > Priority: Critical > > The issue > In batch mode, VectorAssembler can take multiple columns of VectorType and > assemble a output a new column of VectorType containing the concatenated > vectors. In streaming mode, this transformation can fail because > VectorAssembler does not have enough information to produce metadata > (AttributeGroup) for the new column. Because VectorAssembler is such a > ubiquitous part of mllib pipelines, this issue effectively means spark > structured streaming does not support prediction using mllib pipelines. > I've created this ticket so we can discuss ways to potentially improve > VectorAssembler. Please let me know if there are any issues I have not > considered or potential fixes I haven't outlined. I'm happy to submit a patch > once I know which strategy is the best approach. > Potential fixes > 1) Replace VectorAssembler with an estimator/model pair like was recently > done with OneHotEncoder, > [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The > Estimator can "learn" the size of the inputs vectors during training and save > it to use during prediction. > Pros: > * Possibly simplest of the potential fixes > Cons: > * We'll need to deprecate current VectorAssembler > 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty > major change, but it could be done in stages. We could first ensure that > metadata is not used during prediction and allow the VectorAssembler to drop > metadata for streaming dataframes. Going forward, it would be important to > not use any metadata on Vector columns for any prediction tasks. > Pros: > * Potentially, easy short term fix for VectorAssembler > (drop metadata for vector columns in streaming). > * Current Attributes implementation is also causing other issues, eg > [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. > Cons: > * To fully remove ML Attributes would be a major refactor of MLlib and would > most likely require breaking changings. > * A partial removal of ML attributes (eg: ensure ML attributes are not used > during transform, only during fit) might be tricky. This would require > testing or other enforcement mechanism to prevent regressions. > 3) Require Vector columns to have fixed length vectors. Most mllib > transformers that produce vectors already include the size of the vector in > the column metadata. This change would be to deprecate APIs that allow > creating a vector column of unknown length and replace those APIs with > equivalents that enforce a fixed size. > Pros: > * We already treat vectors as fixed size, for example VectorAssembler assumes > the inputs * output col are fixed size vectors and creates metadata > accordingly. In the spirit of explicit is better than implicit, we would be > codifying something we already assume. > * This could potentially enable performance optimizations that are only > possible if the Vector size of a column is fixed & known. > Cons: > * This would require breaking changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org