[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248049#comment-16248049 ]
Bago Amirbekian commented on SPARK-22346: ----------------------------------------- I think [~josephkb]'s version of Option 3 makes the most sense. A transformer that adds size data to a vector column would allow patching pipelines pretty easily and it could be implemented without breaking any APIs. I'm currently working on a PR based on this approach. > Update VectorAssembler to work with Structured Streaming > -------------------------------------------------------- > > Key: SPARK-22346 > URL: https://issues.apache.org/jira/browse/SPARK-22346 > Project: Spark > Issue Type: Improvement > Components: ML, Structured Streaming > Affects Versions: 2.2.0 > Reporter: Bago Amirbekian > Priority: Critical > > The issue > In batch mode, VectorAssembler can take multiple columns of VectorType and > assemble a output a new column of VectorType containing the concatenated > vectors. In streaming mode, this transformation can fail because > VectorAssembler does not have enough information to produce metadata > (AttributeGroup) for the new column. Because VectorAssembler is such a > ubiquitous part of mllib pipelines, this issue effectively means spark > structured streaming does not support prediction using mllib pipelines. > I've created this ticket so we can discuss ways to potentially improve > VectorAssembler. Please let me know if there are any issues I have not > considered or potential fixes I haven't outlined. I'm happy to submit a patch > once I know which strategy is the best approach. > Potential fixes > 1) Replace VectorAssembler with an estimator/model pair like was recently > done with OneHotEncoder, > [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The > Estimator can "learn" the size of the inputs vectors during training and save > it to use during prediction. > Pros: > * Possibly simplest of the potential fixes > Cons: > * We'll need to deprecate current VectorAssembler > 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty > major change, but it could be done in stages. We could first ensure that > metadata is not used during prediction and allow the VectorAssembler to drop > metadata for streaming dataframes. Going forward, it would be important to > not use any metadata on Vector columns for any prediction tasks. > Pros: > * Potentially, easy short term fix for VectorAssembler > (drop metadata for vector columns in streaming). > * Current Attributes implementation is also causing other issues, eg > [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. > Cons: > * To fully remove ML Attributes would be a major refactor of MLlib and would > most likely require breaking changings. > * A partial removal of ML attributes (eg: ensure ML attributes are not used > during transform, only during fit) might be tricky. This would require > testing or other enforcement mechanism to prevent regressions. > 3) Require Vector columns to have fixed length vectors. Most mllib > transformers that produce vectors already include the size of the vector in > the column metadata. This change would be to deprecate APIs that allow > creating a vector column of unknown length and replace those APIs with > equivalents that enforce a fixed size. > Pros: > * We already treat vectors as fixed size, for example VectorAssembler assumes > the inputs * output col are fixed size vectors and creates metadata > accordingly. In the spirit of explicit is better than implicit, we would be > codifying something we already assume. > * This could potentially enable performance optimizations that are only > possible if the Vector size of a column is fixed & known. > Cons: > * This would require breaking changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org