[ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219686#comment-16219686
 ] 

Joseph K. Bradley commented on SPARK-22346:
-------------------------------------------

My thoughts on the options.  Summary: I'm ambivalent between Options 2 & 3 
since I think either could be done without breaking changes.

* Option 1 (VectorAssembler as an Estimator): too drastic
** This would break almost every MLlib workflow I've seen.

* Option 2 (drop metadata when unavailable): good if we're careful
** I think we can avoid breaking changes here.
** We can drop metadata only when (a) part of the metadata are unavailable and 
(b) the DataFrame is a streaming DataFrame.  That way, we won't change existing 
MLlib workflows, and we will enable new ones using streaming.  We can also log 
a warning about metadata being dropped.
** Long-term, we can improve these streaming workflows by maintaining partial 
metadata.

* Option 3 (fixed-length vectors): good if we're careful
** I think we can avoid breaking changes here.
** Note that fixed-length vectors are sort-of required already since ML 
attributes for Vector columns assume fixed lengths.  I've also never heard of 
the need for variable lengths.
** We can provide a method (or better, a Transformer) which adds metadata to a 
column.  (It could just be for specifying vector length for now, not a general 
metadata utility.)
** Current MLlib workflows should not require this; they are either batch or 
they already have metadata.
** New MLlib workflows using Streaming without metadata will be enabled when 
users add this Transformer to their workflow.

Long-term, it'd be great to support a sparse metadata format. Assuming we want 
to keep metadata around (and I think we should because it's really useful, 
e.g., for providing parity with R models by tracking feature names), then this 
seems like the best option for fixing these several issues around metadata.

What do you think?

> Update VectorAssembler to work with Structured Streaming
> --------------------------------------------------------
>
>                 Key: SPARK-22346
>                 URL: https://issues.apache.org/jira/browse/SPARK-22346
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, Structured Streaming
>    Affects Versions: 2.2.0
>            Reporter: Bago Amirbekian
>            Priority: Critical
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and 
> assemble a output a new column of VectorType containing the concatenated 
> vectors. In streaming mode, this transformation can fail because 
> VectorAssembler does not have enough information to produce metadata 
> (AttributeGroup) for the new column. Because VectorAssembler is such a 
> ubiquitous part of mllib pipelines, this issue effectively means spark 
> structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve 
> VectorAssembler. Please let me know if there are any issues I have not 
> considered or potential fixes I haven't outlined. I'm happy to submit a patch 
> once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently 
> done with OneHotEncoder, 
> [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The 
> Estimator can "learn" the size of the inputs vectors during training and save 
> it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty 
> major change, but it could be done in stages. We could first ensure that 
> metadata is not used during prediction and allow the VectorAssembler to drop 
> metadata for streaming dataframes. Going forward, it would be important to 
> not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> (drop metadata for vector columns in streaming).
> * Current Attributes implementation is also causing other issues, eg 
> [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would 
> most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used 
> during transform, only during fit) might be tricky. This would require 
> testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib 
> transformers that produce vectors already include the size of the vector in 
> the column metadata. This change would be to deprecate APIs that allow 
> creating a vector column of unknown length and replace those APIs with 
> equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes 
> the inputs * output col are fixed size vectors and creates metadata 
> accordingly. In the spirit of explicit is better than implicit, we would be 
> codifying something we already assume.
> * This could potentially enable performance optimizations that are only 
> possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to