[ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22346:
------------------------------------
    Description: 
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.

Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.

Cons:
* This would require breaking changes.




  was:
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.
Pros:
* Possibly simplest of the potential fixes
Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.
Pros:
* Potentially, easy short term fix for VectorAssembler
Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.
Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.
Cons:
* This would require breaking changes.





> Update VectorAssembler to work with StreamingDataframes
> -------------------------------------------------------
>
>                 Key: SPARK-22346
>                 URL: https://issues.apache.org/jira/browse/SPARK-22346
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, Structured Streaming
>    Affects Versions: 2.2.0
>            Reporter: Bago Amirbekian
>            Priority: Critical
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and 
> assemble a output a new column of VectorType containing the concatenated 
> vectors. In streaming mode, this transformation can fail because 
> VectorAssembler does not have enough information to produce metadata 
> (AttributeGroup) for the new column. Because VectorAssembler is such a 
> ubiquitous part of mllib pipelines, this issue effectively means spark 
> structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve 
> VectorAssembler. Please let me know if there are any issues I have not 
> considered or potential fixes I haven't outlined. I'm happy to submit a patch 
> once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently 
> done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the 
> inputs vectors during training and save it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty 
> major change, but it could be done in stages. We could first ensure that 
> metadata is not used during prediction and allow the VectorAssembler to drop 
> metadata for streaming dataframes. Going forward, it would be important to 
> not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would 
> most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used 
> during transform, only during fit) might be tricky. This would require 
> testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib 
> transformers that produce vectors already include the size of the vector in 
> the column metadata. This change would be to deprecate APIs that allow 
> creating a vector column of unknown length and replace those APIs with 
> equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes 
> the inputs * output col are fixed size vectors and creates metadata 
> accordingly. In the spirit of explicit is better than implicit, we would be 
> codifying something we already assume.
> * This could potentially enable performance optimizations that are only 
> possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to