[ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Everett Rush updated SPARK-27249:
---------------------------------
    Description: 
It would be nice to have a developers' API for dataset transformers that need 
more than one column from a row(ie UnaryTransformer) or that contain objects 
too expensive to initialize repeatedly in a UDF such as a database connection. 

 

Design:

Abstract class PartitionTransformer extends Transformer and defines the 
partition transformation function as Iterator[Row] => Iterator[Row]

NB: This parallels the UnaryTransformer createTransformFunc method

 

When developers subclass this transformer, they can provide their own schema 
for the output Row in which case the PartitionTransformer creates a row encoder 
and executes the transformation. Alternatively the developer can set output 
Datatype and output col name. Then the PartitionTransformer class will create a 
new schema, a row encoder, and execute the transformation.

  was:
It would be nice to have a developers' API for dataset transformers that need 
more than one column from a row(ie UnaryTransformer) or that contain objects 
too expensive to initialize repeatedly in a UDF such as a database connection. 

 

Design:

Abstract class PartitionTransformer extends Transformer and defines the 
partition transformation function as Iterator[Row] => Iterator[Row]

NB: This parallels the UnaryTransformer createTransformFunc method

 

When developers subclass this transformer, they can either provide their own 
schema for the output Row or set output Datatype and output col name. Then the 
PartitionTransformer class will create a new schema, a row encoder, and execute 
the transformation.


> Developers API for Transformers beyond UnaryTransformer
> -------------------------------------------------------
>
>                 Key: SPARK-27249
>                 URL: https://issues.apache.org/jira/browse/SPARK-27249
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 2.5.0
>            Reporter: Everett Rush
>            Priority: Minor
>              Labels: starter
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformers that need 
> more than one column from a row(ie UnaryTransformer) or that contain objects 
> too expensive to initialize repeatedly in a UDF such as a database 
> connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to