[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Everett Rush updated SPARK-27249: --------------------------------- Description: It would be nice to have a developers' API for dataset transformers that need more than one column from a row(ie UnaryTransformer) or that contain objects too expensive to initialize repeatedly in a UDF such as a database connection. Design: Abstract class PartitionTransformer extends Transformer and defines the partition transformation function as Iterator[Row] => Iterator[Row] NB: This parallels the UnaryTransformer createTransformFunc method When developers subclass this transformer, they can provide their own schema for the output Row in which case the PartitionTransformer creates a row encoder and executes the transformation. Alternatively the developer can set output Datatype and output col name. Then the PartitionTransformer class will create a new schema, a row encoder, and execute the transformation. was: It would be nice to have a developers' API for dataset transformers that need more than one column from a row(ie UnaryTransformer) or that contain objects too expensive to initialize repeatedly in a UDF such as a database connection. Design: Abstract class PartitionTransformer extends Transformer and defines the partition transformation function as Iterator[Row] => Iterator[Row] NB: This parallels the UnaryTransformer createTransformFunc method When developers subclass this transformer, they can either provide their own schema for the output Row or set output Datatype and output col name. Then the PartitionTransformer class will create a new schema, a row encoder, and execute the transformation. > Developers API for Transformers beyond UnaryTransformer > ------------------------------------------------------- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML > Affects Versions: 2.5.0 > Reporter: Everett Rush > Priority: Minor > Labels: starter > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformers that need > more than one column from a row(ie UnaryTransformer) or that contain objects > too expensive to initialize repeatedly in a UDF such as a database > connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org