[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Everett Rush updated SPARK-27249: --------------------------------- Description: It would be nice to have a developers' API for dataset transformations that need more than one column from a row (ie UnaryTransformer inputs one column outputs one column) or that contain objects too expensive to initialize repeatedly in a UDF such as a database connection. Design: Abstract class PartitionTransformer extends Transformer and defines the partition transformation function as Iterator[Row] => Iterator[Row] NB: This parallels the UnaryTransformer createTransformFunc method When developers subclass this transformer, they can provide their own schema for the output Row in which case the PartitionTransformer creates a row encoder and executes the transformation. Alternatively the developer can set output Datatype and output col name. Then the PartitionTransformer class will create a new schema, a row encoder, and execute the transformation. was: It would be nice to have a developers' API for dataset transformers that need more than one column from a row(ie UnaryTransformer) or that contain objects too expensive to initialize repeatedly in a UDF such as a database connection. Design: Abstract class PartitionTransformer extends Transformer and defines the partition transformation function as Iterator[Row] => Iterator[Row] NB: This parallels the UnaryTransformer createTransformFunc method When developers subclass this transformer, they can provide their own schema for the output Row in which case the PartitionTransformer creates a row encoder and executes the transformation. Alternatively the developer can set output Datatype and output col name. Then the PartitionTransformer class will create a new schema, a row encoder, and execute the transformation. > Developers API for Transformers beyond UnaryTransformer > ------------------------------------------------------- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML > Affects Versions: 2.5.0 > Reporter: Everett Rush > Priority: Minor > Labels: starter > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformations that > need more than one column from a row (ie UnaryTransformer inputs one column > outputs one column) or that contain objects too expensive to initialize > repeatedly in a UDF such as a database connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org