[
https://issues.apache.org/jira/browse/BEAM-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550095#comment-17550095
]
Danny McCormick commented on BEAM-14540:
----------------------------------------
This issue has been migrated to https://github.com/apache/beam/issues/21657
> Native implementation for serialized Rows to/from Arrow
> -------------------------------------------------------
>
> Key: BEAM-14540
> URL: https://issues.apache.org/jira/browse/BEAM-14540
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Reporter: Brian Hulette
> Priority: P2
>
> With https://s.apache.org/batched-dofns (BEAM-14213), we want to encourage
> users to develop pipelines that process arrow data within the Python SDK, but
> communicating batches of data across SDKs or from SDK to Runner is left as
> future work. So when Arrow data is processed in the SDK, it must be converted
> to/from Rows for transmission over the Fn API. So the current ideal Python
> execution looks like:
> 1. read row oriented data over the Fn API, deserialize with SchemaCoder
> 2. Buffer rows and construct an arrow RecordBatch/Table object
> 3. Perform user computation(s)
> 4. Explode output RecordBatch/Table into rows
> 5. Serialize rows with SchemaCoder and write out over the Fn API
> Note that (1,2) and (4,5) will exist in every stage of the user's pipeline,
> and they'll also exist when Python transforms (e.g. dataframe read_csv) are
> used in other SDKs. We should improve performance for this hot path by making
> a native (cythonized) implementation for (1,2) and (4,5).
--
This message was sent by Atlassian Jira
(v8.20.7#820007)