[ https://issues.apache.org/jira/browse/BEAM-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030946#comment-17030946 ]
Steve Cosenza commented on BEAM-4379: ------------------------------------- I'm currently evaluating the scalability and performance of Google Dataflow, and we need the ability to read splittable Parquet files. I created an initial POC based on "Splittable DoFn", but I just now learned that Dataflow "_Does not yet support autotuning features of the Source API."_~1~ __ Additionally, the Beam docs state, _"__In some cases, implementing a {{Source}} might be necessary or result in better performance"~2~._ Questions: * Should I be targeting the BoundedSource API and will I be able to submit a PR that changes the existing ParquetIO to use a BoundedSource? Thanks, Steve _1_ _[https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-what]_ _2_ _[https://beam.apache.org/documentation/io/developing-io-overview/]_ > Make ParquetIO Read splittable > ------------------------------ > > Key: BEAM-4379 > URL: https://issues.apache.org/jira/browse/BEAM-4379 > Project: Beam > Issue Type: Improvement > Components: io-ideas, io-java-parquet > Reporter: Lukasz Gajowy > Priority: Major > > As the title stands - currently it is not splittable which is not optimal for > runners that support splitting. -- This message was sent by Atlassian Jira (v8.3.4#803005)