[ https://issues.apache.org/jira/browse/BEAM-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899959#comment-16899959 ]
Ryan Skraba commented on BEAM-4379: ----------------------------------- I took a deep look at the reader furnished by the Parquet community -- it appears that the Hadoop jars are tightly integrated into the parts of the code that permit splitting. Specifically, I was unable to use Parquet 1.10.1 to list row groups in a parquet file (the unit for parallelizable splits) or to read records from a specific row group without including the Hadoop configuration classes and Hadoop filesystem. It looks like there are a couple of possibilities to implement splittable parquet files: 1) Include the hadoop ecosystem in the ParquetIO component, 2) Rewrite the reader for an in-house version that doesn't require hadoop, or 3) Implement and use PARQUET-1126 from a future version of parquet. > Make ParquetIO Read splittable > ------------------------------ > > Key: BEAM-4379 > URL: https://issues.apache.org/jira/browse/BEAM-4379 > Project: Beam > Issue Type: Improvement > Components: io-ideas, io-java-parquet > Reporter: Lukasz Gajowy > Priority: Major > > As the title stands - currently it is not splittable which is not optimal for > runners that support splitting. -- This message was sent by Atlassian JIRA (v7.6.14#76016)