[ 
https://issues.apache.org/jira/browse/BEAM-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899959#comment-16899959
 ] 

Ryan Skraba commented on BEAM-4379:
-----------------------------------

I took a deep look at the reader furnished by the Parquet community -- it 
appears that the Hadoop jars are tightly integrated into the parts of the code 
that permit splitting.

Specifically, I was unable to use Parquet 1.10.1 to list row groups in a 
parquet file (the unit for parallelizable splits) or to read records from a 
specific row group without including the Hadoop configuration classes and 
Hadoop filesystem.

It looks like there are a couple of possibilities to implement splittable 
parquet files:

1) Include the hadoop ecosystem in the ParquetIO component,

2) Rewrite the reader for an in-house version that doesn't require hadoop, or

3) Implement and use PARQUET-1126 from a future version of parquet.

> Make ParquetIO Read splittable
> ------------------------------
>
>                 Key: BEAM-4379
>                 URL: https://issues.apache.org/jira/browse/BEAM-4379
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-ideas, io-java-parquet
>            Reporter: Lukasz Gajowy
>            Priority: Major
>
> As the title stands - currently it is not splittable which is not optimal for 
> runners that support splitting.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to