[
https://issues.apache.org/jira/browse/PARQUET-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481552#comment-14481552
]
Mingyu Kim commented on PARQUET-241:
------------------------------------
I'm using ParquetInputFormat for my Spark job, and being able to set the
precise order is crucial in keeping the order information across persistence.
For example, if you sort your data in Spark and write it out to HDFS,
part-00001 will have the largest rows. However, if part files are read in a
rather arbitrary order by ParquetInputFormat, the order is not guaranteed when
I read in those files.
Another thing is consistency. Because of the footer cache, if I call
.getSplits() with the same input multiple times, it's only the first call that
doesn't preserve the order, while the subsequent calls does (assuming all
footers fit in the cache). Since this is fairly easy to fix, I thought it's
worth it.
My preliminary commit is at
https://github.com/mingyukim/incubator-parquet-mr/commit/eb2b5fc3a0509a6df00cd0d75ad4f5a3ddd3589d.
I'll write unit test, clean things up (e.g. bug number is wrong currently) and
submit a PR soon.
> ParquetInputFormat.getFooters() should return in the same order as what
> listStatus() returns
> --------------------------------------------------------------------------------------------
>
> Key: PARQUET-241
> URL: https://issues.apache.org/jira/browse/PARQUET-241
> Project: Parquet
> Issue Type: Bug
> Affects Versions: 1.6.0
> Reporter: Mingyu Kim
>
> Because of how the footer cache is implemented, getFooters() returns the
> footers in a different order than what listStatus() returns.
> When I provided url
> "hdfs://.../part-00001.parquet,hdfs://.../part-00002.parquet,hdfs://.../part-00003.parquet",
> ParquetInputFormat.getSplits(), which internally calls getFooters(),
> returned the splits in a wrong order.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)