[
https://issues.apache.org/jira/browse/PARQUET-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Abhishek Dixit updated PARQUET-2464:
------------------------------------
Description:
toParquetMetadata method converts
org.apache.parquet.hadoop.metadata.ParquetMetadata to
org.apache.parquet.format.FileMetaData but this does not set the dictionary
page offset bit in FileMetaData.
When a FileMetaData object is serialized while writing to the footer and then
deserialized, the dictionary offset is lost as the dictionary page offset bit
was never set.
PARQUET-1850 tried to fix this but it did only a partial fix.
It sets setDictionary_page_offset only if getEncodingStats are present
{code:java}
if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset());
} {code}
However, it should setDictionary_page_offset even when getEncodingStats are not
present but encodings are present.
It should use the implementation in ColumnChunkMetatdata below:
{code:java}
public boolean hasDictionaryPage() {
EncodingStats stats = getEncodingStats();
if (stats != null) {
return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages();
}
Set<Encoding> encodings = getEncodings();
return (encodings.contains(PLAIN_DICTIONARY) ||
encodings.contains(RLE_DICTIONARY));
} {code}
So new change in ParquetMetadataCOnvertor should be like:
{code:java}
if (columnMetaData.hasDictionaryPage()) {
metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); }
{code}
was:
toParquetMetadata method converts
org.apache.parquet.hadoop.metadata.ParquetMetadata to
org.apache.parquet.format.FileMetaData but this does not set the dictionary
page offset bit in FileMetaData.
When a FileMetaData object is serialized while writing to the footer and then
deserialized, the dictionary offset is lost as the dictionary page offset bit
was never set.
PARQUET-1850 tried to fix this but it did only a partial fix.
It sets setDictionary_page_offset only if getEncodingStats are present
{code:java}
if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset());
} {code}
However, it should setDictionary_page_offset even when getEncodingStats are not
present but encodings are present.
It should use the implementation in ColumnChunkMetatdata below:
{code:java}
public boolean hasDictionaryPage() {
EncodingStats stats = getEncodingStats();
if (stats != null) { // ensure there is a dictionary page and that it is used
to encode data pages return stats.hasDictionaryPages() &&
stats.hasDictionaryEncodedPages(); }
Set<Encoding> encodings = getEncodings();
return (encodings.contains(PLAIN_DICTIONARY) ||
encodings.contains(RLE_DICTIONARY));
} {code}
So new change in ParquetMetadataCOnvertor should be like:
{code:java}
if (columnMetaData.hasDictionaryPage()) {
metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); }
{code}
> toParquetMetadata method in ParquetMetadataConverter does not set dictionary
> page offset bit
> --------------------------------------------------------------------------------------------
>
> Key: PARQUET-2464
> URL: https://issues.apache.org/jira/browse/PARQUET-2464
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.12.2
> Reporter: Abhishek Dixit
> Priority: Major
>
> toParquetMetadata method converts
> org.apache.parquet.hadoop.metadata.ParquetMetadata to
> org.apache.parquet.format.FileMetaData but this does not set the dictionary
> page offset bit in FileMetaData.
> When a FileMetaData object is serialized while writing to the footer and then
> deserialized, the dictionary offset is lost as the dictionary page offset bit
> was never set.
> PARQUET-1850 tried to fix this but it did only a partial fix.
> It sets setDictionary_page_offset only if getEncodingStats are present
> {code:java}
> if (columnMetaData.getEncodingStats() != null
> && columnMetaData.getEncodingStats().hasDictionaryPages())
> {
> metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset());
> } {code}
> However, it should setDictionary_page_offset even when getEncodingStats are
> not present but encodings are present.
> It should use the implementation in ColumnChunkMetatdata below:
> {code:java}
> public boolean hasDictionaryPage() {
> EncodingStats stats = getEncodingStats();
> if (stats != null) {
> return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages();
> }
> Set<Encoding> encodings = getEncodings();
> return (encodings.contains(PLAIN_DICTIONARY) ||
> encodings.contains(RLE_DICTIONARY));
> } {code}
> So new change in ParquetMetadataCOnvertor should be like:
>
> {code:java}
> if (columnMetaData.hasDictionaryPage()) {
> metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset());
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]