[
https://issues.apache.org/jira/browse/PARQUET-71?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114671#comment-14114671
]
Dmitriy V. Ryaboy commented on PARQUET-71:
------------------------------------------
Here is the "writeDictionaryPage" code from ParquetFileWriter:
{code}
/**
* writes a dictionary page page
* @param dictionaryPage the dictionary page
*/
public void writeDictionaryPage(DictionaryPage dictionaryPage) throws
IOException {
state = state.write();
if (DEBUG) LOG.debug(out.getPos() + ": write dictionary page: " +
dictionaryPage.getDictionarySize() + " values");
currentChunkDictionaryPageOffset = out.getPos();
int uncompressedSize = dictionaryPage.getUncompressedSize();
int compressedPageSize = (int)dictionaryPage.getBytes().size(); // TODO:
fix casts
metadataConverter.writeDictionaryPageHeader(
uncompressedSize,
compressedPageSize,
dictionaryPage.getDictionarySize(),
dictionaryPage.getEncoding(),
out);
long headerSize = out.getPos() - currentChunkDictionaryPageOffset;
this.uncompressedLength += uncompressedSize + headerSize;
this.compressedLength += compressedPageSize + headerSize;
if (DEBUG) LOG.debug(out.getPos() + ": write dictionary page content " +
compressedPageSize);
dictionaryPage.getBytes().writeAllTo(out);
currentEncodings.add(dictionaryPage.getEncoding());
}
{code}
So compressedLength is compressedPageSize + headerSize. Header size is a few
bytes. compressedPageSize is just dictionaryPage.getBytes.().size().
DictionaryPage.getUncompressedSize returns bytes.size(), which is the same
thing as compressedPageSize.
So really there's almost no difference between compressed and uncompressed size
for the dictionary -- there is no special compression, and while you are right
that there's a typo, we should print out dictionaryPage.getUncompressedSize()
instead!
Now, this does beg the question of why we aren't feeding this through a
compressor like Snappy or LZO or whatever the user set as compression for this
particular Parquet file.
> column chunk page write store log message displays incorrect information
> ------------------------------------------------------------------------
>
> Key: PARQUET-71
> URL: https://issues.apache.org/jira/browse/PARQUET-71
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Reporter: Ian Barfield
> Priority: Minor
>
> It is printing the size of the dictionary (in terms of the number of keys)
> twice and calling the second time the 'compressed byte count'. An accurate
> account of that number would be very helpful for accounting for disk space
> usage. The actual 'compressed byte count' is indeed calculated at a point
> near there so I am guessing this is a simple mistake.
> see:
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java#L152
--
This message was sent by Atlassian JIRA
(v6.2#6252)