[ 
https://issues.apache.org/jira/browse/PARQUET-71?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114671#comment-14114671
 ] 

Dmitriy V. Ryaboy commented on PARQUET-71:
------------------------------------------

Here is the "writeDictionaryPage" code from ParquetFileWriter:

{code}
  /**
   * writes a dictionary page page
   * @param dictionaryPage the dictionary page
   */
  public void writeDictionaryPage(DictionaryPage dictionaryPage) throws 
IOException {
    state = state.write();
    if (DEBUG) LOG.debug(out.getPos() + ": write dictionary page: " + 
dictionaryPage.getDictionarySize() + " values");
    currentChunkDictionaryPageOffset = out.getPos();
    int uncompressedSize = dictionaryPage.getUncompressedSize();
    int compressedPageSize = (int)dictionaryPage.getBytes().size(); // TODO: 
fix casts
    metadataConverter.writeDictionaryPageHeader(
        uncompressedSize,
        compressedPageSize,
        dictionaryPage.getDictionarySize(),
        dictionaryPage.getEncoding(),
        out);
    long headerSize = out.getPos() - currentChunkDictionaryPageOffset;
    this.uncompressedLength += uncompressedSize + headerSize;
    this.compressedLength += compressedPageSize + headerSize;
    if (DEBUG) LOG.debug(out.getPos() + ": write dictionary page content " + 
compressedPageSize);
    dictionaryPage.getBytes().writeAllTo(out);
    currentEncodings.add(dictionaryPage.getEncoding());
  }
{code}

So compressedLength is compressedPageSize + headerSize. Header size is a few 
bytes. compressedPageSize is just dictionaryPage.getBytes.().size().

DictionaryPage.getUncompressedSize returns bytes.size(), which is the same 
thing as compressedPageSize. 

So really there's almost no difference between compressed and uncompressed size 
for the dictionary -- there is no special compression, and while you are right 
that there's a typo, we should print out dictionaryPage.getUncompressedSize() 
instead!

Now, this does beg the question of why we aren't feeding this through a 
compressor like Snappy or LZO or whatever the user set as compression for this 
particular Parquet file. 

> column chunk page write store log message displays incorrect information
> ------------------------------------------------------------------------
>
>                 Key: PARQUET-71
>                 URL: https://issues.apache.org/jira/browse/PARQUET-71
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>            Reporter: Ian Barfield
>            Priority: Minor
>
> It is printing the size of the dictionary (in terms of the number of keys) 
> twice and calling the second time the 'compressed byte count'. An accurate 
> account of that number would be very helpful for accounting for disk space 
> usage. The actual 'compressed byte count' is indeed calculated at a point 
> near there so I am guessing this is a simple mistake.
> see:
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java#L152



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to