[jira] [Commented] (ORC-305) Add column statistics for the size on disk

Owen O'Malley (JIRA) Wed, 14 Mar 2018 08:56:23 -0700

    [ 
https://issues.apache.org/jira/browse/ORC-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398796#comment-16398796
 ]


Owen O'Malley commented on ORC-305:
-----------------------------------

[~smore] this is harder than I realized.

There are two big problems.

The first is that you need to update the byteCount in the stripeColStatistics 
before the stripe statistics are saved in the middle of 
TreeWriterBase.writeStripe around line 250. That unfortunately runs before the 
TreeWriters have flushed their streams. That is really unfortunate. To fix that 
issue, I’d suggest that we split the TreeWriter.writeStripe into two parts:

{code}
void flushStreams() throws IOException;
void writeStripe(int requiredIndexEntries) throws IOException;
{code}

So then WriterImpl.flushStripe() will call:

{code}
treeWriter.flushStreams();
treeWriter.writeStripe(requiredIndexEntries);
{code}

For TreeWriterBase, the flushStreams() will have the front part of writeStripe:

{code}
if (isPresent != null) {
  isPresent.flush();

  // if no nulls are found in a stream, then suppress the stream
  if(!foundNulls) {
    isPresentOutStream.suppress();
    // since isPresent bitstream is suppressed, update the index to
    // remove the positions of the isPresent stream
    if (rowIndex != null) {
      removeIsPresentPositions();
    }
  }
}
{code}

For IntegerTreeWriter, the flushStreams will have:

{code}
super.flushStreams();
writer.flush();
{code}

The compound types will also flush their children.

All of that should mean that now the streams are all flushed before we hit the 
problematic part of saving the stripe statistics.

Now the second problem is how do you actually get the number of bytes in the 
streams for a column. Unfortunately, the TreeWriters don’t have the stream 
lengths. You’ll need to add a method in PhysicalWriter that returns the number 
of bytes in streams for a given column. 

{code}
long getFileBytes(int column);
{code}

The PhysicalFsWriter will need to add an implementation of getFileBytes that 
finds all of the streams for a given column number, ignores the streams that 
are suppressed, and returns the sum of the sizes.

Now before TreeWriterBase.writeStripe saves the stripe statistics, use 
context.getPhysicalWriter().getFileBytes(id) to get the number of bytes for 
this column for this stripe.


> Add column statistics for the size on disk
> ------------------------------------------
>
>                 Key: ORC-305
>                 URL: https://issues.apache.org/jira/browse/ORC-305
>             Project: ORC
>          Issue Type: Test
>            Reporter: Owen O'Malley
>            Assignee: Sandeep More
>            Priority: Major
>
> It would be great to have the size on disk of each column.
> You can generate this by adding up the sizes of the dictionary and data 
> streams.
> It is only relevant at the stripe and file level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ORC-305) Add column statistics for the size on disk

Reply via email to