[ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267118#comment-15267118
 ] 

Owen O'Malley commented on HIVE-9660:
-------------------------------------

{quote}
Note that the run length blocks finish before CBs (ie RL first, then CB 
containing the RL), so the callbacks are actually reversed.
{quote}

They can happen in *either* order, but the length must be computed when the 
compression block finishes AFTER the rle block finishes.

{quote}
For uncompressed, the main concern is that for exact boundaries, there will be 
too many calls.
{quote}

I don't understand this sentence. There will be a call per stream per a row 
group, that is hardly a problem.

{quote}
You'd need to pass a callback per RG down to the RL writer (and in some cases 
there isn't even an RL writer, like double), but RL writer won't know when a RG 
ends. 
{quote}

The run length encoder doesn't perform the callback, but when its RLE block is 
finished passes the same callback to the OutStream for when the OutStream 
finishes the next compression block. Thus it is easy to guarantee that you only 
get called back when compression block finishes after the RLE finishes, which 
is the required condition. Obviously, for cases where there isn't an RLE, it 
just puts the callback directly on the OutStream and it works exactly the same 
way.


> store end offset of compressed data for RG in RowIndex in ORC
> -------------------------------------------------------------
>
>                 Key: HIVE-9660
>                 URL: https://issues.apache.org/jira/browse/HIVE-9660
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, 
> HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to