[ https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257127#comment-15257127 ]
Sergey Shelukhin commented on HIVE-9660: ---------------------------------------- That is pretty much it. There are some more detailed descriptions in the comments. The two complex bits are the integer writers that have their separate caches, so one needs to be aware when accounting for a CB that, even though some RGs might be fully written, their values could still be in the integer writer literals array (or a similar place), and not in this CB. Another is the string writer, which is logically simple (we save index entries as before, only this time we have to make sure when writing stuff out that we maintain a correct set of active RGs for those CB callbacks), but a little bit involved code-wise. I'll look at test failures, I think the last patch was supposed to pass all the tests before rebase, probably some stupid error. > store end offset of compressed data for RG in RowIndex in ORC > ------------------------------------------------------------- > > Key: HIVE-9660 > URL: https://issues.apache.org/jira/browse/HIVE-9660 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, > HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, > HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, > HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, > HIVE-9660.10.patch, HIVE-9660.patch, HIVE-9660.patch > > > Right now the end offset is estimated, which in some cases results in tons of > extra data being read. > We can add a separate array to RowIndex (positions_v2?) that stores number of > compressed buffers for each RG, or end offset, or something, to remove this > estimation magic -- This message was sent by Atlassian JIRA (v6.3.4#6332)