> On Mar 30, 2018, at 8:37 AM, Owen O'Malley <owen.omal...@gmail.com> wrote:
> 
> Ok, so what I'm trying is:
> * Move the dictionaries (the string contents and lengths) between the
> indexes and the data.

If we’re talking about moving stuff around, ideally, the index would be at the 
end of the stripe so you can execute a single IO to get the footer and indexes.

> * Remove the positions from the row indexes (we don't need them if we flush
> at the row group level)
> * Close the rle and compression after each row group

Are you talking about the the list if “positions/offsets" that allow for 
resuming a stream in the middle?  If so, I think this change could be made 
today in a completely backwards compatible way.  At each row group boundary, 
simply force a stream flush, and then first index will contain a value (e.g. 
start reading the stream at byte x), and all the rest will be zero.

> * Write the data streams for each of the column
>   - the streams are ordered as data, length, secondary, present

I thought this is what happens already.  If this is a change from what happens 
now, can you explain the win?

> So this has a few impacts:
> * We can read and process any row group by reading just the bytes for that
> row group.
>  - That enables a much better async io reader.
>  - We reduce the memory required to read a stripe to just the dictionaries
> and row group.
> * It also means that we could flush the row group to the file as we write.
>  - Less memory consumed by the writer
>  - We could use async io for writing.

If I understand this correctly, I think this might be the equivalent of making 
the stripe smaller.  Generally, I think about stripe level layout as IO 
optimizations (i.e., skipping reads for sections), and row groups as decoder 
optimizations (i.e., skip decoding non-useful data).  Sometimes, the predicate 
pushdown is so precise that we only need to read a few stream, and row group 
pruning turns into an IO win, but normally, there are either a enough streams 
that the IO optimizer ends up reading the full streams anyway (e.g., a seek on 
a disk is about as expensive as reading ~1MiB of data so you coalesce reads 
with a gap less than ~1MiB to avoid the extra seek).


-dain

Reply via email to