> On Mar 30, 2018, at 8:37 AM, Owen O'Malley <owen.omal...@gmail.com> wrote: > > Ok, so what I'm trying is: > * Move the dictionaries (the string contents and lengths) between the > indexes and the data.
If we’re talking about moving stuff around, ideally, the index would be at the end of the stripe so you can execute a single IO to get the footer and indexes. > * Remove the positions from the row indexes (we don't need them if we flush > at the row group level) > * Close the rle and compression after each row group Are you talking about the the list if “positions/offsets" that allow for resuming a stream in the middle? If so, I think this change could be made today in a completely backwards compatible way. At each row group boundary, simply force a stream flush, and then first index will contain a value (e.g. start reading the stream at byte x), and all the rest will be zero. > * Write the data streams for each of the column > - the streams are ordered as data, length, secondary, present I thought this is what happens already. If this is a change from what happens now, can you explain the win? > So this has a few impacts: > * We can read and process any row group by reading just the bytes for that > row group. > - That enables a much better async io reader. > - We reduce the memory required to read a stripe to just the dictionaries > and row group. > * It also means that we could flush the row group to the file as we write. > - Less memory consumed by the writer > - We could use async io for writing. If I understand this correctly, I think this might be the equivalent of making the stripe smaller. Generally, I think about stripe level layout as IO optimizations (i.e., skipping reads for sections), and row groups as decoder optimizations (i.e., skip decoding non-useful data). Sometimes, the predicate pushdown is so precise that we only need to read a few stream, and row group pruning turns into an IO win, but normally, there are either a enough streams that the IO optimizer ends up reading the full streams anyway (e.g., a seek on a disk is about as expensive as reading ~1MiB of data so you coalesce reads with a gap less than ~1MiB to avoid the extra seek). -dain