Hi Jan, > However, this whole discussion is moot if we in the end decide we do not > want extensibility in the first place. So, do we want it? And if so, what > extension points do we want? Should we create a separate discussion thread > for this as well?
Great questions. I started a new thread under "[DISCUSS] Extensibility of Parquet". In my mind as you mention below, I was under the impression that the discussion was mostly moot as it seemed PMC members wanted to limit extensibility for core functionality and I think the other aspects of extensibility are reasonably covered, but lets move the conversation to that thread. Thanks, Micah On Tue, May 28, 2024 at 1:33 AM Jan Finis <jpfi...@gmail.com> wrote: > Thanks Micah for driving the effort! > > One thing that I didn't see in your 3 points is a discussion about > extensibility, and, depending on whether we will be quite extensible, a way > to specify which features are used by a Parquet file. > > Somewhere down the discussion, it was mentioned that we can just use a bit > set for all possible features and types, so any reader can do a quick > comparison to check whether a file might use features not supported by it. > > This whole approach would break down if we enabled extensibility, as then > the list of all possible features would not be known statically. Therefore, > we would need to introduce a mechanism in which new extended features can > be marked without clashing with each other (e.g. UUID per feature). > > However, this whole discussion is moot if we in the end decide we do not > want extensibility in the first place. So, do we want it? And if so, what > extension points do we want? Should we create a separate discussion thread > for this as well? > > Cheers > Jan > > Am Di., 28. Mai 2024 um 07:32 Uhr schrieb Micah Kornfield < > emkornfi...@gmail.com>: > > > Hi Everyone, > > Just to follow up, conversations on the summary doc [1] have largely > slowed > > down. In my mind I think of roughly three different tracks, and I'll > start > > threads to get a sense of who is interested (please be on the lookout for > > discussion threads). I think as those conversations branch naturally, > the > > different tracks can figure out a way to schedule in person syncs as > > necessary. > > > > 1. Infrastructure/Release Issues. Goals of this track: > > - Define how we can get newer features adopted (hopefully some > variant > > of what is proposed in the doc). One main blocker here seems to have > > volunteers to do releases at the right cadence. > > - Improved testing infrastructure. Both cross implementation and for > > downstream consumers. I think Apache Arrow has done a lot of quality > work > > in this space so hopefully that can provide some inspiration. > > - Improved documentation on feature matrix. > > > > 2. Improved footer metadata for wide-schema and many row-groups. There > > have already been a few proposals (linked in the doc) and I will post > these > > in the new thread, but won't do so here to try to unify the discussion > in 1 > > place in the new thread. > > > > 3. Improved encodings. I think there are a few areas to think about > here: > > - Improved heuristics for selecting encodings (most people still only > > use V1 encodings). > > - Improved encodings (an evaluation framework is necessary) > > - Better supporting indexed/point lookups. > > > > Thanks everyone for a great conversation so far. Hopefully, we can > > continue to make progress. > > > > -Micah > > > > > > [1] > > > > > https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit > > > > On Wed, May 22, 2024 at 2:32 PM Weston Pace <weston.p...@gmail.com> > wrote: > > > > > > *A row group can consist of one or many column chunks per column* > > (while > > > > before it could consist of only one)*.* > > > > > > Yes, that would work. It's not a breaking change and makes good sense > as > > > an evolution. > > > > > > I was talking with Jacques Nadeau earlier and he mentioned another > > > solution, which is to have support for a 2-pass writer. First you > write > > > the data with row groups. Then, in the second pass, you read the data > > one > > > column at a time, and write a file with only 1 row group. This works > too > > > but I prefer your solution as a two-pass write might be an > inconvenience > > > for some. > > > > > > > Multiple concurrent decoders can share the same page > > > > > > Yes, this is pretty much required. Coming from Arrow I'm pretty used > to > > > this kind of reference counting concept :). That being said, your > > overall > > > point (a lance reader requires a lot of complexity to get parallelism) > is > > > very true but I don't know that this is a bad thing. > > > > > > > Page sharing assumes that you have a fast (hopefully O(1), at worst > > > > I'd say O(log n)) way to only read parts of a page. Whether this is > > > > possible depends on the encoding. E.g., Parquet has encodings where > > this > > > is > > > > possible (PLAIN) and ones where this is not possible (RLE, DELTA_*, > > > etc.). > > > > With these encodings, two parallel read threads would need to do > > > duplicate > > > > work to read only parts of a page. > > > > > > For plain / bit packed / ... pages this is easy, as you mention. > > > > > > RLE is a bit trickier (disclaimer: I have not yet implemented it yet) > but > > > should be doable. Lance has a decode loop that goes through each > column > > > and figures out which pages (parts of pages) are needed for the next > > batch > > > (more about this in [1]). The page decoders are stateful (they are > > created > > > "full" by the scheduler and then slowly drained) and so they keep a > > pointer > > > into the page (e.g. a pointer into the runs so we know which run we are > > > currently decoding and how many values we've already taken from that > run) > > > and so it should still be O(1) in the RLE case. > > > > > > General compression / block compression, on the other hand, is pretty > > > annoying :). I can't easily split the decode of a compressed buffer > into > > > two different tasks (shoutout to niyue who's been looking into > > compression > > > and helped me realize this). My current thinking for compression (and > > > encryption) is to make them a layer in between the I/O and the decoder. > > So > > > basically, I have a "decompression / decrypting" thread pool and I have > > > `schedule -> I/O -> decrypt/decompress -> decode`. One benefit here is > > > that we effectively parallelize decompression / decryption across > > columns. > > > One downside is that the thread that does the decryption / > decompression > > is > > > probably not the same thread that is going to do the decoding and so we > > > can't fuse decompression, decryption, decode, and first-query-pipeline > > into > > > a single pipeline (we would have decryption / decode as one pipeline > and > > > decode + first-query-pipeline as the second). > > > > > > [1] https://blog.lancedb.com/splitting-scheduling-from-decoding/ > > > > > > On Wed, May 22, 2024 at 12:42 PM Jan Finis <jpfi...@gmail.com> wrote: > > > > > > > Thanks Weston for the answers, very insightful! I appreciate your > input > > > > very much. > > > > > > > > Some follow ups :): > > > > > > > > My point is that one of > > > > > these (compression chunk size) is a file format concern and the > other > > > one > > > > > (encoding size) is an encoding concern. > > > > > > > > > > > > The fact that Lance V2 still can have multiple grains is interesting > > and > > > so > > > > far was not clear in our discussion, thanks for the clarification! As > > you > > > > describe, Lance V2 allows for further "sub pages" inside a page. > > Parquet > > > > does not have that. Thus, if we now just drop the restrictions that > > > Parquet > > > > pages (which - at least as of today - do not allow nesting of smaller > > sub > > > > pages) need to be contiguous in Parquet, we would end up with > something > > > > that is less powerful than Lance V2. I'm wondering how we can best > > arrive > > > > at something that is as powerful as the concept in Lance V2 without > > > > throwing all Parquet concepts out of the window. Here is my first > draft > > > of > > > > how this could look like: > > > > > > > > When you think it through in detail, a Lance V2 page is actually the > > > > equivalent of a Parquet column chunk,* not of a Parquet page,* as the > > > > former is the (preferred) unit of I/O while the latter is the unit of > > > > encoding. And as you describe, a Lance V2 page is the unit of I/O, > not > > > the > > > > unit of encoding. That is an important point to keep in mind in our > > > > discussion, as this might otherwise lead to a lot of > misunderstandings. > > > > > > > > Thus, to make Parquet column chunks as powerful as Lance V2 pages, we > > > must > > > > allow *multiple column chunks* per column inside a Parquet row group. > > > This > > > > should give us the desired equivalent of the advantages of Lance V2. > > > Thus, > > > > if we want this (and I think we do), we should do the following > change > > > for > > > > Parquet V3: > > > > > > > > *A row group can consist of one or many column chunks per column* > > (while > > > > before it could consist of only one)*.* > > > > > > > > We should not say that Parquet pages do not need to be contiguous, as > > > > Parquet pages are *not* the grain of I/O but the grain of encoding. > > > > > > > > Do you agree with this, or did I get something wrong? > > > > > > > > (I see that Lance V2 is even more flexible than that. But I guess > this > > > is a > > > > good inbetween solution that allows us to leverage the advantages of > > "non > > > > contiguity" without needing to make all of Parquet's concepts > > pluggable, > > > > which could be too big of a change.) > > > > > > > > > > > > We do parallelize inside a file. > > > > > > > > > > > > Interesting, but this comes with challenges. How do you solve these? > > > > > > > > As page boundaries are not necessarily aligned between columns, this > > > means > > > > that mini batches can start and end in the middle of pages. This > comes > > > with > > > > the following challenges: > > > > > > > > - Multiple concurrent decoders can share the same page (e.g., one > > > > decodes the first part of it and the other the last part). Thus, > > > either > > > > we > > > > request it two times from storage, wasting I/O bandwidth, or we > make > > > > readers share I/O requests. So we would need some kind of > reference > > > > counting or a similar technique to check how many decoders still > > need > > > a > > > > page that has been fetched from I/O before we can free the read > > > memory. > > > > This is possible, but it increases the complexity of the I/O + > > decode > > > > stack > > > > considerably. Does the Lance V2 format subsume a reader that is > this > > > > clever? (This is not a real downside, as I am more than happy to > > > > implement > > > > complex I/O sharing code if it leads to a more efficient format; I > > > just > > > > wanted to mention this trade-off) > > > > - Page sharing assumes that you have a fast (hopefully O(1), at > > worst > > > > I'd say O(log n)) way to only read parts of a page. Whether this > is > > > > possible depends on the encoding. E.g., Parquet has encodings > where > > > > this is > > > > possible (PLAIN) and ones where this is not possible (RLE, > DELTA_*, > > > > etc.). > > > > With these encodings, two parallel read threads would need to do > > > > duplicate > > > > work to read only parts of a page. Also black box compression > (LZ4, > > > > etc.) > > > > can only decompress a page as a whole. How do you solve this > > challenge > > > > in > > > > LanceDB? > > > > > > > > > > > > @Steve > > > > > > > > Jan, assuming you are one of the authors of "Get Real: How Benchmarks > > > Fail > > > > > to Represent the Real World", can I get a preprint copy? As it's > > > > relevant > > > > > here. My acm library access is gone until I upload another 10 > Banksy > > > > mural > > > > > photos to wikimedia. > > > > > > > > > > > > > This link should work without any library access: > > > > https://dl.acm.org/doi/pdf/10.1145/3209950.3209952 (I tried from my > > > phone, > > > > which definitely doesn't have any ACM access). You might need to > press > > > the > > > > "PDF" button. Let me know if it doesn't work, then I'll try to find a > > > > pre-print. > > > > > > > > Cheers, > > > > Jan > > > > > > > > Am Mi., 22. Mai 2024 um 16:10 Uhr schrieb Steve Loughran > > > > <ste...@cloudera.com.invalid>: > > > > > > > > > On Tue, 21 May 2024 at 22:40, Jan Finis <jpfi...@gmail.com> wrote: > > > > > > > > > > > Thanks Weston for posting here! > > > > > > > > > > > > I appreciate this a lot, as it gives us the opportunity to > discuss > > > > modern > > > > > > formats in depth with the authors themselves, who probably know > the > > > > > design > > > > > > trade-offs they took best and thus can give us a deeper > > understanding > > > > > what > > > > > > certain features would mean for Parquet. > > > > > > > > > > > > > > > > This is becoming an interesting question. > > > > > > > > > > > > > > > > > I read both your linked posts. I read them with the mindset as if > > > they > > > > > were > > > > > > the documentation for a file format that I myself would need to > add > > > to > > > > > our > > > > > > engine, so I always double checked whether I would agree with > your > > > > > > reasoning and where I would see problems in the implementation. > > > > > > > > > > > > I ended up with some points where I cannot follow your reasoning, > > > yet, > > > > or > > > > > > where I feel clarification would be good. It would be nice if you > > > could > > > > > go > > > > > > a bit into detail here: > > > > > > > > > > > > Regarding your "parallelism without row groups" post [2]: > > > > > > > > > > > > 1. Do I understand correctly that you basically replace row > groups > > > with > > > > > > files. Thus, the task for reading row groups in parallel boils > down > > > to > > > > > > reading files in parallel. Your post does *not* claim that the > new > > > > format > > > > > > would be able to parallelize *inside* a row group/file, correct? > > > > > > > > > > > > 2. I do not fully understand what the proposed parallelism has to > > do > > > > with > > > > > > the file format. As you mention yourself, files and row groups > are > > > > > > basically the same thing. As such, couldn't you do the same > "Decode > > > > Based > > > > > > Parallelism" also with Parquet as it is today? E.g., the file > > reader > > > in > > > > > our > > > > > > engine looks basically exactly like what you propose, employing > > what > > > > you > > > > > > call Mini Batches and not reading a whole row group as a whole > > (which > > > > > could > > > > > > lead to out of memory in case a row group contains an insane > amount > > > of > > > > > > rows, so it is a big no no anyway for us). It seems that the > > > > shortcomings > > > > > > of the code listed in "Our First Parallel File Reader" is solely > a > > > > > > shortcoming of that code, not of the underlying format. > > > > > > > > > > > > Regarding [1]: > > > > > > > > > > > > 3. This one is mostly about understanding your rationales: > > > > > > > > > > > > As one main argument for abolishing row groups, you mention that > > > sizing > > > > > > them well is hard (I fully agree!). But since you replace row > > groups > > > > with > > > > > > files, don't you have the same problem for the file again? Small > > row > > > > > > groups/files are bad due to small I/O requests and metadata > > > explosion, > > > > > > agree! So let's use bigger ones. Here you argue that Parquet > > readers > > > > will > > > > > > load the whole row group into memory and therefore suffer memory > > > > issues. > > > > > > This is a strawman IMHO, as this is just a shortcoming of the > > reader, > > > > not > > > > > > of the format. Nothing in the Parquet spec forces a reader to > read > > a > > > > row > > > > > > group at once (and in fact, our implementation doesn't do this > for > > > > > exactly > > > > > > the reasons you mentioned). Just like in LanceV2, Parquet readers > > can > > > > opt > > > > > > to read only a few pages ahead of the decoding. > > > > > > > > > > > > > > > > Parquet-mr now supports parallel GET requests on different ranges > on > > > s3; > > > > > it'd be even better if object stores supported multiple range > > requests > > > > in a > > > > > single GET. > > > > > Doing all this reading in a single file is more efficient > client-side > > > > than > > > > > having different files > > > > > open. > > > > > > > > > > > > > > > > On the writing side, I see your point that a Lance V2 writer > never > > > has > > > > to > > > > > > buffer more than a page and this is great! However, this seems to > > be > > > > > just a > > > > > > result of allowing pages to not be contiguous, not of the fact > that > > > row > > > > > > groups were abolished. You could still support multiple row > groups > > > with > > > > > > non-contiguous pages and reap all the benefits you mention. Your > > post > > > > > > intermingles the two design choices "contiguous pages yes/no" and > > > "row > > > > > > groups as horizontal partitions within a file yes/no". I would > > argue > > > > that > > > > > > the two features are basically fully orthogonal. You can have one > > > > without > > > > > > the other and vice versa. > > > > > > > > > > > > So all in all, do I see correctly that your main argument here > > > > basically > > > > > is > > > > > > "don't force pages to be contiguous!". Doing away with row groups > > is > > > > just > > > > > > added bonus for easier maintenance, as you can just use files > > instead > > > > of > > > > > > row groups. > > > > > > > > > > > > > > > > > > 4. Considering contiguous pages and I/O granularity: > > > > > > > > > > > > The format basically proposes to have pages as the only > granularity > > > > > below a > > > > > > file (+ metadata & footer), while Parquet has two granularities: > > Row > > > > > group, > > > > > > or rather Column Chunk, and Page. You argue that a page in Lance > V2 > > > > > should > > > > > > basically be as big as is necessary for good I/O performance > (say, > > 8 > > > > MiB > > > > > > for Amazon S3). > > > > > > > > > > > > > > > way bigger, please. small files are so expensive on s3, whenever > you > > > > start > > > > > doing any directory operations, bulk copies etc. Same for the other > > > > stores. > > > > > > > > > > Jan, assuming you are one of the authors of "Get Real: How > Benchmarks > > > > Fail > > > > > to Represent the Real World", can I get a preprint copy? As it's > > > > relevant > > > > > here. My acm library access is gone until I upload another 10 > Banksy > > > > mural > > > > > photos to wikimedia. > > > > > > > > > > Steve > > > > > > > > > > > > > > >