Re: Fallback Encoding for Very Sparse or Sorted Datasets

2023-03-02 Thread Gang Wu
I have filed a JIRA: https://issues.apache.org/jira/browse/PARQUET-2253 Best, Gang On Thu, Mar 2, 2023 at 5:39 PM Patrick Hansert wrote: > > This is by design. I guess it benefits sequential scan where the > dictionary > > page is read first and then followed by its encoded indices in the data

Re: Fallback Encoding for Very Sparse or Sorted Datasets

2023-03-02 Thread Patrick Hansert
This is by design. I guess it benefits sequential scan where the dictionary page is read first and then followed by its encoded indices in the data pages. Otherwise we need to seek anyway. Good, then it shouldn't cause problems when putting the dictionary after all-null pages I think that

Re: Fallback Encoding for Very Sparse or Sorted Datasets

2023-03-01 Thread Gang Wu
> What are the reasons for forcing the dictionary to be the first page? This is by design. I guess it benefits sequential scan where the dictionary page is read first and then followed by its encoded indices in the data pages. Otherwise we need to seek anyway. > can this be changed to allow for

Re: Fallback Encoding for Very Sparse or Sorted Datasets

2023-03-01 Thread Patrick Hansert
Hi Gang, thanks for your reply. On 01.03.23 03:09, Gang Wu wrote: If at least one record in the beginning 2 rows is not null, then the encoded size will be much better. That is the workaround I have been using for the past weeks, although my tests show that at least two values are

Re: Fallback Encoding for Very Sparse or Sorted Datasets

2023-02-28 Thread Gang Wu
Hi Patrick, Thanks for reporting the issue! Let me try to answer your question in short. 1. In your case, the good data is dictionary-encoded [1] and the size of the dictionary is 1. 2. The RLE encoding [2] you have observed from the good data applies to the only indices after dictionary

Fallback Encoding for Very Sparse or Sorted Datasets

2023-02-28 Thread Patrick Hansert
Hello everyone! First of all, I hope I'm in the right place. The contribution guidelines directed me here after I discovered the registration in the Jira tracker is closed. I'm a Ph.D. student at RPTU Kaiserslautern-Landau, and my current research revolves around sorting-based improvements to