[GitHub] [incubator-pinot] siddharthteotia opened a new pull request #5256: Derive num docs per chunk from max column value length for varbyte raw index creator

GitBox Wed, 15 Apr 2020 18:11:08 -0700

siddharthteotia opened a new pull request #5256: Derive num docs per chunk from 
max column value length for varbyte raw index creator
URL: https://github.com/apache/incubator-pinot/pull/5256
 
 
   As part of internal testing for text search, we found that there could be 
cases where a text column value is several hundred thousands of characters. 
This would be for a very small percentage of rows from the overall dataset.
   
   The VarByteChunkWriter uses a fixed hard-coded value 1000 for number of docs 
per chunk. It is better to derive this from metadata (length of longest value 
in bytes from stats). For unusually higher (1million) value of 
lengthOfLongestEntry, we were seeing int overflow since the chunk size was 
computed as :
   
   1000 * (lengthOfLongestEntry + 4 byte header offset). Secondly, the 
compression buffer is allocated twice of this size to account for negative 
compression so the capacity for compression buffer became negative.
   
   The PR has change for deriving num docs per chunk from lengthOfLongestEntry 
using a fix target max chunk size of 1MB.
   
   This is backward compatible since we wrote the number of docs per chunk in 
the file header.
   
   **There is a tentative follow-up**
   
   Use long for the chunk offset array in file header. Currently we use int. If 
most of the text column values are blob like data, then the total size of text 
data across all rows could be more than 2GB. So we need long to track chunk 
offsets. This would be backward incompatible change with a new version of chunk 
writer and reader.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-pinot] siddharthteotia opened a new pull request #5256: Derive num docs per chunk from max column value length for varbyte raw index creator

Reply via email to