siddharthteotia opened a new pull request #5256: Derive num docs per chunk from max column value length for varbyte raw index creator URL: https://github.com/apache/incubator-pinot/pull/5256 As part of internal testing for text search, we found that there could be cases where a text column value is several hundred thousands of characters. This would be for a very small percentage of rows from the overall dataset. The VarByteChunkWriter uses a fixed hard-coded value 1000 for number of docs per chunk. It is better to derive this from metadata (length of longest value in bytes from stats). For unusually higher (1million) value of lengthOfLongestEntry, we were seeing int overflow since the chunk size was computed as : 1000 * (lengthOfLongestEntry + 4 byte header offset). Secondly, the compression buffer is allocated twice of this size to account for negative compression so the capacity for compression buffer became negative. The PR has change for deriving num docs per chunk from lengthOfLongestEntry using a fix target max chunk size of 1MB. This is backward compatible since we wrote the number of docs per chunk in the file header. **There is a tentative follow-up** Use long for the chunk offset array in file header. Currently we use int. If most of the text column values are blob like data, then the total size of text data across all rows could be more than 2GB. So we need long to track chunk offsets. This would be backward incompatible change with a new version of chunk writer and reader.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
