yordan-pavlov opened a new pull request #1130:
URL: https://github.com/apache/arrow-rs/pull/1130


   # Which issue does this PR close?
   
   Closes #1111.
   
   # Rationale for this change
   As explained in #1111 `RleDecoder` as used in `VariableLenDictionaryDecoder` 
as part of the implementation of `ArrowArrayReader`, incorrectly returns more 
keys than are actually available while at the same time, when the page contains 
NULLs `VariableLenDictionaryDecoder` is also requesting more keys than 
available because `num_values` is inclusive of NULLs. This then results in 
incorrectly decoding a dictionary-encoded page which also contains NULLs and 
returning more values than necessary.
   
   # What changes are included in this PR?
   This PR contains:
   * a fix where the actual number of values (excluding NULLs) is calculated 
from def levels (if present) and is used (instead of `num_values` from the data 
page) when creating the value decoder, so that it knows how many values are 
actually available. This is then used in existing code in 
`VariableLenDictionaryDecoder` to limit how many keys are requested from the 
nested `RleDecoder`.
   * a new test `test_arrow_array_reader_dict_enc_string` for `ArrowArrayReader`
   * a new test `test_complex_array_reader_dict_enc_string` for `ArrayReader`
   
   # Are there any user-facing changes?
   No
   
   @alamb @tustvold 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to