wankunde opened a new pull request, #2371:
URL: https://github.com/apache/orc/pull/2371

   
   
   
   ### What changes were proposed in this pull request?
   
   For large input rows, the stripe may excessively large , requiring more 
memory for both reading and writing one strip. We can check the tree write size 
in bytes and flush the strip even when the input rows count is less than 5000.
   
   ```log
   Stripes:
     Stripe: offset: 3 data: 347494188 rows: 5120 tail: 244 index: 2304
       Stream: column 0 section ROW_INDEX start: 3 length 12
       Stream: column 1 section ROW_INDEX start: 15 length 110
       Stream: column 2 section ROW_INDEX start: 125 length 893
       Stream: column 3 section ROW_INDEX start: 1018 length 31
       Stream: column 4 section ROW_INDEX start: 1049 length 65
       Stream: column 5 section ROW_INDEX start: 1114 length 923
       Stream: column 6 section ROW_INDEX start: 2037 length 25
       Stream: column 7 section ROW_INDEX start: 2062 length 155
       Stream: column 8 section ROW_INDEX start: 2217 length 28
       Stream: column 9 section ROW_INDEX start: 2245 length 31
       Stream: column 10 section ROW_INDEX start: 2276 length 31
       Stream: column 1 section DATA start: 2307 length 81853
       Stream: column 1 section LENGTH start: 84160 length 2191
       Stream: column 2 section DATA start: 86351 length 345862763
       Stream: column 2 section LENGTH start: 345949114 length 13736
       Stream: column 3 section DATA start: 345962850 length 22
       Stream: column 3 section LENGTH start: 345962872 length 6
       Stream: column 3 section DICTIONARY_DATA start: 345962878 length 5
       Stream: column 4 section PRESENT start: 345962883 length 200
       Stream: column 4 section DATA start: 345963083 length 6322
       Stream: column 4 section LENGTH start: 345969405 length 495
       Stream: column 4 section DICTIONARY_DATA start: 345969900 length 2919
       Stream: column 5 section DATA start: 345972819 length 1507883
       Stream: column 5 section LENGTH start: 347480702 length 7346
       Stream: column 6 section DATA start: 347488048 length 22
       Stream: column 6 section LENGTH start: 347488070 length 6
       Stream: column 6 section DICTIONARY_DATA start: 347488076 length 0
       Stream: column 7 section DATA start: 347488076 length 5795
       Stream: column 7 section LENGTH start: 347493871 length 301
       Stream: column 7 section DICTIONARY_DATA start: 347494172 length 2187
       Stream: column 8 section DATA start: 347496359 length 22
       Stream: column 8 section LENGTH start: 347496381 length 6
       Stream: column 8 section DICTIONARY_DATA start: 347496387 length 4
       Stream: column 9 section DATA start: 347496391 length 58
       Stream: column 9 section LENGTH start: 347496449 length 6
       Stream: column 9 section DICTIONARY_DATA start: 347496455 length 7
       Stream: column 10 section DATA start: 347496462 length 22
       Stream: column 10 section LENGTH start: 347496484 length 6
       Stream: column 10 section DICTIONARY_DATA start: 347496490 length 5
       Encoding column 0: DIRECT
       Encoding column 1: DIRECT_V2
       Encoding column 2: DIRECT_V2
       Encoding column 3: DICTIONARY_V2[1]
       Encoding column 4: DICTIONARY_V2[661]
       Encoding column 5: DIRECT_V2
       Encoding column 6: DICTIONARY_V2[1]
       Encoding column 7: DICTIONARY_V2[682]
       Encoding column 8: DICTIONARY_V2[1]
       Encoding column 9: DICTIONARY_V2[2]
       Encoding column 10: DICTIONARY_V2[1]
   ```
   
   ### Why are the changes needed?
   
   To optimize the memory usage.
   
   ### How was this patch tested?
   
   Local test
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to