Jefffrey commented on issue #7186:
URL: https://github.com/apache/arrow-rs/issues/7186#issuecomment-4843376782

   I'm thinking along the lines like this:
   
   ```rust
   // default = all true
   struct MinifyOptions {
       // general
       minimize_all_buffers: bool,
       recursive: bool,
       // specialized
       zero_size_nulls: bool,
       compact_views: bool,
       propagate_null_mask: bool,
       deduplicate_dictionary: bool,
       compact_runs: bool,
       compact_dense_union: bool,
   }
   ```
   
   ### minimize_all_buffers
   
   If any buffer that the array holds (values, null buffer) is sliced, copy 
into a new buffer with only the sliced portion to get rid of unreferenced data. 
Also can do minor optimization to change null buffer from `Some` to `None` if 
null count is 0 (could happen if slicing).
   
   ### recursive
   
   If array is nested, apply same minify options to all children. This could 
get a bit tricky, but I'll get to that below.
   
   ### zero_size_nulls
   
   For byte/list/map types, pretty much what 
https://github.com/apache/arrow-rs/pull/9970 is doing.
   
   ### compact_views
   
   Performing gc on list/byte view types.
   
   ### propagate_null_mask
   
   For struct & fixed size list, for the null slots in this parent types we can 
try rebuild the children to have nulls there too to ensure minimum possible 
memory footprint. For example if the child is a string array, we can ensure its 
null (or at least zero sized) where the parents have null slots.
   
   ### deduplicate_dictionary
   
   For dictionaries, deduplicate the values and also remove any nulls in the 
values array (and instead specify at key level).
   
   ### compact_runs
   
   For run arrays, ensure we have minimum possible runs (in case we have an 
array with two runs of the same value in succession).
   
   ### compact_dense_union
   
   Ensure children arrays & offsets are minimal possible, similar to views.
   
   ## Interaction of options
   
   Each of the specialized options are mutually exclusive from the others 
(without considering recursion yet). The only option that would affect them all 
would be `minimize_all_buffers`. It would be possible to have 
`minimize_all_buffers: false` but still specify `zero_size_nulls: true`. This 
would still have to rebuild some buffers of the variable types, but not 
necessarily all of the buffers (e.g. can copy null buffer as is), which is why 
the general option specifies **all** buffers.
   
   Recursive option is a bit tricky since some of the specialized options would 
need to rebuild the child anyway, so need to ensure rebuild in proper order. 
e.g. if zero sizing nulls, could zero size nulls first then again minify the 
children (rebuild children twice) or minify children first and rely on that 
minification to automatically zero size the nulls, etc.
   
   ## Overall
   
   Personally I'm not sure how concerned we should be about the granularity of 
the controls on this minification behaviour; I assume most use cases would be 
fine with defaults 🤔
   
   In terms of overall memory footprint, as long as the docstring properly 
explains that whilst it can create an equivalent array with a smaller memory 
footprint, it does not necessarily reduce overall memory footprint since it 
does copy data into new buffers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to