[GitHub] [arrow-datafusion] mingmwang commented on issue #1570: Memory Limited GroupBy (Externalized / Spill)

via GitHub Wed, 26 Apr 2023 02:28:52 -0700


mingmwang commented on issue #1570:
URL: 
https://github.com/apache/arrow-datafusion/issues/1570#issuecomment-1523090560


   > > Can we make the `GroupState` and the Accumulator states serializable ? 
With this approach, we do not need to do any sort when spiiling data to disks. 
And when we read the data back, we reconstruct our raw hash table quickly from 
the hash values and indexes, because our hashmap is very lightweight, the hash 
value can be re-calculated from grouping rows, or we can cache the hash value 
inside the `GroupState` to avoid the re-calculating.
   > 
   > You still need to disk spilling, no? Or where do you store the serialized 
state? Also I guess that serialization may become a major bottleneck for some 
of the accumulators.
   
   Yes, we still need the disk spilling, the disk spilling can be managed and 
tracked by the `disk_manager` in the `RuntimeEnv`, but anyway it avoid sort the 
entire group data or hash table before the spilling.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] mingmwang commented on issue #1570: Memory Limited GroupBy (Externalized / Spill)

Reply via email to