mingmwang commented on issue #1570:
URL: 
https://github.com/apache/arrow-datafusion/issues/1570#issuecomment-1523090560

   > > Can we make the `GroupState` and the Accumulator states serializable ? 
With this approach, we do not need to do any sort when spiiling data to disks. 
And when we read the data back, we reconstruct our raw hash table quickly from 
the hash values and indexes, because our hashmap is very lightweight, the hash 
value can be re-calculated from grouping rows, or we can cache the hash value 
inside the `GroupState` to avoid the re-calculating.
   > 
   > You still need to disk spilling, no? Or where do you store the serialized 
state? Also I guess that serialization may become a major bottleneck for some 
of the accumulators.
   
   Yes, we still need the disk spilling, the disk spilling can be managed and 
tracked by the `disk_manager` in the `RuntimeEnv`, but anyway it avoid sort the 
entire group data or hash table before the spilling.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to