mingmwang commented on issue #1570: URL: https://github.com/apache/arrow-datafusion/issues/1570#issuecomment-1523090560
> > Can we make the `GroupState` and the Accumulator states serializable ? With this approach, we do not need to do any sort when spiiling data to disks. And when we read the data back, we reconstruct our raw hash table quickly from the hash values and indexes, because our hashmap is very lightweight, the hash value can be re-calculated from grouping rows, or we can cache the hash value inside the `GroupState` to avoid the re-calculating. > > You still need to disk spilling, no? Or where do you store the serialized state? Also I guess that serialization may become a major bottleneck for some of the accumulators. Yes, we still need the disk spilling, the disk spilling can be managed and tracked by the `disk_manager` in the `RuntimeEnv`, but anyway it avoid sort the entire group data or hash table before the spilling. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org