stevenzwu edited a comment on issue #1383: URL: https://github.com/apache/iceberg/issues/1383#issuecomment-707225772
I am mainly talking about in the context of FLIP-27 source. Regardless how we implement the enumeration, there are two main states that enumerator needs to track and checkpoint. 1. last snapshot where enumeration/planning is done 2. pending/unprocessed splits from previous discoveries/plannings I was mainly concerned about the state size for the latter. That is where I was referring to throttle the eagerness of planned splits. I was thinking about using `TableScan.useSnapshot(long snapshotId)` so that we can control how many snapshots we plan the the splits into state. @openinx note that this is not keyed state (like after keyBy user_id) etc. 8 GB operator state can be problematic. I vaguely remember RocksDB can't handle a list larger than 1 GB. the bigger the list, the slower it gets. also if we do `planTasks` (vs `planFiles`), the number of splits can be a few times bigger. But I can definitely buy the point of starting with sth simple, and optimize it later. It will be an internal change to the enumerator. So it has no user impact. @JingsongLi Yeah, the key thing is how coordinator/enumerator controls how the splits are generated. I was saying that we may need some control/throttling there to avoid eagerly enumerate all pending snapshots so that the checkpointed split list is manageable/capped. I thought the idea `TableScan.appendsBetween` was to run `planFiles` or `planTasks` between last planned snapshot and the latest table snapshot. that is what I was referring earlier as eager discovery/planning of all unconsumed splits. how are we scanning one snapshot at a time? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
