stevenzwu edited a comment on issue #1383:
URL: https://github.com/apache/iceberg/issues/1383#issuecomment-707225772


   I was mainly discussing in the context of FLIP-27 source. Regardless how we 
implement the enumeration, there are two pieces of info that enumerator needs 
to track and checkpoint.
   
   1. last snapshot where enumeration/planning is done
   2. pending/unprocessed splits from previous discoveries/plannings
   
   I was mainly concerned about the state size for the latter. That is where I 
was referring to throttle the eagerness of planned splits. I was thinking about 
using `TableScan.useSnapshot(long snapshotId)` so that we can control how many 
snapshots we plan the splits into state. 
   
   Here are some additional benefits of enumerating splits snapshot by snapshot.
   * We can track and assign splits snapshot by snapshot in the same order as 
they were committed
   * We can publish metrics like the number of pending snapshots, lag (current 
time - oldest timestamp from uncompleted snapshot), etc.
   
   @openinx note that this is not keyed state where state is distributed among 
parallel tasks. Here, 8 GB operator state can be problematic enumerator state. 
I vaguely remember RocksDB can't handle a list larger than 1 GB. the bigger the 
list, the slower it gets. also if we do `planTasks` (vs `planFiles`), the 
number of splits can be a few times bigger. I can definitely buy the point of 
starting with sth simple, and optimize it later. It will be an internal change 
to the enumerator. So it has no user impact. 
   
   @JingsongLi  Yeah, the key thing is how coordinator/enumerator controls how 
the splits are generated. I was saying that we may need some control/throttling 
there to avoid eagerly enumerate all pending snapshots so that the checkpointed 
split list is manageable/capped. I thought the idea `TableScan.appendsBetween` 
was to run `planFiles` or `planTasks` between last planned snapshot and the 
latest table snapshot. that is what I was referring earlier as eager 
discovery/planning of all unconsumed splits. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to