stevenzwu edited a comment on issue #1383:
URL: https://github.com/apache/iceberg/issues/1383#issuecomment-707225772


   I am mainly talking about in the context of FLIP-27 source. Regardless how 
we implement the enumeration, there are two main states that enumerator needs 
to track and checkpoint.
   
   1. last snapshot where enumeration/planning is done
   2. pending/unprocessed splits from previous discoveries/plannings
   
   I was mainly concerned about the state size for the latter. That is where I 
was referring to throttle the eagerness of planned splits. I was thinking about 
using `TableScan.useSnapshot(long snapshotId)` so that we can control how many 
snapshots we plan the the splits into state. 
   
   @openinx note that this is not keyed state (like after keyBy user_id) etc. 8 
GB operator state can be problematic. I vaguely remember RocksDB can't handle a 
list larger than 1 GB. the bigger the list, the slower it gets. also if we do 
`planTasks` (vs `planFiles`), the number of splits can be a few times bigger. 
But I can definitely buy the point of starting with sth simple, and optimize it 
later. It will be an internal change to the enumerator. So it has no user 
impact. 
   
   @JingsongLi  Yeah, the key thing is how coordinator/enumerator controls how 
the splits are generated. I was saying that we may need some control/throttling 
there to avoid eagerly enumerate all pending snapshots so that the checkpointed 
split list is manageable/capped. I thought the idea `TableScan.appendsBetween` 
was to run `planFiles` or `planTasks` between last planned snapshot and the 
latest table snapshot. that is what I was referring earlier as eager 
discovery/planning of all unconsumed splits. how are we scanning one snapshot 
at a time? 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to