[DISCUSS] Faster Hive incremental pull queries

Bhavani Sudha Saktheeswaran Sun, 19 May 2019 17:13:55 -0700

Hello all,

Hive Incremental queries on Hoodie currently suffer a limitation of listing
all partitions when a datestr is not present (lists .hoodie and the
partitions) and end up throwing away a lot of the files (since
`_*hoodie*_commit_time`
column values filters out those files) . This can be very expensive and can
impact query planning time and sometime causes timeouts as well if the
table is large. https://issues.apache.org/jira/browse/HUDI-25  tracks the
issue.


If we can leverage the timeline and partitions touched by the commits
involved in incremental pull, then we can avoid listing all partitions and
hence reduce the query planning time. I am planning to send a HIP to
discuss this further. Please share your thoughts.

Thanks,
Sudha

[DISCUSS] Faster Hive incremental pull queries

Reply via email to