On 08/10/2019 18:33, Brian Candler wrote:
select * from events where data like "%foo%"
        and publish_time between "2019-01-01T12:00:00" and "2019-01-01T13:00:00";

Does Presto/Pulsar/Bookkeeper only touch the segments where publish_time is within those boundaries?  Is there an index somewhere which says for each segment what is the lowest and highest publish_time it contains?

Ah, I found listed under Phase 2 features at https://github.com/apache/pulsar/wiki/PIP-19:-Pulsar-SQL

4. Time boxed queries
5. When doing a query over a subset of the data, based on publish time, we should be able to only scan the relevant data instead of everything stored in the topic

So I guess this is an "upcoming feature".

(Aside: it occurs to me that if every closed segment published its minimum and maximum publish time on a meta-topic, that would be an efficient way to locate the segments of interest.  Or it could include min/max/bloom filter on user data too, like ORC <https://orc.apache.org/docs/indexes.html> does)

Reply via email to