On 08/10/2019 18:33, Brian Candler wrote:
select * from events where data like "%foo%"
and publish_time between "2019-01-01T12:00:00" and
"2019-01-01T13:00:00";
Does Presto/Pulsar/Bookkeeper only touch the segments where
publish_time is within those boundaries? Is there an index somewhere
which says for each segment what is the lowest and highest
publish_time it contains?
Ah, I found listed under Phase 2 features at
https://github.com/apache/pulsar/wiki/PIP-19:-Pulsar-SQL
4. Time boxed queries
5. When doing a query over a subset of the data, based on publish time,
we should be able to only scan the relevant data instead of everything
stored in the topic
So I guess this is an "upcoming feature".
(Aside: it occurs to me that if every closed segment published its
minimum and maximum publish time on a meta-topic, that would be an
efficient way to locate the segments of interest. Or it could include
min/max/bloom filter on user data too, like ORC
<https://orc.apache.org/docs/indexes.html> does)