Re: ttl on merge-time possible somehow ?

Dorian Hoxha Fri, 16 Dec 2016 14:47:25 -0800

On Fri, Dec 16, 2016 at 10:53 PM, Chris Hostetter <hossman_luc...@fucit.org>
wrote:


>
> : Yep, that's what came in my search. See how TTL work in hbase/cassandra/
> : rocksdb <https://github.com/facebook/rocksdb/wiki/Time-to-Live>. There
> : isn't a "delete old docs"query, but old docs are deleted by the storage
> : when merging. Looks like this needs to be a lucene-module which can then
> be
> : configured by solr ?
>         ...
> : Just like in hbase,cassandra,rocksdb, when you "select" a row/document
> that
> : has expired, it exists on the storage, but isn't returned by the db,
>
>
> What you're describing is exactly how segment merges work in Lucene, it's
> just a question of terminology.
>
> In Lucene, "deleting" a document is a *logical* operation, the data still
> lives in the (existing) segments but the affected docs are recorded in a
> list of deletions (and automatically excluded from future searchers that
> are opened against them) ... once the segments are merged then the deleted
> documents are "expunged" rather then being copied over to the new
> segments.
>
> Where this diverges from what you describe is that as things stand in
> lucene, something has to "mark" the segements as deleted in order for them
> to later be expunged -- in Solr right now is the code in question that
> does this via (internal) DBQ.
>
Note, it doesn't mark the "segment", it marks the "document".

>
> The disatisfaction you expressed with this approach confuses me...
>
Really ?
If you have many expiring docs

>
> >> I did some search for TTL on solr, and found only a way to do it with a
> >> delete-query. But that ~sucks, because you have to do a lot of inserts
> >> (and queries).
>
> ...nothing about this approach does any "inserts" (or queries -- unless
> you mean the DBQ itself?) so w/o more elaboration on what exactly you find
> problematic about this approach, it's hard to make any sense of your
> objection or request for an alternative.
>
"For example, with the configuration below the
DocExpirationUpdateProcessorFactory will create a timer thread that wakes
up every 30 seconds. When the timer triggers, it will execute a
*deleteByQuery* command to *remove any documents* with a value in the
press_release_expiration_date field value that is in the past "


>
> With all those caveats out of the way...
>
> What you're ultimately requesting -- new code that hooks into segment
> merging to exclude "expired" documents from being copied into the the new
> merged segments --- should be theoretically possible with a custom
> MergePolicy, but I don't really see how it would be better then the
> current approach in typically use cases (ie: i want docs excluded from
> results after the expiration date is reached, with a min tollerance of
> X) ...
>
I mentioned that the client would also make a range-query since expired
documents in this case would still be indexed.

>
> 1) nothing would ensure that docs *ever* get removed during perioids when
> docs aren't being added (thus no new segments, thus no merging)
>
This can be done with a periodic/smart thread that wakes up every 'ttl' and
checks min-max (or histogram) of timestamps on segments. If there are a
lot, do merge (or just delete the whole dead segment). At least that's how
those systems do it.

>
> 2) as you described, query clients would be required to specify date range
> filters on every query to identify the "logically live docs at this
> moment" on a per-request basis -- something that's far less efficient from
> a cachng standpoint then letting the system do a DBQ on the backened to
> affect the *global* set of logically live docs at the index level.
>
This makes sense. Deleted docs-ids is cached better than the range-query
that I said.

>
>
> Frankly: It seems to me that you've looked at how other non-lucene based
> systems X & Y handle TTL type logic and decided that's the best possible
> solution therefore the solution used by Solr "sucks" w/o taking into
> account that what's efficient in the underlying Lucene storage
> implementation might just be diff then what's efficient in the underlying
> storage implementation of X & Y.
>
Yes.

>
> If you'd like to tackle implementing TTL as a lower level primitive
> concept in Lucene, then by all means be my guest -- but personally i
> don't think you're going to find any real perf improvements in an
> approach like you describe compared to what we offer today.  i look
> forward to being proved wrong.
>
Since the implementation is apparently more efficient than I thought I'm
gonna leave it.

>
>
>
> -Hoss
> http://www.lucidworks.com/
>

Re: ttl on merge-time possible somehow ?

Reply via email to