Re: Index documents in async way

Đạt Cao Mạnh Thu, 15 Oct 2020 03:30:11 -0700

> The biggest problem I have with this is that the client doesn't know
about indexing problems without awkward callbacks later to see if something
went wrong.  Even simple stuff like a schema problem (e.g. undefined
field).  It's a useful *option*, any way.
> Currently we now guarantee that if solr sends you an OK response the
document WILL eventually become searchable without further action.
Maintaining that guarantee becomes impossible if we haven't verified that
the data is formatted correctly (i.e. dates are in ISO format, etc).
This may be an acceptable cost for those opting for async indexing but it
may be very hard for some folks to swallow if it became the only option
however.


I don't mean that we're gonna replace the sync way. Plus the way sync
works can be changed to leverage the async method. For example, the sync
thread basically waits until the indexer threads finish indexing that
update. Some notice here is all the validations in update processors will
still happen in *sync* way in this model. Only indexing part to Lucene is
executed in an async way.

> I'm a bit skeptical that would boost indexing performance.  Please
understand the intent of that API is about transactionality (atomic add)
and ensuring all docs go in the same segment.  Solr *does* use that API for
nested / parent-child documents, and because it has to.  If that API were
to get called for normal docs, I could see the configured indexing buffer
RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.
You could test your performance theory on a hacked Solr without much
modifications, I think?  Just buffer then send in bulk.

>
>
> What I mean here is right now, when we send a batch of documents to Solr.
We still process it as concrete - unrelated documents by indexing one by
one. If indexing the fifth document causing error, that won't affect
already indexed 4 documents. Using this model we can index the batch in an
atomic way.

> I think that's the real thrust of your motivations, and sounds good to
me!  Also, please see https://issues.apache.org/jira/browse/SOLR-14778 for
making the optionality of the updateLog be a better supportable option in
SolrCloud.

Thank you for bring it out, I will take a look.

All other parts that did not get quoted are very valuable things and I
really appreciate those.

On Wed, Oct 14, 2020 at 12:06 AM Gus Heck <[email protected]> wrote:

> This is interesting, though it opens a few of cans of worms IMHO.
>
>    1. Currently we now guarantee that if solr sends you an OK response
>    the document WILL eventually become searchable without further action.
>    Maintaining that guarantee becomes impossible if we haven't verified that
>    the data is formatted correctly (i.e. dates are in ISO format, etc).
>    This may be an acceptable cost for those opting for async indexing but it
>    may be very hard for some folks to swallow if it became the only option
>    however.
>    2. In the case of errors we need to hold the error message
>    indefinitely for later discovery by the client, this needs to not
>    accumulate forever. Thus:
>       1. We have a timed cleanup, leasing or some other self limiting
>       pattern... possibly by indexing the failures in a TRA with autodelete so
>       that clients can efficiently find the status of the particular 
> document(s)
>       they sent, obviouysly there's at least an asyc id involved, probably the
>       uniqueKey (where available) and timestamps for recieved, and processed 
> as
>       well.
>       2. We log more simply with a sequential id and let clients keep
>       track of what they have seen... This can lead us down the path of
>       re-inventing kafka, or making kafka a required dependency.
>       3. We provide a push oriented connection (websocket? HTTP2?) that
>       clients that care about failures can listen to and store nothing. A less
>       appetizing variant is to publish errors to a message bus.
>    3. If we have more than one thread picking up the submitted documents
>    and writing them, we need a state machine that identifies in-progress
>    documents to prevent multiple pickups and resets processing to new on
>    startup to ensure we don't index the same document twice and don't lose
>    things that were in-flight on power loss.
>    4. Backpressure/throttling. If we're losing ground continuously on the
>    submissions because indexing is heavier than accepting documents, we may
>    fill up the disk. Of course the index itself can do that, but need to think
>    about if this makes it worse.
>
> A big plus to this however is that batches with errors could optionally
> just omit the (one or two?) errored document(s) and publish the error for
> each errored document rather than failing the whole batch, meaning that the
> indexing infrastructure submitting in batches doesn't have to leave several
> hundred docs unprocessed, or alternately do a slow doc at a time resubmit
> to weed out the offenders.
>
> Certainly the involvement of kafka sounds interesting. If one persists to
> an externally addressable location like a kafka queue one might leave the
> option for the write-on-receipt queue to be different from the
> read-to-actually-index queue and put a pipeline behind solr instead of
> infront of it... possibly atomic updates could then be given identical
> processing as initial indexing....
>
> On Sat, Oct 10, 2020 at 12:41 AM David Smiley <[email protected]> wrote:
>
>>
>>
>> On Thu, Oct 8, 2020 at 10:21 AM Cao Mạnh Đạt <[email protected]> wrote:
>>
>>> Hi guys,
>>>
>>> First of all it seems that I used the term async a lot recently :D.
>>> Recently I have been thinking a lot about changing the current indexing
>>> model of Solr from sync way like currently (user submit an update request
>>> waiting for response). What about changing it to async model, where nodes
>>> will only persist the update into tlog then return immediately much like
>>> what tlog is doing now. Then we have a dedicated executor which reads from
>>> tlog to do indexing (producer consumer model with tlog acting like the
>>> queue).
>>>
>>
>> The biggest problem I have with this is that the client doesn't know
>> about indexing problems without awkward callbacks later to see if something
>> went wrong.  Even simple stuff like a schema problem (e.g. undefined
>> field).  It's a useful *option*, any way.
>>
>>
>>>
>>> I do see several big benefits of this approach
>>>
>>>    - We can batching updates in a single call, right now we do not use
>>>    writer.add(documents) api from lucene, by batching updates this gonna 
>>> boost
>>>    the performance of indexing
>>>
>>> I'm a bit skeptical that would boost indexing performance.  Please
>> understand the intent of that API is about transactionality (atomic add)
>> and ensuring all docs go in the same segment.  Solr *does* use that API for
>> nested / parent-child documents, and because it has to.  If that API were
>> to get called for normal docs, I could see the configured indexing buffer
>> RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.
>> You could test your performance theory on a hacked Solr without much
>> modifications, I think?  Just buffer then send in bulk.
>>
>>>
>>>    - One common problems with Solr now is we have lot of threads doing
>>>    indexing so that can ends up with many small segments. Using this model 
>>> we
>>>    can have bigger segments so less merge cost
>>>
>>> This is app/use-case dependent of course.  If you observe the segment
>> count to be high, I think it's more likely due to a sub-optimal commit
>> strategy.  Many users should put more thought into this.  If update
>> requests to Solr have a small number of docs and it explicitly gives a
>> commit (commitWithin on the other hand is fine), then this is a recipe for
>> small segments and is generally expensive as well (commits are kinda
>> expensive).  Many apps would do well to either pass commitWithin or rely on
>> a configured commitWithin, accomplishing the same instead of commit.  For
>> apps that can't do that (e.g. need to immediately read-your-write or for
>> other reasons where I work), then such an app can't use async any way.
>>
>> An idea I've had to help throughput for this case would be for commits
>> that are about to happen concurrently with other indexing to voluntarily
>> wait a short period of time (500ms?) in an attempt to coalesce the commit
>> needs of both concurrent indexing requests.  Very opt-in, probably a
>> solrconfig value, and probably a wait time in the 100s of milliseconds
>> range.  An ideal implementation would be conservative to avoid this waiting
>> if there is no concurrent indexing request that did not start after the
>> current request or that which doesn't require a commit as well.
>>
>> If your goal is fewer segments, then definitely check out the recent
>> improvements to Lucene to do some lightweight merge-on-commit.  The
>> Solr-side hook is SOLR-14582
>> <https://issues.apache.org/jira/browse/SOLR-14582> and it requires a
>> custom MergePolicy.  I contributed such a MergePolicy policy here:
>> https://github.com/apache/lucene-solr/pull/1222 although it needs to be
>> updated in light of Lucene changes that occured since then.  We've been
>> using that MergePolicy at Salesforce for a couple years and it has cut our
>> segment count in half!  Of course if you can generate fewer segments in the
>> first place, that's preferable and is more to your point.
>>
>>>
>>>    - Another huge reason here is after switching to this model, we can
>>>    remove tlog and use a distributed queue like Kafka, Pulsar. Since the
>>>    purpose of leader in SolrCloud now is ordering updates, the distributed
>>>    queue is already ordering updates for us, so no need to have a dedicated
>>>    leader. That is just the beginning of things that we can do after using a
>>>    distributed queue.
>>>
>>> I think that's the real thrust of your motivations, and sounds good to
>> me!  Also, please see https://issues.apache.org/jira/browse/SOLR-14778
>> for making the optionality of the updateLog be a better supportable option
>> in SolrCloud.
>>
>>
>>> What do your guys think about this? Just want to hear from your guys
>>> before going deep into this rabbit hole.
>>>
>>> Thanks!
>>>
>>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>


-- 
*Best regards,*
*Cao Mạnh Đạt*
*E-mail: [email protected] <[email protected]>*

Re: Index documents in async way

Reply via email to