Re: Index documents in async way

Erick Erickson Thu, 08 Oct 2020 08:31:18 -0700

I suppose failures would be returned to the client one the async response?

How would one keep the tlog from growing forever if the actual indexing
took a long time?


I'm guessing that this would be optional..

On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya <[email protected]>
wrote:

> Can there be a situation where the index writer fails after the document
> was added to tlog and a success is sent to the user? I think we want to
> avoid such a situation, isn't it?
>
> On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt, <[email protected]> wrote:
>
>> > Can you explain a little more on how this would impact durability of
>> updates?
>> Since we persist updates into tlog, I do not think this will be an issue
>>
>> > What does a failure look like, and how does that information get
>> propagated back to the client app?
>> I did not be able to do much research but I think this is gonna be the
>> same as the current way of our asyncId. In this case asyncId will be the
>> version of an update (in case of distributed queue it will be offset)
>> failures update will be put into a time-to-live map so users can query the
>> failure, for success we can skip that by leverage the max succeeded version
>> so far.
>>
>> On Thu, Oct 8, 2020 at 9:31 PM Mike Drob <[email protected]> wrote:
>>
>>> Interesting idea! Can you explain a little more on how this would impact
>>> durability of updates? What does a failure look like, and how does that
>>> information get propagated back to the client app?
>>>
>>> Mike
>>>
>>> On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt <[email protected]> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> First of all it seems that I used the term async a lot recently :D.
>>>> Recently I have been thinking a lot about changing the current indexing
>>>> model of Solr from sync way like currently (user submit an update request
>>>> waiting for response). What about changing it to async model, where nodes
>>>> will only persist the update into tlog then return immediately much like
>>>> what tlog is doing now. Then we have a dedicated executor which reads from
>>>> tlog to do indexing (producer consumer model with tlog acting like the
>>>> queue).
>>>>
>>>> I do see several big benefits of this approach
>>>>
>>>>    - We can batching updates in a single call, right now we do not use
>>>>    writer.add(documents) api from lucene, by batching updates this gonna 
>>>> boost
>>>>    the performance of indexing
>>>>    - One common problems with Solr now is we have lot of threads doing
>>>>    indexing so that can ends up with many small segments. Using this model 
>>>> we
>>>>    can have bigger segments so less merge cost
>>>>    - Another huge reason here is after switching to this model, we can
>>>>    remove tlog and use a distributed queue like Kafka, Pulsar. Since the
>>>>    purpose of leader in SolrCloud now is ordering updates, the distributed
>>>>    queue is already ordering updates for us, so no need to have a dedicated
>>>>    leader. That is just the beginning of things that we can do after using 
>>>> a
>>>>    distributed queue.
>>>>
>>>> What do your guys think about this? Just want to hear from your guys
>>>> before going deep into this rabbit hole.
>>>>
>>>> Thanks!
>>>>
>>>>

Re: Index documents in async way

Reply via email to