Re: Handling updates from multiple sources

Michał Zgliczyński Fri, 02 May 2014 12:26:38 -0700

My system is changing rapidly. The final result is to have all the data 
inside the ES index. The way I have it set up currently I have 2 different 
systems that write to the ES index:
1) Bulk job. Run through all the dbs, fetch things in batch updates of 5k 
and send it to ES.
2) Live updating job. Pickup the newest changes and send them to ES. Either 
updates or inserts. Note: the updates don't contain full documents


After this step (1) and (2) I would like to have (almost) 100% guarantee 
that the index is full and up to date.

I think that this is quite common use case if you want to have an index 
with live data, not stale as of the time of the beginning of the bulk job.

W dniu czwartek, 1 maja 2014 19:45:53 UTC-7 użytkownik Rob Ottaway napisał:
>
> I missed that the later doc would only be partial. What is the reason to 
> use the partial doc? That really complicates things.
>
> Filling in missing fields is going to be a very large headache. You'll 
> probably kill performance trying to do it too. Likely it'll be so complex 
> it will present a lot more trouble.
>
> I think if you can better present the overall use cases you will get 
> better insight into how to work this out.
>
>
> On Thursday, May 1, 2014 4:51:03 PM UTC-7, Michał Zgliczyński wrote:
>>
>> Hi,
>> Thank you for your response. I have looked through this blog post: 
>> http://www.elasticsearch.org/blog/elasticsearch-versioning-support/
>> It looks as if external versioning would be the way to go. Have the 
>> timestamps act as version numbers and let ES only pick the document with 
>> the newest version as the correct document. However, with the situation I 
>> have presented above, ES will fail. A quote from the post:
>> "With version_type set to external, Elasticsearch will store the version 
>> number as given and will not increment it. Also, instead of checking for an 
>> exact match, Elasticsearch will only return a version collision error if 
>> the version currently stored is greater or equal to the one in the indexing 
>> command. This effectively means “only store this information if no one else 
>> has supplied the same or a more recent version in the meantime”. 
>> Concretely, the above request will succeed if the stored version number is 
>> smaller than 526. 526 and above will cause the request to fail."
>>
>> In my example, we would have that situation. A partial doc with a larger 
>> version number(later timestamp) is already stored in ES and we get the 
>> complete document with a smaller timestamp. In this situation we would like 
>> to merge these 2 documents in a way that, we have all of the fields from 
>> the partial doc and the other fields(not currently specified in the ES 
>> document) to be filled from the complete document.
>>
>> Thanks!
>> Michal Zgliczynski
>>
>> W dniu czwartek, 1 maja 2014 14:58:31 UTC-7 użytkownik Rob Ottaway 
>> napisał:
>>>
>>> Have you looked at using versioning?
>>>
>>>
>>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning
>>>
>>> cheers,
>>> Rob
>>>
>>> On Thursday, May 1, 2014 2:47:39 PM UTC-7, Michał Zgliczyński wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am building a system in which I will have two sources of updates:
>>>> 1) Bulk updating from the source of truth(db) <- Always inserting 
>>>> documents(complete docs)
>>>> 2) Live updates <- Adding insert and update (complete and incomplete 
>>>> docs)
>>>>
>>>> Also, lets assume that each insert/update has a timestamp, which we 
>>>> belive in (not ES timestamp).
>>>>
>>>> The idea is to have a complete, up to date index once the bulk updating 
>>>> finishes. To achieve this I need to guarantee that I will have the correct 
>>>> data. This would work mostly well, if everything we would do upserts and 
>>>> the inserts/updates coming into ES have a strictly increasing timestamp.
>>>> But one could imagine that this is a possibly problematic situation, 
>>>> when:
>>>>
>>>> 1) We are performing bulk indexing,
>>>>   a) we read an object from the db
>>>>   b) process it
>>>>   c) send it to ES.
>>>> 2) We have an update on the same object, after step (a) and before if 
>>>> makes to ES in the bulk updating - phase(c). That is, ES gets an update 
>>>> with new data and only after that we get the insert with the entire 
>>>> document from the source of truth with older data. Hence, in ES we have a 
>>>> document with a newer timestamp, than the newly added one phase(c).
>>>>
>>>> My theoretical solution: For each operation, have the timestamp for 
>>>> that change (timestamp from the system that made the change, not from 
>>>> Elastic Search). Lets say that all of the operations that we will perform 
>>>> are upserts.
>>>> Then once we get an insert or an update (lets call it doc), we have to 
>>>> perform the following script (pseudo mvel) inside ES.
>>>> {
>>>>   if (doc.timestamp > ctx.source.timestamp) {
>>>>     // doc is newer than what was in ES
>>>>     upsert(doc); // update the index with all of the info from the new 
>>>> doc
>>>>   } else {
>>>>     // there is already a document in ES with a newer timestamp, note, 
>>>> this may be an incomplete document (an update)
>>>>     __fill the missing fields in the document in ES with values from 
>>>> doc__
>>>>   }
>>>> }
>>>>
>>>> My question is:
>>>> 1) Is there a better approach?
>>>> 2) If so, is there a simple approach for doing the ' __fill the missing 
>>>> fields in the document in ES with values from doc__' operation/script?
>>>>
>>>> Thanks!
>>>> Michal Zgliczynski
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/069a5c06-b985-40a7-8324-e4f351fdec1b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Handling updates from multiple sources

Reply via email to