Re: Handling updates from multiple sources

2014-05-02 Thread Michał Zgliczyński
My system is changing rapidly. The final result is to have all the data 
inside the ES index. The way I have it set up currently I have 2 different 
systems that write to the ES index:
1) Bulk job. Run through all the dbs, fetch things in batch updates of 5k 
and send it to ES.
2) Live updating job. Pickup the newest changes and send them to ES. Either 
updates or inserts. Note: the updates don't contain full documents

After this step (1) and (2) I would like to have (almost) 100% guarantee 
that the index is full and up to date.

I think that this is quite common use case if you want to have an index 
with live data, not stale as of the time of the beginning of the bulk job.

W dniu czwartek, 1 maja 2014 19:45:53 UTC-7 użytkownik Rob Ottaway napisał:
>
> I missed that the later doc would only be partial. What is the reason to 
> use the partial doc? That really complicates things.
>
> Filling in missing fields is going to be a very large headache. You'll 
> probably kill performance trying to do it too. Likely it'll be so complex 
> it will present a lot more trouble.
>
> I think if you can better present the overall use cases you will get 
> better insight into how to work this out.
>
>
> On Thursday, May 1, 2014 4:51:03 PM UTC-7, Michał Zgliczyński wrote:
>>
>> Hi,
>> Thank you for your response. I have looked through this blog post: 
>> http://www.elasticsearch.org/blog/elasticsearch-versioning-support/
>> It looks as if external versioning would be the way to go. Have the 
>> timestamps act as version numbers and let ES only pick the document with 
>> the newest version as the correct document. However, with the situation I 
>> have presented above, ES will fail. A quote from the post:
>> "With version_type set to external, Elasticsearch will store the version 
>> number as given and will not increment it. Also, instead of checking for an 
>> exact match, Elasticsearch will only return a version collision error if 
>> the version currently stored is greater or equal to the one in the indexing 
>> command. This effectively means “only store this information if no one else 
>> has supplied the same or a more recent version in the meantime”. 
>> Concretely, the above request will succeed if the stored version number is 
>> smaller than 526. 526 and above will cause the request to fail."
>>
>> In my example, we would have that situation. A partial doc with a larger 
>> version number(later timestamp) is already stored in ES and we get the 
>> complete document with a smaller timestamp. In this situation we would like 
>> to merge these 2 documents in a way that, we have all of the fields from 
>> the partial doc and the other fields(not currently specified in the ES 
>> document) to be filled from the complete document.
>>
>> Thanks!
>> Michal Zgliczynski
>>
>> W dniu czwartek, 1 maja 2014 14:58:31 UTC-7 użytkownik Rob Ottaway 
>> napisał:
>>>
>>> Have you looked at using versioning?
>>>
>>>
>>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning
>>>
>>> cheers,
>>> Rob
>>>
>>> On Thursday, May 1, 2014 2:47:39 PM UTC-7, Michał Zgliczyński wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am building a system in which I will have two sources of updates:
>>>> 1) Bulk updating from the source of truth(db) <- Always inserting 
>>>> documents(complete docs)
>>>> 2) Live updates <- Adding insert and update (complete and incomplete 
>>>> docs)
>>>>
>>>> Also, lets assume that each insert/update has a timestamp, which we 
>>>> belive in (not ES timestamp).
>>>>
>>>> The idea is to have a complete, up to date index once the bulk updating 
>>>> finishes. To achieve this I need to guarantee that I will have the correct 
>>>> data. This would work mostly well, if everything we would do upserts and 
>>>> the inserts/updates coming into ES have a strictly increasing timestamp.
>>>> But one could imagine that this is a possibly problematic situation, 
>>>> when:
>>>>
>>>> 1) We are performing bulk indexing,
>>>>   a) we read an object from the db
>>>>   b) process it
>>>>   c) send it to ES.
>>>> 2) We have an update on the same object, after step (a) and before if 
>>>> makes to ES in the bulk updating - phase(c). That is, ES gets an update 
>>>> with new data and only after that we get the insert with the entire 
>>>> document from the

Re: Handling updates from multiple sources

2014-05-01 Thread Michał Zgliczyński
Hi,
Thank you for your response. I have looked through this blog 
post: http://www.elasticsearch.org/blog/elasticsearch-versioning-support/
It looks as if external versioning would be the way to go. Have the 
timestamps act as version numbers and let ES only pick the document with 
the newest version as the correct document. However, with the situation I 
have presented above, ES will fail. A quote from the post:
"With version_type set to external, Elasticsearch will store the version 
number as given and will not increment it. Also, instead of checking for an 
exact match, Elasticsearch will only return a version collision error if 
the version currently stored is greater or equal to the one in the indexing 
command. This effectively means “only store this information if no one else 
has supplied the same or a more recent version in the meantime”. 
Concretely, the above request will succeed if the stored version number is 
smaller than 526. 526 and above will cause the request to fail."

In my example, we would have that situation. A partial doc with a larger 
version number(later timestamp) is already stored in ES and we get the 
complete document with a smaller timestamp. In this situation we would like 
to merge these 2 documents in a way that, we have all of the fields from 
the partial doc and the other fields(not currently specified in the ES 
document) to be filled from the complete document.

Thanks!
Michal Zgliczynski

W dniu czwartek, 1 maja 2014 14:58:31 UTC-7 użytkownik Rob Ottaway napisał:
>
> Have you looked at using versioning?
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning
>
> cheers,
> Rob
>
> On Thursday, May 1, 2014 2:47:39 PM UTC-7, Michał Zgliczyński wrote:
>>
>> Hi,
>>
>> I am building a system in which I will have two sources of updates:
>> 1) Bulk updating from the source of truth(db) <- Always inserting 
>> documents(complete docs)
>> 2) Live updates <- Adding insert and update (complete and incomplete docs)
>>
>> Also, lets assume that each insert/update has a timestamp, which we 
>> belive in (not ES timestamp).
>>
>> The idea is to have a complete, up to date index once the bulk updating 
>> finishes. To achieve this I need to guarantee that I will have the correct 
>> data. This would work mostly well, if everything we would do upserts and 
>> the inserts/updates coming into ES have a strictly increasing timestamp.
>> But one could imagine that this is a possibly problematic situation, when:
>>
>> 1) We are performing bulk indexing,
>>   a) we read an object from the db
>>   b) process it
>>   c) send it to ES.
>> 2) We have an update on the same object, after step (a) and before if 
>> makes to ES in the bulk updating - phase(c). That is, ES gets an update 
>> with new data and only after that we get the insert with the entire 
>> document from the source of truth with older data. Hence, in ES we have a 
>> document with a newer timestamp, than the newly added one phase(c).
>>
>> My theoretical solution: For each operation, have the timestamp for that 
>> change (timestamp from the system that made the change, not from Elastic 
>> Search). Lets say that all of the operations that we will perform are 
>> upserts.
>> Then once we get an insert or an update (lets call it doc), we have to 
>> perform the following script (pseudo mvel) inside ES.
>> {
>>   if (doc.timestamp > ctx.source.timestamp) {
>> // doc is newer than what was in ES
>> upsert(doc); // update the index with all of the info from the new doc
>>   } else {
>> // there is already a document in ES with a newer timestamp, note, 
>> this may be an incomplete document (an update)
>> __fill the missing fields in the document in ES with values from doc__
>>   }
>> }
>>
>> My question is:
>> 1) Is there a better approach?
>> 2) If so, is there a simple approach for doing the ' __fill the missing 
>> fields in the document in ES with values from doc__' operation/script?
>>
>> Thanks!
>> Michal Zgliczynski
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/550bbb9d-b320-41a8-82d7-5c663c1f7e71%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Handling updates from multiple sources

2014-05-01 Thread Michał Zgliczyński
Hi,

I am building a system in which I will have two sources of updates:
1) Bulk updating from the source of truth(db) <- Always inserting 
documents(complete docs)
2) Live updates <- Adding insert and update (complete and incomplete docs)

Also, lets assume that each insert/update has a timestamp, which we belive 
in (not ES timestamp).

The idea is to have a complete, up to date index once the bulk updating 
finishes. To achieve this I need to guarantee that I will have the correct 
data. This would work mostly well, if everything we would do upserts and 
the inserts/updates coming into ES have a strictly increasing timestamp.
But one could imagine that this is a possibly problematic situation, when:

1) We are performing bulk indexing,
  a) we read an object from the db
  b) process it
  c) send it to ES.
2) We have an update on the same object, after step (a) and before if makes 
to ES in the bulk updating - phase(c). That is, ES gets an update with new 
data and only after that we get the insert with the entire document from 
the source of truth with older data. Hence, in ES we have a document with a 
newer timestamp, than the newly added one phase(c).

My theoretical solution: For each operation, have the timestamp for that 
change (timestamp from the system that made the change, not from Elastic 
Search). Lets say that all of the operations that we will perform are 
upserts.
Then once we get an insert or an update (lets call it doc), we have to 
perform the following script (pseudo mvel) inside ES.
{
  if (doc.timestamp > ctx.source.timestamp) {
// doc is newer than what was in ES
upsert(doc); // update the index with all of the info from the new doc
  } else {
// there is already a document in ES with a newer timestamp, note, this 
may be an incomplete document (an update)
__fill the missing fields in the document in ES with values from doc__
  }
}

My question is:
1) Is there a better approach?
2) If so, is there a simple approach for doing the ' __fill the missing 
fields in the document in ES with values from doc__' operation/script?

Thanks!
Michal Zgliczynski

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f79b9777-133d-4cb3-aa8d-b0e5c9024ba9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Adding 1mln+ aliases is really slow.

2014-03-06 Thread Michał Zgliczyński
Currently this is ran on 4 nodes, with the possibility of adding new nodes. 
I get nothing in the logs.
I don't completely understand this for my use case: 
https://github.com/elasticsearch/elasticsearch/pull/5180 .
Ideally what I would like, would be not to create so many aliases, but to 
have a template doing the work for me. The template could hold the data 
pertaining the filtering and routing. The template could be very simple, 
for a request:
host:9200/user_{id} => this would automatically match the template: 
"user_*" and use its options. Also, this would very much simplify my work 
later on. As the server is alive and a new user would appear, the template 
would automatically use the templates settings, instead of me checking if 
the alias exists and then adding the alias.

This would allow me to create 1 template instead of so many similar 
aliases. Or maybe this is already implemented?

Thanks!

W dniu czwartek, 6 marca 2014 13:49:29 UTC-8 użytkownik David Pilato 
napisał:
>
> I have never seen that number of aliases. That means you have 5 million 
> users?
> Nice project ;-)
>
> I guess here that the cluster state is getting so big that it takes more 
> and more time to update it and copy it to all nodes.
>
> BTW how many nodes you have for those 200 shards?
>
> Do you see anything in logs?
>
> Thinking it loud.
> Wondering if creating some alias template could help here to minimize the 
> cluster state size?
> Something like what you exactly describe:
> {
>  'index' : 'index_name',
>  'alias' : 'user_{user_id}',
>   'filter' : {
>'term' : {
>  'user' : '{user_id}',
>},
>  ),
>  'routing' => 'r{user_id}'
> }
>
> It looks somehow similar to what Luca just did with 
> https://github.com/elasticsearch/elasticsearch/pull/5180
>
> Someone else has an idea? 
>
> --
> David ;-)
> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>
>
> Le 6 mars 2014 à 22:15, Michał Zgliczyński > 
> a écrit :
>
> First of all, thank you for building ElasticSearch. It is a truly awesome 
> product.
>
> I am trying to use the "User data flow". For this, I create a single index 
> and multiple aliases inside of it. In my use case, I have about 5mln 
> aliases to add.
>
> The alias structure roughly looks like this:
> {
>  'index' : 'index_name',
>  'alias' : 'user_' + user_id,
>   'filter' : {
>'term' : {
>  'user' : user_id,
>},
>  ),
>  'routing' => 'r' + user_id,
> }
>
> I create a server with this setup: 
> {
>  "index_name" : {
>"settings" : {
>  "index.number_of_replicas" : "1",
>  "index.number_of_shards" : "100",
>}
>  }
> }
>
>
> Adding aliases works reasonably well for up to about 100k aliases, but it 
> slows down for later updates.
>
> The following timings are shown after creating an index and then adding 
> aliases. No other operations were performed  during that time on the 
> cluster and index.
> These are the times needed to send and add aliases in batches of 5000:
> batch: 5000 - time: 2311ms
> batch: 5000 - time: 4096ms
> batch: 5000 - time: 6022ms
> batch: 5000 - time: 8127ms
> batch: 5000 - time: 10174ms
> batch: 5000 - time: 11403ms
> batch: 5000 - time: 13126ms
> batch: 5000 - time: 14335ms
> batch: 5000 - time: 16500ms
> batch: 5000 - time: 20663ms
> batch: 5000 - time: 23002ms
> batch: 5000 - time: 24457ms
> batch: 5000 - time: 26375ms
> batch: 5000 - time: 28984ms
> batch: 5000 - time: 30559ms
> batch: 5000 - time: 32234ms
> batch: 5000 - time: 35098ms
> batch: 5000 - time: 38922ms
> batch: 5000 - time: 41776ms
> batch: 5000 - time: 53402ms
> batch: 5000 - time: 58600ms
> batch: 5000 - time: 65567ms
> batch: 5000 - time: 79885ms
> batch: 5000 - time: 89900ms
> batch: 5000 - time: 89368ms
> batch: 5000 - time: 104109ms
>
> As you can see, it gradually slows down. Is this expected? Looks like the 
> addition time grows linearly to the amount of aliases. Is that correct? 
> Thanks!
>
>  -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearc...@googlegroups.com .
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/d8a03b09-9ed2-49c7-9ca8-a2285478d933%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/d8a03b09-9ed2-49c7-9ca8-a2285478d933%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/de1e8dce-a559-4c3a-98c1-e87a5eed46c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Adding 1mln+ aliases is really slow.

2014-03-06 Thread Michał Zgliczyński
First of all, thank you for building ElasticSearch. It is a truly awesome 
product.

I am trying to use the "User data flow". For this, I create a single index 
and multiple aliases inside of it. In my use case, I have about 5mln 
aliases to add.

The alias structure roughly looks like this:
{
 'index' : 'index_name',
 'alias' : 'user_' + user_id,
  'filter' : {
   'term' : {
 'user' : user_id,
   },
 ),
 'routing' => 'r' + user_id,
}

I create a server with this setup: 
{
 "index_name" : {
   "settings" : {
 "index.number_of_replicas" : "1",
 "index.number_of_shards" : "100",
   }
 }
}


Adding aliases works reasonably well for up to about 100k aliases, but it 
slows down for later updates.

The following timings are shown after creating an index and then adding 
aliases. No other operations were performed  during that time on the 
cluster and index.
These are the times needed to send and add aliases in batches of 5000:
batch: 5000 - time: 2311ms
batch: 5000 - time: 4096ms
batch: 5000 - time: 6022ms
batch: 5000 - time: 8127ms
batch: 5000 - time: 10174ms
batch: 5000 - time: 11403ms
batch: 5000 - time: 13126ms
batch: 5000 - time: 14335ms
batch: 5000 - time: 16500ms
batch: 5000 - time: 20663ms
batch: 5000 - time: 23002ms
batch: 5000 - time: 24457ms
batch: 5000 - time: 26375ms
batch: 5000 - time: 28984ms
batch: 5000 - time: 30559ms
batch: 5000 - time: 32234ms
batch: 5000 - time: 35098ms
batch: 5000 - time: 38922ms
batch: 5000 - time: 41776ms
batch: 5000 - time: 53402ms
batch: 5000 - time: 58600ms
batch: 5000 - time: 65567ms
batch: 5000 - time: 79885ms
batch: 5000 - time: 89900ms
batch: 5000 - time: 89368ms
batch: 5000 - time: 104109ms

As you can see, it gradually slows down. Is this expected? Looks like the 
addition time grows linearly to the amount of aliases. Is that correct? 
Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d8a03b09-9ed2-49c7-9ca8-a2285478d933%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.