[Hadoop] - Difference between task creation for a write and read-update-write operation in ES

piyush goyal Wed, 06 May 2015 03:42:07 -0700

Hi Costin,

I saw a different behavior of creating task for write to ES operation while 
working on my project. The difference is as follows:


1.) Only write to ES - When I create an RDD of my own to insert data into 
ES, the task are created based on property "es.batch.size.bytes" and 
"es.batch.size.entries". Number of task created = Number of documents in 
RDD/the value of either of these properties. The request hits the node and 
node decides the shard to which document needs routed based on routing 
value(if specified).

2.) Read-Update-write to ES - Consider this case when I have to read data 
from ES, store it in RDD, do some updates in the documents in RDD and then 
index these documents back to ES. While reading, the number of tasks are 
created on basis of number of shards and I presume that each tasks fetch 
data from each Shard(not sure of how it works? - Task delagting request to 
node to serve data from a particular shard?). Now when I try to 
update/re-index data using same RDD and function saveToESWithMetadata, this 
time the number of task created is a number which is not based on point 1. 
If the data in each partition is less than property 
"es.batch.size.entries", it creates the same number of tasks as are the 
number of shards, else greater than it.

What's the reason behind this? Also like read operation where request is 
from particular shard, does write operation also write to a shard or all 
the task delegate their request to the node?

Thanks in advance
Piyush Costin,

I saw a different behavior of creating task for write to ES operation while 
working on my project. The difference is as follows:

1.) Only write to ES - When I create an RDD of my own to insert data into 
ES, the task are created based on property "es.batch.size.bytes" and 
"es.batch.size.entries". Number of task created = Number of documents in 
RDD/the value of either of these properties. The request hits the node and 
node decides the shard to which document needs routed based on routing 
value(if specified).

2.) Read-Update-write to ES - Consider this case when I have to read data 
from ES, store it in RDD, do some updates in the documents in RDD and then 
index these documents back to ES. While reading, the number of tasks are 
created on basis of number of shards and I presume that each tasks fetch 
data from each Shard(not sure of how it works? - Task delagting request to 
node to serve data from a particular shard?). Now when I try to 
update/re-index data using same RDD and function saveToESWithMetadata, this 
time the number of task created is a number which is not based on point 1. 
If the data in each partition is less than property 
"es.batch.size.entries", it creates the same number of tasks as are the 
number of shards, else greater than it. 

What's the reason behind this? Also like read operation where request is 
from particular shard, does write operation also write to a shard or all 
the task delegate their request to the node?

Thanks in advance
Piyush

-- 
Please update your bookmarks! We moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ec268e76-6220-430b-958a-884692283ca0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[Hadoop] - Difference between task creation for a write and read-update-write operation in ES

Reply via email to