Hi,

This section [1] of the Elasticsearch for Apache Hadoop reference tries to answer your questions. In other words, as oppose to a 'normal' client, es-hadoop 'parallelizes' your reads and writes so if you have a Hadoop job with 5 tasks running in parallel, you'll end up with 5 parallel writes to Es. In a similar vein, when you read data from Es you'll get parallel reads - so if your index has 5 shards, you'll end up with 5 different tasks streaming/reading data from Es.

Regarding deployment, es-hadoop works against Apache Hadoop 1.x and 2.x and various other Hadoop distros. There's nothing extra that you have to do to your cluster; again this is covered in the reference docs here [2].

As for deployment, you can install Es on the same physical cluster as Hadoop or on a separate one; it's really up to you and your hardware. As long as you have spare RAM and CPU, you can co-locate the two (which es-hadoop will take advantage of) - in fact, you don't have to have the same amount of ES and Hadoop nodes, you can mix and match depending on your requirements.
If I understand correctly, you already are reusing the same machine which is 
fine.

I suggest taking a look at the docs and trying out the examples in it (which you can find in the readme as well) - things are easy to install and there's no extra provisioning that needs to be done.

Cheers,

[1] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/arch.html
[2] 
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/features.html

On 30/01/2014 9:55 PM, Josh Harrison wrote:
In looking around I haven't been able to find explicit answers to these 
questions - though the questions may entirely be
because I'm a hadoop newbie.
If we were to deploy ES within a hadoop environment:
The primary benefit is allowing direct interaction with ES from Hadoop, running 
queries or indexing data, is that right?
Are there explicit benefits to search speed and capability when run through the 
normal REST or other client APIs? That
is to say, if I have a set of N documents and a query that takes T seconds to 
run on a normal cluster through curl,
would there be a marked improvement in T when running the same query through 
curl against a hadoop enabled cluster?
Are the ideal architecture designs for a hadoop enabled ES cluster the same, or similar 
to, a "regular" cluster?
If they're the same, does a hadoop enabled cluster need to be designed as such 
from the start, or can that functionality
be tacked on to an already functioning cluster with data? Situation is, we're 
on a cluster of machines running hadoop,
but the ES nodes are just running on the compute nodes like a regular service. 
Wondering what it would take to enable
the hadoop capabilities.

Thanks!

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to
elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4dea6c95-75b8-4ed7-a054-3f9eaedde9d3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/52EF74CA.5070202%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to