Re: [Hadoop] capability clarification questions

Costin Leau Mon, 03 Feb 2014 02:53:11 -0800

Hi,

This section [1] of the Elasticsearch for Apache Hadoop reference tries to answer your questions. In other words, asoppose to a 'normal' client, es-hadoop 'parallelizes' your reads and writes so if you have a Hadoop job with 5 tasksrunning in parallel, you'll end up with 5 parallel writes to Es.In a similar vein, when you read data from Es you'll get parallel reads - so if your index has 5 shards, you'll end upwith 5 different tasks streaming/reading data from Es.

Regarding deployment, es-hadoop works against Apache Hadoop 1.x and 2.x and various other Hadoop distros. There'snothing extra that you have to do to your cluster; again this is covered in the reference docs here [2].

As for deployment, you can install Es on the same physical cluster as Hadoop or on a separate one; it's really up to youand your hardware. As long as you have spare RAM and CPU, you can co-locate the two (which es-hadoop will take advantageof) - in fact, you don't have to have the same amount of ES and Hadoop nodes, you can mix and match depending on yourrequirements.

If I understand correctly, you already are reusing the same machine which is 
fine.

I suggest taking a look at the docs and trying out the examples in it (which you can find in the readme as well) -things are easy to install and there's no extra provisioning that needs to be done.


Cheers,

[1] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/arch.html
[2] 
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/features.html

On 30/01/2014 9:55 PM, Josh Harrison wrote:

In looking around I haven't been able to find explicit answers to these
questions - though the questions may entirely be
because I'm a hadoop newbie.
If we were to deploy ES within a hadoop environment:
The primary benefit is allowing direct interaction with ES from Hadoop, running
queries or indexing data, is that right?
Are there explicit benefits to search speed and capability when run through the
normal REST or other client APIs? That
is to say, if I have a set of N documents and a query that takes T seconds to
run on a normal cluster through curl,
would there be a marked improvement in T when running the same query through
curl against a hadoop enabled cluster?
Are the ideal architecture designs for a hadoop enabled ES cluster the same, or similar
to, a "regular" cluster?
If they're the same, does a hadoop enabled cluster need to be designed as such
from the start, or can that functionality
be tacked on to an already functioning cluster with data? Situation is, we're
on a cluster of machines running hadoop,
but the ES nodes are just running on the compute nodes like a regular service.
Wondering what it would take to enable
the hadoop capabilities.

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to
elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4dea6c95-75b8-4ed7-a054-3f9eaedde9d3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--
Costin

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/52EF74CA.5070202%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Hadoop] capability clarification questions

Reply via email to