Re: is hadoop suitable for us?

Luca Pireddu Fri, 18 May 2012 03:28:31 -0700

We're using a multi-user Hadoop MapReduce installation with up to 100computing nodes, without HDFS. Since we have a shared cluster and notall apps use Hadoop, we grow/shrink the Hadoop cluster as the loadchanges. It's working, and because of our hardware setup performance isquite close to what we had with HDFS. We're storing everything directlyon the SAN.

The only problem so far has been trying to get the system to workwithout running the JT as root (I posted yesterday about that problem).



Luca




On 05/18/2012 06:10 AM, Pierre Antoine DuBoDeNa wrote:

You used HDFS too? or storing everything on SAN immediately?

I don't have number of GB/TB (it might be about 2TB so not really that
"huge") but they are more than 100 million documents to be processed. In a
single machine currently we can process about 200.000 docs/day (several
parsing, indexing, metadata extraction has to be done). So in the worst
case we want to use the 50 VMs to distribute the processing..

2012/5/17 Sagar Shukla<sagar_shu...@persistent.co.in>

Hi PA,
     In my environment, we had a SAN storage and I/O was pretty good. So if
you have similar environment then I don't see any performance issues.

Just out of curiosity - what amount of data are you looking forward to
process ?

Regards,
Sagar

-----Original Message-----
From: Pierre Antoine Du Bois De Naurois [mailto:pad...@gmail.com]
Sent: Thursday, May 17, 2012 8:29 PM
To: common-user@hadoop.apache.org
Subject: Re: is hadoop suitable for us?

Thanks Sagar, Mathias and Michael for your replies.

It seems we will have to go with hadoop even if I/O will be slow due to
our configuration.

I will try to update on how it worked for our case.

Best,
PA



2012/5/17 Michael Segel<michael_se...@hotmail.com>

The short answer is yes.
The longer answer is that you will have to account for the latencies.

There is more but you get the idea..

Sent from my iPhone

On May 17, 2012, at 5:33 PM, "Pierre Antoine Du Bois De Naurois"<
pad...@gmail.com>  wrote:

We have large amount of text files that we want to process and index

(plus

applying other algorithms).

The problem is that our configuration is share-everything while
hadoop

has

a share-nothing configuration.

We have 50 VMs and not actual servers, and these share a huge
central storage. So using HDFS might not be really useful as
replication will not help, distribution of files have no meaning as
all files will be again located in the same HDD. I am afraid that
I/O will be very slow with or without HDFS. So i am wondering if it
will really help us to use hadoop/hbase/pig etc. to distribute and
do several parallel tasks.. or is "better" to install something
different (which i am not sure what). We heard myHadoop is better
for such kind of configurations, have any clue about it?

For example we now have a central mySQL to check if we have already
processed a document and keeping there several metadata. Soon we
will

have

to distribute it as there is not enough space in one VM, But
Hadoop/HBase will be useful? we don't want to do any complex
join/sort of the data, we just want to do queries to check if
already processed a document, and if not to add it with several of

it's metadata.


We heard sungrid for example is another way to go but it's
commercial. We are somewhat lost.. so any help/ideas/suggestions are

appreciated.


Best,
PA



2012/5/17 Abhishek Pratap Singh<manu.i...@gmail.com>

Hi,

For your question if HADOOP can be used without HDFS, the answer is

Yes.

Hadoop can be used with any kind of distributed file system.
But I m not able to understand the problem statement clearly to
advice

my

point of view.
Are you processing text file and saving in distributed database??

Regards,
Abhishek

On Thu, May 17, 2012 at 1:46 PM, Pierre Antoine Du Bois De Naurois
<  pad...@gmail.com>  wrote:

We want to distribute processing of text files.. processing of
large machine learning tasks, have a distributed database as we
have big

amount

of data etc.

The problem is that each VM can have up to 2TB of data (limitation
of

VM),

and we have 20TB of data. So we have to distribute the processing,
the database etc. But all those data will be in a shared huge
central file system.

We heard about myHadoop, but we are not sure why is that any
different

from

Hadoop.

If we run hadoop/mapreduce without using HDFS? is that an option?

best,
PA


2012/5/17 Mathias Herberts<mathias.herbe...@gmail.com>

Hadoop does not perform well with shared storage and vms.

The question should be asked first regarding what you're trying
to

achieve,

not about your infra.
On May 17, 2012 10:39 PM, "Pierre Antoine Du Bois De Naurois"<
pad...@gmail.com>  wrote:

Hello,

We have about 50 VMs and we want to distribute processing across

them.

However these VMs share a huge data storage system and thus
their

"virtual"

HDD are all located in the same computer. Would Hadoop be useful
for

such

configuration? Could we use hadoop without HDFS? so that we can

retrieve

and store everything in the same storage?

Thanks,
PA



--
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
09010 Pula (CA), Italy
Tel: +39 0709250452

Re: is hadoop suitable for us?

Reply via email to