Re: Question about Hadoop

lohit Thu, 12 Jun 2008 10:03:46 -0700

Ideally what you would want is your data to be on HDFS and run your map/reduce 
jobs on that data. Hadoop framework splits you data and feeds in those splits 
to each map or reduce task. One problem with Image files is that you will not 
be able to split them. Alternatively people have done this, they wrap Image 
files within xml and create huge files which has multiple image files in them. 
Hadoop offers something called streaming using which you will be able to split 
the files at xml boundry and feed it to your map/reduce tasks. Streaming also 
enables you to use any code like perl/php/c++. 
Check info about streaming here 
http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
And information about parsing XML files in streaming in here 
http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F

Thanks,
Lohit

----- Original Message ----
From: Chanchal James <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Thursday, June 12, 2008 9:42:46 AM
Subject: Question about Hadoop

Hi,

I have a question about Hadoop. I am a beginner and just testing Hadoop.

Would like to know how a php application would benefit from this, say an

application that needs to work on a large number of image files. Do I have
to

store the application in HDFS always, or do I just copy it to HDFS when

needed, do the processing, and then copy it back to the local file system ?

Is that the case with the data files too ? Once I have Hadoop running, do I

keep all data & application files in HDFS always, and not use local file

system storage ?

Thank you.

Re: Question about Hadoop

Reply via email to