solved [Re: streaming command [Re: no output written to HDFS]]
The problem is solved. I had to make sure that the streaming file is given in "-input" and the other file is given in "-file". That solved the issue. Thanks, PD On Fri, Aug 31, 2012 at 10:07 AM, Periya.Data wrote: > Yes, both input files need to be processed by the mapper..but not in the > same fashion. Essentially, this is what my Python script does: > - read two text files - A and B. file A has a list of account-IDs (all > numeric). File B has about 10 records - some of which has the same > account_ID as those listed in file A. > - mapper: read both the files, compares and prints out those records that > have matching account_IDs. > > I have tried placing both the input files under a single input directory. > Same behavior. > > And, from what I have read so far, "-mapper" or "-reducer" should have > "ONLY" the name of the executable (like...in my case, "test2.py".). But, if > I do that, nothing happens. I have to explicitly mention: > -mapper "cat $1 | python $GHU_HOME/test2.py $2"...something like > that...which looks unconventional...but, it produces "some" output...not > the correct one though. > > Again, if I run my script in just plain linux machine, using the basic > commands : > cat $1 | python test2.py $2, > it produces the expected output. > > > *Observation*: If I do not specify the two files under "- file" option, > then, I see no output written to HDFS..even though the output directory has > empy part-files and SUCCESS directory. The 3-part files are reasonable - as > 3 mappers are configured for each job. > > > My current command: > > hadoop jar ...streaming.jar > -input /user/ghu/input/* \ > -output /user/ghu/out file /home/ghu/test2.py \ > -mapper "cat $1 | python test2.py $2" \ > -file /home/ghu/$1 \ > -file /home/ghu/$2 > > > Learning, > /PD > > On Thu, Aug 30, 2012 at 9:46 PM, Hemanth Yamijala wrote: > >> Hi, >> >> Do both input files contain data that needs to be processed by the >> mapper in the same fashion ? In which case, you could just put the >> input files under a directory in HDFS and provide that as input. The >> -input option does accept a directory as argument. >> >> Otherwise, can you please explain a little more what you're trying to >> do with the two inputs. >> >> Thanks >> Hemanth >> >> On Fri, Aug 31, 2012 at 3:00 AM, Periya.Data >> wrote: >> > This is interesting. I changed my command to: >> > >> > -mapper "cat $1 | $GHU_HOME/test2.py $2" \ >> > >> > is producing output to HDFS. But, the output is not what I expected and >> is >> > not the same as when I do "cat | map " on Linux. It is producing >> > part-0, part-1 and part-2. I expected only one output file >> with >> > just 2 records. >> > >> > I think I have to understand what exactly "-file" does and what exactly >> > "-input" does. I am experimenting what happens if I give my input files >> on >> > the command line (like: test2.py arg1 arg2) as against specifying the >> input >> > files via "-file" and "-input" options... >> > >> > The problem is I have 2 input files...and have no idea how to pass them. >> > SHould I keep one in HDFS and stream in the other? >> > >> > More digging, >> > PD/ >> > >> > >> > >> > On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data >> wrote: >> > >> >> Hi Bertrand, >> >> No, I do not observe the same when I run using cat | map. I can see >> >> the output in STDOUT when I run my program. >> >> >> >> I do not have any reducer. In my command, I provide >> >> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be >> >> written directly to HDFS. >> >> >> >> Your suspicion maybe right..about the output. In my counters, the "map >> >> input records" = 40 and "map.output records" = 0. I am trying to see >> if I >> >> am messing up in my command...(see below) >> >> >> >> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I >> am >> >> streaming one file in and test2.py takes in only one argument. How >> should I >> >> frame my command below? I think that is where I am messing up.. >> >> >> >> >> >
streaming command [Re: no output written to HDFS]
Yes, both input files need to be processed by the mapper..but not in the same fashion. Essentially, this is what my Python script does: - read two text files - A and B. file A has a list of account-IDs (all numeric). File B has about 10 records - some of which has the same account_ID as those listed in file A. - mapper: read both the files, compares and prints out those records that have matching account_IDs. I have tried placing both the input files under a single input directory. Same behavior. And, from what I have read so far, "-mapper" or "-reducer" should have "ONLY" the name of the executable (like...in my case, "test2.py".). But, if I do that, nothing happens. I have to explicitly mention: -mapper "cat $1 | python $GHU_HOME/test2.py $2"...something like that...which looks unconventional...but, it produces "some" output...not the correct one though. Again, if I run my script in just plain linux machine, using the basic commands : cat $1 | python test2.py $2, it produces the expected output. *Observation*: If I do not specify the two files under "- file" option, then, I see no output written to HDFS..even though the output directory has empy part-files and SUCCESS directory. The 3-part files are reasonable - as 3 mappers are configured for each job. My current command: hadoop jar ...streaming.jar -input /user/ghu/input/* \ -output /user/ghu/out file /home/ghu/test2.py \ -mapper "cat $1 | python test2.py $2" \ -file /home/ghu/$1 \ -file /home/ghu/$2 Learning, /PD On Thu, Aug 30, 2012 at 9:46 PM, Hemanth Yamijala wrote: > Hi, > > Do both input files contain data that needs to be processed by the > mapper in the same fashion ? In which case, you could just put the > input files under a directory in HDFS and provide that as input. The > -input option does accept a directory as argument. > > Otherwise, can you please explain a little more what you're trying to > do with the two inputs. > > Thanks > Hemanth > > On Fri, Aug 31, 2012 at 3:00 AM, Periya.Data > wrote: > > This is interesting. I changed my command to: > > > > -mapper "cat $1 | $GHU_HOME/test2.py $2" \ > > > > is producing output to HDFS. But, the output is not what I expected and > is > > not the same as when I do "cat | map " on Linux. It is producing > > part-0, part-1 and part-2. I expected only one output file > with > > just 2 records. > > > > I think I have to understand what exactly "-file" does and what exactly > > "-input" does. I am experimenting what happens if I give my input files > on > > the command line (like: test2.py arg1 arg2) as against specifying the > input > > files via "-file" and "-input" options... > > > > The problem is I have 2 input files...and have no idea how to pass them. > > SHould I keep one in HDFS and stream in the other? > > > > More digging, > > PD/ > > > > > > > > On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data > wrote: > > > >> Hi Bertrand, > >> No, I do not observe the same when I run using cat | map. I can see > >> the output in STDOUT when I run my program. > >> > >> I do not have any reducer. In my command, I provide > >> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be > >> written directly to HDFS. > >> > >> Your suspicion maybe right..about the output. In my counters, the "map > >> input records" = 40 and "map.output records" = 0. I am trying to see if > I > >> am messing up in my command...(see below) > >> > >> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I > am > >> streaming one file in and test2.py takes in only one argument. How > should I > >> frame my command below? I think that is where I am messing up.. > >> > >> > >> run.sh:(run as: cat | ./run.sh ) > >> --- > >> > >> hadoop jar > >> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \ > >> -D mapred.reduce.tasks=0 \ > >> -verbose \ > >> -input "$HDFS_INPUT" \ > >> -input "$HDFS_INPUT_2" \ > >> -output "$HDFS_OUTPUT" \ > >> -file "$GHU_HOME/test2.py" \ > >> -mapper "python $GHU_HOME/test2.py $1" \ > >> -file "$GHU_HOME/$1" > >> > >> > >> > >> If I modify my mapper
Re: no output written to HDFS
This is interesting. I changed my command to: -mapper "cat $1 | $GHU_HOME/test2.py $2" \ is producing output to HDFS. But, the output is not what I expected and is not the same as when I do "cat | map " on Linux. It is producing part-0, part-1 and part-2. I expected only one output file with just 2 records. I think I have to understand what exactly "-file" does and what exactly "-input" does. I am experimenting what happens if I give my input files on the command line (like: test2.py arg1 arg2) as against specifying the input files via "-file" and "-input" options... The problem is I have 2 input files...and have no idea how to pass them. SHould I keep one in HDFS and stream in the other? More digging, PD/ On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data wrote: > Hi Bertrand, > No, I do not observe the same when I run using cat | map. I can see > the output in STDOUT when I run my program. > > I do not have any reducer. In my command, I provide > "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be > written directly to HDFS. > > Your suspicion maybe right..about the output. In my counters, the "map > input records" = 40 and "map.output records" = 0. I am trying to see if I > am messing up in my command...(see below) > > Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I am > streaming one file in and test2.py takes in only one argument. How should I > frame my command below? I think that is where I am messing up.. > > > run.sh:(run as: cat | ./run.sh ) > --- > > hadoop jar > /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \ > -D mapred.reduce.tasks=0 \ > -verbose \ > -input "$HDFS_INPUT" \ > -input "$HDFS_INPUT_2" \ > -output "$HDFS_OUTPUT" \ > -file "$GHU_HOME/test2.py" \ > -mapper "python $GHU_HOME/test2.py $1" \ > -file "$GHU_HOME/$1" > > > > If I modify my mapper to take in 2 arguments, then, I would run it as: > > run.sh:(run as: ./run.sh ) > --- > > hadoop jar > /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \ > -D mapred.reduce.tasks=0 \ > -verbose \ > -input "$HDFS_INPUT" \ > -input "$HDFS_INPUT_2" \ > -output "$HDFS_OUTPUT" \ > -file "$GHU_HOME/test2.py" \ > -mapper "python $GHU_HOME/test2.py $1 $2" \ > -file "$GHU_HOME/$1" \ > -file "GHU_HOME/$2" > > > Please let me know if I am making a mistake here. > > > Thanks. > PD > > > > > > > On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux wrote: > >> Do you observe the same thing when running without Hadoop? (cat, map, sort >> and then reduce) >> >> Could you provide the counters of your job? You should be able to get them >> using the job tracker interface. >> >> The most probable answer without more information would be that your >> reducer do not output any s. >> >> Regards >> >> Bertrand >> >> >> >> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data >> wrote: >> >> > Hi All, >> >My Hadoop streaming job (in Python) runs to "completion" (both map >> and >> > reduce says 100% complete). But, when I look at the output directory in >> > HDFS, the part files are empty. I do not know what might be causing this >> > behavior. I understand that the percentages represent the records that >> have >> > been read in (not processed). >> > >> > The following are some of the logs. The detailed logs from Cloudera >> Manager >> > says that there were no Map Outputs...which is interesting. Any >> > suggestions? >> > >> > >> > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run: >> > 12/08/30 03:27:14 INFO streaming.StreamJob: >> /usr/lib/hadoop-0.20/bin/hadoop >> > job -Dmapred.job.tracker=x.yyy.com:8021 -kill >> job_201208232245_3182 >> > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL: >> > http://xx..com:60030/jobdetails.jsp?jobid=job_201208232245_3182 >> > 12/08/30 03:27:15 INFO streaming.StreamJob: map 0% reduce 0% >> > 12/08/30 03:27:20 INFO streaming.StreamJob: map 33% reduce 0% >> > 12/08/30 03:27:23 INFO streaming.StreamJob: map 67% reduce 0% >> > 12/08/30 03:27:29 INFO streaming.
Re: no output written to HDFS
Hi Bertrand, No, I do not observe the same when I run using cat | map. I can see the output in STDOUT when I run my program. I do not have any reducer. In my command, I provide "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be written directly to HDFS. Your suspicion maybe right..about the output. In my counters, the "map input records" = 40 and "map.output records" = 0. I am trying to see if I am messing up in my command...(see below) Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I am streaming one file in and test2.py takes in only one argument. How should I frame my command below? I think that is where I am messing up.. run.sh:(run as: cat | ./run.sh ) --- hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \ -D mapred.reduce.tasks=0 \ -verbose \ -input "$HDFS_INPUT" \ -input "$HDFS_INPUT_2" \ -output "$HDFS_OUTPUT" \ -file "$GHU_HOME/test2.py" \ -mapper "python $GHU_HOME/test2.py $1" \ -file "$GHU_HOME/$1" If I modify my mapper to take in 2 arguments, then, I would run it as: run.sh:(run as: ./run.sh ) --- hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \ -D mapred.reduce.tasks=0 \ -verbose \ -input "$HDFS_INPUT" \ -input "$HDFS_INPUT_2" \ -output "$HDFS_OUTPUT" \ -file "$GHU_HOME/test2.py" \ -mapper "python $GHU_HOME/test2.py $1 $2" \ -file "$GHU_HOME/$1" \ -file "GHU_HOME/$2" Please let me know if I am making a mistake here. Thanks. PD On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux wrote: > Do you observe the same thing when running without Hadoop? (cat, map, sort > and then reduce) > > Could you provide the counters of your job? You should be able to get them > using the job tracker interface. > > The most probable answer without more information would be that your > reducer do not output any s. > > Regards > > Bertrand > > > > On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data > wrote: > > > Hi All, > >My Hadoop streaming job (in Python) runs to "completion" (both map and > > reduce says 100% complete). But, when I look at the output directory in > > HDFS, the part files are empty. I do not know what might be causing this > > behavior. I understand that the percentages represent the records that > have > > been read in (not processed). > > > > The following are some of the logs. The detailed logs from Cloudera > Manager > > says that there were no Map Outputs...which is interesting. Any > > suggestions? > > > > > > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run: > > 12/08/30 03:27:14 INFO streaming.StreamJob: > /usr/lib/hadoop-0.20/bin/hadoop > > job -Dmapred.job.tracker=x.yyy.com:8021 -kill job_201208232245_3182 > > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL: > > http://xx..com:60030/jobdetails.jsp?jobid=job_201208232245_3182 > > 12/08/30 03:27:15 INFO streaming.StreamJob: map 0% reduce 0% > > 12/08/30 03:27:20 INFO streaming.StreamJob: map 33% reduce 0% > > 12/08/30 03:27:23 INFO streaming.StreamJob: map 67% reduce 0% > > 12/08/30 03:27:29 INFO streaming.StreamJob: map 100% reduce 0% > > 12/08/30 03:27:33 INFO streaming.StreamJob: map 100% reduce 100% > > 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete: > > job_201208232245_3182 > > 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU > > Thu Aug 30 03:27:24 GMT 2012 > > *** END > > bash-3.2$ > > bash-3.2$ hadoop fs -ls /user/ghu/ > > Found 5 items > > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 /user/GHU/_SUCCESS > > drwxrwxrwx - ghu hadoop 0 2012-08-30 03:27 /user/GHU/_logs > > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 > /user/GHU/part-0 > > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 > /user/GHU/part-1 > > -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 > /user/GHU/part-2 > > bash-3.2$ > > > > > > > > > > > Metadata Status Succeeded Type MapReduce Id job_201208232245_3182 > > Name CaidMatch > > User srisrini Mapper class PipeMapper Reducer class > > Scheduler pool name default Job input directory > > hdfs://x.yyy.txt,hdfs://..com/user/GHUcaidlist
no output written to HDFS
Hi All, My Hadoop streaming job (in Python) runs to "completion" (both map and reduce says 100% complete). But, when I look at the output directory in HDFS, the part files are empty. I do not know what might be causing this behavior. I understand that the percentages represent the records that have been read in (not processed). The following are some of the logs. The detailed logs from Cloudera Manager says that there were no Map Outputs...which is interesting. Any suggestions? 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run: 12/08/30 03:27:14 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=x.yyy.com:8021 -kill job_201208232245_3182 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL: http://xx..com:60030/jobdetails.jsp?jobid=job_201208232245_3182 12/08/30 03:27:15 INFO streaming.StreamJob: map 0% reduce 0% 12/08/30 03:27:20 INFO streaming.StreamJob: map 33% reduce 0% 12/08/30 03:27:23 INFO streaming.StreamJob: map 67% reduce 0% 12/08/30 03:27:29 INFO streaming.StreamJob: map 100% reduce 0% 12/08/30 03:27:33 INFO streaming.StreamJob: map 100% reduce 100% 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete: job_201208232245_3182 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU Thu Aug 30 03:27:24 GMT 2012 *** END bash-3.2$ bash-3.2$ hadoop fs -ls /user/ghu/ Found 5 items -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 /user/GHU/_SUCCESS drwxrwxrwx - ghu hadoop 0 2012-08-30 03:27 /user/GHU/_logs -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 /user/GHU/part-0 -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 /user/GHU/part-1 -rw-r--r-- 3 ghu hadoop 0 2012-08-30 03:27 /user/GHU/part-2 bash-3.2$ Metadata Status Succeeded Type MapReduce Id job_201208232245_3182 Name CaidMatch User srisrini Mapper class PipeMapper Reducer class Scheduler pool name default Job input directory hdfs://x.yyy.txt,hdfs://..com/user/GHUcaidlist.txt Job output directory hdfs://..com/user/GHU/ Timing Duration 20.977s Submit time Wed, 29 Aug 2012 08:27 PM Start time Wed, 29 Aug 2012 08:27 PM Finish time Wed, 29 Aug 2012 08:27 PM Progress and Scheduling Map Progress 100.0% Reduce Progress 100.0% Launched maps 4 Data-local maps 3 Rack-local maps 1 Other local maps Desired maps 3 Launched reducers Desired reducers 0 Fairscheduler running tasks Fairscheduler minimum share Fairscheduler demand Current Resource Usage Current User CPUs 0 Current System CPUs 0 Resident memory 0 B Running maps 0 Running reducers 0 Aggregate Resource Usage and Counters User CPU 0s System CPU 0s Map Slot Time 12.135s Reduce slot time 0s Cumulative disk reads Cumulative disk writes 155.0 KiB Cumulative HDFS reads 3.6 KiB Cumulative HDFS writes Map input bytes 2.5 KiB Map input records 45 Map output records 0 Reducer input groups Reducer input records Reducer output records Reducer shuffle bytes Spilled records
Hadoop streaming - Subprocess failed
Hi, I am running a map-reduce job in Python and I get this error message. I do not understand what it means. Output is not written to HDFS. I am using CDH3u3. Any suggestion is appreciated. MapAttempt TASK_TYPE="MAP" TASKID="task_201208232245_2812_m_00" TASK_ATTEMPT_ID="attempt_201208232245_2812_m_00_0" TASK_STATUS="FAILED" *ERROR="java\.lang\.RuntimeException: PipeMapRed\.waitOutputThreads(): subprocess failed with code 1* at org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:362) at org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:572) at org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:136) at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57) at org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:34) at org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:391) at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:325) at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:270) at java\.security\.AccessController\.doPrivileged(Native Method) at javax\.security\.auth\.Subject\.doAs(Subject\.java:396) at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1157) at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:264) " .
Re: streaming data ingest into HDFS
Sorry...misworded my statement. What I meant was that the sources are meant to be untouched and admins do not want to mess with it and add more tools in there. All I've got is source addresses, port numbers. Once I know what technique(s) I will be using, accordingly, I will be given access via firewalls and other access credentials. -PD On Thu, Dec 15, 2011 at 5:05 PM, Russell Jurney wrote: > Just curious - what is the situation you're in where no collectors are > possible? Sounds interesting. > > Russell Jurney > twitter.com/rjurney > russell.jur...@gmail.com > datasyndrome.com > > On Dec 15, 2011, at 5:01 PM, "Periya.Data" wrote: > > > Hi all, > > I would like to know what options I have to ingest terabytes of data > > that are being generated very fast from a small set of sources. I have > > thought about : > > > > 1. Flume > > 2. Have an intermediate staging server(s) where you can offload data > and > > from there use dfs -put to load into HDFS. > > 3. Anything else?? > > > > Suppose I am unable to use Flume (since the sources do not support their > > installation) and suppose that I do not have the luxury of having an > > intermediate staging place, what options do I have? In this case, I might > > have to directly (preferably in parallel) ingest data into HDFS. > > > > I have read about a technique to use Map-Reduce where the map would read > > data and use JAVA API to store in HDFS. We could have multiple threads of > > maps to get parallel ingestion. It would be nice to know about ways to > > ingest data "directly" into HDFS considering my assumptions. > > > > Suggestions are appreciated, > > > > /PD. >
streaming data ingest into HDFS
Hi all, I would like to know what options I have to ingest terabytes of data that are being generated very fast from a small set of sources. I have thought about : 1. Flume 2. Have an intermediate staging server(s) where you can offload data and from there use dfs -put to load into HDFS. 3. Anything else?? Suppose I am unable to use Flume (since the sources do not support their installation) and suppose that I do not have the luxury of having an intermediate staging place, what options do I have? In this case, I might have to directly (preferably in parallel) ingest data into HDFS. I have read about a technique to use Map-Reduce where the map would read data and use JAVA API to store in HDFS. We could have multiple threads of maps to get parallel ingestion. It would be nice to know about ways to ingest data "directly" into HDFS considering my assumptions. Suggestions are appreciated, /PD.
Re: choices for deploying a small hadoop cluster on EC2
Thanks for all your help and replies. Though I am leaning towards option 1 or 2, I looked up Big Table...an Incubator project in Apache. Could not find enough info on it in its website. I have a few more questions...and hope they apply to these mailing-list.. 1. Cos: Can you please point me to a link that talk about BigTop & EC2? 2. Regarding Whirr, can I just choose an Ubuntu EBS-backed AMI? Would that be any different from choosing a normal Hadoop AMI and (later) try to mount an EBS to this instance? 3. John: I like you idea of using S3 to store input and output. But, say I start a hadoop cluster, configure Sqoop and Hive and run it. Then, after I get my output in S3, I either stop it or terminate it (since I do not have EBS, I don't care). Now, after a while, I want to bring up a similar cluster and run Hive and Sqoop and do more experiments. In this case, will I have to reconfigure all my Sqoop settings, Hive table schemas etc? Because, I think once I "stop" an instance, I will lose the configs and when I restart a Hadoop AMI, I will only have hadoop nicely running in that instance and nothing else. I ideally want everything to persist...even configs and newly installed tools (Hive, Sqoop). Or , should I create a custom Ubuntu AMI with Hadoop, Sqoop, Hive etc "pre-cooked" in it? Probably, this is the ideal way to proceed...even if it is a little painful. I think I really want EBS-backed instance..as it maintains its internal state when stopped and restarted. Please let me know your opinion. This discussion is deviating from what I originally started as.. A little Googling has similar posts: https://forums.aws.amazon.com/message.jspa?messageID=131157 I know I can get to know by trying out these but, I want to lessen my burden in the trial-and-error process. Thanks very much, PD. On Tue, Nov 29, 2011 at 12:40 PM, Konstantin Boudnik wrote: > I'd suggest you use BigTop (cross-posting to bigtop-dev@ list) produced > bit > which also posses Puppet recipes allowing for fully automated deployment > and > configuration. BigTop also uses Jenkins EC2 plugin for deployment part and > it > seems to work real great! > > Cos > > On Tue, Nov 29, 2011 at 12:28PM, Periya.Data wrote: > > Hi All, > > I am just beginning to learn how to deploy a small cluster (a 3 > > node cluster) on EC2. After some quick Googling, I see the following > > approaches: > > > >1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it > >have features for persisting (EBS)? > >2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop > clusters/POC > >etc. Good stuff - I can persist using EBS snapshots. But, this uses > CDH2. > >3. Install hadoop manually and related stuff like Hive...on each > cluster > >node...on EC2 (or use some automation tool like Chef). I do not > prefer it. > >4. Hadoop distribution comes with EC2 (under src/contrib) and there > are > >several Hadoop EC2 AMIs available. I have not studied enough to know > if > >that is easy for a beginner like me. > >5. Anything else?? > > > > 1 and 2 look promising as a beginner. If any of you have any thoughts > about > > this, I would like to know (like what to keep in mind, what to take care > > of, caveats etc). I want my data /config to persist (using EBS) and > > continue from where I left off...(after a few days). Also, I want to > have > > HIVE and SQOOP installed. Can this done using 1 or 2? Or, will > installation > > of them have to be done manually after I set up the cluster? > > > > Thanks very much, > > > > PD. >
choices for deploying a small hadoop cluster on EC2
Hi All, I am just beginning to learn how to deploy a small cluster (a 3 node cluster) on EC2. After some quick Googling, I see the following approaches: 1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it have features for persisting (EBS)? 2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop clusters/POC etc. Good stuff - I can persist using EBS snapshots. But, this uses CDH2. 3. Install hadoop manually and related stuff like Hive...on each cluster node...on EC2 (or use some automation tool like Chef). I do not prefer it. 4. Hadoop distribution comes with EC2 (under src/contrib) and there are several Hadoop EC2 AMIs available. I have not studied enough to know if that is easy for a beginner like me. 5. Anything else?? 1 and 2 look promising as a beginner. If any of you have any thoughts about this, I would like to know (like what to keep in mind, what to take care of, caveats etc). I want my data /config to persist (using EBS) and continue from where I left off...(after a few days). Also, I want to have HIVE and SQOOP installed. Can this done using 1 or 2? Or, will installation of them have to be done manually after I set up the cluster? Thanks very much, PD.
Re: mapreduce linear chaining: ClassCastException
Fantastic ! Thanks much Bejoy. Now, I am able to get the output of my MR-2 nicely. I had to convert the sum (in text) format to IntWritable and I am able to get all the word frequency in ascending order. I used "KeyValueTextInputFormat.class". My program was complaining when I used "KeyValueInputFormat". Now, let me investigate how to do that in descending order...and then top-20...etc. I know I must look into RawComparator and more... Thanks, PD. On Sat, Oct 15, 2011 at 1:08 AM, wrote: > Hi >I believe what is happening in your case is that. > The first map reduce jobs runs to completion > When you trigger the second map reduce job, it is triggered with the > default input format, TextInputFormat and definitely expects the key value > as LongWritable and Text type. In default the MapReduce jobs output format > is TextOutputFormat, key value as tab seperated. When you need to consume > this output of an MR job as key value pairs by another MR job, use > KeyValueInputFormat, ie while setting config parameters for second job set > jobConf.setInputFormat(KeyValueInput Format.class). > Now if your output key value pairs use a different separator other than > default tab then for second job you need to specify that as well using > key.value.separator.in.input.line > > In short for your case in second map reduce job doing the following would > get things in place > -use jobConf.setInputFormat(KeyValueInputFormat.class) > -alter your mapper to accept key values of type Text,Text > -swap the key and values within mapper for output to reducer with > conversions. > > To be noted here,AFAIK KeyValueInputFormat is not a part of new mapreduce > API. > > Hope it helps. > > Regards > Bejoy K S > > -Original Message- > From: "Periya.Data" > Date: Fri, 14 Oct 2011 17:31:27 > To: ; > Reply-To: common-user@hadoop.apache.org > Subject: mapreduce linear chaining: ClassCastException > > Hi all, > I am trying a simple extension of WordCount example in Hadoop. I want to > get a frequency of wordcounts in descending order. To that I employ a > linear > chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the > usual example). For the next MR job => I set the mapper to swap the count> to . Then, have the Identity reducer to simply store > the results. > > My MR-1 does its job correctly and store the result in a temp path. > > Question 1: The mapper of the second MR job (MR-2) doesn't like the input > format. I have properly set the input format for MapClass2 of what it > expects and what its output must be. It seems to expecting a LongWritable. > I > suspect that it is trying to look at some index file. I am not sure. > > > It throws an error like this: > > >java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot > be cast to org.apache.hadoop.io.Text > > > Some Info: > - I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it > for now. > - I use hadoop-0.20.2 > > For MR-1: > - conf1.setOutputKeyClass(Text.class); > - conf1.setOutputValueClass(IntWritable.class); > > For MR-2 > - takes in a Text (word) and IntWritable (sum) > - conf2.setOutputKeyClass(IntWritable.class); > - conf2.setOutputValueClass(Text.class); > > > public class MapClass2 extends MapReduceBase > implements Mapper { > > @Override > public void map(Text word, IntWritable sum, > OutputCollector output, > Reporter reporter) throws IOException { > > output.collect(sum, word); // > } > } > > > Any suggestions would be helpful. Is my MapClass2 code right in the first > place...for swapping? Or should I assume that mapper reads line by line, > so, must read in one line, then, use StrTokenizer to split them up and > convert the second token (sum) from str to Int?? Or should I mess > around > with OutputKeyComparator class? > > Thanks, > PD > >
mapreduce linear chaining: ClassCastException
Hi all, I am trying a simple extension of WordCount example in Hadoop. I want to get a frequency of wordcounts in descending order. To that I employ a linear chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the usual example). For the next MR job => I set the mapper to swap the to . Then, have the Identity reducer to simply store the results. My MR-1 does its job correctly and store the result in a temp path. Question 1: The mapper of the second MR job (MR-2) doesn't like the input format. I have properly set the input format for MapClass2 of what it expects and what its output must be. It seems to expecting a LongWritable. I suspect that it is trying to look at some index file. I am not sure. It throws an error like this: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text Some Info: - I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it for now. - I use hadoop-0.20.2 For MR-1: - conf1.setOutputKeyClass(Text.class); - conf1.setOutputValueClass(IntWritable.class); For MR-2 - takes in a Text (word) and IntWritable (sum) - conf2.setOutputKeyClass(IntWritable.class); - conf2.setOutputValueClass(Text.class); public class MapClass2 extends MapReduceBase implements Mapper { @Override public void map(Text word, IntWritable sum, OutputCollector output, Reporter reporter) throws IOException { output.collect(sum, word); // } } Any suggestions would be helpful. Is my MapClass2 code right in the first place...for swapping? Or should I assume that mapper reads line by line, so, must read in one line, then, use StrTokenizer to split them up and convert the second token (sum) from str to Int?? Or should I mess around with OutputKeyComparator class? Thanks, PD
Re: Simple Hadoop program build with Maven
Fantastic ! Worked like a charm. Thanks much Bochun. For those who are facing similar issues, here is the command and output: $ hadoop jar ../MyHadoopProgram.jar com.ABC.MyHadoopProgram -libjars ~/CDH3/extJars/json-rpc-1.0.jar /usr/PD/input/sample22.json /usr/PD/output 11/10/08 17:51:45 INFO mapred.FileInputFormat: Total input paths to process : 1 11/10/08 17:51:46 INFO mapred.JobClient: Running job: job_201110072230_0005 11/10/08 17:51:47 INFO mapred.JobClient: map 0% reduce 0% 11/10/08 17:51:58 INFO mapred.JobClient: map 50% reduce 0% 11/10/08 17:51:59 INFO mapred.JobClient: map 100% reduce 0% 11/10/08 17:52:08 INFO mapred.JobClient: map 100% reduce 100% 11/10/08 17:52:10 INFO mapred.JobClient: Job complete: job_201110072230_0005 11/10/08 17:52:10 INFO mapred.JobClient: Counters: 23 11/10/08 17:52:10 INFO mapred.JobClient: Job Counters 11/10/08 17:52:10 INFO mapred.JobClient: Launched reduce tasks=1 11/10/08 17:52:10 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=17981 11/10/08 17:52:10 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/10/08 17:52:10 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/10/08 17:52:10 INFO mapred.JobClient: Launched map tasks=2 11/10/08 17:52:10 INFO mapred.JobClient: Data-local map tasks=2 11/10/08 17:52:10 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=9421 11/10/08 17:52:10 INFO mapred.JobClient: FileSystemCounters 11/10/08 17:52:10 INFO mapred.JobClient: FILE_BYTES_READ=606 11/10/08 17:52:10 INFO mapred.JobClient: HDFS_BYTES_READ=56375 11/10/08 17:52:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=157057 11/10/08 17:52:10 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=504 11/10/08 17:52:10 INFO mapred.JobClient: Map-Reduce Framework 11/10/08 17:52:10 INFO mapred.JobClient: Reduce input groups=24 11/10/08 17:52:10 INFO mapred.JobClient: Combine output records=24 11/10/08 17:52:10 INFO mapred.JobClient: Map input records=24 11/10/08 17:52:10 INFO mapred.JobClient: Reduce shuffle bytes=306 11/10/08 17:52:10 INFO mapred.JobClient: Reduce output records=24 11/10/08 17:52:10 INFO mapred.JobClient: Spilled Records=48 11/10/08 17:52:10 INFO mapred.JobClient: Map output bytes=552 11/10/08 17:52:10 INFO mapred.JobClient: Map input bytes=54923 11/10/08 17:52:10 INFO mapred.JobClient: Combine input records=24 11/10/08 17:52:10 INFO mapred.JobClient: Map output records=24 11/10/08 17:52:10 INFO mapred.JobClient: SPLIT_RAW_BYTES=240 11/10/08 17:52:10 INFO mapred.JobClient: Reduce input records=24 $ Appreciate you help. PD. On Fri, Oct 7, 2011 at 11:31 PM, Bochun Bai wrote: > To make a jar bundled big jar file using maven I suggest this plugin: >http://anydoby.com/fatjar/usage.html > But I prefer not doing so, because the classpath order is different > with different environment. > > I guess your old myHadoopProgram.jar should contains Main-Class meta info. > So the following ***xxx*** part is omitted. It originally likes: > > hadoop jar jar/myHadoopProgram.jar ***com.ABC.xxx*** -libjars > ../lib/json-rpc-1.0.jar > /usr/PD/input/sample22.json /usr/PD/output/ > > I suggest you add the Main-Class meta following this: > > http://maven.apache.org/plugins/maven-assembly-plugin/usage.html#Advanced_Configuration > or >pay attention to the order of and <-libjars ..> using: >hadoop jar <-libjars ...> > > On Sat, Oct 8, 2011 at 12:05 PM, Periya.Data > wrote: > > Hi all, > >I am migrating from ant builds to maven. So, brand new to Maven and do > > not yet understand many parts of it. > > > > Problem: I have a perfectly working map-reduce program (working by ant > > build). This program needs an external jar file (json-rpc-1.0.jar). So, > when > > I run the program, I do the following to get a nice output: > > > > $ hadoop jar jar/myHadoopProgram.jar -libjars ../lib/json-rpc-1.0.jar > > /usr/PD/input/sample22.json /usr/PD/output/ > > > > (note that I include the external jar file by the "-libjars" option as > > mentioned in the "Hadoop: The Definitive Guide 2nd Edition" - page 253). > > Everything is fine with my ant build. > > > > So, now, I move on to Maven. I had some trouble getting my pom.xml right. > I > > am still unsure if it is right, but, it builds "successfully" (the > resulting > > jar file has the class files of my program). The essential part of my > > pom.xml has the two following dependencies (a complete pom.xml is at the > end > > of this email). > > > > > > > > com.metaparadigm > > json-rpc > > 1.0 > > > > > > > > > >
Simple Hadoop program build with Maven
Hi all, I am migrating from ant builds to maven. So, brand new to Maven and do not yet understand many parts of it. Problem: I have a perfectly working map-reduce program (working by ant build). This program needs an external jar file (json-rpc-1.0.jar). So, when I run the program, I do the following to get a nice output: $ hadoop jar jar/myHadoopProgram.jar -libjars ../lib/json-rpc-1.0.jar /usr/PD/input/sample22.json /usr/PD/output/ (note that I include the external jar file by the "-libjars" option as mentioned in the "Hadoop: The Definitive Guide 2nd Edition" - page 253). Everything is fine with my ant build. So, now, I move on to Maven. I had some trouble getting my pom.xml right. I am still unsure if it is right, but, it builds "successfully" (the resulting jar file has the class files of my program). The essential part of my pom.xml has the two following dependencies (a complete pom.xml is at the end of this email). com.metaparadigm json-rpc 1.0 org.apache.hadoop hadoop-core 0.20.2 provided I try to run it like this: $ hadoop jar ../myHadoopProgram.jar -libjars ../json-rpc-1.0.jar com.ABC.MyHadoopProgram /usr/PD/input/sample22.json /usr/PD/output Exception in thread "main" java.lang.ClassNotFoundException: -libjars at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:179) $ Then, I thought, maybe it is not necessary to include the classpath. So, I ran with the following command: $ hadoop jar ../myHadoopProgram.jar -libjars ../json-rpc-1.0.jar /usr/PD/input/sample22.json /usr/PD/output Exception in thread "main" java.lang.ClassNotFoundException: -libjars at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:179) $ Question: What am I doing wrong? I know, since I am new to Maven, I may be missing some key pieces/concepts. What really happens when one builds the classes, where my java program imports org.json.JSONArray and org.json.JSONObject? This import is just for compilation I suppose and it does not get "embedded" into the final jar. Am I right? I want to either bundle-up the external jar(s) into a single jar and conveniently run hadoop using that, or, know how to include the external jars in my command-line. This is what I have: - maven 3.0.3 - Mac OSX - Java 1.6.0_26 - Hadoop - CDH 0.20.2-cdh3u0 I have Googled, looked at Tom White's github repo ( https://github.com/cloudera/repository-example/blob/master/pom.xml). The more I Google, the more confused I get. Any help is highly appreciated. Thanks, PD. http://maven.apache.org/POM/4.0.0"; xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd";> 4.0.0 com.ABC MyHadoopProgram 1.0 jar MyHadoopProgram http://maven.apache.org UTF-8 com.metaparadigm json-rpc 1.0 org.apache.hadoop hadoop-core 0.20.2 provided
Re: Hadoop : Linux-Window interface
Hi Aditya, You may want to investigate about using Flume...that is designed to collect unstructured data from disparate sources and store them in HDFS (or directly into HIVE tables). I do not know if Flume provides interoperability with Window's systems (maybe you hack it and make it work with Cygwin...). http://archive.cloudera.com/cdh/3/flume/Cookbook/ -PD. On Wed, Oct 5, 2011 at 8:14 AM, Bejoy KS wrote: > Hi Aditya > Definitely you can do it. As a very basic solution you can ftp the > contents to LFS(LOCAL/Linux File System ) and they do a copyFromLocal into > HDFS. Create a hive table with appropriate regex support and load the data > in. Hive has classes that effectively support parsing and loading of Apache > log files into hive tables. > For the entite data transfer,you just need to write a shell script for the > same. Log analysis won't be real time right? So you can schedule the job > with some scheduler libe a cron or to be used in conjuction with hadoop > jobs you can use some workflow management within hadoop eco ecosystem. > > > On Wed, Oct 5, 2011 at 3:43 PM, Aditya Singh30 > wrote: > > > Hi, > > > > We want to use Hadoop and Hive to store and analyze some Web Servers' Log > > files. The servers are running on windows platform. As mentioned about > > Hadoop, it is only supported for development on windows. I wanted to know > is > > there a way that we can run the Hadoop server(namenode server) and > cluster > > nodes on Linux, and have an interface using which we can send files and > run > > analysis queries from the WebServer's windows environment. > > I would really appreciate if you could point me to a right direction. > > > > > > Regards, > > Aditya Singh > > Infosys. India > > > > > > CAUTION - Disclaimer * > > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended > > solely > > for the use of the addressee(s). If you are not the intended recipient, > > please > > notify the sender by e-mail and delete the original message. Further, you > > are not > > to copy, disclose, or distribute this e-mail or its contents to any other > > person and > > any such actions are unlawful. This e-mail may contain viruses. Infosys > has > > taken > > every reasonable precaution to minimize this risk, but is not liable for > > any damage > > you may sustain as a result of any virus in this e-mail. You should carry > > out your > > own virus checks before opening the e-mail or attachment. Infosys > reserves > > the > > right to monitor and review the content of all messages sent to or from > > this e-mail > > address. Messages sent to or from this e-mail address may be stored on > the > > Infosys e-mail system. > > ***INFOSYS End of Disclaimer INFOSYS*** > > >
example of splitting a binary file
Hi all, Is there a nice example that shows how to split a large binary file into splits? If there is one, please let me know. It will be a great place to for me to start. More ideally, I want to create a custom InputFormat from sequenceFileAsBinaryInputFormat and a custom record-reader that can properly read well-defined records (with known offsets) in my binary input file. But, for now, to begin, I want to learn the basics => read a binary file, break it into splits of known size and play with a record-reader and get some output. I do not want to do any map-reduce yet on them. Once I know how to do those, I can gradually build on it. Please let me know if there are any links to such examples. Thanks, PD.
Hadoop with Eclipse Plugin: connection issues
Hi, After working on Hadoop for a while, I though I would integrate with Eclipse and give that a shot. I am seeing a seemingly trivial issue..but, could not figure out what is going on. I have tried googling and despite those, I am unable to fix my issue. Any suggestion on the following would be appreciated. - Have a MAC with Hadoop 0.20.2-cdh3u0, java version 1.6.0_26, Eclipse Indigo release. - Hadoop normally runs fine - jps shows all the daemons running. Able to see namenode and jobtracker on the web interface : http://localhost:50070and 50030. (that makes me wonder if I have to be using localhost or PDMac as my hostname in Eclipse). - Mapreduce on port 9001 and dfs on port 9000 as the xml configs. - NOTE: my host name is PDMac which I had initially changed..using "sudo scutil --set Hostname PDMac". I am not sure if this is an issue. - I configured Eclipse Hadoop plugin "appropriately": I see a mapreduce - elephant logo, create DFS locations - actually 2: one with localhost and the other for PDMac. - For Map/Reduce Master, I entered Host: localhost and port: 9001. For DFS: localhost, 9000. Call to localhost/127.0.0.1:9000 failed on local exception: java.io.EOFException. - Then, I tried with my current running hostname. I entered Host: PDMac and port: 9001. For DFS: PDMac, 9000. ==> Error: Call to PDMac/ 192.168.1.102:9000 Failed on connection exception Java.net.ConnectException. Connection refused. 1. I checked if this was something to do with /etc/hosts. I entered 192.168.1.102 as PDMac. I get same error at Eclipse. 2. I checked if this was due to ssh. I did "ssh localhost" and immediately got a response "last login: Sep 11" 3. But, "ssh PDMac" does not respond. Is that an issue? Because nodes in dfs need ssh to connect... 4. I checked the namenode logs: / 2011-09-11 07:33:07,002 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = PDMac/192.168.1.102 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2-cdh3u0 STARTUP_MSG: build = -r 81256ad0f2e4ab2bd34b04f53d25a6c23686dd14; compiled by 'hudson' on Fri Mar 25 19:56:23 PDT 2011 / 2011-09-11 07:33:07,851 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 9000 2011-09-11 07:33:07,854 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 2011-09-11 07:33:07,856 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 *2011-09-11 07:33:07,866 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost/ 127.0.0.1:9000* 2011-09-11 07:33:07,987 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog . 2011-09-11 07:33:22,822 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 127.0.0.1:50010 to delete blk_8310136671599400924_1002 2011-09-11 07:34:05,136 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 127.0.0.1:60733 got version 3 expected version 4 2011-09-11 07:34:08,614 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 127.0.0.1:60734 got version 3 expected version 4 2011-09-11 07:34:15,228 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 127.0.0.1:60735 got version 3 expected version 4 2011-09-11 07:38:12,349 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 127.0.0.1 === So, I am not sure what is going on. First, I do not know what is my server = localhost (127.0.0.1) or PDMac: 192.168.1.102 . Then, config optionsI think 9001 and 9000 are right ...as my hadoop/conf/ core-site.xml and dfs xml says. Any suggestions would be very much appreciated. Thanks, PD.