solved [Re: streaming command [Re: no output written to HDFS]]

2012-08-31 Thread Periya.Data
The problem is solved. I had to make sure that the streaming file is given
in "-input" and the other file is given in "-file". That solved the issue.

Thanks,
PD

On Fri, Aug 31, 2012 at 10:07 AM, Periya.Data  wrote:

> Yes, both input files need to be processed by the mapper..but not in the
> same fashion. Essentially, this is what my Python script does:
> - read two text files - A and B. file A has a list of account-IDs (all
> numeric). File B has about 10 records - some of which has the same
> account_ID as those listed in file A.
> - mapper: read both the files, compares and prints out those records that
> have matching account_IDs.
>
> I have tried placing both the input files under a single input directory.
> Same behavior.
>
> And, from what I have read so far, "-mapper" or "-reducer" should have
> "ONLY" the name of the executable (like...in my case, "test2.py".). But, if
> I do that, nothing happens. I have to explicitly mention:
> -mapper "cat $1 | python $GHU_HOME/test2.py $2"...something like
> that...which looks unconventional...but, it produces "some" output...not
> the correct one though.
>
> Again, if I run my script in just plain linux machine, using the basic
> commands :
> cat $1 | python test2.py $2,
> it produces the expected output.
>
>
> *Observation*: If I do not specify the two files under "- file" option,
> then, I see no output written to HDFS..even though the output directory has
> empy part-files and SUCCESS directory. The 3-part files are reasonable - as
> 3 mappers are configured for each job.
>
>
> My current command:
>
> hadoop jar ...streaming.jar
>  -input /user/ghu/input/* \
>  -output /user/ghu/out file /home/ghu/test2.py \
>  -mapper "cat $1 | python test2.py $2" \
>  -file /home/ghu/$1 \
>  -file /home/ghu/$2
>
>
> Learning,
> /PD
>
> On Thu, Aug 30, 2012 at 9:46 PM, Hemanth Yamijala wrote:
>
>> Hi,
>>
>> Do both input files contain data that needs to be processed by the
>> mapper in the same fashion ? In which case, you could just put the
>> input files under a directory in HDFS and provide that as input. The
>> -input option does accept a directory as argument.
>>
>> Otherwise, can you please explain a little more what you're trying to
>> do with the two inputs.
>>
>> Thanks
>> Hemanth
>>
>> On Fri, Aug 31, 2012 at 3:00 AM, Periya.Data 
>> wrote:
>> > This is interesting. I changed my command to:
>> >
>> > -mapper "cat $1 |  $GHU_HOME/test2.py $2" \
>> >
>> > is producing output to HDFS. But, the output is not what I expected and
>> is
>> > not the same as when I do "cat | map " on Linux. It is producing
>> > part-0, part-1 and part-2. I expected only one output file
>> with
>> > just 2 records.
>> >
>> > I think I have to understand what exactly "-file" does and what exactly
>> > "-input" does. I am experimenting what happens if I give my input files
>> on
>> > the command line (like: test2.py arg1 arg2) as against specifying the
>> input
>> > files via "-file" and "-input" options...
>> >
>> > The problem is I have 2 input files...and have no idea how to pass them.
>> > SHould I keep one in HDFS and stream in the other?
>> >
>> > More digging,
>> > PD/
>> >
>> >
>> >
>> > On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data 
>> wrote:
>> >
>> >> Hi Bertrand,
>> >> No, I do not observe the same when I run using cat | map. I can see
>> >> the output in STDOUT when I run my program.
>> >>
>> >> I do not have any reducer. In my command, I provide
>> >> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be
>> >> written directly to HDFS.
>> >>
>> >> Your suspicion maybe right..about the output. In my counters, the "map
>> >> input records" = 40 and "map.output records" = 0. I am trying to see
>> if I
>> >> am messing up in my command...(see below)
>> >>
>> >> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I
>> am
>> >> streaming one file in and test2.py takes in only one argument. How
>> should I
>> >> frame my command below? I think that is where I am messing up..
>> >>
>> >>
>> >

streaming command [Re: no output written to HDFS]

2012-08-31 Thread Periya.Data
Yes, both input files need to be processed by the mapper..but not in the
same fashion. Essentially, this is what my Python script does:
- read two text files - A and B. file A has a list of account-IDs (all
numeric). File B has about 10 records - some of which has the same
account_ID as those listed in file A.
- mapper: read both the files, compares and prints out those records that
have matching account_IDs.

I have tried placing both the input files under a single input directory.
Same behavior.

And, from what I have read so far, "-mapper" or "-reducer" should have
"ONLY" the name of the executable (like...in my case, "test2.py".). But, if
I do that, nothing happens. I have to explicitly mention:
-mapper "cat $1 | python $GHU_HOME/test2.py $2"...something like
that...which looks unconventional...but, it produces "some" output...not
the correct one though.

Again, if I run my script in just plain linux machine, using the basic
commands :
cat $1 | python test2.py $2,
it produces the expected output.


*Observation*: If I do not specify the two files under "- file" option,
then, I see no output written to HDFS..even though the output directory has
empy part-files and SUCCESS directory. The 3-part files are reasonable - as
3 mappers are configured for each job.


My current command:

hadoop jar ...streaming.jar
 -input /user/ghu/input/* \
 -output /user/ghu/out file /home/ghu/test2.py \
 -mapper "cat $1 | python test2.py $2" \
 -file /home/ghu/$1 \
 -file /home/ghu/$2


Learning,
/PD

On Thu, Aug 30, 2012 at 9:46 PM, Hemanth Yamijala wrote:

> Hi,
>
> Do both input files contain data that needs to be processed by the
> mapper in the same fashion ? In which case, you could just put the
> input files under a directory in HDFS and provide that as input. The
> -input option does accept a directory as argument.
>
> Otherwise, can you please explain a little more what you're trying to
> do with the two inputs.
>
> Thanks
> Hemanth
>
> On Fri, Aug 31, 2012 at 3:00 AM, Periya.Data 
> wrote:
> > This is interesting. I changed my command to:
> >
> > -mapper "cat $1 |  $GHU_HOME/test2.py $2" \
> >
> > is producing output to HDFS. But, the output is not what I expected and
> is
> > not the same as when I do "cat | map " on Linux. It is producing
> > part-0, part-1 and part-2. I expected only one output file
> with
> > just 2 records.
> >
> > I think I have to understand what exactly "-file" does and what exactly
> > "-input" does. I am experimenting what happens if I give my input files
> on
> > the command line (like: test2.py arg1 arg2) as against specifying the
> input
> > files via "-file" and "-input" options...
> >
> > The problem is I have 2 input files...and have no idea how to pass them.
> > SHould I keep one in HDFS and stream in the other?
> >
> > More digging,
> > PD/
> >
> >
> >
> > On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data 
> wrote:
> >
> >> Hi Bertrand,
> >> No, I do not observe the same when I run using cat | map. I can see
> >> the output in STDOUT when I run my program.
> >>
> >> I do not have any reducer. In my command, I provide
> >> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be
> >> written directly to HDFS.
> >>
> >> Your suspicion maybe right..about the output. In my counters, the "map
> >> input records" = 40 and "map.output records" = 0. I am trying to see if
> I
> >> am messing up in my command...(see below)
> >>
> >> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I
> am
> >> streaming one file in and test2.py takes in only one argument. How
> should I
> >> frame my command below? I think that is where I am messing up..
> >>
> >>
> >> run.sh:(run as:   cat  | ./run.sh  )
> >> ---
> >>
> >> hadoop jar
> >> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
> >> -D mapred.reduce.tasks=0 \
> >> -verbose \
> >> -input "$HDFS_INPUT" \
> >> -input "$HDFS_INPUT_2" \
> >> -output "$HDFS_OUTPUT" \
> >> -file   "$GHU_HOME/test2.py" \
> >> -mapper "python $GHU_HOME/test2.py $1" \
> >> -file   "$GHU_HOME/$1"
> >>
> >>
> >>
> >> If I modify my mapper 

Re: no output written to HDFS

2012-08-30 Thread Periya.Data
This is interesting. I changed my command to:

-mapper "cat $1 |  $GHU_HOME/test2.py $2" \

is producing output to HDFS. But, the output is not what I expected and is
not the same as when I do "cat | map " on Linux. It is producing
part-0, part-1 and part-2. I expected only one output file with
just 2 records.

I think I have to understand what exactly "-file" does and what exactly
"-input" does. I am experimenting what happens if I give my input files on
the command line (like: test2.py arg1 arg2) as against specifying the input
files via "-file" and "-input" options...

The problem is I have 2 input files...and have no idea how to pass them.
SHould I keep one in HDFS and stream in the other?

More digging,
PD/



On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data  wrote:

> Hi Bertrand,
> No, I do not observe the same when I run using cat | map. I can see
> the output in STDOUT when I run my program.
>
> I do not have any reducer. In my command, I provide
> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be
> written directly to HDFS.
>
> Your suspicion maybe right..about the output. In my counters, the "map
> input records" = 40 and "map.output records" = 0. I am trying to see if I
> am messing up in my command...(see below)
>
> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I am
> streaming one file in and test2.py takes in only one argument. How should I
> frame my command below? I think that is where I am messing up..
>
>
> run.sh:(run as:   cat  | ./run.sh  )
> ---
>
> hadoop jar
> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
> -D mapred.reduce.tasks=0 \
> -verbose \
> -input "$HDFS_INPUT" \
> -input "$HDFS_INPUT_2" \
> -output "$HDFS_OUTPUT" \
> -file   "$GHU_HOME/test2.py" \
> -mapper "python $GHU_HOME/test2.py $1" \
> -file   "$GHU_HOME/$1"
>
>
>
> If I modify my mapper to take in 2 arguments, then, I would run it as:
>
> run.sh:(run as:   ./run.sh   )
> ---
>
> hadoop jar
> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
> -D mapred.reduce.tasks=0 \
> -verbose \
> -input "$HDFS_INPUT" \
> -input "$HDFS_INPUT_2" \
> -output "$HDFS_OUTPUT" \
> -file   "$GHU_HOME/test2.py" \
> -mapper "python $GHU_HOME/test2.py $1 $2" \
> -file   "$GHU_HOME/$1" \
> -file   "GHU_HOME/$2"
>
>
> Please let me know if I am making a mistake here.
>
>
> Thanks.
> PD
>
>
>
>
>
>
> On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux wrote:
>
>> Do you observe the same thing when running without Hadoop? (cat, map, sort
>> and then reduce)
>>
>> Could you provide the counters of your job? You should be able to get them
>> using the job tracker interface.
>>
>> The most probable answer without more information would be that your
>> reducer do not output any s.
>>
>> Regards
>>
>> Bertrand
>>
>>
>>
>> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data 
>> wrote:
>>
>> > Hi All,
>> >My Hadoop streaming job (in Python) runs to "completion" (both map
>> and
>> > reduce says 100% complete). But, when I look at the output directory in
>> > HDFS, the part files are empty. I do not know what might be causing this
>> > behavior. I understand that the percentages represent the records that
>> have
>> > been read in (not processed).
>> >
>> > The following are some of the logs. The detailed logs from Cloudera
>> Manager
>> > says that there were no Map Outputs...which is interesting. Any
>> > suggestions?
>> >
>> >
>> > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
>> > 12/08/30 03:27:14 INFO streaming.StreamJob:
>> /usr/lib/hadoop-0.20/bin/hadoop
>> > job  -Dmapred.job.tracker=x.yyy.com:8021 -kill
>> job_201208232245_3182
>> > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
>> > http://xx..com:60030/jobdetails.jsp?jobid=job_201208232245_3182
>> > 12/08/30 03:27:15 INFO streaming.StreamJob:  map 0%  reduce 0%
>> > 12/08/30 03:27:20 INFO streaming.StreamJob:  map 33%  reduce 0%
>> > 12/08/30 03:27:23 INFO streaming.StreamJob:  map 67%  reduce 0%
>> > 12/08/30 03:27:29 INFO streaming.

Re: no output written to HDFS

2012-08-30 Thread Periya.Data
Hi Bertrand,
No, I do not observe the same when I run using cat | map. I can see the
output in STDOUT when I run my program.

I do not have any reducer. In my command, I provide
"-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be
written directly to HDFS.

Your suspicion maybe right..about the output. In my counters, the "map
input records" = 40 and "map.output records" = 0. I am trying to see if I
am messing up in my command...(see below)

Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I am
streaming one file in and test2.py takes in only one argument. How should I
frame my command below? I think that is where I am messing up..


run.sh:(run as:   cat  | ./run.sh  )
---

hadoop jar
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
-D mapred.reduce.tasks=0 \
-verbose \
-input "$HDFS_INPUT" \
-input "$HDFS_INPUT_2" \
-output "$HDFS_OUTPUT" \
-file   "$GHU_HOME/test2.py" \
-mapper "python $GHU_HOME/test2.py $1" \
-file   "$GHU_HOME/$1"



If I modify my mapper to take in 2 arguments, then, I would run it as:

run.sh:(run as:   ./run.sh   )
---

hadoop jar
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
-D mapred.reduce.tasks=0 \
-verbose \
-input "$HDFS_INPUT" \
-input "$HDFS_INPUT_2" \
-output "$HDFS_OUTPUT" \
-file   "$GHU_HOME/test2.py" \
-mapper "python $GHU_HOME/test2.py $1 $2" \
-file   "$GHU_HOME/$1" \
-file   "GHU_HOME/$2"


Please let me know if I am making a mistake here.


Thanks.
PD





On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux wrote:

> Do you observe the same thing when running without Hadoop? (cat, map, sort
> and then reduce)
>
> Could you provide the counters of your job? You should be able to get them
> using the job tracker interface.
>
> The most probable answer without more information would be that your
> reducer do not output any s.
>
> Regards
>
> Bertrand
>
>
>
> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data 
> wrote:
>
> > Hi All,
> >My Hadoop streaming job (in Python) runs to "completion" (both map and
> > reduce says 100% complete). But, when I look at the output directory in
> > HDFS, the part files are empty. I do not know what might be causing this
> > behavior. I understand that the percentages represent the records that
> have
> > been read in (not processed).
> >
> > The following are some of the logs. The detailed logs from Cloudera
> Manager
> > says that there were no Map Outputs...which is interesting. Any
> > suggestions?
> >
> >
> > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
> > 12/08/30 03:27:14 INFO streaming.StreamJob:
> /usr/lib/hadoop-0.20/bin/hadoop
> > job  -Dmapred.job.tracker=x.yyy.com:8021 -kill job_201208232245_3182
> > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
> > http://xx..com:60030/jobdetails.jsp?jobid=job_201208232245_3182
> > 12/08/30 03:27:15 INFO streaming.StreamJob:  map 0%  reduce 0%
> > 12/08/30 03:27:20 INFO streaming.StreamJob:  map 33%  reduce 0%
> > 12/08/30 03:27:23 INFO streaming.StreamJob:  map 67%  reduce 0%
> > 12/08/30 03:27:29 INFO streaming.StreamJob:  map 100%  reduce 0%
> > 12/08/30 03:27:33 INFO streaming.StreamJob:  map 100%  reduce 100%
> > 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete:
> > job_201208232245_3182
> > 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU
> > Thu Aug 30 03:27:24 GMT 2012
> > *** END
> > bash-3.2$
> > bash-3.2$ hadoop fs -ls /user/ghu/
> > Found 5 items
> > -rw-r--r--   3 ghu hadoop  0 2012-08-30 03:27 /user/GHU/_SUCCESS
> > drwxrwxrwx   - ghu hadoop  0 2012-08-30 03:27 /user/GHU/_logs
> > -rw-r--r--   3 ghu hadoop  0 2012-08-30 03:27
> /user/GHU/part-0
> > -rw-r--r--   3 ghu hadoop  0 2012-08-30 03:27
> /user/GHU/part-1
> > -rw-r--r--   3 ghu hadoop  0 2012-08-30 03:27
> /user/GHU/part-2
> > bash-3.2$
> >
> >
> 
> >
> >
> > Metadata Status Succeeded  Type MapReduce  Id job_201208232245_3182
> > Name CaidMatch
> >  User srisrini  Mapper class PipeMapper  Reducer class
> >  Scheduler pool name default  Job input directory
> > hdfs://x.yyy.txt,hdfs://..com/user/GHUcaidlist

no output written to HDFS

2012-08-29 Thread Periya.Data
Hi All,
   My Hadoop streaming job (in Python) runs to "completion" (both map and
reduce says 100% complete). But, when I look at the output directory in
HDFS, the part files are empty. I do not know what might be causing this
behavior. I understand that the percentages represent the records that have
been read in (not processed).

The following are some of the logs. The detailed logs from Cloudera Manager
says that there were no Map Outputs...which is interesting. Any suggestions?


12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
12/08/30 03:27:14 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop
job  -Dmapred.job.tracker=x.yyy.com:8021 -kill job_201208232245_3182
12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
http://xx..com:60030/jobdetails.jsp?jobid=job_201208232245_3182
12/08/30 03:27:15 INFO streaming.StreamJob:  map 0%  reduce 0%
12/08/30 03:27:20 INFO streaming.StreamJob:  map 33%  reduce 0%
12/08/30 03:27:23 INFO streaming.StreamJob:  map 67%  reduce 0%
12/08/30 03:27:29 INFO streaming.StreamJob:  map 100%  reduce 0%
12/08/30 03:27:33 INFO streaming.StreamJob:  map 100%  reduce 100%
12/08/30 03:27:35 INFO streaming.StreamJob: Job complete:
job_201208232245_3182
12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU
Thu Aug 30 03:27:24 GMT 2012
*** END
bash-3.2$
bash-3.2$ hadoop fs -ls /user/ghu/
Found 5 items
-rw-r--r--   3 ghu hadoop  0 2012-08-30 03:27 /user/GHU/_SUCCESS
drwxrwxrwx   - ghu hadoop  0 2012-08-30 03:27 /user/GHU/_logs
-rw-r--r--   3 ghu hadoop  0 2012-08-30 03:27 /user/GHU/part-0
-rw-r--r--   3 ghu hadoop  0 2012-08-30 03:27 /user/GHU/part-1
-rw-r--r--   3 ghu hadoop  0 2012-08-30 03:27 /user/GHU/part-2
bash-3.2$



Metadata Status Succeeded  Type MapReduce  Id job_201208232245_3182
Name CaidMatch
 User srisrini  Mapper class PipeMapper  Reducer class
 Scheduler pool name default  Job input directory
hdfs://x.yyy.txt,hdfs://..com/user/GHUcaidlist.txt  Job output
directory hdfs://..com/user/GHU/  Timing
Duration 20.977s  Submit time Wed, 29 Aug 2012 08:27 PM  Start time Wed, 29
Aug 2012 08:27 PM  Finish time Wed, 29 Aug 2012 08:27 PM






 Progress and Scheduling Map Progress
100.0%
 Reduce Progress
100.0%
 Launched maps 4  Data-local maps 3  Rack-local maps 1  Other local maps
 Desired maps 3  Launched reducers
 Desired reducers 0  Fairscheduler running tasks
 Fairscheduler minimum share
 Fairscheduler demand
 Current Resource Usage Current User CPUs 0  Current System CPUs 0  Resident
memory 0 B  Running maps 0  Running reducers 0  Aggregate Resource Usage
and Counters User CPU 0s  System CPU 0s  Map Slot Time 12.135s  Reduce slot
time 0s  Cumulative disk reads
 Cumulative disk writes 155.0 KiB  Cumulative HDFS reads 3.6 KiB  Cumulative
HDFS writes
 Map input bytes 2.5 KiB  Map input records 45  Map output records 0  Reducer
input groups
 Reducer input records
 Reducer output records
 Reducer shuffle bytes
 Spilled records


Hadoop streaming - Subprocess failed

2012-08-29 Thread Periya.Data
Hi,
I am running a map-reduce job in Python and I get this error message. I
do not understand what it means. Output is not written to HDFS. I am using
CDH3u3. Any suggestion is appreciated.

MapAttempt TASK_TYPE="MAP" TASKID="task_201208232245_2812_m_00"
TASK_ATTEMPT_ID="attempt_201208232245_2812_m_00_0"
TASK_STATUS="FAILED"  *ERROR="java\.lang\.RuntimeException:
PipeMapRed\.waitOutputThreads(): subprocess failed with code 1*
at
org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:362)
at
org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:572)
at
org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:136)
at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57)
at
org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:34)
at
org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:391)
at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:325)
at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:270)
at java\.security\.AccessController\.doPrivileged(Native Method)
at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
at
org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1157)
at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:264)
" .


Re: streaming data ingest into HDFS

2011-12-15 Thread Periya.Data
Sorry...misworded my statement. What I meant was that the sources are meant
to be untouched and admins do not want to mess with it and add more tools
in there. All I've got is source addresses, port numbers. Once I know what
technique(s) I will be using, accordingly, I will be given access via
firewalls and other access credentials.


-PD

On Thu, Dec 15, 2011 at 5:05 PM, Russell Jurney wrote:

> Just curious - what is the situation you're in where no collectors are
> possible?  Sounds interesting.
>
> Russell Jurney
> twitter.com/rjurney
> russell.jur...@gmail.com
> datasyndrome.com
>
> On Dec 15, 2011, at 5:01 PM, "Periya.Data"  wrote:
>
> > Hi all,
> > I would like to know what options I have to ingest terabytes of data
> > that are being generated very fast from a small set of sources. I have
> > thought about :
> >
> >   1. Flume
> >   2. Have an intermediate staging server(s) where you can offload data
> and
> >   from there use dfs -put to load into HDFS.
> >   3. Anything else??
> >
> > Suppose I am unable to use Flume (since the sources do not support their
> > installation) and suppose that I do not have the luxury of having an
> > intermediate staging place, what options do I have? In this case, I might
> > have to directly (preferably in parallel) ingest data into HDFS.
> >
> > I have read about a technique to use Map-Reduce where the map would read
> > data and use JAVA API to store in HDFS. We could have multiple threads of
> > maps to get parallel ingestion. It would be nice to know about ways to
> > ingest data "directly" into HDFS considering my assumptions.
> >
> > Suggestions are appreciated,
> >
> > /PD.
>


streaming data ingest into HDFS

2011-12-15 Thread Periya.Data
Hi all,
 I would like to know what options I have to ingest terabytes of data
that are being generated very fast from a small set of sources. I have
thought about :

   1. Flume
   2. Have an intermediate staging server(s) where you can offload data and
   from there use dfs -put to load into HDFS.
   3. Anything else??

Suppose I am unable to use Flume (since the sources do not support their
installation) and suppose that I do not have the luxury of having an
intermediate staging place, what options do I have? In this case, I might
have to directly (preferably in parallel) ingest data into HDFS.

I have read about a technique to use Map-Reduce where the map would read
data and use JAVA API to store in HDFS. We could have multiple threads of
maps to get parallel ingestion. It would be nice to know about ways to
ingest data "directly" into HDFS considering my assumptions.

Suggestions are appreciated,

/PD.


Re: choices for deploying a small hadoop cluster on EC2

2011-11-29 Thread Periya.Data
Thanks for all your help and replies.  Though I am leaning towards option 1
or 2, I looked up Big Table...an Incubator project in Apache. Could not
find enough info on it in its website. I have a few more questions...and
hope they apply to these mailing-list..

1. Cos: Can you please point me to a link that talk about BigTop & EC2?

2. Regarding Whirr, can I just choose an Ubuntu EBS-backed AMI? Would that
be any different from choosing a normal Hadoop AMI and (later) try to mount
an EBS to this instance?

3. John: I like you idea of using S3 to store input and output. But, say I
start a hadoop cluster, configure Sqoop and Hive and run it. Then, after I
get my output in S3, I either stop it or terminate it (since I do not have
EBS, I don't care). Now, after a while, I want to bring up a similar
cluster and run Hive and Sqoop and do more experiments. In this case, will
I have to reconfigure all my Sqoop settings, Hive table schemas etc?
Because, I think once I "stop" an instance, I will lose the configs and
when I restart a Hadoop AMI, I will only have hadoop nicely running in that
instance and nothing else.

I ideally want everything to persist...even configs and newly installed
tools (Hive, Sqoop).  Or , should I create a custom Ubuntu AMI with Hadoop,
Sqoop, Hive etc "pre-cooked" in it? Probably, this is the ideal way to
proceed...even if it is a little painful. I think I really want EBS-backed
instance..as it maintains its internal state when stopped and restarted.

 Please let me know your opinion. This discussion is deviating from what I
originally started as..

A little Googling has similar posts:
https://forums.aws.amazon.com/message.jspa?messageID=131157


I  know I can get to know by trying out these but, I want to lessen my
burden in the trial-and-error process.

Thanks very much,
PD.


On Tue, Nov 29, 2011 at 12:40 PM, Konstantin Boudnik  wrote:

> I'd suggest you use BigTop (cross-posting to bigtop-dev@ list) produced
> bit
> which also posses Puppet recipes allowing for fully automated deployment
> and
> configuration. BigTop also uses Jenkins EC2 plugin for deployment part and
> it
> seems to work real great!
>
> Cos
>
> On Tue, Nov 29, 2011 at 12:28PM, Periya.Data wrote:
> > Hi All,
> > I am just beginning to learn how to deploy a small cluster (a 3
> > node cluster) on EC2. After some quick Googling, I see the following
> > approaches:
> >
> >1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it
> >have features for persisting (EBS)?
> >2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop
> clusters/POC
> >etc. Good stuff - I can persist using EBS snapshots. But, this uses
> CDH2.
> >3. Install hadoop manually and related stuff like Hive...on each
> cluster
> >node...on EC2 (or use some automation tool like Chef). I do not
> prefer it.
> >4. Hadoop distribution comes with EC2 (under src/contrib) and there
> are
> >several Hadoop EC2 AMIs available. I have not studied enough to know
> if
> >that is easy for a beginner like me.
> >5. Anything else??
> >
> > 1 and 2 look promising as a beginner. If any of you have any thoughts
> about
> > this, I would like to know (like what to keep in mind, what to take care
> > of, caveats etc). I want my data /config to persist (using EBS) and
> > continue from where I left off...(after a few days).  Also, I want to
> have
> > HIVE and SQOOP installed. Can this done using 1 or 2? Or, will
> installation
> > of them have to be done manually after I set up the cluster?
> >
> > Thanks very much,
> >
> > PD.
>


choices for deploying a small hadoop cluster on EC2

2011-11-29 Thread Periya.Data
Hi All,
I am just beginning to learn how to deploy a small cluster (a 3
node cluster) on EC2. After some quick Googling, I see the following
approaches:

   1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it
   have features for persisting (EBS)?
   2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop clusters/POC
   etc. Good stuff - I can persist using EBS snapshots. But, this uses CDH2.
   3. Install hadoop manually and related stuff like Hive...on each cluster
   node...on EC2 (or use some automation tool like Chef). I do not prefer it.
   4. Hadoop distribution comes with EC2 (under src/contrib) and there are
   several Hadoop EC2 AMIs available. I have not studied enough to know if
   that is easy for a beginner like me.
   5. Anything else??

1 and 2 look promising as a beginner. If any of you have any thoughts about
this, I would like to know (like what to keep in mind, what to take care
of, caveats etc). I want my data /config to persist (using EBS) and
continue from where I left off...(after a few days).  Also, I want to have
HIVE and SQOOP installed. Can this done using 1 or 2? Or, will installation
of them have to be done manually after I set up the cluster?

Thanks very much,

PD.


Re: mapreduce linear chaining: ClassCastException

2011-10-15 Thread Periya.Data
Fantastic ! Thanks much Bejoy. Now, I am able to get the output of my MR-2
nicely. I had to convert the sum (in text) format to IntWritable and I am
able to get all the word frequency  in ascending order. I used
"KeyValueTextInputFormat.class". My program was complaining when I used
"KeyValueInputFormat".

Now, let me investigate how to do that in descending order...and then
top-20...etc. I know I must look into RawComparator and more...

Thanks,
PD.

On Sat, Oct 15, 2011 at 1:08 AM,  wrote:

> Hi
>I believe what is happening in your case is that.
> The first map reduce jobs runs to completion
> When you trigger the second map reduce job, it is triggered with the
> default input format, TextInputFormat and definitely expects the key value
> as LongWritable and Text type. In default the MapReduce jobs output format
> is TextOutputFormat, key value as tab seperated. When you need to consume
> this output of an MR job  as key value pairs by another MR job, use
> KeyValueInputFormat, ie while setting config parameters for second job set
> jobConf.setInputFormat(KeyValueInput Format.class).
> Now if your output key value pairs use a different separator other than
> default tab then for second job you need to specify that as well using
> key.value.separator.in.input.line
>
> In short for your case in second map reduce job doing the following would
> get things in place
> -use jobConf.setInputFormat(KeyValueInputFormat.class)
> -alter your mapper to accept key values of type Text,Text
> -swap the key and values within mapper for output to reducer with
> conversions.
>
> To be noted here,AFAIK KeyValueInputFormat is not a part of new mapreduce
> API.
>
> Hope it helps.
>
> Regards
> Bejoy K S
>
> -Original Message-
> From: "Periya.Data" 
> Date: Fri, 14 Oct 2011 17:31:27
> To: ; 
> Reply-To: common-user@hadoop.apache.org
> Subject: mapreduce linear chaining: ClassCastException
>
> Hi all,
>   I am trying a simple extension of WordCount example in Hadoop. I want to
> get a frequency of wordcounts in descending order. To that I employ a
> linear
> chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the
> usual example). For the next MR job => I set the mapper to swap the  count> to . Then,  have the Identity reducer to simply store
> the results.
>
> My MR-1 does its job correctly and store the result in a temp path.
>
> Question 1: The mapper of the second MR job (MR-2) doesn't like the input
> format. I have properly set the input format for MapClass2 of what it
> expects and what its output must be. It seems to expecting a LongWritable.
> I
> suspect that it is trying to look at some index file. I am not sure.
>
>
> It throws an error like this:
>
> 
>java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> be cast to org.apache.hadoop.io.Text
> 
>
> Some Info:
> - I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it
> for now.
> - I use hadoop-0.20.2
>
> For MR-1:
> - conf1.setOutputKeyClass(Text.class);
> - conf1.setOutputValueClass(IntWritable.class);
>
> For MR-2
> - takes in a Text (word) and IntWritable (sum)
> - conf2.setOutputKeyClass(IntWritable.class);
> - conf2.setOutputValueClass(Text.class);
>
> 
> public class MapClass2 extends MapReduceBase
>  implements Mapper {
>
>  @Override
>  public void map(Text word, IntWritable sum,
>  OutputCollector output,
>  Reporter reporter) throws IOException {
>
>  output.collect(sum, word);   // 
>  }
>  }
> 
>
> Any suggestions would be helpful. Is my MapClass2 code right in the first
> place...for swapping? Or should I assume that mapper reads line by line,
> so,  must read in one line, then, use StrTokenizer to split them up and
> convert the second token (sum) from str to Int?? Or should I mess
> around
> with OutputKeyComparator class?
>
> Thanks,
> PD
>
>


mapreduce linear chaining: ClassCastException

2011-10-14 Thread Periya.Data
Hi all,
   I am trying a simple extension of WordCount example in Hadoop. I want to
get a frequency of wordcounts in descending order. To that I employ a linear
chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the
usual example). For the next MR job => I set the mapper to swap the  to . Then,  have the Identity reducer to simply store
the results.

My MR-1 does its job correctly and store the result in a temp path.

Question 1: The mapper of the second MR job (MR-2) doesn't like the input
format. I have properly set the input format for MapClass2 of what it
expects and what its output must be. It seems to expecting a LongWritable. I
suspect that it is trying to look at some index file. I am not sure.


It throws an error like this:


java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
be cast to org.apache.hadoop.io.Text


Some Info:
- I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it
for now.
- I use hadoop-0.20.2

For MR-1:
- conf1.setOutputKeyClass(Text.class);
- conf1.setOutputValueClass(IntWritable.class);

For MR-2
- takes in a Text (word) and IntWritable (sum)
- conf2.setOutputKeyClass(IntWritable.class);
- conf2.setOutputValueClass(Text.class);


public class MapClass2 extends MapReduceBase
  implements Mapper {

  @Override
  public void map(Text word, IntWritable sum,
  OutputCollector output,
  Reporter reporter) throws IOException {

  output.collect(sum, word);   // 
  }
  }


Any suggestions would be helpful. Is my MapClass2 code right in the first
place...for swapping? Or should I assume that mapper reads line by line,
so,  must read in one line, then, use StrTokenizer to split them up and
convert the second token (sum) from str to Int?? Or should I mess around
with OutputKeyComparator class?

Thanks,
PD


Re: Simple Hadoop program build with Maven

2011-10-08 Thread Periya.Data
Fantastic ! Worked like a charm. Thanks much Bochun.

For those who are facing similar issues, here is the command and output:

$ hadoop jar ../MyHadoopProgram.jar com.ABC.MyHadoopProgram -libjars
~/CDH3/extJars/json-rpc-1.0.jar /usr/PD/input/sample22.json /usr/PD/output
11/10/08 17:51:45 INFO mapred.FileInputFormat: Total input paths to process
: 1
11/10/08 17:51:46 INFO mapred.JobClient: Running job: job_201110072230_0005
11/10/08 17:51:47 INFO mapred.JobClient:  map 0% reduce 0%
11/10/08 17:51:58 INFO mapred.JobClient:  map 50% reduce 0%
11/10/08 17:51:59 INFO mapred.JobClient:  map 100% reduce 0%
11/10/08 17:52:08 INFO mapred.JobClient:  map 100% reduce 100%
11/10/08 17:52:10 INFO mapred.JobClient: Job complete: job_201110072230_0005
11/10/08 17:52:10 INFO mapred.JobClient: Counters: 23
11/10/08 17:52:10 INFO mapred.JobClient:   Job Counters
11/10/08 17:52:10 INFO mapred.JobClient: Launched reduce tasks=1
11/10/08 17:52:10 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=17981
11/10/08 17:52:10 INFO mapred.JobClient: Total time spent by all reduces
waiting after reserving slots (ms)=0
11/10/08 17:52:10 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
11/10/08 17:52:10 INFO mapred.JobClient: Launched map tasks=2
11/10/08 17:52:10 INFO mapred.JobClient: Data-local map tasks=2
11/10/08 17:52:10 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=9421
11/10/08 17:52:10 INFO mapred.JobClient:   FileSystemCounters
11/10/08 17:52:10 INFO mapred.JobClient: FILE_BYTES_READ=606
11/10/08 17:52:10 INFO mapred.JobClient: HDFS_BYTES_READ=56375
11/10/08 17:52:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=157057
11/10/08 17:52:10 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=504
11/10/08 17:52:10 INFO mapred.JobClient:   Map-Reduce Framework
11/10/08 17:52:10 INFO mapred.JobClient: Reduce input groups=24
11/10/08 17:52:10 INFO mapred.JobClient: Combine output records=24
11/10/08 17:52:10 INFO mapred.JobClient: Map input records=24
11/10/08 17:52:10 INFO mapred.JobClient: Reduce shuffle bytes=306
11/10/08 17:52:10 INFO mapred.JobClient: Reduce output records=24
11/10/08 17:52:10 INFO mapred.JobClient: Spilled Records=48
11/10/08 17:52:10 INFO mapred.JobClient: Map output bytes=552
11/10/08 17:52:10 INFO mapred.JobClient: Map input bytes=54923
11/10/08 17:52:10 INFO mapred.JobClient: Combine input records=24
11/10/08 17:52:10 INFO mapred.JobClient: Map output records=24
11/10/08 17:52:10 INFO mapred.JobClient: SPLIT_RAW_BYTES=240
11/10/08 17:52:10 INFO mapred.JobClient: Reduce input records=24
$



Appreciate you help.
PD.

On Fri, Oct 7, 2011 at 11:31 PM, Bochun Bai  wrote:

> To make a jar bundled big jar file using maven I suggest this plugin:
>http://anydoby.com/fatjar/usage.html
> But I prefer not doing so, because the classpath order is different
> with different environment.
>
> I guess your old myHadoopProgram.jar should contains Main-Class meta info.
> So the following ***xxx*** part is omitted. It originally likes:
>
> hadoop jar jar/myHadoopProgram.jar ***com.ABC.xxx*** -libjars
> ../lib/json-rpc-1.0.jar
> /usr/PD/input/sample22.json /usr/PD/output/
>
> I suggest you add the Main-Class meta following this:
>
> http://maven.apache.org/plugins/maven-assembly-plugin/usage.html#Advanced_Configuration
> or
>pay attention to the order of  and <-libjars ..> using:
>hadoop jar   <-libjars ...>  
>
> On Sat, Oct 8, 2011 at 12:05 PM, Periya.Data 
> wrote:
> > Hi all,
> >I am migrating from ant builds to maven. So, brand new to Maven and do
> > not yet understand many parts of it.
> >
> > Problem: I have a perfectly working map-reduce program (working by ant
> > build). This program needs an external jar file (json-rpc-1.0.jar). So,
> when
> > I run the program, I do the following to get a nice output:
> >
> > $ hadoop jar jar/myHadoopProgram.jar -libjars ../lib/json-rpc-1.0.jar
> > /usr/PD/input/sample22.json /usr/PD/output/
> >
> > (note that I include the external jar file by the "-libjars" option as
> > mentioned in the "Hadoop: The Definitive Guide 2nd Edition" - page 253).
> > Everything is fine with my ant build.
> >
> > So, now, I move on to Maven. I had some trouble getting my pom.xml right.
> I
> > am still unsure if it is right, but, it builds "successfully" (the
> resulting
> > jar file has the class files of my program).  The essential part of my
> > pom.xml has the two following dependencies (a complete pom.xml is at the
> end
> > of this email).
> >
> > 
> > 
> >   com.metaparadigm
> >   json-rpc
> >   1.0
> > 
> >
> >  
> > 
> > 

Simple Hadoop program build with Maven

2011-10-07 Thread Periya.Data
Hi all,
I am migrating from ant builds to maven. So, brand new to Maven and do
not yet understand many parts of it.

Problem: I have a perfectly working map-reduce program (working by ant
build). This program needs an external jar file (json-rpc-1.0.jar). So, when
I run the program, I do the following to get a nice output:

$ hadoop jar jar/myHadoopProgram.jar -libjars ../lib/json-rpc-1.0.jar
/usr/PD/input/sample22.json /usr/PD/output/

(note that I include the external jar file by the "-libjars" option as
mentioned in the "Hadoop: The Definitive Guide 2nd Edition" - page 253).
Everything is fine with my ant build.

So, now, I move on to Maven. I had some trouble getting my pom.xml right. I
am still unsure if it is right, but, it builds "successfully" (the resulting
jar file has the class files of my program).  The essential part of my
pom.xml has the two following dependencies (a complete pom.xml is at the end
of this email).


 
   com.metaparadigm
   json-rpc
   1.0
 

  
 
   org.apache.hadoop
   hadoop-core
   0.20.2
   provided
 


I try to run it like this:

$ hadoop jar ../myHadoopProgram.jar -libjars ../json-rpc-1.0.jar
com.ABC.MyHadoopProgram /usr/PD/input/sample22.json /usr/PD/output
Exception in thread "main" java.lang.ClassNotFoundException: -libjars
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:179)
$

Then, I thought, maybe it is not necessary to include the classpath. So, I
ran with the following command:

$ hadoop jar ../myHadoopProgram.jar -libjars ../json-rpc-1.0.jar
/usr/PD/input/sample22.json /usr/PD/output
Exception in thread "main" java.lang.ClassNotFoundException: -libjars
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:179)
$

Question: What am I doing wrong? I know, since I am new to Maven, I may be
missing some key pieces/concepts. What really happens when one builds the
classes, where my java program imports org.json.JSONArray and
org.json.JSONObject? This import is just for compilation I suppose and it
does not get "embedded" into the final jar. Am I right?

I want to either bundle-up the external jar(s) into a single jar and
conveniently run hadoop using that, or, know how to include the external
jars in my command-line.


This is what I have:
- maven 3.0.3
- Mac OSX
- Java 1.6.0_26
- Hadoop - CDH 0.20.2-cdh3u0

I have Googled, looked at Tom White's github repo (
https://github.com/cloudera/repository-example/blob/master/pom.xml). The
more I Google, the more confused I get.

Any help is highly appreciated.

Thanks,
PD.





http://maven.apache.org/POM/4.0.0"; xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance";
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
  4.0.0

  com.ABC
  MyHadoopProgram
  1.0
  jar

  MyHadoopProgram
  http://maven.apache.org

  
UTF-8
  

  

  
 
   com.metaparadigm
   json-rpc
   1.0
 

  
 
   org.apache.hadoop
   hadoop-core
   0.20.2
   provided
 

  



Re: Hadoop : Linux-Window interface

2011-10-05 Thread Periya.Data
Hi Aditya,
You may want to investigate about using Flume...that is designed to
collect unstructured data from disparate sources and store them in HDFS (or
directly into HIVE tables). I do not know if Flume provides interoperability
with Window's systems (maybe you hack it and make it work with Cygwin...).

http://archive.cloudera.com/cdh/3/flume/Cookbook/


-PD.

On Wed, Oct 5, 2011 at 8:14 AM, Bejoy KS  wrote:

> Hi Aditya
> Definitely you can do it. As a very basic solution you can ftp the
> contents to LFS(LOCAL/Linux File System ) and they do a copyFromLocal into
> HDFS. Create a hive table with appropriate regex support and load the data
> in. Hive has classes that effectively support parsing and loading of Apache
> log files into hive tables.
> For the entite data transfer,you just need to write a shell script for the
> same. Log analysis won't be real time right? So you can schedule the job
> with some scheduler libe a cron or to be used  in conjuction with hadoop
> jobs you can use some workflow management within hadoop eco ecosystem.
>
>
> On Wed, Oct 5, 2011 at 3:43 PM, Aditya Singh30
> wrote:
>
> > Hi,
> >
> > We want to use Hadoop and Hive to store and analyze some Web Servers' Log
> > files. The servers are running on windows platform. As mentioned about
> > Hadoop, it is only supported for development on windows. I wanted to know
> is
> > there a way that we can run the Hadoop server(namenode server) and
> cluster
> > nodes on  Linux, and have an interface using which we can send files and
> run
> > analysis queries from the WebServer's windows environment.
> > I would really appreciate if you could point me to a right direction.
> >
> >
> > Regards,
> > Aditya Singh
> > Infosys. India
> >
> >
> >  CAUTION - Disclaimer *
> > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> > solely
> > for the use of the addressee(s). If you are not the intended recipient,
> > please
> > notify the sender by e-mail and delete the original message. Further, you
> > are not
> > to copy, disclose, or distribute this e-mail or its contents to any other
> > person and
> > any such actions are unlawful. This e-mail may contain viruses. Infosys
> has
> > taken
> > every reasonable precaution to minimize this risk, but is not liable for
> > any damage
> > you may sustain as a result of any virus in this e-mail. You should carry
> > out your
> > own virus checks before opening the e-mail or attachment. Infosys
> reserves
> > the
> > right to monitor and review the content of all messages sent to or from
> > this e-mail
> > address. Messages sent to or from this e-mail address may be stored on
> the
> > Infosys e-mail system.
> > ***INFOSYS End of Disclaimer INFOSYS***
> >
>


example of splitting a binary file

2011-09-15 Thread Periya.Data
Hi all,
Is there a nice example that shows how to split a large binary file into
splits? If there is one, please let me know. It will be a great place to for
me to start.

More ideally, I want to create a custom InputFormat from
sequenceFileAsBinaryInputFormat and a custom record-reader that can properly
read well-defined records (with known offsets) in my binary input file.

But, for now, to begin, I want to learn the basics => read a binary file,
break it into splits of known size and play with a record-reader and get
some output. I do not want to do any map-reduce yet on them. Once I know how
to do those, I can gradually build on it.

Please let me know if there are any links to such examples.

Thanks,
PD.


Hadoop with Eclipse Plugin: connection issues

2011-09-11 Thread Periya.Data
Hi,
After working on Hadoop for a while, I though I would integrate with
Eclipse and give that a shot. I am seeing a seemingly trivial issue..but,
could not figure out what is going on. I have tried googling and despite
those, I am unable to fix my issue. Any suggestion on the following would be
appreciated.


   - Have a MAC with Hadoop 0.20.2-cdh3u0, java version 1.6.0_26, Eclipse
   Indigo release.
   - Hadoop normally runs fine - jps shows all the daemons running. Able to
   see namenode and jobtracker on the web interface :
http://localhost:50070and 50030.  (that makes me wonder if I have to
be using localhost or PDMac
   as my hostname in Eclipse).
   - Mapreduce on port 9001 and dfs on port 9000 as the xml configs.
   - NOTE: my host name is PDMac  which I had initially changed..using "sudo
   scutil --set Hostname PDMac". I am not sure if this is an issue.
   - I configured Eclipse Hadoop plugin "appropriately": I see a mapreduce -
   elephant logo, create DFS locations - actually 2: one with localhost   and
   the other for PDMac.
   - For Map/Reduce Master, I entered Host: localhost and port: 9001. For
   DFS: localhost, 9000. Call to localhost/127.0.0.1:9000 failed on local
   exception: java.io.EOFException.
   - Then, I tried with my current running hostname. I entered Host: PDMac
   and port: 9001. For DFS: PDMac, 9000. ==> Error: Call to PDMac/
   192.168.1.102:9000 Failed on connection exception
   Java.net.ConnectException. Connection refused.


   1. I checked if this was something to do with /etc/hosts. I entered
   192.168.1.102  as PDMac. I get same error at Eclipse.
   2. I checked if this was due to ssh.  I did "ssh localhost" and
   immediately got a response "last login: Sep 11"
   3. But, "ssh PDMac" does not respond. Is that an issue? Because nodes in
   dfs need ssh to connect...
   4. I checked the namenode logs:


/
2011-09-11 07:33:07,002 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = PDMac/192.168.1.102
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.2-cdh3u0
STARTUP_MSG:   build =  -r 81256ad0f2e4ab2bd34b04f53d25a6c23686dd14;
compiled by 'hudson' on Fri Mar 25 19:56:23 PDT 2011
/

2011-09-11 07:33:07,851 INFO org.apache.hadoop.ipc.Server: Starting Socket
Reader #1 for port 9000
2011-09-11 07:33:07,854 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=NameNode, port=9000
2011-09-11 07:33:07,856 INFO
org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics
with hostName=NameNode, port=9000
*2011-09-11 07:33:07,866 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost/
127.0.0.1:9000*
2011-09-11 07:33:07,987 INFO org.mortbay.log: Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog

.
2011-09-11 07:33:22,822 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
127.0.0.1:50010 to delete  blk_8310136671599400924_1002
2011-09-11 07:34:05,136 WARN org.apache.hadoop.ipc.Server: Incorrect header
or version mismatch from 127.0.0.1:60733 got version 3 expected version 4
2011-09-11 07:34:08,614 WARN org.apache.hadoop.ipc.Server: Incorrect header
or version mismatch from 127.0.0.1:60734 got version 3 expected version 4
2011-09-11 07:34:15,228 WARN org.apache.hadoop.ipc.Server: Incorrect header
or version mismatch from 127.0.0.1:60735 got version 3 expected version 4
2011-09-11 07:38:12,349 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from
127.0.0.1

===

So, I am not sure what is going on. First, I do not know what is my server =
localhost (127.0.0.1) or PDMac: 192.168.1.102 . Then, config optionsI
think 9001 and 9000 are right ...as my hadoop/conf/ core-site.xml and dfs
xml says.

Any suggestions would be very much appreciated.

Thanks,
PD.