Hi,

Do both input files contain data that needs to be processed by the
mapper in the same fashion ? In which case, you could just put the
input files under a directory in HDFS and provide that as input. The
-input option does accept a directory as argument.

Otherwise, can you please explain a little more what you're trying to
do with the two inputs.

Thanks
Hemanth

On Fri, Aug 31, 2012 at 3:00 AM, Periya.Data <periya.d...@gmail.com> wrote:
> This is interesting. I changed my command to:
>
> -mapper "cat $1 |  $GHU_HOME/test2.py $2" \
>
> is producing output to HDFS. But, the output is not what I expected and is
> not the same as when I do "cat | map " on Linux. It is producing
> part-00000, part-00001 and part-00002. I expected only one output file with
> just 2 records.
>
> I think I have to understand what exactly "-file" does and what exactly
> "-input" does. I am experimenting what happens if I give my input files on
> the command line (like: test2.py arg1 arg2) as against specifying the input
> files via "-file" and "-input" options...
>
> The problem is I have 2 input files...and have no idea how to pass them.
> SHould I keep one in HDFS and stream in the other?
>
> More digging,
> PD/
>
>
>
> On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data <periya.d...@gmail.com> wrote:
>
>> Hi Bertrand,
>>     No, I do not observe the same when I run using cat | map. I can see
>> the output in STDOUT when I run my program.
>>
>> I do not have any reducer. In my command, I provide
>> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be
>> written directly to HDFS.
>>
>> Your suspicion maybe right..about the output. In my counters, the "map
>> input records" = 40 and "map.output records" = 0. I am trying to see if I
>> am messing up in my command...(see below)
>>
>> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I am
>> streaming one file in and test2.py takes in only one argument. How should I
>> frame my command below? I think that is where I am messing up..
>>
>>
>> run.sh:        (run as:   cat <arg2> | ./run.sh <arg1> )
>> -----------
>>
>> hadoop jar
>> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
>>         -D mapred.reduce.tasks=0 \
>>         -verbose \
>>         -input "$HDFS_INPUT" \
>>         -input "$HDFS_INPUT_2" \
>>         -output "$HDFS_OUTPUT" \
>>         -file   "$GHU_HOME/test2.py" \
>>         -mapper "python $GHU_HOME/test2.py $1" \
>>         -file   "$GHU_HOME/$1"
>>
>>
>>
>> If I modify my mapper to take in 2 arguments, then, I would run it as:
>>
>> run.sh:        (run as:   ./run.sh <arg1>  <arg2>)
>> -----------
>>
>> hadoop jar
>> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
>>         -D mapred.reduce.tasks=0 \
>>         -verbose \
>>         -input "$HDFS_INPUT" \
>>         -input "$HDFS_INPUT_2" \
>>         -output "$HDFS_OUTPUT" \
>>         -file   "$GHU_HOME/test2.py" \
>>         -mapper "python $GHU_HOME/test2.py $1 $2" \
>>         -file   "$GHU_HOME/$1" \
>>         -file   "GHU_HOME/$2"
>>
>>
>> Please let me know if I am making a mistake here.
>>
>>
>> Thanks.
>> PD
>>
>>
>>
>>
>>
>>
>> On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux <decho...@gmail.com>wrote:
>>
>>> Do you observe the same thing when running without Hadoop? (cat, map, sort
>>> and then reduce)
>>>
>>> Could you provide the counters of your job? You should be able to get them
>>> using the job tracker interface.
>>>
>>> The most probable answer without more information would be that your
>>> reducer do not output any <key,value>s.
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>>
>>>
>>> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <periya.d...@gmail.com>
>>> wrote:
>>>
>>> > Hi All,
>>> >    My Hadoop streaming job (in Python) runs to "completion" (both map
>>> and
>>> > reduce says 100% complete). But, when I look at the output directory in
>>> > HDFS, the part files are empty. I do not know what might be causing this
>>> > behavior. I understand that the percentages represent the records that
>>> have
>>> > been read in (not processed).
>>> >
>>> > The following are some of the logs. The detailed logs from Cloudera
>>> Manager
>>> > says that there were no Map Outputs...which is interesting. Any
>>> > suggestions?
>>> >
>>> >
>>> > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
>>> > 12/08/30 03:27:14 INFO streaming.StreamJob:
>>> /usr/lib/hadoop-0.20/bin/hadoop
>>> > job  -Dmapred.job.tracker=xxxxx.yyy.com:8021 -kill
>>> job_201208232245_3182
>>> > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
>>> > http://xxxxxx.yyyy.com:60030/jobdetails.jsp?jobid=job_201208232245_3182
>>> > 12/08/30 03:27:15 INFO streaming.StreamJob:  map 0%  reduce 0%
>>> > 12/08/30 03:27:20 INFO streaming.StreamJob:  map 33%  reduce 0%
>>> > 12/08/30 03:27:23 INFO streaming.StreamJob:  map 67%  reduce 0%
>>> > 12/08/30 03:27:29 INFO streaming.StreamJob:  map 100%  reduce 0%
>>> > 12/08/30 03:27:33 INFO streaming.StreamJob:  map 100%  reduce 100%
>>> > 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete:
>>> > job_201208232245_3182
>>> > 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU
>>> > Thu Aug 30 03:27:24 GMT 2012
>>> > *** END
>>> > bash-3.2$
>>> > bash-3.2$ hadoop fs -ls /user/ghu/
>>> > Found 5 items
>>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/_SUCCESS
>>> > drwxrwxrwx   - ghu hadoop          0 2012-08-30 03:27 /user/GHU/_logs
>>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>>> /user/GHU/part-00000
>>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>>> /user/GHU/part-00001
>>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>>> /user/GHU/part-00002
>>> > bash-3.2$
>>> >
>>> >
>>> --------------------------------------------------------------------------------------------------------------------
>>> >
>>> >
>>> > Metadata Status Succeeded  Type MapReduce  Id job_201208232245_3182
>>> > Name CaidMatch
>>> >  User srisrini  Mapper class PipeMapper  Reducer class
>>> >  Scheduler pool name default  Job input directory
>>> > hdfs://xxxxx.yyy.txt,hdfs://xxxx.yyyy.com/user/GHUcaidlist.txt  Job
>>> output
>>> > directory hdfs://xxxx.yyyy.com/user/GHU/  Timing
>>> > Duration 20.977s  Submit time Wed, 29 Aug 2012 08:27 PM  Start time
>>> Wed, 29
>>> > Aug 2012 08:27 PM  Finish time Wed, 29 Aug 2012 08:27 PM
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >  Progress and Scheduling Map Progress
>>> > 100.0%
>>> >  Reduce Progress
>>> > 100.0%
>>> >  Launched maps 4  Data-local maps 3  Rack-local maps 1  Other local maps
>>> >  Desired maps 3  Launched reducers
>>> >  Desired reducers 0  Fairscheduler running tasks
>>> >  Fairscheduler minimum share
>>> >  Fairscheduler demand
>>> >  Current Resource Usage Current User CPUs 0  Current System CPUs 0
>>> >  Resident
>>> > memory 0 B  Running maps 0  Running reducers 0  Aggregate Resource Usage
>>> > and Counters User CPU 0s  System CPU 0s  Map Slot Time 12.135s  Reduce
>>> slot
>>> > time 0s  Cumulative disk reads
>>> >  Cumulative disk writes 155.0 KiB  Cumulative HDFS reads 3.6 KiB
>>> >  Cumulative
>>> > HDFS writes
>>> >  Map input bytes 2.5 KiB  Map input records 45  Map output records 0
>>> >  Reducer
>>> > input groups
>>> >  Reducer input records
>>> >  Reducer output records
>>> >  Reducer shuffle bytes
>>> >  Spilled records
>>> >
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>

Reply via email to