Re: Merge Reducers Output

2012-07-31 Thread Mike S
Thank you all for responses.

I can not really use hadoop fs -getMerge as data could be generated on
any file system like S3 and this way of merge will not work for those
files.

I should say that files contain binary data and I assume I can not use
default mappers as they will chunk of my file where I really want to
just concatenate the reducers output file. I mean records within my
reducer output files are custom records (say several images) and I
just want the reducers outputs to concatenate together as one blob of
binary data and that is why I wrote my own InputFileFormat to read
each reducers output as a whole or as a byte[] and then then pass
these blobs up to the one reducer to concatenate them. Does this make
sense?

I assume the only way to do the above is to have the custom
InputFileFormat, and then the map and reduce as in my earlier email. I
sure can add the combiner but still definitively need the one reduce?
and I I assume I can not use one map only as the map results are
written into a sequence file which I am not after and probably reading
file by multiple mappers is better?

If the solution I put together seems to work, still my main issue is
that I can have only one reducer . This bottleneck my concatenation
job and I am wondering if my approach could be done better/faster?

So again, the final result does have to be sorted. Each file is a blob
of binary data with no keys in it and by merge I really mean to
concatenate reducers binary output files to one files on any file
system used for my MR job. Hope this makes the problem statement more
clear.




On Tue, Jul 31, 2012 at 1:44 PM, Michael Segel
 wrote:
> Sorry, but the OP was saying he had map/reduce job where the job had multiple 
> reducers where he wanted to then combine the output to a single file.
> While you could merge the output files, you could also use a combiner then an 
> identity reducer all within the same M/R job.
>
>
> On Jul 31, 2012, at 10:10 AM, Raj Vishwanathan  wrote:
>
>> Is there a requirement for the final reduce file to be sorted? If not, 
>> wouldn't a map only job ( +  a combiner, ) and a merge only job provide the 
>> answer?
>>
>> Raj
>>
>>
>>
>>> 
>>> From: Michael Segel 
>>> To: common-user@hadoop.apache.org
>>> Sent: Tuesday, July 31, 2012 5:24 AM
>>> Subject: Re: Merge Reducers Output
>>>
>>> You really don't want to run a single reducer unless you know that you 
>>> don't have a lot of mappers.
>>>
>>> As long as the output data types and structure are the same as the input, 
>>> you can run your code as the combiner, and then run it again as the 
>>> reducer. Problem solved with one or two lines of code.
>>> If your input and output don't match, then you can use the existing code as 
>>> a combiner, and then write a new reducer. It could as easily be an identity 
>>> reducer too. (Don't know the exact problem.)
>>>
>>> So here's a silly question. Why wouldn't you want to run a combiner?
>>>
>>>
>>> On Jul 31, 2012, at 12:08 AM, Jay Vyas  wrote:
>>>
 Its not clear to me that you need custom input formats

 1) Getmerge might work or

 2) Simply run a SINGLE reducer job (have mappers output static final int
 key=1, or specify numReducers=1).

 In this case, only one reducer will be called, and it will read through all
 the values.

 On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS  wrote:

> Hi
>
> Why not use 'hadoop fs -getMerge 
> ' while copying files out of hdfs for the end users 
> to
> consume. This will merge all the files in 'outputFolderInHdfs'  into one
> file and put it in lfs.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> -Original Message-
> From: Michael Segel 
> Date: Mon, 30 Jul 2012 21:08:22
> To: 
> Reply-To: common-user@hadoop.apache.org
> Subject: Re: Merge Reducers Output
>
> Why not use a combiner?
>
> On Jul 30, 2012, at 7:59 PM, Mike S wrote:
>
>> Liked asked several times, I need to merge my reducers output files.
>> Imagine I have many reducers which will generate 200 files. Now to
>> merge them together, I have written another map reduce job where each
>> mapper read a complete file in full in memory, and output that and
>> then only one reducer has to merge them together. To do so, I had to
>> write a custom fileinputreader that reads the complete file into
>> memory and then another custom fileoutputfileformat to append the each
>> reducer item bytes together. this how my mapper and reducers looks
>> like
>>
>> public static class MapClass extends Mapper> BytesWritable, IntWritable, BytesWritable>
>>   {
>>   @Override
>>   public void map(NullWritable key, BytesWritable value,
> Context
>> context) throws IOException, InterruptedException
>>   {
>>

Re: task jvm bootstrapping via distributed cache

2012-07-31 Thread Michael Segel
Hi Stan,

If I understood your question... you want to ship a jar to the nodes where the 
task will run prior to the start of the task? 

Not sure what it is you're trying to do...
Your example isn't  really clear. 

See: 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html

When you pull stuff out of the cache you get the path to the jar. 
Or you should be able to get it. 

I'm assuming you're doing this in your setup() method? 

Can you give a better example, there may be a different way to handle this...

On Jul 31, 2012, at 3:50 PM, Stan Rosenberg  wrote:

> Forwarding to common-user to hopefully get more exposure...
> 
> 
> -- Forwarded message --
> From: Stan Rosenberg 
> Date: Tue, Jul 31, 2012 at 11:55 AM
> Subject: Re: task jvm bootstrapping via distributed cache
> To: mapreduce-u...@hadoop.apache.org
> 
> 
> I am guessing this is either a well-known problem or an edge case.  In
> any case, would it be a bad idea to designate predetermined output
> paths?
> E.g., DistributedCache.addCacheFileInto(uri, conf, outputPath) would
> attempt to copy the cached file into the specified path resolving to a
> task's local filesystem.
> 
> Thanks,
> 
> stan
> 
> On Mon, Jul 30, 2012 at 6:23 PM, Stan Rosenberg
>  wrote:
>> Hi,
>> 
>> I am seeking a way to leverage hadoop's distributed cache in order to
>> ship jars that are required to bootstrap a task's jvm, i.e., before a
>> map/reduce task is launched.
>> As a concrete example, let's say that I need to launch with
>> '-javaagent:/path/profiler.jar'.  In theory, the task tracker is
>> responsible for downloading cached files onto its local filesystem.
>> However, the absolute path to a given cached file is not known a
>> priori; however, we need the path in order to configure '-javaagent'.
>> 
>> Is this currently possible with the distributed cache? If not, is the
>> use case appealing enough to open a jira ticket?
>> 
>> Thanks,
>> 
>> stan
> 



Fwd: task jvm bootstrapping via distributed cache

2012-07-31 Thread Stan Rosenberg
Forwarding to common-user to hopefully get more exposure...


-- Forwarded message --
From: Stan Rosenberg 
Date: Tue, Jul 31, 2012 at 11:55 AM
Subject: Re: task jvm bootstrapping via distributed cache
To: mapreduce-u...@hadoop.apache.org


I am guessing this is either a well-known problem or an edge case.  In
any case, would it be a bad idea to designate predetermined output
paths?
E.g., DistributedCache.addCacheFileInto(uri, conf, outputPath) would
attempt to copy the cached file into the specified path resolving to a
task's local filesystem.

Thanks,

stan

On Mon, Jul 30, 2012 at 6:23 PM, Stan Rosenberg
 wrote:
> Hi,
>
> I am seeking a way to leverage hadoop's distributed cache in order to
> ship jars that are required to bootstrap a task's jvm, i.e., before a
> map/reduce task is launched.
> As a concrete example, let's say that I need to launch with
> '-javaagent:/path/profiler.jar'.  In theory, the task tracker is
> responsible for downloading cached files onto its local filesystem.
> However, the absolute path to a given cached file is not known a
> priori; however, we need the path in order to configure '-javaagent'.
>
> Is this currently possible with the distributed cache? If not, is the
> use case appealing enough to open a jira ticket?
>
> Thanks,
>
> stan


Re: Merge Reducers Output

2012-07-31 Thread Michael Segel
Sorry, but the OP was saying he had map/reduce job where the job had multiple 
reducers where he wanted to then combine the output to a single file. 
While you could merge the output files, you could also use a combiner then an 
identity reducer all within the same M/R job.


On Jul 31, 2012, at 10:10 AM, Raj Vishwanathan  wrote:

> Is there a requirement for the final reduce file to be sorted? If not, 
> wouldn't a map only job ( +  a combiner, ) and a merge only job provide the 
> answer?
> 
> Raj
> 
> 
> 
>> 
>> From: Michael Segel 
>> To: common-user@hadoop.apache.org 
>> Sent: Tuesday, July 31, 2012 5:24 AM
>> Subject: Re: Merge Reducers Output
>> 
>> You really don't want to run a single reducer unless you know that you don't 
>> have a lot of mappers. 
>> 
>> As long as the output data types and structure are the same as the input, 
>> you can run your code as the combiner, and then run it again as the reducer. 
>> Problem solved with one or two lines of code. 
>> If your input and output don't match, then you can use the existing code as 
>> a combiner, and then write a new reducer. It could as easily be an identity 
>> reducer too. (Don't know the exact problem.) 
>> 
>> So here's a silly question. Why wouldn't you want to run a combiner? 
>> 
>> 
>> On Jul 31, 2012, at 12:08 AM, Jay Vyas  wrote:
>> 
>>> Its not clear to me that you need custom input formats
>>> 
>>> 1) Getmerge might work or
>>> 
>>> 2) Simply run a SINGLE reducer job (have mappers output static final int
>>> key=1, or specify numReducers=1).
>>> 
>>> In this case, only one reducer will be called, and it will read through all
>>> the values.
>>> 
>>> On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS  wrote:
>>> 
 Hi
 
 Why not use 'hadoop fs -getMerge 
 ' while copying files out of hdfs for the end users to
 consume. This will merge all the files in 'outputFolderInHdfs'  into one
 file and put it in lfs.
 
 Regards
 Bejoy KS
 
 Sent from handheld, please excuse typos.
 
 -Original Message-
 From: Michael Segel 
 Date: Mon, 30 Jul 2012 21:08:22
 To: 
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: Merge Reducers Output
 
 Why not use a combiner?
 
 On Jul 30, 2012, at 7:59 PM, Mike S wrote:
 
> Liked asked several times, I need to merge my reducers output files.
> Imagine I have many reducers which will generate 200 files. Now to
> merge them together, I have written another map reduce job where each
> mapper read a complete file in full in memory, and output that and
> then only one reducer has to merge them together. To do so, I had to
> write a custom fileinputreader that reads the complete file into
> memory and then another custom fileoutputfileformat to append the each
> reducer item bytes together. this how my mapper and reducers looks
> like
> 
> public static class MapClass extends Mapper BytesWritable, IntWritable, BytesWritable>
>   {
>   @Override
>   public void map(NullWritable key, BytesWritable value,
 Context
> context) throws IOException, InterruptedException
>   {
>   context.write(key, value);
>   }
>   }
> 
>   public static class Reduce extends Reducer BytesWritable, NullWritable, BytesWritable>
>   {
>   @Override
>   public void reduce(NullWritable key,
 Iterable values,
> Context context) throws IOException, InterruptedException
>   {
>   for (BytesWritable value : values)
>   {
>   context.write(NullWritable.get(), value);
>   }
>   }
>   }
> 
> I still have to have one reducers and that is a bottle neck. Please
> note that I must do this merging as the users of my MR job are outside
> my hadoop environment and the result as one file.
> 
> Is there better way to merge reducers output files?
> 
 
 
>>> 
>>> 
>>> -- 
>>> Jay Vyas
>>> MMSB/UCHC
>> 
>> 
>> 



RE: Multinode cluster only recognizes 1 node

2012-07-31 Thread Barry, Sean F
After a good number of hours trying to tinker with my set up I am still am only 
seeing results for 1, not 2 nodes.

This is the tut I followed
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

After executing ./start-all.sh

starting namenode, logging to 
/usr/local/hadoop/logs/hadoop-hduser-namenode-master.out
slave: starting datanode, logging to 
/usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-slave.out
master: starting datanode, logging to 
/usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-master.out
master: starting secondarynamenode, logging to 
/usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-master.out
starting jobtracker, logging to 
/usr/local/hadoop/logs/hadoop-hduser-jobtracker-master.out
slave: starting tasktracker, logging to 
/usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-slave.out
master: starting tasktracker, logging to 
/usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-master.out





-Original Message-
From: syed kather [mailto:in.ab...@gmail.com] 
Sent: Thursday, July 26, 2012 6:06 PM
To: common-user@hadoop.apache.org
Subject: Re: Multinode cluster only recognizes 1 node

Can you paste the information when you execute . start-all.sh in kernal..
when you do ssh on slave .. whether it is working fine...
On Jul 27, 2012 4:50 AM, "Barry, Sean F"  wrote:

> Hi,
>
> I just set up a 2 node POC cluster and I am currently having an issue 
> with it. I ran a wordcount MR test on my cluster to see if it was 
> working and noticed that the Web ui at localhost:50030 showed that I 
> only have 1 live node. I followed the tutorial step by step and I 
> cannot seem to figure out my problem. When I ran start-all.sh all of 
> the daemons on my master node and my slave node start up perfectly 
> fine. If you have any suggestions please let me know.
>
> -Sean
>


Re: YARN Pi example job stuck at 0%(No MR tasks are started by ResourceManager)

2012-07-31 Thread anil gupta
Hi Harsh and Others,

I was able to run the job when I login as user "hdfs". However, it fails if
i run it as "root". I was suspecting this as a problem before also and it
came out to be true.

Thanks,
Anil gupta

On Mon, Jul 30, 2012 at 9:21 PM, abhiTowson cal
wrote:

> Hi anil,
>
> Was trying several things.I didn't hadoop-env.sh, so i created it.
>
> Regards
> Abhishek
>
> On Mon, Jul 30, 2012 at 11:51 PM, anil gupta 
> wrote:
> > That's pretty interesting!! From where you figured out that you need to
> add
> > that property? Just trying to understand how adding that property fixed
> the
> > issue.
> >
> > On Mon, Jul 30, 2012 at 8:12 PM, abhiTowson cal
> > wrote:
> >
> >> Hi anil,
> >>
> >> Adding property resolved issue for me, and i also made this change
> >>
> >> vim hadoop-env.sh
> >>
> >> export JAVA_HOME=/usr/lib/java-1.6.0/jdk1.6.0_33
> >> if [ "$JAVA_HOME" != "" ]; then
> >>   #echo "run java in $JAVA_HOME"
> >>   JAVA_HOME=$JAVA_HOME
> >> fi
> >>
> >> if [ "$JAVA_HOME" = "" ]; then
> >>   echo "Error: JAVA_HOME is not set."
> >>   exit 1
> >> fi
> >>
> >> JAVA=$JAVA_HOME/bin/java
> >> JAVA_HEAP_MAX=-Xmx1000m
> >>
> >> Regards
> >> Abhishek
> >>
> >>
> >> On Mon, Jul 30, 2012 at 10:47 PM, anil gupta 
> >> wrote:
> >> > Hi Abhishek,
> >> >
> >> > Did you mean that adding yarn.resourcemanager.resource-tracker.address
> >> > along with yarn.log-aggregation-enable in my configuration will
> resolve
> >> the
> >> > problem in which map-reduce job fails at 0% with the following error:
> In
> >> > the web page of
> >> >
> >>
> http://data-node:8042/node/containerlogs/container_1343687008058_0003_01_01/rootthe
> >> > page says:
> >> > Failed redirect for container_1343687008058_0003_01_01  Failed
> while
> >> > trying to construct the redirect url to the log server. Log Server url
> >> may
> >> > not be configured. Unknown container. Container either has not
> started or
> >> > has already completed or doesn't belong to this node at all.
> >> > Please let me know.
> >> >
> >> > Thanks,
> >> > Anil Gupta
> >> >
> >> > On Mon, Jul 30, 2012 at 7:30 PM, abhiTowson cal
> >> > wrote:
> >> >
> >> >> hi anil,
> >> >>
> >> >> Adding these help me resolve the issue for me
> >> >> yarn.resourcemanager.resource-tracker.address
> >> >>
> >> >> Regards
> >> >> Abhishek
> >> >>
> >> >> On Mon, Jul 30, 2012 at 7:56 PM, anil gupta 
> >> wrote:
> >> >> > Hi Rahul,
> >> >> >
> >> >> > Thanks for your response. I can certainly enable the
> >> >> > yarn.log-aggregation-enable to true. But after enabling this what
> >> manual
> >> >> > steps i will have to take to run jobs. Could you please elaborate.
> >> >> >
> >> >> > Thanks,
> >> >> > Anil
> >> >> >
> >> >> > On Mon, Jul 30, 2012 at 4:26 PM, Rahul Jain 
> wrote:
> >> >> >
> >> >> >> The inability to look at map-reduce logs for failed logs is due to
> >> >> number
> >> >> >> of open issues in yarn; see my recent comment here:
> >> >> >>
> >> >> >>
> >> >>
> >>
> https://issues.apache.org/jira/browse/MAPREDUCE-4428?focusedCommentId=13412995&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13412995
> >> >> >>
> >> >> >> You can workaround this by enabling log aggregation and manually
> >> copying
> >> >> >> job logs from HDFS log location. Of course that is a painful way
> till
> >> >> the
> >> >> >> yarn log collection and history bugs are resolved in an upcoming
> >> >> release.
> >> >> >>
> >> >> >> -Rahul
> >> >> >>
> >> >> >>
> >> >> >> > 12/07/27 09:38:27 INFO mapred.ResourceMgrDelegate: Submitted
> >> >> application
> >> >> >> > application_1343365114818_0002 to ResourceManager at ihub-an-l1/
> >> >> >> > 172.31.192.151:8040
> >> >> >> > 12/07/27 09:38:27 INFO mapreduce.Job: The url to track the job:
> >> >> >> > http://ihub-an-l1:/proxy/application_1343365114818_0002/
> >> >> >> > 12/07/27 09:38:27 INFO mapreduce.Job: Running job:
> >> >> job_1343365114818_0002
> >> >> >> >
> >> >> >> > No Map-Reduce task are started by the cluster. I dont see any
> >> errors
> >> >> >> > anywhere in the application. Please help me in resolving this
> >> problem.
> >> >> >> >
> >> >> >> > Thanks,
> >> >> >> > Anil Gupta
> >> >> >> >
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Thanks & Regards,
> >> >> > Anil Gupta
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards,
> >> > Anil Gupta
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
>



-- 
Thanks & Regards,
Anil Gupta


NodeStatusUpdaterImpl is stopped whenever a Yarn Job is Run.

2012-07-31 Thread anil gupta
Hi All,

I have a fully distributed hadoop-2.0.0 alpha cluster(cdh4). Each node in
the cluster had 3.2 GB of RAM running on CentOS6.0. Since the memory in
each node is less so i modified the yarn-site.xml and mapred-site.xml to
run with lesser memory . Here is the mapred-site.xml:
http://pastebin.com/Fxjie6kg and yarn-site.xml: http://pastebin.com/TCJuDAhe.
If i run a job then i get the following message in log file of a
nodemanager:http://pastebin.com/d8tsBA2a

On the console i would get the following error:
[root@ihub-nn-a1 ~]# hadoop --config /etc/hadoop/conf/ jar
/usr/lib/hadoop-mapreduce/hadoop-*-examples.jar pi 10 10
Number of Maps  = 10
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
12/07/31 00:15:12 INFO input.FileInputFormat: Total input paths to process
: 10
12/07/31 00:15:12 INFO mapreduce.JobSubmitter: number of splits:10
12/07/31 00:15:12 WARN conf.Configuration: mapred.jar is deprecated.
Instead, use mapreduce.job.jar
12/07/31 00:15:12 WARN conf.Configuration:
mapred.map.tasks.speculative.execution is deprecated. Instead, use
mapreduce.map.speculative
12/07/31 00:15:12 WARN conf.Configuration: mapred.reduce.tasks is
deprecated. Instead, use mapreduce.job.reduces
12/07/31 00:15:12 WARN conf.Configuration: mapred.output.value.class is
deprecated. Instead, use mapreduce.job.output.value.class
12/07/31 00:15:12 WARN conf.Configuration:
mapred.reduce.tasks.speculative.execution is deprecated. Instead, use
mapreduce.reduce.speculative
12/07/31 00:15:12 WARN conf.Configuration: mapreduce.map.class is
deprecated. Instead, use mapreduce.job.map.class
12/07/31 00:15:12 WARN conf.Configuration: mapred.job.name is deprecated.
Instead, use mapreduce.job.name
12/07/31 00:15:12 WARN conf.Configuration: mapreduce.reduce.class is
deprecated. Instead, use mapreduce.job.reduce.class
12/07/31 00:15:12 WARN conf.Configuration: mapreduce.inputformat.class is
deprecated. Instead, use mapreduce.job.inputformat.class
12/07/31 00:15:12 WARN conf.Configuration: mapred.input.dir is deprecated.
Instead, use mapreduce.input.fileinputformat.inputdir
12/07/31 00:15:12 WARN conf.Configuration: mapred.output.dir is deprecated.
Instead, use mapreduce.output.fileoutputformat.outputdir
12/07/31 00:15:12 WARN conf.Configuration: mapreduce.outputformat.class is
deprecated. Instead, use mapreduce.job.outputformat.class
12/07/31 00:15:12 WARN conf.Configuration: mapred.map.tasks is deprecated.
Instead, use mapreduce.job.maps
12/07/31 00:15:12 WARN conf.Configuration: mapred.output.key.class is
deprecated. Instead, use mapreduce.job.output.key.class
12/07/31 00:15:12 WARN conf.Configuration: mapred.working.dir is
deprecated. Instead, use mapreduce.job.working.dir
12/07/31 00:15:12 INFO mapred.ResourceMgrDelegate: Submitted application
application_1343717845091_0003 to ResourceManager at ihub-an-l1/
172.31.192.151:8040
12/07/31 00:15:12 INFO mapreduce.Job: The url to track the job:
http://ihub-an-l1:/proxy/application_1343717845091_0003/
12/07/31 00:15:12 INFO mapreduce.Job: Running job: job_1343717845091_0003
12/07/31 00:15:18 INFO mapreduce.Job: Job job_1343717845091_0003 running in
uber mode : false
12/07/31 00:15:18 INFO mapreduce.Job:  map 0% reduce 0%
12/07/31 00:15:18 INFO mapreduce.Job: Job job_1343717845091_0003 failed
with state FAILED due to: Application application_1343717845091_0003 failed
1 times due to AM Container for appattempt_1343717845091_0003_01 exited
with  exitCode: 1 due to:
.Failing this attempt.. Failing the application.
12/07/31 00:15:18 INFO mapreduce.Job: Counters: 0
Job Finished in 6.898 seconds
java.io.FileNotFoundException: File does not exist:
hdfs://ihubcluster/user/root/QuasiMonteCarlo_TMP_3_141592654/out/reduce-out
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:736)
at
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1685)
at
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1709)
at
org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:314)
at
org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:351)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at
org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:360)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144

Re: Merge Reducers Output

2012-07-31 Thread Raj Vishwanathan
Is there a requirement for the final reduce file to be sorted? If not, wouldn't 
a map only job ( +  a combiner, ) and a merge only job provide the answer?

Raj



>
> From: Michael Segel 
>To: common-user@hadoop.apache.org 
>Sent: Tuesday, July 31, 2012 5:24 AM
>Subject: Re: Merge Reducers Output
> 
>You really don't want to run a single reducer unless you know that you don't 
>have a lot of mappers. 
>
>As long as the output data types and structure are the same as the input, you 
>can run your code as the combiner, and then run it again as the reducer. 
>Problem solved with one or two lines of code. 
>If your input and output don't match, then you can use the existing code as a 
>combiner, and then write a new reducer. It could as easily be an identity 
>reducer too. (Don't know the exact problem.) 
>
>So here's a silly question. Why wouldn't you want to run a combiner? 
>
>
>On Jul 31, 2012, at 12:08 AM, Jay Vyas  wrote:
>
>> Its not clear to me that you need custom input formats
>> 
>> 1) Getmerge might work or
>> 
>> 2) Simply run a SINGLE reducer job (have mappers output static final int
>> key=1, or specify numReducers=1).
>> 
>> In this case, only one reducer will be called, and it will read through all
>> the values.
>> 
>> On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS  wrote:
>> 
>>> Hi
>>> 
>>> Why not use 'hadoop fs -getMerge 
>>> ' while copying files out of hdfs for the end users to
>>> consume. This will merge all the files in 'outputFolderInHdfs'  into one
>>> file and put it in lfs.
>>> 
>>> Regards
>>> Bejoy KS
>>> 
>>> Sent from handheld, please excuse typos.
>>> 
>>> -Original Message-
>>> From: Michael Segel 
>>> Date: Mon, 30 Jul 2012 21:08:22
>>> To: 
>>> Reply-To: common-user@hadoop.apache.org
>>> Subject: Re: Merge Reducers Output
>>> 
>>> Why not use a combiner?
>>> 
>>> On Jul 30, 2012, at 7:59 PM, Mike S wrote:
>>> 
 Liked asked several times, I need to merge my reducers output files.
 Imagine I have many reducers which will generate 200 files. Now to
 merge them together, I have written another map reduce job where each
 mapper read a complete file in full in memory, and output that and
 then only one reducer has to merge them together. To do so, I had to
 write a custom fileinputreader that reads the complete file into
 memory and then another custom fileoutputfileformat to append the each
 reducer item bytes together. this how my mapper and reducers looks
 like
 
 public static class MapClass extends Mapper>>> BytesWritable, IntWritable, BytesWritable>
      {
              @Override
              public void map(NullWritable key, BytesWritable value,
>>> Context
 context) throws IOException, InterruptedException
              {
                      context.write(key, value);
              }
      }
 
      public static class Reduce extends Reducer>>> BytesWritable, NullWritable, BytesWritable>
      {
              @Override
              public void reduce(NullWritable key,
>>> Iterable values,
 Context context) throws IOException, InterruptedException
              {
                      for (BytesWritable value : values)
                      {
                              context.write(NullWritable.get(), value);
                      }
              }
      }
 
 I still have to have one reducers and that is a bottle neck. Please
 note that I must do this merging as the users of my MR job are outside
 my hadoop environment and the result as one file.
 
 Is there better way to merge reducers output files?
 
>>> 
>>> 
>> 
>> 
>> -- 
>> Jay Vyas
>> MMSB/UCHC
>
>
>
>

Re: Merge Reducers Output

2012-07-31 Thread Michael Segel
You really don't want to run a single reducer unless you know that you don't 
have a lot of mappers. 

As long as the output data types and structure are the same as the input, you 
can run your code as the combiner, and then run it again as the reducer. 
Problem solved with one or two lines of code. 
If your input and output don't match, then you can use the existing code as a 
combiner, and then write a new reducer. It could as easily be an identity 
reducer too. (Don't know the exact problem.) 

So here's a silly question. Why wouldn't you want to run a combiner? 


On Jul 31, 2012, at 12:08 AM, Jay Vyas  wrote:

> Its not clear to me that you need custom input formats
> 
> 1) Getmerge might work or
> 
> 2) Simply run a SINGLE reducer job (have mappers output static final int
> key=1, or specify numReducers=1).
> 
> In this case, only one reducer will be called, and it will read through all
> the values.
> 
> On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS  wrote:
> 
>> Hi
>> 
>> Why not use 'hadoop fs -getMerge 
>> ' while copying files out of hdfs for the end users to
>> consume. This will merge all the files in 'outputFolderInHdfs'  into one
>> file and put it in lfs.
>> 
>> Regards
>> Bejoy KS
>> 
>> Sent from handheld, please excuse typos.
>> 
>> -Original Message-
>> From: Michael Segel 
>> Date: Mon, 30 Jul 2012 21:08:22
>> To: 
>> Reply-To: common-user@hadoop.apache.org
>> Subject: Re: Merge Reducers Output
>> 
>> Why not use a combiner?
>> 
>> On Jul 30, 2012, at 7:59 PM, Mike S wrote:
>> 
>>> Liked asked several times, I need to merge my reducers output files.
>>> Imagine I have many reducers which will generate 200 files. Now to
>>> merge them together, I have written another map reduce job where each
>>> mapper read a complete file in full in memory, and output that and
>>> then only one reducer has to merge them together. To do so, I had to
>>> write a custom fileinputreader that reads the complete file into
>>> memory and then another custom fileoutputfileformat to append the each
>>> reducer item bytes together. this how my mapper and reducers looks
>>> like
>>> 
>>> public static class MapClass extends Mapper>> BytesWritable, IntWritable, BytesWritable>
>>>  {
>>>  @Override
>>>  public void map(NullWritable key, BytesWritable value,
>> Context
>>> context) throws IOException, InterruptedException
>>>  {
>>>  context.write(key, value);
>>>  }
>>>  }
>>> 
>>>  public static class Reduce extends Reducer>> BytesWritable, NullWritable, BytesWritable>
>>>  {
>>>  @Override
>>>  public void reduce(NullWritable key,
>> Iterable values,
>>> Context context) throws IOException, InterruptedException
>>>  {
>>>  for (BytesWritable value : values)
>>>  {
>>>  context.write(NullWritable.get(), value);
>>>  }
>>>  }
>>>  }
>>> 
>>> I still have to have one reducers and that is a bottle neck. Please
>>> note that I must do this merging as the users of my MR job are outside
>>> my hadoop environment and the result as one file.
>>> 
>>> Is there better way to merge reducers output files?
>>> 
>> 
>> 
> 
> 
> -- 
> Jay Vyas
> MMSB/UCHC