DataXceiver WRITE_BLOCK: Premature EOF from inputStream: Using Avro Multiple Outputs

2015-07-04 Thread ed
Hello,

We are running a job that makes use of Avro Multiple Ouputs (Avro 1.7.5).
When there are lots of output files the job was failing with the following
error which I believed caused the job to fail:


hc1hdfs2p.thecarlylegroup.local:50010:DataXceiverServer:
java.io.IOException: Xceiver count 4097 exceeds the limit of concurrent
xcievers: 4096
 at
org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:137)
 at java.lang.Thread.run(Thread.java:744)

This error starts to appear when we have lots of output directories due to
our use of AvroMultipleOutputs  (all maps complete without issue and the
multi output is being done in the reducers which fail).  I went ahead and
increased dfs.datanode.max.xcievers to 8192 and reran the job. Using
Cloudera Manager I saw that the Transceivers across nodes maxed out at 5376
so setting the max to 8192 solved that first error.  Unfortunately, the job
still failed.  When checking HDFS logs the above error about the xciever
limit was gone but I now saw lots of the following.

hc1hdfs3p.thecarlylegroup.local:50010:DataXceiver error processing
WRITE_BLOCK operation  src: /10.14.5.83:53280 dest: /10.14.5.81:50010
java.io.IOException: Premature EOF from inputStream
 at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
 at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446)
 at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702)
 at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:711)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
 at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
 at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)
 at java.lang.Thread.run(Thread.java:744)

The above errors seem to be happening a lot and I'm not sure if they are
related to the job failure.  This error seems to match exactly the error
pattern seen in this thread below (which unfortunately had no responses)

http://mail-archives.apache.org/mod_mbox/hadoop-user/201408.mbox/%3CCAJOOh6E1D1bx_9NrAUPPzAb6x1=fxd52rgqwxfzwy5tpjiw...@mail.gmail.com%3E

The only other warnings I see occurring around the same time as the job
failure are:

WARN Failed to place enough replicas, still in need of 1 to reach 3. For
> more information, please enable DEBUG log level on
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
>  WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor Exit
> code from container container_1417712817932_31879_01_002464 is : 143



Does anyone have any ideas for what could be causing the job to fail? I did
not see anything obvious looking through Cloudera Manager charts or logs.
For example, open files was below the limit and memory was well within what
the nodes have (6 nodes with 90GB each).  No errors in YARN either.

Thank you!

Best,

Ed


Re: Job fails while re attempting the task in multiple outputs case

2013-12-30 Thread Jiayu Ji
I think if the task fails, the output related to that task will be clean up
before the second attempt. I am guessing you have this exception is because
you have two reducers tried to write to the same file. One thing you need
to be aware of is that all data that is supposed to be in the same file
should go to the same reducer. Let's say if you have data1 on reducer1 and
data2 on reducer2 and they are all going to be stored on fileAll, then you
will end up having that exception.


On Mon, Dec 30, 2013 at 11:22 AM, AnilKumar B  wrote:

> Thanks Harsh.
>
> @Are you using the MultipleOutputs class shipped with Apache Hadoop or
> one of your own?
> I am using Apache Hadoop's multipleOutputs.
>
> But as you see in stack trace, it's not appending the attempt id to file
> name, it's only consists of task id.
>
>
>
> Thanks & Regards,
> B Anil Kumar.
>
>
> On Mon, Dec 30, 2013 at 7:42 PM, Harsh J  wrote:
>
>> Are you using the MultipleOutputs class shipped with Apache Hadoop or
>> one of your own?
>>
>> If its the latter, please take a look at gotchas to take care of
>> described at
>> http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2Fwrite-to_hdfs_files_directly_from_map.2Freduce_tasks.3F
>>
>> On Mon, Dec 30, 2013 at 4:22 PM, AnilKumar B 
>> wrote:
>> > Hi,
>> >
>> > I am  using multiple outputs in our job. So whenever any reduce task
>> fails,
>> > all it's next task attempts are failing with file exist exception.
>> >
>> >
>> > The output file name should also append the task attempt right? But it's
>> > only appending the task id. Is this the bug or Some thing wrong from my
>> > side?
>> >
>> > Where should look in src code? I went through  code at
>> > FileOutputFormat$getTaskOutputPath(), but there it's only considering
>> task
>> > id.
>> >
>> >
>> > Exception Trace:
>> > 13/12/29 09:13:00 INFO mapred.JobClient: Task Id :
>> > attempt_201312162255_60465_r_08_0, Status : FAILED
>> > 13/12/29 09:14:42 WARN mapred.JobClient: Error reading task
>> >
>> outputhttp://localhost:50050/tasklog?plaintext=true&attemptid=attempt_201312162255_60465_r_08_0&filter=stdout
>> > 13/12/29 09:14:42 WARN mapred.JobClient: Error reading task
>> >
>> outputhttp://localhost:50050/tasklog?plaintext=true&attemptid=attempt_201312162255_60465_r_08_0&filter=stderr
>> > 13/12/29 09:15:04 INFO mapred.JobClient:  map 100% reduce 93%
>> > 13/12/29 09:15:23 INFO mapred.JobClient:  map 100% reduce 96%
>> > 13/12/29 09:17:31 INFO mapred.JobClient:  map 100% reduce 97%
>> > 13/12/29 09:19:34 INFO mapred.JobClient: Task Id :
>> > attempt_201312162255_60465_r_08_1, Status : FAILED
>> > org.apache.hadoop.ipc.RemoteException: java.io.IOException: failed to
>> create
>> > /x/y/z/2013/12/29/04/o_2013_12_29_03-r-8.gz on client 10.103.10.31
>> > either because the filename is invalid or the file exists
>> > at
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1672)
>> > at
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1599)
>> > at
>> >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:732)
>> > at
>> >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:711)
>> > at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown
>> > Source)
>> > at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> > at java.lang.reflect.Method.invoke(Method.java:597)
>> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
>> > at
>> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1448)
>> > at
>> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
>> > at java.security.AccessController.doPrivileged(Native
>> > Method)
>> > at javax.security.auth.Subject.doAs(Subject.java:396)
>> > at
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>> > at
>> > org.apache.hadoop.ipc.Server$Handler.run(Server.java:1442)
>> >
>> >  

Re: Job fails while re attempting the task in multiple outputs case

2013-12-30 Thread AnilKumar B
Thanks Harsh.

@Are you using the MultipleOutputs class shipped with Apache Hadoop or
one of your own?
I am using Apache Hadoop's multipleOutputs.

But as you see in stack trace, it's not appending the attempt id to file
name, it's only consists of task id.



Thanks & Regards,
B Anil Kumar.


On Mon, Dec 30, 2013 at 7:42 PM, Harsh J  wrote:

> Are you using the MultipleOutputs class shipped with Apache Hadoop or
> one of your own?
>
> If its the latter, please take a look at gotchas to take care of
> described at
> http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2Fwrite-to_hdfs_files_directly_from_map.2Freduce_tasks.3F
>
> On Mon, Dec 30, 2013 at 4:22 PM, AnilKumar B 
> wrote:
> > Hi,
> >
> > I am  using multiple outputs in our job. So whenever any reduce task
> fails,
> > all it's next task attempts are failing with file exist exception.
> >
> >
> > The output file name should also append the task attempt right? But it's
> > only appending the task id. Is this the bug or Some thing wrong from my
> > side?
> >
> > Where should look in src code? I went through  code at
> > FileOutputFormat$getTaskOutputPath(), but there it's only considering
> task
> > id.
> >
> >
> > Exception Trace:
> > 13/12/29 09:13:00 INFO mapred.JobClient: Task Id :
> > attempt_201312162255_60465_r_08_0, Status : FAILED
> > 13/12/29 09:14:42 WARN mapred.JobClient: Error reading task
> >
> outputhttp://localhost:50050/tasklog?plaintext=true&attemptid=attempt_201312162255_60465_r_08_0&filter=stdout
> > 13/12/29 09:14:42 WARN mapred.JobClient: Error reading task
> >
> outputhttp://localhost:50050/tasklog?plaintext=true&attemptid=attempt_201312162255_60465_r_08_0&filter=stderr
> > 13/12/29 09:15:04 INFO mapred.JobClient:  map 100% reduce 93%
> > 13/12/29 09:15:23 INFO mapred.JobClient:  map 100% reduce 96%
> > 13/12/29 09:17:31 INFO mapred.JobClient:  map 100% reduce 97%
> > 13/12/29 09:19:34 INFO mapred.JobClient: Task Id :
> > attempt_201312162255_60465_r_08_1, Status : FAILED
> > org.apache.hadoop.ipc.RemoteException: java.io.IOException: failed to
> create
> > /x/y/z/2013/12/29/04/o_2013_12_29_03-r-8.gz on client 10.103.10.31
> > either because the filename is invalid or the file exists
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1672)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1599)
> > at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:732)
> > at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:711)
> > at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown
> > Source)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
> > at
> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1448)
> > at
> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
> > at java.security.AccessController.doPrivileged(Native
> > Method)
> > at javax.security.auth.Subject.doAs(Subject.java:396)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> > at
> > org.apache.hadoop.ipc.Server$Handler.run(Server.java:1442)
> >
> > at org.apache.hadoop.ipc.Client.call(Client.java:1118)
> > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
> > at $Proxy7.create(Unknown Source)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
> > at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
> > at $Proxy7.create(Unknown Source)
> >

Re: Job fails while re attempting the task in multiple outputs case

2013-12-30 Thread Harsh J
Are you using the MultipleOutputs class shipped with Apache Hadoop or
one of your own?

If its the latter, please take a look at gotchas to take care of
described at 
http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2Fwrite-to_hdfs_files_directly_from_map.2Freduce_tasks.3F

On Mon, Dec 30, 2013 at 4:22 PM, AnilKumar B  wrote:
> Hi,
>
> I am  using multiple outputs in our job. So whenever any reduce task fails,
> all it's next task attempts are failing with file exist exception.
>
>
> The output file name should also append the task attempt right? But it's
> only appending the task id. Is this the bug or Some thing wrong from my
> side?
>
> Where should look in src code? I went through  code at
> FileOutputFormat$getTaskOutputPath(), but there it's only considering task
> id.
>
>
> Exception Trace:
> 13/12/29 09:13:00 INFO mapred.JobClient: Task Id :
> attempt_201312162255_60465_r_08_0, Status : FAILED
> 13/12/29 09:14:42 WARN mapred.JobClient: Error reading task
> outputhttp://localhost:50050/tasklog?plaintext=true&attemptid=attempt_201312162255_60465_r_08_0&filter=stdout
> 13/12/29 09:14:42 WARN mapred.JobClient: Error reading task
> outputhttp://localhost:50050/tasklog?plaintext=true&attemptid=attempt_201312162255_60465_r_08_0&filter=stderr
> 13/12/29 09:15:04 INFO mapred.JobClient:  map 100% reduce 93%
> 13/12/29 09:15:23 INFO mapred.JobClient:  map 100% reduce 96%
> 13/12/29 09:17:31 INFO mapred.JobClient:  map 100% reduce 97%
> 13/12/29 09:19:34 INFO mapred.JobClient: Task Id :
> attempt_201312162255_60465_r_08_1, Status : FAILED
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: failed to create
> /x/y/z/2013/12/29/04/o_2013_12_29_03-r-8.gz on client 10.103.10.31
> either because the filename is invalid or the file exists
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1672)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1599)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:732)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:711)
> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown
> Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
> at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1448)
> at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
> at java.security.AccessController.doPrivileged(Native
> Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> at
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:1442)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1118)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
> at $Proxy7.create(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
> at $Proxy7.create(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.(DFSClient.java:3753)
> at
> org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:937)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:207)
> at
> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
> at
> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:536)
> at
> org.apache.hadoop.fs.FileSystem.create(FileSystem.java:443)
> at
> org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:131)
> at
> org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:411)
> 

Job fails while re attempting the task in multiple outputs case

2013-12-30 Thread AnilKumar B
Hi,

I am  using multiple outputs in our job. So whenever any reduce task fails,
all it's next task attempts are failing with file exist exception.


The output file name should also append the task attempt right? But it's
only appending the task id. Is this the bug or Some thing wrong from my
side?

Where should look in src code? I went through  code at
FileOutputFormat$getTaskOutputPath(), but there it's only considering task
id.


Exception Trace:
13/12/29 09:13:00 INFO mapred.JobClient: Task Id :
attempt_201312162255_60465_r_08_0, Status : FAILED
13/12/29 09:14:42 WARN mapred.JobClient: Error reading task
outputhttp://localhost:50050/tasklog?plaintext=true&attemptid=attempt_201312162255_60465_r_08_0&filter=stdout
13/12/29 09:14:42 WARN mapred.JobClient: Error reading task
outputhttp://localhost:50050/tasklog?plaintext=true&attemptid=attempt_201312162255_60465_r_08_0&filter=stderr
13/12/29 09:15:04 INFO mapred.JobClient:  map 100% reduce 93%
13/12/29 09:15:23 INFO mapred.JobClient:  map 100% reduce 96%
13/12/29 09:17:31 INFO mapred.JobClient:  map 100% reduce 97%
13/12/29 09:19:34 INFO mapred.JobClient: Task Id :
attempt_201312162255_60465_r_08_1, Status : FAILED
org.apache.hadoop.ipc.RemoteException: java.io.IOException: failed to
create /x/y/z/2013/12/29/04/o_2013_12_29_03-r-8.gz on client
10.103.10.31 either because the filename is invalid or the file exists
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1672)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1599)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:732)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:711)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown
Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1448)
at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
at java.security.AccessController.doPrivileged(Native
Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at
org.apache.hadoop.ipc.Server$Handler.run(Server.java:1442)

at org.apache.hadoop.ipc.Client.call(Client.java:1118)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at $Proxy7.create(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at $Proxy7.create(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.(DFSClient.java:3753)
at
org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:937)
at
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:207)
at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:536)
at
org.apache.hadoop.fs.FileSystem.create(FileSystem.java:443)
at
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:131)
at
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:411)
at
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:370)
at
com.team.hadoop.mapreduce1$Reducer1.reduce(MapReduce1.java:254)
at
com.team.hadoop.mapreduce1$Reducer1.reduce(MapReduce1.java::144)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native
Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.sec

Real Multiple Outputs for Hadoop -- is this implementation correct?

2013-09-13 Thread Paul Houle
Hey guys I spent some time last week thinking about Hadoop before I wrote
my own class,  RealMultipleOutputs,  that does something like what
MultipleOutputs does,  except that you can specify different hdfs paths for
the different output streams.   My pals were telling me to use Cascading or
Pig if I want this functionality,  but otherwise I was happy writing Plain
M/R jars

I wrote up the implementation here:

https://github.com/paulhoule/infovore/wiki/Real-Multiple-Outputs-in-Hadoop

And this works hand-in hand with an abstraction layer that supports unit
testing w/ Mockito

https://github.com/paulhoule/infovore/wiki/Unit-Testing-Hadoop-Mappers-and-Reducers

Anyway,  I'd appreciate anybody looking at this code and trying to poke
holes in it.  It runs OK on my tiny dev cluster in 1.0.4,  1.1.2 and in
AMZN EMR but I am wondering if I missed something.


Re: Real Multiple Outputs for Hadoop -- is this implementation correct?

2013-09-13 Thread Harsh J
I took a very brief look, and the approach to use multiple OCs, one
per unique parent path from a task, seems the right thing to do. Nice
work! Do consider contributing this if its working well for you :)

On Sat, Sep 14, 2013 at 12:53 AM, Paul Houle  wrote:
> Hey guys I spent some time last week thinking about Hadoop before I wrote my
> own class,  RealMultipleOutputs,  that does something like what
> MultipleOutputs does,  except that you can specify different hdfs paths for
> the different output streams.   My pals were telling me to use Cascading or
> Pig if I want this functionality,  but otherwise I was happy writing Plain
> M/R jars
>
> I wrote up the implementation here:
>
> https://github.com/paulhoule/infovore/wiki/Real-Multiple-Outputs-in-Hadoop
>
> And this works hand-in hand with an abstraction layer that supports unit
> testing w/ Mockito
>
> https://github.com/paulhoule/infovore/wiki/Unit-Testing-Hadoop-Mappers-and-Reducers
>
> Anyway,  I'd appreciate anybody looking at this code and trying to poke
> holes in it.  It runs OK on my tiny dev cluster in 1.0.4,  1.1.2 and in AMZN
> EMR but I am wondering if I missed something.
>
>



-- 
Harsh J


Re: Multiple outputs

2013-03-18 Thread Harsh J
MultipleOutputs is the way to go :)

On Tue, Mar 12, 2013 at 12:48 PM, Fatih Haltas  wrote:
> Hi Everyone,
>
> I would like to have 2 different output (having different columns of a same
> input text file.)
> When I googled a bit, I got multipleoutputs classes, is this the common way
> of doing it or is there any way to create context kind of
> things/is there context array/is it possible to have two different context
> object as reducer output by changing the "public  void reduce(Text
> key,Iterable values, Context context)" part as one more Context
> context1, Context context2) ?
>
> Any help will be appraciated.
> Thank you very much.
>
>
>
>
> Below is my reducer function, how should I modify it?
>
>
>
> static class MyReducer extends Reducer
>{
> public  void reduce(Text key,Iterable values, Context context)
> throws IOException,
>InterruptedException
>{
> Iterator iter = values.iterator();
>while(iter.hasNext())
>{
> Text externalip_starttime_endtime = iter.next();
> Text outValue = new
> Text(externalip_starttime_endtime);
> context.write(key, new Text(outValue));
>
>}
>
>}
>}



--
Harsh J


Multiple outputs

2013-03-12 Thread Fatih Haltas
Hi Everyone,

I would like to have 2 different output (having different columns of a same
input text file.)
When I googled a bit, I got multipleoutputs classes, is this the common way
of doing it or is there any way to create context kind of
things/is there context array/is it possible to have two different context
object as reducer output by changing the "public  void reduce(Text
key,Iterable values, Context context)" part as one more Context
context1, Context context2) ?

Any help will be appraciated.
Thank you very much.




Below is my reducer function, how should I modify it?



static class MyReducer extends Reducer
{
 public  void reduce(Text key,Iterable values, Context
context) throws IOException,
InterruptedException
{
 Iterator iter = values.iterator();
while(iter.hasNext())
{
 Text externalip_starttime_endtime =
iter.next();
 Text outValue = new
Text(externalip_starttime_endtime);
 context.write(key, new Text(outValue));

}

}
}