Re: Custom FileOutputFormat / RecordWriter

2011-07-26 Thread Harsh J
Tom, What I meant to say was that doing this is well supported with existing API/libraries itself: - The class MultipleOutputs supports providing a filename for an output. See MultipleOutputs.addNamedOutput usage [1]. - The type 'NullWritable' is a special writable that doesn't do anything. So

Re: Custom FileOutputFormat / RecordWriter

2011-07-26 Thread Tom Melendez
Hi Harsh, Cool, thanks for the details. For anyone interested, with your tip and description I was able to find an example inside the Hadoop in Action (Chapter 7, p168) book. Another question, though, it doesn't look like MultipleOutputs will let me control the filename in a per-key (per map)

Re: Custom FileOutputFormat / RecordWriter

2011-07-26 Thread Harsh J
Tom, You can theoretically add N amounts of named outputs from a single task itself, even from within the map() calls (addNamedOutputs or addMultiNamedOutputs checks within itself for dupes, so you don't have to). So yes, you can keep adding outputs and using them per-key, and given your earlier

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Robert Evans
Tom, That assumes that you will never write to the same file from two different mappers or processes. HDFS currently does not support writing to a single file from multiple processes. --Bobby On 7/25/11 3:25 PM, Tom Melendez t...@supertom.com wrote: Hi Folks, Just doing a sanity check

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Robert Evans
Tom, I also forgot to mention that if you are writing to lots of little files it could cause issues too. HDFS is designed to handle relatively few BIG files. There is some work to improve this, but it is still a ways off. So it is likely going to be very slow and put a big load on the

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Tom Melendez
Hi Robert, In this specific case, that's OK. I'll never write to the same file from two different mappers. Otherwise, think it's cool? I haven't played with the outputformat before. Thanks, Tom On Mon, Jul 25, 2011 at 1:30 PM, Robert Evans ev...@yahoo-inc.com wrote: Tom, That assumes

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Tom Melendez
Hi Bobby, Yeah, that won't be a big deal in this case. It will create about 40 files, each about 60MB each. This job is kind of an odd one that won't be run very often. Thanks, Tom On Mon, Jul 25, 2011 at 1:34 PM, Robert Evans ev...@yahoo-inc.com wrote: Tom, I also forgot to mention that

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Harsh J
You can use MultipleOutputs (or MultiTextOutputFormat for direct key-file mapping, but I'd still prefer the stable MultipleOutputs). Your sinking Key can be of NullWritable type, and you can keep passing an instance of NullWritable.get() to it in every cycle. This would write just the value, while

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Tom Melendez
Hi Harsh, Thanks for the response. Unfortunately, I'm not following your response. :-) Could you elaborate a bit? Thanks, Tom On Mon, Jul 25, 2011 at 2:10 PM, Harsh J ha...@cloudera.com wrote: You can use MultipleOutputs (or MultiTextOutputFormat for direct key-file mapping, but I'd still