Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Tom Melendez
Hi Folks, Just doing a sanity check here. I have a map-only job, which produces a filename for a key and data as a value. I want to write the value (data) into the key (filename) in the path specified when I run the job. The value (data) doesn't need any formatting, I can just write it to HDFS

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Robert Evans
Tom, That assumes that you will never write to the same file from two different mappers or processes. HDFS currently does not support writing to a single file from multiple processes. --Bobby On 7/25/11 3:25 PM, "Tom Melendez" wrote: Hi Folks, Just doing a sanity check here. I have a map-

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Robert Evans
Tom, I also forgot to mention that if you are writing to lots of little files it could cause issues too. HDFS is designed to handle relatively few BIG files. There is some work to improve this, but it is still a ways off. So it is likely going to be very slow and put a big load on the nameno

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Tom Melendez
Hi Robert, In this specific case, that's OK. I'll never write to the same file from two different mappers. Otherwise, think it's cool? I haven't played with the outputformat before. Thanks, Tom On Mon, Jul 25, 2011 at 1:30 PM, Robert Evans wrote: > Tom, > > That assumes that you will never

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Tom Melendez
Hi Bobby, Yeah, that won't be a big deal in this case. It will create about 40 files, each about 60MB each. This job is kind of an odd one that won't be run very often. Thanks, Tom On Mon, Jul 25, 2011 at 1:34 PM, Robert Evans wrote: > Tom, > > I also forgot to mention that if you are writin

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Harsh J
You can use MultipleOutputs (or MultiTextOutputFormat for direct key-file mapping, but I'd still prefer the stable MultipleOutputs). Your sinking Key can be of NullWritable type, and you can keep passing an instance of NullWritable.get() to it in every cycle. This would write just the value, while

Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Tom Melendez
Hi Harsh, Thanks for the response. Unfortunately, I'm not following your response. :-) Could you elaborate a bit? Thanks, Tom On Mon, Jul 25, 2011 at 2:10 PM, Harsh J wrote: > You can use MultipleOutputs (or MultiTextOutputFormat for direct > key-file mapping, but I'd still prefer the stabl

Re: Custom FileOutputFormat / RecordWriter

2011-07-26 Thread Harsh J
Tom, What I meant to say was that doing this is well supported with existing API/libraries itself: - The class MultipleOutputs supports providing a filename for an output. See MultipleOutputs.addNamedOutput usage [1]. - The type 'NullWritable' is a special writable that doesn't do anything. So if

Re: Custom FileOutputFormat / RecordWriter

2011-07-26 Thread Tom Melendez
Hi Harsh, Cool, thanks for the details. For anyone interested, with your tip and description I was able to find an example inside the "Hadoop in Action" (Chapter 7, p168) book. Another question, though, it doesn't look like MultipleOutputs will let me control the filename in a per-key (per map)

Re: Custom FileOutputFormat / RecordWriter

2011-07-26 Thread Harsh J
Tom, You can theoretically add N amounts of named outputs from a single task itself, even from within the map() calls (addNamedOutputs or addMultiNamedOutputs checks within itself for dupes, so you don't have to). So yes, you can keep adding outputs and using them per-key, and given your earlier d