Tom,
What I meant to say was that doing this is well supported with
existing API/libraries itself:
- The class MultipleOutputs supports providing a filename for an
output. See MultipleOutputs.addNamedOutput usage [1].
- The type 'NullWritable' is a special writable that doesn't do
anything. So
Hi Harsh,
Cool, thanks for the details. For anyone interested, with your tip
and description I was able to find an example inside the Hadoop in
Action (Chapter 7, p168) book.
Another question, though, it doesn't look like MultipleOutputs will
let me control the filename in a per-key (per map)
Tom,
You can theoretically add N amounts of named outputs from a single
task itself, even from within the map() calls (addNamedOutputs or
addMultiNamedOutputs checks within itself for dupes, so you don't have
to). So yes, you can keep adding outputs and using them per-key, and
given your earlier
Tom,
That assumes that you will never write to the same file from two different
mappers or processes. HDFS currently does not support writing to a single file
from multiple processes.
--Bobby
On 7/25/11 3:25 PM, Tom Melendez t...@supertom.com wrote:
Hi Folks,
Just doing a sanity check
Tom,
I also forgot to mention that if you are writing to lots of little files it
could cause issues too. HDFS is designed to handle relatively few BIG files.
There is some work to improve this, but it is still a ways off. So it is
likely going to be very slow and put a big load on the
Hi Robert,
In this specific case, that's OK. I'll never write to the same file
from two different mappers. Otherwise, think it's cool? I haven't
played with the outputformat before.
Thanks,
Tom
On Mon, Jul 25, 2011 at 1:30 PM, Robert Evans ev...@yahoo-inc.com wrote:
Tom,
That assumes
Hi Bobby,
Yeah, that won't be a big deal in this case. It will create about 40
files, each about 60MB each. This job is kind of an odd one that
won't be run very often.
Thanks,
Tom
On Mon, Jul 25, 2011 at 1:34 PM, Robert Evans ev...@yahoo-inc.com wrote:
Tom,
I also forgot to mention that
You can use MultipleOutputs (or MultiTextOutputFormat for direct
key-file mapping, but I'd still prefer the stable MultipleOutputs).
Your sinking Key can be of NullWritable type, and you can keep passing
an instance of NullWritable.get() to it in every cycle. This would
write just the value, while
Hi Harsh,
Thanks for the response. Unfortunately, I'm not following your response. :-)
Could you elaborate a bit?
Thanks,
Tom
On Mon, Jul 25, 2011 at 2:10 PM, Harsh J ha...@cloudera.com wrote:
You can use MultipleOutputs (or MultiTextOutputFormat for direct
key-file mapping, but I'd still