Hi All,
I had a MR job that processed 2000 small (<3MB ea.) files and it took 40
minutes on 8 nodes.
Since the files are small it triggerred 2000 tasks. I packed my 2000 files
into a single 445MB
sequence file (K,V == Text,Text == ,). The new MR job
triggers 7 map
tasks (approx 64MB each) bu
To get around the small-file-problem (I have thousands of 2MB log files) I wrote
a class to convert all my log files into a single SequenceFile in
(Text key, BytesWritable value) format. That works fine. I can run this:
hadoop fs -text /my.seq |grep peemt114.log | head -1
10/07/08 15:02:
Hi,
I'm using Cloudera's 0.20.2+228 release.
How do I create a custom Counter using the NEW API?
In my Mapper class I tried this:
public class MyMapper extends Mapper {
static enum recordTypes { GOOD, BAD, IGNORED };
public void map(Object key, Text value, Contex
cify the number of reducers as the
number of files you want, which is not the best option if some days have more
data than the others. You also dont have control over the file name. See Tom
White's Hadoop The Definitive Guide for an excellent example and usage.
Thanks and Regards,
Sonal
w
Hi,
I'm trying to understand how to generate multiple outputs in my reducer (using
0.20.2+228).
Do I need MultipleOutput or should I partition my output in the mapper?
My reducer currently gets key/val input pairs like this which all end up in my
part_r_ file.
hostA_VarX_2010-05-01_mor