Ted, Good point. Patches are welcome :) I will add it onto my to-do list.
Edward On Sat, Sep 25, 2010 at 12:05 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Edward: > Thanks for the tool. > > I think the last parameter can be omitted if you follow what hadoop fs -text > does. > It looks at a file's magic number so that it can attempt to *detect* the > type of the file. > > Cheers > > On Fri, Sep 24, 2010 at 11:41 PM, Edward Capriolo > <edlinuxg...@gmail.com>wrote: > >> Many times a hadoop job produces a file per reducer and the job has >> many reducers. Or a map only job one output file per input file and >> you have many input files. Or you just have many small files from some >> external process. Hadoop has sub optimal handling of small files. >> There are some ways to handle this inside a map reduce program, >> IdentityMapper + IdentityReducer for example, or multi outputs However >> we wanted a tool that could be used by people using hive, or pig, or >> map reduce. We wanted to allow people to combine a directory with >> multiple files or a hierarchy of directories like the root of a hive >> partitioned table. We also wanted to be able to combine text or >> sequence files. >> >> What we came up with is the filecrusher. >> >> Usage: >> /usr/bin/hadoop jar filecrush.jar crush.Crush /directory/to/compact >> /user/edward/backup 50 SEQUENCE >> (50 is the number of mappers here) >> >> Code is Apache V2 and you can get it here: >> http://www.jointhegrid.com/hadoop_filecrush/index.jsp >> >> Enjoy, >> Edward >> >