Can't you solve for the --max-file-blocks option given that you know the sizes of the input files and desired number of outputfiles?
On Wed, Jul 31, 2013 at 12:21 PM, Something Something < mailinglist...@gmail.com> wrote: > Thanks, John. But I don't see an option to specify the # of output files. > How does Crush decide how many files to create? Is it only based on file > sizes? > > On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <john.meag...@gmail.com > >wrote: > > > Here's a great tool for handling exactly that case: > > https://github.com/edwardcapriolo/filecrush > > > > On Wed, Jul 31, 2013 at 2:40 AM, Something Something > > <mailinglist...@gmail.com> wrote: > > > Each bz2 file after merging is about 50Megs. The reducers take about 9 > > > minutes. > > > > > > Note: 'getmerge' is not an option. There isn't enough disk space to > do > > a > > > getmerge on the local production box. Plus we need a scalable solution > > as > > > these files will get a lot bigger soon. > > > > > > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <benjij...@gmail.com> > wrote: > > > > > >> How big are your 50 files? How long are the reducers taking? > > >> > > >> On Jul 30, 2013, at 10:26 PM, Something Something < > > >> mailinglist...@gmail.com> wrote: > > >> > > >> > Hello, > > >> > > > >> > One of our pig scripts creates over 500 small part files. To save > on > > >> > namespace, we need to cut down the # of files, so instead of saving > > 500 > > >> > small files we need to merge them into 50. We tried the following: > > >> > > > >> > 1) When we set parallel number to 50, the Pig script takes a long > > time - > > >> > for obvious reasons. > > >> > 2) If we use Hadoop Streaming, it puts some garbage values into the > > key > > >> > field. > > >> > 3) We wrote our own Map Reducer program that reads these 500 small > > part > > >> > files & uses 50 reducers. Basically, the Mappers simply write the > > line & > > >> > reducers loop thru values & write them out. We set > > >> > job.setOutputKeyClass(NullWritable.class) so that the key is not > > written > > >> to > > >> > the output file. This is performing better than Pig. Actually > > Mappers > > >> run > > >> > very fast, but Reducers take some time to complete, but this > approach > > >> seems > > >> > to be working well. > > >> > > > >> > Is there a better way to do this? What strategy can you think of to > > >> > increase speed of reducers. > > >> > > > >> > Any help in this regard will be greatly appreciated. Thanks. > > >> > > >> > > > -- https://github.com/bearrito @deepbearrito