let me rephrase my question : are all the parts of a MapFile necessarily affected by a merge ? if so, it's not scalable, no matter what is the block size is. however, since MapFile is essentially a directory and not a file, I don't see a reason why all parts should be affected. can anyone comment on the actual implementation of the merge algorithm ?
Elia Mazzawi-2 wrote: > > it has to do with the data block size, > > I had many small files and the performance because much better when i > merged them, > > the default block size is 64Mb so redo your files to <= 64MB (what i did > and recommend) > or reconfigure your hadoop. > > <property> > <name>dfs.block.size</name> > <value>67108864</value> > <description>The default block size for new files.</description> > </property> > > do something like > cat * | rotatelogs ./merged/m 64M > it will merge and chop the data for you. > > yoav.morag wrote: >> hi all - >> can anyone comment on the performance cost of merging many small files >> into >> an increasingly large MapFile ? will that cost be dependent on the size >> of >> the larger MapFile (since I have to rewrite it) or is there a built-in >> strategy to split it into smaller parts, affecting only those which were >> touched ? >> thanks - >> Yoav. >> > > > -- View this message in context: http://www.nabble.com/merging-into-MapFile-tp20914388p20930594.html Sent from the Hadoop core-user mailing list archive at Nabble.com.