On Wed, Jun 6, 2012 at 10:14 AM, Mohit Anchlia <mohitanch...@gmail.com>wrote:
> On Wed, Jun 6, 2012 at 9:48 AM, M. C. Srivas <mcsri...@gmail.com> wrote: > > > Many factors to consider than just the size of the file. . How long can > > you wait before you *have to* process the data? 5 minutes? 5 hours? 5 > > days? If you want good timeliness, you need to roll-over faster. The > > longer you wait: > > > > 1. the lesser the load on the NN. > > 2. but the poorer the timeliness > > 3. and the larger chance of lost data (ie, the data is not saved until > > the file is closed and rolled over, unless you want to sync() after every > > write) > > > > To Begin with I was going to use Flume and specify rollover file size. I > understand the above parameters, I just want to ensure that too many small > files doesn't cause problem on the NameNode. For instance there would be > times when we get GBs of data in an hour and at times only few 100 MB. From > what Harsh, Edward and you've described it doesn't cause issues with the > NameNode but rather increase in processing times if there are too many > small files. Looks like I need to find that balance. > > It would also be interesting to see how others solve this problem when not > using Flume. > They use NFS with MapR. Any and all log-rotators (like the one in log4j) simply just work over NFS, and MapR does not have a NN, so the problems with small files or number of files do not exist. > > > > > > > > On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia <mohitanch...@gmail.com > > >wrote: > > > > > We have continuous flow of data into the sequence file. I am wondering > > what > > > would be the ideal file size before file gets rolled over. I know too > > many > > > small files are not good but could someone tell me what would be the > > ideal > > > size such that it doesn't overload NameNode. > > > > > >