On Wed, Jun 6, 2012 at 10:14 AM, Mohit Anchlia <mohitanch...@gmail.com>wrote:

> On Wed, Jun 6, 2012 at 9:48 AM, M. C. Srivas <mcsri...@gmail.com> wrote:
>
> > Many factors to consider than just the size of the file.  . How long can
> > you wait before you *have to* process the data?  5 minutes? 5 hours? 5
> > days?  If you want good timeliness, you need to roll-over faster.  The
> > longer you wait:
> >
> > 1.  the lesser the load on the NN.
> > 2.  but the poorer the timeliness
> > 3.  and the larger chance of lost data  (ie, the data is not saved until
> > the file is closed and rolled over, unless you want to sync() after every
> > write)
> >
> > To Begin with I was going to use Flume and specify rollover file size. I
> understand the above parameters, I just want to ensure that too many small
> files doesn't cause problem on the NameNode. For instance there would be
> times when we get GBs of data in an hour and at times only few 100 MB. From
> what Harsh, Edward and you've described it doesn't cause issues with the
> NameNode but rather increase in processing times if there are too many
> small files. Looks like I need to find that balance.
>
> It would also be interesting to see how others solve this problem when not
> using Flume.
>


They use NFS with MapR.

Any and all log-rotators (like the one in log4j) simply just work over NFS,
and MapR does not have a NN, so the problems with small files or number of
files do not exist.



>
>
> >
> >
> > On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia <mohitanch...@gmail.com
> > >wrote:
> >
> > > We have continuous flow of data into the sequence file. I am wondering
> > what
> > > would be the ideal file size before file gets rolled over. I know too
> > many
> > > small files are not good but could someone tell me what would be the
> > ideal
> > > size such that it doesn't overload NameNode.
> > >
> >
>

Reply via email to