Re: large files vs many files

Sasha Dolgy Tue, 12 May 2009 09:26:41 -0700

Right now data is received in parallel and is written to a queue, then a
single thread reads the queue and writes those messages to a
FSDataOutputStream which is kept open, but the messages never get flushed.
 Tried flush() and sync() with no joy.
1.
outputStream.writeBytes(rawMessage.toString());


2.

log.debug("Flushing stream, size = " + s.getOutputStream().size());
s.getOutputStream().sync();
log.debug("Flushed stream, size = " + s.getOutputStream().size());

or

log.debug("Flushing stream, size = " + s.getOutputStream().size());
s.getOutputStream().flush();
log.debug("Flushed stream, size = " + s.getOutputStream().size());

Just see the size() remain the same after performing this action.

This is using hadoop-0.20.0.

-sd

On Sun, May 10, 2009 at 4:45 PM, Stefan Podkowinski <spo...@gmail.com>wrote:

> You just can't have many distributed jobs write into the same file
> without locking/synchronizing these writes. Even with append(). Its
> not different than using a regular file from multiple processes in
> this respect.
> Maybe you need to collect your data in front before processing them in
> hadoop?
> Have a look at Chukwa, http://wiki.apache.org/hadoop/Chukwa
>
>
> On Sat, May 9, 2009 at 9:44 AM, Sasha Dolgy <sdo...@gmail.com> wrote:
> > Would WritableFactories not allow me to open one outputstream and
> continue
> > to write() and sync() ?
> >
> > Maybe I'm reading into that wrong.  Although UUID would be nice, it would
> > still leave me in the problem of having lots of little files instead of a
> > few large files.
> >
> > -sd
> >
> > On Sat, May 9, 2009 at 8:37 AM, jason hadoop <jason.had...@gmail.com>
> wrote:
> >
> >> You must create unique file names, I don't believe (but I do not know)
> that
> >> the append could will allow multiple writers.
> >>
> >> Are you writing from within a task, or as an external application
> writing
> >> into hadoop.
> >>
> >> You may try using UUID,
> >> http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part of
> >> your
> >> filename.
> >> Without knowing more about your goals, environment and constraints it is
> >> hard to offer any more detailed suggestions.
> >> You could also have an application aggregate the streams and write out
> >> chunks, with one or more writers, one per output file.
>

Re: large files vs many files

Reply via email to