Re: large files vs many files

Sasha Dolgy Tue, 12 May 2009 09:32:20 -0700

2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28)
hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49)
hdfs.HdfsQueueConsumer: Re-using existing stream
2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63)
hdfs.HdfsQueueConsumer: Flushing stream, size = 1986
2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013)
hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986
lastFlushOffset 1731
2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66)
hdfs.HdfsQueueConsumer: Flushed stream, size = 1986
2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39)
hdfs.HdfsQueueConsumer: Consumer writing event
2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28)
hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49)
hdfs.HdfsQueueConsumer: Re-using existing stream
2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63)
hdfs.HdfsQueueConsumer: Flushing stream, size = 2235
2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013)
hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235
lastFlushOffset 1986
2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66)
hdfs.HdfsQueueConsumer: Flushed stream, size = 2235


So although the Offset is changing as expected, the output stream isn't
being flushed or cleared out and isn't being written to file...

On Tue, May 12, 2009 at 5:26 PM, Sasha Dolgy <sdo...@gmail.com> wrote:

> Right now data is received in parallel and is written to a queue, then a
> single thread reads the queue and writes those messages to a
> FSDataOutputStream which is kept open, but the messages never get flushed.
>  Tried flush() and sync() with no joy.
> 1.
> outputStream.writeBytes(rawMessage.toString());
>
> 2.
>
> log.debug("Flushing stream, size = " + s.getOutputStream().size());
>  s.getOutputStream().sync();
> log.debug("Flushed stream, size = " + s.getOutputStream().size());
>
> or
>
> log.debug("Flushing stream, size = " + s.getOutputStream().size());
> s.getOutputStream().flush();
>  log.debug("Flushed stream, size = " + s.getOutputStream().size());
>
> Just see the size() remain the same after performing this action.
>
> This is using hadoop-0.20.0.
>
> -sd
>
>
> On Sun, May 10, 2009 at 4:45 PM, Stefan Podkowinski <spo...@gmail.com>wrote:
>
>> You just can't have many distributed jobs write into the same file
>> without locking/synchronizing these writes. Even with append(). Its
>> not different than using a regular file from multiple processes in
>> this respect.
>> Maybe you need to collect your data in front before processing them in
>> hadoop?
>> Have a look at Chukwa, http://wiki.apache.org/hadoop/Chukwa
>>
>>
>> On Sat, May 9, 2009 at 9:44 AM, Sasha Dolgy <sdo...@gmail.com> wrote:
>> > Would WritableFactories not allow me to open one outputstream and
>> continue
>> > to write() and sync() ?
>> >
>> > Maybe I'm reading into that wrong.  Although UUID would be nice, it
>> would
>> > still leave me in the problem of having lots of little files instead of
>> a
>> > few large files.
>> >
>> > -sd
>> >
>> > On Sat, May 9, 2009 at 8:37 AM, jason hadoop <jason.had...@gmail.com>
>> wrote:
>> >
>> >> You must create unique file names, I don't believe (but I do not know)
>> that
>> >> the append could will allow multiple writers.
>> >>
>> >> Are you writing from within a task, or as an external application
>> writing
>> >> into hadoop.
>> >>
>> >> You may try using UUID,
>> >> http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part
>> of
>> >> your
>> >> filename.
>> >> Without knowing more about your goals, environment and constraints it
>> is
>> >> hard to offer any more detailed suggestions.
>> >> You could also have an application aggregate the streams and write out
>> >> chunks, with one or more writers, one per output file.
>>
>


-- 
Sasha Dolgy
sasha.do...@gmail.com

Re: large files vs many files

Reply via email to