I had a lot of questions regarding the data flow as well. I spent a while reverse engineering it and wrote something up on our internal wiki. I believe this is what's happening. If others with more knowledge could verify what I have below, I'll gladly move it to a wiki on the Chukwa site.
Regarding your specific question, I believe the DemuxManager process is the first step in aggregating the data sink files. It moves the chunks to the dataSinkArchives directory once it's done with them. The ArchiveManager later archives those chunks. 1. Collectors write chunks to logs/*.chukwa files until a 64MB chunk size is reached or a given time interval is reached. - to: logs/*.chukwa 2. Collectors close chunks and rename them to *.done - from: logs/*.chukwa - to: logs/*.done 3. DemuxManager wakes up every 20 seconds, runs M/R to merges *.donefiles and moves them. - from: logs/*.done - to: demuxProcessing/mrInput - to: demuxProcessing/mrOutput - to: dataSinkArchives/[yyyyMMdd]/*/*.done 4. PostProcessManager wakes up every few minutes and aggregates, orders and de-dups record files. - from: postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt - to: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH]_[N].[N].evt 5. HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5 minute logs to hourly. - from: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm].[N].evt - to: temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd] - to: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMdd]_[HH].[N].evt - leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/ 6. DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs to daily. - from: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N].evt - to: temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd] - to: repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N].evt - leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/ 7. ChukwaArchiveManager every half hour or so aggregates and removes dataSinkArchives data using M/R. - from: dataSinkArchives/[yyyyMMdd]/*/*.done - to: archivesProcessing/mrInput - to: archivesProcessing/mrOutput - to: finalArchives/[yyyyMMdd]/*/chukwaArchive-part-* thanks, Bill On Tue, Feb 2, 2010 at 10:21 AM, Corbin Hoenes <cor...@tynt.com> wrote: > I am trying to understand the flow of data inside hdfs as it's processed by > the data processor script. > I see the archive.sh and demux.sh are run which runs ArchiveManager and > DemuxManager. It appears to that just reading the code that they both are > looking at the data sink (default /chukwa/logs). > > Can someone shed some light on how ArchiveManager and DemuxManager > interact? E.g. I was under the impression that the data flowed through the > archiving process first then got fed into the demuxing after it had created > .arc files. > >