I had a lot of questions regarding the data flow as well. I spent a while
reverse engineering it and wrote something up on our internal wiki. I
believe this is what's happening. If others with more knowledge could verify
what I have below, I'll gladly move it to a wiki on the Chukwa site.

Regarding your specific question, I believe the DemuxManager process is the
first step in aggregating the data sink files. It moves the chunks to the
dataSinkArchives directory once it's done with them. The ArchiveManager
later archives those chunks.


   1. Collectors write chunks to logs/*.chukwa files until a 64MB chunk size
   is reached or a given time interval is reached.
      - to: logs/*.chukwa
   2. Collectors close chunks and rename them to *.done
      - from: logs/*.chukwa
      - to: logs/*.done
   3. DemuxManager wakes up every 20 seconds, runs M/R to merges
*.donefiles and moves them.
      - from: logs/*.done
      - to: demuxProcessing/mrInput
      - to: demuxProcessing/mrOutput
      - to: dataSinkArchives/[yyyyMMdd]/*/*.done
   4. PostProcessManager wakes up every few minutes and aggregates, orders
   and de-dups record files.
      - from:
      
postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt
      - to:
      
repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH]_[N].[N].evt
   5. HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5
   minute logs to hourly.
      - from:
      
repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm].[N].evt
      - to: temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd]
      - to:
      
repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMdd]_[HH].[N].evt
      - leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/
   6. DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs
   to daily.
      - from:
      
repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N].evt
      - to: temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd]
      - to:
      
repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N].evt
      - leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/
   7. ChukwaArchiveManager every half hour or so aggregates and removes
   dataSinkArchives data using M/R.
      - from: dataSinkArchives/[yyyyMMdd]/*/*.done
      - to: archivesProcessing/mrInput
      - to: archivesProcessing/mrOutput
      - to: finalArchives/[yyyyMMdd]/*/chukwaArchive-part-*


thanks,
Bill

On Tue, Feb 2, 2010 at 10:21 AM, Corbin Hoenes <cor...@tynt.com> wrote:

> I am trying to understand the flow of data inside hdfs as it's processed by
> the data processor script.
> I see the archive.sh and demux.sh are run which runs ArchiveManager and
> DemuxManager.   It appears to that just reading the code that they both are
> looking at the data sink (default /chukwa/logs).
>
> Can someone shed some light on how ArchiveManager and DemuxManager
> interact?  E.g. I was under the impression that the data flowed through the
> archiving process first then got fed into the demuxing after it had created
> .arc files.
>
>

Reply via email to