Ok thats what i thought. I have been trying to backtrack through the code.
The one thing i cant figure out is currently a chunk can be multiple lines
in a logfile (for example). I cant see how to get the parser to return
multiple "rows" for a single chunk to be inserted back into Hbase.
>From backtracking it looks like the plan would be (for parsing apache logs)
1 - set the collector.pipeline to
include
org.apache.hadoop.chukwa.datacollection.writer.cassandra.CassandraWriter.
Need 2 methods, add and init
2 - write a new parser (ApachelogProcessor) in
chukwa.extraction.demux.processor.mapper that implements AbstractProcessor
and parses the Chunk (logfile entries).
3 - the "add" method of CassandraWriter calls processor.process on the
ApachelogProcessor with a single chunk from an array of chunks
4 - ApacheLogProcessor parse (called from process) method does a
record.add("logfield","logvalue") for each field in apachelog in addition
to the buildGenericRecord call
5 - ApacheLogProcessor calls output.collect (may need to write a custom
OutputCollector) for a single row in Cassandra. Sets up the "put"
6 - "add" method of CassandraWriter does a put of the new rowkey
7 - loop to next chunk.
Look about right? If so the only open question is how to deal with
"chunks" that span multiple lines in a logfile and map them to a single row
in Cassandra.
On Sat, Oct 29, 2011 at 9:44 PM, Eric Yang <[email protected]> wrote:
> The demux parsers can work with either Mapreduce or HBaseWriter. If you
> want to have the ability to write parser once, and operate with multiple
> data sink, it will be good to implement a version of HBaseWriter for
> cassandra, and keeping parsing logic separate from loading logic.
>
> For performance reason, it is entirely possible to implement Cassandra
> loader with parsing logic inside to improve performance, but it will put
> you on another course than what is planned for Chukwa.
>
> regards,
> Eric
>
> On Oct 29, 2011, at 8:39 AM, AD wrote:
>
> > With the new imininent trunk (0.5) getting wired into HBase, does it
> make sense for me to keep the Demux parser as the place to put this logic
> for writing to cassandra? Or does it make sense to implement a version of
> src/java/org/apache/hadoop/chukwa/datacollection/writer/hbase/HbaseWriter.java
> for Cassandra so that the collector pushes it straight?
> >
> > If i want to use both HDFS and Cassandra, it seems the current pipeline
> config would support this by doing something like
> >
> > <property>
> > <name>chukwaCollector.pipeline</name>
> <value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.cassandra.CassandraWriter</value>
> > </property>
> >
> > Thoughts ?
> >
> >
> > On Wed, Oct 26, 2011 at 10:16 PM, AD <[email protected]> wrote:
> > yep that did it, just updated my initial_adaptors to have dataType
> TsProcessor and saw demux kick in.
> >
> > Thanks for the help.
> >
> >
> >
> > On Wed, Oct 26, 2011 at 9:22 PM, Eric Yang <[email protected]> wrote:
> > See: http://incubator.apache.org/chukwa/docs/r0.4.0/agent.html and
> http://incubator.apache.org/chukwa/docs/r0.4.0/programming.html
> >
> > The configuration are the same for collector based demux. Hope this
> helps.
> >
> > regards,
> > Eric
> >
> > On Oct 26, 2011, at 4:20 PM, AD wrote:
> >
> > > Thanks. Sorry for being dense here, but where does the data type get
> mapped from the agent to the collector when passing data so that demux will
> match ?
> > >
> > > On Wed, Oct 26, 2011 at 12:34 PM, Eric Yang <[email protected]> wrote:
> > > "dp" serves as two functions, first it loads data to mysql, second, it
> runs SQL for aggregated views. demuxOutputDir_* is created if the demux
> mapreduce produces data. Hence, make sure that there is a demux processor
> mapped to your data type for the extracting process in
> chukwa-demux-conf.xml.
> > >
> > > regards,
> > > Eric
> > >
> > > On Oct 26, 2011, at 5:15 AM, AD wrote:
> > >
> > > > Hmm, i am running bin/chukwa demux and i dont have anything past
> dataSinkArchives, there is no directory named demuxOutputDir_*.
> > > >
> > > > Also isnt dp an aggregate view? I need to parse the apache logs to
> do custom reports on things like remote_host , query strings, etc so i was
> hoping to parse the raw record and load it into Cassandra and run M/R there
> to do the aggregate views. I thought a new version of TSProcessor was the
> right place here but i could be wrong.
> > > >
> > > > Thoughts?
> > > >
> > > >
> > > >
> > > > If not, how do you write a custom postProcessor?
> > > >
> > > > On Wed, Oct 26, 2011 at 12:57 AM, Eric Yang <[email protected]>
> wrote:
> > > > Hi AD,
> > > >
> > > > Data is stored in demuxOutputDir_* by demux and there is a
> > > > postProcessorMananger (bin/chukwa dp) which monitors postProcess
> > > > directory and load data to MySQL. For your use case, you will need
> to
> > > > modify PostProcessorManager.java to adopt to your use case. Hope
> this
> > > > helps.
> > > >
> > > > regards,
> > > > Eric
> > > >
> > > > On Tue, Oct 25, 2011 at 6:34 PM, AD <[email protected]> wrote:
> > > > > hello,
> > > > > I currently push apache logs into Chukwa. I am trying to figure
> out how to
> > > > > get all those logs into Cassandra and run mapreduce there. Is the
> best
> > > > > place to do this in Demux (right my own version of TSProcessor?)
> > > > > Also the data flow seems to miss a step. The
> > > > > page http://incubator.apache.org/chukwa/docs/r0.4.0/dataflow.htmlsays
> > > > > in
> > > > > 3.3 that
> > > > > - demux moves complete files to:
> dataSinkArchives/[yyyyMMdd]/*/*.done
> > > > > - the next step is to move files
> > > > > from
> postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt
> > > > > How do they get from dataSinkArchives to postProcess? does this
> run
> > > > > inside of DemuxManager or a separate process (bin/chukwa demux) ?
> > > > > Thanks
> > > > > AD
> > > >
> > >
> > >
> >
> >
> >
>
>