Eric, Just wanted to say thank you for all your help. Everything is now working perfectly with a custom demux parser for my apache log format and a custom writer (copied your hbase as a starting point) into Cassandra.
To the broader community, if anyone is interested in helping re-factor the Cassandra writer to a real working add-in to Chukwa, happy to share what i have so far. I am not a Java developer by trade so I am sure there are some tweaks, but happy to give back to the Chukwa community to help advance its capabilities. Cheers, AD On Sun, Oct 30, 2011 at 4:18 PM, Eric Yang <[email protected]> wrote: > Sounds about right. For the multiple lines map to a chunk, you can modify > Apache Rotatelog to add Control-A character to end of the log entry, and > use UTF8 File Tailing Adaptor to read the file. In this case, you will get > your log entries in the same chunk rather than multiple chunks. Hope this > helps. > > regards, > Eric > > On Oct 29, 2011, at 7:42 PM, AD wrote: > > > Ok thats what i thought. I have been trying to backtrack through the > code. The one thing i cant figure out is currently a chunk can be multiple > lines in a logfile (for example). I cant see how to get the parser to > return multiple "rows" for a single chunk to be inserted back into Hbase. > > > > From backtracking it looks like the plan would be (for parsing apache > logs) > > > > 1 - set the collector.pipeline to include > org.apache.hadoop.chukwa.datacollection.writer.cassandra.CassandraWriter. > Need 2 methods, add and init > > 2 - write a new parser (ApachelogProcessor) in > chukwa.extraction.demux.processor.mapper that implements AbstractProcessor > and parses the Chunk (logfile entries). > > 3 - the "add" method of CassandraWriter calls processor.process on the > ApachelogProcessor with a single chunk from an array of chunks > > 4 - ApacheLogProcessor parse (called from process) method does a > record.add("logfield","logvalue") for each field in apachelog in addition > to the buildGenericRecord call > > 5 - ApacheLogProcessor calls output.collect (may need to write a custom > OutputCollector) for a single row in Cassandra. Sets up the "put" > > 6 - "add" method of CassandraWriter does a put of the new rowkey > > 7 - loop to next chunk. > > > > Look about right? If so the only open question is how to deal with > "chunks" that span multiple lines in a logfile and map them to a single row > in Cassandra. > > > > On Sat, Oct 29, 2011 at 9:44 PM, Eric Yang <[email protected]> wrote: > > The demux parsers can work with either Mapreduce or HBaseWriter. If you > want to have the ability to write parser once, and operate with multiple > data sink, it will be good to implement a version of HBaseWriter for > cassandra, and keeping parsing logic separate from loading logic. > > > > For performance reason, it is entirely possible to implement Cassandra > loader with parsing logic inside to improve performance, but it will put > you on another course than what is planned for Chukwa. > > > > regards, > > Eric > > > > On Oct 29, 2011, at 8:39 AM, AD wrote: > > > > > With the new imininent trunk (0.5) getting wired into HBase, does it > make sense for me to keep the Demux parser as the place to put this logic > for writing to cassandra? Or does it make sense to implement a version of > src/java/org/apache/hadoop/chukwa/datacollection/writer/hbase/HbaseWriter.java > for Cassandra so that the collector pushes it straight? > > > > > > If i want to use both HDFS and Cassandra, it seems the current > pipeline config would support this by doing something like > > > > > > <property> > > > <name>chukwaCollector.pipeline</name> > <value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.cassandra.CassandraWriter</value> > > > </property> > > > > > > Thoughts ? > > > > > > > > > On Wed, Oct 26, 2011 at 10:16 PM, AD <[email protected]> wrote: > > > yep that did it, just updated my initial_adaptors to have dataType > TsProcessor and saw demux kick in. > > > > > > Thanks for the help. > > > > > > > > > > > > On Wed, Oct 26, 2011 at 9:22 PM, Eric Yang <[email protected]> wrote: > > > See: http://incubator.apache.org/chukwa/docs/r0.4.0/agent.html and > http://incubator.apache.org/chukwa/docs/r0.4.0/programming.html > > > > > > The configuration are the same for collector based demux. Hope this > helps. > > > > > > regards, > > > Eric > > > > > > On Oct 26, 2011, at 4:20 PM, AD wrote: > > > > > > > Thanks. Sorry for being dense here, but where does the data type > get mapped from the agent to the collector when passing data so that demux > will match ? > > > > > > > > On Wed, Oct 26, 2011 at 12:34 PM, Eric Yang <[email protected]> > wrote: > > > > "dp" serves as two functions, first it loads data to mysql, second, > it runs SQL for aggregated views. demuxOutputDir_* is created if the demux > mapreduce produces data. Hence, make sure that there is a demux processor > mapped to your data type for the extracting process in > chukwa-demux-conf.xml. > > > > > > > > regards, > > > > Eric > > > > > > > > On Oct 26, 2011, at 5:15 AM, AD wrote: > > > > > > > > > Hmm, i am running bin/chukwa demux and i dont have anything past > dataSinkArchives, there is no directory named demuxOutputDir_*. > > > > > > > > > > Also isnt dp an aggregate view? I need to parse the apache logs > to do custom reports on things like remote_host , query strings, etc so i > was hoping to parse the raw record and load it into Cassandra and run M/R > there to do the aggregate views. I thought a new version of TSProcessor > was the right place here but i could be wrong. > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > If not, how do you write a custom postProcessor? > > > > > > > > > > On Wed, Oct 26, 2011 at 12:57 AM, Eric Yang <[email protected]> > wrote: > > > > > Hi AD, > > > > > > > > > > Data is stored in demuxOutputDir_* by demux and there is a > > > > > postProcessorMananger (bin/chukwa dp) which monitors postProcess > > > > > directory and load data to MySQL. For your use case, you will > need to > > > > > modify PostProcessorManager.java to adopt to your use case. Hope > this > > > > > helps. > > > > > > > > > > regards, > > > > > Eric > > > > > > > > > > On Tue, Oct 25, 2011 at 6:34 PM, AD <[email protected]> > wrote: > > > > > > hello, > > > > > > I currently push apache logs into Chukwa. I am trying to > figure out how to > > > > > > get all those logs into Cassandra and run mapreduce there. Is > the best > > > > > > place to do this in Demux (right my own version of TSProcessor?) > > > > > > Also the data flow seems to miss a step. The > > > > > > page > http://incubator.apache.org/chukwa/docs/r0.4.0/dataflow.html says in > > > > > > 3.3 that > > > > > > - demux moves complete files to: > dataSinkArchives/[yyyyMMdd]/*/*.done > > > > > > - the next step is to move files > > > > > > from > postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt > > > > > > How do they get from dataSinkArchives to postProcess? does > this run > > > > > > inside of DemuxManager or a separate process (bin/chukwa demux) ? > > > > > > Thanks > > > > > > AD > > > > > > > > > > > > > > > > > > > > > > > > > > > >
