awesome, patch submitted. https://issues.apache.org/jira/browse/CHUKWA-600
On Tue, Nov 1, 2011 at 12:46 PM, Eric Yang <[email protected]> wrote: > Hi AD, > > Glad it works for you. If you are interested in contributing what you > currently have for Cassandra writer, feel free to post your code as a patch > in a jira (http://issues.apache.org/jira/browse/CHUKWA). With license > grant of your work, it will be easier for the open source community enhance > your work. :) > > regards, > Eric > > On Oct 31, 2011, at 8:42 PM, AD wrote: > > > Eric, > > > > Just wanted to say thank you for all your help. Everything is now > working perfectly with a custom demux parser for my apache log format and a > custom writer (copied your hbase as a starting point) into Cassandra. > > > > To the broader community, if anyone is interested in helping re-factor > the Cassandra writer to a real working add-in to Chukwa, happy to share > what i have so far. I am not a Java developer by trade so I am sure there > are some tweaks, but happy to give back to the Chukwa community to help > advance its capabilities. > > > > Cheers, > > AD > > > > On Sun, Oct 30, 2011 at 4:18 PM, Eric Yang <[email protected]> wrote: > > Sounds about right. For the multiple lines map to a chunk, you can > modify Apache Rotatelog to add Control-A character to end of the log entry, > and use UTF8 File Tailing Adaptor to read the file. In this case, you will > get your log entries in the same chunk rather than multiple chunks. Hope > this helps. > > > > regards, > > Eric > > > > On Oct 29, 2011, at 7:42 PM, AD wrote: > > > > > Ok thats what i thought. I have been trying to backtrack through the > code. The one thing i cant figure out is currently a chunk can be multiple > lines in a logfile (for example). I cant see how to get the parser to > return multiple "rows" for a single chunk to be inserted back into Hbase. > > > > > > From backtracking it looks like the plan would be (for parsing apache > logs) > > > > > > 1 - set the collector.pipeline to include > org.apache.hadoop.chukwa.datacollection.writer.cassandra.CassandraWriter. > Need 2 methods, add and init > > > 2 - write a new parser (ApachelogProcessor) in > chukwa.extraction.demux.processor.mapper that implements AbstractProcessor > and parses the Chunk (logfile entries). > > > 3 - the "add" method of CassandraWriter calls processor.process on the > ApachelogProcessor with a single chunk from an array of chunks > > > 4 - ApacheLogProcessor parse (called from process) method does a > record.add("logfield","logvalue") for each field in apachelog in addition > to the buildGenericRecord call > > > 5 - ApacheLogProcessor calls output.collect (may need to write a > custom OutputCollector) for a single row in Cassandra. Sets up the "put" > > > 6 - "add" method of CassandraWriter does a put of the new rowkey > > > 7 - loop to next chunk. > > > > > > Look about right? If so the only open question is how to deal with > "chunks" that span multiple lines in a logfile and map them to a single row > in Cassandra. > > > > > > On Sat, Oct 29, 2011 at 9:44 PM, Eric Yang <[email protected]> wrote: > > > The demux parsers can work with either Mapreduce or HBaseWriter. If > you want to have the ability to write parser once, and operate with > multiple data sink, it will be good to implement a version of HBaseWriter > for cassandra, and keeping parsing logic separate from loading logic. > > > > > > For performance reason, it is entirely possible to implement Cassandra > loader with parsing logic inside to improve performance, but it will put > you on another course than what is planned for Chukwa. > > > > > > regards, > > > Eric > > > > > > On Oct 29, 2011, at 8:39 AM, AD wrote: > > > > > > > With the new imininent trunk (0.5) getting wired into HBase, does it > make sense for me to keep the Demux parser as the place to put this logic > for writing to cassandra? Or does it make sense to implement a version of > src/java/org/apache/hadoop/chukwa/datacollection/writer/hbase/HbaseWriter.java > for Cassandra so that the collector pushes it straight? > > > > > > > > If i want to use both HDFS and Cassandra, it seems the current > pipeline config would support this by doing something like > > > > > > > > <property> > > > > <name>chukwaCollector.pipeline</name> > <value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.cassandra.CassandraWriter</value> > > > > </property> > > > > > > > > Thoughts ? > > > > > > > > > > > > On Wed, Oct 26, 2011 at 10:16 PM, AD <[email protected]> > wrote: > > > > yep that did it, just updated my initial_adaptors to have dataType > TsProcessor and saw demux kick in. > > > > > > > > Thanks for the help. > > > > > > > > > > > > > > > > On Wed, Oct 26, 2011 at 9:22 PM, Eric Yang <[email protected]> > wrote: > > > > See: http://incubator.apache.org/chukwa/docs/r0.4.0/agent.html and > http://incubator.apache.org/chukwa/docs/r0.4.0/programming.html > > > > > > > > The configuration are the same for collector based demux. Hope this > helps. > > > > > > > > regards, > > > > Eric > > > > > > > > On Oct 26, 2011, at 4:20 PM, AD wrote: > > > > > > > > > Thanks. Sorry for being dense here, but where does the data type > get mapped from the agent to the collector when passing data so that demux > will match ? > > > > > > > > > > On Wed, Oct 26, 2011 at 12:34 PM, Eric Yang <[email protected]> > wrote: > > > > > "dp" serves as two functions, first it loads data to mysql, > second, it runs SQL for aggregated views. demuxOutputDir_* is created if > the demux mapreduce produces data. Hence, make sure that there is a demux > processor mapped to your data type for the extracting process in > chukwa-demux-conf.xml. > > > > > > > > > > regards, > > > > > Eric > > > > > > > > > > On Oct 26, 2011, at 5:15 AM, AD wrote: > > > > > > > > > > > Hmm, i am running bin/chukwa demux and i dont have anything past > dataSinkArchives, there is no directory named demuxOutputDir_*. > > > > > > > > > > > > Also isnt dp an aggregate view? I need to parse the apache logs > to do custom reports on things like remote_host , query strings, etc so i > was hoping to parse the raw record and load it into Cassandra and run M/R > there to do the aggregate views. I thought a new version of TSProcessor > was the right place here but i could be wrong. > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > If not, how do you write a custom postProcessor? > > > > > > > > > > > > On Wed, Oct 26, 2011 at 12:57 AM, Eric Yang <[email protected]> > wrote: > > > > > > Hi AD, > > > > > > > > > > > > Data is stored in demuxOutputDir_* by demux and there is a > > > > > > postProcessorMananger (bin/chukwa dp) which monitors postProcess > > > > > > directory and load data to MySQL. For your use case, you will > need to > > > > > > modify PostProcessorManager.java to adopt to your use case. > Hope this > > > > > > helps. > > > > > > > > > > > > regards, > > > > > > Eric > > > > > > > > > > > > On Tue, Oct 25, 2011 at 6:34 PM, AD <[email protected]> > wrote: > > > > > > > hello, > > > > > > > I currently push apache logs into Chukwa. I am trying to > figure out how to > > > > > > > get all those logs into Cassandra and run mapreduce there. Is > the best > > > > > > > place to do this in Demux (right my own version of > TSProcessor?) > > > > > > > Also the data flow seems to miss a step. The > > > > > > > page > http://incubator.apache.org/chukwa/docs/r0.4.0/dataflow.html says in > > > > > > > 3.3 that > > > > > > > - demux moves complete files to: > dataSinkArchives/[yyyyMMdd]/*/*.done > > > > > > > - the next step is to move files > > > > > > > from > postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt > > > > > > > How do they get from dataSinkArchives to postProcess? does > this run > > > > > > > inside of DemuxManager or a separate process (bin/chukwa > demux) ? > > > > > > > Thanks > > > > > > > AD > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
