Re: piping data into Cassandra

Eric Yang Tue, 01 Nov 2011 09:46:52 -0700

Hi AD,

Glad it works for you.  If you are interested in contributing what you 
currently have for Cassandra writer, feel free to post your code as a patch in 
a jira (http://issues.apache.org/jira/browse/CHUKWA).  With license grant of 
your work, it will be easier for the open source community enhance your work. :)


regards,
Eric

On Oct 31, 2011, at 8:42 PM, AD wrote:

> Eric,
> 
>  Just wanted to say thank you for all your help.  Everything is now working 
> perfectly with a custom demux parser for my apache log format and a custom 
> writer (copied your hbase as a starting point) into Cassandra.
> 
> To the broader community, if anyone is interested in helping re-factor the 
> Cassandra writer to a real working add-in  to Chukwa, happy to share what i 
> have so far. I am not a Java developer by trade so I am sure there are some 
> tweaks, but happy to give back to the Chukwa community to help advance its 
> capabilities.
> 
> Cheers,
> AD
> 
> On Sun, Oct 30, 2011 at 4:18 PM, Eric Yang <[email protected]> wrote:
> Sounds about right.  For the multiple lines map to a chunk, you can modify 
> Apache Rotatelog to add Control-A character to end of the log entry, and use 
> UTF8 File Tailing Adaptor to read the file.  In this case, you will get your 
> log entries in the same chunk rather than multiple chunks.  Hope this helps.
> 
> regards,
> Eric
> 
> On Oct 29, 2011, at 7:42 PM, AD wrote:
> 
> > Ok thats what i thought.  I have been trying to backtrack through the code. 
> >  The one thing i cant figure out is currently a chunk can be multiple lines 
> > in a logfile (for example).  I cant see how to get the parser to return 
> > multiple "rows" for a single chunk to be inserted back into Hbase.
> >
> > From backtracking it looks like the plan would be (for parsing apache logs)
> >
> > 1 - set the collector.pipeline to include 
> > org.apache.hadoop.chukwa.datacollection.writer.cassandra.CassandraWriter.  
> > Need 2 methods, add and init
> > 2 - write a new parser (ApachelogProcessor) in  
> > chukwa.extraction.demux.processor.mapper that implements AbstractProcessor 
> > and parses the Chunk (logfile entries).
> > 3 - the "add" method of CassandraWriter calls processor.process on the 
> > ApachelogProcessor with a single chunk  from an array of chunks
> > 4 -  ApacheLogProcessor parse (called from process) method does a 
> > record.add("logfield","logvalue") for each field in apachelog in addition 
> > to the buildGenericRecord call
> > 5 - ApacheLogProcessor calls output.collect (may need to write a custom 
> > OutputCollector) for a single row in Cassandra.  Sets up the "put"
> > 6 - "add" method of CassandraWriter does a put of the new rowkey
> > 7 - loop to next chunk.
> >
> > Look about right?  If so the only open question is how to deal with 
> > "chunks" that span multiple lines in a logfile and map them to a single row 
> > in Cassandra.
> >
> > On Sat, Oct 29, 2011 at 9:44 PM, Eric Yang <[email protected]> wrote:
> > The demux parsers can work with either Mapreduce or HBaseWriter.  If you 
> > want to have the ability to write parser once, and operate with multiple 
> > data sink, it will be good to implement a version of HBaseWriter for 
> > cassandra, and keeping parsing logic separate from loading logic.
> >
> > For performance reason, it is entirely possible to implement Cassandra 
> > loader with parsing logic inside to improve performance, but it will put 
> > you on another course than what is planned for Chukwa.
> >
> > regards,
> > Eric
> >
> > On Oct 29, 2011, at 8:39 AM, AD wrote:
> >
> > > With the new imininent trunk (0.5) getting wired into HBase, does it make 
> > > sense for me to keep the Demux parser as the place to put this logic for 
> > > writing to cassandra?  Or does it make sense to implement a version of 
> > > src/java/org/apache/hadoop/chukwa/datacollection/writer/hbase/HbaseWriter.java
> > >  for Cassandra so that the collector pushes it straight?
> > >
> > > If i want to use both HDFS and Cassandra, it seems the current pipeline 
> > > config would support this by doing something like
> > >
> > > <property>
> > >   <name>chukwaCollector.pipeline</name>     
> > > <value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.cassandra.CassandraWriter</value>
> > >  </property>
> > >
> > >  Thoughts ?
> > >
> > >
> > > On Wed, Oct 26, 2011 at 10:16 PM, AD <[email protected]> wrote:
> > > yep that did it, just updated my initial_adaptors to have dataType 
> > > TsProcessor and saw demux kick in.
> > >
> > > Thanks for the help.
> > >
> > >
> > >
> > > On Wed, Oct 26, 2011 at 9:22 PM, Eric Yang <[email protected]> wrote:
> > > See: http://incubator.apache.org/chukwa/docs/r0.4.0/agent.html and 
> > > http://incubator.apache.org/chukwa/docs/r0.4.0/programming.html
> > >
> > > The configuration are the same for collector based demux.  Hope this 
> > > helps.
> > >
> > > regards,
> > > Eric
> > >
> > > On Oct 26, 2011, at 4:20 PM, AD wrote:
> > >
> > > > Thanks.  Sorry for being dense here, but where does the data type get 
> > > > mapped from the agent to the collector when passing data so that demux 
> > > > will match ?
> > > >
> > > > On Wed, Oct 26, 2011 at 12:34 PM, Eric Yang <[email protected]> wrote:
> > > > "dp" serves as two functions, first it loads data to mysql, second, it 
> > > > runs SQL for aggregated views.  demuxOutputDir_* is created if the 
> > > > demux mapreduce produces data.  Hence, make sure that there is a demux 
> > > > processor mapped to your data type for the extracting process in 
> > > > chukwa-demux-conf.xml.
> > > >
> > > > regards,
> > > > Eric
> > > >
> > > > On Oct 26, 2011, at 5:15 AM, AD wrote:
> > > >
> > > > > Hmm, i am running bin/chukwa demux and i dont have anything past 
> > > > > dataSinkArchives, there is no directory named demuxOutputDir_*.
> > > > >
> > > > > Also isnt dp an aggregate view?  I need to parse the apache logs to 
> > > > > do custom reports on things like remote_host , query strings, etc so 
> > > > > i was hoping to parse the raw record and load it into Cassandra and 
> > > > > run M/R there to do the aggregate views.  I thought a new version of 
> > > > > TSProcessor was the right place here but i could be wrong.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > >
> > > > >
> > > > > If not, how do you write a custom postProcessor?
> > > > >
> > > > > On Wed, Oct 26, 2011 at 12:57 AM, Eric Yang <[email protected]> wrote:
> > > > > Hi AD,
> > > > >
> > > > > Data is stored in demuxOutputDir_* by demux and there is a
> > > > > postProcessorMananger (bin/chukwa dp) which monitors postProcess
> > > > > directory and load data to MySQL.  For your use case, you will need to
> > > > > modify PostProcessorManager.java to adopt to your use case.  Hope this
> > > > > helps.
> > > > >
> > > > > regards,
> > > > > Eric
> > > > >
> > > > > On Tue, Oct 25, 2011 at 6:34 PM, AD <[email protected]> wrote:
> > > > > > hello,
> > > > > >  I currently push apache logs into Chukwa.  I am trying to figure 
> > > > > > out how to
> > > > > > get all those logs into Cassandra and run mapreduce there.  Is the 
> > > > > > best
> > > > > > place to do this in Demux (right my own version of TSProcessor?)
> > > > > >  Also the data flow seems to miss a step.  The
> > > > > > page http://incubator.apache.org/chukwa/docs/r0.4.0/dataflow.html 
> > > > > > says in
> > > > > > 3.3 that
> > > > > >    - demux moves complete files to: 
> > > > > > dataSinkArchives/[yyyyMMdd]/*/*.done
> > > > > >  - the next step is to move files
> > > > > > from 
> > > > > > postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt
> > > > > >   How do they get from dataSinkArchives to postProcess?  does this 
> > > > > > run
> > > > > > inside of DemuxManager or a separate process (bin/chukwa demux) ?
> > > > > >  Thanks
> > > > > >  AD
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> 
>

Re: piping data into Cassandra

Reply via email to