I have uploaded a patch to Flume-1252 which is a better performing Hbase sink. I you are experimenting, it would be great help if you could try that out. The serializer has almost the same functionality/API. It will be great to get some verification of its correctness and performance.
Thanks! Hari On Sunday, June 10, 2012, Patrick Wendell wrote: > Thanks Hari - that does help. I was envisioning something akin to the > RegexSerde in Hive, where you can just write a regular expression to > extract fields from the event data and put in to separate columns (within a > CF). Sounds like a customer Serializer is exactly what I want here. > > - Patrick > > On Sat, Jun 9, 2012 at 11:01 PM, Hari Shreedharan < > [email protected] <javascript:;> > > wrote: > > > Hi Patrick, > > > > The HbaseSink has 2 components - one being the sink itself and the other > > being the serializer. When the sink picks up an event from the channel, > it > > is handed over to the serializer which can process the event and return > > Puts and/or Increments. So if you plan to write to different columns > within > > the same column family, all you need to do is to write your own > serializer > > that implements HbaseEventSerializer, and set that as the serializer for > > the HbaseSink. > > > > If you need to write to more than one column family, the way to do it is > > to add a header to the event based on the column family/column, use the > > multiplexing channel selector to divert the event to different flows and > > then use multiple Hbase sinks. As of now, the HbaseSink writes only to > one > > table and one column family. This was done to simplify configuration and > > the serializer interface. > > > > Basically - write a HBaseEventSerializer and plug it into the HbaseSink, > > which will write to Hbase > > > > > > I hope this helps. > > > > > > Thanks > > Hari > > > > > > -- > > Hari Shreedharan > > > > > > On Saturday, June 9, 2012 at 11:27 PM, Patrick Wendell wrote: > > > > > Hi There, > > > > > > For certain types of event data, such as log files, it would be nice to > > > have a way to write to HBase such that fields from the original file > can > > be > > > parsed into distinct columns. > > > > > > I want to implement this for a one-off project (and maybe for > > contribution > > > back to flume if this makes sense). > > > > > > What is the best way to go about it? Based on skimming the code my > sense > > is > > > that writing a custom HBase sink makes the most sense. Is that heading > > down > > > the right path, or is there some other component I should be modifying > or > > > extending? > > > > > > - Patrick > > > > >
