Re: Questions about Morphline Solr Sink structure

Wolfgang Hoschek Mon, 11 Nov 2013 11:54:48 -0800

Hi Otis,

You bring up a lot of very good points here, indeed. I'll try to answer as best 
as I can...

In the early days this Flume Sink started out as being very Solr specific. Over 
time I have made it more generic and reduced the dependency on Solr more and 
more, and at this point, there is in fact no dependency on Solr in the code 
left anymore (except in some tests that straddle the boundary between unit 
tests and integration tests). So in effect it wouldn't be technically wrong to 
refer to this as a Morphline Sink. The name is just a reflection of an 
evolutionary journey through history, and for retaining backwards compat.

You could easily use this sink to extract, transform and load data into ES (or 
any other app or database or storage system) without pulling in any Solr 
related jar. To do so you'd write a loadElasticSearch morphline command in a 
separate morphline maven module, and use that command instead of the loadSolr 
command in your morphline config files. The new loadElasticSearch command would 
convert a morphline record to a data structure appropriate for ES, e.g. ES 
JSON/Smile, and send that to ES. That's all there is to it, really.

A morphline record is essentially a hash table where the keys are strings and 
the values are a list of arbitrary Java objects. Those Java objects are 
typically Strings and Integers, but they can also be InputStreams or byte[] 
BLOBs, Avro objects, etc. This data model corresponds exactly to the features 
of the Lucene data model. It can also be seen as a superset of the Flume event 
data model - the Flume body is a byte[] value in the morphline _attachment_body 
field. The data model also maps well to the relational model. It also can be 
used for hierarchical data considering that the values in a morphline record 
field can be Avro, JSON, XML, protobufs, or any other custom complex data 
structure.

Wolfgang.

On Nov 10, 2013, at 4:42 PM, Otis Gospodnetic wrote:

> Hello,
> 
> One more "proactive" question.
> 
> Isn't all code under the .... solr/morphline package not really about
> Morphline *Solr* Sink, but really more about *Morphline* Sink?
> In other words, if where Morphline actually outputs is dictated by the
> Morphline command in Morphline config (e.g. loadSolr()), then as far
> as Flume is concerned, isn't that really just *Morphline* Sink?
> 
> For example, if I wanted to get Flume to pass events through Morphline
> and have Morphline output to Elasticsearch, I wouldn't really want to
> add a while new Elasticsearch Morphline Sink.  I should really just be
> able to use the existing (misnamed?) Morphline Solr Sink and just
> point it to a Morphline config that has laodElasticsearch() instead of
> loadSolr().
> 
> (please ignore the fact Morphline doesn't actually have
> loadElasticsearch() yet - I think this is a Morphline issue, not a
> Flume issue)
> 
> Is the above correct?
> 
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
> 
> 
> On Sun, Nov 10, 2013 at 7:29 PM, Otis Gospodnetic
> <otis.gospodne...@gmail.com> wrote:
>> Hello,
>> 
>> Warning: I've got a Flume NG and Morphlines newbie status
>> 
>> I was looking at Morphline Solr Sink to see how one could write an
>> equivalent Morphline Elasticsearch Sink, but after looking at the
>> code, I'm a bit confused.  Here are my Qs:
>> 
>> 1)  interface MorphlineHandler mentions Solr in N places, but it
>> doesn't seem to be Solr-specific.  Couldn't one reuse this interface
>> for a Morphline ES Sink?
>> 
>> 2) In general, couldn't/shouldn't a few classes from
>> org.apache.flume.sink.solr.morphline package really not outside
>> anything solr-specific? e.g.  org.apache.flume.sink.morphline for
>> those that are Morphline-specific?
>> 
>> 3) Similarly, BlobDeserializer and BlobHandler don't seem to be even
>> Morphline-specific.  Shouldn't they be elsewhere?
>> 
>> 4) I was expecting to see SolrJ (Solr Java client library) being used
>> in MorphlineHandlerImpl or MorphlineSolrSink to send events to Solr,
>> but there is no trace of SolrJ there.  How exactly does this load
>> Flume events into Solr then?
>> Ooooh, is that because when using this sink one is supposed to provide
>> a Morphline config and this config has a hard-coded loadSolr()
>> command?
>> 
>> 5) Would it make sense to refactor any of the current Morphline Solr
>> Sink code to make it easier to add things Morphline Elasticsearch
>> Sink?  If so, any guidance you could provide would be very helpful.
>> 
>> Thanks,
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/

Re: Questions about Morphline Solr Sink structure

Reply via email to