Re: how to ensure that rivers are equally distributed across nodes in the cluster

Karol Gwaj Thu, 26 Dec 2013 07:35:47 -0800

yep it sounds great, cant wait to see some beta version to play with

i gave a quick look into logstash, but it is not exactly what i want
and feels too 'resources heavy' for me to install additional framework on 
every node (or have dedicated nodes for it)


will be nice if elasticsearch team could extract river functionality into 
some kind of plugin and contribute it to community (before deprecating it)
so if someone still wants to use rivers, they will be able too

Cheers,
Karol

On Thursday, December 26, 2013 12:37:36 PM UTC, Jörg Prante wrote:
>
> Rivers were once introduced for demo purposes to load quickly some data 
> into ES and make showcases from twitter or wikipedia data. 
>
> The Elasticsearch team is now in favor of Logstash.
>
> I start this gatherer plugin for my uses cases where I am not able to use 
> Logstash. I have very complex streams, e.g. ISO 2709 record formats with 
> some hundred custom transformations in the data, that I reduce to primitive 
> key/value streams and RDF triples. Also I plan to build RDF feeds for 
> semantic web/linked data platforms, where ES is the search engine.
>
> The gatherer "uber" plugin should work like this:
>
> - it can be installed on one or more nodes and provides a common bulk 
> indexing framework
>
> - a gatherer plugin registers in the cluster state (on node level)
>
> - there are standard capabilities, but a gatherer plugin capability can be 
> extended in a live cluster by submitting code for inputs, codecs, and 
> filters, picked up by a custom class loader (for example, JDBC, and a 
> driver jar, and tabular key/value output)
>
> - a gatherer plugin is idling, and accepts jobs in form of JSON commands 
> (defining the selection of inputs, codecs, and filters), for example, an 
> SQL command
>
> - if a gatherer is told to distribute the jobs fairly and is too busy 
> (active job queue length), it forwards them to other gatherers (other 
> methods are crontab-like scheduling), and the results of the jobs (ok, 
> failed, retry) are registered also in the cluster state (maybe an internal 
> index is better because there can be tens of thousands such jobs)
>
> - a client can ask for the state of all the gatherers and all the job 
> results
>
> - all jobs can be partitioned and processed in parallel for maximum 
> throughput
>
> - the gatherer also creates metrics/statistics of the jobs successfully 
> done
>
> Another thing I find important is to enable scripting for processing the 
> data streams (JSR 223 scripting, especially Groovy, Jython, Jruby, 
> Rhino/Nashorn)
>
> Right now there is no repo, I plan to kickstart the repo in early 2014.
>
> Jörg
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9c814b01-b09e-4974-aca4-0f8489933915%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: how to ensure that rivers are equally distributed across nodes in the cluster

Reply via email to