Re: Rivers are reimporting data at each ElasticSearch restart

joergpra...@gmail.com Wed, 25 Jun 2014 08:54:06 -0700

It is up to the river implementation how the data import is handled.

The JDBC river, in the "simple" strategy, imports data when the river is
started, regardless of existing cluster or index. It is possible to
implement other strategies, for example, a strategy that performs a check
before indexing.


There is no support for river implementations about node start/stop control
and how to behave. JDBC river tries to compensate this by persisting a JDBC
river specific state. This state is useful for flow control.

If you do no longer need the river, you can delete the river with curl
-XDELETE, this shuts down river instance threads gracefully and releases
resources.

If you delete the _river index with curl -XDELETE, you wipe all data that
is used by rivers. Active river instances are not stopped and are not aware
of what happened, so this is an unfriendly way to terminate river runs, all
kind of river errors may occur.

Jörg



On Wed, Jun 25, 2014 at 5:38 PM, Stéphane Seng <seng.steph...@gmail.com>
wrote:

> Hello,
>
> I have a question about the fact that, when rivers are used to import data
> into ElasticSearch, rivers are also reimporting data at each ElasticSearch
> restart.
>
> In our project, what we are doing is as follows :
>
>    - Raw data is imported into ElasticSearch from a MySQL database using
>    the JDBC river (https://github.com/jprante/elasticsearch-river-jdbc);
>    - Some updates are executed directly on the newly imported data in
>    ElasticSearch using POST requests;
>    - In the end, the final data stored in ElasticSearch is not the same
>    than the imported raw data.
>
> The problem we are facing is that when ElasticSearch is restarted, the
> JDBC river is reimporting the raw data thus overriding the transformations
> made.
> We suppose that this is an intentional behavior from ElasticSearch rivers.
> One solution to avoid the reimporting of data is to delete the
> corresponding _river index, which is supposed to store the state of the
> rivers.
>
> Our questions are as follows :
>
>    - Is the reimporting of data from rivers at each restart is a standard
>    use case ? Is it useful for some applications ?
>    - What is the point of the _river index state saving ?
>       - Is there a way to avoid the reimporting of data without having to
>       delete the corresponding _river index ?
>       - Is there any downsides (for our use case) to delete the
>       corresponding _river index ?
>
> Thanks,
> Stéphane.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHW4ZeQV4Op9QuB4XJpMOht3P-Eq5ouJ0tsK3UU6dqD2Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Rivers are reimporting data at each ElasticSearch restart

Reply via email to