Thanks for your quick reply,
I need some clarifications about what you meant by "delete the river",
"delete the _river index" and by "this state is useful for flow control".
>From what I have understand from your reply and supposing that I have
imported data into a "documents" river using the JDBC river:
- "Delete the river" means "DELETE _river/documents" (and does not mean
"DELETE documents"):
- This does not affect the already imported data.
- The data is not reimported into ElasticSearch at restart.
- Everything is fine for our use case.
- "Delete the _river index" means "DELETE _river":
- This does not affect the already imported data.
- The data is not reimported into ElasticSearch at restart.
- This should not be done because it affects all the rivers at the
same time (for the documents river, it is equivalent of doing "DELETE
_river/documents").
- "This state is useful for flow control" means that:
- The state keeps track of what data is already imported so that the
same raw data (left untouched in ElasticSearch) is not reimported
multiple
times ?
- OR The state keeps a trace of the SQL query so that, in case of an
error during a node start/stop, the river can be automatically replayed ?
Thanks again,
Stéphane.
On Wednesday, June 25, 2014 6:08:52 PM UTC+2, Jörg Prante wrote:
>
> Because each river can freely implement the data fetch, ES does not offer
> river monitoring.
>
> For JDBC river, I implemented some primitive river state query commands
> that allow polling for river state changes.
>
> Jörg
>
>
> On Wed, Jun 25, 2014 at 6:00 PM, Tanguy Bernard > wrote:
>
>> Hello,
>> This post interested me.
>> Have we a way to know when indexing is finished and thus triggered the
>> XDELETE _river?
>>
>> Le mercredi 25 juin 2014 17:54:01 UTC+2, Jörg Prante a écrit :
>>>
>>> It is up to the river implementation how the data import is handled.
>>>
>>> The JDBC river, in the "simple" strategy, imports data when the river is
>>> started, regardless of existing cluster or index. It is possible to
>>> implement other strategies, for example, a strategy that performs a check
>>> before indexing.
>>>
>>> There is no support for river implementations about node start/stop
>>> control and how to behave. JDBC river tries to compensate this by
>>> persisting a JDBC river specific state. This state is useful for flow
>>> control.
>>>
>>> If you do no longer need the river, you can delete the river with curl
>>> -XDELETE, this shuts down river instance threads gracefully and releases
>>> resources.
>>>
>>> If you delete the _river index with curl -XDELETE, you wipe all data
>>> that is used by rivers. Active river instances are not stopped and are not
>>> aware of what happened, so this is an unfriendly way to terminate river
>>> runs, all kind of river errors may occur.
>>>
>>> Jörg
>>>
>>>
>>>
>>> On Wed, Jun 25, 2014 at 5:38 PM, Stéphane Seng
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a question about the fact that, when rivers are used to import
>>>> data into ElasticSearch, rivers are also reimporting data at each
>>>> ElasticSearch restart.
>>>>
>>>> In our project, what we are doing is as follows :
>>>>
>>>>- Raw data is imported into ElasticSearch from a MySQL database
>>>>using the JDBC river (https://github.com/jprante/
>>>>elasticsearch-river-jdbc);
>>>>- Some updates are executed directly on the newly imported data in
>>>>ElasticSearch using POST requests;
>>>>- In the end, the final data stored in ElasticSearch is not the
>>>>same than the imported raw data.
>>>>
>>>> The problem we are facing is that when ElasticSearch is restarted, the
>>>> JDBC river is reimporting the raw data thus overriding the transformations
>>>> made.
>>>> We suppose that this is an intentional behavior from ElasticSearch
>>>> rivers.
>>>> One solution to avoid the reimporting of data is to delete the
>>>> corresponding _river index, which is supposed to store the state of the
>>>> rivers.
>>>>
>>>> Our questions are as follows :
>>>>
>>>>- Is the reimporting of data fro