Re: Rivers are reimporting data at each ElasticSearch restart

2014-06-26 Thread Stéphane Seng
Thanks for your quick reply,

I need some clarifications about what you meant by "delete the river", 
"delete the _river index" and by "this state is useful for flow control".

>From what I have understand from your reply and supposing that I have 
imported data into a "documents" river using the JDBC river:

   - "Delete the river" means "DELETE _river/documents" (and does not mean 
   "DELETE documents"):
  - This does not affect the already imported data.
  - The data is not reimported into ElasticSearch at restart.
  - Everything is fine for our use case.
   - "Delete the _river index" means "DELETE _river":
  - This does not affect the already imported data.
  - The data is not reimported into ElasticSearch at restart.
  - This should not be done because it affects all the rivers at the 
  same time (for the documents river, it is equivalent of doing "DELETE 
  _river/documents").
   - "This state is useful for flow control" means that:
  - The state keeps track of what data is already imported so that the 
  same raw data (left untouched in ElasticSearch) is not reimported 
multiple 
  times ?
  - OR The state keeps a trace of the SQL query so that, in case of an 
  error during a node start/stop, the river can be automatically replayed ?
  
Thanks again,
Stéphane.

On Wednesday, June 25, 2014 6:08:52 PM UTC+2, Jörg Prante wrote:
>
> Because each river can freely implement the data fetch, ES does not offer 
> river monitoring.
>
> For JDBC river, I implemented some primitive river state query commands 
> that allow polling for river state changes.
>
> Jörg
>
>
> On Wed, Jun 25, 2014 at 6:00 PM, Tanguy Bernard  > wrote:
>
>> Hello,
>> This post interested me.
>> Have we a way to know when  indexing is finished and thus triggered the 
>> XDELETE _river?
>>
>> Le mercredi 25 juin 2014 17:54:01 UTC+2, Jörg Prante a écrit :
>>>
>>> It is up to the river implementation how the data import is handled.
>>>
>>> The JDBC river, in the "simple" strategy, imports data when the river is 
>>> started, regardless of existing cluster or index. It is possible to 
>>> implement other strategies, for example, a strategy that performs a check 
>>> before indexing.
>>>
>>> There is no support for river implementations about node start/stop 
>>> control and how to behave. JDBC river tries to compensate this by 
>>> persisting a JDBC river specific state. This state is useful for flow 
>>> control.
>>>
>>> If you do no longer need the river, you can delete the river with curl 
>>> -XDELETE, this shuts down river instance threads gracefully and releases 
>>> resources.
>>>
>>> If you delete the _river index with curl -XDELETE, you wipe all data 
>>> that is used by rivers. Active river instances are not stopped and are not 
>>> aware of what happened, so this is an unfriendly way to terminate river 
>>> runs, all kind of river errors may occur.
>>>
>>> Jörg
>>>
>>>
>>>
>>> On Wed, Jun 25, 2014 at 5:38 PM, Stéphane Seng  
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a question about the fact that, when rivers are used to import 
>>>> data into ElasticSearch, rivers are also reimporting data at each 
>>>> ElasticSearch restart.
>>>>
>>>> In our project, what we are doing is as follows :
>>>>
>>>>- Raw data is imported into ElasticSearch from a MySQL database 
>>>>using the JDBC river (https://github.com/jprante/
>>>>elasticsearch-river-jdbc); 
>>>>- Some updates are executed directly on the newly imported data in 
>>>>ElasticSearch using POST requests;
>>>>- In the end, the final data stored in ElasticSearch is not the 
>>>>same than the imported raw data.
>>>>
>>>> The problem we are facing is that when ElasticSearch is restarted, the 
>>>> JDBC river is reimporting the raw data thus overriding the transformations 
>>>> made.
>>>> We suppose that this is an intentional behavior from ElasticSearch 
>>>> rivers.
>>>> One solution to avoid the reimporting of data is to delete the 
>>>> corresponding _river index, which is supposed to store the state of the 
>>>> rivers.
>>>>
>>>> Our questions are as follows :
>>>>
>>>>- Is the reimporting of data fro

Rivers are reimporting data at each ElasticSearch restart

2014-06-25 Thread Stéphane Seng
Hello,

I have a question about the fact that, when rivers are used to import data 
into ElasticSearch, rivers are also reimporting data at each ElasticSearch 
restart.

In our project, what we are doing is as follows :

   - Raw data is imported into ElasticSearch from a MySQL database using 
   the JDBC river (https://github.com/jprante/elasticsearch-river-jdbc);
   - Some updates are executed directly on the newly imported data in 
   ElasticSearch using POST requests;
   - In the end, the final data stored in ElasticSearch is not the same 
   than the imported raw data.
   
The problem we are facing is that when ElasticSearch is restarted, the JDBC 
river is reimporting the raw data thus overriding the transformations made.
We suppose that this is an intentional behavior from ElasticSearch rivers.
One solution to avoid the reimporting of data is to delete the 
corresponding _river index, which is supposed to store the state of the 
rivers.

Our questions are as follows :

   - Is the reimporting of data from rivers at each restart is a standard 
   use case ? Is it useful for some applications ?
   - What is the point of the _river index state saving ?
  - Is there a way to avoid the reimporting of data without having to 
  delete the corresponding _river index ?
  - Is there any downsides (for our use case) to delete the 
  corresponding _river index ?
  
Thanks,
Stéphane.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.