Hi ,

I understand that you are looking to index incremental data.
In that case the following approach is the best i can think of -


   1. Make a unique key per document. This key can be the URL or shah hash
   sum of some other field that makes sense to be a unique key.
   2. Use the unique key as doc ID.
   3. Set a field which would be the hash of the content field. This means
   that this hash field will change whenever the content has changed.
   4. Now whenever there is a new insert do a upsert
   
<http://www.elastic.co/guide/en/elasticsearch/reference/1.4/docs-update.html#upserts>
   on this document.
   5. During upsert , see if the content hash has changed. If there no no
   change you can stop orceeding and if there is a change in content , update
   both the content field and the new hash content field.


Thanks
           Vineeth Mohan,
           Elasticsearch consultant,
           qbox.io ( Elasticsearch service provider <http://qbox.io/>)


On Sun, Apr 5, 2015 at 6:14 PM, Employ <m...@employ.com> wrote:

> Hi,
>
> At different random times throughout the day I am going to do a "crawl" of
> data which I am going to feed into elasticsearch. This bit is working just
> fine.
>
> However the index should reflect only what was found in my most recent
> crawl and I currently have nothing to remove the content in the
> elasticsearch index which was left over from the previous crawl but wasn't
> found in the new crawl.
>
> From what I can see I have a few options:
>
> A) Delete items based on how old they are. Won't work because index times
> are random.
>
> B) Delete entire index and feed with fresh data. Doesn't else em very
> efficient and will leave me time with an empty or partial index.
>
> C) Do an insert/modify query, if not found insert, if found already in the
> index update the timesstamp, then do a second pass to delete any items with
> an older time stamp.
>
> D) Something better.
>
> I would really appreciate any feedback on a logical and efficient way to
> removing old content in a situation like this.
>
> Thank you and happy Easter.
>
> James
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/AF36C2A5-8B38-4176-90B8-2E4210A0244F%40employ.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5mrNgC8kMMnv8uNL%3DUh9Uk%3DN4TDqGLjwexAPtBucsnWEw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to