Hi , I understand that you are looking to index incremental data. In that case the following approach is the best i can think of -
1. Make a unique key per document. This key can be the URL or shah hash sum of some other field that makes sense to be a unique key. 2. Use the unique key as doc ID. 3. Set a field which would be the hash of the content field. This means that this hash field will change whenever the content has changed. 4. Now whenever there is a new insert do a upsert <http://www.elastic.co/guide/en/elasticsearch/reference/1.4/docs-update.html#upserts> on this document. 5. During upsert , see if the content hash has changed. If there no no change you can stop orceeding and if there is a change in content , update both the content field and the new hash content field. Thanks Vineeth Mohan, Elasticsearch consultant, qbox.io ( Elasticsearch service provider <http://qbox.io/>) On Sun, Apr 5, 2015 at 6:14 PM, Employ <m...@employ.com> wrote: > Hi, > > At different random times throughout the day I am going to do a "crawl" of > data which I am going to feed into elasticsearch. This bit is working just > fine. > > However the index should reflect only what was found in my most recent > crawl and I currently have nothing to remove the content in the > elasticsearch index which was left over from the previous crawl but wasn't > found in the new crawl. > > From what I can see I have a few options: > > A) Delete items based on how old they are. Won't work because index times > are random. > > B) Delete entire index and feed with fresh data. Doesn't else em very > efficient and will leave me time with an empty or partial index. > > C) Do an insert/modify query, if not found insert, if found already in the > index update the timesstamp, then do a second pass to delete any items with > an older time stamp. > > D) Something better. > > I would really appreciate any feedback on a logical and efficient way to > removing old content in a situation like this. > > Thank you and happy Easter. > > James > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/AF36C2A5-8B38-4176-90B8-2E4210A0244F%40employ.com > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5mrNgC8kMMnv8uNL%3DUh9Uk%3DN4TDqGLjwexAPtBucsnWEw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.