Hi, We have a service, which is essentially a file system crawler, and we're using Ignite to store the overall state of the job. The state is represented by simple objects with fields like ID, Name, Path, and State. The State field is either "Candidate" or "Document". A Candidate is metadata only, and is inserted before the content of the file is actually fetched. Once a Candidate is stored in Ignite, we then send it back to the crawler for it to read the file and send back the content. Once we receive the content, we update the State field for the item to Document.
We'd like to be able to support stopping the crawl before it finishes, and then on the next start, pickup where we left off. This essentially means crawling all Candidates, but skipping the Documents. This is all straight forward. The case that gets tricky, is when the second crawl finishes, we'd like to then have the option of re-evaluating everything on the next crawl. We could do this by sending everything to the crawler. The problem is that if _this_ crawl then is stopped before finishing, the state of the items becomes ambiguous — items that were not crawled have their previous state stuck at Document, and the items that were crawled also have their new state set from Document, to Document. This means that re-starting the job causes everything to be re-crawled. Obviously this approach is flawed. So we tried the simplest thing we could think of as a solution: at the end of a job that has finished (and not manually stopped), update the state of every item in the cache back to Candidate. And this does the trick. Unfortunately, it is slow - we have a custom cache store, which may or may not be the bottleneck. While it is simple, this is indeed a brute-force solution. So I'm wondering if there's something in Ignite that could help? Or if anyone has dealt with this kind of problem before and can offer ideas for a better way? Thanks! - Matt -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/