Advice for updating large cache

matt Thu, 28 Jun 2018 10:12:47 -0700

Hi,

We have a service, which is essentially a file system crawler, and we're
using Ignite to store the overall state of the job. The state is represented
by simple objects with fields like ID, Name, Path, and State. The State
field is either "Candidate" or "Document". A Candidate is metadata only, and
is inserted before the content of the file is actually fetched. Once a
Candidate is stored in Ignite, we then send it back to the crawler for it to
read the file and send back the content. Once we receive the content, we
update the State field for the item to Document.


We'd like to be able to support stopping the crawl before it finishes, and
then on the next start, pickup where we left off. This essentially means
crawling all Candidates, but skipping the Documents. This is all straight
forward.

The case that gets tricky, is when the second crawl finishes, we'd like to
then have the option of re-evaluating everything on the next crawl. We could
do this by sending everything to the crawler. The problem is that if _this_
crawl then is stopped before finishing, the state of the items becomes
ambiguous — items that were not crawled have their previous state stuck at
Document, and the items that were crawled also have their new state set from
Document, to Document. This means that re-starting the job causes everything
to be re-crawled.

Obviously this approach is flawed. So we tried the simplest thing we could
think of as a solution: at the end of a job that has finished (and not
manually stopped), update the state of every item in the cache back to
Candidate. And this does the trick. Unfortunately, it is slow - we have a
custom cache store, which may or may not be the bottleneck. While it is
simple, this is indeed a brute-force solution.

So I'm wondering if there's something in Ignite that could help? Or if
anyone has dealt with this kind of problem before and can offer ideas for a
better way?

Thanks!
- Matt



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Advice for updating large cache

Reply via email to