On Mon, May 27, 2019 at 5:58 PM Karl Wright wrote:
>
> (1) There should be no new tables needed for any of this. Your seed list
> can be stored in the job specification information. See the rss connector
> for simple example of how this might be done.
Are you assuming the seed list is static?
One seed per job is an interesting approach but in the interests of
fully understanding the alternatives let me consider choice #2.
> you might want to combine this all into one job, but then you would need to
> link your documents somehow to the seed they came from, so that if the seed
> was
Thanks for your help Karl. So I think I'm converging on a design.
First of all, per your recommendation, I've switched to scheduled
crawl and it executes as expected every minute with the "schedule
window anytime" setting.
My next problem is dealing with seed deletion. My upstream source
actually
Hi Raman,
(1) Continuous crawl is not a good model for you. It's meant for crawling
large web domains, not the kind of task you are doing.
(2) Scheduled crawl will work fine for you if you simply tell it "start
within schedule window" and make sure your schedule completely covers 7x24
times. So
Yes, we are indeed running it in continuous crawl mode. Scheduled mode
works, but given we have a delta API, we thought this is what makes
sense, as the delta API is efficient and we don't need to wait an
entire day for a scheduled job to run. I see that if I change recrawl
interval and max
So MODEL_ADD_CHANGE does not work for you, eh?
You were saying that every minute a addSeedDocuments is being called,
correct? It sounds to me like you are running this job in continuous crawl
mode. Can you try running the job in non-continuous mode, and just
repeating the job run once it
For any given job run, all documents that are added via addSeedDocuments()
should be processed. There is no magic in the framework that somehow knows
that a document has been created vs. modified vs. deleted until
processDocuments() is called. If your claim is that this contract is not
being
For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically says
that you have to include *at least* the documents that were changed, added,
or deleted since the previous stamp, and if no stamp is provided, it should
return ALL specified documents. Are you doing that?
If you are, the
My team is creating a new repository connector. The source system has
a delta API that lets us know of all new, modified, and deleted
individual folders and documents since the last call to the API. Each
call to the delta API provides the changes, as well as a token which
can be provided on