Re: Repository connector for source with with delta API

2019-05-27 Thread Raman Gupta
On Mon, May 27, 2019 at 5:58 PM Karl Wright wrote: > > (1) There should be no new tables needed for any of this. Your seed list > can be stored in the job specification information. See the rss connector > for simple example of how this might be done. Are you assuming the seed list is static?

Re: Repository connector for source with with delta API

2019-05-27 Thread Raman Gupta
One seed per job is an interesting approach but in the interests of fully understanding the alternatives let me consider choice #2. > you might want to combine this all into one job, but then you would need to > link your documents somehow to the seed they came from, so that if the seed > was

Re: Repository connector for source with with delta API

2019-05-27 Thread Raman Gupta
Thanks for your help Karl. So I think I'm converging on a design. First of all, per your recommendation, I've switched to scheduled crawl and it executes as expected every minute with the "schedule window anytime" setting. My next problem is dealing with seed deletion. My upstream source actually

Re: Repository connector for source with with delta API

2019-05-24 Thread Karl Wright
Hi Raman, (1) Continuous crawl is not a good model for you. It's meant for crawling large web domains, not the kind of task you are doing. (2) Scheduled crawl will work fine for you if you simply tell it "start within schedule window" and make sure your schedule completely covers 7x24 times. So

Re: Repository connector for source with with delta API

2019-05-24 Thread Raman Gupta
Yes, we are indeed running it in continuous crawl mode. Scheduled mode works, but given we have a delta API, we thought this is what makes sense, as the delta API is efficient and we don't need to wait an entire day for a scheduled job to run. I see that if I change recrawl interval and max

Re: Repository connector for source with with delta API

2019-05-24 Thread Karl Wright
So MODEL_ADD_CHANGE does not work for you, eh? You were saying that every minute a addSeedDocuments is being called, correct? It sounds to me like you are running this job in continuous crawl mode. Can you try running the job in non-continuous mode, and just repeating the job run once it

Re: Repository connector for source with with delta API

2019-05-24 Thread Karl Wright
For any given job run, all documents that are added via addSeedDocuments() should be processed. There is no magic in the framework that somehow knows that a document has been created vs. modified vs. deleted until processDocuments() is called. If your claim is that this contract is not being

Re: Repository connector for source with with delta API

2019-05-24 Thread Karl Wright
For ADD_CHANGE_DELETE, the contract for addSeedDocuments() basically says that you have to include *at least* the documents that were changed, added, or deleted since the previous stamp, and if no stamp is provided, it should return ALL specified documents. Are you doing that? If you are, the

Repository connector for source with with delta API

2019-05-24 Thread Raman Gupta
My team is creating a new repository connector. The source system has a delta API that lets us know of all new, modified, and deleted individual folders and documents since the last call to the API. Each call to the delta API provides the changes, as well as a token which can be provided on