Re: How delete unreachable documents on continous crawling?

Karl Wright Tue, 12 Aug 2014 08:27:12 -0700

Hi Mario,

Setting up a schedule does not prevent you from starting the job manually.


But it sounds like you understand the solution.

Thanks,
Karl



On Tue, Aug 12, 2014 at 10:30 AM, Bisonti Mario <[email protected]>
wrote:

>  Ok, so, I think to have understood better, now.
>
>
>
> But I have 3800 .pdf documents so “full crawl” by Tika is very long
> because it uses 2 days. (perhaps I need to increase RAM?)
>
>
>
> I am using “web connector” so I see “Start minimal” option.
>
>
>
> I understand that I can do this:
> 1) full crawl on the Saturday night so it deletes orphaned file
>
> 2) start minimal crawl every night except Saturday so it crawls only
> changed documents
>
>
>
> are 1) and 2) right or I haven’t understood?
>
>
>
>
>
> Furthermore, I haven’t so clear  the option:
> “Start even inside a scheduled window” because I tried with “Start when
> scheduled window start” but I am able to start it manually, too.
>
>
>
> Thanks a lot!
>
>
>
>
>
> Mario
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:[email protected]]
> *Inviato:* martedì 12 agosto 2014 14:54
>
> *A:* [email protected]
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> What I would do is set up a single job.  (Multiple jobs that share the
> same documents may work but they aren't recommended because a document must
> vanish from ALL jobs that share it before it is removed.)  There are two
> different possibilities for the schedule, depending on the kind of
> connector you are using:
>
> (1) Repeated full crawls
>
> (2) Mostly minimal crawls, with periodic full crawls
>
> If the connector you are using makes any distinction between minimal and
> full crawls, then (2) would probably be more efficient for you.  But only
> on full crawls will unreachable documents be removed.
>
> To do the setup:
>
> -- you will need multiple scheduling records for (2), but may be able to
> do (1) with a single scheduling record
>
> -- for each day, you want the window to start at midnight, and its length
> to be the equivalent of 24 hours
>
> -- you want to select the option to start crawls in the middle of a
> window, not just at the beginning
>
> This should give you what you want.
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 8:43 AM, Bisonti Mario <[email protected]>
> wrote:
>
>  So , I suppose, the best solution could be :
>
> Continous recrawling and one periodic recrawling to delete orphaned
> documents.
>
>
>
> Can I superimpose the two jobs?
>
>
>
> *Mario Bisonti*
>
> Information and Comunications Technology
>
>
>
> VIMAR SpA
>
> Tel. +39 0424 488 644
>
> [email protected]
>
> *Rispetta l’ambiente. **Stampa solo se necessario.*
>
> Take care of the environment. Print only if necessary.
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:[email protected]]
> *Inviato:* martedì 12 agosto 2014 12:21
>
>
> *A:* [email protected]
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> Yes, periodic recrawling allows ManifoldCF the opportunity to discover
> abandoned documents and remove them.
>
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 6:18 AM, Bisonti Mario <[email protected]>
> wrote:
>
>  Ok, thanks..
>
>
>
> So you suggest to me to not use continuos crawling and schedule a
> re-crawling periodically of all documents?
>
> Is it better?
>
> Thanks a lot.
>
>
>
>
>
>
>
> *Mario*
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:[email protected]]
> *Inviato:* martedì 12 agosto 2014 12:16
> *A:* [email protected]
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> Please read ManifoldCF in Action Chapter 1.  Continuous crawling has no
> mechanism for deleting unreachable documents, and never will, because it is
> fundamentally impossible to do.
>
> Thanks,
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 6:10 AM, Bisonti Mario <[email protected]>
> wrote:
>
>  Hallo.
>
> I set continuous crawling on a folder of a website to index the pdf files
> contained.
>
>
>
> Schedule type: Rescan documents dinamically
>
> Recrawl interval (if continuous):5
>
>
>
> I see that if documents are added on the folder, they are indexed, but if
> documents are deleted they aren’t deleted from indexing.
>
> I see that on the “MainfoldCF in action” , is mentioned “…that continuous
> crawling seems to be missing a phase – the “delete unreachable documents”
> phase.”
>
>
>
> But, how could I solve the problem, please?
>
> Thanks a lot for yopur help.
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: How delete unreachable documents on continous crawling?

Reply via email to