Re: How delete unreachable documents on continous crawling?

Karl Wright Wed, 27 Aug 2014 04:20:14 -0700

Hi Bisonti,

I meant the throttling parameters on the "Bandwidth" tab.


http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#webrepository

Karl



On Wed, Aug 27, 2014 at 6:36 AM, Bisonti Mario <[email protected]>
wrote:

>  Thabks a lot.
>
> I understood about full crawl vs minimal crawls
>
>
>
> Third throttling:
>
>
>
> I set for the web repository connection, throttling = 100
>
> I set for the output connection Solr , Throttling, max connection = 1000
>
>
>
> I am using ManifoldCF 1.7
>
>
>
>
>
> My documents are .pdf docs so Tika execute the scan of the content.
>
>
>
> Karl, do you think that the throttling parameters are  right ?
>
>
>
> Thanks a lot!
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:[email protected]]
> *Inviato:* mercoledì 27 agosto 2014 12:03
>
> *A:* [email protected]
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> First, you don't need a lot of memory for ManifoldCF, although you may
> need it for your search index (e.g. Solr).
>
> Second, different connectors behave differently for full crawls vs.
> minimal crawls.  The web connector makes no distinction, except for the
> removal of unreachable documents at the end of the crawl.
>
> Third, most of the time in your crawl is probably going into waiting
> because of throttling.  Depending on what you are crawling, and whether it
> is your own local pages, you might want to relax the throttling
> constraints.  It is also the case that ManifoldCF 1.5 had a bug in the
> throttling code that made byte-rate throttling 1000x too restrictive.  This
> was fixed in 1.6.
>
> Karl
>
>
>
>
>
> On Wed, Aug 27, 2014 at 5:38 AM, Bisonti Mario <[email protected]>
> wrote:
>
>
>
> Hallo.
>
>
>
> I increased RAM to 4GB and I execute, manually the job to crawl “Web
> repository”  containing 3800 pdf documents.
>
>
>
> I understood that “Start” executes a full scan instead, “Start minimal”
> executes a incremental scan only on modified documents.
>
>
>
>
>
> I executed the job with “Start” : it used near 20 hours.
>
> After
>
> I executed the job with “Start minimal” : it rescan the same 3800
> documents so it used 20 hours
>
>
>
> Why this?
>
>
>
> Note that there aren’t new documents by the moment that I started job with
> “Start” and the time I started job with “Start minimal”
>
>
>
>
>
> Thanks for your help!
>
>
>
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:[email protected]]
> *Inviato:* martedì 12 agosto 2014 17:26
>
>
> *A:* [email protected]
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> Setting up a schedule does not prevent you from starting the job manually.
>
> But it sounds like you understand the solution.
>
> Thanks,
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 10:30 AM, Bisonti Mario <[email protected]>
> wrote:
>
>  Ok, so, I think to have understood better, now.
>
>
>
> But I have 3800 .pdf documents so “full crawl” by Tika is very long
> because it uses 2 days. (perhaps I need to increase RAM?)
>
>
>
> I am using “web connector” so I see “Start minimal” option.
>
>
>
> I understand that I can do this:
> 1) full crawl on the Saturday night so it deletes orphaned file
>
> 2) start minimal crawl every night except Saturday so it crawls only
> changed documents
>
>
>
> are 1) and 2) right or I haven’t understood?
>
>
>
>
>
> Furthermore, I haven’t so clear  the option:
> “Start even inside a scheduled window” because I tried with “Start when
> scheduled window start” but I am able to start it manually, too.
>
>
>
> Thanks a lot!
>
>
>
>
>
> Mario
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:[email protected]]
> *Inviato:* martedì 12 agosto 2014 14:54
>
>
> *A:* [email protected]
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> What I would do is set up a single job.  (Multiple jobs that share the
> same documents may work but they aren't recommended because a document must
> vanish from ALL jobs that share it before it is removed.)  There are two
> different possibilities for the schedule, depending on the kind of
> connector you are using:
>
> (1) Repeated full crawls
>
> (2) Mostly minimal crawls, with periodic full crawls
>
> If the connector you are using makes any distinction between minimal and
> full crawls, then (2) would probably be more efficient for you.  But only
> on full crawls will unreachable documents be removed.
>
> To do the setup:
>
> -- you will need multiple scheduling records for (2), but may be able to
> do (1) with a single scheduling record
>
> -- for each day, you want the window to start at midnight, and its length
> to be the equivalent of 24 hours
>
> -- you want to select the option to start crawls in the middle of a
> window, not just at the beginning
>
> This should give you what you want.
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 8:43 AM, Bisonti Mario <[email protected]>
> wrote:
>
>  So , I suppose, the best solution could be :
>
> Continous recrawling and one periodic recrawling to delete orphaned
> documents.
>
>
>
> Can I superimpose the two jobs?
>
>
>
> *Mario Bisonti*
>
> Information and Comunications Technology
>
>
>
> VIMAR SpA
>
> Tel. +39 0424 488 644
>
> [email protected]
>
> *Rispetta l’ambiente. **Stampa solo se necessario.*
>
> Take care of the environment. Print only if necessary.
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:[email protected]]
> *Inviato:* martedì 12 agosto 2014 12:21
>
>
> *A:* [email protected]
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> Yes, periodic recrawling allows ManifoldCF the opportunity to discover
> abandoned documents and remove them.
>
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 6:18 AM, Bisonti Mario <[email protected]>
> wrote:
>
>  Ok, thanks..
>
>
>
> So you suggest to me to not use continuos crawling and schedule a
> re-crawling periodically of all documents?
>
> Is it better?
>
> Thanks a lot.
>
>
>
>
>
>
>
> *Mario*
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright [mailto:[email protected]]
> *Inviato:* martedì 12 agosto 2014 12:16
> *A:* [email protected]
> *Oggetto:* Re: How delete unreachable documents on continous crawling?
>
>
>
> Hi Mario,
>
> Please read ManifoldCF in Action Chapter 1.  Continuous crawling has no
> mechanism for deleting unreachable documents, and never will, because it is
> fundamentally impossible to do.
>
> Thanks,
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 6:10 AM, Bisonti Mario <[email protected]>
> wrote:
>
>  Hallo.
>
> I set continuous crawling on a folder of a website to index the pdf files
> contained.
>
>
>
> Schedule type: Rescan documents dinamically
>
> Recrawl interval (if continuous):5
>
>
>
> I see that if documents are added on the folder, they are indexed, but if
> documents are deleted they aren’t deleted from indexing.
>
> I see that on the “MainfoldCF in action” , is mentioned “…that continuous
> crawling seems to be missing a phase – the “delete unreachable documents”
> phase.”
>
>
>
> But, how could I solve the problem, please?
>
> Thanks a lot for yopur help.
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: How delete unreachable documents on continous crawling?

Reply via email to