Hi Bisonti, I meant the throttling parameters on the "Bandwidth" tab.
http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#webrepository Karl On Wed, Aug 27, 2014 at 6:36 AM, Bisonti Mario <[email protected]> wrote: > Thabks a lot. > > I understood about full crawl vs minimal crawls > > > > Third throttling: > > > > I set for the web repository connection, throttling = 100 > > I set for the output connection Solr , Throttling, max connection = 1000 > > > > I am using ManifoldCF 1.7 > > > > > > My documents are .pdf docs so Tika execute the scan of the content. > > > > Karl, do you think that the throttling parameters are right ? > > > > Thanks a lot! > > > > > > > > > > > > > > > > *Da:* Karl Wright [mailto:[email protected]] > *Inviato:* mercoledì 27 agosto 2014 12:03 > > *A:* [email protected] > *Oggetto:* Re: How delete unreachable documents on continous crawling? > > > > Hi Mario, > > First, you don't need a lot of memory for ManifoldCF, although you may > need it for your search index (e.g. Solr). > > Second, different connectors behave differently for full crawls vs. > minimal crawls. The web connector makes no distinction, except for the > removal of unreachable documents at the end of the crawl. > > Third, most of the time in your crawl is probably going into waiting > because of throttling. Depending on what you are crawling, and whether it > is your own local pages, you might want to relax the throttling > constraints. It is also the case that ManifoldCF 1.5 had a bug in the > throttling code that made byte-rate throttling 1000x too restrictive. This > was fixed in 1.6. > > Karl > > > > > > On Wed, Aug 27, 2014 at 5:38 AM, Bisonti Mario <[email protected]> > wrote: > > > > Hallo. > > > > I increased RAM to 4GB and I execute, manually the job to crawl “Web > repository” containing 3800 pdf documents. > > > > I understood that “Start” executes a full scan instead, “Start minimal” > executes a incremental scan only on modified documents. > > > > > > I executed the job with “Start” : it used near 20 hours. > > After > > I executed the job with “Start minimal” : it rescan the same 3800 > documents so it used 20 hours > > > > Why this? > > > > Note that there aren’t new documents by the moment that I started job with > “Start” and the time I started job with “Start minimal” > > > > > > Thanks for your help! > > > > Mario > > > > > > > > > > > > > > *Da:* Karl Wright [mailto:[email protected]] > *Inviato:* martedì 12 agosto 2014 17:26 > > > *A:* [email protected] > *Oggetto:* Re: How delete unreachable documents on continous crawling? > > > > Hi Mario, > > Setting up a schedule does not prevent you from starting the job manually. > > But it sounds like you understand the solution. > > Thanks, > Karl > > > > On Tue, Aug 12, 2014 at 10:30 AM, Bisonti Mario <[email protected]> > wrote: > > Ok, so, I think to have understood better, now. > > > > But I have 3800 .pdf documents so “full crawl” by Tika is very long > because it uses 2 days. (perhaps I need to increase RAM?) > > > > I am using “web connector” so I see “Start minimal” option. > > > > I understand that I can do this: > 1) full crawl on the Saturday night so it deletes orphaned file > > 2) start minimal crawl every night except Saturday so it crawls only > changed documents > > > > are 1) and 2) right or I haven’t understood? > > > > > > Furthermore, I haven’t so clear the option: > “Start even inside a scheduled window” because I tried with “Start when > scheduled window start” but I am able to start it manually, too. > > > > Thanks a lot! > > > > > > Mario > > > > > > > > *Da:* Karl Wright [mailto:[email protected]] > *Inviato:* martedì 12 agosto 2014 14:54 > > > *A:* [email protected] > *Oggetto:* Re: How delete unreachable documents on continous crawling? > > > > Hi Mario, > > What I would do is set up a single job. (Multiple jobs that share the > same documents may work but they aren't recommended because a document must > vanish from ALL jobs that share it before it is removed.) There are two > different possibilities for the schedule, depending on the kind of > connector you are using: > > (1) Repeated full crawls > > (2) Mostly minimal crawls, with periodic full crawls > > If the connector you are using makes any distinction between minimal and > full crawls, then (2) would probably be more efficient for you. But only > on full crawls will unreachable documents be removed. > > To do the setup: > > -- you will need multiple scheduling records for (2), but may be able to > do (1) with a single scheduling record > > -- for each day, you want the window to start at midnight, and its length > to be the equivalent of 24 hours > > -- you want to select the option to start crawls in the middle of a > window, not just at the beginning > > This should give you what you want. > Karl > > > > On Tue, Aug 12, 2014 at 8:43 AM, Bisonti Mario <[email protected]> > wrote: > > So , I suppose, the best solution could be : > > Continous recrawling and one periodic recrawling to delete orphaned > documents. > > > > Can I superimpose the two jobs? > > > > *Mario Bisonti* > > Information and Comunications Technology > > > > VIMAR SpA > > Tel. +39 0424 488 644 > > [email protected] > > *Rispetta l’ambiente. **Stampa solo se necessario.* > > Take care of the environment. Print only if necessary. > > > > > > > > > > > > *Da:* Karl Wright [mailto:[email protected]] > *Inviato:* martedì 12 agosto 2014 12:21 > > > *A:* [email protected] > *Oggetto:* Re: How delete unreachable documents on continous crawling? > > > > Hi Mario, > > Yes, periodic recrawling allows ManifoldCF the opportunity to discover > abandoned documents and remove them. > > Karl > > > > On Tue, Aug 12, 2014 at 6:18 AM, Bisonti Mario <[email protected]> > wrote: > > Ok, thanks.. > > > > So you suggest to me to not use continuos crawling and schedule a > re-crawling periodically of all documents? > > Is it better? > > Thanks a lot. > > > > > > > > *Mario* > > > > > > > > > > > > *Da:* Karl Wright [mailto:[email protected]] > *Inviato:* martedì 12 agosto 2014 12:16 > *A:* [email protected] > *Oggetto:* Re: How delete unreachable documents on continous crawling? > > > > Hi Mario, > > Please read ManifoldCF in Action Chapter 1. Continuous crawling has no > mechanism for deleting unreachable documents, and never will, because it is > fundamentally impossible to do. > > Thanks, > Karl > > > > On Tue, Aug 12, 2014 at 6:10 AM, Bisonti Mario <[email protected]> > wrote: > > Hallo. > > I set continuous crawling on a folder of a website to index the pdf files > contained. > > > > Schedule type: Rescan documents dinamically > > Recrawl interval (if continuous):5 > > > > I see that if documents are added on the folder, they are indexed, but if > documents are deleted they aren’t deleted from indexing. > > I see that on the “MainfoldCF in action” , is mentioned “…that continuous > crawling seems to be missing a phase – the “delete unreachable documents” > phase.” > > > > But, how could I solve the problem, please? > > Thanks a lot for yopur help. > Mario > > > > > > > > > > > > > > > > > > > >
