Ok, you are right. In bandwith I see: Max connections:10 Max kbytes/sec: 256 Max fetches/min: 12
Can I increase that to: Max connections:100 Max kbytes/sec: 256 Max fetches/min: 120 Could they be a good values? Thanks Mario Da: Karl Wright [mailto:[email protected]] Inviato: mercoledì 27 agosto 2014 13:19 A: [email protected] Oggetto: Re: How delete unreachable documents on continous crawling? Hi Bisonti, I meant the throttling parameters on the "Bandwidth" tab. http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#webrepository Karl On Wed, Aug 27, 2014 at 6:36 AM, Bisonti Mario <[email protected]<mailto:[email protected]>> wrote: Thabks a lot. I understood about full crawl vs minimal crawls Third throttling: I set for the web repository connection, throttling = 100 I set for the output connection Solr , Throttling, max connection = 1000 I am using ManifoldCF 1.7 My documents are .pdf docs so Tika execute the scan of the content. Karl, do you think that the throttling parameters are right ? Thanks a lot! Da: Karl Wright [mailto:[email protected]<mailto:[email protected]>] Inviato: mercoledì 27 agosto 2014 12:03 A: [email protected]<mailto:[email protected]> Oggetto: Re: How delete unreachable documents on continous crawling? Hi Mario, First, you don't need a lot of memory for ManifoldCF, although you may need it for your search index (e.g. Solr). Second, different connectors behave differently for full crawls vs. minimal crawls. The web connector makes no distinction, except for the removal of unreachable documents at the end of the crawl. Third, most of the time in your crawl is probably going into waiting because of throttling. Depending on what you are crawling, and whether it is your own local pages, you might want to relax the throttling constraints. It is also the case that ManifoldCF 1.5 had a bug in the throttling code that made byte-rate throttling 1000x too restrictive. This was fixed in 1.6. Karl On Wed, Aug 27, 2014 at 5:38 AM, Bisonti Mario <[email protected]<mailto:[email protected]>> wrote: Hallo. I increased RAM to 4GB and I execute, manually the job to crawl “Web repository” containing 3800 pdf documents. I understood that “Start” executes a full scan instead, “Start minimal” executes a incremental scan only on modified documents. I executed the job with “Start” : it used near 20 hours. After I executed the job with “Start minimal” : it rescan the same 3800 documents so it used 20 hours Why this? Note that there aren’t new documents by the moment that I started job with “Start” and the time I started job with “Start minimal” Thanks for your help! Mario Da: Karl Wright [mailto:[email protected]<mailto:[email protected]>] Inviato: martedì 12 agosto 2014 17:26 A: [email protected]<mailto:[email protected]> Oggetto: Re: How delete unreachable documents on continous crawling? Hi Mario, Setting up a schedule does not prevent you from starting the job manually. But it sounds like you understand the solution. Thanks, Karl On Tue, Aug 12, 2014 at 10:30 AM, Bisonti Mario <[email protected]<mailto:[email protected]>> wrote: Ok, so, I think to have understood better, now. But I have 3800 .pdf documents so “full crawl” by Tika is very long because it uses 2 days. (perhaps I need to increase RAM?) I am using “web connector” so I see “Start minimal” option. I understand that I can do this: 1) full crawl on the Saturday night so it deletes orphaned file 2) start minimal crawl every night except Saturday so it crawls only changed documents are 1) and 2) right or I haven’t understood? Furthermore, I haven’t so clear the option: “Start even inside a scheduled window” because I tried with “Start when scheduled window start” but I am able to start it manually, too. Thanks a lot! Mario Da: Karl Wright [mailto:[email protected]<mailto:[email protected]>] Inviato: martedì 12 agosto 2014 14:54 A: [email protected]<mailto:[email protected]> Oggetto: Re: How delete unreachable documents on continous crawling? Hi Mario, What I would do is set up a single job. (Multiple jobs that share the same documents may work but they aren't recommended because a document must vanish from ALL jobs that share it before it is removed.) There are two different possibilities for the schedule, depending on the kind of connector you are using: (1) Repeated full crawls (2) Mostly minimal crawls, with periodic full crawls If the connector you are using makes any distinction between minimal and full crawls, then (2) would probably be more efficient for you. But only on full crawls will unreachable documents be removed. To do the setup: -- you will need multiple scheduling records for (2), but may be able to do (1) with a single scheduling record -- for each day, you want the window to start at midnight, and its length to be the equivalent of 24 hours -- you want to select the option to start crawls in the middle of a window, not just at the beginning This should give you what you want. Karl On Tue, Aug 12, 2014 at 8:43 AM, Bisonti Mario <[email protected]<mailto:[email protected]>> wrote: So , I suppose, the best solution could be : Continous recrawling and one periodic recrawling to delete orphaned documents. Can I superimpose the two jobs? Mario Bisonti Information and Comunications Technology VIMAR SpA Tel. +39 0424 488 644 [email protected]<mailto:[email protected]> Rispetta l’ambiente. Stampa solo se necessario. Take care of the environment. Print only if necessary. Da: Karl Wright [mailto:[email protected]<mailto:[email protected]>] Inviato: martedì 12 agosto 2014 12:21 A: [email protected]<mailto:[email protected]> Oggetto: Re: How delete unreachable documents on continous crawling? Hi Mario, Yes, periodic recrawling allows ManifoldCF the opportunity to discover abandoned documents and remove them. Karl On Tue, Aug 12, 2014 at 6:18 AM, Bisonti Mario <[email protected]<mailto:[email protected]>> wrote: Ok, thanks.. So you suggest to me to not use continuos crawling and schedule a re-crawling periodically of all documents? Is it better? Thanks a lot. Mario Da: Karl Wright [mailto:[email protected]<mailto:[email protected]>] Inviato: martedì 12 agosto 2014 12:16 A: [email protected]<mailto:[email protected]> Oggetto: Re: How delete unreachable documents on continous crawling? Hi Mario, Please read ManifoldCF in Action Chapter 1. Continuous crawling has no mechanism for deleting unreachable documents, and never will, because it is fundamentally impossible to do. Thanks, Karl On Tue, Aug 12, 2014 at 6:10 AM, Bisonti Mario <[email protected]<mailto:[email protected]>> wrote: Hallo. I set continuous crawling on a folder of a website to index the pdf files contained. Schedule type: Rescan documents dinamically Recrawl interval (if continuous):5 I see that if documents are added on the folder, they are indexed, but if documents are deleted they aren’t deleted from indexing. I see that on the “MainfoldCF in action” , is mentioned “…that continuous crawling seems to be missing a phase – the “delete unreachable documents” phase.” But, how could I solve the problem, please? Thanks a lot for yopur help. Mario
