R: How delete unreachable documents on continous crawling?

Bisonti Mario Wed, 27 Aug 2014 03:37:39 -0700

Thabks a lot.
I understood about full crawl vs minimal crawls

Third throttling:


I set for the web repository connection, throttling = 100
I set for the output connection Solr , Throttling, max connection = 1000

I am using ManifoldCF 1.7


My documents are .pdf docs so Tika execute the scan of the content.

Karl, do you think that the throttling parameters are  right ?

Thanks a lot!







Da: Karl Wright [mailto:[email protected]]
Inviato: mercoledì 27 agosto 2014 12:03
A: [email protected]
Oggetto: Re: How delete unreachable documents on continous crawling?

Hi Mario,
First, you don't need a lot of memory for ManifoldCF, although you may need it 
for your search index (e.g. Solr).
Second, different connectors behave differently for full crawls vs. minimal 
crawls.  The web connector makes no distinction, except for the removal of 
unreachable documents at the end of the crawl.

Third, most of the time in your crawl is probably going into waiting because of 
throttling.  Depending on what you are crawling, and whether it is your own 
local pages, you might want to relax the throttling constraints.  It is also 
the case that ManifoldCF 1.5 had a bug in the throttling code that made 
byte-rate throttling 1000x too restrictive.  This was fixed in 1.6.
Karl


On Wed, Aug 27, 2014 at 5:38 AM, Bisonti Mario 
<[email protected]<mailto:[email protected]>> wrote:

Hallo.

I increased RAM to 4GB and I execute, manually the job to crawl “Web 
repository”  containing 3800 pdf documents.

I understood that “Start” executes a full scan instead, “Start minimal” 
executes a incremental scan only on modified documents.


I executed the job with “Start” : it used near 20 hours.
After
I executed the job with “Start minimal” : it rescan the same 3800 documents so 
it used 20 hours

Why this?

Note that there aren’t new documents by the moment that I started job with 
“Start” and the time I started job with “Start minimal”


Thanks for your help!

Mario






Da: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Inviato: martedì 12 agosto 2014 17:26

A: [email protected]<mailto:[email protected]>
Oggetto: Re: How delete unreachable documents on continous crawling?

Hi Mario,

Setting up a schedule does not prevent you from starting the job manually.
But it sounds like you understand the solution.

Thanks,
Karl

On Tue, Aug 12, 2014 at 10:30 AM, Bisonti Mario 
<[email protected]<mailto:[email protected]>> wrote:
Ok, so, I think to have understood better, now.

But I have 3800 .pdf documents so “full crawl” by Tika is very long because it 
uses 2 days. (perhaps I need to increase RAM?)

I am using “web connector” so I see “Start minimal” option.

I understand that I can do this:
1) full crawl on the Saturday night so it deletes orphaned file
2) start minimal crawl every night except Saturday so it crawls only changed 
documents

are 1) and 2) right or I haven’t understood?


Furthermore, I haven’t so clear  the option:
“Start even inside a scheduled window” because I tried with “Start when 
scheduled window start” but I am able to start it manually, too.

Thanks a lot!


Mario



Da: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Inviato: martedì 12 agosto 2014 14:54

A: [email protected]<mailto:[email protected]>
Oggetto: Re: How delete unreachable documents on continous crawling?

Hi Mario,

What I would do is set up a single job.  (Multiple jobs that share the same 
documents may work but they aren't recommended because a document must vanish 
from ALL jobs that share it before it is removed.)  There are two different 
possibilities for the schedule, depending on the kind of connector you are 
using:
(1) Repeated full crawls
(2) Mostly minimal crawls, with periodic full crawls
If the connector you are using makes any distinction between minimal and full 
crawls, then (2) would probably be more efficient for you.  But only on full 
crawls will unreachable documents be removed.
To do the setup:
-- you will need multiple scheduling records for (2), but may be able to do (1) 
with a single scheduling record
-- for each day, you want the window to start at midnight, and its length to be 
the equivalent of 24 hours
-- you want to select the option to start crawls in the middle of a window, not 
just at the beginning
This should give you what you want.
Karl

On Tue, Aug 12, 2014 at 8:43 AM, Bisonti Mario 
<[email protected]<mailto:[email protected]>> wrote:
So , I suppose, the best solution could be :
Continous recrawling and one periodic recrawling to delete orphaned documents.

Can I superimpose the two jobs?

Mario Bisonti
Information and Comunications Technology

VIMAR SpA
Tel. +39 0424 488 644
[email protected]<mailto:[email protected]>
Rispetta l’ambiente. Stampa solo se necessario.
Take care of the environment. Print only if necessary.





Da: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Inviato: martedì 12 agosto 2014 12:21

A: [email protected]<mailto:[email protected]>
Oggetto: Re: How delete unreachable documents on continous crawling?

Hi Mario,

Yes, periodic recrawling allows ManifoldCF the opportunity to discover 
abandoned documents and remove them.

Karl

On Tue, Aug 12, 2014 at 6:18 AM, Bisonti Mario 
<[email protected]<mailto:[email protected]>> wrote:
Ok, thanks..

So you suggest to me to not use continuos crawling and schedule a re-crawling 
periodically of all documents?
Is it better?
Thanks a lot.



Mario





Da: Karl Wright [mailto:[email protected]<mailto:[email protected]>]
Inviato: martedì 12 agosto 2014 12:16
A: [email protected]<mailto:[email protected]>
Oggetto: Re: How delete unreachable documents on continous crawling?

Hi Mario,
Please read ManifoldCF in Action Chapter 1.  Continuous crawling has no 
mechanism for deleting unreachable documents, and never will, because it is 
fundamentally impossible to do.
Thanks,
Karl

On Tue, Aug 12, 2014 at 6:10 AM, Bisonti Mario 
<[email protected]<mailto:[email protected]>> wrote:
Hallo.
I set continuous crawling on a folder of a website to index the pdf files 
contained.

Schedule type: Rescan documents dinamically
Recrawl interval (if continuous):5

I see that if documents are added on the folder, they are indexed, but if 
documents are deleted they aren’t deleted from indexing.
I see that on the “MainfoldCF in action” , is mentioned “…that continuous 
crawling seems to be missing a phase – the “delete unreachable documents” 
phase.”

But, how could I solve the problem, please?
Thanks a lot for yopur help.
Mario

R: How delete unreachable documents on continous crawling?

Reply via email to