Re: Specifications of HopFilters "Keep unreachable documents"

Issei Nishigata Fri, 08 Nov 2019 07:11:31 -0800

Hi Karl,


Thank you for a quick response.

It seems that I have completely misunderstood the specifications so it'd be 
helpful if you could show specific examples for each Hop count mode.

Is those below my understanding correct?
- "keep unreachable documents, for now" and "... forever" is the settings that 
does not delete documents from the index that were not crawled.

- hop count dependency information is like a cache of the link structure. This link structure is not recreated in "keep unreachable documentsforever" mode, so it is faster to crawl.


The reason I am asking these question is a document was deleted that I thought 
it was not going to be.
Is there any way that it does not delete? What does it "keep" in "keep unreachable 
document"?


Sincerely,
Issei Nishigata



On 2019/11/08 2:19, Karl Wright wrote:

Hi Issei,
The setting of "Keep unreachable documents forever" basically means that no hop count dependency information is kept around for any crawls donewhen that setting is in place. That means that when links change or documents change the system does not know how to recompute the hopcountaccurately. This setting is appropriate if you want your crawl to be as fast as possible and do not expect ever to use hop count filtering forthe job in question.
The "keep unreachable documents for now" means that enough information is kept around that if you decided to put a hop count filter into placelater, it would still work properly.
Hope that helps.

Karl


On Thu, Nov 7, 2019 at 11:01 AM Issei Nishigata <duo.2...@gmail.com 
<mailto:duo.2...@gmail.com>> wrote:

    Hi All,


    I use MCF2.12, and I have confused about specifications of HopFilters "Keep 
unreachable documents".

    I understand that the "Keep unrechable documents, for now" and "Keep 
unreacheable documents, forever" of HopFilter
    is an effective setting when specifying HopCount.

    For example, crawling all data with specifying the empty value on HopCount 
at first time, and the second time,
    putting 0 in the value of HopCount with "Keep unreachable documents, for 
now", only the first layer of the directory
    will be crawled and the second and deeper layers, which are not crawled, 
will not be deleted from the index.

    However, when actually processing as the above setting, document on second 
layer is deleted from index
    when processing second time and after that. It works same way when using "Keep 
unreacheable documents, forever".

    Is there anything wrong with my understanding? and Does anyone know about 
difference between these two settings,
    "Keep unrechable documents, for now" and "Keep unreacheable documents, 
forever"?

    If anyone of you knows about the specs of these settings, then it is very 
helpful to share your bits of advice.
    Any clue will be very appreciated.


    Sincerely,
    Issei Nishigata

Re: Specifications of HopFilters "Keep unreachable documents"

Reply via email to