Hi Karl,
Thank you for a quick response.
It seems that I have completely misunderstood the specifications so it'd be
helpful if you could show specific examples for each Hop count mode.
Is those below my understanding correct?
- "keep unreachable documents, for now" and "... forever" is the settings that
does not delete documents from the index that were not crawled.
- hop count dependency information is like a cache of the link structure. This link structure is not recreated in "keep unreachable documents
forever" mode, so it is faster to crawl.
The reason I am asking these question is a document was deleted that I thought
it was not going to be.
Is there any way that it does not delete? What does it "keep" in "keep unreachable
document"?
Sincerely,
Issei Nishigata
On 2019/11/08 2:19, Karl Wright wrote:
Hi Issei,
The setting of "Keep unreachable documents forever" basically means that no hop count dependency information is kept around for any crawls done
when that setting is in place. That means that when links change or documents change the system does not know how to recompute the hopcount
accurately. This setting is appropriate if you want your crawl to be as fast as possible and do not expect ever to use hop count filtering for
the job in question.
The "keep unreachable documents for now" means that enough information is kept around that if you decided to put a hop count filter into place
later, it would still work properly.
Hope that helps.
Karl
On Thu, Nov 7, 2019 at 11:01 AM Issei Nishigata <duo.2...@gmail.com
<mailto:duo.2...@gmail.com>> wrote:
Hi All,
I use MCF2.12, and I have confused about specifications of HopFilters "Keep
unreachable documents".
I understand that the "Keep unrechable documents, for now" and "Keep
unreacheable documents, forever" of HopFilter
is an effective setting when specifying HopCount.
For example, crawling all data with specifying the empty value on HopCount
at first time, and the second time,
putting 0 in the value of HopCount with "Keep unreachable documents, for
now", only the first layer of the directory
will be crawled and the second and deeper layers, which are not crawled,
will not be deleted from the index.
However, when actually processing as the above setting, document on second
layer is deleted from index
when processing second time and after that. It works same way when using "Keep
unreacheable documents, forever".
Is there anything wrong with my understanding? and Does anyone know about
difference between these two settings,
"Keep unrechable documents, for now" and "Keep unreacheable documents,
forever"?
If anyone of you knows about the specs of these settings, then it is very
helpful to share your bits of advice.
Any clue will be very appreciated.
Sincerely,
Issei Nishigata