Thanks a lot Karl
In the “Simple History” in ManifoldCF I see, for every document, even if it’s
not been modified every day:
26/05/23, 08:47:47 document ingest (SolrShare)
file:/...Avanzato%202014.pptx
26/05/23, 08:47:46 extract [TikaTrasform]
file:/...Avanzato%202014.pptx
26/05/23, 08:47:45 access
file:/...Avanzato%202014.pptx
In Solr, I execute the query to search the document and I see, omitting
extended result..) :
{
"responseHeader":{
"status":0,
"QTime":977,
"params":{
"q":"id:*Avanzato*202014*",
"_":"1685082709862"}},
"response":{"numFound":1,"start":0,"docs":[
{
"id":file:/...Avanzato%202014.pptx,
"last_modified":"2015-03-25T17:27:22Z",
"resourcename":"...Avanzato 2014.pptx",
"content_type":["application/vnd.openxmlformats-officedocument.presentationml.presentation"],
"allow_token_document":["Active+Directory:S-1-5-21-…..",
"Active+Directory:S-1-..."],
"deny_token_document":["Active+Directory:DEAD_AUTHORITY"],
"allow_token_share":["Active+Directory:S-1-1-0"],
"deny_token_share":["Active+Directory:DEAD_AUTHORITY"],
"deny_token_parent":["__nosecurity__"],
"allow_token_parent":["__nosecurity__"],
"content":["ESER..
"_version_":1766940934228934656}]
}}
Is this what did you mean when you mentioned “activity log” ?
I see that document in Solr, so, I suppose that it is indexed
What could I investigated furthermore?
Thanks a lot
Mario
Da: Karl Wright
Inviato: venerdì 26 maggio 2023 07:20
A: user@manifoldcf.apache.org
Oggetto: Re: Long Job on Windows Share
The jcifs connector does not include a lot of information in the version string
for a file - basically, the length, and the modified date. So I would not
expect there to be lot of actual work involved if there are no changes to a
document.
The activity "access" does imply that the system believes that the document
does need to be reindexed. It clearly reads the document properly. I would
check to be sure it actually indexes the document. I suspect that your job may
be reading the file but determining it is not suitable for indexing and then
repeating that every day. You can see this by looking for the document in the
activity log to see what ManifoldCF decided to do with it.
Karl
On Thu, May 25, 2023 at 6:03 AM Bisonti Mario
mailto:mario.biso...@vimar.com>> wrote:
Hi,
I would like to understand how recrawl works
My job scan, using “Connection Type” “Windows shares” works for near 18 hours.
My document numebr a little bit of 1 million.
If I check the documents scan from MifoldCF I see, for example:
[cid:image001.png@01D98FB1.12689F10]
It seems that re work on the document every day even if it hadn’t been modified.
So, is it right or I chose a wrong job to crawl the documents?
Thanks a lot
Mario