Hi,
fyi, the issue is now tracked for Tika at
https://issues.apache.org/jira/browse/TIKA-4350
for a potential fix see
https://github.com/apache/tika/pull/2045
@Raj: since the change only applies to the tika-mimetypes.xml,
it is possible to patch via configuration:
1. place the modified/patched tika-mimetypes.xml in the Nutch conf/ folder
2. preferably choose a custom name, e.g., custom-tika-mimetypes.xml
3. "register" the custom MIME type pattern file via the Nutch property
"mime.types.file"
4. voilà, the HTML snippet is now recognized as HTML and properly parsed:
$> bin/nutch parsechecker -Dmime.types.file=custom-tika-mimetypes.xml \
'https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw=='
...
2024-11-14 12:23:37,869 INFO o.a.n.p.ParserChecker [main] contentType:
text/html
Outlinks: 1
outlink: toUrl:
https://www.cdsco.gov.in/opencms/resources/UploadCDSCOWeb/2022/acts_and_rules/Drugs
and Cosmetics Act, 1940.pdf anchor:
Best,
Sebastian
On 11/8/24 23:12, Lewis John McGibbney wrote:
Hi Raj,
You should cross post your question over to the user@tika mailing list.
Good luck!
lewismc
On 2024/10/23 12:27:59 Raj Chidara wrote:
Hi
I get following exception when Nutch is parsing this page
https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw==