Hi,

fyi, the issue is now tracked for Tika at

  https://issues.apache.org/jira/browse/TIKA-4350

for a potential fix see

  https://github.com/apache/tika/pull/2045


@Raj: since the change only applies to the tika-mimetypes.xml,
it is possible to patch via configuration:

1. place the modified/patched tika-mimetypes.xml in the Nutch conf/ folder

2. preferably choose a custom name, e.g., custom-tika-mimetypes.xml

3. "register" the custom MIME type pattern file via the Nutch property
   "mime.types.file"

4. voilà, the HTML snippet is now recognized as HTML and properly parsed:

  $> bin/nutch parsechecker -Dmime.types.file=custom-tika-mimetypes.xml  \
'https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw=='
  ...
  2024-11-14 12:23:37,869 INFO o.a.n.p.ParserChecker [main] contentType: 
text/html

  Outlinks: 1
outlink: toUrl: https://www.cdsco.gov.in/opencms/resources/UploadCDSCOWeb/2022/acts_and_rules/Drugs and Cosmetics Act, 1940.pdf anchor:


Best,
Sebastian

On 11/8/24 23:12, Lewis John McGibbney wrote:
Hi Raj,
You should cross post your question over to the user@tika mailing list.
Good luck!
lewismc

On 2024/10/23 12:27:59 Raj Chidara wrote:
Hi

  I get following exception when Nutch is parsing this page 
https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw==

Reply via email to