Hi Sebastian

 Thanks for your support.  It resolved issue. However, one clarification is 
required. In conf/nutch-default.xml, this is commented



<!-- mime properties -->



<!

<property>

  <name>mime.types.file</name>

  <value>tika-mimetypes.xml</value>

  <description>Name of file in CLASSPATH containing filename extension and

  magic sequence to mime types mapping information. Overrides the default Tika 
config

  if specified.

  </description>

</property>



--> and this is present in 



<property>

  <name>mime.type.magic</name>

  <value>true</value>

  <description>Defines if the mime content type detector uses magic resolution.

  </description>

</property>





I removed this comment and placed file modified/patched tika-mimetypes.xml in 
the Nutch conf/ folder and run nutch program. 



Am I doing this correctly?

 



Thanks and Regards

Raj Chidara








---- On Thu, 14 Nov 2024 16:55:24 +0530 Sebastian Nagel <[email protected]> 
wrote ---



Hi, 
 
fyi, the issue is now tracked for Tika at 
 
 https://issues.apache.org/jira/browse/TIKA-4350 
 
for a potential fix see 
 
 https://github.com/apache/tika/pull/2045 
 
 
@Raj: since the change only applies to the tika-mimetypes.xml, 
it is possible to patch via configuration: 
 
1. place the modified/patched tika-mimetypes.xml in the Nutch conf/ folder 
 
2. preferably choose a custom name, e.g., custom-tika-mimetypes.xml 
 
3. "register" the custom MIME type pattern file via the Nutch property 
 "mime.types.file" 
 
4. voilà, the HTML snippet is now recognized as HTML and properly parsed: 
 
 $> bin/nutch parsechecker -Dmime.types.file=custom-tika-mimetypes.xml  \ 
 
'https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw=='
 
 ... 
 2024-11-14 12:23:37,869 INFO o.a.n.p.ParserChecker [main] contentType: 
text/html 
 
 Outlinks: 1 
 outlink: toUrl: 
https://www.cdsco.gov.in/opencms/resources/UploadCDSCOWeb/2022/acts_and_rules/Drugs
 
and Cosmetics Act, 1940.pdf anchor: 
 
 
Best, 
Sebastian 
 
On 11/8/24 23:12, Lewis John McGibbney wrote: 
> Hi Raj, 
> You should cross post your question over to the user@tika mailing list. 
> Good luck! 
> lewismc 
> 
> On 2024/10/23 12:27:59 Raj Chidara wrote: 
>> Hi 
>> 
>>   I get following exception when Nutch is parsing this page 
>> https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/download_file_division.jsp?num_id=OTIyNw==

 
 
 
Global Locations:
 
USA | UK | India | Singapore | Japan
 
*ISO 9001, 27001, 20000 Compliant
 
www.DDIsmart.com
 
About Us | Awards | Blog | News | Contact Us
 
 
 



 
 
 
DISCLAIMER: This message is intended solely for the use of the individual or 
entity to which it is addressed. If you are not the intended recipient, you 
should not use, copy, alter, or disclose the contents of this message. All 
information or opinions expressed in this message and/or any attachments are 
those of the author and are not necessarily those of the group companies.
 



Reply via email to