Hi,
> In conf/nutch-default.xml, this is commented
Oh, sorry, yes of course. You'd need to uncomment it.
Respectively, the recommended way is to define all custom-set properties
in nutch-site.xml - this keeps defaults and customizations apart.
> I removed this comment and placed file modified/patched tika-mimetypes.xml in
> the Nutch conf/ folder and run nutch program.
Eventually, it's necessary to use a different name because a tika-mimetypes.xml
file is shipped with Tika. Just to be sure, but it should work also with the
default name it because the conf/ folder is in front of the Java classpath.
> <property>
> <name>mime.type.magic</name>
> <value>true</value>
> <description>Defines if the mime content type detector uses magic
resolution.
> </description>
> </property>
Btw., in case you can be sure that all crawled servers always send the correct
MIME type in the Content-Type header, disabling the MIME magic could also be
an option.
> Am I doing this correctly?
Just try it out:
1. change the configuration
2. compile Nutch: `ant clean runtime`
3. run parsechecker without overwriting any properties in the command-line
4. check whether the page is properly parsed
Best,
Sebastian
On 11/17/24 11:15, Raj Chidara wrote:
Hi Sebastian
Thanks for your support. It resolved issue. However, one clarification is
required. In conf/nutch-default.xml, this is commented
<!-- mime properties -->
<!
<property>
<name>mime.types.file</name>
<value>tika-mimetypes.xml</value>
<description>Name of file in CLASSPATH containing filename extension and
magic sequence to mime types mapping information. Overrides the default Tika
config
if specified.
</description>
</property>
--> and this is present in
<property>
<name>mime.type.magic</name>
<value>true</value>
<description>Defines if the mime content type detector uses magic resolution.
</description>
</property>
I removed this comment and placed file modified/patched tika-mimetypes.xml in
the Nutch conf/ folder and run nutch program.
Am I doing this correctly?
Thanks and Regards
Raj Chidara
---- On Thu, 14 Nov 2024 16:55:24 +0530 *Sebastian Nagel <[email protected]>*
wrote ---
Hi,
fyi, the issue is now tracked for Tika at
https://issues.apache.org/jira/browse/TIKA-4350 <https://issues.apache.org/
jira/browse/TIKA-4350>
for a potential fix see
https://github.com/apache/tika/pull/2045 <https://github.com/apache/tika/
pull/2045>
@Raj: since the change only applies to the tika-mimetypes.xml,
it is possible to patch via configuration:
1. place the modified/patched tika-mimetypes.xml in the Nutch conf/ folder
2. preferably choose a custom name, e.g., custom-tika-mimetypes.xml
3. "register" the custom MIME type pattern file via the Nutch property
"mime.types.file"
4. voilà, the HTML snippet is now recognized as HTML and properly parsed:
$> bin/nutch parsechecker -Dmime.types.file=custom-tika-mimetypes.xml \
'https://www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/
download_file_division.jsp?num_id=OTIyNw==' <https://www.cdsco.gov.in/
opencms/opencms/system/modules/CDSCO.WEB/elements/
download_file_division.jsp?num_id=OTIyNw=='>
...
2024-11-14 12:23:37,869 INFO o.a.n.p.ParserChecker [main] contentType: text/
html
Outlinks: 1
outlink: toUrl:
https://www.cdsco.gov.in/opencms/resources/UploadCDSCOWeb/2022/
acts_and_rules/Drugs <https://www.cdsco.gov.in/opencms/resources/
UploadCDSCOWeb/2022/acts_and_rules/Drugs>
and Cosmetics Act, 1940.pdf anchor:
Best,
Sebastian
On 11/8/24 23:12, Lewis John McGibbney wrote:
> Hi Raj,
> You should cross post your question over to the user@tika mailing list.
> Good luck!
> lewismc
>
> On 2024/10/23 12:27:59 Raj Chidara wrote:
>> Hi
>>
>> I get following exception when Nutch is parsing this page https://
www.cdsco.gov.in/opencms/opencms/system/modules/CDSCO.WEB/elements/
download_file_division.jsp?num_id=OTIyNw== <https://www.cdsco.gov.in/
opencms/opencms/system/modules/CDSCO.WEB/elements/
download_file_division.jsp?num_id=OTIyNw==>
DDi logo <https://www.ddismart.com>
*Global Locations:*
USA | UK | India | Singapore | Japan
*ISO 9001, 27001, 20000 Compliant
www.DDIsmart.com <https://www.ddismart.com>
About Us <https://www.ddismart.com/ddi-drug-development-informatics/> | Awards
<https://www.ddismart.com/ddi-awards-recognition/> | Blog <https://
www.ddismart.com/ddi-blog/> | News <https://www.ddismart.com/news/> | Contact Us
<https://www.ddismart.com/contact-ddi/>
Why Impact Assessment is Critical in Pharmaceutical Product Lifecycle Management
<https://www.ddismart.com/blog/why-impact-assessment-is-critical-in-
pharmaceutical-product-lifecycle-management/>The Future of Regulatory
Publishing: How Automation is Transforming Compliance
<https://www.ddismart.com/blog/transforming-compliance-with-regulatory-
publishing-automation/>The Ultimate Guide to Regulatory Information Management
System Software: Solutions and Vendors
<https://www.ddismart.com/blog/the-ultimate-guide-to-regulatory-information-
management-system-software-solutions-and-vendors/>
DISCLAIMER: This message is intended solely for the use of the individual or
entity to which it is addressed. If you are not the intended recipient, you
should not use, copy, alter, or disclose the contents of this message. All
information or opinions expressed in this message and/or any attachments are
those of the author and are not necessarily those of the group companies.