Re: Nutch - Restriction by content type

Markus Jelsma Thu, 16 Nov 2023 06:37:20 -0800

Hello,

You can skip certain types of documents based on their file extension,
using the urlfilter-suffix. It only filters known suffixes. Filtering based
on content type is not possible, because to know the content type requires
fetching and parsing them.


You can skip specific content types when indexing using the Jexl indexing
filter.

Regards,
Markus

Op do 16 nov 2023 om 14:56 schreef Raj Chidara <[email protected]>:

> Hello
>   Can we control crawling of web pages by its content type through any
> configuration setting?  For example, I want to crawl only pages whose
> content type is text/html from a website and does not want to crawl other
> pages/files.
>
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>
>
>
> Worldwide Offices:
>
> USA | UK | India | Singapore | Japan
>
> *ISO 9001, 27001, 20000 Compliant
>
>
>
> www.DDIsmart.com
>
>
>
>
>
>
>
>
>
>
>
>
>
> DISCLAIMER: This message is intended solely for the use of the individual
> or entity to which it is addressed. If you are not the intended recipient,
> you should not use, copy, alter, or disclose the contents of this message.
> All information or opinions expressed in this message and/or any
> attachments are those of the author and are not necessarily those of the
> group companies.
>
>
>
>
>
>
>

Re: Nutch - Restriction by content type

Reply via email to