Re: Index URL's based on a condition

Jorge Betancourt Tue, 26 Sep 2017 03:55:45 -0700

If you only want to avoid **indexing** old documents you could write your
own `IndexingFilter` that will check your condition and avoid the indexing
of the documents. You don't mention your Nutch version, but assuming that
you're using v1 we have a new PR (https://github.com/apache/nutch/pull/219)
that will be ready for the next release, that offers this feature out of
the box using JEXL expressions to allow/prevent documents from being
indexed.

If you can grab the PR and test it and provide some feedback would be
amazing!

You could write your own custom plugin if you want, and you can check the
mimetype-filter for something similar to what you want (in this case we
apply the filtering based on the mime-type).

Also a warning is in order, at the moment the `fetchTime` or `modifiedTime`
that Nutch uses are coming from the headers that the webserver sends when
the resource is fetched, keep in mind that these values should not be
trusted (unless you are 100% sure) because in most cases you'll get wrong
dates. https://issues.apache.org/jira/browse/NUTCH-1414 proposes a better
approach to extracting the publication date from the content of the page,
or you can implement your own parser.

Keep in mind that with this approach you'll still fetch/parse the old
documents you just avoid the indexing step.

Best Regards,
Jorge

On Tue, Sep 26, 2017 at 7:13 AM Abhishek Ramachandran <abhishe...@mstack.com>
wrote:

> Hello,
>
> I want to know whether it's possible to filter the url's that are fetched,
> based on a condition (for example published date or time). I know that we
> can filter the url's by regex-urlfilter for fetching.
>
> In my case I don't want to index old documents. So, if a document is
> published before 2017 then, it has to be rejected. Is there any date filter
> plugin needed for this or any other solution already available.
>
> Any help will be appreciated. Thanks in advance.
>
> --
> Regards,
> *Abhishek Ramachandran*
> *abhishe...@mstack.com <abhishe...@mstack.com>*
> * <http://www.mstack.com/>*
>
> --
>
>
> The information contained in this electronic message and any attachments to
> this message are intended for the exclusive use of the addressee(s) and may
> contain proprietary, confidential or privileged information. If you are not
> the intended recipient, you should not disseminate, distribute or copy this
> e-mail. Please notify the sender immediately and destroy all copies of this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>

Re: Index URL's based on a condition

Reply via email to