Re: Is there any way to block the hubpages while crawling

BlackIce Sun, 18 Mar 2018 02:11:03 -0700

Basically what you're saying is that you need more control over what is
being indexed?


That's an excellent question!

Greetz!

On Mar 17, 2018 11:46 AM, "ShivaKarthik S" <shivakarthik...@gmail.com>
wrote:

> Hi,
>
> Is there any way to block the hub pages & index only the articles from the
> websites. I wanted to index only the articles & not hubpage. Hub pages will
> be crawled & the outlines will be extracted, but while indexing, I needed
> only the articles to be indexed.
> E.g.
> www.abc.com/xyz & www.abc.com/abc are hub pages and www.abc.com/xyz/1.html
> & www.abc.com/ABC/1.html is an article.
>
> In this case I can block all the urls not ending with .html or .aspx or
> .JSP or any other extensions. But all the websites need not be following
> same format. Some follow . html for hub pages as well as articles & some
> follow no extension for both hub pages as well as articles. Considering
> these cases, I can't generalize any rule saying that whichever is ending
> without extension is hubpage & whichever is ending with extension is
> article. Is there any way in nutch 1.x this can be handled?
>
> Thanks & regards
> Shiva
>
>
> --
> Thanks and Regards
> Shiva
>

Re: Is there any way to block the hubpages while crawling

Reply via email to