Hi,

> more control over what is being indexed?

It's possible to enable URL filters for the indexer:
   bin/nutch index ... -filter
With little extra effort you can use different URL filter rules
during the index step, e.g. in local mode by pointing NUTCH_CONF_DIR
to a different folder.

>> I can't generalize any rule

What about to classify hubs by number of outlinks?
Then you could skip those pages using an indexing-filter, just return
null if a document shall be skipped.
For a larger crawl you'll definitely get lost with a URL filter.

Maybe you can also see this as a ranking problem: if hub pages are
only penalized you could apply simple but noisy heuristics.

Best,
Sebastian

On 03/18/2018 10:10 AM, BlackIce wrote:
> Basically what you're saying is that you need more control over what is
> being indexed?
> 
> That's an excellent question!
> 
> Greetz!
> 
> On Mar 17, 2018 11:46 AM, "ShivaKarthik S" <shivakarthik...@gmail.com>
> wrote:
> 
>> Hi,
>>
>> Is there any way to block the hub pages & index only the articles from the
>> websites. I wanted to index only the articles & not hubpage. Hub pages will
>> be crawled & the outlines will be extracted, but while indexing, I needed
>> only the articles to be indexed.
>> E.g.
>> www.abc.com/xyz & www.abc.com/abc are hub pages and www.abc.com/xyz/1.html
>> & www.abc.com/ABC/1.html is an article.
>>
>> In this case I can block all the urls not ending with .html or .aspx or
>> .JSP or any other extensions. But all the websites need not be following
>> same format. Some follow . html for hub pages as well as articles & some
>> follow no extension for both hub pages as well as articles. Considering
>> these cases, I can't generalize any rule saying that whichever is ending
>> without extension is hubpage & whichever is ending with extension is
>> article. Is there any way in nutch 1.x this can be handled?
>>
>> Thanks & regards
>> Shiva
>>
>>
>> --
>> Thanks and Regards
>> Shiva
>>
> 

Reply via email to