Hi, > more control over what is being indexed?
It's possible to enable URL filters for the indexer: bin/nutch index ... -filter With little extra effort you can use different URL filter rules during the index step, e.g. in local mode by pointing NUTCH_CONF_DIR to a different folder. >> I can't generalize any rule What about to classify hubs by number of outlinks? Then you could skip those pages using an indexing-filter, just return null if a document shall be skipped. For a larger crawl you'll definitely get lost with a URL filter. Maybe you can also see this as a ranking problem: if hub pages are only penalized you could apply simple but noisy heuristics. Best, Sebastian On 03/18/2018 10:10 AM, BlackIce wrote: > Basically what you're saying is that you need more control over what is > being indexed? > > That's an excellent question! > > Greetz! > > On Mar 17, 2018 11:46 AM, "ShivaKarthik S" <shivakarthik...@gmail.com> > wrote: > >> Hi, >> >> Is there any way to block the hub pages & index only the articles from the >> websites. I wanted to index only the articles & not hubpage. Hub pages will >> be crawled & the outlines will be extracted, but while indexing, I needed >> only the articles to be indexed. >> E.g. >> www.abc.com/xyz & www.abc.com/abc are hub pages and www.abc.com/xyz/1.html >> & www.abc.com/ABC/1.html is an article. >> >> In this case I can block all the urls not ending with .html or .aspx or >> .JSP or any other extensions. But all the websites need not be following >> same format. Some follow . html for hub pages as well as articles & some >> follow no extension for both hub pages as well as articles. Considering >> these cases, I can't generalize any rule saying that whichever is ending >> without extension is hubpage & whichever is ending with extension is >> article. Is there any way in nutch 1.x this can be handled? >> >> Thanks & regards >> Shiva >> >> >> -- >> Thanks and Regards >> Shiva >> >