Basically what you're saying is that you need more control over what is being indexed?
That's an excellent question! Greetz! On Mar 17, 2018 11:46 AM, "ShivaKarthik S" <shivakarthik...@gmail.com> wrote: > Hi, > > Is there any way to block the hub pages & index only the articles from the > websites. I wanted to index only the articles & not hubpage. Hub pages will > be crawled & the outlines will be extracted, but while indexing, I needed > only the articles to be indexed. > E.g. > www.abc.com/xyz & www.abc.com/abc are hub pages and www.abc.com/xyz/1.html > & www.abc.com/ABC/1.html is an article. > > In this case I can block all the urls not ending with .html or .aspx or > .JSP or any other extensions. But all the websites need not be following > same format. Some follow . html for hub pages as well as articles & some > follow no extension for both hub pages as well as articles. Considering > these cases, I can't generalize any rule saying that whichever is ending > without extension is hubpage & whichever is ending with extension is > article. Is there any way in nutch 1.x this can be handled? > > Thanks & regards > Shiva > > > -- > Thanks and Regards > Shiva >