OOPS...I meant IndexSegment. PruneIndexTool prunes existing Nutch indexes of unwanted content. :-)
karthik085 wrote: > > Hi, > > I am trying to index a website. That website has > <meta name='ROBOTS' content='NOINDEX, NOFOLLOW'> in their html file. > > If they want to remove this, they will have to remove it in all their > pages and they don't want to regenerate these pages from database. > > I already crawled this website. Is there anyway I can make Nutch to ignore > the above and index the page? > > One way I can think of is: > a) Retrieve HTML from segments > b) Remove that line > c) Write back > d) Re-index > > Anyone has a better solution? Can I use PruneIndexTool? > > If the above is the way I go about it, how do I do it...I mean, what are > the commands I need to issue/classes I need to call and modify? > > Any help is appreciated. Thanks. > > Karthik > > -- View this message in context: http://www.nabble.com/Ignore-Robots-meta-tag-tf3659247.html#a10225171 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
