[Nutch-general] Ignore Robots meta tag

karthik085 Fri, 27 Apr 2007 11:48:25 -0700

Hi,

I am trying to index a website. That website has 
  <meta name='ROBOTS' content='NOINDEX, NOFOLLOW'> in their html file.


If they want to remove this, they will have to remove it in all their pages
and they don't want to regenerate these pages from database.

I already crawled this website. Is there anyway I can make Nutch to ignore
the above and index the page?

One way I can think of is:
a) Retrieve HTML from segments
b) Remove that line
c) Write back
d) Re-index

Anyone has a better solution? Can I use PruneIndexTool?

If the above is the way I go about it, how do I do it...I mean, what are the
commands I need to issue/classes I need to call and modify?

Any help is appreciated. Thanks.

Karthik

-- 
View this message in context: 
http://www.nabble.com/Ignore-Robots-meta-tag-tf3659247.html#a10224500
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Ignore Robots meta tag

Reply via email to