Hi Andrea,
It sounds like your addition is useful only for people crawling sites
under their control. It's not useful for Internet crawling. Is there
a way to make this useful to a wider audience without overly-
complicating the code?
I'm trying to think of specific scenarios where this could be useful
in an internet crawl:
1) Known ad-hosting providers? (Doubleclick, etc)
2) Known domain-parking (godaddy) or domain-squatting patterns?
But, these examples have problems:
(1) -- I can't think of any offhand that would put indexable text
into the page (would be an <embed>, <img>, or <script)
(2) -- I'd want to filter domain-squatters earlier in the processing
chain, so we ignore their links too
--matt
On Nov 29, 2007, at 6:13 AM, Andrea Spinelli (JIRA) wrote:
[PARSE-HTML plugin] Block certain parts of HTML code from being
indexed
----------------------------------------------------------------------
-
Key: NUTCH-585
URL: https://issues.apache.org/jira/browse/NUTCH-585
Project: Nutch
Issue Type: Improvement
Affects Versions: 0.9.0
Environment: All operating systems
Reporter: Andrea Spinelli
Priority: Minor
We are using nutch to index our own web sites; we would like not to
index certain parts of our pages, because we know they are not
relevant (for instance, there are several links to change the
background color) and generate spurious matches.
We have modified the plugin so that it ignores HTML code between
certain HTML comments, like
<!-- START-IGNORE -->
... ignored part ...
<!-- STOP-IGNORE -->
We feel this might be useful to someone else, maybe factorizing the
comment strings as constants in the configuration files (say
parser.html.ignore.start and parser.html.ignore.stop in nutch-
site.xml).
We are almost ready to contribute our code snippet. Looking
forward for any expression of interest - or for an explanation why
waht we are doing is plain wrong!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
--
Matt Kangas / [EMAIL PROTECTED]