Re: [jira] Created: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

Matt Kangas Thu, 29 Nov 2007 13:07:56 -0800

Hi Andrea,

It sounds like your addition is useful only for people crawling sitesunder their control. It's not useful for Internet crawling. Is therea way to make this useful to a wider audience without overly-complicating the code?

I'm trying to think of specific scenarios where this could be usefulin an internet crawl:

1) Known ad-hosting providers? (Doubleclick, etc)
2) Known domain-parking (godaddy) or domain-squatting patterns?

But, these examples have problems:

(1) -- I can't think of any offhand that would put indexable textinto the page (would be an <embed>, <img>, or <script)(2) -- I'd want to filter domain-squatters earlier in the processingchain, so we ignore their links too


--matt

On Nov 29, 2007, at 6:13 AM, Andrea Spinelli (JIRA) wrote:

[PARSE-HTML plugin] Block certain parts of HTML code from beingindexed-----------------------------------------------------------------------
                 Key: NUTCH-585
                 URL: https://issues.apache.org/jira/browse/NUTCH-585
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.9.0
         Environment: All operating systems
            Reporter: Andrea Spinelli
            Priority: Minor
We are using nutch to index our own web sites; we would like not toindex certain parts of our pages, because we know they are notrelevant (for instance, there are several links to change thebackground color) and generate spurious matches.
We have modified the plugin so that it ignores HTML code betweencertain HTML comments, like

... ignored part ...

We feel this might be useful to someone else, maybe factorizing thecomment strings as constants in the configuration files (sayparser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml).
We are almost ready to contribute our code snippet. Lookingforward for any expression of interest - or for an explanation whywaht we are doing is plain wrong!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


--
Matt Kangas / [EMAIL PROTECTED]

Re: [jira] Created: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

Reply via email to