[Nutch-general] Re: document markup to control indexing

Jack Tang Wed, 28 Dec 2005 18:38:20 -0800

Hi
I am sorry, it should be getTextHelper() method.

Say i want to index the content in this block:
<!--indexware-->
This is not Ads
<!--/indexware-->


The code may look like this:

boolean contentStart;
boolean contentEnd;

if (node.getNodeType() == Node.COMMENT_NODE) {
    // you can move the value to your configuration file.
    if("indexware".equalsIgnoreCase(node.getNodeValue())) {
         // pls config your flags
         return true; // let it go deep
    }

    .......
    return false;
}

if (contentStart && !contentEnd && node.getNodeType() == Node.TEXT_NODE) {
    // get text in <!--indexware--><!--/indexware-->

}

/Jack

On 28 Dec 2005 19:26:21 -0000, [EMAIL PROTECTED]
<[EMAIL PROTECTED]> wrote:
> Hi, I asked this question a while back and didn't get a response, so I rolled
> my own parse solution using jericho-html and and applyling it to the 
> HTMLParseFilter
> extension point.
>
> I just took a look at the getText() method of the DOMContentUtils
> class and I don't see any way to add your own custom tags ( or comment tags
> ) short of modifying the parse-html code directly and recompiling.
>
> Is
> that what is meant by adding your own filter?
>
> Thanks in advance for the
> help,
> -a
>
> --- [email protected] wrote:
> Hi Jeff
> >
> > Pls refer
> to getText() method in
> > org.apache.nutch.parse.html.DOMContentUtils class
> (of course
> > parse-html plugin). You can add your filter easily;)
> >
> >
> /Jack
> >
> > On 12/27/05, Jeff Breidenbach <[email protected]> wrote:
> > >
> > >
> Hi all,
> > >
> > > Another open source search engine, HtDig, allows web page
> authors to
> > > mark up a page such that some sections are not indexed.  The
> syntax
> > > looks like the following:
> > >
> > > <!--htdig_noindex-->
> > >
> ... material inside is not indexed ...
> > > <!--/htdig_noindex-->
> > >
> >
> > Does a similar feature exist in Nutch? If the answer is "write a
> > > plugin"
> does anyone have tips on where to start? Also, how hard is
> > > something
> like this for a Nutch newbie who doesn't know anything about
> > > HTML parsing?
> I have a bunch of documents already marked up with the
> > > htdig syntax,
> and in the interests of interoperability I'm tempted to
> > > follow the syntax
> exactly.
> > >
> > > -Jeff
> > >
> >
> >
> > --
> > Keep Discovering ... ...
> >
> http://www.jroller.com/page/jmars
> >
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: document markup to control indexing

Reply via email to