Sorry I ment that
I would like to filter all the MSWord and PDF files after fetching and
before parsing.
Thanks ,
Rafit
From: "Rafit Izhak_Ratzin" <[EMAIL PROTECTED]>
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: Re: Filtering content before parsing?
Date: Wed, 25 Jan 2006 00:05:18 +0000
Hi,
I would like filter all the MSWord and PDF files after fetching and before
filtering,
Is there a way to do that ?
Thanks,
Rafit
From: Gal Nitzan <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Subject: Re: Filtering content before parsing?
Date: Tue, 24 Jan 2006 21:55:57 +0200
Create your own parser.
Create a new plaguin parse-chunwei :) for example (just copy the whole
parse-html plugin amd take it from there)
subclass HtmlParser and there you have it.
G.
P.S. Do not forget to replace the parse-html in plugin.include in the
nutch-site.xpl entry.
On Tue, 2006-01-24 at 10:16 +0800, Chun Wei Ho wrote:
> Is there an easy way to filter content after fetching but before
parsing?
>
> I'm crawling a site where the information pages includes a form on the
> side, and the option values of the form (which also get sucked into
> the parse.getText() value that I index as "content") is interfering
> with searches on the index. I plan to filter the content and remove
> the form html block before parsing (as per above question). Does
> anyone have another method around this?
>
> Thanks
> CW
>
_________________________________________________________________
Don't just search. Find. Check out the new MSN Search!
http://search.msn.com/
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general