Re: [Nutch-general] Nutch - Filtering (REGEX)

Marcin Okraszewski Sat, 05 May 2007 13:39:25 -0700

In other words, you want to crawl whole site, but index only some pages?

To be honest this is something I would like to do also. I finish check
it yet, but seems that you can write IndexingFilter, which would throw
exception if the page shouldn't be indexed. Unfortunatelly you cannot
return null, bacause there is null pointer exception. Throwing the
exception, causes a warn log message, which may cause log overload if
you have a large site.


I hope it helps,
Marcin Okraszewski


On 5/5/07, simon_ece <[EMAIL PROTECTED]> wrote:
>
> hi, thanks for the reply,
>
> this is my conf/Crawl-url filter file content
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
>
> +^http://([a-z0-9]*\.)*example.com/
>
>
>
> # skip everything else
> -.
>
> its crawling the whole site and i can view all the related matches while
> searching,
> but i need to filter out someof the pages
> for eg:
> if i search for some category (red)
> this will list out all the links ;
> but i do want to show only a particular link which should matches the
> regular expression
>
> ^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$
>
> kindly post your suggestion
> Regards,
> Simon
> __________________________________________________________________
>
> Marcin Okraszewski wrote:
> >
> > How about  conf/crawl-urlfilter.txt  ??
> >
> > Marcin
> >
> > On 5/4/07, simon_ece <[EMAIL PROTECTED]> wrote:
> >>
> >> hi all,
> >> i am new to Nutch. I would like to crawl a particular site and get the
> >> result in the following pattern.I dont want to list other urls from the
> >> Crwaled site.
> >>
> >> Site to be Crwal :eg" www.example.com
> >> ^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$
> >>
> >> i can crawl and geting all the matching urls from the site,
> >> i dont know how to filterout the urls and get only the particular urls,
> >> kindly post the suggestions
> >> Thanks & Regards
> >> Simon
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10318059
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context: 
> http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10334300
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch - Filtering (REGEX)

Reply via email to