In other words, you want to crawl whole site, but index only some pages? To be honest this is something I would like to do also. I finish check it yet, but seems that you can write IndexingFilter, which would throw exception if the page shouldn't be indexed. Unfortunatelly you cannot return null, bacause there is null pointer exception. Throwing the exception, causes a warn log message, which may cause log overload if you have a large site.
I hope it helps, Marcin Okraszewski On 5/5/07, simon_ece <[EMAIL PROTECTED]> wrote: > > hi, thanks for the reply, > > this is my conf/Crawl-url filter file content > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept hosts in MY.DOMAIN.NAME > > +^http://([a-z0-9]*\.)*example.com/ > > > > # skip everything else > -. > > its crawling the whole site and i can view all the related matches while > searching, > but i need to filter out someof the pages > for eg: > if i search for some category (red) > this will list out all the links ; > but i do want to show only a particular link which should matches the > regular expression > > ^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$ > > kindly post your suggestion > Regards, > Simon > __________________________________________________________________ > > Marcin Okraszewski wrote: > > > > How about conf/crawl-urlfilter.txt ?? > > > > Marcin > > > > On 5/4/07, simon_ece <[EMAIL PROTECTED]> wrote: > >> > >> hi all, > >> i am new to Nutch. I would like to crawl a particular site and get the > >> result in the following pattern.I dont want to list other urls from the > >> Crwaled site. > >> > >> Site to be Crwal :eg" www.example.com > >> ^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$ > >> > >> i can crawl and geting all the matching urls from the site, > >> i dont know how to filterout the urls and get only the particular urls, > >> kindly post the suggestions > >> Thanks & Regards > >> Simon > >> > >> -- > >> View this message in context: > >> http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10318059 > >> Sent from the Nutch - User mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10334300 > Sent from the Nutch - User mailing list archive at Nabble.com. > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
