hi, thanks for the reply, this is my conf/Crawl-url filter file content
# skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*example.com/ # skip everything else -. its crawling the whole site and i can view all the related matches while searching, but i need to filter out someof the pages for eg: if i search for some category (red) this will list out all the links ; but i do want to show only a particular link which should matches the regular expression ^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$ kindly post your suggestion Regards, Simon __________________________________________________________________ Marcin Okraszewski wrote: > > How about conf/crawl-urlfilter.txt ?? > > Marcin > > On 5/4/07, simon_ece <[EMAIL PROTECTED]> wrote: >> >> hi all, >> i am new to Nutch. I would like to crawl a particular site and get the >> result in the following pattern.I dont want to list other urls from the >> Crwaled site. >> >> Site to be Crwal :eg" www.example.com >> ^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$ >> >> i can crawl and geting all the matching urls from the site, >> i dont know how to filterout the urls and get only the particular urls, >> kindly post the suggestions >> Thanks & Regards >> Simon >> >> -- >> View this message in context: >> http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10318059 >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10334300 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
