hi, thanks for the reply, 

this is my conf/Crawl-url filter file content 

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*example.com/



# skip everything else
-.

its crawling the whole site and i can view all the related matches while
searching,
but i need to filter out someof the pages
for eg:
if i search for some category (red)
this will list out all the links ;
but i do want to show only a particular link which should matches the
regular expression

^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$

kindly post your suggestion
Regards,
Simon
__________________________________________________________________

Marcin Okraszewski wrote:
> 
> How about  conf/crawl-urlfilter.txt  ??
> 
> Marcin
> 
> On 5/4/07, simon_ece <[EMAIL PROTECTED]> wrote:
>>
>> hi all,
>> i am new to Nutch. I would like to crawl a particular site and get the
>> result in the following pattern.I dont want to list other urls from the
>> Crwaled site.
>>
>> Site to be Crwal :eg" www.example.com
>> ^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$
>>
>> i can crawl and geting all the matching urls from the site,
>> i dont know how to filterout the urls and get only the particular urls,
>> kindly post the suggestions
>> Thanks & Regards
>> Simon
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10318059
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10334300
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to