Cha,
Are you updating regex-urlfilter.txt or crawl-urlfilter.txt ?
If you use nutch crawl, you need to update crawl-urlfilter.txt
If you use whole web crawling you need to use regex-urlfilter.txt
Jason

On Mar 22, 2007, at 10:26 PM, cha wrote:

>
> Hi,
>
> I have try the filters u have provided ..but still its not working..
>
> i have enable urlfilter-regex plugin in your configuration as well..
>
> i cant find out whats the problem is.
>
> Cheers,
> cha
>
>
>
> Jason Culverhouse wrote:
>>
>> Cha,
>> You want something like this
>>
>> -^http://([a-z0-9]*\.)*example.com/stores/[^/]+/(merch-cats-pg|merch-
>> cats|merch)/
>>
>> Your regex fails to match because that last segment '/merch-cats-pg
>> \.*' requires a literal .
>> So it matches http://www.example.com/stores/abcd/merch-cats-pg.
>> not http://www.example.com/stores/abcd/merch-cats-pg/
>>
>> You could also just change that to  '/merch-cats-pg.*'
>>
>> Get a copy of Mastering Regular Expressions By Jeffrey E. F. Friedl
>> http://www.oreilly.com/catalog/regex3/index.html
>>
>> Jason
>>
>> On Mar 21, 2007, at 8:37 AM, cha wrote:
>>
>>>
>>>
>>> Hi,
>>>
>>> I want to ignore the following urls from crawling
>>>
>>> for eg.
>>>
>>> http://www.example.com/stores/abcd/merch-cats-pg/abcd.*
>>> http://www.example.com/stores/abcd/merch-cats/abcd.*
>>> http://www.example.com/stores/abcd/merch/abd.*
>>>
>>>
>>> I have used regex-urlfilter.txt file  and negate the following urls:
>>>
>>>
>>> # skip URLs containing certain characters as probable queries, etc.
>>> [EMAIL PROTECTED]
>>> -http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats-pg\.*
>>> -http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats\.*
>>> -http://([a-z0-9]*\.)*example.com/stores/.*/merch\.*
>>>
>>> The above filters still don't filters all the urls.
>>>
>>> is there any way to solve this..any alternatives??
>>>
>>> Awaiting,
>>>
>>> Cha
>>>
>>>
>>>
>>> --  
>>> View this message in context: http://www.nabble.com/help-needed-%3A-
>>> filters-in-regex-urlfilter.txt-tf3441531.html#a9596460
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>
>>
>>
>
> --  
> View this message in context: http://www.nabble.com/help-needed-%3A- 
> filters-in-regex-urlfilter.txt-tf3441531.html#a9629082
> Sent from the Nutch - User mailing list archive at Nabble.com.
>


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to