Hi,

I want to crawl only .htm,.html and .do pages from my web-site.Secondly I
want to ignore the following urls from crawling

http://www.example.com/stores/abcd/merch-cats-pg/abcd.*
http://www.example.com/stores/abcd/merch-cats/abcd.*
http://www.example.com/stores/abcd/merch/abd.*

I have set all the filters in regex-urlfilters and crawl-urlfilter files.

Follwing is just the code which fulfill my purpose :

# skip URLs containing certain characters as probable queries, etc.
-^http://www.example.com/stores/.*/merch.*

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*example.com/.*\.htm$
+http://([a-z0-9]*\.)*example.com/.*\.do
+http://([a-z0-9]*\.)*example.com/$


Its crawl all the required pages correctly the only problem I get is getting
? or some other characters after htm. So i pass the htm$.

But after giving that it is not crawling the merchant pages & neglect lotsa
of urls , which i require.

So dont know what to do??

Please let me know with your valuable suggestions.

Cheers,
Cha
-- 
View this message in context: 
http://www.nabble.com/help-needed-on-filters-tf3530069.html#a9851344
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to