Re: Prevent crawl of parent URL

stone2dbone Mon, 12 Aug 2013 11:07:07 -0700

Thanks for the reference to debuggex. I have followed your recommendations
and tested other regex, but I am still having a problem. I appreciate your
patience in helping me understand what I am missing. In seed.txt I have:


http://my.domain.name/dir/

I need to index the files but not the directories, e.g.

http://my.domain.name/dir/testA.xls
http://my.domain.name/dir/testB.doc
http://my.domain.name/dir/subdir1/testAA.pdf
http://my.domain.name/dir/subdir1/testBB.mpp
http://my.domain.name/dir/subdir2/testCC.pub
http://my.domain.name/dir/subdir2/testDD.docx

Your recommendation:

-^http://my.domain.name/dir/.*/$
+^http://my.domain.name/dir/.*/.*

gives me nothing.  The seed URL is rejected.



+^http://my.domain.name/dir/*

gives me 9 'documents' (3 of which I don't want)

http://my.domain.name/dir/
http://my.domain.name/dir/testA.xls
http://my.domain.name/dir/testB.doc
http://my.domain.name/dir/subdir1/
http://my.domain.name/dir/subdir1/testAA.pdf
http://my.domain.name/dir/subdir1/testBB.mpp
http://my.domain.name/dir/subdir2/
http://my.domain.name/dir/subdir2/testCC.pub
http://my.domain.name/dir/subdir2/testDD.docx




-^http://my.domain.name/dir/([^/]+/)+$
+^http://my.domain.name/dir/*

gives me only 3 'documents' (1 of which I don't want)

http://my.domain.name/dir/
http://my.domain.name/dir/testA.xls
http://my.domain.name/dir/testB.doc

How can I get only the 6 documents I want? What am I doing wrong?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032p4084057.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Prevent crawl of parent URL

Reply via email to