Thanks for the reference to debuggex. I have followed your recommendations and tested other regex, but I am still having a problem. I appreciate your patience in helping me understand what I am missing. In seed.txt I have:
http://my.domain.name/dir/ I need to index the files but not the directories, e.g. http://my.domain.name/dir/testA.xls http://my.domain.name/dir/testB.doc http://my.domain.name/dir/subdir1/testAA.pdf http://my.domain.name/dir/subdir1/testBB.mpp http://my.domain.name/dir/subdir2/testCC.pub http://my.domain.name/dir/subdir2/testDD.docx Your recommendation: -^http://my.domain.name/dir/.*/$ +^http://my.domain.name/dir/.*/.* gives me nothing. The seed URL is rejected. +^http://my.domain.name/dir/* gives me 9 'documents' (3 of which I don't want) http://my.domain.name/dir/ http://my.domain.name/dir/testA.xls http://my.domain.name/dir/testB.doc http://my.domain.name/dir/subdir1/ http://my.domain.name/dir/subdir1/testAA.pdf http://my.domain.name/dir/subdir1/testBB.mpp http://my.domain.name/dir/subdir2/ http://my.domain.name/dir/subdir2/testCC.pub http://my.domain.name/dir/subdir2/testDD.docx -^http://my.domain.name/dir/([^/]+/)+$ +^http://my.domain.name/dir/* gives me only 3 'documents' (1 of which I don't want) http://my.domain.name/dir/ http://my.domain.name/dir/testA.xls http://my.domain.name/dir/testB.doc How can I get only the 6 documents I want? What am I doing wrong? -- View this message in context: http://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032p4084057.html Sent from the Nutch - User mailing list archive at Nabble.com.