Hi, regexes must follow the Java regex syntax, see http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
I think your intention was: # skip .../test and .../test/ -^https://my\.domain\.name/inside/test/?$ # allow paths below .../test/ +^https://my\.domain\.name/inside/test/.+ Finally, also seeds are filtered: you cannot use https://my.domain.name/inside/test/ as seed URL. Sebastian On 07/25/2013 02:49 PM, stone2dbone wrote: > When I perform a crawl, one of the documents returned by Nutch is the index > of documents. e.g. > > for a crawl of: > https://my.domain.name/inside/test/ > > the content of the first document is: > Index of /inside/test Index of /inside/test Parent Directory test_css.css > test_css.html test_css1.html test_css2.html test_css3.html test_css4.css > test_css4.html test_css5.cfm test_css6.cfm > > How do I prevent this from happening? > > regex-urlfilter.txt has the following: > # skip URLs > -^https://my.domain.name/inside/test$ > > # accept URLs > +^https://my.domain.name/inside/test/* > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-returns-index-as-document-tp4080323.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

