Turns out it was because I had a copy of the default file sitting in the
directory I was calling nutch from.

Once I removed that it correctly found my copy in the conf directory.


On Wed, Jun 12, 2013 at 9:29 AM, Bai Shen <[email protected]> wrote:

> Doh!  I really should just read the code of things before posting.
>
> I ran the URLFilterChecker and passed it in a url that the SuffixFilter
> should flag and it still passed it.  However, if I change the url to end in
> a format that is in the default config file, it rejects the url.
>
> So it looks like the problem is that it's not loading the altered config
> file from my conf directory.  Not sure why since the regex filter correctly
> finds it's config file.
>
>
> On Wed, Jun 12, 2013 at 8:34 AM, Markus Jelsma <[email protected]
> > wrote:
>
>> We happily use that filter just as it is shipped with Nutch. Just
>> enabling it in plugin.includes works for us. To ease testing you can use
>> the bin/nutch org.apache.nutch.net.URLFilterChecker to test filters.
>>
>>
>> -----Original message-----
>> > From:Bai Shen <[email protected]>
>> > Sent: Wed 12-Jun-2013 14:32
>> > To: [email protected]
>> > Subject: Suffix URLFilter not working
>> >
>> > I'm dealing with a lot of file types that I don't want to index.  I was
>> > originally using the regex filter to exclude them but it was getting
>> out of
>> > hand.
>> >
>> > I changed my plugin includes from
>> >
>> > urlfilter-regex
>> >
>> > to
>> >
>> > urlfilter-(regex|suffix)
>> >
>> > I've tried using both the default urlfilter-suffix.txt file via adding
>> the
>> > extensions I don't want and making my own file that starts with + and
>> > includes the extensions I do want.
>> >
>> > Neither of these approaches seem to work.  I continue to get urls added
>> to
>> > the database which continue extensions I don't want.  Even adding a
>> > urlfilter.order section to my nutch-site.xml doesn't work.
>> >
>> > I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
>> > suggestions for what else to look at?
>> >
>> > Thanks.
>> >
>>
>
>

Reply via email to