Glad to help and good luck! On Fri, Aug 3, 2012 at 1:43 AM, Ian Piper <ianpi...@tellura.co.uk> wrote:
> Hi, > > Thanks very much for the suggestions, particularly from AC Nutch. You were > correct in both cases: my regular expressions were unescaped in places, and > there was a catch-all include at the top of the file. This was the crucial > mistake - I hadn't got it straight in my head that the processing in this > file stops once the first match is made, so anything beyond that catch-all > was never being evaluated anyway! > > The indexing is still a little problematic, but I'm a lot further forward > now. > > Thanks again for all of the suggestions. > > > Ian. > -- > > > On 31 Jul 2012, at 20:22, AC Nutch wrote: > > A couple of things I could think of are: > > (1) Make sure those regex excludes aren't below a "catch-all" include. If > you had "+." right above those for example in the regex-urlfilter file, it > is my understanding that Nutch will index them. > > (2) I know everyone keeps saying this but make sure the regexes are > correct. One thing I noticed is that your dots are not escaped. I would try > making it more general and narrow it down, or use an online regex > validation tool. If you're feeling lazy try the following: > > -^http://.*\.elaweb\.org\.uk/resources/type\..*<http://www.elaweb.org.uk/resources/type.aspx.*> > -^http://.*\.elaweb\.org\.uk/resources/topic\..*<http://www.elaweb.org.uk/resources/topic.aspx.*> > > It's a little more general and easier to not screw up ;-) If that's not > acceptable for your purposes let us know I'm sure someone could help with > the specific regexes. > > > > On Mon, Jul 30, 2012 at 12:24 PM, Ian Piper <ianpi...@tellura.co.uk>wrote: > >> Hi all, >> >> I have been trying to get to the bottom of this problem for ages and >> cannot resolve it - you're my last hope, Obi-Wan... >> >> I have a job that crawls over a client's site. I want to exclude urls >> that look like this: >> >> http://[clientsite.net]/resources/type.aspx?type=[whatever] >> >> and >> >> http://[clientsite.net]/resources/topic.aspx?topic=[whatever] >> >> >> To achieve this I thought I could put this into conf/regex-urlfilter.txt: >> >> [...] >> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.* >> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.* >> [...] >> >> Yet when I next run the crawl I see things like this: >> >> fetching http://[clientsite.net]/resources/topic.aspx?topic=10 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37 >> [...] >> fetching http://[clientsite.net]/resources/type.aspx?type=2 >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36 >> [...] >> >> and the corresponding pages seem to appear in the final Solr index. So >> clearly they are not being excluded. >> >> Is anyone able to explain what I have missed? Any guidance much >> appreciated. >> >> Thanks, >> >> >> Ian. >> *-- * >> *Dr Ian Piper* >> Tellura Information Services - the web, document and information people >> Registered in England and Wales: 5076715, VAT Number: 874 2060 29 >> http://www.tellura.co.uk/ >> Creator of monickr: http://monickr.com >> 01926 813736 | 07973 156616 >> *-- * >> >> <ianpiper.png> >> >> > > *-- * > *Dr Ian Piper* > Tellura Information Services - the web, document and information people > Registered in England and Wales: 5076715, VAT Number: 874 2060 29 > http://www.tellura.co.uk/ > Creator of monickr: http://monickr.com > 01926 813736 | 07973 156616 > *-- * > > >