Hi
Some files aren't being indexed because the StringMatch library is not
working properly (I mean, I guess the findFirst is not working properly :).
Here are some simple facts:
1- Here's the portion of my log when the file WASN'T being indexed:
[...]
image: http://www.nrc.ca/corporate/english/images/navw-nrc.gif
Tag: /A>, matched 3
href: http://www.nrc.ca/corporate/english/tools/institut.html ()
Not added because: Item in the exclude list! (In fact the 38th
value, 2 large)*
url rejected: (level
1)http://www.nrc.ca/corporate/english/tools/institut.html
Tag: IMG SRC="images/navw-bar.gif" WIDTH="2" HEIGHT="20"
ALIGN="BOTTOM" BORDER="0">, matched 18
[...]
* Geoff, I modified the Retriever::IsValidUrl() so that when an URL
is actually invalid, it prints out the reason. Could that be implemented in
the new version since it's taking just a few minutes to add? It is REALLY
useful when you want to have a good follow (...when debugging) of your
digging.
But why the length is 2??? Should be 3. Because a link with
"cgi-bin" gave me that message:
href: http://www.nrc.ca/cgi-bin/corporate/external.pl ()
Not added because: Item in the exclude list! (In fact the 0th value,
7 large)
Anyway, I knew it was my 38th string (from 0) in my exclude list
that was matching the URL. ("rct").
2- I had this exclude list:
exclude_urls: cgi-bin .cgi cwis ctn.nrc.ca rct.nrc.ca irap /catalog_3d/ IRIX
irix hrb /infocisti/ cwis /test/ /temp/ /temp1/ /temp2/ /tempdir/ /zone/
ccbfc ptcbs cccme icsti /arctic /aic-journal acst /conferences/ ccsg /ctn
/dtf-gtn/ fptt /programs/ /nzdl/ /wusage/ /w-usage/ /catalog_int_ascii/
harvest gatherer broker rct stats /backup/ /testdir/ /confserv/ fox lynx
... on 1 single line.
As we can see, "rct" is not located in my url:
www.nrc.ca/corporate/english/tools/institut.html
<http://www.nrc.ca/corporate/english/tools/institut.html>
So, I removed that string ("rct") in my exclude list and the file has been
indexed (WEIRD!).
I tried to look at the findFirst function, but I'm not a big fan of binary
and masking stuff, don't find it really intuitive ;). Anyway, too bad we
can't use the "=~" operator from Perl, it would have taken 5 lines instead
of 40 ;))).
Anyway, did I miss something? I would really like to know the bottom reasons
of all this.
P.S.: Btw, the exclude_urls is loading properly in the StringMatch library,
so the problem is obviously when you compare your string/patterns =)
application/ms-tnef