----- Original Message ----- From: "Toby Thain" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Thursday, March 25, 2004 10:49 PM Subject: Re: Fwd: [htdig] query parameters should be ignored by extension filter?
> David Adams wrote: > > > Toby, > > > > Did you have a valid_extensions: statement originally? If you did, then it > > might be worthwhile trying without it, > > No, I had no valid_extensions originally. The query URLs were ignored > regardless. > > > as then all extensions not listed in > > your bad_extensions: will be valid. > > I don't think you're correct in the above: the doc says, "This is a list > of extensions on URLs which are the only ones considered acceptable." > The 3.1.6 documentation says: "If the list is empty, then all extensions are acceptable, provided they pass other criteria for acceptance or rejection. If the list is not empty, only documents with one of the extensions in the list are parsed." > > > > Do you really have no limit_urls_to: statement? That doesn't strike me as a > > good idea. > > It's not needed, because "the value of start_url will be the default > value for limit_urls_to," (see doc). > Good point, but beware of the trap of setting http://www.foo.bar/index.htm as your starting point. > > > > I have a list of 108 bad extensions if anyone is interested, but I make no > > claims that it is anywhere near complete. > > I needed to add .ico & .swf because they were actually used on the site. > I don't need a catch-all list as we are responsible for all site content. No list is going to make everybody happy. I would not want .swf as a bad extension because we parse .swf files for links. > > Toby > > > > > ----- Original Message ----- > > From: "Toby Thain" <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Thursday, March 25, 2004 12:49 AM > > Subject: Re: Fwd: [htdig] query parameters should be ignored by extension > > filter? > > > > > > > >>Toby Thain wrote: > >> > >> > >>> > >>> > >>>Begin forwarded message: > >>> > >>> *From: *"David Adams" <[EMAIL PROTECTED]> > >>> *Date: *24 March 2004 9:18:17 PM > >>> *To: *<[EMAIL PROTECTED]>, "Toby Thain" > >>> <[EMAIL PROTECTED]> > >>> *Subject: Re: [htdig] query parameters should be ignored by > >>> extension filter? > >>> * > >>> I am also using ht://Dig version 3.1.6 and for me it IS indexing > >>> URLs like > >>> > >>> > > > > http://www.soton.ac.uk/~lopsoc/gallery.php?gallery=sorcerer1&photo=CNV00023.jpg > > > >>> > >>> even though I have .jpg in my bad_extensions: list. > >>> > >>> I suggest that you take a hard look at your configuration file and > >>> check > >>> that one of: > >>> > >>> exclude_urls: > >>> limit_urls_to: > >>> bad_querystr: > >>> url_rewrite_rules: > >>> > >>> isn't excluding them. > >> > >>David, > >> > >>Thanks for your suggestions. > >> > >>I am not using any of those directives; the .conf is vanilla except for > >>customising the search results wrapper. > >> > >>I did need to add .swf and .ico to the bad extensions list. IMHO these > >>should really be in there by default (may be fixed in later version?) > >> > >>Adding "valid_extensions: .php3 .html" did not help either; the URLs are > >>still not being indexed. Even adding a fake "&q" to the end of the URL > >>doesn't stop htdig rejecting it - a sample rejection from rundig -vvv: > >> > >>----- > >>href: http://stegbar.intranet/php/photo.php3?f=s_rc_pl_wd_t_aw_1.jpg&q > >>(Thumbnail: windows_and_doors > >> Enlarge) > >> > >> Rejected: Extension is not valid! > >>----- > >> > >>Toby > >> > >> > >>> Personally, I don't need those ~lopsoc/...jpg files and will be > >>> adding them > >>> to exclude_urls: if they publish many more of them! > >>> > >>> David Adams > >>> Corporate Information Services > >>> Information Systems Services > >>> University of Southampton > >>> > >>> ----- Original Message ----- > >>> From: "Toby Thain" <[EMAIL PROTECTED]> > >>> To: <[EMAIL PROTECTED]> > >>> Sent: Wednesday, March 24, 2004 9:58 AM > >>> Subject: [htdig] query parameters should be ignored by extension > >>> filter? > >>> > >>> > >>> List, > >>> > >>> I noticed today that htdig is not indexing URLs like: > >>> > >>> /foo/page.php3?f=bar.jpg > >>> > >>> because it notices the URL ends with ".jpg". I am surprised that > >>> it's > >>> not smart enough to realise that the fetched object is actually > > > > a > > > >>> ".php3", and I definitely want that URL followed. > >>> > >>> Is this fixed in a recent version (I am using ht://Dig 3.1.6)? > >>> Or is > >>> there a simple configuration fix? > >>> > >>> Toby > >>> > >>> > >>> > >>> ------------------------------------------------------- > >>> This SF.Net email is sponsored by: IBM Linux Tutorials > >>> Free Linux tutorial presented by Daniel Robbins, President and > >>> CEO of > >>> GenToo technologies. Learn everything from fundamentals to > > > > system > > > > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > > > >>> _______________________________________________ > >>> ht://Dig general mailing list: > >>> <[EMAIL PROTECTED]> > >>> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > >>> List information (subscribe/unsubscribe, etc.) > >>> https://lists.sourceforge.net/lists/listinfo/htdig-general > >>> > >>> > >> > >> > >> > >>------------------------------------------------------- > >>This SF.Net email is sponsored by: IBM Linux Tutorials > >>Free Linux tutorial presented by Daniel Robbins, President and CEO of > >>GenToo technologies. Learn everything from fundamentals to system > >>administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >>_______________________________________________ > >>ht://Dig general mailing list: <[EMAIL PROTECTED]> > >>ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > >>List information (subscribe/unsubscribe, etc.) > >>https://lists.sourceforge.net/lists/listinfo/htdig-general > >> > > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > ht://Dig general mailing list: <[EMAIL PROTECTED]> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general > ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

