Toby, I was trying to find some possible reason why htdig 3.1.6, with .jpg in the bad_extensions: list, indexed pages like
http://www.soton.ac.uk/~lopsoc/gallery.php?gallery=sorcerer1&photo=CNV00023.jpg for me but not for you. Gilles Detillieux has suggested that my source code might have been patched (any news on that Gilles?). While I, having eliminated several other configuration file statements, was checking out a wild guess on my part that the valid_extensions: statement might be relevant. It seems not. The swfparser (also called swfdumper) available from the htdig web site does not output any text from within SWF files, so "picking up spurious and/or useless text" is not possible with that. Those who think such text might actually be worth indexing can use the jGenerator Java application. I would be interested if anyone knows of another utility for parsing Flash files? David Adams Corporate Information Services Information Systems Services University of Southampton ----- Original Message ----- From: "Toby Thain" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, March 27, 2004 8:22 AM Subject: Re: Fwd: [htdig] query parameters should be ignored by extension filter? > > > > From: "David Adams" <[EMAIL PROTECTED]> > > To: "Toby Thain" <[EMAIL PROTECTED]>, > > <[EMAIL PROTECTED]> > > Subject: Re: Fwd: [htdig] query parameters should be ignored by > > extension filter? > > Date: Fri, 26 Mar 2004 10:10:01 -0000 > > > > > > ----- Original Message ----- > > From: "Toby Thain" <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Thursday, March 25, 2004 10:49 PM > > Subject: Re: Fwd: [htdig] query parameters should be ignored by > > extension > > filter? > > > > > >> David Adams wrote: > >> > >>> Toby, > >>> > >>> Did you have a valid_extensions: statement originally? If you did, > >>> then > > it > >>> might be worthwhile trying without it, > >> > >> No, I had no valid_extensions originally. The query URLs were ignored > >> regardless. > >> > >>> as then all extensions not listed in > >>> your bad_extensions: will be valid. > >> > >> I don't think you're correct in the above: the doc says, "This is a > >> list > >> of extensions on URLs which are the only ones considered acceptable." > >> > > > > The 3.1.6 documentation says: "If the list is empty, then all > > extensions are > > acceptable, provided they pass other criteria for acceptance or > > rejection. > > If the list is not empty, only documents with one of the extensions in > > the > > list are parsed." > > That refers to an empty directive. I was using ".html .php3", so > bad_extensions would still have been rejected. > > >> > > > > Good point, but beware of the trap of setting > > http://www.foo.bar/index.htm > > as your starting point. > > I didn't :-) > > > > >>> > >>> I have a list of 108 bad extensions if anyone is interested, but I > >>> make > > no > >>> claims that it is anywhere near complete. > >> > >> I needed to add .ico & .swf because they were actually used on the > >> site. > >> I don't need a catch-all list as we are responsible for all site > >> content. > > > > No list is going to make everybody happy. I would not want .swf as a > > bad > > extension because we parse .swf files for links. > > Would this not risk the parser picking up spurious and/or useless text > inside the swf? > > Toby > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > ht://Dig general mailing list: <[EMAIL PROTECTED]> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general > ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

