Toby,

I was trying to find some possible reason why htdig 3.1.6, with .jpg in the
bad_extensions: list, indexed pages like


http://www.soton.ac.uk/~lopsoc/gallery.php?gallery=sorcerer1&photo=CNV00023.jpg

for me but not for you.  Gilles Detillieux has suggested that my source code
might have been patched (any news on that Gilles?).
While I, having eliminated several other configuration file statements, was
checking out a wild guess on my part that the valid_extensions: statement
might be relevant.  It seems not.

The swfparser (also called swfdumper) available from the htdig web site does
not output any text from within SWF files, so "picking up spurious and/or
useless text" is not possible with that.  Those who think such text might
actually be worth indexing can use the jGenerator Java application.  I would
be interested if anyone knows of another utility for parsing Flash files?

David Adams
Corporate Information Services
Information Systems Services
University of Southampton

----- Original Message ----- 
From: "Toby Thain" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, March 27, 2004 8:22 AM
Subject: Re: Fwd: [htdig] query parameters should be ignored by extension
filter?


> >
> > From: "David Adams" <[EMAIL PROTECTED]>
> > To: "Toby Thain" <[EMAIL PROTECTED]>,
> > <[EMAIL PROTECTED]>
> > Subject: Re: Fwd: [htdig] query parameters should be ignored by
> > extension filter?
> > Date: Fri, 26 Mar 2004 10:10:01 -0000
> >
> >
> > ----- Original Message -----
> > From: "Toby Thain" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Thursday, March 25, 2004 10:49 PM
> > Subject: Re: Fwd: [htdig] query parameters should be ignored by
> > extension
> > filter?
> >
> >
> >> David Adams wrote:
> >>
> >>> Toby,
> >>>
> >>> Did you have a valid_extensions: statement originally?  If you did,
> >>> then
> > it
> >>> might be worthwhile trying without it,
> >>
> >> No, I had no valid_extensions originally. The query URLs were ignored
> >> regardless.
> >>
> >>> as then all extensions not listed in
> >>> your bad_extensions: will be valid.
> >>
> >> I don't think you're correct in the above: the doc says, "This is a
> >> list
> >> of extensions on URLs which are the only ones considered acceptable."
> >>
> >
> > The 3.1.6 documentation says: "If the list is empty, then all
> > extensions are
> > acceptable, provided they pass other criteria for acceptance or
> > rejection.
> > If the list is not empty, only documents with one of the extensions in
> > the
> > list are parsed."
>
> That refers to an empty directive. I was using ".html .php3", so
> bad_extensions would still have been rejected.
>
> >>
> >
> > Good point, but beware of the trap of setting
> > http://www.foo.bar/index.htm
> > as your starting point.
>
> I didn't :-)
>
> >
> >>>
> >>> I have a list of 108 bad extensions if anyone is interested, but I
> >>> make
> > no
> >>> claims that it is anywhere near complete.
> >>
> >> I needed to add .ico & .swf because they were actually used on the
> >> site.
> >> I don't need a catch-all list as we are responsible for all site
> >> content.
> >
> > No list is going to make everybody happy.  I would not want .swf  as a
> > bad
> > extension because we parse .swf files for links.
>
> Would this not risk the parser picking up spurious and/or useless text
> inside the swf?
>
> Toby
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> ht://Dig general mailing list: <[EMAIL PROTECTED]>
> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-general
>



-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to