----- Original Message ----- 
From: "Toby Thain" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, March 25, 2004 10:49 PM
Subject: Re: Fwd: [htdig] query parameters should be ignored by extension
filter?


> David Adams wrote:
>
> > Toby,
> >
> > Did you have a valid_extensions: statement originally?  If you did, then
it
> > might be worthwhile trying without it,
>
> No, I had no valid_extensions originally. The query URLs were ignored
> regardless.
>
>  > as then all extensions not listed in
> > your bad_extensions: will be valid.
>
> I don't think you're correct in the above: the doc says, "This is a list
> of extensions on URLs which are the only ones considered acceptable."
>

The 3.1.6 documentation says: "If the list is empty, then all extensions are
acceptable, provided they pass other criteria for acceptance or rejection.
If the list is not empty, only documents with one of the extensions in the
list are parsed."


> >
> > Do you really have no limit_urls_to: statement?  That doesn't strike me
as a
> > good idea.
>
> It's not needed, because "the value of start_url  will be the default
> value for limit_urls_to," (see doc).
>

Good point, but beware of the trap of setting http://www.foo.bar/index.htm
as your starting point.

> >
> > I have a list of 108 bad extensions if anyone is interested, but I make
no
> > claims that it is anywhere near complete.
>
> I needed to add .ico & .swf because they were actually used on the site.
> I don't need a catch-all list as we are responsible for all site content.

No list is going to make everybody happy.  I would not want .swf  as a bad
extension because we parse .swf files for links.

>
> Toby
>
> >
> > ----- Original Message ----- 
> > From: "Toby Thain" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Thursday, March 25, 2004 12:49 AM
> > Subject: Re: Fwd: [htdig] query parameters should be ignored by
extension
> > filter?
> >
> >
> >
> >>Toby Thain wrote:
> >>
> >>
> >>>
> >>>
> >>>Begin forwarded message:
> >>>
> >>>    *From: *"David Adams" <[EMAIL PROTECTED]>
> >>>    *Date: *24 March 2004 9:18:17 PM
> >>>    *To: *<[EMAIL PROTECTED]>, "Toby Thain"
> >>>    <[EMAIL PROTECTED]>
> >>>    *Subject: Re: [htdig] query parameters should be ignored by
> >>>    extension filter?
> >>>    *
> >>>    I am also using ht://Dig version 3.1.6 and for me it IS indexing
> >>>    URLs like
> >>>
> >>>
> >
> >
http://www.soton.ac.uk/~lopsoc/gallery.php?gallery=sorcerer1&photo=CNV00023.jpg
> >
> >>>
> >>>    even though I have .jpg in my bad_extensions: list.
> >>>
> >>>    I suggest that you take a hard look at your configuration file and
> >>>    check
> >>>    that one of:
> >>>
> >>>    exclude_urls:
> >>>    limit_urls_to:
> >>>    bad_querystr:
> >>>    url_rewrite_rules:
> >>>
> >>>    isn't excluding them.
> >>
> >>David,
> >>
> >>Thanks for your suggestions.
> >>
> >>I am not using any of those directives; the .conf is vanilla except for
> >>customising the search results wrapper.
> >>
> >>I did need to add .swf and .ico to the bad extensions list. IMHO these
> >>should really be in there by default (may be fixed in later version?)
> >>
> >>Adding "valid_extensions: .php3 .html" did not help either; the URLs are
> >>still not being indexed. Even adding a fake "&q" to the end of the URL
> >>doesn't stop htdig rejecting it - a sample rejection from rundig -vvv:
> >>
> >>-----
> >>href: http://stegbar.intranet/php/photo.php3?f=s_rc_pl_wd_t_aw_1.jpg&q
> >>(Thumbnail: windows_and_doors
> >>  Enlarge)
> >>
> >>    Rejected: Extension is not valid!
> >>-----
> >>
> >>Toby
> >>
> >>
> >>>    Personally, I don't need those ~lopsoc/...jpg files and will be
> >>>    adding them
> >>>    to exclude_urls: if they publish many more of them!
> >>>
> >>>    David Adams
> >>>    Corporate Information Services
> >>>    Information Systems Services
> >>>    University of Southampton
> >>>
> >>>    ----- Original Message -----
> >>>    From: "Toby Thain" <[EMAIL PROTECTED]>
> >>>    To: <[EMAIL PROTECTED]>
> >>>    Sent: Wednesday, March 24, 2004 9:58 AM
> >>>    Subject: [htdig] query parameters should be ignored by extension
> >>>    filter?
> >>>
> >>>
> >>>        List,
> >>>
> >>>        I noticed today that htdig is not indexing URLs like:
> >>>
> >>>        /foo/page.php3?f=bar.jpg
> >>>
> >>>        because it notices the URL ends with ".jpg". I am surprised
that
> >>>        it's
> >>>        not smart enough to realise that the fetched object is actually
> >
> > a
> >
> >>>        ".php3", and I definitely want that URL followed.
> >>>
> >>>        Is this fixed in a recent version (I am using ht://Dig 3.1.6)?
> >>>        Or is
> >>>        there a simple configuration fix?
> >>>
> >>>        Toby
> >>>
> >>>
> >>>
> >>>        -------------------------------------------------------
> >>>        This SF.Net email is sponsored by: IBM Linux Tutorials
> >>>        Free Linux tutorial presented by Daniel Robbins, President and
> >>>        CEO of
> >>>        GenToo technologies. Learn everything from fundamentals to
> >
> > system
> >
> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> >
> >>>        _______________________________________________
> >>>        ht://Dig general mailing list:
> >>>        <[EMAIL PROTECTED]>
> >>>        ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> >>>        List information (subscribe/unsubscribe, etc.)
> >>>        https://lists.sourceforge.net/lists/listinfo/htdig-general
> >>>
> >>>
> >>
> >>
> >>
> >>-------------------------------------------------------
> >>This SF.Net email is sponsored by: IBM Linux Tutorials
> >>Free Linux tutorial presented by Daniel Robbins, President and CEO of
> >>GenToo technologies. Learn everything from fundamentals to system
> >>administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> >>_______________________________________________
> >>ht://Dig general mailing list: <[EMAIL PROTECTED]>
> >>ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> >>List information (subscribe/unsubscribe, etc.)
> >>https://lists.sourceforge.net/lists/listinfo/htdig-general
> >>
> >
> >
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> ht://Dig general mailing list: <[EMAIL PROTECTED]>
> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-general
>



-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to