apache irritations

2002-04-22 Thread Jamie Zawinski

I know this would be somewhat evil, but can we have a special case in
wget to assume that files named "?N=D" and "index.html?N=D" are the same
as "index.html"?  I'm tired of those dumb apache sorting directives
showing up in my mirrors as if they were real files...

-- 
Jamie Zawinski
[EMAIL PROTECTED] http://www.jwz.org/
[EMAIL PROTECTED]   http://www.dnalounge.com/



Re: apache irritations

2002-04-22 Thread Maciej W. Rozycki

On Mon, 22 Apr 2002, Jamie Zawinski wrote:

> I know this would be somewhat evil, but can we have a special case in
> wget to assume that files named "?N=D" and "index.html?N=D" are the same
> as "index.html"?  I'm tired of those dumb apache sorting directives
> showing up in my mirrors as if they were real files...

 How about using the "-R" option of wget?  A brief test proves "-R
'*\?[A-Z]=[A-Z]'" works as it should. 

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--+
+e-mail: [EMAIL PROTECTED], PGP key available+




Re: apache irritations

2002-04-22 Thread Hrvoje Niksic

"Maciej W. Rozycki" <[EMAIL PROTECTED]> writes:

> On Mon, 22 Apr 2002, Jamie Zawinski wrote:
>
>> I know this would be somewhat evil, but can we have a special case in
>> wget to assume that files named "?N=D" and "index.html?N=D" are the same
>> as "index.html"?  I'm tired of those dumb apache sorting directives
>> showing up in my mirrors as if they were real files...
>
>  How about using the "-R" option of wget?  A brief test proves "-R
> '*\?[A-Z]=[A-Z]'" works as it should.

Or maybe the default system wgetrc should ship with something like:

reject = *?[A-Z]=[A-Z]

Adding new reject patterns will correctly append to this.  If the user
wanted to nullify that in his `.wgetrc', he'd need to set `reject' to
empty string.



Re: apache irritations

2002-04-22 Thread Maciej W. Rozycki

On Mon, 22 Apr 2002, Hrvoje Niksic wrote:

> >  How about using the "-R" option of wget?  A brief test proves "-R
> > '*\?[A-Z]=[A-Z]'" works as it should.
> 
> Or maybe the default system wgetrc should ship with something like:
> 
> reject = *?[A-Z]=[A-Z]

 Note the difference between strings! -- the backslash before the
quotation mark is essential as otherwise it's a glob character. 

> Adding new reject patterns will correctly append to this.  If the user
> wanted to nullify that in his `.wgetrc', he'd need to set `reject' to
> empty string.

 Well, I don't think it's sane but adding a *commented-out* reject line
with an appropriate annotation to the default system wgetrc looks like a
good idea to me.

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--+
+e-mail: [EMAIL PROTECTED], PGP key available+




Re: apache irritations

2002-04-22 Thread csaba . raduly


On 22/04/2002 16:38:15 "Maciej W. Rozycki" wrote:

>On Mon, 22 Apr 2002, Hrvoje Niksic wrote:
>
>> >  How about using the "-R" option of wget?  A brief test proves "-R
>> > '*\?[A-Z]=[A-Z]'" works as it should.
>>
>> Or maybe the default system wgetrc should ship with something like:
>>
>> reject = *?[A-Z]=[A-Z]
>
>Note the difference between strings! -- the backslash before the
>quotation mark is essential as otherwise it's a glob character.
>

[A-Z] is a bit extreme, IMHO. How about

reject = *\?[NMSD]=[AD]
  ^^ literal '?' needed here


>
>Well, I don't think it's sane but adding a *commented-out* reject line
>with an appropriate annotation to the default system wgetrc looks like a
>good idea to me.
>

A good idea.

--
Csaba Ráduly, Software Engineer   Sophos Anti-Virus
email: [EMAIL PROTECTED]http://www.sophos.com
US Support: +1 888 SOPHOS 9 UK Support: +44 1235 559933




Re: apache irritations

2002-04-22 Thread Maciej W. Rozycki

On Mon, 22 Apr 2002 [EMAIL PROTECTED] wrote:

> >> reject = *?[A-Z]=[A-Z]
> >
> >Note the difference between strings! -- the backslash before the
> >quotation mark is essential as otherwise it's a glob character.
> 
> [A-Z] is a bit extreme, IMHO. How about
> 
> reject = *\?[NMSD]=[AD]

 Hmm, it's too fragile in my opinion.  What if a new version of Apache
defines a new format? 

>   ^^ literal '?' needed here

 Exactly -- I've meant the question mark above, of course. 

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--+
+e-mail: [EMAIL PROTECTED], PGP key available+




Re: apache irritations

2002-04-22 Thread Tony Lewis

Maciej W. Rozycki wrote:

>  Hmm, it's too fragile in my opinion.  What if a new version of Apache
> defines a new format?

I think all of the expressions proposed thus far are too fragile. Consider
the following URL:

http://www.google.com/search?num=100&q=%2Bwget+-GNU

The regular expression needs to account for multiple arguments separated by
ampersands. It also needs to account from any valid URI character between an
equal sign and either end of string or an ampersand.

I'm not fluent enough in regular expressions to compose one myself. (Some
day I'll absorb all of Friedl's "Mastering Regular Expressions", but not
today.)

Tony




Re: apache irritations

2002-04-22 Thread Maciej W. Rozycki

On Mon, 22 Apr 2002, Tony Lewis wrote:

> I think all of the expressions proposed thus far are too fragile. Consider
> the following URL:
> 
> http://www.google.com/search?num=100&q=%2Bwget+-GNU
> 
> The regular expression needs to account for multiple arguments separated by
> ampersands. It also needs to account from any valid URI character between an
> equal sign and either end of string or an ampersand.

 I'm not sure what you are referring to.  We are discussing a common
problem with "static" pages generated by default by Apache as "index.html" 
objects for server's filesystem directories providing no default page. 
Any dynamic content should probably be protected by "robots.txt" and
otherwise dealt by a user specifically depending on the content. 

 BTW, wget's accept/reject rules are not regular expressions but simple
shell globbing patterns. 

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--+
+e-mail: [EMAIL PROTECTED], PGP key available+




Re: apache irritations

2002-04-22 Thread Hrvoje Niksic

"Tony Lewis" <[EMAIL PROTECTED]> writes:

> Maciej W. Rozycki wrote:
>
>>  Hmm, it's too fragile in my opinion.  What if a new version of Apache
>> defines a new format?
>
> I think all of the expressions proposed thus far are too fragile. Consider
> the following URL:
>
> http://www.google.com/search?num=100&q=%2Bwget+-GNU

That URL will not match the proposed pattern.

As Maciej said, Wget's "reject" feature implements shell-style
patterns that are much simpler than regexps.  Also, they always match
the entire string by default.



Re: apache irritations

2002-04-22 Thread Tony Lewis

Maciej W. Rozycki wrote:

>  I'm not sure what you are referring to.  We are discussing a common
> problem with "static" pages generated by default by Apache as "index.html"
> objects for server's filesystem directories providing no default page.

Really? The original posting from Jamie Zawinski said:

> I know this would be somewhat evil, but can we have a special case in
> wget to assume that files named "?N=D" and "index.html?N=D" are the same
> as "index.html"?  I'm tired of those dumb apache sorting directives
> showing up in my mirrors as if they were real files...

I understood the question to be about URLs containing query strings (which
Jamie called sorting directives) showing up as separate files. I thought the
discussion was related to that topic. Maybe it diverged from that later in
the chain and I missed the change of topic.

I think what Jamie wants is one copy of index.html no matter how many links
of the form index.html?N=D appear.

>  BTW, wget's accept/reject rules are not regular expressions but simple
> shell globbing patterns.

OK.

Tony




Re: apache irritations

2002-04-23 Thread Maciej W. Rozycki

On Mon, 22 Apr 2002, Tony Lewis wrote:

> >  I'm not sure what you are referring to.  We are discussing a common
> > problem with "static" pages generated by default by Apache as "index.html"
> > objects for server's filesystem directories providing no default page.
> 
> Really? The original posting from Jamie Zawinski said:
> 
> > I know this would be somewhat evil, but can we have a special case in
> > wget to assume that files named "?N=D" and "index.html?N=D" are the same
> > as "index.html"?  I'm tired of those dumb apache sorting directives
> > showing up in my mirrors as if they were real files...
> 
> I understood the question to be about URLs containing query strings (which
> Jamie called sorting directives) showing up as separate files. I thought the
> discussion was related to that topic. Maybe it diverged from that later in
> the chain and I missed the change of topic.

 These sorting directives are specific to Apache when it builds a
replacement "index.html" file for server's file system directories
containing no default page (assuming neither such building nor the
directives are disabled).  They have always the form of
"?=" appended to the base URL of a directory.  See e.g. 
"http://www.kernel.org/pub/linux/"; and its subdirectories for how it looks
like.

> I think what Jamie wants is one copy of index.html no matter how many links
> of the form index.html?N=D appear.

 So do I and my shell pattern will work as expected.

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--+
+e-mail: [EMAIL PROTECTED], PGP key available+