Re: regex support RFC

2006-04-03 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

Tony Lewis [EMAIL PROTECTED] writes:


I don't think ,r complicates the command that much. Internally,
the only additional work for supporting both globs and regular
expressions is a function that converts a glob into a regexp when
,r is not requested.  That's a straightforward transformation.


,r makes it harder to input regexps, which are the whole point of
introducing --filter.  Besides, having two different syntaxes for the
same switch, and for no good reason, is not really acceptable, even if
the implementation is straightforward.


i agree 100%. and don't forget that globs are already supported by 
current filtering options.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-04-03 Thread Mauro Tortonesi

Curtis Hatter wrote:

On Friday 31 March 2006 06:52, Mauro Tortonesi:


while i like the idea of supporting modifiers like quick (short
circuit) and maybe i (case insensitive comparison), i think that (?i:)
and (?-i:) constructs would be overkill and rather hard to implement.


I figured that the (?i:) and (?-i:) constructs would be provided by the 
regular expression engine and that the --filter switch would simply be able 
to use any construct provided by that engine.


i know, that would be really nice.

If, as you said, this would be hard to implement or require extra effort by 
you that is above and beyond that required for the more standard constructs 
then I would say that they shouldn't be implemented; at least at first.


i agree.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Scott Scriven wrote:

* Mauro Tortonesi [EMAIL PROTECTED] wrote:


wget -r --filter=-domain:www-*.yoyodyne.com


This appears to match www.yoyodyne.com, www--.yoyodyne.com,
www---.yoyodyne.com, and so on, if interpreted as a regex.


not really. it would not match www.yoyodyne.com.

It would most likely also match www---zyoyodyneXcom.  


yes.


Perhaps you want glob patterns instead?  I know I wouldn't mind having
glob patterns in addition to regexes...  glob is much eaesier
when you're not doing complex matches.


no. i was talking about regexps. they are more expressive and powerful 
than simple globs. i don't see what's the point in supporting both.



If I had to choose just one though, I'd prefer to use PCRE,
Perl-Compatible Regular Expressions.  They offer a richer, more
concise syntax than traditional regexes, such as \d instead of
[:digit:] or [0-9].


i agree, but adding a dependency from PCRE to wget is asking for 
infinite maintenance nightmares. and i don't know if we can simply 
bundle code from PCRE in wget, as it has a BSD license.



--filter=[+|-][file|path|domain]:REGEXP

is it consistent? is it flawed? is there a more convenient one?


It seems like a good idea, but wouldn't actually provide the
regex-filtering features I'm hoping for unless there was a raw
type in addition to file, domain, etc.  I'll give details
below.  Basically, I need to match based on things like the
inline CSS data, the visible link text, etc.


do you mean you would like to have a regex class working on the content 
of downloaded files as well?



Below is the original message I sent to the wget list a few
months ago, about this same topic:

=
I'd find it useful to guide wget by using regular expressions to
control which links get followed.  For example, to avoid
following links based on embedded css styles or link text.

I've needed this several times, but the most recent was when I
wanted to avoid following any add to cart or buy links on a
site which uses GET parameters instead of directories to select
content.

Given a link like this...

a 
href=http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCartamp;g2_itemId=11436amp;g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7bamp;g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7bamp;g2_returnName=album;
 class=gbAdminLink gbAdminLink gbLink-cart_AddToCartadd to cart/a

... a useful parameter could be --ignore-regex='AddToCart|add to cart'
so the class or link text (really, anything inside the tag) could
be used to decide whether the link should be followed.

Or...  if there's already a way to do this, let me know.  I
didn't see anything in the docs, but I may have missed something.

:)
=

I think what I want could be implemented via the --filter option,
with a few small modifications to what was proposed.  I'm not
sure exactly what syntax to use, but it should be able to specify
whether to include/exclude the link, which PCRE flags to use, how
much of the raw HTML tag to use as input, and what pattern to use
for matching.  Here's an idea:

  --filter=[allow][flags,][scope][:]pattern

Example:

  '--filter=-i,raw:add ?to ?cart'
  (the quotes are there only to make the shell treat it as one parameter)

The details are:

  allow is + for include or - for exclude.
  It defaults to + if omitted.

  flags, is a set of letters to control regex options, followed
  by a comma (to separate it from scope).  For example, i
  specifies a case-insensitive search.  These would be the same
  flags that perl appends to the end of search patterns.  So,
  instead of /foo/i, it would be --filter=+i,:foo

  scope controls how much of the a or similar tag gets used
  as input to the regex.  Values include:
raw: use the entire tag and all contents (default)
 a href=/path/to/foo.extbar/a
domain: use only the domain name
 www.example.com
file: use only the file name
 foo.ext
path: use the directory, but not the file name
 /path/to
others...  can be added as desired

  : is required if allow or flags or scope is given

So, for example, to exclude the add to cart links in my
previous post, this could be used:

  --filter=-raw:'AddToCart|add to cart'
or
  --filter=-raw:AddToCart\|add\ to\ cart
or
  --filter=-:'AddToCart|add to cart'
or
  --filter=-i,raw:'add ?to ?cart'

Alternately, the --filter option could be split into two options:
one for including content, and one for excluding.  This would be
more consistent with wget's existing parameters, and would
slightly simplify the syntax.

I hope I haven't been to full of hot air.  This is a feature I've
wanted in wget for a long time, and I'm a bit excited that it
might happen soon.  :)


i don't like your raw proposal as it is HTML-specific. i would like 
instead to develop a mechanism which could work for all supported 
protocols.


--

Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Mauro Tortonesi [EMAIL PROTECTED] writes:

 Scott Scriven wrote:
 * Mauro Tortonesi [EMAIL PROTECTED] wrote:

wget -r --filter=-domain:www-*.yoyodyne.com
 This appears to match www.yoyodyne.com, www--.yoyodyne.com,
 www---.yoyodyne.com, and so on, if interpreted as a regex.

 not really. it would not match www.yoyodyne.com.

Why not?

 Perhaps you want glob patterns instead?  I know I wouldn't mind
 having glob patterns in addition to regexes...  glob is much
 eaesier when you're not doing complex matches.

 no. i was talking about regexps. they are more expressive and
 powerful than simple globs. i don't see what's the point in
 supporting both.

I agree with this.


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

Mauro Tortonesi [EMAIL PROTECTED] writes:



Scott Scriven wrote:


* Mauro Tortonesi [EMAIL PROTECTED] wrote:



wget -r --filter=-domain:www-*.yoyodyne.com


This appears to match www.yoyodyne.com, www--.yoyodyne.com,
www---.yoyodyne.com, and so on, if interpreted as a regex.


not really. it would not match www.yoyodyne.com. 


Why not?


i may be wrong, but if - is not a special charachter, the previous 
expression should match only domains starting with www- and ending in 
[randomchar]yoyodyne[randomchar]com.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Oliver Schulze L. wrote:

Hrvoje Niksic wrote:


 The regexp API's found on today's Unix systems
might be usable, but unfortunately those are not available on Windows.


My personal idea on this is to: enable regex in Unix and disable it on 
Windows.


We all use Unix/Linux and regex is really usefull. I think not having 
regex on
Windows will not do any more harm that it is doing now (not having it at 
all)


for consistency and to avoid maintenance problems, i would like wget to 
have the same behavior on windows and unix. please, notice that if we 
implemented regex support only on unix, windows binaries of wget built 
with cygwin would have regex support but native binaries wouldn't. that 
would be very confusing for windows users, IMHO.


I hope wget can get conection cache, 


this is planned for wget 1.12 (which might become 2.0). i already have 
some code implementing connection cache data structure.


URL regex 


this is planned for wget 1.11. i've already started working on it.


and advanced mirror functions (sync 2 folders) in the near future.


this is very interesting.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Curtis Hatter wrote:

On Thursday 30 March 2006 13:42, Tony Lewis wrote:


Perhaps --filter=path,i:/path/to/krs would work.


That would look to be the most elegant method. I do hope that the (?i:) and 
(?-i:) constructs are supported since I may not want the entire path/file to 
be case (in)?sensitive =), but that will depend on the regex engine chosen.


while i like the idea of supporting modifiers like quick (short 
circuit) and maybe i (case insensitive comparison), i think that (?i:) 
and (?-i:) constructs would be overkill and rather hard to implement.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Mauro Tortonesi [EMAIL PROTECTED] writes:

wget -r --filter=-domain:www-*.yoyodyne.com

This appears to match www.yoyodyne.com, www--.yoyodyne.com,
www---.yoyodyne.com, and so on, if interpreted as a regex.

 not really. it would not match www.yoyodyne.com.
 Why not?

 i may be wrong, but if - is not a special charachter, the previous
 expression should match only domains starting with www- and ending
 in [randomchar]yoyodyne[randomchar]com.

* matches the previous character repeated 0 or more times.  This is
in contrast to wildcards, where * alone matches any character 0 or
more times.  (This is part of why regexps are often confusing to
people used to the much simpler wildcards.)

Therefore www-* matches www, www-, www--, etc., i.e. Scott's
interpretation was correct.  What you describe is achieved with the
www-.*.yoyodyne.com.


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

Herold Heiko [EMAIL PROTECTED] writes:


Get the best of both, use a syntax permitting a first match-exits
ACL, single ACE permits several statements ANDed together. Cooking
up a simple syntax for users without much regexp experience won't be
easy.


I assume ACL stands for access control list, but what is ACE?


access control entry, i guess.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Wincent Colaiuta

El 31/03/2006, a las 14:37, Hrvoje Niksic escribió:


* matches the previous character repeated 0 or more times.  This is
in contrast to wildcards, where * alone matches any character 0 or
more times.  (This is part of why regexps are often confusing to
people used to the much simpler wildcards.)

Therefore www-* matches www, www-, www--, etc., i.e. Scott's
interpretation was correct.  What you describe is achieved with the
www-.*.yoyodyne.com.


Are you sure that www-* matches www?

As far as I know www-* matches one w, another w, a third w, a  
hyphen, then 0 or more hyphens. In other words, www, does not match.


Wincent




smime.p7s
Description: S/MIME cryptographic signature


Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Wincent Colaiuta [EMAIL PROTECTED] writes:

 Are you sure that www-* matches www?

Yes.

 As far as I know www-* matches one w, another w, a third w, a
 hyphen, then 0 or more hyphens.

That would be www--* or www-+.


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

Wincent Colaiuta [EMAIL PROTECTED] writes:



Are you sure that www-* matches www?


Yes.


hrvoje is right. try this perl script:


#!/usr/bin/perl -w

use strict;

my @strings = (www-.yoyodyne.com,
   www.yoyodyne.com);

foreach my $str (@strings) {
$str =~ /www-*.yoyodyne.com/ or print $str doesn't match\n;
}


both the strings match.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Oliver Schulze L.

Mauro Tortonesi wrote:
for consistency and to avoid maintenance problems, i would like wget 
to have the same behavior on windows and unix. please, notice that if 
we implemented regex support only on unix, windows binaries of wget 
built with cygwin would have regex support but native binaries 
wouldn't. that would be very confusing for windows users, IMHO.

Ok, I understand.
I was thinking in a #ifdef in the source code so you can:
- enable all regex code/command line parameters in Unix/Linux
- at runtime, print the error regex not yet supported on windows if 
any regex related command

  parameter line parameter is passed to wget on windows/cygwin

this is planned for wget 1.12 (which might become 2.0). i already have 
some code implementing connection cache data structure.

Excelent!

URL regex
this is planned for wget 1.11. i've already started working on it.

looking forward to it, many thanks!

--
Oliver Schulze L.
[EMAIL PROTECTED]



RE: regex support RFC

2006-03-31 Thread Tony Lewis
Mauro Tortonesi wrote: 

 no. i was talking about regexps. they are more expressive
 and powerful than simple globs. i don't see what's the
 point in supporting both.

The problem is that users who are expecting globs will try things like
--filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases their
expressions will simply work, which will result in significant confusion
when some expression doesn't work, such as
--filter:-domain:www-*.yoyodyne.com. :-)

It is pretty easy to programmatically convert a glob into a regular
expression. One possibility is to make glob the default input and allow
regular expressions. For example, the following could be equivalent:

--filter:-domain:www-*.yoyodyne.com
--filter:-domain,r:www-.*\.yoyodyne\.com

Internally, wget would convert the first into the second and then treat it
as a regular expression. For the vast majority of cases, glob will work just
fine.

One might argue that it's a lot of work to implement regular expressions if
the default input format is a glob, but I think we should aim for both lack
of confusion and robust functionality. Using ,r means people get regular
expressions when they want them and know what they're doing. The universe of
wget users who know what they're doing are mostly subscribed to this
mailing list; the rest of them send us mail saying please CC me as I'm not
on the list. :-)

If we go this route, I'm wondering if the appropriate conversion from glob
to regular expression should take directory separators into account, such
as:

--filter:-path:path/to/*

becoming the same as:

--filter:-path,r:path/to/[^/]*

or even:

--filter:-path,r:path[/\\]to[/\\][^/\\]*

Should the glob match path/to/sub/dir? (I suspect it shouldn't.)

Tony



Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Tony Lewis [EMAIL PROTECTED] writes:

 Mauro Tortonesi wrote: 

 no. i was talking about regexps. they are more expressive
 and powerful than simple globs. i don't see what's the
 point in supporting both.

 The problem is that users who are expecting globs will try things like
 --filter=-file:*.pdf

The --filter command will be documented from the start to support
regexps.  Since most Unix utilities work with regexps and very few
with globs (excepting the shell), this should not be a problem.

 It is pretty easy to programmatically convert a glob into a regular
 expression.

But it's harder to document and explain, and it requires more options
and logic.  Supporting two different syntaxes (the old one for
backward compatibility) is bad enough: supporting three is at least
one too many.

 One possibility is to make glob the default input and allow regular
 expressions. For example, the following could be equivalent:

 --filter:-domain:www-*.yoyodyne.com
 --filter:-domain,r:www-.*\.yoyodyne\.com

But that misses the point, which is that we *want* to make the more
expressive language, already used elsewhere on Unix, the default.


RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote: 

 But that misses the point, which is that we *want* to make the
 more expressive language, already used elsewhere on Unix, the
 default.

I didn't miss the point at all. I'm trying to make a completely different
one, which is that regular expressions will confuse most users (even if you
tell them that the argument to --filter is a regular expression). This
mailing list will get a huge number of bug reports when users try to use
globs that fail.

Yes, regular expressions are used elsewhere on Unix, but not everywhere. The
shell is the most obvious comparison for user input dealing with expressions
that select multiple objects; the shell uses globs.

Personally, I will be quite happy if --filter only supports regular
expressions because I've been using them quite effectively for years. I just
don't think the same thing can be said for the typical wget user. We've
already had disagreements in this chain about what would match a particular
regular expression; I suspect everyone involved in the conversation could
have correctly predicted what the equivalent glob would do.

I don't think ,r complicates the command that much. Internally, the only
additional work for supporting both globs and regular expressions is a
function that converts a glob into a regexp when ,r is not requested.
That's a straightforward transformation.

Tony



Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Tony Lewis [EMAIL PROTECTED] writes:

 I didn't miss the point at all. I'm trying to make a completely different
 one, which is that regular expressions will confuse most users (even if you
 tell them that the argument to --filter is a regular expression).

Well, most users will probably not use --filter in the first place.
Those that do will have to look at the documentation where they'll
find that it accepts regexps.  Since Wget is hardly the first program
to use regexps, I don't see why most users would be confused by that
choice.

 Yes, regular expressions are used elsewhere on Unix, but not
 everywhere. The shell is the most obvious comparison for user input
 dealing with expressions that select multiple objects; the shell
 uses globs.

I don't see a clear line that connects --filter to glob patterns as
used by the shell.  If anything, the connection is with grep and other
commands that provide powerful filtering (awk and Perl's //
operators), which all seem to work on regexps.  Where the context can
be thought of shell-like (as in wget ftp://blah/*), Wget happily
obliges by providing shell-compatible patterns.

 I don't think ,r complicates the command that much. Internally,
 the only additional work for supporting both globs and regular
 expressions is a function that converts a glob into a regexp when
 ,r is not requested.  That's a straightforward transformation.

,r makes it harder to input regexps, which are the whole point of
introducing --filter.  Besides, having two different syntaxes for the
same switch, and for no good reason, is not really acceptable, even if
the implementation is straightforward.


RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote:

 I don't see a clear line that connects --filter to glob patterns as used
 by the shell.

I want to list all PDFs in the shell, ls -l *.pdf

I want a filter to keep all PDFs, --filter=+file:*.pdf

Note that *.pdf is not a valid regular expression even though it's what
most people will try naturally. Perl complains:
/*.pdf/: ?+*{} follows nothing in regexp

I predict that the vast majority of bug reports and support requests will be
for users who are trying a glob rather than a regular expression.

Tony



Re: regex support RFC

2006-03-31 Thread Curtis Hatter
On Friday 31 March 2006 06:52, Mauro Tortonesi:
 while i like the idea of supporting modifiers like quick (short
 circuit) and maybe i (case insensitive comparison), i think that (?i:)
 and (?-i:) constructs would be overkill and rather hard to implement.

I figured that the (?i:) and (?-i:) constructs would be provided by the 
regular expression engine and that the --filter switch would simply be able 
to use any construct provided by that engine.

I was more trying to persuade for the use of a regex engine that supports such 
constructs (like Perl's). Some other constructs I find useful are: (?=), 
(?!=), (?!), (?=), (?)

These may be overkill but I would rather have the expressiveness of a regex 
engine like Perls when I need it instead of writing regexs in another engine 
that have to be twice as long to compensate for the lack of language 
constructs. Those who don't want to use them, or don't know of them they can 
write regex's as normal.

If, as you said, this would be hard to implement or require extra effort by 
you that is above and beyond that required for the more standard constructs 
then I would say that they shouldn't be implemented; at least at first.

Curtis


Re: regex support RFC

2006-03-31 Thread Scott Scriven
* Mauro Tortonesi [EMAIL PROTECTED] wrote:
 I'm hoping for ... a raw type in addition to file,
 domain, etc.
 
 do you mean you would like to have a regex class working on the
 content of downloaded files as well?

Not exactly.  (details below)

 i don't like your raw proposal as it is HTML-specific. i
 would like instead to develop a mechanism which could work for
 all supported protocols.

I see.  It would be problematic for other protocols.  :(
A raw match would be more complicated than I originally thought,
because it is HTML-specific and uses extra data which isn't
currently available to the filters.

Would it be feasible to make raw simply return the full URI
when the document is not HTML?

I think there is some value in matching based on the entire link
tag, instead of just the URI.  Wget already has --follow-tags and
--ignore-tags, and a raw match would be like an extension to
that concept.  I would find it useful to be able to filter
according to things which are not part of the URI.  For example:

  follow: a href=/a38bef9c class=contentarticle/a
  skip:   a href=/cb31d512 class=advertisementbuy now/a

Either the class property or the visible link text could be used
to decide if the link is worth following, but the URI in this
case is pretty useless.

It may need to be a different option; use --filter to filter
the URI list, and use --filter-tag earlier in the process (same
place as --follow-tags), to help generate the URI list.
Regardless, I think it would be useful.

Any thoughts?


-- Scott


RE: regex support RFC

2006-03-31 Thread Sandhu, Ranjit
I agree with Tony.i think most basic users, me included, thought
www-*.yoyodyne.com would not match www.yoyodyne.com

Support globs as default, regexp as the more powerful option.

Ranjit Sandhu
SRA
 

-Original Message-
From: Tony Lewis [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 31, 2006 10:03 AM
To: wget@sunsite.dk
Subject: RE: regex support RFC

Mauro Tortonesi wrote: 

 no. i was talking about regexps. they are more expressive and powerful

 than simple globs. i don't see what's the point in supporting both.

The problem is that users who are expecting globs will try things like
--filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases
their expressions will simply work, which will result in significant
confusion when some expression doesn't work, such as
--filter:-domain:www-*.yoyodyne.com. :-)

It is pretty easy to programmatically convert a glob into a regular
expression. One possibility is to make glob the default input and allow
regular expressions. For example, the following could be equivalent:

--filter:-domain:www-*.yoyodyne.com
--filter:-domain,r:www-.*\.yoyodyne\.com

Internally, wget would convert the first into the second and then treat
it as a regular expression. For the vast majority of cases, glob will
work just fine.

One might argue that it's a lot of work to implement regular expressions
if the default input format is a glob, but I think we should aim for
both lack of confusion and robust functionality. Using ,r means people
get regular expressions when they want them and know what they're doing.
The universe of wget users who know what they're doing are mostly
subscribed to this mailing list; the rest of them send us mail saying
please CC me as I'm not on the list. :-)

If we go this route, I'm wondering if the appropriate conversion from
glob to regular expression should take directory separators into
account, such
as:

--filter:-path:path/to/*

becoming the same as:

--filter:-path,r:path/to/[^/]*

or even:

--filter:-path,r:path[/\\]to[/\\][^/\\]*

Should the glob match path/to/sub/dir? (I suspect it shouldn't.)

Tony




RE: regex support RFC

2006-03-30 Thread Herold Heiko
 From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
 I don't think such a thing is necessary in practice, though; remember
 that even if you don't escape the dot, it still matches the (intended)
 dot, along with other characters.  So for quickdirty usage not
 escaping dots will just work, and those who want to be precise can
 escape them.

I agree. Just how often will there be problems in a single wget run due to
both some.domain.com and somedomain.com present (famous last words...)

Heiko

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED]
-- +39-041-5907073 / +39-041-5917073 ph
-- +39-041-5907472 / +39-041-5917472 fax


Re: regex support RFC

2006-03-30 Thread Hrvoje Niksic
Herold Heiko [EMAIL PROTECTED] writes:

 From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
 I don't think such a thing is necessary in practice, though; remember
 that even if you don't escape the dot, it still matches the (intended)
 dot, along with other characters.  So for quickdirty usage not
 escaping dots will just work, and those who want to be precise can
 escape them.

 I agree. Just how often will there be problems in a single wget run due to
 both some.domain.com and somedomain.com present (famous last
 words...)

Actually it would have to be someletter-or-numberdomain.com -- a .
will not match the null string.  My point was that people who care
about that potential problem will carefully quote their dots, while
the rest of us will use the more convenient notation.


Re: regex support RFC

2006-03-30 Thread Jim Wright
On Thu, 30 Mar 2006, Mauro Tortonesi wrote:
 
  I do like the [file|path|domain]: approach.  very nice and flexible.
  (and would be a huge help to one specific need I have!)  I suggest also
  including an any option as a shortcut for putting the same pattern in
  all three options.
 
 do you think the any option would be really useful? if so, could you please
 give us an example?

Depends on how individual [file|path|domain]: entries are combined.
AND, OR?  Suppose you want files from some.dom.com://*/foo/*.png.
The part I'm thinking of here is foo as last directory component,
and png as filename extension.  Can the individual rules be combined
to express this?  I guess the real question is, how are rules combined.

Jim


Re: regex support RFC

2006-03-30 Thread Curtis Hatter
On Wednesday 29 March 2006 12:05, you wrote:
 we also have to reach consensus on the filtering algorithm. for
 instance, should we simply require that a url passes all the filtering
 rules to allow its download (just like the current -A/R behaviour), or
 should we instead adopt a short circuit algorithm that applies all rules
 in the same order in which they were given in the command line and
 immediately allows the download of an url if it passes the first allow
 match? should we also support apache-like deny-from-all and
 allow-from-all policies? and what would be the best syntax to trigger
 the usage of these policies?

I would recommend parsing the filters in the order given, that puts the onus 
on the user to optimize the filters and not you. Another way could possibly 
be all filters by domain, then path, and finally file.

Regardless of how you ultimately decide to order the filters, would it be 
possible to allow for users to specify a short circuit? I'm thinking 
something similar to PF's (http://www.openbsd.org/faq/pf/filter.html#quick) 
quick keyword. Example usage of this would be something like:

Need to mirror a site that uses several domains:

--filter=+domain:example.(net|org|com)

Within that domain several paths. One of those paths, which is four levels 
deep, I know I want everything regardless of it's file name/type/etc. It's 
four levels deep.

--filter=+path,quick:([^/]+/){3}/thefiles

The quick keyword is used to skip all other filters, because I've told wget 
that I'm sure I want everything in that path if it matches.

Wget would first evaluate the domain, if it passes evaluate the path and if 
that passes then skip all other filters. Should it fail, wget continues to 
evaluate the rest of the filters.

Another example: I know I want nothing from any site other than example.com

--filter=-domain,quick:^(?!example.com)

That should ignore any domain that doesn't begin with example.com and skip all 
other rules because of the quick keyword. This would make processing more 
efficient, since other filters don't have to be evaluated.

Curtis


RE: regex support RFC

2006-03-30 Thread Tony Lewis
How many keywords do we need to provide maximum flexibility on the
components of the URI? (I'm thinking we need five.)

Consider http://www.example.com/path/to/script.cgi?foo=bar

--filter=uri:regex could match against any part of the URI
--filter=domain:regex could match against www.example.com
--filter=path:regex could match against /path/to/script.cgi
--filter=file:regex could match against script.cgi
--filter=query:regex could match against foo=bar

I think there are good arguments for and against matching against the file
name in path:

Tony



Re: regex support RFC

2006-03-30 Thread Scott Scriven
* Mauro Tortonesi [EMAIL PROTECTED] wrote:
 wget -r --filter=-domain:www-*.yoyodyne.com

This appears to match www.yoyodyne.com, www--.yoyodyne.com,
www---.yoyodyne.com, and so on, if interpreted as a regex.
It would most likely also match www---zyoyodyneXcom.  Perhaps
you want glob patterns instead?  I know I wouldn't mind having
glob patterns in addition to regexes...  glob is much eaesier
when you're not doing complex matches.

If I had to choose just one though, I'd prefer to use PCRE,
Perl-Compatible Regular Expressions.  They offer a richer, more
concise syntax than traditional regexes, such as \d instead of
[:digit:] or [0-9].

 --filter=[+|-][file|path|domain]:REGEXP
 
 is it consistent? is it flawed? is there a more convenient one?

It seems like a good idea, but wouldn't actually provide the
regex-filtering features I'm hoping for unless there was a raw
type in addition to file, domain, etc.  I'll give details
below.  Basically, I need to match based on things like the
inline CSS data, the visible link text, etc.

 please notice that supporting multiple comma-separated regexp in a 
 single --filter option:
 
 --filter=[+|-][file|path|domain]:REGEXP1,REGEXP2,...

Commas for multiple regexes are unnecessary.  Regexes already
have an or operator built in.  If you want to match fee or
fie or foe or fum, the pattern is fee|fie|foe|fum.

 we also have to reach consensus on the filtering algorithm. for
 instance, should we simply require that a url passes all the
 filtering rules to allow its download (just like the current
 -A/R behaviour), or should we instead adopt a short circuit
 algorithm that applies all rules in the same order in which
 they were given in the command line and immediately allows the
 download of an url if it passes the first allow match?

Regexes implicitly have or functionality built in, via the pipe
operator.  They also have and built in simply by extending the
pattern.  To require both foo and bar in a match, you could
do something like foo.*bar|bar.*foo.  So, it's not strictly
necessary to support more than one regex unless you specify both
an include pattern and an exclude pattern.

However, if multiple patterns are supported, I think it would be
more helpful to implement them as and rather than or.  This
is just because and doubles the length of the filter, so it may
be more convenient to say --filter=foo --filter=bar than
--filter='foo.*bar|bar.*foo'.


Below is the original message I sent to the wget list a few
months ago, about this same topic:

=
I'd find it useful to guide wget by using regular expressions to
control which links get followed.  For example, to avoid
following links based on embedded css styles or link text.

I've needed this several times, but the most recent was when I
wanted to avoid following any add to cart or buy links on a
site which uses GET parameters instead of directories to select
content.

Given a link like this...

a 
href=http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCartamp;g2_itemId=11436amp;g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7bamp;g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7bamp;g2_returnName=album;
 class=gbAdminLink gbAdminLink gbLink-cart_AddToCartadd to cart/a

... a useful parameter could be --ignore-regex='AddToCart|add to cart'
so the class or link text (really, anything inside the tag) could
be used to decide whether the link should be followed.

Or...  if there's already a way to do this, let me know.  I
didn't see anything in the docs, but I may have missed something.

:)
=

I think what I want could be implemented via the --filter option,
with a few small modifications to what was proposed.  I'm not
sure exactly what syntax to use, but it should be able to specify
whether to include/exclude the link, which PCRE flags to use, how
much of the raw HTML tag to use as input, and what pattern to use
for matching.  Here's an idea:

  --filter=[allow][flags,][scope][:]pattern

Example:

  '--filter=-i,raw:add ?to ?cart'
  (the quotes are there only to make the shell treat it as one parameter)

The details are:

  allow is + for include or - for exclude.
  It defaults to + if omitted.

  flags, is a set of letters to control regex options, followed
  by a comma (to separate it from scope).  For example, i
  specifies a case-insensitive search.  These would be the same
  flags that perl appends to the end of search patterns.  So,
  instead of /foo/i, it would be --filter=+i,:foo

  scope controls how much of the a or similar tag gets used
  as input to the regex.  Values include:
raw: use the entire tag and all contents (default)
 a href=/path/to/foo.extbar/a
domain: use only the domain name
 www.example.com
file: use only the file name
 foo.ext
path: use the directory, but not the file name
 /path/to
others...  can be added 

RE: regex support RFC

2006-03-30 Thread Tony Lewis
Curtis Hatter wrote:

 Also any way to add modifiers to the regexs? 

Perhaps --filter=path,i:/path/to/krs would work.

Tony



Re: regex support RFC

2006-03-30 Thread Curtis Hatter
On Thursday 30 March 2006 13:42, Tony Lewis wrote:
 Perhaps --filter=path,i:/path/to/krs would work.

That would look to be the most elegant method. I do hope that the (?i:) and 
(?-i:) constructs are supported since I may not want the entire path/file to 
be case (in)?sensitive =), but that will depend on the regex engine chosen.

Curtis


Re: regex support RFC

2006-03-30 Thread Oliver Schulze L.

Hrvoje Niksic wrote:

 The regexp API's found on today's Unix systems
might be usable, but unfortunately those are not available on Windows.
  
My personal idea on this is to: enable regex in Unix and disable it on 
Windows.


We all use Unix/Linux and regex is really usefull. I think not having 
regex on
Windows will not do any more harm that it is doing now (not having it at 
all)


I hope wget can get conection cache, URL regex and advanced mirror functions
(sync 2 folders) in the near future.
Thats all I still wants from wget and still could not find in another 
OSS software.


Thanks
Oliver

--
Oliver Schulze L.
[EMAIL PROTECTED]



Re: regex support RFC

2006-03-30 Thread Scott Scriven
* [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/
 
 soon leads to non wget related links being downloaded, eg. 
 http://www.gnu.org/graphics/agnuhead.html

In that particular case, I think --no-parent would solve the
problem.

Maybe I misunderstood, though.  It seems awfully risky to use -r
and -H without having something to strictly limit the links
followed.  So, I suppose the content filter would be an effective
way to make cross-host downloading safer.

I think I'd prefer to have a different option, for that sort of
thing -- filter by using external programs.  If the program
returns a specific code, follow the link or recurse into the
links contained in the file.  Then you could do far more complex
filtering, including things like interactive pruning.


-- Scott


Re: regex support RFC

2006-03-29 Thread Jim Wright
what definition of regexp would you be following?  or would this be
making up something new?  I'm not quite understanding the comment about
the comma and needing escaping for literal commas.  this is true for any
character in the regexp language, so why the special concern for comma?

I do like the [file|path|domain]: approach.  very nice and flexible.
(and would be a huge help to one specific need I have!)  I suggest also
including an any option as a shortcut for putting the same pattern in
all three options.

Jim



On Wed, 29 Mar 2006, Mauro Tortonesi wrote:

 
 hrvoje and i have been recently talking about adding regex support to wget. we
 were considering to add a new --filter option which, by supporting regular
 expressions, would allow more powerful ways of filtering urls to download.
 
 for instance the new option could allow the filtering of domain names, file
 names and url paths. in the following case --filter is used to prevent any
 download from the www-*.yoyodyne.com domain and to restrict download only to
 .gif files:
 
 wget -r --filter=-domain:www-*.yoyodyne.com --filter=+file:\.gif$
 http://yoyodyne.com
 
 (notice that --filter interprets every given rule as a regex).
 
 i personally think the --filter option would be a great new feature for wget,
 and i have already started working on its implementation, but we still have a
 few opened questions.
 
 for instance, the syntax for --filter presented above is basically the
 following:
 
 --filter=[+|-][file|path|domain]:REGEXP
 
 is it consistent? is it flawed? is there a more convenient one?
 
 please notice that supporting multiple comma-separated regexp in a single
 --filter option:
 
 --filter=[+|-][file|path|domain]:REGEXP1,REGEXP2,...
 
 would significantly complicate the implementation and usage of --filter, as it
 would require escaping of the , charachter. also notice that current
 filtering options like -A/R are somewhat broken, as they do not allow the
 usage of , char in filtering rules.
 
 we also have to reach consensus on the filtering algorithm. for instance,
 should we simply require that a url passes all the filtering rules to allow
 its download (just like the current -A/R behaviour), or should we instead
 adopt a short circuit algorithm that applies all rules in the same order in
 which they were given in the command line and immediately allows the download
 of an url if it passes the first allow match? should we also support
 apache-like deny-from-all and allow-from-all policies? and what would be the
 best syntax to trigger the usage of these policies?
 
 i am looking forward to read your opinions on this topic.
 
 
 P.S.: the new --filter option would replace and extend the old -D, -I/X
 and -A/R options, which will be deprecated but still supported.
 
 -- 
 Aequam memento rebus in arduis servare mentem...
 
 Mauro Tortonesi  http://www.tortonesi.com
 
 University of Ferrara - Dept. of Eng.http://www.ing.unife.it
 GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
 Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
 Ferrara Linux User Group http://www.ferrara.linux.it
 
 


Re: regex support RFC

2006-03-29 Thread Hrvoje Niksic
Mauro Tortonesi [EMAIL PROTECTED] writes:

 for instance, the syntax for --filter presented above is basically the
 following:

 --filter=[+|-][file|path|domain]:REGEXP

I think there should also be url for filtering on the entire URL.
People have been asking for that kind of thing a lot over the years.


Re: regex support RFC

2006-03-29 Thread Hrvoje Niksic
Jim Wright [EMAIL PROTECTED] writes:

 what definition of regexp would you be following?  or would this be
 making up something new?

It wouldn't be new, Mauro is definitely referring to regexps as
normally understood.  The regexp API's found on today's Unix systems
might be usable, but unfortunately those are not available on Windows.
They also lack the support for the very useful non-greedy matching
quantifier (the ? modifier to the * operator) introduced by Perl 5
and supported by most of today's major regexp implementations: Python,
Java, Tcl, etc.

One idea was to use PCRE, bundling it with Wget for the sake of
Windows and systems without PCRE.  Another (http://tinyurl.com/elp7h)
was to use and bundle Emacs's regex.c, the version of GNU regex
shipped with GNU Emacs.  It is small (one source) and offers
Unix-compatible basic and extended regeps, but also supports the
non-greedy quantifier and non-capturing groups.

See the message and the related discussion at http://tinyurl.com/mdwhx
for more about this topic.

 I'm not quite understanding the comment about the comma and needing
 escaping for literal commas.

Supporting PATTERN1,PATTERN2,... would require having a way to quote
the comma character.  But there is little reason for a specific comma
syntax since one can always use (PATTERN1|PATTERN2|...).

Being unable to have a comma in the pattern is a shortcoming in the
current -R/-A options.

 I do like the [file|path|domain]: approach.  very nice and flexible.

Thanks.