RE: [Bug-wget] Thoughts on regex support

Tony Lewis Thu, 24 Sep 2009 12:08:00 -0700

Micah Cowan wrote:

> - It should use extended regular expressions
Agreed


> PCREs are less important
I have a very strong preference for \s over [[:space]]

> - It should be possible to match against just certain components of an
>   URL
Agreed. In your exchange with Matthew some possible labels were discussed. I
compared the identifiers you suggested with the definition of Location in
JavaScript and noted that there is very little overlap. (I'm not sure that
JavaScript should be the deciding factor, but these are well-known names for
the components.)

url (href)
scheme (protocol)
domain (is it host, which includes port, or hostname, which does not?)
path (pathname)
query (search [includes ?])
field (no equivalent)

Also, Location includes port and hash. How do you plan to deal with these
aspects of a URL?

There should be a simple way of matching www.site.com and site.com. It might
be explicitly specified as ':domain:^.*\bsite.com$', but I suspect most
people will really want ':domain:site.com' to match both, but not to match
othersite.com.

> - We should avoid adding many more options than are necessary.
Agreed.

> - It should be easy to match against individual fields from HTML
>   forms, within query strings.
I agree that it is convenient to separate the query string into "fields" by
splitting on '&'. However, I think it should also be easy to match on the
name and value portions of name=value. For example, exclude any URL where
'action' is specified. Perhaps that will be ':field:^action='.

> - We should avoid unnecessary external dependencies if possible.
Agreed, but we should not lose functionality for most users because some
implementation has a broken or missing regex library.

> - We should provide short options.
Perhaps, but I would put this in the "nice to have" category.

> { --match | --no-match }  [ : components [ / flags ] : ] regex
Sounds OK, but I think you mean: [ : [ components ] [ / flags ] : ]

That is, I think you meant to allow ':/i:foo' since you use that syntax
later in your message.

> With short options -z and -Z for --match and --no-match, respectively.
Those are not intuitive choices to me, but OK.

> it is implicitly anchored at the beginning and end
I think this is a bad idea. If someone wants ^ and $, they should specify
them.

I realize that my argument for domain matching is not entirely consistent
with explicit anchors. Going back to domain matching, I think
':domain:site.com' should be interpreted as ':domain:^.*\bsite\.com$', but I
also think domain matching is a special case.

In the more general case of anchoring, I think ':path:foo' should match
'/path/to/foo.html' and '/foo/baz/index.html'.


> If the components aren't specified, it would default to matching just
> the pathname portion of the URL.
I'm not sure this is the obvious behavior, but I would get used to it.

> - If we make the components-specification mandatory, we could eschew
>   the initial colon.
I don't like this approach.


It is not clear to me how one would combine matches. Let's say that I want
all ZIP files from directory a (but not from directory A) and all JPG files
from directory b (but not from directory B). How do I indicate that I want
to match:

(':path:\ba\b' AND ':path/i:\.zip$') OR (':path:\bb\b' AND ':path/i:\.jpg$')


> == The "--traverse" option ==
In general, I agree with the thinking in this entire section.

> Additionally, the --traverse settings would be ignored when we're one
> level away from the maximum recursion depth. Why download something just
> to throw it out without doing anything more?
What if you're recording unfollowed links to the SIDB? Don't you still want
those links to appear?

> Caveat: I'm against giving --traverse an implicit default value of
> '.*\.html?'
What's wrong with treating --traverse as meaning --traverse
':path/i:^.*\.html?$' and then having --traverse ':path/i:^.*\.php$'
override that behavior and only download PHP pages. In other words, if you
don't specify a matching pattern to traverse, it behaves the way it does
now, but if you do specify one, you have to include '.html' if you want HTML
suffixes as well.

Given that the most common use case is to match against suffixes in the
path, perhaps ':path/i:^.*\.' and '$' should be implied so that --traverse
'(html?|php)' is interpreted as ':path/i:^.*\.(html?|php)$'.

By the way, it would probably be helpful to have a variation of traverse
that looks for Content-Type headers that contain "text/html" regardless of
path extension.

RE: [Bug-wget] Thoughts on regex support

Reply via email to