Micah Cowan wrote: > - It should use extended regular expressions Agreed
> PCREs are less important I have a very strong preference for \s over [[:space]] > - It should be possible to match against just certain components of an > URL Agreed. In your exchange with Matthew some possible labels were discussed. I compared the identifiers you suggested with the definition of Location in JavaScript and noted that there is very little overlap. (I'm not sure that JavaScript should be the deciding factor, but these are well-known names for the components.) url (href) scheme (protocol) domain (is it host, which includes port, or hostname, which does not?) path (pathname) query (search [includes ?]) field (no equivalent) Also, Location includes port and hash. How do you plan to deal with these aspects of a URL? There should be a simple way of matching www.site.com and site.com. It might be explicitly specified as ':domain:^.*\bsite.com$', but I suspect most people will really want ':domain:site.com' to match both, but not to match othersite.com. > - We should avoid adding many more options than are necessary. Agreed. > - It should be easy to match against individual fields from HTML > forms, within query strings. I agree that it is convenient to separate the query string into "fields" by splitting on '&'. However, I think it should also be easy to match on the name and value portions of name=value. For example, exclude any URL where 'action' is specified. Perhaps that will be ':field:^action='. > - We should avoid unnecessary external dependencies if possible. Agreed, but we should not lose functionality for most users because some implementation has a broken or missing regex library. > - We should provide short options. Perhaps, but I would put this in the "nice to have" category. > { --match | --no-match } [ : components [ / flags ] : ] regex Sounds OK, but I think you mean: [ : [ components ] [ / flags ] : ] That is, I think you meant to allow ':/i:foo' since you use that syntax later in your message. > With short options -z and -Z for --match and --no-match, respectively. Those are not intuitive choices to me, but OK. > it is implicitly anchored at the beginning and end I think this is a bad idea. If someone wants ^ and $, they should specify them. I realize that my argument for domain matching is not entirely consistent with explicit anchors. Going back to domain matching, I think ':domain:site.com' should be interpreted as ':domain:^.*\bsite\.com$', but I also think domain matching is a special case. In the more general case of anchoring, I think ':path:foo' should match '/path/to/foo.html' and '/foo/baz/index.html'. > If the components aren't specified, it would default to matching just > the pathname portion of the URL. I'm not sure this is the obvious behavior, but I would get used to it. > - If we make the components-specification mandatory, we could eschew > the initial colon. I don't like this approach. It is not clear to me how one would combine matches. Let's say that I want all ZIP files from directory a (but not from directory A) and all JPG files from directory b (but not from directory B). How do I indicate that I want to match: (':path:\ba\b' AND ':path/i:\.zip$') OR (':path:\bb\b' AND ':path/i:\.jpg$') > == The "--traverse" option == In general, I agree with the thinking in this entire section. > Additionally, the --traverse settings would be ignored when we're one > level away from the maximum recursion depth. Why download something just > to throw it out without doing anything more? What if you're recording unfollowed links to the SIDB? Don't you still want those links to appear? > Caveat: I'm against giving --traverse an implicit default value of > '.*\.html?' What's wrong with treating --traverse as meaning --traverse ':path/i:^.*\.html?$' and then having --traverse ':path/i:^.*\.php$' override that behavior and only download PHP pages. In other words, if you don't specify a matching pattern to traverse, it behaves the way it does now, but if you do specify one, you have to include '.html' if you want HTML suffixes as well. Given that the most common use case is to match against suffixes in the path, perhaps ':path/i:^.*\.' and '$' should be implied so that --traverse '(html?|php)' is interpreted as ':path/i:^.*\.(html?|php)$'. By the way, it would probably be helpful to have a variation of traverse that looks for Content-Type headers that contain "text/html" regardless of path extension.