Re: regex support RFC
Scott Scriven wrote: * Mauro Tortonesi <[EMAIL PROTECTED]> wrote: wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. not really. it would not match www.yoyodyne.com. It would most likely also match "www---zyoyodyneXcom". yes. Perhaps you want glob patterns instead? I know I wouldn't mind having glob patterns in addition to regexes... glob is much eaesier when you're not doing complex matches. no. i was talking about regexps. they are more expressive and powerful than simple globs. i don't see what's the point in supporting both. If I had to choose just one though, I'd prefer to use PCRE, Perl-Compatible Regular Expressions. They offer a richer, more concise syntax than traditional regexes, such as \d instead of [:digit:] or [0-9]. i agree, but adding a dependency from PCRE to wget is asking for infinite maintenance nightmares. and i don't know if we can simply bundle code from PCRE in wget, as it has a BSD license. --filter=[+|-][file|path|domain]:REGEXP is it consistent? is it flawed? is there a more convenient one? It seems like a good idea, but wouldn't actually provide the regex-filtering features I'm hoping for unless there was a "raw" type in addition to "file", "domain", etc. I'll give details below. Basically, I need to match based on things like the inline CSS data, the visible link text, etc. do you mean you would like to have a regex class working on the content of downloaded files as well? Below is the original message I sent to the wget list a few months ago, about this same topic: = I'd find it useful to guide wget by using regular expressions to control which links get followed. For example, to avoid following links based on embedded css styles or link text. I've needed this several times, but the most recent was when I wanted to avoid following any "add to cart" or "buy" links on a site which uses GET parameters instead of directories to select content. Given a link like this... http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCart&g2_itemId=11436&g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7b&g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7b&g2_returnName=album"; class="gbAdminLink gbAdminLink gbLink-cart_AddToCart">add to cart ... a useful parameter could be --ignore-regex='AddToCart|add to cart' so the class or link text (really, anything inside the tag) could be used to decide whether the link should be followed. Or... if there's already a way to do this, let me know. I didn't see anything in the docs, but I may have missed something. :) = I think what I want could be implemented via the --filter option, with a few small modifications to what was proposed. I'm not sure exactly what syntax to use, but it should be able to specify whether to include/exclude the link, which PCRE flags to use, how much of the raw HTML tag to use as input, and what pattern to use for matching. Here's an idea: --filter=[allow][flags,][scope][:]pattern Example: '--filter=-i,raw:add ?to ?cart' (the quotes are there only to make the shell treat it as one parameter) The details are: "allow" is "+" for "include" or "-" for "exclude". It defaults to "+" if omitted. "flags," is a set of letters to control regex options, followed by a comma (to separate it from scope). For example, "i" specifies a case-insensitive search. These would be the same flags that perl appends to the end of search patterns. So, instead of "/foo/i", it would be "--filter=+i,:foo" "scope" controls how much of the or similar tag gets used as input to the regex. Values include: raw: use the entire tag and all contents (default) bar domain: use only the domain name www.example.com file: use only the file name foo.ext path: use the directory, but not the file name /path/to others... can be added as desired ":" is required if "allow" or "flags" or "scope" is given So, for example, to exclude the "add to cart" links in my previous post, this could be used: --filter=-raw:'AddToCart|add to cart' or --filter=-raw:AddToCart\|add\ to\ cart or --filter=-:'AddToCart|add to cart' or --filter=-i,raw:'add ?to ?cart' Alternately, the --filter option could be split into two options: one for including content, and one for excluding. This would be more consistent with wget's existing parameters, and would slightly simplify the syntax. I hope I haven't been to full of hot air. This is a feature I've wanted in wget for a long time, and I'm a bit excited that it might happen soon. :) i don't like your "raw" proposal as it is HTML-specific. i would like instead to develop a mechanism which could work for all supported protocols.
Re: regex support RFC
Mauro Tortonesi <[EMAIL PROTECTED]> writes: > Scott Scriven wrote: >> * Mauro Tortonesi <[EMAIL PROTECTED]> wrote: >> >>>wget -r --filter=-domain:www-*.yoyodyne.com >> This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", >> "www---.yoyodyne.com", and so on, if interpreted as a regex. > > not really. it would not match www.yoyodyne.com. Why not? >> Perhaps you want glob patterns instead? I know I wouldn't mind >> having glob patterns in addition to regexes... glob is much >> eaesier when you're not doing complex matches. > > no. i was talking about regexps. they are more expressive and > powerful than simple globs. i don't see what's the point in > supporting both. I agree with this.
Re: regex support RFC
Hrvoje Niksic wrote: Mauro Tortonesi <[EMAIL PROTECTED]> writes: Scott Scriven wrote: * Mauro Tortonesi <[EMAIL PROTECTED]> wrote: wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. not really. it would not match www.yoyodyne.com. Why not? i may be wrong, but if - is not a special charachter, the previous expression should match only domains starting with www- and ending in [randomchar]yoyodyne[randomchar]com. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Oliver Schulze L. wrote: Hrvoje Niksic wrote: The regexp API's found on today's Unix systems might be usable, but unfortunately those are not available on Windows. My personal idea on this is to: enable regex in Unix and disable it on Windows. > We all use Unix/Linux and regex is really usefull. I think not having regex on Windows will not do any more harm that it is doing now (not having it at all) for consistency and to avoid maintenance problems, i would like wget to have the same behavior on windows and unix. please, notice that if we implemented regex support only on unix, windows binaries of wget built with cygwin would have regex support but native binaries wouldn't. that would be very confusing for windows users, IMHO. I hope wget can get conection cache, this is planned for wget 1.12 (which might become 2.0). i already have some code implementing connection cache data structure. URL regex this is planned for wget 1.11. i've already started working on it. and advanced mirror functions (sync 2 folders) in the near future. this is very interesting. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Curtis Hatter wrote: On Thursday 30 March 2006 13:42, Tony Lewis wrote: Perhaps --filter=path,i:/path/to/krs would work. That would look to be the most elegant method. I do hope that the (?i:) and (?-i:) constructs are supported since I may not want the entire path/file to be case (in)?sensitive =), but that will depend on the regex engine chosen. while i like the idea of supporting modifiers like "quick" (short circuit) and maybe "i" (case insensitive comparison), i think that (?i:) and (?-i:) constructs would be overkill and rather hard to implement. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
RE: regex support RFC
> From: Oliver Schulze L. [mailto:[EMAIL PROTECTED] > My personal idea on this is to: enable regex in Unix and > disable it on > Windows. > > We all use Unix/Linux and regex is really usefull. I think not having We all use Unix/Linux ? You would be surprised how many wget users on windows are out there. Beside that, Those Who Know The Code better than me please consider how bad portability issues in using native regexp engines could be. Are the interfaces and capabilities all the same or are there consistent differences between various flavors (gnu, several BSD, hpux, aix, sunos, solaris, older flavours...). If so that would be a point favouring an external library (hopefully supported on as many as possible flavours). Heiko -- -- PREVINET S.p.A. www.previnet.it -- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED] -- +39-041-5907073 / +39-041-5917073 ph -- +39-041-5907472 / +39-041-5917472 fax
Re: regex support RFC
Mauro Tortonesi <[EMAIL PROTECTED]> writes: >wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. >>> >>> not really. it would not match www.yoyodyne.com. >> Why not? > > i may be wrong, but if - is not a special charachter, the previous > expression should match only domains starting with www- and ending > in [randomchar]yoyodyne[randomchar]com. "*" matches the previous character repeated 0 or more times. This is in contrast to wildcards, where "*" alone matches any character 0 or more times. (This is part of why regexps are often confusing to people used to the much simpler wildcards.) Therefore "www-*" matches "www", "www-", "www--", etc., i.e. Scott's interpretation was correct. What you describe is achieved with the "www-.*.yoyodyne.com".
Re: regex support RFC
Hrvoje Niksic wrote: Mauro Tortonesi <[EMAIL PROTECTED]> writes: wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. not really. it would not match www.yoyodyne.com. Why not? i may be wrong, but if - is not a special charachter, the previous expression should match only domains starting with www- and ending in [randomchar]yoyodyne[randomchar]com. "*" matches the previous character repeated 0 or more times. This is in contrast to wildcards, where "*" alone matches any character 0 or more times. (This is part of why regexps are often confusing to people used to the much simpler wildcards.) Therefore "www-*" matches "www", "www-", "www--", etc., i.e. Scott's interpretation was correct. What you describe is achieved with the "www-.*.yoyodyne.com". you're right. ok, it is official. i must stop drinking this much - it just doesn't work. i have to start drinking less or, even better, more. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Hrvoje Niksic wrote: Herold Heiko <[EMAIL PROTECTED]> writes: Get the best of both, use a syntax permitting a "first match-exits" ACL, single ACE permits several statements ANDed together. Cooking up a simple syntax for users without much regexp experience won't be easy. I assume ACL stands for "access control list", but what is ACE? access control entry, i guess. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
El 31/03/2006, a las 14:37, Hrvoje Niksic escribió: "*" matches the previous character repeated 0 or more times. This is in contrast to wildcards, where "*" alone matches any character 0 or more times. (This is part of why regexps are often confusing to people used to the much simpler wildcards.) Therefore "www-*" matches "www", "www-", "www--", etc., i.e. Scott's interpretation was correct. What you describe is achieved with the "www-.*.yoyodyne.com". Are you sure that "www-*" matches "www"? As far as I know "www-*" matches "one w, another w, a third w, a hyphen, then 0 or more hyphens". In other words, "www", does not match. Wincent smime.p7s Description: S/MIME cryptographic signature
Re: Bug in ETA code on x64
El 29/03/2006, a las 14:39, Hrvoje Niksic escribió: I can't see any good reason to use "," here. Why not write the line as: eta_hrs = eta / 3600; eta %= 3600; Because that's not equivalent. Well, it should be, because the comma operator has lower precedence than the assignment operator (see http://tinyurl.com/evo5a, http://tinyurl.com/ff4pp and numerous other locations). Indeed you are right. So: eta_hrs = eta / 3600, eta %= 3600; Is equivalent to the following (with explicit parentheses to make the effect of the precendence obvious): (eta_hrs = eta / 3600), (eta %= 3600); Or of course: eta_hrs = eta / 3600; eta %= 3600; Greg smime.p7s Description: S/MIME cryptographic signature
Re: regex support RFC
Wincent Colaiuta <[EMAIL PROTECTED]> writes: > Are you sure that "www-*" matches "www"? Yes. > As far as I know "www-*" matches "one w, another w, a third w, a > hyphen, then 0 or more hyphens". That would be "www--*" or "www-+".
Re: regex support RFC
Hrvoje Niksic wrote: Wincent Colaiuta <[EMAIL PROTECTED]> writes: Are you sure that "www-*" matches "www"? Yes. hrvoje is right. try this perl script: #!/usr/bin/perl -w use strict; my @strings = ("www-.yoyodyne.com", "www.yoyodyne.com"); foreach my $str (@strings) { $str =~ /www-*.yoyodyne.com/ or print "$str doesn't match\n"; } both the strings match. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Mauro Tortonesi wrote: for consistency and to avoid maintenance problems, i would like wget to have the same behavior on windows and unix. please, notice that if we implemented regex support only on unix, windows binaries of wget built with cygwin would have regex support but native binaries wouldn't. that would be very confusing for windows users, IMHO. Ok, I understand. I was thinking in a #ifdef in the source code so you can: - enable all regex code/command line parameters in Unix/Linux - at runtime, print the error "regex not yet supported on windows" if any regex related command parameter line parameter is passed to wget on windows/cygwin this is planned for wget 1.12 (which might become 2.0). i already have some code implementing connection cache data structure. Excelent! URL regex this is planned for wget 1.11. i've already started working on it. looking forward to it, many thanks! -- Oliver Schulze L. <[EMAIL PROTECTED]>
RE: regex support RFC
Mauro Tortonesi wrote: > no. i was talking about regexps. they are more expressive > and powerful than simple globs. i don't see what's the > point in supporting both. The problem is that users who are expecting globs will try things like --filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases their expressions will simply work, which will result in significant confusion when some expression doesn't work, such as --filter:-domain:www-*.yoyodyne.com. :-) It is pretty easy to programmatically convert a glob into a regular expression. One possibility is to make glob the default input and allow regular expressions. For example, the following could be equivalent: --filter:-domain:www-*.yoyodyne.com --filter:-domain,r:www-.*\.yoyodyne\.com Internally, wget would convert the first into the second and then treat it as a regular expression. For the vast majority of cases, glob will work just fine. One might argue that it's a lot of work to implement regular expressions if the default input format is a glob, but I think we should aim for both lack of confusion and robust functionality. Using ",r" means people get regular expressions when they want them and know what they're doing. The universe of wget users who "know what they're doing" are mostly subscribed to this mailing list; the rest of them send us mail saying "please CC me as I'm not on the list". :-) If we go this route, I'm wondering if the appropriate conversion from glob to regular expression should take directory separators into account, such as: --filter:-path:path/to/* becoming the same as: --filter:-path,r:path/to/[^/]* or even: --filter:-path,r:path[/\\]to[/\\][^/\\]* Should the glob match "path/to/sub/dir"? (I suspect it shouldn't.) Tony
Re: regex support RFC
"Tony Lewis" <[EMAIL PROTECTED]> writes: > Mauro Tortonesi wrote: > >> no. i was talking about regexps. they are more expressive >> and powerful than simple globs. i don't see what's the >> point in supporting both. > > The problem is that users who are expecting globs will try things like > --filter=-file:*.pdf The --filter command will be documented from the start to support regexps. Since most Unix utilities work with regexps and very few with globs (excepting the shell), this should not be a problem. > It is pretty easy to programmatically convert a glob into a regular > expression. But it's harder to document and explain, and it requires more options and logic. Supporting two different syntaxes (the old one for backward compatibility) is bad enough: supporting three is at least one too many. > One possibility is to make glob the default input and allow regular > expressions. For example, the following could be equivalent: > > --filter:-domain:www-*.yoyodyne.com > --filter:-domain,r:www-.*\.yoyodyne\.com But that misses the point, which is that we *want* to make the more expressive language, already used elsewhere on Unix, the default.
RE: regex support RFC
Hrvoje Niksic wrote: > But that misses the point, which is that we *want* to make the > more expressive language, already used elsewhere on Unix, the > default. I didn't miss the point at all. I'm trying to make a completely different one, which is that regular expressions will confuse most users (even if you tell them that the argument to --filter is a regular expression). This mailing list will get a huge number of bug reports when users try to use globs that fail. Yes, regular expressions are used elsewhere on Unix, but not everywhere. The shell is the most obvious comparison for user input dealing with expressions that select multiple objects; the shell uses globs. Personally, I will be quite happy if --filter only supports regular expressions because I've been using them quite effectively for years. I just don't think the same thing can be said for the typical wget user. We've already had disagreements in this chain about what would match a particular regular expression; I suspect everyone involved in the conversation could have correctly predicted what the equivalent glob would do. I don't think ",r" complicates the command that much. Internally, the only additional work for supporting both globs and regular expressions is a function that converts a glob into a regexp when ",r" is not requested. That's a straightforward transformation. Tony
Re: regex support RFC
"Tony Lewis" <[EMAIL PROTECTED]> writes: > I didn't miss the point at all. I'm trying to make a completely different > one, which is that regular expressions will confuse most users (even if you > tell them that the argument to --filter is a regular expression). Well, "most users" will probably not use --filter in the first place. Those that do will have to look at the documentation where they'll find that it accepts regexps. Since Wget is hardly the first program to use regexps, I don't see why most users would be confused by that choice. > Yes, regular expressions are used elsewhere on Unix, but not > everywhere. The shell is the most obvious comparison for user input > dealing with expressions that select multiple objects; the shell > uses globs. I don't see a clear line that connects --filter to glob patterns as used by the shell. If anything, the connection is with grep and other commands that provide powerful filtering (awk and Perl's // operators), which all seem to work on regexps. Where the context can be thought of shell-like (as in wget ftp://blah/*), Wget happily obliges by providing shell-compatible patterns. > I don't think ",r" complicates the command that much. Internally, > the only additional work for supporting both globs and regular > expressions is a function that converts a glob into a regexp when > ",r" is not requested. That's a straightforward transformation. ",r" makes it harder to input regexps, which are the whole point of introducing --filter. Besides, having two different syntaxes for the same switch, and for no good reason, is not really acceptable, even if the implementation is straightforward.
RE: regex support RFC
Hrvoje Niksic wrote: > I don't see a clear line that connects --filter to glob patterns as used > by the shell. I want to list all PDFs in the shell, ls -l *.pdf I want a filter to keep all PDFs, --filter=+file:*.pdf Note that "*.pdf" is not a valid regular expression even though it's what most people will try naturally. Perl complains: /*.pdf/: ?+*{} follows nothing in regexp I predict that the vast majority of bug reports and support requests will be for users who are trying a glob rather than a regular expression. Tony
Re: regex support RFC
On Friday 31 March 2006 06:52, Mauro Tortonesi: > while i like the idea of supporting modifiers like "quick" (short > circuit) and maybe "i" (case insensitive comparison), i think that (?i:) > and (?-i:) constructs would be overkill and rather hard to implement. I figured that the (?i:) and (?-i:) constructs would be provided by the regular expression engine and that the --filter switch would simply be able to use any construct provided by that engine. I was more trying to persuade for the use of a regex engine that supports such constructs (like Perl's). Some other constructs I find useful are: (?=), (?!=), (?) These may be overkill but I would rather have the expressiveness of a regex engine like Perls when I need it instead of writing regexs in another engine that have to be twice as long to compensate for the lack of language constructs. Those who don't want to use them, or don't know of them they can write regex's as normal. If, as you said, this would be hard to implement or require extra effort by you that is above and beyond that required for the more "standard" constructs then I would say that they shouldn't be implemented; at least at first. Curtis
Re: regex support RFC
> * [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/ > > > > soon leads to non wget related links being downloaded, eg. > > http://www.gnu.org/graphics/agnuhead.html > > In that particular case, I think --no-parent would solve the > problem. No. The idea is not to be restricted to not descending the tree. > > Maybe I misunderstood, though. It seems awfully risky to use -r > and -H without having something to strictly limit the links > followed. So, I suppose the content filter would be an effective > way to make cross-host downloading safer. Absolutely. That is why I proposed a 'contents' regexp. > > I think I'd prefer to have a different option, for that sort of > thing -- filter by using external programs. If the program > returns a specific code, follow the link or recurse into the > links contained in the file. Then you could do far more complex > filtering, including things like interactive pruning. True. That could be a future feature request but now that the wget team are writing regexp code, it seems an ideal time to implement it. By constructing suitable regexps, one could use this feature to search for any string in the html file, (as above), or just in metatags etc. IMHO it gives a lot of flexibility for little extra developer programming. Any comments, Mauro & Hrvoje? Thanks Tom Crane -- Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 0EX, England. Email: [EMAIL PROTECTED] Fax:+44 (0) 1784 472794
Problem with double slashes in URI
I'm having a problem downloading files from a Novell Netware server. When I do it manually with FTP, I first 'cd //ccp1' to change servers. Ncftpget seems to do this, but wget doesn't: [EMAIL PROTECTED]:/tmp$ ncftpget -u xxx -p yyy ccp3 /tmp/ '//ccp1/data/shared/news/motd/qotd.txt' /tmp/qotd.txt: 109.00 B 2.54 kB/s [EMAIL PROTECTED]:/tmp$ wget --timestamping --no-host-directories --glob=on --recursive --cut-dirs=4 'ftp://xxx:[EMAIL PROTECTED]/%2Fccp1/data/shared/news/motd/qotd.txt' --12:46:27-- ftp://isgadmin:[EMAIL PROTECTED]/%2Fccp1/data/shared/news/motd/qotd.txt => `motd/.listing' Resolving ccp3... 162.129.45.72 Connecting to ccp3[162.129.45.72]:21... connected. Logging in as isgadmin ... Logged in! ==> SYST ... done.==> PWD ... done. ==> TYPE I ... done. ==> CWD /ccp1/data/shared/news/motd ... No such directory `/ccp1/data/shared/news/motd'. unlink: No such file or directory FINISHED --12:46:27-- Downloaded: 0 bytes in 0 files [EMAIL PROTECTED]:/tmp$ I search this list's archives on 'double slash' to try to solve this problem. This unanswered post from 2002 is exactly my problem (using Novell Netware servers): http://www.mail-archive.com/wget@sunsite.dk/msg04424.html. Here's also an unanswered post describing this from 2004: http://www.mail-archive.com/wget@sunsite.dk/msg06904.html. This 2001 thread talks about this problem, but ends with the note that this is now broken: http://www.mail-archive.com/wget@sunsite.dk/msg00380.html. This 2002 thread that refers to version 1.8.1 also might apply: http://www.mail-archive.com/wget@sunsite.dk/msg03272.html. I'm using 1.9.1 (Debian sarge). Can anyone tell me the current status of this, and if I'm doing anything wrong or if there's an easy work around? Thanks. -Kevin Zembower
Re: regex support RFC
* Mauro Tortonesi <[EMAIL PROTECTED]> wrote: >> I'm hoping for ... a "raw" type in addition to "file", >> "domain", etc. > > do you mean you would like to have a regex class working on the > content of downloaded files as well? Not exactly. (details below) > i don't like your "raw" proposal as it is HTML-specific. i > would like instead to develop a mechanism which could work for > all supported protocols. I see. It would be problematic for other protocols. :( A raw match would be more complicated than I originally thought, because it is HTML-specific and uses extra data which isn't currently available to the filters. Would it be feasible to make "raw" simply return the full URI when the document is not HTML? I think there is some value in matching based on the entire link tag, instead of just the URI. Wget already has --follow-tags and --ignore-tags, and a "raw" match would be like an extension to that concept. I would find it useful to be able to filter according to things which are not part of the URI. For example: follow: article skip: buy now Either the class property or the visible link text could be used to decide if the link is worth following, but the URI in this case is pretty useless. It may need to be a different option; use "--filter" to filter the URI list, and use "--filter-tag" earlier in the process (same place as "--follow-tags"), to help generate the URI list. Regardless, I think it would be useful. Any thoughts? -- Scott
RE: regex support RFC
I agree with Tony.i think most basic users, me included, thought www-*.yoyodyne.com would not match www.yoyodyne.com Support globs as default, regexp as the more powerful option. Ranjit Sandhu SRA -Original Message- From: Tony Lewis [mailto:[EMAIL PROTECTED] Sent: Friday, March 31, 2006 10:03 AM To: wget@sunsite.dk Subject: RE: regex support RFC Mauro Tortonesi wrote: > no. i was talking about regexps. they are more expressive and powerful > than simple globs. i don't see what's the point in supporting both. The problem is that users who are expecting globs will try things like --filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases their expressions will simply work, which will result in significant confusion when some expression doesn't work, such as --filter:-domain:www-*.yoyodyne.com. :-) It is pretty easy to programmatically convert a glob into a regular expression. One possibility is to make glob the default input and allow regular expressions. For example, the following could be equivalent: --filter:-domain:www-*.yoyodyne.com --filter:-domain,r:www-.*\.yoyodyne\.com Internally, wget would convert the first into the second and then treat it as a regular expression. For the vast majority of cases, glob will work just fine. One might argue that it's a lot of work to implement regular expressions if the default input format is a glob, but I think we should aim for both lack of confusion and robust functionality. Using ",r" means people get regular expressions when they want them and know what they're doing. The universe of wget users who "know what they're doing" are mostly subscribed to this mailing list; the rest of them send us mail saying "please CC me as I'm not on the list". :-) If we go this route, I'm wondering if the appropriate conversion from glob to regular expression should take directory separators into account, such as: --filter:-path:path/to/* becoming the same as: --filter:-path,r:path/to/[^/]* or even: --filter:-path,r:path[/\\]to[/\\][^/\\]* Should the glob match "path/to/sub/dir"? (I suspect it shouldn't.) Tony
Re: Problem with double slashes in URI
"Zembower, Kevin" <[EMAIL PROTECTED]> writes: > [EMAIL PROTECTED]:/tmp$ wget --timestamping --no-host-directories --glob=on > --recursive --cut-dirs=4 > 'ftp://xxx:[EMAIL PROTECTED]/%2Fccp1/data/shared/news/motd/qotd.txt' If you need double slash, you must spell it explicitly: wget [...] ftp://xxx:[EMAIL PROTECTED]/%2F%2Fccp1/data/shared/news/motd/qotd.txt Substituting ccp3 for something that I can connect to: $ wget -S ftp://gnjilux.srk.fer.hr/%2F%2Fccp1/data/shared/news/motd/qotd.txt ... --> CWD //ccp1/data/shared/news/motd That's with Wget 1.10.2 It might not work on versions before 1.10.