Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Scott Scriven wrote:

* Mauro Tortonesi <[EMAIL PROTECTED]> wrote:


wget -r --filter=-domain:www-*.yoyodyne.com


This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
"www---.yoyodyne.com", and so on, if interpreted as a regex.


not really. it would not match www.yoyodyne.com.

It would most likely also match "www---zyoyodyneXcom".  


yes.


Perhaps you want glob patterns instead?  I know I wouldn't mind having
glob patterns in addition to regexes...  glob is much eaesier
when you're not doing complex matches.


no. i was talking about regexps. they are more expressive and powerful 
than simple globs. i don't see what's the point in supporting both.



If I had to choose just one though, I'd prefer to use PCRE,
Perl-Compatible Regular Expressions.  They offer a richer, more
concise syntax than traditional regexes, such as \d instead of
[:digit:] or [0-9].


i agree, but adding a dependency from PCRE to wget is asking for 
infinite maintenance nightmares. and i don't know if we can simply 
bundle code from PCRE in wget, as it has a BSD license.



--filter=[+|-][file|path|domain]:REGEXP

is it consistent? is it flawed? is there a more convenient one?


It seems like a good idea, but wouldn't actually provide the
regex-filtering features I'm hoping for unless there was a "raw"
type in addition to "file", "domain", etc.  I'll give details
below.  Basically, I need to match based on things like the
inline CSS data, the visible link text, etc.


do you mean you would like to have a regex class working on the content 
of downloaded files as well?



Below is the original message I sent to the wget list a few
months ago, about this same topic:

=
I'd find it useful to guide wget by using regular expressions to
control which links get followed.  For example, to avoid
following links based on embedded css styles or link text.

I've needed this several times, but the most recent was when I
wanted to avoid following any "add to cart" or "buy" links on a
site which uses GET parameters instead of directories to select
content.

Given a link like this...

http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCart&g2_itemId=11436&g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7b&g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7b&g2_returnName=album";
 class="gbAdminLink gbAdminLink gbLink-cart_AddToCart">add to cart

... a useful parameter could be --ignore-regex='AddToCart|add to cart'
so the class or link text (really, anything inside the tag) could
be used to decide whether the link should be followed.

Or...  if there's already a way to do this, let me know.  I
didn't see anything in the docs, but I may have missed something.

:)
=

I think what I want could be implemented via the --filter option,
with a few small modifications to what was proposed.  I'm not
sure exactly what syntax to use, but it should be able to specify
whether to include/exclude the link, which PCRE flags to use, how
much of the raw HTML tag to use as input, and what pattern to use
for matching.  Here's an idea:

  --filter=[allow][flags,][scope][:]pattern

Example:

  '--filter=-i,raw:add ?to ?cart'
  (the quotes are there only to make the shell treat it as one parameter)

The details are:

  "allow" is "+" for "include" or "-" for "exclude".
  It defaults to "+" if omitted.

  "flags," is a set of letters to control regex options, followed
  by a comma (to separate it from scope).  For example, "i"
  specifies a case-insensitive search.  These would be the same
  flags that perl appends to the end of search patterns.  So,
  instead of "/foo/i", it would be "--filter=+i,:foo"

  "scope" controls how much of the  or similar tag gets used
  as input to the regex.  Values include:
raw: use the entire tag and all contents (default)
 bar
domain: use only the domain name
 www.example.com
file: use only the file name
 foo.ext
path: use the directory, but not the file name
 /path/to
others...  can be added as desired

  ":" is required if "allow" or "flags" or "scope" is given

So, for example, to exclude the "add to cart" links in my
previous post, this could be used:

  --filter=-raw:'AddToCart|add to cart'
or
  --filter=-raw:AddToCart\|add\ to\ cart
or
  --filter=-:'AddToCart|add to cart'
or
  --filter=-i,raw:'add ?to ?cart'

Alternately, the --filter option could be split into two options:
one for including content, and one for excluding.  This would be
more consistent with wget's existing parameters, and would
slightly simplify the syntax.

I hope I haven't been to full of hot air.  This is a feature I've
wanted in wget for a long time, and I'm a bit excited that it
might happen soon.  :)


i don't like your "raw" proposal as it is HTML-specific. i would like 
instead to develop a mechanism which could work for all supported 
protocols.

Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Mauro Tortonesi <[EMAIL PROTECTED]> writes:

> Scott Scriven wrote:
>> * Mauro Tortonesi <[EMAIL PROTECTED]> wrote:
>>
>>>wget -r --filter=-domain:www-*.yoyodyne.com
>> This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
>> "www---.yoyodyne.com", and so on, if interpreted as a regex.
>
> not really. it would not match www.yoyodyne.com.

Why not?

>> Perhaps you want glob patterns instead?  I know I wouldn't mind
>> having glob patterns in addition to regexes...  glob is much
>> eaesier when you're not doing complex matches.
>
> no. i was talking about regexps. they are more expressive and
> powerful than simple globs. i don't see what's the point in
> supporting both.

I agree with this.


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

Mauro Tortonesi <[EMAIL PROTECTED]> writes:



Scott Scriven wrote:


* Mauro Tortonesi <[EMAIL PROTECTED]> wrote:



wget -r --filter=-domain:www-*.yoyodyne.com


This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
"www---.yoyodyne.com", and so on, if interpreted as a regex.


not really. it would not match www.yoyodyne.com. 


Why not?


i may be wrong, but if - is not a special charachter, the previous 
expression should match only domains starting with www- and ending in 
[randomchar]yoyodyne[randomchar]com.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Oliver Schulze L. wrote:

Hrvoje Niksic wrote:


 The regexp API's found on today's Unix systems
might be usable, but unfortunately those are not available on Windows.


My personal idea on this is to: enable regex in Unix and disable it on 
Windows.

>
We all use Unix/Linux and regex is really usefull. I think not having 
regex on
Windows will not do any more harm that it is doing now (not having it at 
all)


for consistency and to avoid maintenance problems, i would like wget to 
have the same behavior on windows and unix. please, notice that if we 
implemented regex support only on unix, windows binaries of wget built 
with cygwin would have regex support but native binaries wouldn't. that 
would be very confusing for windows users, IMHO.


I hope wget can get conection cache, 


this is planned for wget 1.12 (which might become 2.0). i already have 
some code implementing connection cache data structure.


URL regex 


this is planned for wget 1.11. i've already started working on it.


and advanced mirror functions (sync 2 folders) in the near future.


this is very interesting.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Curtis Hatter wrote:

On Thursday 30 March 2006 13:42, Tony Lewis wrote:


Perhaps --filter=path,i:/path/to/krs would work.


That would look to be the most elegant method. I do hope that the (?i:) and 
(?-i:) constructs are supported since I may not want the entire path/file to 
be case (in)?sensitive =), but that will depend on the regex engine chosen.


while i like the idea of supporting modifiers like "quick" (short 
circuit) and maybe "i" (case insensitive comparison), i think that (?i:) 
and (?-i:) constructs would be overkill and rather hard to implement.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


RE: regex support RFC

2006-03-31 Thread Herold Heiko
> From: Oliver Schulze L. [mailto:[EMAIL PROTECTED]
> My personal idea on this is to: enable regex in Unix and 
> disable it on 
> Windows.
> 
> We all use Unix/Linux and regex is really usefull. I think not having 

We all use Unix/Linux ? You would be surprised how many wget users on
windows are out there.

Beside that, Those Who Know The Code better than me please consider how bad
portability issues in using native regexp engines could be.
Are the interfaces and capabilities all the same or are there consistent
differences between various flavors (gnu, several BSD, hpux, aix, sunos,
solaris, older flavours...). If so that would be a point favouring an
external library (hopefully supported on as many as possible flavours).

Heiko

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED]
-- +39-041-5907073 / +39-041-5917073 ph
-- +39-041-5907472 / +39-041-5917472 fax


Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Mauro Tortonesi <[EMAIL PROTECTED]> writes:

>wget -r --filter=-domain:www-*.yoyodyne.com

This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
"www---.yoyodyne.com", and so on, if interpreted as a regex.
>>>
>>> not really. it would not match www.yoyodyne.com.
>> Why not?
>
> i may be wrong, but if - is not a special charachter, the previous
> expression should match only domains starting with www- and ending
> in [randomchar]yoyodyne[randomchar]com.

"*" matches the previous character repeated 0 or more times.  This is
in contrast to wildcards, where "*" alone matches any character 0 or
more times.  (This is part of why regexps are often confusing to
people used to the much simpler wildcards.)

Therefore "www-*" matches "www", "www-", "www--", etc., i.e. Scott's
interpretation was correct.  What you describe is achieved with the
"www-.*.yoyodyne.com".


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

Mauro Tortonesi <[EMAIL PROTECTED]> writes:


wget -r --filter=-domain:www-*.yoyodyne.com


This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
"www---.yoyodyne.com", and so on, if interpreted as a regex.


not really. it would not match www.yoyodyne.com.


Why not?


i may be wrong, but if - is not a special charachter, the previous
expression should match only domains starting with www- and ending
in [randomchar]yoyodyne[randomchar]com.


"*" matches the previous character repeated 0 or more times.  This is
in contrast to wildcards, where "*" alone matches any character 0 or
more times.  (This is part of why regexps are often confusing to
people used to the much simpler wildcards.)

Therefore "www-*" matches "www", "www-", "www--", etc., i.e. Scott's
interpretation was correct.  What you describe is achieved with the
"www-.*.yoyodyne.com".


you're right. ok, it is official. i must stop drinking this much - it 
just doesn't work. i have to start drinking less or, even better, more.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

Herold Heiko <[EMAIL PROTECTED]> writes:


Get the best of both, use a syntax permitting a "first match-exits"
ACL, single ACE permits several statements ANDed together. Cooking
up a simple syntax for users without much regexp experience won't be
easy.


I assume ACL stands for "access control list", but what is ACE?


access control entry, i guess.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Wincent Colaiuta

El 31/03/2006, a las 14:37, Hrvoje Niksic escribió:


"*" matches the previous character repeated 0 or more times.  This is
in contrast to wildcards, where "*" alone matches any character 0 or
more times.  (This is part of why regexps are often confusing to
people used to the much simpler wildcards.)

Therefore "www-*" matches "www", "www-", "www--", etc., i.e. Scott's
interpretation was correct.  What you describe is achieved with the
"www-.*.yoyodyne.com".


Are you sure that "www-*" matches "www"?

As far as I know "www-*" matches "one w, another w, a third w, a  
hyphen, then 0 or more hyphens". In other words, "www", does not match.


Wincent




smime.p7s
Description: S/MIME cryptographic signature


Re: Bug in ETA code on x64

2006-03-31 Thread Greg Hurrell

El 29/03/2006, a las 14:39, Hrvoje Niksic escribió:



I can't see any good reason to use "," here. Why not write the line
as:
  eta_hrs = eta / 3600; eta %= 3600;


Because that's not equivalent.


Well, it should be, because the comma operator has lower precedence
than the assignment operator (see http://tinyurl.com/evo5a,
http://tinyurl.com/ff4pp and numerous other locations).


Indeed you are right. So:

eta_hrs = eta / 3600, eta %= 3600;

Is equivalent to the following (with explicit parentheses to make the  
effect of the precendence obvious):


(eta_hrs = eta / 3600), (eta %= 3600);

Or of course:

eta_hrs = eta / 3600; eta %= 3600;

Greg



smime.p7s
Description: S/MIME cryptographic signature


Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Wincent Colaiuta <[EMAIL PROTECTED]> writes:

> Are you sure that "www-*" matches "www"?

Yes.

> As far as I know "www-*" matches "one w, another w, a third w, a
> hyphen, then 0 or more hyphens".

That would be "www--*" or "www-+".


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

Wincent Colaiuta <[EMAIL PROTECTED]> writes:



Are you sure that "www-*" matches "www"?


Yes.


hrvoje is right. try this perl script:


#!/usr/bin/perl -w

use strict;

my @strings = ("www-.yoyodyne.com",
   "www.yoyodyne.com");

foreach my $str (@strings) {
$str =~ /www-*.yoyodyne.com/ or print "$str doesn't match\n";
}


both the strings match.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Oliver Schulze L.

Mauro Tortonesi wrote:
for consistency and to avoid maintenance problems, i would like wget 
to have the same behavior on windows and unix. please, notice that if 
we implemented regex support only on unix, windows binaries of wget 
built with cygwin would have regex support but native binaries 
wouldn't. that would be very confusing for windows users, IMHO.

Ok, I understand.
I was thinking in a #ifdef in the source code so you can:
- enable all regex code/command line parameters in Unix/Linux
- at runtime, print the error "regex not yet supported on windows" if 
any regex related command

  parameter line parameter is passed to wget on windows/cygwin

this is planned for wget 1.12 (which might become 2.0). i already have 
some code implementing connection cache data structure.

Excelent!

URL regex
this is planned for wget 1.11. i've already started working on it.

looking forward to it, many thanks!

--
Oliver Schulze L.
<[EMAIL PROTECTED]>



RE: regex support RFC

2006-03-31 Thread Tony Lewis
Mauro Tortonesi wrote: 

> no. i was talking about regexps. they are more expressive
> and powerful than simple globs. i don't see what's the
> point in supporting both.

The problem is that users who are expecting globs will try things like
--filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases their
expressions will simply work, which will result in significant confusion
when some expression doesn't work, such as
--filter:-domain:www-*.yoyodyne.com. :-)

It is pretty easy to programmatically convert a glob into a regular
expression. One possibility is to make glob the default input and allow
regular expressions. For example, the following could be equivalent:

--filter:-domain:www-*.yoyodyne.com
--filter:-domain,r:www-.*\.yoyodyne\.com

Internally, wget would convert the first into the second and then treat it
as a regular expression. For the vast majority of cases, glob will work just
fine.

One might argue that it's a lot of work to implement regular expressions if
the default input format is a glob, but I think we should aim for both lack
of confusion and robust functionality. Using ",r" means people get regular
expressions when they want them and know what they're doing. The universe of
wget users who "know what they're doing" are mostly subscribed to this
mailing list; the rest of them send us mail saying "please CC me as I'm not
on the list". :-)

If we go this route, I'm wondering if the appropriate conversion from glob
to regular expression should take directory separators into account, such
as:

--filter:-path:path/to/*

becoming the same as:

--filter:-path,r:path/to/[^/]*

or even:

--filter:-path,r:path[/\\]to[/\\][^/\\]*

Should the glob match "path/to/sub/dir"? (I suspect it shouldn't.)

Tony



Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
"Tony Lewis" <[EMAIL PROTECTED]> writes:

> Mauro Tortonesi wrote: 
>
>> no. i was talking about regexps. they are more expressive
>> and powerful than simple globs. i don't see what's the
>> point in supporting both.
>
> The problem is that users who are expecting globs will try things like
> --filter=-file:*.pdf

The --filter command will be documented from the start to support
regexps.  Since most Unix utilities work with regexps and very few
with globs (excepting the shell), this should not be a problem.

> It is pretty easy to programmatically convert a glob into a regular
> expression.

But it's harder to document and explain, and it requires more options
and logic.  Supporting two different syntaxes (the old one for
backward compatibility) is bad enough: supporting three is at least
one too many.

> One possibility is to make glob the default input and allow regular
> expressions. For example, the following could be equivalent:
>
> --filter:-domain:www-*.yoyodyne.com
> --filter:-domain,r:www-.*\.yoyodyne\.com

But that misses the point, which is that we *want* to make the more
expressive language, already used elsewhere on Unix, the default.


RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote: 

> But that misses the point, which is that we *want* to make the
> more expressive language, already used elsewhere on Unix, the
> default.

I didn't miss the point at all. I'm trying to make a completely different
one, which is that regular expressions will confuse most users (even if you
tell them that the argument to --filter is a regular expression). This
mailing list will get a huge number of bug reports when users try to use
globs that fail.

Yes, regular expressions are used elsewhere on Unix, but not everywhere. The
shell is the most obvious comparison for user input dealing with expressions
that select multiple objects; the shell uses globs.

Personally, I will be quite happy if --filter only supports regular
expressions because I've been using them quite effectively for years. I just
don't think the same thing can be said for the typical wget user. We've
already had disagreements in this chain about what would match a particular
regular expression; I suspect everyone involved in the conversation could
have correctly predicted what the equivalent glob would do.

I don't think ",r" complicates the command that much. Internally, the only
additional work for supporting both globs and regular expressions is a
function that converts a glob into a regexp when ",r" is not requested.
That's a straightforward transformation.

Tony



Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
"Tony Lewis" <[EMAIL PROTECTED]> writes:

> I didn't miss the point at all. I'm trying to make a completely different
> one, which is that regular expressions will confuse most users (even if you
> tell them that the argument to --filter is a regular expression).

Well, "most users" will probably not use --filter in the first place.
Those that do will have to look at the documentation where they'll
find that it accepts regexps.  Since Wget is hardly the first program
to use regexps, I don't see why most users would be confused by that
choice.

> Yes, regular expressions are used elsewhere on Unix, but not
> everywhere. The shell is the most obvious comparison for user input
> dealing with expressions that select multiple objects; the shell
> uses globs.

I don't see a clear line that connects --filter to glob patterns as
used by the shell.  If anything, the connection is with grep and other
commands that provide powerful filtering (awk and Perl's //
operators), which all seem to work on regexps.  Where the context can
be thought of shell-like (as in wget ftp://blah/*), Wget happily
obliges by providing shell-compatible patterns.

> I don't think ",r" complicates the command that much. Internally,
> the only additional work for supporting both globs and regular
> expressions is a function that converts a glob into a regexp when
> ",r" is not requested.  That's a straightforward transformation.

",r" makes it harder to input regexps, which are the whole point of
introducing --filter.  Besides, having two different syntaxes for the
same switch, and for no good reason, is not really acceptable, even if
the implementation is straightforward.


RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote:

> I don't see a clear line that connects --filter to glob patterns as used
> by the shell.

I want to list all PDFs in the shell, ls -l *.pdf

I want a filter to keep all PDFs, --filter=+file:*.pdf

Note that "*.pdf" is not a valid regular expression even though it's what
most people will try naturally. Perl complains:
/*.pdf/: ?+*{} follows nothing in regexp

I predict that the vast majority of bug reports and support requests will be
for users who are trying a glob rather than a regular expression.

Tony



Re: regex support RFC

2006-03-31 Thread Curtis Hatter
On Friday 31 March 2006 06:52, Mauro Tortonesi:
> while i like the idea of supporting modifiers like "quick" (short
> circuit) and maybe "i" (case insensitive comparison), i think that (?i:)
> and (?-i:) constructs would be overkill and rather hard to implement.

I figured that the (?i:) and (?-i:) constructs would be provided by the 
regular expression engine and that the --filter switch would simply be able 
to use any construct provided by that engine.

I was more trying to persuade for the use of a regex engine that supports such 
constructs (like Perl's). Some other constructs I find useful are: (?=), 
(?!=), (?)

These may be overkill but I would rather have the expressiveness of a regex 
engine like Perls when I need it instead of writing regexs in another engine 
that have to be twice as long to compensate for the lack of language 
constructs. Those who don't want to use them, or don't know of them they can 
write regex's as normal.

If, as you said, this would be hard to implement or require extra effort by 
you that is above and beyond that required for the more "standard" constructs 
then I would say that they shouldn't be implemented; at least at first.

Curtis


Re: regex support RFC

2006-03-31 Thread TPCnospam
> * [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> > wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/
> > 
> > soon leads to non wget related links being downloaded, eg. 
> > http://www.gnu.org/graphics/agnuhead.html
> 
> In that particular case, I think --no-parent would solve the
> problem.

No.  The idea is not to be restricted to not descending the tree. 

> 
> Maybe I misunderstood, though.  It seems awfully risky to use -r
> and -H without having something to strictly limit the links
> followed.  So, I suppose the content filter would be an effective
> way to make cross-host downloading safer.

Absolutely.  That is why I proposed a 'contents' regexp.

> 
> I think I'd prefer to have a different option, for that sort of
> thing -- filter by using external programs.  If the program
> returns a specific code, follow the link or recurse into the
> links contained in the file.  Then you could do far more complex
> filtering, including things like interactive pruning.

True.  That could be a future feature request but now that the wget team 
are writing regexp code, it seems an ideal time to implement it.  By 
constructing suitable regexps, one could use this feature to search for 
any string in the html file, (as above), or just in metatags etc.  IMHO it 
gives a lot of flexibility for little extra developer programming.

Any comments, Mauro & Hrvoje?

Thanks
Tom Crane

-- 
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England. 
Email:  [EMAIL PROTECTED]
Fax:+44 (0) 1784 472794


Problem with double slashes in URI

2006-03-31 Thread Zembower, Kevin
I'm having a problem downloading files from a Novell Netware server.
When I do it manually with FTP, I first 'cd //ccp1' to change servers.
Ncftpget seems to do this, but wget doesn't:

[EMAIL PROTECTED]:/tmp$ ncftpget -u xxx -p yyy ccp3 /tmp/
'//ccp1/data/shared/news/motd/qotd.txt'
/tmp/qotd.txt: 109.00 B 2.54
kB/s  
[EMAIL PROTECTED]:/tmp$ wget --timestamping --no-host-directories --glob=on
--recursive --cut-dirs=4
'ftp://xxx:[EMAIL PROTECTED]/%2Fccp1/data/shared/news/motd/qotd.txt'
--12:46:27--
ftp://isgadmin:[EMAIL PROTECTED]/%2Fccp1/data/shared/news/motd/qotd.txt
   => `motd/.listing'
Resolving ccp3... 162.129.45.72
Connecting to ccp3[162.129.45.72]:21... connected.
Logging in as isgadmin ... Logged in!
==> SYST ... done.==> PWD ... done.
==> TYPE I ... done.  ==> CWD /ccp1/data/shared/news/motd ... 
No such directory `/ccp1/data/shared/news/motd'.

unlink: No such file or directory

FINISHED --12:46:27--
Downloaded: 0 bytes in 0 files
[EMAIL PROTECTED]:/tmp$

I search this list's archives on 'double slash' to try to solve this
problem. This unanswered post from 2002 is exactly my problem (using
Novell Netware servers):
http://www.mail-archive.com/wget@sunsite.dk/msg04424.html.
Here's also an unanswered post describing this from 2004:
http://www.mail-archive.com/wget@sunsite.dk/msg06904.html.
This 2001 thread talks about this problem, but ends with the note that
this is now broken:
http://www.mail-archive.com/wget@sunsite.dk/msg00380.html. This 2002
thread that refers to version 1.8.1 also might apply:
http://www.mail-archive.com/wget@sunsite.dk/msg03272.html. I'm using
1.9.1 (Debian sarge).

Can anyone tell me the current status of this, and if I'm doing anything
wrong or if there's an easy work around?

Thanks.

-Kevin Zembower


Re: regex support RFC

2006-03-31 Thread Scott Scriven
* Mauro Tortonesi <[EMAIL PROTECTED]> wrote:
>> I'm hoping for ... a "raw" type in addition to "file",
>> "domain", etc.
> 
> do you mean you would like to have a regex class working on the
> content of downloaded files as well?

Not exactly.  (details below)

> i don't like your "raw" proposal as it is HTML-specific. i
> would like instead to develop a mechanism which could work for
> all supported protocols.

I see.  It would be problematic for other protocols.  :(
A raw match would be more complicated than I originally thought,
because it is HTML-specific and uses extra data which isn't
currently available to the filters.

Would it be feasible to make "raw" simply return the full URI
when the document is not HTML?

I think there is some value in matching based on the entire link
tag, instead of just the URI.  Wget already has --follow-tags and
--ignore-tags, and a "raw" match would be like an extension to
that concept.  I would find it useful to be able to filter
according to things which are not part of the URI.  For example:

  follow: article
  skip:   buy now

Either the class property or the visible link text could be used
to decide if the link is worth following, but the URI in this
case is pretty useless.

It may need to be a different option; use "--filter" to filter
the URI list, and use "--filter-tag" earlier in the process (same
place as "--follow-tags"), to help generate the URI list.
Regardless, I think it would be useful.

Any thoughts?


-- Scott


RE: regex support RFC

2006-03-31 Thread Sandhu, Ranjit
I agree with Tony.i think most basic users, me included, thought
www-*.yoyodyne.com would not match www.yoyodyne.com

Support globs as default, regexp as the more powerful option.

Ranjit Sandhu
SRA
 

-Original Message-
From: Tony Lewis [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 31, 2006 10:03 AM
To: wget@sunsite.dk
Subject: RE: regex support RFC

Mauro Tortonesi wrote: 

> no. i was talking about regexps. they are more expressive and powerful

> than simple globs. i don't see what's the point in supporting both.

The problem is that users who are expecting globs will try things like
--filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases
their expressions will simply work, which will result in significant
confusion when some expression doesn't work, such as
--filter:-domain:www-*.yoyodyne.com. :-)

It is pretty easy to programmatically convert a glob into a regular
expression. One possibility is to make glob the default input and allow
regular expressions. For example, the following could be equivalent:

--filter:-domain:www-*.yoyodyne.com
--filter:-domain,r:www-.*\.yoyodyne\.com

Internally, wget would convert the first into the second and then treat
it as a regular expression. For the vast majority of cases, glob will
work just fine.

One might argue that it's a lot of work to implement regular expressions
if the default input format is a glob, but I think we should aim for
both lack of confusion and robust functionality. Using ",r" means people
get regular expressions when they want them and know what they're doing.
The universe of wget users who "know what they're doing" are mostly
subscribed to this mailing list; the rest of them send us mail saying
"please CC me as I'm not on the list". :-)

If we go this route, I'm wondering if the appropriate conversion from
glob to regular expression should take directory separators into
account, such
as:

--filter:-path:path/to/*

becoming the same as:

--filter:-path,r:path/to/[^/]*

or even:

--filter:-path,r:path[/\\]to[/\\][^/\\]*

Should the glob match "path/to/sub/dir"? (I suspect it shouldn't.)

Tony




Re: Problem with double slashes in URI

2006-03-31 Thread Hrvoje Niksic
"Zembower, Kevin" <[EMAIL PROTECTED]> writes:

> [EMAIL PROTECTED]:/tmp$ wget --timestamping --no-host-directories --glob=on
> --recursive --cut-dirs=4
> 'ftp://xxx:[EMAIL PROTECTED]/%2Fccp1/data/shared/news/motd/qotd.txt'

If you need double slash, you must spell it explicitly:

wget [...] ftp://xxx:[EMAIL 
PROTECTED]/%2F%2Fccp1/data/shared/news/motd/qotd.txt

Substituting ccp3 for something that I can connect to:

$ wget -S ftp://gnjilux.srk.fer.hr/%2F%2Fccp1/data/shared/news/motd/qotd.txt
...
--> CWD //ccp1/data/shared/news/motd

That's with Wget 1.10.2  It might not work on versions before 1.10.