Re: Wikipedia page

2005-06-24 Thread Oliver Schulze L.




Good idea Hrvoje.

Oliver

Hrvoje Niksic wrote:

  "Oliver Schulze L." <[EMAIL PROTECTED]> writes:

  
  
I think that having a link to an email address is not that usefull,
because people can just write to that email address because its a
mailling list.

  
  
Good point.  An even better link might be to the gmane archive, where
you can read the list, but which also allowed posting.
  


-- 
Oliver Schulze L.
<[EMAIL PROTECTED]>




No more Libtool (long)

2005-06-24 Thread Hrvoje Niksic
Thanks to the effort of Mauro Tortonesi and the prior work of Bruno
Haible, Wget has been modified to no longer use Libtool for linking in
external libraries.  If you are interested in why that might be a
cause for celebration, read on.


A bit of history: Libtool was integrated in Wget by Dan Harkless,
despite protests (see http://tinyurl.com/98zkt), to ensure portable
linking to external libraries.  Linking with a system library, such as
librt or libpthread is as easy as using -lLIBNAME.  However, linking
to third-party libraries installed in /usr/local/lib or elsewhere is
harder because: a) you have to find the location of the library, and
b) you have to produce an executable with runtime path information to
find the library when it is run (the system's dynamic linker cannot be
expected to know about non-standard library locations).

The b) part is really tricky because the compiler and linker flags
vary from system to system, and it is hard or impossible to get access
to a large number of different systems to test it on.  For example, on
Linux you would use -Wl,-rpath /usr/local/lib, on Solaris you would
use -R/usr/local/lib, on AIX you might use -Wl,-blibpath
/usr/local/lib:/usr/lib:/lib, and so on.  Of course, the "-Wl," part
also differs between compilers.  And the GNU linker may be used by GCC
on some of the systems, which means you have to use its flags, not the
system ones.  And so on -- you get the idea.

Libtool, normally used for *building* shared libraries, can also be
used to help link them in because it contains code that handles the
above runpath conundrum.  It supports a unified interface where make
can simply use -R/usr/local/lib, and depend on libtool to convert that
to the incantation appropriate for the system linker.  configure.in
was made to detect OpenSSL in this way; however, what started as a
simple use of libtool turned into 200 lines of *hard* configure.in
code.

Despite the apparent improvement over simply not specifying the
runpath, and arguably over trying to duplicate libtool's rpath logic
in configure.in, Libtool brought many painful disadvantages, which I
will proceed to list, in no particular order:

* It made the configure script much larger and slower, excercising
  many weird and unnecessary checks, such as how to run one of ~20
  supported FORTRAN compilers, how to parse output of `nm', how to run
  the C++ preprocessor, where to find `ar', `ranlib', and `strip', how
  to produce PIC, how to tell C++ not to use RTTI and exceptions (!),
  and so on.

* It is unclear what would happen if some of the checks Libtool thinks
  are important (the nm one comes to mind) failed on a platform on
  which the rest of Wget builds just fine.  The experience with a
  Libtool version that caused Wget to fail to build when there was no
  C++ compiler on the platform suggests the worst.

* Such use of Libtool is complete overkill.  While Libtool may be the
  appropriate solution for building shared libraries (although there
  are opposing views to that), it was certainly not designed with this
  use in mind, which is amply proven by the amount of documentation
  devoted to the issue -- none.

* The merge of Wget's configure and libtool was far from clean, simply
  because such use was not envisioned and is therefore not documented.
  It involved digging into Autoconf internals, such as unsetting
  cache-related variables, temporarily changing CC to "$SHELL
  ./libtool $CC" and then reverting it, hackery to LDFLAGS and LIBS,
  and more.

* It was completely specific to OpenSSL's libssl and libcrypto, and
  non-reusable to other external libraries.  Adding a *new* external
  library would have required rethinking the entire scheme, and
  possibly rewriting that very tricky code.

* Libtool created unnecessary cruft, such as the .libs directories,
  and unnecessary restrictions, such as the inability to use `make
  CC=some-other-compiler' without rerunning configure, even though the
  other compiler would work just fine with the Makefile variables
  currently available -- for example, it can be another version of
  gcc.  (This had to do with the "tags" feature of Libtool that the
  documentation didn't explain sufficiently to turn it off.)

* The complex and fragile Libtool code base required frequent updates.
  Some versions of Wget didn't compile on otherwise unexceptional
  operating systems simply because of Libtool bugs.  While it can be
  argued that all software requires updates in one form or another,
  Libtool has required much more hand-holding than other software we
  use to build Wget, for example Autoconf.

* IT DIDN'T WORK, despite all the invested effort.  After Wget 1.10
  was released, this list received reports of OpenSSL libraries not
  being detected on some operating systems, apparently because Libtool
  insisted on creating executable in the .libs directory (where the
  Autoconf test system doesn't find them).  Of course, Libtool doesn't
  do that on Linux, nor on So

Re: wget bug report

2005-06-24 Thread Hrvoje Niksic
<[EMAIL PROTECTED]> writes:

> Sorry for the crosspost, but the wget Web site is a little confusing
> on the point of where to send bug reports/patches.

Sorry about that.  In this case, either address is fine, and we don't
mind the crosspost.

> After taking a look at it, i implemented the following change to
> http.c and tried again. It works for me, but i don't know what other
> implications my change might have.

It's exactly the correct change.  A similar fix has already been
integrated in the CVS (in fact subversion) code base.

Thanks for the report and the patch.


Re: Bug handling session cookies

2005-06-24 Thread Hrvoje Niksic
"Mark Street" <[EMAIL PROTECTED]> writes:

> Many thanks for the explanation and the patch.  Yes, this patch
> successfully resolves the problem for my particular test case.

Thanks for testing it.  It has been applied to the code and will be in
Wget 1.10.1 and later.


Re: Wikipedia page

2005-06-24 Thread Hrvoje Niksic
"Oliver Schulze L." <[EMAIL PROTECTED]> writes:

> I think that having a link to an email address is not that usefull,
> because people can just write to that email address because its a
> mailling list.

Good point.  An even better link might be to the gmane archive, where
you can read the list, but which also allowed posting.


Re: Removing thousand separators from file size output

2005-06-24 Thread Hrvoje Niksic
"Tony Lewis" <[EMAIL PROTECTED]> writes:

> Hrvoje Niksic wrote: 
>
>> In fact, I know of no application that accepts numbers as Wget prints
> them.
>
> Microsoft Calculator does.

Sorry, I forgot to qualify that as "(Unix) command-line application"
or something to that effect.  I know that many GUI applications, such
as Excel, accept numbers in a variety of formats, including (depending
on locale and possibly number format customizations) that one.


Re: Bug handling session cookies

2005-06-24 Thread Mark Street

Hrvoje,

Many thanks for the explanation and the patch.
Yes, this patch successfully resolves the problem for my particular test
case.

Best regards,

Mark Street.




Re: Removing thousand separators from file size output

2005-06-24 Thread Hrvoje Niksic
Leonid <[EMAIL PROTECTED]> writes:

> Those guys who find numbers like 11782023180 easy to read and can
> tell for a fraction of a second that it was 11Gb

I'm not such person; Wget would in fact print:

Length: 11782023180 (11.0G)


Re: Wikipedia page

2005-06-24 Thread Oliver Schulze L.




Added :)

I think that having a link to an email address is not that usefull,
because people
can just write to that email address because its a mailling list.

Is just an Idea, if you want to can revert the changes.

Thanks
Oliver

Hrvoje Niksic wrote:

  "Oliver Schulze L." <[EMAIL PROTECTED]> writes:

  
  
Looks really nice.  Maybe it needs a link to instructions on how to
subscribe to the mailling list.

  
  
You can always add it.  :-)

But we already have a link to the home page where the information
resides.  Link to subscription details probably don't belong to a
wikipedia article.
  


-- 
Oliver Schulze L.
<[EMAIL PROTECTED]>




RE: Removing thousand separators from file size output

2005-06-24 Thread Tony Lewis
Hrvoje Niksic wrote: 

> In fact, I know of no application that accepts numbers as Wget prints
them.

Microsoft Calculator does.

Tony




RE: Removing thousand separators from file size output

2005-06-24 Thread Leonid

Hrvoje,

> What do you think?

  To add a new (oh!) option in .wgetrc and call it decimal_separator

  Those guys who find numbers like 11782023180 easy
to read and can tell for a fraction of a second that it was
11Gb downloaded, not 1.1Gb, will use

decimal_separator = ""

  I personally would specify

decimal_separator = ","

  Germans may like more  decimal_separator = "."

Leonid


Re: ChangeLog-branches

2005-06-24 Thread Hrvoje Niksic
Alain Bench <[EMAIL PROTECTED]> writes:

> MHO: They are ununderstandable, unusable, unclean, and big. They may
> give a false bad impression of source/project misorganization. We
> want to drop them, wipe any proof of their existence from any
> archives and mirrors, then honestly deny they ever existed. No need
> to kill witnesses though: Who would believe them?

The pesky subversion software allows their restoration...  I *knew* we
should have stuck to CVS!  :-)

(They're gone.)


Re: Removing thousand separators from file size output

2005-06-24 Thread Hrvoje Niksic
Alain Bench <[EMAIL PROTECTED]> writes:

>  On Thursday, June 23, 2005 at 3:16:28 PM +0200, Hrvoje Niksic wrote:
>
>> Since Wget 1.10 also prints sizes in kilobytes/megabytes/etc., I am
>> thinking of removing the thousand separators from size display.
>
> IMHO thousand (or myriad) separators are necessary.
>
> This size display is primarily intended for humans, not for other
> apps.

Primarily yes -- which is why Wget 1.10 also shows the size in units.
But it is also convenient as input for other applications, which is
very hard with the thousand separators.

> If separators constitute a difficulty for other apps, then it's
> these other apps problem. Or sed's task (s/,//g).

It's not the other apps problems.  In many applications (e.g.
programming languages, but also programmable calculators) "," is a
separator between function arguments and cannot be used inside the
number.  In fact, I know of no application that accepts numbers as
Wget prints them.

(sed is not readily available when I paste Wget's output into another
program such as bc or calc.)

> Humans can have habit to look at exact unit size, or rounded
> kilo/mega/tera size, or both. It would be a regression to reduce
> readability of legacy exact bytes count,

The way I see it, with the unit sizes present, omitting the thousand
separators merely removes redundancy, not useful information.

More importantly, I know of no other command-line program that prints
sizes with thousand separators the way Wget does, with no way to get
the ordinary parsable numbers.  If the users were so used to
separators, they would surely request them in other programs, such as
`ls', `du', or `df'?

> I don't really understand nor follow your reasons against
> localization. User's cultural preferences should be respected.

You can make a case that the correct character and layout should be
used for digit grouping when it is deployed, but I don't see how you
can argue that grouping *must* be used in all applications!  The
appearance of grouped digits can be and is described by the locale,
but no locale mandates grouping to be used for display of all numbers.


As for localization, I'm not against it.  The argument was that, where
possible, I prefer the output of applications to remain parsable.  For
example, I consider the ISO 8601 date format a clear advantage over
the asctime() format.  The same goes for the display of integers.


Re: Removing thousand separators from file size output

2005-06-24 Thread Alain Bench
 On Thursday, June 23, 2005 at 3:16:28 PM +0200, Hrvoje Niksic wrote:

> Since Wget 1.10 also prints sizes in kilobytes/megabytes/etc., I am
> thinking of removing the thousand separators from size display.

IMHO thousand (or myriad) separators are necessary.

This size display is primarily intended for humans, not for other
apps. If separators constitute a difficulty for other apps, then it's
these other apps problem. Or sed's task (s/,//g).

Humans can have habit to look at exact unit size, or rounded
kilo/mega/tera size, or both. It would be a regression to reduce
readability of legacy exact bytes count, just because we have a new
added more human-readable but rounded count.


> The separators are interpunction which introduces clutter, especially
> with complex size output also containing the "remaining" size next to
> the whole size.

True: The more info, the more confusion. But that's the contrary of
a valid reason to reduce readability of those infos. And IMHO removing
thousand separators reduces readability.


> replace the "," character with the character mandated by the locale

This seems naturally desirable. I don't really understand nor follow
your reasons against localization. User's cultural preferences should be
respected.

OTOS this is not so important nor urgent, compared to thousand
serparators removal cons.


Bye!Alain.
-- 
When you want to reply to a mailing list, please avoid doing so from a
digest. This often builds incorrect references and breaks threads.


Re: ChangeLog-branches

2005-06-24 Thread Alain Bench
Hello Hrvoje,

 On Thursday, June 23, 2005 at 9:00:44 PM +0200, Hrvoje Niksic wrote:

> the ChangeLog-branches directories distributed with Wget are desirable
> or necessary?

MHO: They are ununderstandable, unusable, unclean, and big. They may
give a false bad impression of source/project misorganization. We want
to drop them, wipe any proof of their existence from any archives and
mirrors, then honestly deny they ever existed. No need to kill witnesses
though: Who would believe them?


Bye!Alain.
-- 
Microsoft Outlook Express users concerned about readability: For much
better viewing quotes in your messages, check the little freeware
program OE-QuoteFix by Dominik Jain on http://flash.to/oblivion/>.
It'll change your life. :-) Now exists also for Outlook.


Re: Getting the list of the files to download before downloading them

2005-06-24 Thread Paul Smith
On 6/21/05, Isaac Grover <[EMAIL PROTECTED]> wrote:
> > > I wonder if someone on the list could come up with a sed one-liner?
> > > Or a snippet of perl perhaps.  It should be trivial to take a
> > > directory of html files, extract html tags that bracket each URL that
> > > mention a PDF file, and write a pseudo-HTML file that contains only
> > > the PDF links for wget.
> 
> I don't know sed, and it wouldn't be hard to do in perl I suppose, but this
> is more or less what I use:
> 
> #!/bin/sh
> 
> wget http://www.example.com/links/
> grep "http://"; index.html > index.txt
> cat index.txt | awk 'BEGIN { FS="\"" } { print $2 }' > url_list.txt
> 
> Then if you wanted to only grab the PDF files, do:
> 
> grep "\.pdf" url_list.txt > new_url_list.txt
> wget -i new_url_list.txt
> 
> It is just after midnight here, so it may not work exactly as advertised,
> but cut-n-paste usually doesn't lie, so it should work okay.

Thanks, Isaac, but as far as I understand your script, it does not
apply with wget recursion.

Paul


Re: Bug handling session cookies

2005-06-24 Thread Hrvoje Niksic
"Mark Street" <[EMAIL PROTECTED]> writes:

> I'm not sure why this [catering for paths without a leading /] is
> done in the code.

rfc1808 declared that the leading / is not really part of path, but
merely a "separator", presumably to be consistent with its treatment
of ;params, ?queries, and #fragments.  The author of the code found it
appealing to disregard common sense and implement rfc1808 semantics.

In most cases the user shouldn't notice the difference, but it has
lead to all kinds of implementation problems with code that assumes
that URL paths naturally begin with /.  Because of that it will be
changed later.

> Note that the forward slash is stripped from "prefix", hence never
> matches "full_path".  I'm not sure why this is done in the code.

Because PREFIX is the path declared by the cookie, which always begins
with /, and FULL_PATH is the URL path coming from the URL parsing
code, which doesn't begin with a /.  To match them, one must indeed
strip the leading / off PREFIX.

But paths without a slash still caused subtle problems.  For example,
cookies without a path attribute still had to be stored with the
correct cookie-path (with a leading slash).  To account for this, the
invocation of cookie_handle_set_cookie was modified to prepend the /
before the path.  This lead to path_match unexpectedly receiving two
/-prefixed paths and being unable to match them.

The attached patch fixes the problem by:

* Making sure that path consistently gets prepended in all entry
  points to cookie code;

* Removing the special logic from path_match.

With that change your test case seems to work, and so do all the other
tests I could think of.

Please let me know if it works for you, and thanks for the detailed
bug report.


2005-06-24  Hrvoje Niksic  <[EMAIL PROTECTED]>

* http.c (gethttp): Don't prepend / here.

* cookies.c (cookie_handle_set_cookie): Prepend / to PATH.
(cookie_header): Ditto.

Index: src/http.c
===
--- src/http.c  (revision 1794)
+++ src/http.c  (working copy)
@@ -1706,7 +1706,6 @@
   /* Handle (possibly multiple instances of) the Set-Cookie header. */
   if (opt.cookies)
 {
-  char *pth = NULL;
   int scpos;
   const char *scbeg, *scend;
   /* The jar should have been created by now. */
@@ -1717,15 +1716,8 @@
   ++scpos)
{
  char *set_cookie; BOUNDED_TO_ALLOCA (scbeg, scend, set_cookie);
- if (pth == NULL)
-   {
- /* u->path doesn't begin with /, which cookies.c expects. */
- pth = (char *) alloca (1 + strlen (u->path) + 1);
- pth[0] = '/';
- strcpy (pth + 1, u->path);
-   }
- cookie_handle_set_cookie (wget_cookie_jar, u->host, u->port, pth,
-   set_cookie);
+ cookie_handle_set_cookie (wget_cookie_jar, u->host, u->port,
+   u->path, set_cookie);
}
 }
 
Index: src/cookies.c
===
--- src/cookies.c   (revision 1794)
+++ src/cookies.c   (working copy)
@@ -822,6 +822,17 @@
 {
   return path_matches (path, cookie_path) != 0;
 }
+
+/* Prepend '/' to string S.  S is copied to fresh stack-allocated
+   space and its value is modified to point to the new location.  */
+
+#define PREPEND_SLASH(s) do {  \
+  char *PS_newstr = (char *) alloca (1 + strlen (s) + 1);  \
+  *PS_newstr = '/';\
+  strcpy (PS_newstr + 1, s);   \
+  s = PS_newstr;   \
+} while (0)
+
 
 /* Process the HTTP `Set-Cookie' header.  This results in storing the
cookie or discarding a matching one, or ignoring it completely, all
@@ -835,6 +846,11 @@
   struct cookie *cookie;
   cookies_now = time (NULL);
 
+  /* Wget's paths don't begin with '/' (blame rfc1808), but cookie
+ usage assumes /-prefixed paths.  Until the rest of Wget is fixed,
+ simply prepend slash to PATH.  */
+  PREPEND_SLASH (path);
+
   cookie = parse_set_cookies (set_cookie, update_cookie_field, false);
   if (!cookie)
 goto out;
@@ -977,17 +993,8 @@
 static int
 path_matches (const char *full_path, const char *prefix)
 {
-  int len;
+  int len = strlen (prefix);
 
-  if (*prefix != '/')
-/* Wget's HTTP paths do not begin with '/' (the URL code treats it
-   as a mere separator, inspired by rfc1808), but the '/' is
-   assumed when matching against the cookie stuff.  */
-return 0;
-
-  ++prefix;
-  len = strlen (prefix);
-
   if (0 != strncmp (full_path, prefix, len))
 /* FULL_PATH doesn't begin with PREFIX. */
 return 0;
@@ -1149,6 +1156,7 @@
   int count, i, ocnt;
   char *result;
   int result_size, pos;
+  PREPEND_SLASH (path);/* see cookie_handle_set_cookie */
 
   /* First, find the cooki

Bug handling session cookies

2005-06-24 Thread Mark Street
Hello folks,

I'm running wget v1.10 compiled from source (tested on HP-UX and Linux).

I am having problems handling session cookies.  The idea is to request a
web page which returns an ID number in a session cookie.  All subsequent
requests from the site must contain this session cookie.

I'm using a command line as follows:
wget --no-proxy --save-cookies cookies.txt --keep-session-cookies
http://ttms:9900/testdb-bin/login -O -

The headers returned from the webserver are as follows:

---request begin---
GET /testdb-bin/login HTTP/1.0
User-Agent: Wget/1.10
Accept: */*
Host: ttms:9900
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Date: Fri, 24 Jun 2005 09:22:38 GMT
Server: Apache/2.0.51 (Unix) PHP/4.3.3
Set-Cookie: SessionID=1119604958; path=/testdb-bin
Connection: close
Content-Type: text/html; charset=ISO-8859-1

---response end---


However, the cookie.txt file is empty...

$ cat cookie.txt

# HTTP cookie file.
# Generated by Wget on 2005-06-24 10:22:38.
# Edit at your own risk.

$

I've looked at the source code, in cookie.c
I've added debug to print out the contents of full_path and prefix in the
path_matches() function.  The output is as follows:

path_matches() full_path: /testdb-bin/login, prefix: /testdb-bin  [ on
function entry, i.e. before ++prefix statement ]
path_matches() calling strncmp("/testdb-bin/login", "testdb-bin", 10) = -69

Note that the forward slash is stripped from "prefix", hence never matches
"full_path".
I'm not sure why this is done in the code.

Is there a problem here?  Or am I doing something wrong?
The path returned in the cookie from the webserver seems valid.  It's
generated by the Perl CGI module cookie method and seems consistent with
the CGI man page.

For now, I've hacked the path_matches() function to ensure that the slash
prefixes are always consistent...

  /* MNS hack for fixing cookie leading slashes */
  if (*prefix == '/' && *full_path != '/')
prefix++;
  if (*prefix != '/' && *full_path == '/')
full_path++;
  /* MNS end of hack */

  // ++prefix;  MNS was original code

If I try the same test with something like www.google.com, the cookie file
gets created sucessfully - although this isn't a session cookie, of course.

Cheers,

Mark.