RE: wget re-download fully downloaded files

2008-10-27 Thread Tony Lewis
Micah Cowan wrote:

> Actually, I'll have to confirm this, but I think that current Wget will
> re-download it, but not overwrite the current content, until it arrives
> at some content corresponding to bytes beyond the current content.
>
> I need to investigate further to see if this change was somehow
> intentional (though I can't imagine what the reasoning would be); if I
> don't find a good reason not to, I'll revert this behavior.

One reason to keep the current behavior is to retain all of the existing
content in the event of another partial download that is shorter than the
previous one. However, I think that only makes sense if wget is comparing
the new content with what is already on disk.

Tony




Re: wget re-download fully downloaded files

2008-10-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Maksim Ivanov wrote:
> I'm trying to download the same file from the same server, command line
> I use:
> wget --debug -o log  -c -t 0 --load-cookies=cookie_file
> http://rapidshare.com/files/153131390/Blind-Test.rar
> 
> Below attached 2 files: log with 1.9.1 and log with 1.10.2
> Both logs are made when Blind-Test.rar was already on my HDD.
> Sorry for some "mess" in logs, but russian language used on my console.

This is currently being tracked at https://savannah.gnu.org/bugs/?24662

A similar and related bug report is at
https://savannah.gnu.org/bugs/?24642 in which the logs show that
rapidshare.com issues also issues erroneous Content-Range information
when it responds with a 206 Partial Content, which exercised a different
"regression"* introduced in 1.11.x.

* It's not really a regression, since it's desirable behavior: we now
determine the size of the content from the content-range header, since
content-length is often missing or erroneous for partial content.
However, in this instance of server error, it resulted in less-desirable
behavior than the previous version of Wget. Anyway...

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJBhvA7M8hyUobTrERAty1AKCEscXut6FDXvXlxpuSBtKkii1/awCeJH0M
+JcJ5xG67K7CxHBEcV1x/zY=
=D2uE
-END PGP SIGNATURE-


Re: wget re-download fully downloaded files

2008-10-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Maksim Ivanov wrote:
> I'm trying to download the same file from the same server, command line
> I use:
> wget --debug -o log  -c -t 0 --load-cookies=cookie_file
> http://rapidshare.com/files/153131390/Blind-Test.rar
> 
> Below attached 2 files: log with 1.9.1 and log with 1.10.2
> Both logs are made when Blind-Test.rar was already on my HDD.
> Sorry for some "mess" in logs, but russian language used on my console.

Thanks very much for providing these, Maksim; they were very helpful.
(Sorry for getting back to you so late: it's been busy lately).

I've confirmed this behavioral difference (though I compared the current
development sources against 1.8.2, rather than 1.10.2 to 1.9.1). Your
logs involve a 302 redirection before arriving at the real file, but
that's just a red herring.

The difference is that when 1.9.1 encountered a server that would
respond to a byte-range request with "200" (meaning it doesn't know how
to send partial contents), but with a Content-Length value matching the
size of the local file, then wget would close the connection and not
proceed to redownload. 1.10.2, on the other hand, would just re-download it.

Actually, I'll have to confirm this, but I think that current Wget will
re-download it, but not overwrite the current content, until it arrives
at some content corresponding to bytes beyond the current content.

I need to investigate further to see if this change was somehow
intentional (though I can't imagine what the reasoning would be); if I
don't find a good reason not to, I'll revert this behavior. Probably for
the 1.12 release, but I might possibly punt it to 1.13 on the grounds
that it's not a recent regression (however, it should really be a quick
fix, so most likely it'll be in for 1.12).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJBfOj7M8hyUobTrERAjNTAJ9ayaKLvN4bYS/7o0kYcQywDvfwNgCfcGzz
P9aAwVD6Q/xQuACjU7KF1ng=
=m5QO
-END PGP SIGNATURE-


Re: wget re-download fully downloaded files

2008-10-13 Thread Maksim Ivanov
I'm trying to download the same file from the same server, command line I
use:
wget --debug -o log  -c -t 0 --load-cookies=cookie_file
http://rapidshare.com/files/153131390/Blind-Test.rar

Below attached 2 files: log with 1.9.1 and log with 1.10.2
Both logs are made when Blind-Test.rar was already on my HDD.
Sorry for some "mess" in logs, but russian language used on my console.

Yours faithfully, Maksim Ivanov



2008/10/13 Micah Cowan <[EMAIL PROTECTED]>

-BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Maksim Ivanov wrote:
> > Hello!
> >
> > Starting version 1.10 wget has very annoying bug: if you trying download
> > already fully downloaded file, wget begin download it over,
> > but 1.9.1 says: "Nothing to do" as it must to be.
>
> It all depends on what options you specify. That's as true for 1.9 as it
> is for 1.10 (or the current release 1.11.4).
>
> It can also depend on the server; not all of them support timestamping
> or partial fetches.
>
> Please post the minimal log that exhibits the problem you're experiencing.
>
> - --
> Thanks,
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer.
> GNU Maintainer: wget, screen, teseq
> http://micah.cowan.name/
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFI8mrL7M8hyUobTrERAqx4AJ9yQb+kPXGI2N7sv34krZLnYDuRvgCfWI2K
> nZYI8ER1PB3pkYC4neiTa9U=
> =JW3/
> -END PGP SIGNATURE-
>


log.1.9.1
Description: Binary data


log.1.10.2
Description: Binary data


Re: wget re-download fully downloaded files

2008-10-13 Thread Maksim Ivanov
I'm trying to download the same file from the same server, command line I
use:
wget --debug -o log  -c -t 0 --load-cookies=cookie_file
http://rapidshare.com/files/153131390/Blind-Test.rar

Below attached 2 files: log with 1.9.1 and log with 1.10.2
Both logs are made when Blind-Test.rar was already on my HDD.
Sorry for some "mess" in logs, but russian language used on my console.

Yours faithfully, Maksim Ivanov



2008/10/13 Micah Cowan <[EMAIL PROTECTED]>

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Maksim Ivanov wrote:
> > Hello!
> >
> > Starting version 1.10 wget has very annoying bug: if you trying download
> > already fully downloaded file, wget begin download it over,
> > but 1.9.1 says: "Nothing to do" as it must to be.
>
> It all depends on what options you specify. That's as true for 1.9 as it
> is for 1.10 (or the current release 1.11.4).
>
> It can also depend on the server; not all of them support timestamping
> or partial fetches.
>
> Please post the minimal log that exhibits the problem you're experiencing.
>
> - --
> Thanks,
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer.
> GNU Maintainer: wget, screen, teseq
> http://micah.cowan.name/
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFI8mrL7M8hyUobTrERAqx4AJ9yQb+kPXGI2N7sv34krZLnYDuRvgCfWI2K
> nZYI8ER1PB3pkYC4neiTa9U=
> =JW3/
> -END PGP SIGNATURE-
>


log.1.9.1
Description: Binary data


log.1.10.2
Description: Binary data


Re: wget re-download fully downloaded files

2008-10-12 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Maksim Ivanov wrote:
> Hello!
> 
> Starting version 1.10 wget has very annoying bug: if you trying download
> already fully downloaded file, wget begin download it over,
> but 1.9.1 says: "Nothing to do" as it must to be.

It all depends on what options you specify. That's as true for 1.9 as it
is for 1.10 (or the current release 1.11.4).

It can also depend on the server; not all of them support timestamping
or partial fetches.

Please post the minimal log that exhibits the problem you're experiencing.

- --
Thanks,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFI8mrL7M8hyUobTrERAqx4AJ9yQb+kPXGI2N7sv34krZLnYDuRvgCfWI2K
nZYI8ER1PB3pkYC4neiTa9U=
=JW3/
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-09-10 Thread Tony Godshall
And you'll probably have to do this again- I bet
yahoo expires the session cookies!


On Tue, Sep 9, 2008 at 2:18 PM, Donald Allen <[EMAIL PROTECTED]> wrote:
> After surprisingly little struggle, I got Plan B working -- logged into
> yahoo with wget, saved the cookies, including session cookies, and then
> proceeded to fetch pages using the saved cookies. Those pages came back
> logged in as me, with my customizations. Thanks to Tony, Daniel, and Micah
> -- you all provided critical advice in solving this problem.
>
> /Don
>
> On Tue, Sep 9, 2008 at 2:21 PM, Donald Allen <[EMAIL PROTECTED]> wrote:
>>
>>
>> On Tue, Sep 9, 2008 at 1:51 PM, Micah Cowan <[EMAIL PROTECTED]> wrote:
>>>
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA1
>>>
>>> Donald Allen wrote:
>>> >
>>> >
>>> > On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan <[EMAIL PROTECTED]
>>> > > wrote:
>>> >
>>> > Donald Allen wrote:
>>> >>> I am doing the yahoo session login with firefox, not with wget,
>>> > so I'm
>>> >>> using the first and easier of your two suggested methods. I'm
>>> > guessing
>>> >>> you are thinking that I'm trying to login to the yahoo session with
>>> >>> wget, and thus --keep-session-cookies and
>>> > --save-cookies= would
>>> >>> make perfect sense to me, but that's not what I'm doing (yet --
>>> > if I'm
>>> >>> right about what's happening here, I'm going to have to resort to
>>> > this).
>>> >>> But using firefox to initiate the session, it looks to me like wget
>>> >>> never gets to see the session cookies because I don't think firefox
>>> >>> writes them to its cookie file (which actually makes sense -- if they
>>> >>> only need to live as long as the session, why write them out?).
>>> >
>>> > Yes, and I understood this; the thing is, that if session cookies are
>>> > involved (i.e., cookies that are marked for immediate expiration and
>>> > are
>>> > not meant to be saved to the cookies file), then I don't see how you
>>> > have much choice other than to use the "harder" method, or else to fake
>>> > the session cookies by manually inserting them to your cookies file or
>>> > whatnot (not sure how well that may be expected to work). Or, yeah, add
>>> > an explicit --header 'Cookie: ...'.
>>> >
>>> >
>>> >> Ah, the misunderstanding was that the stuff you thought I missed was
>>> >> intended to push me in the direction of Plan B -- log in to yahoo with
>>> >> wget.
>>>
>>> Yes; and that's entirely my fault, as I didn't explicitly say that.
>>
>> No problem.
>>>
>>>
>>> > I understand now. I'll look at trying to make this work. Thanks
>>> >> for all the help, though I can't guarantee that you are done yet :-)
>>> >> But, hopefully, this exchange will benefit others.
>>>
>>> I was actually surprised you kept going after I pointed out that it
>>> required the Accept-Encoding header that results in gzipped content.
>>
>> That didn't faze me because the pages I'm after will be processed by a
>> python program, so having to gunzip would not require a manual step.
>>>
>>> This behavior is a little surprising to me from Yahoo!. It's not
>>> surprising in _general_, but for a site that really wants to be as
>>> accessible as possible (I would think?), insisting on "the latest"
>>> browsers seems ill-advised.
>>>
>>> Ah, well. At least the days are _mostly_ gone when I'd fire up Netscape,
>>> visit a site, and get a server-generated page that's empty other than
>>> the phrase "You're not using Internet Explorer." :p
>>
>> And taking it one step further, I'm greatly enjoying watching Microsoft
>> thrash around, trying to save themselves, which I don't think they will.
>> Perhaps they'll re-invent themselves, as IBM did, but their cash cow is not
>> going to produce milk too much longer. I've just installed the Chrome beta
>> on the Windows side of one of my machines (I grudgingly give it 10 Gb on
>> each machine; Linux gets the rest), and it looks very, very nice. They've
>> still got work to do, but they appear to be heading in a very good
>> direction. These are smart people at Google. All signs seem to be pointing
>> towards more and more computing happening on the server side in the coming
>> years.
>>
>> /Don
>>
>>>
>>> - --
>>> Micah J. Cowan
>>> Programmer, musician, typesetting enthusiast, gamer.
>>> GNU Maintainer: wget, screen, teseq
>>> http://micah.cowan.name/
>>> -BEGIN PGP SIGNATURE-
>>> Version: GnuPG v1.4.7 (GNU/Linux)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>>
>>> iD8DBQFIxreZ7M8hyUobTrERAslyAJwKfirhzth9ACgdunxp/rfQlR86mQCcClik
>>> 3HbbATyqnrm0hAJXqNTqpl4=
>>> =3XD/
>>> -END PGP SIGNATURE-
>>
>
>



-- 
Best Regards.
Please keep in touch.
This is unedited.
P-)


Re: Wget and Yahoo login?

2008-09-09 Thread Donald Allen
After surprisingly little struggle, I got Plan B working -- logged into
yahoo with wget, saved the cookies, including session cookies, and then
proceeded to fetch pages using the saved cookies. Those pages came back
logged in as me, with my customizations. Thanks to Tony, Daniel, and Micah
-- you all provided critical advice in solving this problem.

/Don

On Tue, Sep 9, 2008 at 2:21 PM, Donald Allen <[EMAIL PROTECTED]> wrote:

>
>
> On Tue, Sep 9, 2008 at 1:51 PM, Micah Cowan <[EMAIL PROTECTED]> wrote:
>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> Donald Allen wrote:
>> >
>> >
>> > On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan <[EMAIL PROTECTED]
>> > > wrote:
>> >
>> > Donald Allen wrote:
>> >>> I am doing the yahoo session login with firefox, not with wget,
>> > so I'm
>> >>> using the first and easier of your two suggested methods. I'm
>> > guessing
>> >>> you are thinking that I'm trying to login to the yahoo session with
>> >>> wget, and thus --keep-session-cookies and
>> > --save-cookies= would
>> >>> make perfect sense to me, but that's not what I'm doing (yet --
>> > if I'm
>> >>> right about what's happening here, I'm going to have to resort to
>> > this).
>> >>> But using firefox to initiate the session, it looks to me like wget
>> >>> never gets to see the session cookies because I don't think firefox
>> >>> writes them to its cookie file (which actually makes sense -- if they
>> >>> only need to live as long as the session, why write them out?).
>> >
>> > Yes, and I understood this; the thing is, that if session cookies are
>> > involved (i.e., cookies that are marked for immediate expiration and are
>> > not meant to be saved to the cookies file), then I don't see how you
>> > have much choice other than to use the "harder" method, or else to fake
>> > the session cookies by manually inserting them to your cookies file or
>> > whatnot (not sure how well that may be expected to work). Or, yeah, add
>> > an explicit --header 'Cookie: ...'.
>> >
>> >
>> >> Ah, the misunderstanding was that the stuff you thought I missed was
>> >> intended to push me in the direction of Plan B -- log in to yahoo with
>> >> wget.
>>
>> Yes; and that's entirely my fault, as I didn't explicitly say that.
>
>
> No problem.
>
>>
>>
>> > I understand now. I'll look at trying to make this work. Thanks
>> >> for all the help, though I can't guarantee that you are done yet :-)
>> >> But, hopefully, this exchange will benefit others.
>>
>> I was actually surprised you kept going after I pointed out that it
>> required the Accept-Encoding header that results in gzipped content.
>
>
> That didn't faze me because the pages I'm after will be processed by a
> python program, so having to gunzip would not require a manual step.
>
>>
>> This behavior is a little surprising to me from Yahoo!. It's not
>> surprising in _general_, but for a site that really wants to be as
>> accessible as possible (I would think?), insisting on "the latest"
>> browsers seems ill-advised.
>>
>> Ah, well. At least the days are _mostly_ gone when I'd fire up Netscape,
>> visit a site, and get a server-generated page that's empty other than
>> the phrase "You're not using Internet Explorer." :p
>
>
> And taking it one step further, I'm greatly enjoying watching Microsoft
> thrash around, trying to save themselves, which I don't think they will.
> Perhaps they'll re-invent themselves, as IBM did, but their cash cow is not
> going to produce milk too much longer. I've just installed the Chrome beta
> on the Windows side of one of my machines (I grudgingly give it 10 Gb on
> each machine; Linux gets the rest), and it looks very, very nice. They've
> still got work to do, but they appear to be heading in a very good
> direction. These are smart people at Google. All signs seem to be pointing
> towards more and more computing happening on the server side in the coming
> years.
>
> /Don
>
>
>>
>>
>> - --
>> Micah J. Cowan
>> Programmer, musician, typesetting enthusiast, gamer.
>> GNU Maintainer: wget, screen, teseq
>> http://micah.cowan.name/
>> -BEGIN PGP SIGNATURE-
>> Version: GnuPG v1.4.7 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iD8DBQFIxreZ7M8hyUobTrERAslyAJwKfirhzth9ACgdunxp/rfQlR86mQCcClik
>> 3HbbATyqnrm0hAJXqNTqpl4=
>> =3XD/
>> -END PGP SIGNATURE-
>>
>
>


Re: Wget and Yahoo login?

2008-09-09 Thread Donald Allen
On Tue, Sep 9, 2008 at 1:51 PM, Micah Cowan <[EMAIL PROTECTED]> wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Donald Allen wrote:
> >
> >
> > On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan <[EMAIL PROTECTED]
> > > wrote:
> >
> > Donald Allen wrote:
> >>> I am doing the yahoo session login with firefox, not with wget,
> > so I'm
> >>> using the first and easier of your two suggested methods. I'm
> > guessing
> >>> you are thinking that I'm trying to login to the yahoo session with
> >>> wget, and thus --keep-session-cookies and
> > --save-cookies= would
> >>> make perfect sense to me, but that's not what I'm doing (yet --
> > if I'm
> >>> right about what's happening here, I'm going to have to resort to
> > this).
> >>> But using firefox to initiate the session, it looks to me like wget
> >>> never gets to see the session cookies because I don't think firefox
> >>> writes them to its cookie file (which actually makes sense -- if they
> >>> only need to live as long as the session, why write them out?).
> >
> > Yes, and I understood this; the thing is, that if session cookies are
> > involved (i.e., cookies that are marked for immediate expiration and are
> > not meant to be saved to the cookies file), then I don't see how you
> > have much choice other than to use the "harder" method, or else to fake
> > the session cookies by manually inserting them to your cookies file or
> > whatnot (not sure how well that may be expected to work). Or, yeah, add
> > an explicit --header 'Cookie: ...'.
> >
> >
> >> Ah, the misunderstanding was that the stuff you thought I missed was
> >> intended to push me in the direction of Plan B -- log in to yahoo with
> >> wget.
>
> Yes; and that's entirely my fault, as I didn't explicitly say that.


No problem.

>
>
> > I understand now. I'll look at trying to make this work. Thanks
> >> for all the help, though I can't guarantee that you are done yet :-)
> >> But, hopefully, this exchange will benefit others.
>
> I was actually surprised you kept going after I pointed out that it
> required the Accept-Encoding header that results in gzipped content.


That didn't faze me because the pages I'm after will be processed by a
python program, so having to gunzip would not require a manual step.

>
> This behavior is a little surprising to me from Yahoo!. It's not
> surprising in _general_, but for a site that really wants to be as
> accessible as possible (I would think?), insisting on "the latest"
> browsers seems ill-advised.
>
> Ah, well. At least the days are _mostly_ gone when I'd fire up Netscape,
> visit a site, and get a server-generated page that's empty other than
> the phrase "You're not using Internet Explorer." :p


And taking it one step further, I'm greatly enjoying watching Microsoft
thrash around, trying to save themselves, which I don't think they will.
Perhaps they'll re-invent themselves, as IBM did, but their cash cow is not
going to produce milk too much longer. I've just installed the Chrome beta
on the Windows side of one of my machines (I grudgingly give it 10 Gb on
each machine; Linux gets the rest), and it looks very, very nice. They've
still got work to do, but they appear to be heading in a very good
direction. These are smart people at Google. All signs seem to be pointing
towards more and more computing happening on the server side in the coming
years.

/Don


>
>
> - --
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer.
> GNU Maintainer: wget, screen, teseq
> http://micah.cowan.name/
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.7 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFIxreZ7M8hyUobTrERAslyAJwKfirhzth9ACgdunxp/rfQlR86mQCcClik
> 3HbbATyqnrm0hAJXqNTqpl4=
> =3XD/
> -END PGP SIGNATURE-
>


Re: Wget and Yahoo login?

2008-09-09 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
> 
> 
> On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan <[EMAIL PROTECTED]
> > wrote:
> 
> Donald Allen wrote:
>>> I am doing the yahoo session login with firefox, not with wget,
> so I'm
>>> using the first and easier of your two suggested methods. I'm
> guessing
>>> you are thinking that I'm trying to login to the yahoo session with
>>> wget, and thus --keep-session-cookies and
> --save-cookies= would
>>> make perfect sense to me, but that's not what I'm doing (yet --
> if I'm
>>> right about what's happening here, I'm going to have to resort to
> this).
>>> But using firefox to initiate the session, it looks to me like wget
>>> never gets to see the session cookies because I don't think firefox
>>> writes them to its cookie file (which actually makes sense -- if they
>>> only need to live as long as the session, why write them out?).
> 
> Yes, and I understood this; the thing is, that if session cookies are
> involved (i.e., cookies that are marked for immediate expiration and are
> not meant to be saved to the cookies file), then I don't see how you
> have much choice other than to use the "harder" method, or else to fake
> the session cookies by manually inserting them to your cookies file or
> whatnot (not sure how well that may be expected to work). Or, yeah, add
> an explicit --header 'Cookie: ...'.
> 
> 
>> Ah, the misunderstanding was that the stuff you thought I missed was
>> intended to push me in the direction of Plan B -- log in to yahoo with
>> wget.

Yes; and that's entirely my fault, as I didn't explicitly say that.

> I understand now. I'll look at trying to make this work. Thanks
>> for all the help, though I can't guarantee that you are done yet :-)
>> But, hopefully, this exchange will benefit others.

I was actually surprised you kept going after I pointed out that it
required the Accept-Encoding header that results in gzipped content.
This behavior is a little surprising to me from Yahoo!. It's not
surprising in _general_, but for a site that really wants to be as
accessible as possible (I would think?), insisting on "the latest"
browsers seems ill-advised.

Ah, well. At least the days are _mostly_ gone when I'd fire up Netscape,
visit a site, and get a server-generated page that's empty other than
the phrase "You're not using Internet Explorer." :p

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxreZ7M8hyUobTrERAslyAJwKfirhzth9ACgdunxp/rfQlR86mQCcClik
3HbbATyqnrm0hAJXqNTqpl4=
=3XD/
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-09-09 Thread Donald Allen
On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan <[EMAIL PROTECTED]> wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Donald Allen wrote:
> >> I am doing the yahoo session login with firefox, not with wget, so I'm
> >> using the first and easier of your two suggested methods. I'm guessing
> >> you are thinking that I'm trying to login to the yahoo session with
> >> wget, and thus --keep-session-cookies and --save-cookies= would
> >> make perfect sense to me, but that's not what I'm doing (yet -- if I'm
> >> right about what's happening here, I'm going to have to resort to this).
> >> But using firefox to initiate the session, it looks to me like wget
> >> never gets to see the session cookies because I don't think firefox
> >> writes them to its cookie file (which actually makes sense -- if they
> >> only need to live as long as the session, why write them out?).
>
> Yes, and I understood this; the thing is, that if session cookies are
> involved (i.e., cookies that are marked for immediate expiration and are
> not meant to be saved to the cookies file), then I don't see how you
> have much choice other than to use the "harder" method, or else to fake
> the session cookies by manually inserting them to your cookies file or
> whatnot (not sure how well that may be expected to work). Or, yeah, add
> an explicit --header 'Cookie: ...'.


Ah, the misunderstanding was that the stuff you thought I missed was
intended to push me in the direction of Plan B -- log in to yahoo with wget.
I understand now. I'll look at trying to make this work. Thanks for all the
help, though I can't guarantee that you are done yet :-) But, hopefully,
this exchange will benefit others.

/Don


>
> - --
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer.
> GNU Maintainer: wget, screen, teseq
> http://micah.cowan.name/
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.7 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFIxrVD7M8hyUobTrERAt19AJ9bmmczCKjzMtGCoXb8B5g25uMLRQCeK8qh
> M57W3Reqj+/pO8GuDwb9Nok=
> =ajp/
> -END PGP SIGNATURE-
>


Re: Wget and Yahoo login?

2008-09-09 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
>> I am doing the yahoo session login with firefox, not with wget, so I'm
>> using the first and easier of your two suggested methods. I'm guessing
>> you are thinking that I'm trying to login to the yahoo session with
>> wget, and thus --keep-session-cookies and --save-cookies= would
>> make perfect sense to me, but that's not what I'm doing (yet -- if I'm
>> right about what's happening here, I'm going to have to resort to this).
>> But using firefox to initiate the session, it looks to me like wget
>> never gets to see the session cookies because I don't think firefox
>> writes them to its cookie file (which actually makes sense -- if they
>> only need to live as long as the session, why write them out?).

Yes, and I understood this; the thing is, that if session cookies are
involved (i.e., cookies that are marked for immediate expiration and are
not meant to be saved to the cookies file), then I don't see how you
have much choice other than to use the "harder" method, or else to fake
the session cookies by manually inserting them to your cookies file or
whatnot (not sure how well that may be expected to work). Or, yeah, add
an explicit --header 'Cookie: ...'.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxrVD7M8hyUobTrERAt19AJ9bmmczCKjzMtGCoXb8B5g25uMLRQCeK8qh
M57W3Reqj+/pO8GuDwb9Nok=
=ajp/
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-09-09 Thread Donald Allen
On Tue, Sep 9, 2008 at 1:29 PM, Micah Cowan <[EMAIL PROTECTED]> wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Donald Allen wrote:
> > The result of this test, just to be clear, was a page that indicated
> > yahoo thought I was not logged in. Those extra items firefox is sending
> > appear to be the difference, because I included them (from the
> > livehttpheaders output) when I tried sending the cookies manually with
> > --header, I got the same page back with wget that indicated that yahoo
> > knew I was logged in and formatted with page with my preferences.
>
> Perhaps you missed this in my last message:
>
> >> Probably there are session cookies involved, that are sent in the first
> >> page, that you're not sending back with the form submit.
> >> --keep-session-cookies and --save-cookies= make a good
> >> combination.
>

I think we're mis-communicating, easily my fault, since I know just enough
about this stuff to be dangerous.

I am doing the yahoo session login with firefox, not with wget, so I'm using
the first and easier of your two suggested methods. I'm guessing you are
thinking that I'm trying to login to the yahoo session with wget, and thus
--keep-session-cookies and --save-cookies= would make perfect sense
to me, but that's not what I'm doing (yet -- if I'm right about what's
happening here, I'm going to have to resort to this). But using firefox to
initiate the session, it looks to me like wget never gets to see the session
cookies because I don't think firefox writes them to its cookie file (which
actually makes sense -- if they only need to live as long as the session,
why write them out?).

/Don



>
>
> - --
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer.
> GNU Maintainer: wget, screen, teseq
> http://micah.cowan.name/
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.7 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFIxrJ17M8hyUobTrERAvdsAJ9XEwMfimHXRUXKtV66P+YsG+tA7gCfWKbq
> nCqAmXJfU3kTncMQkKk0JZo=
> =17Yr
> -END PGP SIGNATURE-
>


Re: Wget and Yahoo login?

2008-09-09 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
> The result of this test, just to be clear, was a page that indicated
> yahoo thought I was not logged in. Those extra items firefox is sending
> appear to be the difference, because I included them (from the
> livehttpheaders output) when I tried sending the cookies manually with
> --header, I got the same page back with wget that indicated that yahoo
> knew I was logged in and formatted with page with my preferences.

Perhaps you missed this in my last message:

>> Probably there are session cookies involved, that are sent in the first
>> page, that you're not sending back with the form submit.
>> --keep-session-cookies and --save-cookies= make a good
>> combination.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxrJ17M8hyUobTrERAvdsAJ9XEwMfimHXRUXKtV66P+YsG+tA7gCfWKbq
nCqAmXJfU3kTncMQkKk0JZo=
=17Yr
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-09-09 Thread Donald Allen
On Tue, Sep 9, 2008 at 12:23 PM, Micah Cowan <[EMAIL PROTECTED]> wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Donald Allen wrote:
> > On Tue, Sep 9, 2008 at 3:14 AM, Daniel Stenberg <[EMAIL PROTECTED]> wrote:
> >> On Mon, 8 Sep 2008, Donald Allen wrote:
> >>
> >>> The page I get is what would be obtained if an un-logged-in user went
> to
> >>> the specified url. Opening that same url in Firefox *does* correctly
> >>> indicate that it is logged in as me and reflects my customizations.
> >> First, LiveHTTPHeaders is the Firefox plugin everyone who tries these
> stunts
> >> need. Then you read the capure and replay them as closely as possible
> using
> >> your tool.
> >>
> >> As you will find out, sites like this use all sorts of funny tricks to
> >> figure out you and to make it hard to automate what you're trying to do.
> >> They tend to use javascripts for redirects and for fiddling with cookies
> >> just to make sure you have a javascript and cookie enabled browser. So
> you
> >> need to work hard(er) when trying this with non-browsers.
> >>
> >> It's certainly still possible, even without using the browser to get the
> >> first cookie file. But it may take some effort.
> >
> > I have not been able to retrieve a page with wget as if I were logged
> > in using --load-cookies and Micah's suggestion about 'Accept-Encoding'
> > (there was a typo in his message -- it's 'Accept-Encoding', not
> > 'Accept-Encodings'). I did install livehttpheaders and tried
> > --no-cookies and --header  and that
> > did work.
>
> That's how I did it as well (except I got the headers from tcpdump); I'm
> using Firefox 3, so don't have access to FF's new sqllite-based cookies
> file (apart from the patch at
>
> http://wget.addictivecode.org/FrontPage?action=AttachFile&do=view&target=wget-firefox3-cookie.patch
> ).
>
> > Some of the cookie info sent by Firefox was a mystery,
> > because it's not in the cookie file. Perhaps that's the crucial
> > difference -- I'm speculating that wget isn't sending quite the same
> > thing as Firefox when --load-cookies is used, because Firefox is
> > adding stuff that isn't in the cookie file. Just a guess.
>
> Probably there are session cookies involved, that are sent in the first
> page, that you're not sending back with the form submit.
> - --keep-session-cookies and --save-cookies= make a good
> combination.
>
> > Is there a
> > way to ask wget to print the headers it sends (ala livehttpheaders)?
> > I've looked through the options on the man page and didn't see
> > anything, though I might have missed it.
>
> - --debug


Well, I rebuilt my wget with the 'debug' use flag and ran it on the yahoo
test page (after having logged in to yahoo with firefox, of course) with
--load-cookies and the accept-encoding header item, with --debug. Very
useful. wget is sending every cookie item in firefox's cookies.txt. But
firefox sends three additional cookie items in the header that wget does not
send. Those items are *not* in firefox's cookies.txt so wget has no way of
knowing about them. Is it possible that firefox is not writing session
cookies to the file?

The result of this test, just to be clear, was a page that indicated yahoo
thought I was not logged in. Those extra items firefox is sending appear to
be the difference, because I included them (from the livehttpheaders output)
when I tried sending the cookies manually with --header, I got the same page
back with wget that indicated that yahoo knew I was logged in and formatted
with page with my preferences.

/Don



>
>
> - --
> HTH,
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer.
> GNU Maintainer: wget, screen, teseq
> http://micah.cowan.name/
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.7 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFIxqL77M8hyUobTrERAovFAJ9yagS2xW+2wFG65BwiFkJNfTMylgCfYaq7
> 1vOmTDimFg8E7Cn+Q+HGZn8=
> =JKXH
> -END PGP SIGNATURE-
>


Re: Wget and Yahoo login?

2008-09-09 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
> On Tue, Sep 9, 2008 at 3:14 AM, Daniel Stenberg <[EMAIL PROTECTED]> wrote:
>> On Mon, 8 Sep 2008, Donald Allen wrote:
>>
>>> The page I get is what would be obtained if an un-logged-in user went to
>>> the specified url. Opening that same url in Firefox *does* correctly
>>> indicate that it is logged in as me and reflects my customizations.
>> First, LiveHTTPHeaders is the Firefox plugin everyone who tries these stunts
>> need. Then you read the capure and replay them as closely as possible using
>> your tool.
>>
>> As you will find out, sites like this use all sorts of funny tricks to
>> figure out you and to make it hard to automate what you're trying to do.
>> They tend to use javascripts for redirects and for fiddling with cookies
>> just to make sure you have a javascript and cookie enabled browser. So you
>> need to work hard(er) when trying this with non-browsers.
>>
>> It's certainly still possible, even without using the browser to get the
>> first cookie file. But it may take some effort.
> 
> I have not been able to retrieve a page with wget as if I were logged
> in using --load-cookies and Micah's suggestion about 'Accept-Encoding'
> (there was a typo in his message -- it's 'Accept-Encoding', not
> 'Accept-Encodings'). I did install livehttpheaders and tried
> --no-cookies and --header  and that
> did work.

That's how I did it as well (except I got the headers from tcpdump); I'm
using Firefox 3, so don't have access to FF's new sqllite-based cookies
file (apart from the patch at
http://wget.addictivecode.org/FrontPage?action=AttachFile&do=view&target=wget-firefox3-cookie.patch).

> Some of the cookie info sent by Firefox was a mystery,
> because it's not in the cookie file. Perhaps that's the crucial
> difference -- I'm speculating that wget isn't sending quite the same
> thing as Firefox when --load-cookies is used, because Firefox is
> adding stuff that isn't in the cookie file. Just a guess.

Probably there are session cookies involved, that are sent in the first
page, that you're not sending back with the form submit.
- --keep-session-cookies and --save-cookies= make a good
combination.

> Is there a
> way to ask wget to print the headers it sends (ala livehttpheaders)?
> I've looked through the options on the man page and didn't see
> anything, though I might have missed it.

- --debug

- --
HTH,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxqL77M8hyUobTrERAovFAJ9yagS2xW+2wFG65BwiFkJNfTMylgCfYaq7
1vOmTDimFg8E7Cn+Q+HGZn8=
=JKXH
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-09-09 Thread Donald Allen
On Tue, Sep 9, 2008 at 3:14 AM, Daniel Stenberg <[EMAIL PROTECTED]> wrote:
> On Mon, 8 Sep 2008, Donald Allen wrote:
>
>> The page I get is what would be obtained if an un-logged-in user went to
>> the specified url. Opening that same url in Firefox *does* correctly
>> indicate that it is logged in as me and reflects my customizations.
>
> First, LiveHTTPHeaders is the Firefox plugin everyone who tries these stunts
> need. Then you read the capure and replay them as closely as possible using
> your tool.
>
> As you will find out, sites like this use all sorts of funny tricks to
> figure out you and to make it hard to automate what you're trying to do.
> They tend to use javascripts for redirects and for fiddling with cookies
> just to make sure you have a javascript and cookie enabled browser. So you
> need to work hard(er) when trying this with non-browsers.
>
> It's certainly still possible, even without using the browser to get the
> first cookie file. But it may take some effort.

I have not been able to retrieve a page with wget as if I were logged
in using --load-cookies and Micah's suggestion about 'Accept-Encoding'
(there was a typo in his message -- it's 'Accept-Encoding', not
'Accept-Encodings'). I did install livehttpheaders and tried
--no-cookies and --header  and that
did work. Some of the cookie info sent by Firefox was a mystery,
because it's not in the cookie file. Perhaps that's the crucial
difference -- I'm speculating that wget isn't sending quite the same
thing as Firefox when --load-cookies is used, because Firefox is
adding stuff that isn't in the cookie file. Just a guess. Is there a
way to ask wget to print the headers it sends (ala livehttpheaders)?
I've looked through the options on the man page and didn't see
anything, though I might have missed it.

>
> --
>
>  / daniel.haxx.se
>


Re: Wget and Yahoo login?

2008-09-09 Thread Daniel Stenberg

On Mon, 8 Sep 2008, Donald Allen wrote:

The page I get is what would be obtained if an un-logged-in user went to the 
specified url. Opening that same url in Firefox *does* correctly indicate 
that it is logged in as me and reflects my customizations.


First, LiveHTTPHeaders is the Firefox plugin everyone who tries these stunts 
need. Then you read the capure and replay them as closely as possible using 
your tool.


As you will find out, sites like this use all sorts of funny tricks to figure 
out you and to make it hard to automate what you're trying to do. They tend to 
use javascripts for redirects and for fiddling with cookies just to make sure 
you have a javascript and cookie enabled browser. So you need to work hard(er) 
when trying this with non-browsers.


It's certainly still possible, even without using the browser to get the first 
cookie file. But it may take some effort.


--

 / daniel.haxx.se


Re: Wget and Yahoo login?

2008-09-08 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
> There was a recent discussion concerning using wget to obtain pages
> from yahoo logged into yahoo as a particular user. Micah replied to
> Rick Nakroshis with instructions describing two methods for doing
> this. This information has also been added by Micah to the wiki.
> 
> I just tried the simpler of the two methods -- logging into yahoo with
> my browser (Firefox 2.0.0.16) and then downloading a page with
> 
> wget --output-document=/tmp/yahoo/yahoo.htm --load-cookies  directory>/.mozilla/firefox/id2dmo7r.default/cookies.txt
> 'http://'
> 
> The page I get is what would be obtained if an un-logged-in user went
> to the specified url. Opening that same url in Firefox *does*
> correctly indicate that it is logged in as me and reflects my
> customizations.

Are you signing into the main Yahoo! site?

When I try to do so, whether I use the cookies or no, I get a message
about "update your browser to something more modern" or the like. The
difference appears to be a combination of _both_ User-Agent (as you've
done), _and_ --header "Accept-Encodings: gzip,deflate". This plus
appropriate cookies gets me a decent logged-in page, but of course it's
gzip-compressed.

Since Wget doesn't currently support gzip-decoding and the like, that
makes the use of Wget in this situation cumbersome. Support for
something like this probably won't be seen until 1.13 or 1.14, I'm afraid.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxdw77M8hyUobTrERAi/QAJ0atPMeUQ/0YCNwAP+XiH4nDyvclwCcDxYo
obud0CjpATBYDvA0eS3ZHGY=
=vv4R
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-09-08 Thread Donald Allen
2008/9/8 Tony Godshall <[EMAIL PROTECTED]>:
> I haven't done this but I can speculate that you need to
> have wget identify itself as firefox.

When I read this, I thought it looked promising, but it doesn't work.
I tried sending exactly the user-agent string firefox is sending and
still got a page from yahoo that clearly indicates yahoo thinks I'm
not logged in.

/Don

>
> Quote from man wget...
>
>   -U agent-string
>   --user-agent=agent-string
>   Identify as agent-string to the HTTP server.
>
>   The HTTP protocol allows the clients to identify themselves
> using a "User-Agent" header field.  This enables distinguishing the
> WWW software,
>   usually for statistical purposes or for tracing of protocol
> violations.  Wget normally identifies as Wget/version, version being
> the current ver‐
>   sion number of Wget.
>
>   However, some sites have been known to impose the policy of
> tailoring the output according to the "User-Agent"-supplied
> information.  While this
>   is not such a bad idea in theory, it has been abused by
> servers denying information to clients other than (historically)
> Netscape or, more fre‐
>   quently, Microsoft Internet Explorer.  This option allows
> you to change the "User-Agent" line issued by Wget.  Use of this
> option is discouraged,
>   unless you really know what you are doing.
>
>
> On Mon, Sep 8, 2008 at 12:25 PM, Donald Allen <[EMAIL PROTECTED]> wrote:
>> There was a recent discussion concerning using wget to obtain pages
>> from yahoo logged into yahoo as a particular user. Micah replied to
>> Rick Nakroshis with instructions describing two methods for doing
>> this. This information has also been added by Micah to the wiki.
>>
>> I just tried the simpler of the two methods -- logging into yahoo with
>> my browser (Firefox 2.0.0.16) and then downloading a page with
>>
>> wget --output-document=/tmp/yahoo/yahoo.htm --load-cookies > directory>/.mozilla/firefox/id2dmo7r.default/cookies.txt
>> 'http://'
>>
>> The page I get is what would be obtained if an un-logged-in user went
>> to the specified url. Opening that same url in Firefox *does*
>> correctly indicate that it is logged in as me and reflects my
>> customizations.
>>
>> wget -V:
>> GNU Wget 1.11.1
>>
>> I am running a reasonably up-to-date Gentoo system (updated within the
>> last month) on a Thinkpad X61.
>>
>> Have I missed something here? Any help will be appreciated. Please
>> include my personal address in your replies as I am not (yet) a
>> subscriber to this list.
>>
>> Thanks --
>> /Don Allen
>>
>
>
>
> --
> Best Regards.
> Please keep in touch.
> This is unedited.
> P-)
>


Re: [wget-notify] add a new option

2008-09-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

houda hocine wrote:
>  Hi,

Hi houda.

This message was sent to the wget-notify, which was not the proper
forum. Wget-notify is reserved for bug-change and (previously) commit
notifications, and is not intended for discussion (though I obviously
haven't blocked discussions; the original intent was to be able to
discuss commits, but I'm not sure I need to allow discussions any more,
so it may be disallowed soon).

The appropriate list would be wget@sunsite.dk, to which this discussion
has been redirected.

> we create a new format for archiviving (. warc), and we want to ensure
> that wget generate directly this format from the input url .
> You can help me by some ideas  to achieve this new option?
> The format is (warc -wget url)
> I am in the process of trying to understand the source code to add this
> new option.  Which .c  file fallows me to do this?

Doing this is not likely to be a trivial undertaking: the current
file-output interface isn't really abstracted enough to allow this, so
basically you'll need to modify most of the existing .c files. We are
hoping at some future point to allow for a more generic output format,
for direct output to (for instance) tarballs and .mhtml archives. At
that point, it'd probably be fairly easy to write extensions to do what
you want.

In the meantime, though, it'll be a pain in the butt. I can't really
offer much help; the best way to understand the source is to read and
explore it. However, on the general topic of adding new options to Wget,
Tony Lewis has written the excellent guide at
http://wget.addictivecode.org/OptionsHowto. Hope that helps!

Please note that I won't likely be entertaining patches to Wget to make
it output to non-mainstream archive formats, and even once generic
output mechanisms are supported, the mainstream archive formats will
most likely be supported as extension plugins or similar, and not as
built-in support within Wget.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIvbyf7M8hyUobTrERApl8AJwNvWOdDd0Z//wbNzN/jyZFqKI5iQCfQOx4
3zlxPGaVqjsPhwa7ZwB4wrs=
=Zy+N
-END PGP SIGNATURE-


Re: Wget function

2008-08-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Please keep the list in the replies.

karlito wrote:
> hi thank you for the reply my problem can be fixed on the next  verssion ?
>  
> because it's for batch
>  
> i have more 1000 url to made so is that why i need to find a solution
>  
> also when you mean rename
>  
> what is the function to rename with wget ?

I mean, just use the "mv" or "rename" command on your operating system.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIswfR7M8hyUobTrERAubkAJ0VL2UPnNQtD27waPVwFkeUwbUp9wCfXerh
dZBr4e7ZBKcEE5Kzrjv1mi8=
=GoKL
-END PGP SIGNATURE-


Re: WGET :: [Correction de texte]

2008-08-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tom wrote:
> Téléchargement récursif:
>   -r,  --recursive  spécifer un téléchargement récursif.
>   -l,  --level=NOMBRE   _*profondeeur*_ maximale de récursion (inf
> ou 0 pour infini).
> 
> Juste un "e" à enlever de profondeeur, et ca sera réglé !

This issue appears to have been fixed with the latest French
translation. It will be released with Wget 1.12.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIswBE7M8hyUobTrERAufeAKCIl4ghMvo2JolNfsSAYCTd92v9OwCfS89O
iT3urRXKctZuucXnOn9tGLc=
=v5SC
-END PGP SIGNATURE-


Re: Wget function

2008-08-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

karlito wrote:
> 
> 
> Hello,
>  
> First of all i would thank you for your great tool
>  
> I have a request
>  
> i use this function to save url with absolute link so it's very good
>  
> wget -k http://www.google.fr/
>  
> but i want to save this file as other name than index.html like for
> example  google-is-good.html
>  
> i have try this
>  
> wget -k –output-document=google-is-good.html http://www.google.fr/
>  
> is work except i lost absolute link and it's terrible

Yeah. Conversions won't work with --output-document, which behaves
rather like a shell redirection.

> i don't know how to fix this problem wich combinaison i have to made
> for use wget - k  with another name ??

You could always rename it afterwards.

In your specific case, the current development sources (which will
become Wget 1.12) have a --default-page=google-is-good.html option for
specifying the default page name, thanks to Joao Ferreira. It's not yet
available in any release.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIsv3N7M8hyUobTrERAskoAJ4lHZK+VEBWYuFzOtbd57wEEvYm0wCdEVSK
el6v3e0TkKpQtOG2b5ZiHcI=
=/+sB
-END PGP SIGNATURE-


Re: Wget function

2008-08-25 Thread karlito
 Hello,
>
> First of all i would thank you for your great tool
>
> I have a request
>
> i use this function to save url with absolute link so it's very good
>
> wget -k http://www.google.fr/
>
> but i want to save this file as other name than index.html like for
> example  google-is-good.html
>
> i have try this
>
> wget -k –output-document=google-is-good.html http://www.google.fr/
>
> is work except i lost absolute link and it's terrible
>
> i don't know how to fix this problem wich combinaison i have to made for
> use wget - k  with another name ??
>
> can you help me i don't find the solution  also where i can find last
> verssion for windows
>
> thank you for your time
>
>
> regard carlos .
>


Re: wget and wiki crawling

2008-08-22 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

asm c wrote:
> I've recently been using wget, and got it working for the most part, but
> there's one issue that's really been bugging me. One of the parameters I
> use is '-R "*action=*,*oldid=*"' (side note on the platform: ZSH on
> NetBSD on the SDF public access unix system, although I've also used it
> on windows with the same result). The purpose of this parameter is so
> that, when wget crawls a mid-sized wiki I'd like to have a local copy
> of, it doesn't bother with all the history pages, edit pages, and so
> forth. Not downloading these would save me an enormous amount of time.
> Unfortunately, the parameter is ignored until after the php page is
> downloaded. So, because it waits until it's downloaded to delete it,
> using the param doesn't really help at all.
> 
> Does anyone know how I can stop wget from even downloading matching pages?

Well, you don't mention it, but I'll assume that those patterns occur in
the "query string" portion of the URL: that is, they follow a question
mark (?) that appears at some point.

Unfortunately, the -R and -A options only apply to the "filename"
portion of the URL: that is, whatever falls between the first question
mark, and the first preceding slash (/). Confusingly, it is also then
applied _after_ files are downloaded, to determine whether they should
be deleted after the fact: so Wget probably downloads those files you
really wish it wouldn't, and then deletes them afterwards anyway.

Worse, there's no way around this, currently. This is part of a suite of
problems that are currently slated to be addressed soon. The most
pertinent to your problem, though, is the need for a way to match
against query strings. I'm very much hoping to get around to this before
the next major Wget release, version 1.12. It's being tracked here:

https://savannah.gnu.org/bugs/index.php?22089

If you add yourself to the Cc list, you'll be able to follow along on
its progress.

- --
Cheers!
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIr55d7M8hyUobTrERAu4KAJsHmDTZ46ioEGOTprdE/aTGrj853QCfet84
+c+npJnPwC/86/rLpn5rB8s=
=abdv
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-08-21 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
> Micah Cowan wrote:
> 
>> The easiest way to do what you want may be to log in using your browser,
>> and then tell Wget to use the cookies from your browser, using
> 
> Given the frequency of the "login and then download a file" use case , it
> should probably be documented on the wiki. (Perhaps it already is. :-)

Yeah, at
http://wget.addictivecode.org/FrequentlyAskedQuestions#password-protected

I think you missed the final sentence of my how-to:

> (I'm going to put this up on the Wgiki Faq now, at
> http://wget.addictivecode.org/FrequentlyAskedQuestions)

:)

(Back to you:)
> Also, it would probably be helpful to have a shell script to automate this.

I filed the following issue some time ago:
https://savannah.gnu.org/bugs/index.php?22561

The report is low on details; but I was envisioning something that would
spew out forms and their fields, accept values for fields in one form,
and invoke the appropriate Wget command to do the submission.

I don't know if it could be _completely_ automated, since it's not 100%
possible for the script to know which form fields are the ones it should
be filling out.

OTOH, there are some damn good heuristics that could be done: I imagine
that the "right form" (in the event of more than one) can usually be
guessed by seeing which one has a "password"-type input (assuming
there's also only one of those). If that form has only one "text"-type
input, then we've found the username field as well. Name-based
heuristics (with "pass", "user", "uname", "login", etc) could also help.

If someone wants to do this, that'd be terrific. Could probably reuse
the existing HTML parser code from Wget. Otherwise, it'd probably be a
while before I could get to it, since I've got higher priorities that
have been languishing.

Such a tool might also be an appropriate place to add FF3 sqllite
cookies support.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIrb0s7M8hyUobTrERAlVXAJ9YnAM7JiQrxrB/KclA1FXDnoVswgCdGO7t
Vaa98nhNRuEY4aLMx2BFXm0=
=ScoA
-END PGP SIGNATURE-


RE: Wget and Yahoo login?

2008-08-21 Thread Tony Lewis
Micah Cowan wrote:

> The easiest way to do what you want may be to log in using your browser,
> and then tell Wget to use the cookies from your browser, using

Given the frequency of the "login and then download a file" use case , it
should probably be documented on the wiki. (Perhaps it already is. :-)

Also, it would probably be helpful to have a shell script to automate this.

Tony



Re: WGET :: [Correction de texte]

2008-08-11 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Saint Xavier wrote:
> * Tom ([EMAIL PROTECTED]) wrote:
>> Bonjour !
> 
> bonjour,
> 
>> Je souhaite vous informer d'une touche restée appuyée un quart de seconde
>> trop longtemps semble-t-il !
> ...
>> Téléchargement récursif:
>>   -r,  --recursive  spécifer un téléchargement récursif.
>>   -l,  --level=NOMBRE   *profondeeur* maximale de récursion (inf ou 0
>> Juste un "e" à enlever de profondeeur, et ca sera réglé !
> 
> En effet, merci !
> 
> Micah, instead of "profondeeur" it should be "profondeur".
> Where do you forward that info, French GNU translation team ?
> (./po/fr.po around line 1472)

Yup. The mailing address for the French translation team is at
[EMAIL PROTECTED] The team page is
http://translationproject.org/team/fr.html; other translation teams are
listed at http://translationproject.org/team/index.html

Looks like it's still present in the latest fr.po file at
http://translationproject.org/latest/wget/fr.po

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIoIl77M8hyUobTrERApRkAJsGUybOJEDvYidFXc9OWLJ7gIX66QCeL8we
UsjynplN9Um1gmmWUcyZMbU=
=lqbw
-END PGP SIGNATURE-


Re: WGET :: [Correction de texte]

2008-08-11 Thread Saint Xavier
* Tom ([EMAIL PROTECTED]) wrote:
> Bonjour !

bonjour,

> Je souhaite vous informer d'une touche restée appuyée un quart de seconde
> trop longtemps semble-t-il !
...
> Téléchargement récursif:
>   -r,  --recursive  spécifer un téléchargement récursif.
>   -l,  --level=NOMBRE   *profondeeur* maximale de récursion (inf ou 0
> Juste un "e" à enlever de profondeeur, et ca sera réglé !

En effet, merci !

Micah, instead of "profondeeur" it should be "profondeur".
Where do you forward that info, French GNU translation team ?
(./po/fr.po around line 1472)

Saint Xavier.


Re: WGET :: [Correction de texte]

2008-08-11 Thread Julien
Bonjour Tom,

Merci de cette information.
Mais pourrais tu nous préciser de quelle version de Wget il s'agit?
Tu peux obtenir cette information avec: wget --version
Je te recommande la dernière version de Wget, disponible ici:
http://wget.addictivecode.org/FrequentlyAskedQuestions#download
Aussi, la langue de cette liste de diffusion est l'Anglais.

Merci,
Julien.


Hi Tom,

Thanks for this information.
But, could you tell us what version of Wget are you using?
You can see that using: wget --version
I advise you to try the last version, available here:
http://wget.addictivecode.org/FrequentlyAskedQuestions#download
Moreover, the language of this mailing list is English.

Thanks,
Julien.

2008/8/11 Tom <[EMAIL PROTECTED]>:
> Bonjour !
>
> Je souhaite vous informer d'une touche restée appuyée un quart de seconde
> trop longtemps semble-t-il !
>
> Dans l'aide de Wget (wget --help), nous trouvons en effet :
>
>
> Téléchargement récursif:
>   -r,  --recursive  spécifer un téléchargement récursif.
>   -l,  --level=NOMBRE   profondeeur maximale de récursion (inf ou 0 pour
> infini).
>
>
>
> Juste un "e" à enlever de profondeeur, et ca sera réglé !
>
> Comme il était indiqué "Transmettre toutes anomalies ou suggestions à
> <[EMAIL PROTECTED]>.", je me suis permis de vous le signaler !
>
> Merci pour cet outil, et bonne continuation !
>
> Cordialement,
>
> Tom
>
>
>
>
>
>
>
>
>


Re: Wget and Yahoo login?

2008-08-11 Thread Rick Nakroshis

At 04:27 PM 8/10/2008, you wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Rick Nakroshis wrote:
> Micah,
>
> If you will excuse a quick question about Wget, I'm trying to find out
> if I can use it to download a page from Yahoo that requires me to be
> logged in using my Yahoo profile name and password.  It's a display of a
> CSV file, and the only wrinkle is trying to get past the Yahoo login.
>
> Try as I may, I just can't seem to find anything about Wget and Yahoo.
> Any suggestions or pointers?

Hi Rick,

In the future, it's better if you post questions to the mailing list at
wget@sunsite.dk; I don't always have time to respond.

The easiest way to do what you want may be to log in using your browser,
and then tell Wget to use the cookies from your browser, using
- --load-cookies=. Of course, this only works
if your browser saves its cookies in the standard text format (Firefox
prior to version 3 will do this), or can export to that format (note
that someone contributed a patch to allow Wget to work with Firefox 3
cookies; it's linked from http://wget.addictivecode.org/, it's
unoffocial so I can't vouch for its quality).

Otherwise, you can perform the login using Wget, saving the cookies to a
file of your choice, using --post-data=..., --save-cookies=cookies.txt,
and probably --keep-session-cookies. This will require that you know
what data to place in --post-data, which generally requires that you dig
around in the HTML to find the right form field names, and where to post
them.

For instance, if you find a form like the following within the page
containing the log-in form:


  
  


then you need to do something like:

  $ wget --post-data='s-login=USERNAME&s-pass=PASSWORD' \
--save-cookies=my-cookies.txt --keep-session-cookies \
http://HOSTNAME/doLogin.php

(Note that you _don't_ necessarily send the information to the page that
 had the login page: you send it to the spot mentioned in the "action"
attribute of the password form.)

Once this is done, you _should_ be able to perform further operations
with Wget as if you're logged in, by using

  $ wget --load-cookies=my-cookies.txt --save-cookies=my-cookies.txt \
--keep-session-cookies ...

(I'm going to put this up on the Wgiki Faq now, at
http://wget.addictivecode.org/FrequentlyAskedQuestions)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIn09A7M8hyUobTrERAu04AJ9EgRoBBhvNCDwOt87f91p+HpWktACdFgMM
KEfliBtfrPBbh/XdvusEPiw=
=qlGZ
-END PGP SIGNATURE-



Micah,

Thank you for taking the time to answer so thoroughly, and doing so 
promptly, too.  You've given me a great boost forward, and I appreciate it.


Thank you, sir!


Rick



Re: Wget and Yahoo login?

2008-08-10 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Rick Nakroshis wrote:
> Micah,
> 
> If you will excuse a quick question about Wget, I'm trying to find out
> if I can use it to download a page from Yahoo that requires me to be
> logged in using my Yahoo profile name and password.  It's a display of a
> CSV file, and the only wrinkle is trying to get past the Yahoo login.
> 
> Try as I may, I just can't seem to find anything about Wget and Yahoo. 
> Any suggestions or pointers?

Hi Rick,

In the future, it's better if you post questions to the mailing list at
wget@sunsite.dk; I don't always have time to respond.

The easiest way to do what you want may be to log in using your browser,
and then tell Wget to use the cookies from your browser, using
- --load-cookies=. Of course, this only works
if your browser saves its cookies in the standard text format (Firefox
prior to version 3 will do this), or can export to that format (note
that someone contributed a patch to allow Wget to work with Firefox 3
cookies; it's linked from http://wget.addictivecode.org/, it's
unoffocial so I can't vouch for its quality).

Otherwise, you can perform the login using Wget, saving the cookies to a
file of your choice, using --post-data=..., --save-cookies=cookies.txt,
and probably --keep-session-cookies. This will require that you know
what data to place in --post-data, which generally requires that you dig
around in the HTML to find the right form field names, and where to post
them.

For instance, if you find a form like the following within the page
containing the log-in form:


  
  


then you need to do something like:

  $ wget --post-data='s-login=USERNAME&s-pass=PASSWORD' \
--save-cookies=my-cookies.txt --keep-session-cookies \
http://HOSTNAME/doLogin.php

(Note that you _don't_ necessarily send the information to the page that
 had the login page: you send it to the spot mentioned in the "action"
attribute of the password form.)

Once this is done, you _should_ be able to perform further operations
with Wget as if you're logged in, by using

  $ wget --load-cookies=my-cookies.txt --save-cookies=my-cookies.txt \
--keep-session-cookies ...

(I'm going to put this up on the Wgiki Faq now, at
http://wget.addictivecode.org/FrequentlyAskedQuestions)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIn09A7M8hyUobTrERAu04AJ9EgRoBBhvNCDwOt87f91p+HpWktACdFgMM
KEfliBtfrPBbh/XdvusEPiw=
=qlGZ
-END PGP SIGNATURE-


Re: WGET Date-Time

2008-08-07 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andreas Weller wrote:
> Hi!
> I use wget to download files from a ftp server in a bash script.
> For example:
> touch last.time
> wget -nc ftp://[]/*.txt .
> find -newer last.time
> 
> This fails if the files on the FTP server are older than my last.time. So I 
> want
> wget to set file date/time to the local creation time not the server's...
> 
> How to do this?

You can't, currently. This behavior is intended to support Wget's
timestamping (-N) functionality.

However, I'd accept a patch for an option that disables this.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIm2si7M8hyUobTrERAi9AAJ0f8TUv7TJR6tFsgc4k174rqH6OlgCghCzz
xpemaFdQhODIm0SGp7rJSRA=
=vDKD
-END PGP SIGNATURE-


Re: Wget scriptability

2008-08-03 Thread Anthony Bryan
> support for MetaLink
>
> Current Wget? I think someone's actually working on this. But, given Wget's 
> current single-connection support, it couldn't be much more than falling back 
> on one URL when another is broken.
> Pluggable/Library Wget (with multiple connections)? Doable. A level of 
> difficulty.
> Pipelines Wget? Use a Metalink "getter" rather than the stock Pipes-Wget 
> "getter". The Metalink "getter" itself would probably manage the use of  
> several invocations of stock Pipes-Wget "getter".

I don't think anyone is specifically working on a patch for wget yet.
But a new version of Tatsuhiro's libmetalink (C library) [1] is
getting close to release & only lacks documentation.

Is anyone from the wget community interested in adding Metalink
support? If you are, please contact me!

For those unfamiliar, Metalink is an XML format listing URLs
(mirrors), checksums, and other information about downloads. It's used
by projects such as OpenOffice.org, openSUSE, Ubuntu, and many others.

Metalink support in the current wget (which I think would still be
useful) would include:

download from a single URL (no multi-source downloads). wget would get
a URL from the mirror list in the .metalink (preferably one with the
highest priority).
verification of the whole file checksum at the end of the transfer.
optionally, if that server/URL went down wget could switch to the next
highest priority URL.
optionally, if there was an error in transfer wget could request that
chunk again and compare against the chunk checksums.

Inclusion in wget is dependent on code quality, adherence to GNU
standards, etc. It will be disabled by default, but enabled with the
appropriate switch to ./configure.

-- 
(( Anthony Bryan ... Metalink [ http://www.metalinker.org ]
 )) Easier, More Reliable, Self Healing Downloads

[1] http://code.google.com/p/libmetalink/


Re: Wget scriptability

2008-08-03 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Dražen Kačar wrote:
> Micah Cowan wrote:
> 
>> Okay, so there's been a lot of thought in the past, regarding better
>> extensibility features for Wget. Things like hooks for adding support
>> for traversal of new Content-Types besides text/html, or adding some
>> form of JavaScript support, or support for MetaLink. Also, support for
>> being able to filter results pre- and post-processing by Wget: for
>> example, being able to do some filtering on the HTML to change how Wget
>> sees it before parsing for links, but without affecting the actual
>> downloaded version; or filtering the links themselves to alter what Wget
>> fetches.
> 
>> However, another thing that's been vaguely itching at me lately, is the
>> fact that Wget's design is not particularly unix-y. Instead of doing one
>> thing, and doing it well, it does a lot of things, some well, some not.
> 
> It does what various people needed. It wasn't an excercise in writing a
> unixy utility. It was a program that solved real problems for real
> people.

>> But the thing everyone loves about Unix and GNU (and certainly the thing
>> that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline
>> paradigm,
> 
> I have always hated that. With a passion.

A surprising position from a user of Mutt, whose excellence is due in no
small part to its ability to integrate well with other command utilities
(that is, to pipeline). The power and flexibility of pipelines is
extremely well-established in the Unix world; I feel no need whatsoever
to waste breath arguing for it, particularly when you haven't provided
the reasons you hate it.

For my part, I'm not exaggerating that it's single-handedly responsible
for why I'm a Unix/GNU user at all, and why I continue to highly enjoy
developing on it.

  find -name '*.html' -exec sed -i \
's#http://oldhost/#http://newhost/#g' \;

  ( cat message; echo; echo '-- '; cat ~/.signature ) | \
gpg --clearsign | mail -s 'Report' [EMAIL PROTECTED]

  pic | tbl | eqn | eff-ing | troff -ms

Each one of these demonstrates the enormously powerful technique of
using distinct tools with distinct feature domains, together to form a
cohesive solution for the need. The best part is (with the possible
exception of the troff pipeline), each of these components are
immediately available for use in some other pipeline that does some
other completely different function.

Note, though, that I don't intend that using "Piped-Wget" would actually
mean the user types in a special pipeline each time he wants to do
something with it. The primary driver would read in some config file
that would tell wget how it should do the piping. You just tweak the
config file when you want to add new functionality.

>>  - The tools themselves, as much as possible, should be written in an
>> easily-hackable scripting language. Python makes a good candidate. Where
>> we want efficiency, we can implement modules in C to do the work.
> 
> At the time Wget was conceived, that was Tcl's mantra. It failed
> miserably. :-)

Are you claiming that Tcl's failure was due to the ability to integrate
it with C, rather than its abysmal inadequacy as a programming language
(changing it from an ability to integrate with C, to an absolute
requirement to do so in order to get anything accomplished)?

> How about concentrating on the problems listed in your first paragraph
> (which is why I quoted it)? Could you show us how would a buch of shell
> tools solve them? Or how would a librarized Wget solve them? Or how
> would any other paradigm or architecture or whatever solve them?

It should be trivially obvious: you plug them in, rather than "wait for
the Wget developers to get around to implementing it".

The thing that both library-ized Wget and pipeline-ized Wget would offer
is the same: extreme flexibility. It puts the users in control of what
Wget does, rather than just perpetually hearing, "sorry, Wget can't do
it: you could hack the source, though." :p

The difference between the two is that a pipelined Wget offers this
flexibility to a wider range of users, whereas a library Wget offers it
to C programmers.

Or how would you expect to do these things without a library-ized (at
least) Wget? Implementing them in the core app (at least by default) is
clearly wrong (scope bloat). Giving Wget a plugin architecture is good,
but then there's only as much flexibility as there are hooks.
Libraryizing Wget is equivalent to providing everything as hooks, and
puts the program using it in the driver's seat (and, naturally, there'd
be a wrapper implementation, like curl for libcurl). A suite of
interconnected utilities does the same, but is more accessible to
greater numbers of people. Generally at some expense to efficiency
(aren't all flexible architectures?); but Wget isn't CPU-bound, it's
network-bound.

As mentioned in my original post, this would be a separate project from
Wget. Wget would not be going away (though it seems l

Re: Wget scriptability

2008-08-02 Thread Dražen Kačar
Micah Cowan wrote:

> Okay, so there's been a lot of thought in the past, regarding better
> extensibility features for Wget. Things like hooks for adding support
> for traversal of new Content-Types besides text/html, or adding some
> form of JavaScript support, or support for MetaLink. Also, support for
> being able to filter results pre- and post-processing by Wget: for
> example, being able to do some filtering on the HTML to change how Wget
> sees it before parsing for links, but without affecting the actual
> downloaded version; or filtering the links themselves to alter what Wget
> fetches.

> However, another thing that's been vaguely itching at me lately, is the
> fact that Wget's design is not particularly unix-y. Instead of doing one
> thing, and doing it well, it does a lot of things, some well, some not.

It does what various people needed. It wasn't an excercise in writing a
unixy utility. It was a program that solved real problems for real
people.

> But the thing everyone loves about Unix and GNU (and certainly the thing
> that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline
> paradigm,

I have always hated that. With a passion.

>  - The tools themselves, as much as possible, should be written in an
> easily-hackable scripting language. Python makes a good candidate. Where
> we want efficiency, we can implement modules in C to do the work.

At the time Wget was conceived, that was Tcl's mantra. It failed
miserably. :-)

How about concentrating on the problems listed in your first paragraph
(which is why I quoted it)? Could you show us how would a buch of shell
tools solve them? Or how would a librarized Wget solve them? Or how
would any other paradigm or architecture or whatever solve them?

-- 
 .-.   .-.Yes, I am an agent of Satan, but my duties are largely
(_  \ /  _)   ceremonial.
 |
 |[EMAIL PROTECTED]


Re: wget does not like this URL

2008-07-31 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Kevin O'Gorman wrote:
> Is there a reason i get this:
>> [EMAIL PROTECTED] Pending $ wget -O foo 
>> "http://www.littlegolem.net/jsp/info/player_game_list_txt.jsp?plid=1107>id=hex";
>> Cannot specify -r, -p or -N if -O is given.
>> Usage: wget [OPTION]... [URL]...
>> [EMAIL PROTECTED] Pending $
> 
> While I do have "-O", I don't have the ones it seems to think I've specified.
> 
> Without the "-O foo" it works fine, but of course puts the results in
> a different place.
> I get the same error message if I use the long-form parameter.

You most likely have "timestamping=on" in your wgetrc. -N and -O were
disallowed for version 1.11, but were re-enabled for 1.11.3 (I think)
with a warning. The latest version of wget is 1.11.4.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIkf9U7M8hyUobTrERAtkfAJ9g84lMEkzSeLn24cWQA805HZmE8wCfV2Ck
bB5RK4lRlcBbwOSiU4jPwxM=
=K9cv
-END PGP SIGNATURE-


RE: wget-1.11.4 bug

2008-07-26 Thread kuang-cheng chao

Micah Cowan wrote:> The thing is, though, those two threads should be running 
wgets under> separate processes
 
Yes, the two threads are running wgets under seperate processes with "system".
> What operating system are you running? Vista?mipsel-linux with kernel v2.4 
> built from gcc v3.3.5 
 
Best regards,
K.C. Chao
_
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE

Re: wget-1.11.4 bug

2008-07-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

k.c. chao wrote:
> Micah Cowan wrote:
> > Have you reproduced this, or is this in theory? If the latter, what has
> > led you to this conclusion? I don't see anything in the code that would
> > cause this behavior.
>
> I reproduce this. But I can't make sure the really problem is in
> "resolve_bind_address." In the attached message, both
> api.yougotphogo.com and farm1.static.flickr.com get the same
> ip(74.124.203.218).  The two wget are called from two threads of a
> program.

Yeah, I get 68.142.213.135 for the flickr.com address, currently.

The thing is, though, those two threads should be running wgets under
separate processes (I'm not sure how they couldn't be, but if they
somehow weren't that would be using Wget other than how it was designed
to be used).

This problem sounds much more like an issue with the OS's API than an
issue with Wget, to me. But we'd still want to work around it if it were
feasible.

What operating system are you running? Vista?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIirT17M8hyUobTrERAjsuAJ0crMPYIQficu1csou8Tt0jDFKvpQCeNYk3
1FhXl3uUYj2IA53qI1oOJ8A=
=DvdG
-END PGP SIGNATURE-


RE: wget-1.11.4 bug

2008-07-25 Thread kuang-cheng chao

Micah Cowan wrote:
> Have you reproduced this, or is this in theory? If the latter, what has> led 
> you to this conclusion? I don't see anything in the code that would> cause 
> this behavior.
I reproduce this. But I can't make sure the really problem is in 
"resolve_bind_address."
In the attached message, both api.yougotphogo.com and farm1.static.flickr.com 
get the same ip(74.124.203.218).
The two wget are called from two threads of a program.
 
Best regards,
k.c. chao
 
p.s. 
 
The log is follworing:
 
wget -4 -t 6 
"http://api.yougotphoto.com/device/?action=get_device_new_photo&api=2.2&api_key=f10df554a958fd10050e2d305241c7a3&device_class=2&serial_no=000E2EE5676F&url_no=24616&cksn=44fe191d6cb4e7807f75938b5d72f07c";
 -O /tmp/webii/ygp_new_photo_list.txt--1999-11-30 00:04:21--  
http://api.yougotphoto.com/device/?action=get_device_new_photo&api=2.2&api_key=f10df554a958fd10050e2d305241c7a3&device_class=2&serial_no=000E2EE5676F&url_no=24616&cksn=44fe191d6cb4e7807f75938b5d72f07cResolving
 api.yougotphoto.com... wget -4 -t 6 
"http://farm1.static.flickr.com/33/49038824_e4b04b7d9f_b.jpg"; -O 
/tmp/webii/24616 74.124.203.218Connecting to 
api.yougotphoto.com|74.124.203.218|:80... --1999-11-30 00:04:22--  
http://farm1.static.flickr.com/33/49038824_e4b04b7d9f_b.jpgResolving 
farm1.static.flickr.com... 74.124.203.218Connecting to 
farm1.static.flickr.com|74.124.203.218|:80... connected. 
_
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE

Re: wget-1.11.4 bug

2008-07-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

kuang-cheng chao wrote:
> Dear Micah:
>  
> Thanks for your work of wget.
>  
> There is a question about two wgets run simultaneously.
> In method resolve_bind_address, wget assumes that this is called once.
> However, this will cause two domain name with the same ip if two wgets
> run the same method concurrently.

Have you reproduced this, or is this in theory? If the latter, what has
led you to this conclusion? I don't see anything in the code that would
cause this behavior.

Also, please use the mailing list for discussions about Wget. I've added
it to the recipients list.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIiYKF7M8hyUobTrERAr7fAJ0TnkLdEVOMy6wJA3Z1kIYC7dQoMACfZ9hb
x5K6MTzhgVRCdKJwUGnbSRw=
=EcFC
-END PGP SIGNATURE-


Re: wget i18N issue

2008-07-24 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Li Ru An wrote:
> Hi all:
> 
> Greeting! I'm new in this list, hope I can help here.
> 
> I found that there's some I18N issue with wget. For example, my OS is
> using GBK, when wget tries to get a URL coded in UTF-8, there's some
> issue in the coding translation. 
> 
> I have managed to resolve my issue by changing some code, but I think
> it's not a solution for I18N. Is someone working on I18N or facing the
> same issue like me? We can work together to get it resolved completely.

Yes, Saint Xavier, one of our Google Summer of Code students, has been
actively working on this area; his improvements are expected to be part
of Wget release 1.12, which will probably be out around the turn of the
year, possibly sooner.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIiD6Q7M8hyUobTrERAjRxAJ9ctAQAm3vMtK5QZFlg+ZAOfo1BBQCfYNhq
t+D9/34bPlMHCUSDe5y1rbE=
=2j46
-END PGP SIGNATURE-


Re: Wget

2008-07-22 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hor Meng Yoong wrote:
> Hi:
> 
>   I understand that you are a very busy person. Sorry to disturb you.

Hi; please use the mailing list for support requests. I've copied the
list in my response.

>   I am using wget to mirror (using ftp://) a user home directory from a
> unix machine. Wget default to the user's home directory. However, I also
> need to get /etc folder. So, I tried to use ../../../etc. It works but
> the result of the ftpped files are in %2E%2E/ %2E%2E/ %2E%2E
> 
> Any means to overcome this, or rename the directory.

Try the -nd option (you may also need -nH). You might prefer to fetch
/etc in a separate invocation from the other things; perhaps with the -P
option to specify a directory name.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIhi5O7M8hyUobTrERAl+YAJ9xaX5NivhEfzJLHKD5T3qs0nZuOACgg0eC
IqFZMlz8obK+loKyQ6vXCWo=
=gNqH
-END PGP SIGNATURE-


Re: WGET bug...

2008-07-11 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

HARPREET SAWHNEY wrote:
> Hi,
> 
> Thanks for the prompt response.
> 
> I am using
> 
> GNU Wget 1.10.2
> 
> I tried a few things on your suggestion but the problem remains.
> 
> 1. I exported the cookies file in Internet Explorer and specified
> that in the Wget command line. But same error occurs.
> 
> 2. I have an open session on the site with my username and password.
> 
> 3. I also tried running wget while I am downloading a file from the
> IE session on the site, but the same error.

Sounds like you'll need to get the appropriate cookie by using Wget to
login to the website. This requires site-specific information from the
user-login form page, though, so I can't help you without that.

If you know how to read some HTML, then you can find the HTML form used
for posting username/password stuff, and use

wget --keep-session-cookies --save-cookies=cookies.txt \
- --post-data='username=foo&password=bar' ACTION

Where ACTION is the value of the form's action field, USERNAME and
PASSWORD (and possibly further required values) are field names from the
HTML form, and FOO and BAR is the username/password.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFId+w97M8hyUobTrERAmLsAJ91231iGeO/albrgRuuUCRp8zFcnwCgiX3H
fDp2J2oTBKlxW17eQ2jaCAA=
=Khmi
-END PGP SIGNATURE-


Re: WGET bug...

2008-07-11 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

HARPREET SAWHNEY wrote:
> Hi,
> 
> I am getting a strange bug when I use wget to download a binary file
> from a URL versus when I manually download.
> 
> The attached ZIP file contains two files:
> 
> 05.upc --- manually downloaded
> dum.upc--- downloaded through wget
> 
> wget adds a number of ascii characters to the head of the file and seems
> to delete a similar number from the tail.
> 
> So the file sizes are the same but the addition and deletion renders
> the file useless.
> 
> Could you please direct me on if I should be using some specific
> option to avoind this problem?

In the future, it's useful to mention which version of Wget you're using.

The problem you're having is that the server is adding the extra HTML at
the front of your session, and then giving you the file contents anyway.
It's a bug in the PHP code that serves the file.

You're getting this extra content because you are not logged in when
you're fetching it. You need to have Wget send a cookie with an
login-session information, and then the server will probably stop
sending the corrupting information at the head of the file. The site
does not appear to use HTTP's authentication mechanisms, so the
<[EMAIL PROTECTED]> bit in the URL doesn't do you any good. It uses
Forms-and-cookies authentication.

Hopefully, you're using a browser that stores its cookies in a text
format, or that is capable of exporting to a text format. In that case,
you can just ensure that you're logged in in your browser, and use the
- --load-cookies= option to Wget to use the same session
information.

Otherwise, you'll need to use --save-cookies with Wget to simulate the
login form post, which is tricky and requires some understanding of HTML
Forms.

- --
HTH,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFId9Vy7M8hyUobTrERAjCWAJ9niSjC5YdBDNcAbnBFWZX6D8AO7gCeM8nE
i8jn5i5Y6wLX1g3Q2hlDgcM=
=uOke
-END PGP SIGNATURE-


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-24 Thread Coombe, Allan David (DPS)
Sorry Guys - just an ID 10 T error on my part.

I think I need to change 2 things in the proxy server.

1.  URLs in the HTML being returned to wget - this works OK
2.  The "Content-Location" header used when the web server reports a
"301 Moved Permanently" response - I think this works OK.

When I reported that it wasn't working I hadn't done both at the same
time.

Cheers

Allan

-Original Message-
From: Micah Cowan [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 25 June 2008 6:44 AM
To: Tony Lewis
Cc: Coombe, Allan David (DPS); 'Wget'
Subject: Re: Wget 1.11.3 - case sensetivity and URLs


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
> Coombe, Allan David (DPS) wrote:
> 
>> However, the case of the files on disk is still mixed - so I assume 
>> that wget is not using the URL it originally requested (harvested 
>> from the HTML?) to create directories and files on disk.  So what is 
>> it using? A http header (if so, which one??).
> 
> I think wget uses the case from the HTML page(s) for the file name; 
> your proxy would need to change the URLs in the HTML pages to lower 
> case too.

My understanding from David's post is that he claimed to have been doing
just that:

> I modified the response from the web site to lowercase the urls in the

> html (actually I lowercased the whole response) and the data that wget

> put on disk was fully lowercased - problem solved - or so I thought.

My suspicion is it's not quite working, though, as otherwise where would
Wget be getting the mixed-case URLs?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIYVyq7M8hyUobTrERAo6mAJ4ylEi5qUZqE7DR8xL2XjWOSfuurACePrIz
Vl7REl1hNVNqdBrLqoygrcE=
=jlBN
-END PGP SIGNATURE-


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-24 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
> Coombe, Allan David (DPS) wrote:
> 
>> However, the case of the files on disk is still mixed - so I assume
>> that wget is not using the URL it originally requested (harvested
>> from the HTML?) to create directories and files on disk.  So what
>> is it using? A http header (if so, which one??).
> 
> I think wget uses the case from the HTML page(s) for the file name;
> your proxy would need to change the URLs in the HTML pages to lower
> case too.

My understanding from David's post is that he claimed to have been doing
just that:

> I modified the response from the web site to lowercase the urls in
> the html (actually I lowercased the whole response) and the data that
> wget put on disk was fully lowercased - problem solved - or so I
> thought.

My suspicion is it's not quite working, though, as otherwise
where would Wget be getting the mixed-case URLs?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIYVyq7M8hyUobTrERAo6mAJ4ylEi5qUZqE7DR8xL2XjWOSfuurACePrIz
Vl7REl1hNVNqdBrLqoygrcE=
=jlBN
-END PGP SIGNATURE-



RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-24 Thread Tony Lewis
Coombe, Allan David (DPS) wrote:

> However, the case of the files on disk is still mixed - so I assume that
> wget is not using the URL it originally requested (harvested from the
> HTML?) to create directories and files on disk.  So what is it using? A
> http header (if so, which one??).

I think wget uses the case from the HTML page(s) for the file name; your
proxy would need to change the URLs in the HTML pages to lower case too.

Tony



Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-21 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Coombe, Allan David (DPS) wrote:
> OK - now I am confused.
> 
> I found a perl based http proxy (named "http::proxy" funnily enough)
> that has filters to change both the request and response headers and
> data.  I modified the response from the web site to lowercase the urls
> in the html (actually I lowercased the whole response) and the data that
> wget put on disk was fully lowercased - problem solved - or so I thought.
> 
> However, the case of the files on disk is still mixed - so I assume that
> wget is not using the URL it originally requested (harvested from the
> HTML?) to create directories and files on disk.  So what is it using? A
> http header (if so, which one??).

I think you're missing something on your end; I couldn't begin to tell
you what. Running with --debug will likely be informative.

Wget uses the URL that successfully results in a file download. If the
files on disk have mixed case, then it's because it was the result of a
mixed-case request from Wget (which, in turn, must have either resulted
from an explicit argument, or from HTML content).

The only exception to the above is when you explicitly enable
- --content-disposition support, in which case Wget will use any filename
specified in a Content-Disposition header. Those are virtually never
issued, except for CGI-based downloads (and you have to explicitly
enable it).

- --
Good luck!
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIXe0Z7M8hyUobTrERAkF5AJ9FOkx5XQJCx9vkTV9xr2zbYzp4jwCffrec
zhdtjp59GOwt07YgvtolM8o=
=FZ3m
-END PGP SIGNATURE-


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-21 Thread Coombe, Allan David (DPS)
OK - now I am confused.

I found a perl based http proxy (named "http::proxy" funnily enough)
that has filters to change both the request and response headers and
data.  I modified the response from the web site to lowercase the urls
in the html (actually I lowercased the whole response) and the data that
wget put on disk was fully lowercased - problem solved - or so I
thought.

However, the case of the files on disk is still mixed - so I assume that
wget is not using the URL it originally requested (harvested from the
HTML?) to create directories and files on disk.  So what is it using? A
http header (if so, which one??).

Any ideas??

Cheers
Allan


Re: wget doesn't load page-requisites from a) dynamic web page b) through https

2008-06-20 Thread Michelle Konzack
Hello Stefan,

I have a question:

Am 2008-06-18 12:17:12, schrieb Stefan Nowak:
> wget \
> --page-requisites \
> --html-extension \
> --convert-links \
> --span-hosts \
> --no-check-certificate \
> --debug \
> https://help.ubuntu.com/community/MacBookPro/ &> log.txt

Why do you use

&> log.txt
instead of
--output-file=log.txt
or
--append-output=log.txt

Thanks, Greetings and nice Day/Evening
Michelle Konzack
Systemadministrator
24V Electronic Engineer
Tamay Dogan Network
Debian GNU/Linux Consultant


-- 
Linux-User #280138 with the Linux Counter, http://counter.li.org/
# Debian GNU/Linux Consultant #
Michelle Konzack   Apt. 917  ICQ #328449886
+49/177/935194750, rue de Soultz MSN LinuxMichi
+33/6/61925193 67100 Strasbourg/France   IRC #Debian (irc.icq.com)


signature.pgp
Description: Digital signature


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread mm w
not al, but in this particular case I pretty sure they have

On Thu, Jun 19, 2008 at 10:42 AM, Tony Lewis <[EMAIL PROTECTED]> wrote:
> mm w wrote:
>
>> a simple url-rewriting conf should fix the problem, wihout touch the file 
>> system
>> everything can be done server side
>
> Why do you assume the user of wget has any control over the server from which 
> content is being downloaded?
>
>



-- 
-mmw


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread Tony Lewis
mm w wrote:

> a simple url-rewriting conf should fix the problem, wihout touch the file 
> system
> everything can be done server side

Why do you assume the user of wget has any control over the server from which 
content is being downloaded?



Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread mm w
without touching the file system

On Thu, Jun 19, 2008 at 9:23 AM, mm w <[EMAIL PROTECTED]> wrote:
> a simple url-rewriting conf should fix the problem, wihout touch the file 
> system
> everything can be done server side
>
> Best Regards
>
> On Thu, Jun 19, 2008 at 6:29 AM, Coombe, Allan David (DPS)
> <[EMAIL PROTECTED]> wrote:
>> Thanks averyone for the contributions.
>>
>> Ultimately, our purpose is to process documents from the site into our
>> search database, so probably the most important thing is to limit the
>> number of files being processed.  The case of  the URLs in the html
>> probably wouldn't cause us much concern, but I could see that it might
>> be useful to "convert" a site for mirroring from a non-case sensetive
>> (windows) environment to a case sensetive (li|u)nix one - this would
>> need to include translation of urls in content as well as filenames on
>> disk.
>>
>> In the meantime - does anyone know of a proxy server that could
>> translate urls from mixed case to lower case.  I thought that if we
>> downloaded using wget via such a proxy server we might get the
>> appropriate result.
>>
>> The other alternative we were thinking of was to post process the files
>> with symlinks for all mixed case versions of files and directories (I
>> think someone already suggested this - greate minds and all that...). I
>> assume that wget would correctly use the symlink to determine the
>> time/date stamp of the file for determining if it requires updating (or
>> would it use the time/date stamp of the symlink?). I also assume that if
>> wget downloaded the file it would overwrite the symlink and we would
>> have to run our "convert files to" symlinks process again.
>>
>> Just to put it in perspective, the actual site is approximately 45gb
>> (that's what the administrator said) and wget downloaded > 100gb
>> (463,000 files) when I did the first process.
>>
>> Cheers
>> Allan
>>
>> -Original Message-
>> From: Micah Cowan [mailto:[EMAIL PROTECTED]
>> Sent: Saturday, 14 June 2008 7:30 AM
>> To: Tony Lewis
>> Cc: Coombe, Allan David (DPS); 'Wget'
>> Subject: Re: Wget 1.11.3 - case sensetivity and URLs
>>
>>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> Tony Lewis wrote:
>>> Micah Cowan wrote:
>>>
>>>> Unfortunately, nothing really comes to mind. If you'd like, you could
>>
>>>> file a feature request at
>>>> https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option
>>
>>>> asking Wget to treat URLs case-insensitively.
>>>
>>> To have the effect that Allan seeks, I think the option would have to
>>> convert all URIs to lower case at an appropriate point in the process.
>>
>>> I think you probably want to send the original case to the server
>>> (just in case it really does matter to the server). If you're going to
>>
>>> treat different case URIs as matching then the lower-case version will
>>
>>> have to be stored in the hash. The most important part (from the
>>> perspective that Allan voices) is that the versions written to disk
>>> use lower case characters.
>>
>> Well, that really depends. If it's doing a straight recursive download,
>> without preexisting local files, then all that's really necessary is to
>> do lookups/stores in the blacklist in a case-normalized manner.
>>
>> If preexisting files matter, then yes, your solution would fix it.
>> Another solution would be to scan directory contents for the first name
>> that matches case insensitively. That's obviously much less efficient,
>> but has the advantage that the file will match at least one of the
>> "real" cases from the server.
>>
>> As Matthias points out, your lower-case normalization solution could be
>> achieved in a more general manner with a hook. Which is something I was
>> planning on introducing perhaps in 1.13 anyway (so you could, say, run
>> sed on the filenames before Wget uses them), so that's probably the
>> approach I'd take. But probably not before 1.13, even if someone
>> provides a patch for it in time for 1.12 (too many other things to focus
>> on, and I'd like to introduce the "external command" hooks as a suite,
>> if possible).
>>
>> OTOH, case normalization in the blacklists would still be useful, in
>> addition to that mechanism. Could make another good addition for 1.13
>> (because it'll be more useful in combination with the rename hooks).
>>
>> - --
>> Micah J. Cowan
>> Programmer, musician, typesetting enthusiast, gamer,
>> and GNU Wget Project Maintainer.
>> http://micah.cowan.name/
>> -BEGIN PGP SIGNATURE-
>> Version: GnuPG v1.4.6 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
>> nVYivipui+0TRmmK04kD2JE=
>> =OMsD
>> -END PGP SIGNATURE-
>>
>
>
>
> --
> -mmw
>



-- 
-mmw


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread mm w
a simple url-rewriting conf should fix the problem, wihout touch the file system
everything can be done server side

Best Regards

On Thu, Jun 19, 2008 at 6:29 AM, Coombe, Allan David (DPS)
<[EMAIL PROTECTED]> wrote:
> Thanks averyone for the contributions.
>
> Ultimately, our purpose is to process documents from the site into our
> search database, so probably the most important thing is to limit the
> number of files being processed.  The case of  the URLs in the html
> probably wouldn't cause us much concern, but I could see that it might
> be useful to "convert" a site for mirroring from a non-case sensetive
> (windows) environment to a case sensetive (li|u)nix one - this would
> need to include translation of urls in content as well as filenames on
> disk.
>
> In the meantime - does anyone know of a proxy server that could
> translate urls from mixed case to lower case.  I thought that if we
> downloaded using wget via such a proxy server we might get the
> appropriate result.
>
> The other alternative we were thinking of was to post process the files
> with symlinks for all mixed case versions of files and directories (I
> think someone already suggested this - greate minds and all that...). I
> assume that wget would correctly use the symlink to determine the
> time/date stamp of the file for determining if it requires updating (or
> would it use the time/date stamp of the symlink?). I also assume that if
> wget downloaded the file it would overwrite the symlink and we would
> have to run our "convert files to" symlinks process again.
>
> Just to put it in perspective, the actual site is approximately 45gb
> (that's what the administrator said) and wget downloaded > 100gb
> (463,000 files) when I did the first process.
>
> Cheers
> Allan
>
> -Original Message-
> From: Micah Cowan [mailto:[EMAIL PROTECTED]
> Sent: Saturday, 14 June 2008 7:30 AM
> To: Tony Lewis
> Cc: Coombe, Allan David (DPS); 'Wget'
> Subject: Re: Wget 1.11.3 - case sensetivity and URLs
>
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Tony Lewis wrote:
>> Micah Cowan wrote:
>>
>>> Unfortunately, nothing really comes to mind. If you'd like, you could
>
>>> file a feature request at
>>> https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option
>
>>> asking Wget to treat URLs case-insensitively.
>>
>> To have the effect that Allan seeks, I think the option would have to
>> convert all URIs to lower case at an appropriate point in the process.
>
>> I think you probably want to send the original case to the server
>> (just in case it really does matter to the server). If you're going to
>
>> treat different case URIs as matching then the lower-case version will
>
>> have to be stored in the hash. The most important part (from the
>> perspective that Allan voices) is that the versions written to disk
>> use lower case characters.
>
> Well, that really depends. If it's doing a straight recursive download,
> without preexisting local files, then all that's really necessary is to
> do lookups/stores in the blacklist in a case-normalized manner.
>
> If preexisting files matter, then yes, your solution would fix it.
> Another solution would be to scan directory contents for the first name
> that matches case insensitively. That's obviously much less efficient,
> but has the advantage that the file will match at least one of the
> "real" cases from the server.
>
> As Matthias points out, your lower-case normalization solution could be
> achieved in a more general manner with a hook. Which is something I was
> planning on introducing perhaps in 1.13 anyway (so you could, say, run
> sed on the filenames before Wget uses them), so that's probably the
> approach I'd take. But probably not before 1.13, even if someone
> provides a patch for it in time for 1.12 (too many other things to focus
> on, and I'd like to introduce the "external command" hooks as a suite,
> if possible).
>
> OTOH, case normalization in the blacklists would still be useful, in
> addition to that mechanism. Could make another good addition for 1.13
> (because it'll be more useful in combination with the rename hooks).
>
> - --
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer,
> and GNU Wget Project Maintainer.
> http://micah.cowan.name/
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
> nVYivipui+0TRmmK04kD2JE=
> =OMsD
> -END PGP SIGNATURE-
>



-- 
-mmw


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread Coombe, Allan David (DPS)
Thanks averyone for the contributions.

Ultimately, our purpose is to process documents from the site into our
search database, so probably the most important thing is to limit the
number of files being processed.  The case of  the URLs in the html
probably wouldn't cause us much concern, but I could see that it might
be useful to "convert" a site for mirroring from a non-case sensetive
(windows) environment to a case sensetive (li|u)nix one - this would
need to include translation of urls in content as well as filenames on
disk.

In the meantime - does anyone know of a proxy server that could
translate urls from mixed case to lower case.  I thought that if we
downloaded using wget via such a proxy server we might get the
appropriate result.  

The other alternative we were thinking of was to post process the files
with symlinks for all mixed case versions of files and directories (I
think someone already suggested this - greate minds and all that...). I
assume that wget would correctly use the symlink to determine the
time/date stamp of the file for determining if it requires updating (or
would it use the time/date stamp of the symlink?). I also assume that if
wget downloaded the file it would overwrite the symlink and we would
have to run our "convert files to" symlinks process again.

Just to put it in perspective, the actual site is approximately 45gb
(that's what the administrator said) and wget downloaded > 100gb
(463,000 files) when I did the first process.

Cheers
Allan

-Original Message-
From: Micah Cowan [mailto:[EMAIL PROTECTED] 
Sent: Saturday, 14 June 2008 7:30 AM
To: Tony Lewis
Cc: Coombe, Allan David (DPS); 'Wget'
Subject: Re: Wget 1.11.3 - case sensetivity and URLs


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
> Micah Cowan wrote:
> 
>> Unfortunately, nothing really comes to mind. If you'd like, you could

>> file a feature request at 
>> https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option

>> asking Wget to treat URLs case-insensitively.
> 
> To have the effect that Allan seeks, I think the option would have to 
> convert all URIs to lower case at an appropriate point in the process.

> I think you probably want to send the original case to the server 
> (just in case it really does matter to the server). If you're going to

> treat different case URIs as matching then the lower-case version will

> have to be stored in the hash. The most important part (from the 
> perspective that Allan voices) is that the versions written to disk 
> use lower case characters.

Well, that really depends. If it's doing a straight recursive download,
without preexisting local files, then all that's really necessary is to
do lookups/stores in the blacklist in a case-normalized manner.

If preexisting files matter, then yes, your solution would fix it.
Another solution would be to scan directory contents for the first name
that matches case insensitively. That's obviously much less efficient,
but has the advantage that the file will match at least one of the
"real" cases from the server.

As Matthias points out, your lower-case normalization solution could be
achieved in a more general manner with a hook. Which is something I was
planning on introducing perhaps in 1.13 anyway (so you could, say, run
sed on the filenames before Wget uses them), so that's probably the
approach I'd take. But probably not before 1.13, even if someone
provides a patch for it in time for 1.12 (too many other things to focus
on, and I'd like to introduce the "external command" hooks as a suite,
if possible).

OTOH, case normalization in the blacklists would still be useful, in
addition to that mechanism. Could make another good addition for 1.13
(because it'll be more useful in combination with the rename hooks).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
nVYivipui+0TRmmK04kD2JE=
=OMsD
-END PGP SIGNATURE-


Re: wget doesn't load page-requisites from a) dynamic web page b) through https

2008-06-18 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ryan Schmidt wrote:
> For example, if you want American English, set LANG to en_US.
> 
> In the Bash shell, you can type "export LANG=en_US"
> 
> In the Tcsh shell, you can type "setenv LANG en_US"
> 
> To find out which shell you use, type "echo $SHELL"

FYI: It's not in any current release, but current mainline has support
for the special "[EMAIL PROTECTED]" for LANGUAGE (still may need to set
LANG=en_US or something). This causes all quoted strings to be rendered
in boldface, using terminal escape sequences. I've found it pleasant to
use that setting for my own purposes.

The "[EMAIL PROTECTED]" LANGUAGE setting is also supported (converts to proper
left/right-quotemarks, but no terminal sequences); but I've rigged
"LANG=en_US" to have the same effect ([EMAIL PROTECTED] is copied to en_US.po).

Again, this is only in the mainline repo, and not in any release.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIWXvT7M8hyUobTrERAmedAJ44nMxqJCyIBox1LDv/FOibkCslIACeLoS3
Beb0toZwvx29J4Sa3AZk62k=
=Sreb
-END PGP SIGNATURE-


Re: wget doesn't load page-requisites from a) dynamic web page b) through https

2008-06-18 Thread Ryan Schmidt

On Jun 18, 2008, at 5:17 AM, Stefan Nowak wrote:


where do I set the locale of the CLI environment of MacOSX?


You should set the LANG environment variable to the desired locale,  
and one which is supported on your system; you can look at the  
directories in /usr/share/locale to see what locales are available.


For example, if you want American English, set LANG to en_US.

In the Bash shell, you can type "export LANG=en_US"

In the Tcsh shell, you can type "setenv LANG en_US"

To find out which shell you use, type "echo $SHELL"



Re: wget doesn't load page-requisites from a) dynamic web page b) through https

2008-06-18 Thread Valentin
Dear Stefan,

If you take a look at the source of the page, you'll see this:



Simply add "-e robots=off" to your arguments and wget will ignore any
robots.txt files or tags. With that it should download everything you
want. (I did not find this myself, credits go to sxav for pointing this
out. ;)

Cheers,

Valentin


-- 
The last time someone listened to a Bush, a bunch of people wandered in
the desert for 40 years.


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-16 Thread mm w
On Sat, Jun 14, 2008 at 4:30 PM, Tony Lewis <[EMAIL PROTECTED]> wrote:
> mm w wrote:
>
>> Hi, after all, after all it's only my point of view :D
>> anyway,
>>
>> "/dir/file",
>> "dir/File", non-standard
>> "Dir/file", non-standard
>> and "/Dir/File" non-standard
>
> According to RFC 2396: The path component contains data, specific to the 
> authority (or the scheme if there is no authority component), identifying the 
> resource within the scope of that scheme and authority.
>
> In other words, those names are well within the standard when the server 
> understands them. As far as I know, there is nothing in Internet standards 
> restricting mixed case paths.
>
:) read again, nobody does except some punk-head folks

>> that's it, if the server manages non-standard URL, it's not my
>> concern, for me it doesn't exist
>
> Oh. I see. You're writing to say that wget should only implement features 
> that are meaningful to you. Thanks for your narcissistic input.

no i'm not such a jerk, a simple grep/sed on the website source to
remove the malicious URL should be fine,
or an HTTP redirection when the  malicious non-standard URL is called

in other hand, if wget changes every links in lowercase, some people
should have the opposite problem
a golden rule: never distributing mixed-case URL (to your users), a
simple respect for them and everything in lower-case

>
> Tony
>
>



-- 
-mmw


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-14 Thread Tony Lewis
mm w wrote:

> Hi, after all, after all it's only my point of view :D
> anyway,
> 
> "/dir/file",
> "dir/File", non-standard
> "Dir/file", non-standard
> and "/Dir/File" non-standard

According to RFC 2396: The path component contains data, specific to the 
authority (or the scheme if there is no authority component), identifying the 
resource within the scope of that scheme and authority.

In other words, those names are well within the standard when the server 
understands them. As far as I know, there is nothing in Internet standards 
restricting mixed case paths.

> that's it, if the server manages non-standard URL, it's not my
> concern, for me it doesn't exist

Oh. I see. You're writing to say that wget should only implement features that 
are meaningful to you. Thanks for your narcissistic input.

Tony



Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread mm w
Hi, after all, after all it's only my point of view :D
anyway,

"/dir/file",
"dir/File", non-standard
"Dir/file", non-standard
and "/Dir/File" non-standard

that's it, if the server manages non-standard URL, it's not my
concern, for me it doesn't exist


On Fri, Jun 13, 2008 at 3:12 PM, Tony Lewis <[EMAIL PROTECTED]> wrote:
> mm w wrote:
>
>> standard: the URL are case-insensitive
>>
>> you can adapt your software because some people don't respect standard,
>> we are not anymore in 90's, let people doing crapy things deal with
>> their crapy world
>
> You obviously missed the point of the original posting: how can one 
> conveniently mirror a site whose server uses case insensitive names onto a 
> server that uses case sensitive names.
>
> If the original site has the URI strings "/dir/file", "dir/File", "Dir/file", 
> and "/Dir/File", the same local file will be returned. However, wget will 
> treat those as unique directories and files and you wind up with four copies.
>
> Allan asked if there is a way to have wget just create one copy and proposed 
> one way that might accomplish that goal.
>
> Tony
>
>



-- 
-mmw


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
Steven M. Schweda wrote:

> >From Tony Lewis:
> > To have the effect that Allan seeks, I think the option would have to
> > convert all URIs to lower case at an appropriate point in the process.

>   I think that that's the wrong way to look at it.  Implementation
> details like name hashing may also need to be adjusted, but this
> shouldn't be too hard.

OK. How would you normalize the names?

Tony



RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
mm w wrote:

> standard: the URL are case-insensitive
>
> you can adapt your software because some people don't respect standard,
> we are not anymore in 90's, let people doing crapy things deal with
> their crapy world

You obviously missed the point of the original posting: how can one 
conveniently mirror a site whose server uses case insensitive names onto a 
server that uses case sensitive names.

If the original site has the URI strings "/dir/file", "dir/File", "Dir/file", 
and "/Dir/File", the same local file will be returned. However, wget will treat 
those as unique directories and files and you wind up with four copies.

Allan asked if there is a way to have wget just create one copy and proposed 
one way that might accomplish that goal.

Tony



Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Steven M. Schweda
   In the VMS world, where file name case may matter, but usually
doesn't, the normal scheme is to preserve case when creating files, but
to do case-insensitive comparisons on file names.

>From Tony Lewis:

> To have the effect that Allan seeks, I think the option would have to
> convert all URIs to lower case at an appropriate point in the process.

   I think that that's the wrong way to look at it.  Implementation
details like name hashing may also need to be adjusted, but this
shouldn't be too hard.



   Steven M. Schweda   [EMAIL PROTECTED]
   382 South Warwick Street(+1) 651-699-9818
   Saint Paul  MN  55105-2547


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
> Micah Cowan wrote:
> 
>> Unfortunately, nothing really comes to mind. If you'd like, you
>> could file a feature request at 
>> https://savannah.gnu.org/bugs/?func=additem&group=wget, for an
>> option asking Wget to treat URLs case-insensitively.
> 
> To have the effect that Allan seeks, I think the option would have to
> convert all URIs to lower case at an appropriate point in the
> process. I think you probably want to send the original case to the
> server (just in case it really does matter to the server). If you're
> going to treat different case URIs as matching then the lower-case
> version will have to be stored in the hash. The most important part
> (from the perspective that Allan voices) is that the versions written
> to disk use lower case characters.

Well, that really depends. If it's doing a straight recursive download,
without preexisting local files, then all that's really necessary is to
do lookups/stores in the blacklist in a case-normalized manner.

If preexisting files matter, then yes, your solution would fix it.
Another solution would be to scan directory contents for the first name
that matches case insensitively. That's obviously much less efficient,
but has the advantage that the file will match at least one of the
"real" cases from the server.

As Matthias points out, your lower-case normalization solution could be
achieved in a more general manner with a hook. Which is something I was
planning on introducing perhaps in 1.13 anyway (so you could, say, run
sed on the filenames before Wget uses them), so that's probably the
approach I'd take. But probably not before 1.13, even if someone
provides a patch for it in time for 1.12 (too many other things to focus
on, and I'd like to introduce the "external command" hooks as a suite,
if possible).

OTOH, case normalization in the blacklists would still be useful, in
addition to that mechanism. Could make another good addition for 1.13
(because it'll be more useful in combination with the rename hooks).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
nVYivipui+0TRmmK04kD2JE=
=OMsD
-END PGP SIGNATURE-


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread mm w
standard: the URL are case-insensitive

you can adapt your software because some people don't respect standard,
we are not anymore in 90's, let people doing crapy things deal with
their crapy world

Cheers!

On Fri, Jun 13, 2008 at 2:08 PM, Tony Lewis <[EMAIL PROTECTED]> wrote:
> Micah Cowan wrote:
>
>> Unfortunately, nothing really comes to mind. If you'd like, you could
>> file a feature request at
>> https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option
>> asking Wget to treat URLs case-insensitively.
>
> To have the effect that Allan seeks, I think the option would have to convert 
> all URIs to lower case at an appropriate point in the process. I think you 
> probably want to send the original case to the server (just in case it really 
> does matter to the server). If you're going to treat different case URIs as 
> matching then the lower-case version will have to be stored in the hash. The 
> most important part (from the perspective that Allan voices) is that the 
> versions written to disk use lower case characters.
>
> Tony
>
>



-- 
-mmw


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
Micah Cowan wrote:

> Unfortunately, nothing really comes to mind. If you'd like, you could
> file a feature request at
> https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option
> asking Wget to treat URLs case-insensitively.

To have the effect that Allan seeks, I think the option would have to convert 
all URIs to lower case at an appropriate point in the process. I think you 
probably want to send the original case to the server (just in case it really 
does matter to the server). If you're going to treat different case URIs as 
matching then the lower-case version will have to be stored in the hash. The 
most important part (from the perspective that Allan voices) is that the 
versions written to disk use lower case characters.

Tony



Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-12 Thread Matthias Vill

Hi list!

saddly I couldn't find the E-Mail of Allan (maybe because I'm atached by 
the news gateway) so this is a list-only-post.


Micah Cowan wrote:

Hi Allan,

You'll generally get better results if you post to the mailing list
(wget@sunsite.dk). I've added it to the recipients list.

Coombe, Allan David (DPS) wrote:

Hi Micah,



First some context&
We are using wget 1.11.3 to mirror a web site so we can do some offline
processing on it.  The mirror is on a Solaris 10 x86 server.



The problem we are getting appears to be because the URLs in the HTML
pages that are harvested by wget for downloading have mixed case (the
site we are mirroring is running on a Windows 2000 server using IIS) and
the directory structure created on the mirror have 'duplicate'
directories because of the mixed case.



For example,  the URLs in HTML pages /Senate/committees/index.htm and
/senate/committees/index.htm refer to the same file but wget creates 2
different directory structures on the mirror site for these URLs.


Ok... at this point I need to ask whether you try to mirror or just 
backup the site.
The main problem is easy: the moment you want a working mirror you need 
those mixed-case files or rewrite the url to a unique casing.
At this point it seems to be most practical to either introduce a hook 
like --restrict-file-names to modify the name of the local copy and the 
links inside the downloaded files in the same way.
An other option is to create symlinks for the different directory cases. 
That would safe half the overhead, i guess.

To create such a symlink structure you could use the output of "
find /mirror/basedir -type d | sort -f
" hope that helps.

Matthias


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-11 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Allan,

You'll generally get better results if you post to the mailing list
(wget@sunsite.dk). I've added it to the recipients list.

Coombe, Allan David (DPS) wrote:
> Hi Micah,
> 
> First some context…
> We are using wget 1.11.3 to mirror a web site so we can do some offline
> processing on it.  The mirror is on a Solaris 10 x86 server.
> 
> The problem we are getting appears to be because the URLs in the HTML
> pages that are harvested by wget for downloading have mixed case (the
> site we are mirroring is running on a Windows 2000 server using IIS) and
> the directory structure created on the mirror have 'duplicate'
> directories because of the mixed case.
> 
> For example,  the URLs in HTML pages /Senate/committees/index.htm and
> /senate/committees/index.htm refer to the same file but wget creates 2
> different directory structures on the mirror site for these URLs.
> 
> This appears to be a fairly basic thing, but we can't see any wget
> options that allow us to treat URLs case insensetively.
> 
> We don't really want to post-process the site just to merge the files
> and directories with different case.

Unfortunately, nothing really comes to mind. If you'd like, you could
file a feature request at
https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option
asking Wget to treat URLs case-insensitively. Finding local files
case-insensitively, on a case-sensitive filesystem, would be a PITA; but
adding and looking up URLs in the internal blacklist hash wouldn't be
too hard. I probably wouldn't get to that for a while, though.

Another useful option might be to change the name of "index" files, so
that, for instance, you could have URLs like http://foo/ result in
"foo/index.htm" or "foo/default.html", rather than "foo/index.html".

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUG937M8hyUobTrERAqq2AJ48mGvcFCSxnouTFqYTuRHzVgwYdgCeLegI
vkdzf3Lu+Vn5diCOHk5CRhc=
=IlG9
-END PGP SIGNATURE-


Re: wget 1.11.1 make test fails

2008-04-04 Thread Alain Guibert
 On Thursday, April 3, 2008 at 9:14:52 -0700, Micah Cowan wrote:

> Are you certain you rebuilt cmpt.o? This seems pretty unlikely, to me.

Certain: make test after touching src/sysdep.h rebuilds both cmpt.o, the
normal in src/ and the one in tests/. And both those cmpt.o become
784 bytes bigger without SYSTEM_FNMATCH.


Alain.


Re: wget 1.11.1 make test fails

2008-04-04 Thread Alain Guibert
 On Thursday, April 3, 2008 at 22:37:41 +0200, Hrvoje Niksic wrote:

> Or it could be that you're picking up a different fnmatch.h that sets
> up a different value for FNM_PATHNAME.  Do you have more than one
> "fnmatch.h" installed on your system?

I have only /usr/include/fnmatch.h installed, identical to the file in
the libc-5.4.33 tarball, and defining the same values as wget's
src/sysdep.h (even comments are identical). Just "my" fnmatch.h defines
two more flags, FNM_LEADING_DIR=8 and FNM_CASEFOLD=16, and defines an
FNM_FILE_NAME alias (commented as "Preferred GNU name") to
FNM_PATHNAME=1 (the libc code uses only this alias). Anyway I had
noticed your comment about incompatible headers, and double-checked your
little test program also with explicit value 1: same results.


BTW everybody should be able to reproduce the make test failure, on any
system, just by #undefining SYSTEM_FNMATCH in src/sysdep.h


Alain.


Re: wget 1.11.1 make test fails

2008-04-04 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hrvoje Niksic wrote:
> Alain Guibert <[EMAIL PROTECTED]> writes:
> 
>>> Maybe you could put a breakpoint in fnmatch and see what goes wrong?
>> The for loop intended to eat several characters from the string also
>> advances the pattern pointer. This one reaches the end of the pattern,
>> and points to a NUL. It is not a '*' anymore, so the loop exits
>> prematurely. Just below, a test for NUL returns 0.
> 
> Thanks for the analysis.  Looking at the current fnmatch code in
> gnulib, it seems that the fix is to change that NUL test to something
> like:
> 
>   if (c == '\0')
> {
>   /* The wildcard(s) is/are the last element of the pattern.
>  If the name is a file name and contains another slash
>  this means it cannot match. */
>   int result = (flags & FNM_PATHNAME) == 0 ? 0 : FNM_NOMATCH;
>   if (flags & FNM_PATHNAME)
> {
>   if (!strchr (n, '/'))
> result = 0;
> }
>   return result;
> }
> 
> But I'm not at all sure that it covers all the needed cases.

I'm thinking not: the loop still shouldn't be incrementing n, since that
forces each additional * to match at least one character, doesn't it?
Gnulib's version seems to handle that better.

> Maybe we
> should simply switch to gnulib-provided fnmatch?  Unfortunately that
> one is quite complex and quite hard for the '**' extension Micah
> envisions.  There might be other fnmatch implementations out there in
> GNU which are debugged but still simpler than the gnulib/glibc one.

Maybe. I'm not sure ** would be too hard to add to gnulib's fnmatch,
just have to toggle with the FNM_FILE_NAME tests within the '*' case, if
we see an immediate second '*'. But maybe ** as part of a *?**? sequence
is more complex. I don't think so, though.

The main thing is that we need it to support the invalid sequence stuff.

Hm; I'm not sure we'll ever want fnmatch() to be locale-aware, though.
User-specified match patterns should interpret characters based on the
locale; but the source strings may be in different encodings altogether.
If we solve this by transcoding to the current locale, we may find that
the user's locale doesn't support all of the characters that the
original string's encoding does. Probably we'll need to transcode both
to Unicode before comparison.

In the meantime, though, I think we want a simple byte-by-byte match.
Perhaps it's best to (a) use our custom matcher, ignoring the system's
(so we don't get locale specialness), and (b) fix it, providing as
thorough test coverage as possible.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH9jWi7M8hyUobTrERAglwAKCDnpnDjr44Ovgh/oBuzkM4mu/gKACeNnN8
arvFSrCEBatNeO29fzHxuU4=
=QDMp
-END PGP SIGNATURE-


Re: wget 1.11.1 make test fails

2008-04-04 Thread Hrvoje Niksic
Alain Guibert <[EMAIL PROTECTED]> writes:

>> Maybe you could put a breakpoint in fnmatch and see what goes wrong?
>
> The for loop intended to eat several characters from the string also
> advances the pattern pointer. This one reaches the end of the pattern,
> and points to a NUL. It is not a '*' anymore, so the loop exits
> prematurely. Just below, a test for NUL returns 0.

Thanks for the analysis.  Looking at the current fnmatch code in
gnulib, it seems that the fix is to change that NUL test to something
like:

  if (c == '\0')
{
  /* The wildcard(s) is/are the last element of the pattern.
 If the name is a file name and contains another slash
 this means it cannot match. */
  int result = (flags & FNM_PATHNAME) == 0 ? 0 : FNM_NOMATCH;
  if (flags & FNM_PATHNAME)
{
  if (!strchr (n, '/'))
result = 0;
}
  return result;
}

But I'm not at all sure that it covers all the needed cases.  Maybe we
should simply switch to gnulib-provided fnmatch?  Unfortunately that
one is quite complex and quite hard for the '**' extension Micah
envisions.  There might be other fnmatch implementations out there in
GNU which are debugged but still simpler than the gnulib/glibc one.


It's kind of ironic that while the various system fnmatches were
considered broken, the one Wget was using (for many years
unconditionally!) was also broken.


Re: wget 1.11.1 make test fails

2008-04-04 Thread Hrvoje Niksic
Alain Guibert <[EMAIL PROTECTED]> writes:

>  On Wednesday, April 2, 2008 at 23:09:52 +0200, Hrvoje Niksic wrote:
>
>> Micah Cowan <[EMAIL PROTECTED]> writes:
>>> It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME
>
> The libc 5.4.33 fnmatch() supports FNM_PATHNAME, and there is code
> apparently intending to return FNM_NOMATCH on a slash. But this code
> seems to be rather broken.

Or it could be that you're picking up a different fnmatch.h that sets
up a different value for FNM_PATHNAME.  Do you have more than one
"fnmatch.h" installed on your system?


fnmatch [Re: wget 1.11.1 make test fails]

2008-04-03 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Alain Guibert wrote:
> The for loop intended to eat several characters from the string also
> advances the pattern pointer. This one reaches the end of the pattern,
> and points to a NUL. It is not a '*' anymore, so the loop exits
> prematurely. Just below, a test for NUL returns 0.
> 
> The body of the loop, returning FNM_NOMATCH on a slash, is not executed
> at all. That isn't moderately broken, is it?

I haven't stepped through it, but it sure looks broken to my eyes too. I
am tired at the moment, though, so may be missing something.

GNUlib has an fnmatch, which might be worth considering for use; but
AIUI it suffers from the same overly-locale-aware problem that system
fnmatches can suffer from (fnmatch fails when the string isn't encoded
properly for the current locale; we often don't even _know_ the original
encoding, especially for FTP, and mainly want * to match any arbitrary
string of byte values). They were looking for someone to address that issue:

http://lists.gnu.org/archive/html/bug-gnulib/2008-02/msg00019.html

Perhaps, if I'm motivated and somehow scrounge the time, I can fix the
problem in their code, and then use it in ours? :)

Or, if someone else with more time would like to tackle it, I'm sure
that'd also be welcome. :)

I responded to the message linked above with a note that Wget also had a
need for such functionality, along with some questions about the
approach, but hadn't received a response. Maybe I'll try again.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH9XBy7M8hyUobTrERAtReAJ94Ac0ClInQOE7qq7OQxon87zj7JACeOTz3
Lfafi0U2phRDnFqQ2IPSx+s=
=9yU/
-END PGP SIGNATURE-


Re: wget 1.11.1 make test fails

2008-04-03 Thread Alain Guibert
 On Thursday, April 3, 2008 at 11:08:27 +0200, Hrvoje Niksic wrote:

> Well, it would point to a problem with both the fnmatch replacement
> and the older system fnmatch.  "Our" fnmatch (coming from an old
> release of Bash

The fnmatch()es in libc 5.4.33 and in Wget are twins. They differ on
some minor details like FNM_CASEFOLD support, and cosmetic things like
parenthesis around return(code). The part dealing with * in pattern is
functionaly identical.


> Maybe you could put a breakpoint in fnmatch and see what goes wrong?

The for loop intended to eat several characters from the string also
advances the pattern pointer. This one reaches the end of the pattern,
and points to a NUL. It is not a '*' anymore, so the loop exits
prematurely. Just below, a test for NUL returns 0.

The body of the loop, returning FNM_NOMATCH on a slash, is not executed
at all. That isn't moderately broken, is it?


Alain.


Re: wget fails using proxy with https-protocol

2008-04-03 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
> The log shows that:
> 
>   1. Wget still doesn't wait for the Proxy to ask for authentication,
> before sending Proxy-Authorization headers with its first request.
>   2. Apparently, when going through a proxy, Wget now correctly waits to
> receive a challenge from the destination server (as I intended), but
> then _doesn't_ respond to the challenge with an Authorization header,
> instead just treating the (first) 401 as a final header.

Slava, could you perhaps download and install Wget 1.11.1, and try it
with the --auth-no-challenge option? That was added to support a case
when there was a genuine need for Wget's older, less secure
authentication behavior; it's intended to disable the new behavior. It
may or may not fix your problem, and I'd be interested to know which it
is. :)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH9QzK7M8hyUobTrERAuOUAJ4ygaAyhihkeM/tG0j7hMexnHJZwwCeKhzi
r3OHfZk8bDZu0DnQljyP7vU=
=6i/0
-END PGP SIGNATURE-


Re: wget 1.11.1 make test fails

2008-04-03 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Alain Guibert wrote:
> Hello Hrvoje,
> 
>  On Wednesday, April 2, 2008 at 12:51:20 +0200, Hrvoje Niksic wrote:
> 
>> Alain Guibert <[EMAIL PROTECTED]> writes:
>>> The only failing src/utils.c test_array[] line is:
>>> | { { "*COMPLETE", NULL, NULL }, "foo/!COMPLETE", false },
>> Try #undefing SYSTEM_FNMATCH in sysdep.h and see if it works then.
> 
> This old system does HAVE_WORKING_FNMATCH_H (and thus SYSTEM_FNMATCH).
> When #undefining SYSTEM_FNMATCH, the test still fails at the very same
> line. And then it also fails on modern systems. I guess this points at
> the embedded src/cmpt.c:fnmatch() replacement?

Are you certain you rebuilt cmpt.o? This seems pretty unlikely, to me.

> That also demonstrates the major interest of testsuites. Who would have
> noticed the runtime consequences of such obscure libc problem otherwise?
> Well done, Micah!

Heh, thanks. However, I haven't done much yet with testsuites, despite
really really wanting to. In this case, I just added two or three lines
to a test that Mauro had written, when I noticed that none of the tests
were against slashes or strange-ish characters. Guess that was a pretty
lucky addition, then!

- --
Coincidence? Or proof that God exists,
and wants me to find Wget bugs? :)

Micah J. Cowan
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH9QJ87M8hyUobTrERAsqQAJsFjRfUjjCo63Srs2XbuRBBMVJVQgCfTwZU
/sp4Vz8QnIV3I3W3/D6Mgq8=
=drfg
-END PGP SIGNATURE-


Re: wget 1.11.1 make test fails

2008-04-03 Thread Alain Guibert
 On Wednesday, April 2, 2008 at 23:09:52 +0200, Hrvoje Niksic wrote:

> Micah Cowan <[EMAIL PROTECTED]> writes:
>> It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME

The libc 5.4.33 fnmatch() supports FNM_PATHNAME, and there is code
apparently intending to return FNM_NOMATCH on a slash. But this code
seems to be rather broken.


>| printf("%d\n", fnmatch("foo*", "foo/bar", FNM_PATHNAME));
> It should print a non-zero value.

Zero on the old system, FNM_NOMATCH on a recent one.


Alain.


Re: wget 1.11.1 make test fails

2008-04-03 Thread Hrvoje Niksic
Alain Guibert <[EMAIL PROTECTED]> writes:

> This old system does HAVE_WORKING_FNMATCH_H (and thus
> SYSTEM_FNMATCH).  When #undefining SYSTEM_FNMATCH, the test still
> fails at the very same line. And then it also fails on modern
> systems. I guess this points at the embedded src/cmpt.c:fnmatch()
> replacement?

Well, it would point to a problem with both the fnmatch replacement
and the older system fnmatch.  "Our" fnmatch (coming from an old
release of Bash, but otherwise very well-tested, both in Bash and
Wget) is careful to special-case '/' only if FNM_PATHNAME is
specified.

Maybe you could put a breakpoint in fnmatch and see what goes wrong?


Re: wget 1.11.1 make test fails

2008-04-03 Thread Alain Guibert
Hello Hrvoje,

 On Wednesday, April 2, 2008 at 12:51:20 +0200, Hrvoje Niksic wrote:

> Alain Guibert <[EMAIL PROTECTED]> writes:
>> The only failing src/utils.c test_array[] line is:
>> | { { "*COMPLETE", NULL, NULL }, "foo/!COMPLETE", false },
> Try #undefing SYSTEM_FNMATCH in sysdep.h and see if it works then.

This old system does HAVE_WORKING_FNMATCH_H (and thus SYSTEM_FNMATCH).
When #undefining SYSTEM_FNMATCH, the test still fails at the very same
line. And then it also fails on modern systems. I guess this points at
the embedded src/cmpt.c:fnmatch() replacement?

That also demonstrates the major interest of testsuites. Who would have
noticed the runtime consequences of such obscure libc problem otherwise?
Well done, Micah!


Alain.


Re: wget 1.11.1 make test fails

2008-04-02 Thread Hrvoje Niksic
Micah Cowan <[EMAIL PROTECTED]> writes:

> I'm wondering whether it might make sense to go back to completely
> ignoring the system-provided fnmatch?

One argument against that approach is that it increases code size on
systems that do correctly implement fnmatch, i.e. on most modern
Unixes that we are targeting.  Supporting I18N file names would
require modifications to our fnmatch; but on the other hand, we still
need it for Windows, so we'd have to make those changes anyway.

Providing added value in our fnmatch implementation should go a long
way towards preventing complaints of code bloat.

> In particular, it would probably resolve the remaining issue with
> that one bug you reported about fnmatch() failing on strings whose
> encoding didn't match the locale.

It would.

> Additionally, I've been toying with the idea of adding something
> like a "**" to match all characters, including slashes.

That would be great.  That kind of thing is known to zsh users anyway,
and it's a useful feature.


Re: wget 1.11.1 make test fails

2008-04-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hrvoje Niksic wrote:
> Micah Cowan <[EMAIL PROTECTED]> writes:
> 
>>> It sounds like a libc problem rather than a gcc problem.  Try
>>> #undefing SYSTEM_FNMATCH in sysdep.h and see if it works then.
>> It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME: I
>> mean, don't most shells rely on this to handle file globbing and
>> whatnot?
> 
> The conventional wisdom among free software of the 90s was that
> fnmatch() was too buggy to be useful.  For that reason all free shells
> rolled their own fnmatch, as did other programs that needed it,
> including Wget.  Maybe the conventional wisdom was right for the
> reporter's system.
> 
> Another possibility is that something else is installing fnmatch.h in
> a directory on the compiler's search path and breaking the system
> fnmatch.  IIRC Apache was a known culprit that installed fnmatch.h in
> /usr/local/include.  That was another reason why Wget used to
> completely ignore system-provided fnmatch.

I'm wondering whether it might make sense to go back to completely
ignoring the system-provided fnmatch? In particular, it would probably
resolve the remaining issue with that one bug you reported about
fnmatch() failing on strings whose encoding didn't match the locale.

Additionally, I've been toying with the idea of adding something like a
"**" to match all characters, including slashes. There was a user who
had trouble using wildcards to match any directory whose name was (as in
the problem example here), "!COMPLETE". At the time I wasn't fully
certain that it wasn't a bug in Wget; as I understand it now, in order
to match _any_ directory !COMPLETE, you'd have to be sure to exclude
"!COMPLETE", "*/!COMPLETE", "*/*/!COMPLETE", etc. I'm not sure if it's
original there, but Vim uses a ** pattern, so that you could simply
write "**!COMPLETE" (or, if you wanted to be more correct I suppose,
just "!COMPLETE" and "**/!COMPLETE").

What do you think?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFH8/zx7M8hyUobTrERAtpoAJiQHrzVjFwKXxEjteqMGAGgBCMgAJ9rZIah
k+92ivTBGpSsmHcLnlsjfQ==
=JLn9
-END PGP SIGNATURE-


Re: wget 1.11.1 make test fails

2008-04-02 Thread Hrvoje Niksic
Micah Cowan <[EMAIL PROTECTED]> writes:

>> It sounds like a libc problem rather than a gcc problem.  Try
>> #undefing SYSTEM_FNMATCH in sysdep.h and see if it works then.
>
> It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME: I
> mean, don't most shells rely on this to handle file globbing and
> whatnot?

The conventional wisdom among free software of the 90s was that
fnmatch() was too buggy to be useful.  For that reason all free shells
rolled their own fnmatch, as did other programs that needed it,
including Wget.  Maybe the conventional wisdom was right for the
reporter's system.

Another possibility is that something else is installing fnmatch.h in
a directory on the compiler's search path and breaking the system
fnmatch.  IIRC Apache was a known culprit that installed fnmatch.h in
/usr/local/include.  That was another reason why Wget used to
completely ignore system-provided fnmatch.

In any case, it should be easy enough to isolate the problem:

#include 
#include 
int main()
{
  printf("%d\n", fnmatch("foo*", "foo/bar", FNM_PATHNAME));
  return 0;
}

It should print a non-zero value.


Re: wget fails using proxy with https-protocol

2008-04-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
> Julien, I've CC'd you, in case you think this might be something you'd
> want to add to your GSoC proposal. If it _is_, it's probably something
> that should be done before the rest, so I can backport it into the 1.11
> branch for a 1.11.2 release (since this is an important regression),
> rather than make people wait for 1.12 to come out (which is where I
> expect the rest of the authorization improvements would go).

Er, on reflection, that's a terrible idea, given that coding for GSoC
doesn't even start until nearly June, and this is a serious regression
that should be fixed as soon as it can be got to.

Still, if you'd like to tackle it out-of-band, that'd be handy. :)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH87mY7M8hyUobTrERAuyjAJ0XJ8ImAFZ/J49EGQlc+HWWNdxhQACgiK3U
bgyhQErH//V6bDkaeE9mLYM=
=3fn1
-END PGP SIGNATURE-


Re: wget 1.11.1 make test fails

2008-04-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hrvoje Niksic wrote:
> Alain Guibert <[EMAIL PROTECTED]> writes:
> 
>> Hello Micah,
>>
>>  On Monday, March 31, 2008 at 11:39:43 -0700, Micah Cowan wrote:
>>
>>> could you try to isolate which part of test_dir_matches_p is failing?
>> The only failing src/utils.c test_array[] line is:
>>
>> | { { "*COMPLETE", NULL, NULL }, "foo/!COMPLETE", false },
>>
>> I don't understand enough of dir_matches_p() and fnmatch() to guess
>> what is supposed to happen. But with false replaced by true, this
>> test and following succeed.
> 
> '*' is not supposed to match '/' in regular fnmatch.

Well, that's assuming you pass it the FNM_PATHNAME flag (which, for
dir_matches_p, we always do).

> It sounds like a libc problem rather than a gcc problem.  Try
> #undefing SYSTEM_FNMATCH in sysdep.h and see if it works then.

It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME: I
mean, don't most shells rely on this to handle file globbing and whatnot?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH86+L7M8hyUobTrERApHKAJsFbO8+PtAqFhHJ2Psv1AuKSy17YwCcDsi2
9WHcJ0Pzkc4XmNbcEUCXf6U=
=r8ZV
-END PGP SIGNATURE-


Re: wget fails using proxy with https-protocol

2008-04-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I could've sworn I sent my response to this, to the list, but I must
have accidentally hit "Reply" instead of "Reply All".

Slava Grig wrote:
>Hello!
> 
> I try to set environment variable https_proxy and download something through
> 
> wget --no-check-certificate https://_something_.com/_some_file_
> 
> but can't. Wget does not initiate a ssl connection to the proxy but
> instead connects directly to the target host.

This bit was wrong.

I asked for --debug information to be sent privately (just in case) and
explained how to ensure that his password doesn't show up Base64'd in
the Authorization headers and what not.

The log shows that:

  1. Wget still doesn't wait for the Proxy to ask for authentication,
before sending Proxy-Authorization headers with its first request.
  2. Apparently, when going through a proxy, Wget now correctly waits to
receive a challenge from the destination server (as I intended), but
then _doesn't_ respond to the challenge with an Authorization header,
instead just treating the (first) 401 as a final header.

Julien, I've CC'd you, in case you think this might be something you'd
want to add to your GSoC proposal. If it _is_, it's probably something
that should be done before the rest, so I can backport it into the 1.11
branch for a 1.11.2 release (since this is an important regression),
rather than make people wait for 1.12 to come out (which is where I
expect the rest of the authorization improvements would go).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH86zl7M8hyUobTrERAvhoAJ0UY8cu9OtvVwIG7XDxCm0RPdfFZgCfVvSt
NEeHVBhey76f2KdewsZAdds=
=8ZUM
-END PGP SIGNATURE-


Re: wget 1.11.1 make test fails

2008-04-02 Thread Hrvoje Niksic
Alain Guibert <[EMAIL PROTECTED]> writes:

> Hello Micah,
>
>  On Monday, March 31, 2008 at 11:39:43 -0700, Micah Cowan wrote:
>
>> could you try to isolate which part of test_dir_matches_p is failing?
>
> The only failing src/utils.c test_array[] line is:
>
> | { { "*COMPLETE", NULL, NULL }, "foo/!COMPLETE", false },
>
> I don't understand enough of dir_matches_p() and fnmatch() to guess
> what is supposed to happen. But with false replaced by true, this
> test and following succeed.

'*' is not supposed to match '/' in regular fnmatch.

It sounds like a libc problem rather than a gcc problem.  Try
#undefing SYSTEM_FNMATCH in sysdep.h and see if it works then.


Re: wget 1.11.1 make test fails

2008-04-02 Thread Alain Guibert
Hello Micah,

 On Monday, March 31, 2008 at 11:39:43 -0700, Micah Cowan wrote:

> could you try to isolate which part of test_dir_matches_p is failing?

The only failing src/utils.c test_array[] line is:

| { { "*COMPLETE", NULL, NULL }, "foo/!COMPLETE", false },

I don't understand enough of dir_matches_p() and fnmatch() to guess what
is supposed to happen. But with false replaced by true, this test and
following succeed.

| ALL TESTS PASSED
| Tests run: 7

Of course this test then fails on newer systems.


Alain.


Re: wget 1.11.1 make test fails

2008-03-31 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Alain Guibert wrote:
> Hello,
> 
> With an old gcc 2.7.2.1 compiler, wget 1.11.1 make test fails:
> 
> | gcc -I. -I. -I./../src  -DHAVE_CONFIG_H 
> -DSYSTEM_WGETRC=\"/usr/local/etc/wgetrc\" 
> -DLOCALEDIR=\"/usr/local/share/locale\" -O2 -Wall -DTESTING -c ../src/test.c
> | ../src/test.c: In function `all_tests':
> | ../src/test.c:51: parse error before `const'



> The attached make-test.patch seems to fix this.

Yeah; that's invalid C90 code; declaration following statement. I'll fix
that.

> However later the 3rd
> test fails:
> 
> | ./unit-tests
> | RUNNING TEST test_parse_content_disposition...
> | PASSED
> |
> | RUNNING TEST test_subdir_p...
> | PASSED
> |
> | RUNNING TEST test_dir_matches_p...
> | test_dir_matches_p: wrong result
> | Tests run: 3
> | make[1]: *** [run-unit-tests] Error 1
> | make[1]: Leaving directory `/tmp/wget-1.11.1/tests'
> | make: *** [test] Error 2

That's an interesting failure. I wonder if it's one of the new cases I
just added...

In any case, it runs through fine for me. This suggests a difference in
behavior between your system fnmatch function and mine (since that
should be the only bit of external code that dir_matches_p relies on).

Pity the tests don't give much clue as to the specifics of what
failed... there are about 10 tests for test_dir_matches_p, any of which
could have caused the problem.

The whole testing thing needs some serious rework; which is my current
top priority, when I find time for it (GSoC is eating everything, right
now).

"make test" isn't actually expected to work completely, right now; some
of the .px tests are known to be broken/missing. They're basically
provided "as-is". I thought about removing them for the official
package; maybe I should have.

But if I had, I'd still be blissfully unaware of this potential problem.

If you know how, and don't mind, could you try to isolate which part of
test_dir_matches_p is failing? Perhaps augmenting the error message to
spit the match-list and string arguments...

- --
Thanks,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH8S/v7M8hyUobTrERAhrPAJ9N+XqLeVP0NN9HkLxO162Zf2uJnACeMwUo
kew/FkMA2GljqWiPG6IC+zs=
=fQSH
-END PGP SIGNATURE-


Re: Wget 1.11 build fails on old Linux

2008-03-31 Thread Alain Guibert
 On Monday, February 25, 2008 at 16:32:21 +0100, Alain Guibert wrote:

> On an old Debian Bo system (kernel 2.0.40, gcc 2.7.2.1, libc 5.4.33),
> building Wget 1.11 fails:

While wget 1.11.1 builds and works OK. Thank you very much, gentlemen!


Alain.


Re: wget doesn't account reject if loads from a list

2008-03-19 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Aleksandar Radulovic wrote:
> (I am not subscribed to the bug-list)
> 
>   Hello,
> 
>   I found another bug in wget:
> 
>   if the input file with relative links is specified
> (extracted from and index.html), with -B option, and I
> have in .wgetrc a list of files to reject, it doesn't
> work. It simply downloads everything is in the list.
> From an original file it works fine and rejects
> according a list.

accept/reject lists are only applied to recursively-fetched files; never
to explicitly-requested URLs, which includes URLs you specify as
arguments or via an input file.

>   It seems I have to wrote a programm which do it file
> by file (grmbh!).

If you don't want the file, don't ask wget for it!

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH4T2M7M8hyUobTrERAhy+AKCGC75YqxCgDxnfc9gKrKkw5j1JjgCfbRLN
aSigPz6tfcwP1/nnGDZX1sU=
=7fHa
-END PGP SIGNATURE-


Re: wget aborts when file exists

2008-03-19 Thread Aleksandar Radulovic

(I am not subscribed to the bug-list)


**
On 12.03.2008 at 20:36 Charles wrote:

>On Wed, Mar 12, 2008 at 12:46 AM, Aleksandar
Radulovic
<[EMAIL PROTECTED]> wrote:
>
> (I am not subscribed to the bug-list)
>
>Hello,
>
>I use wget to retreive recurively images from
a site,
> which are randomly changed on a daily basis . I
wrote
> small batch which worked until sistem upgrade. Now
the
> new version of wget is installed but it aborts when
> any file already exists.

When I tried this in my wget, I got different behavior
with wget 1.11
alpha and wget 1.10.2

D:\>wget --proxy=off -r -l 1 -nc -np
http://localhost/test/
File `localhost/test/index.html' already there; not
retrieving.


D:\>wget110 --proxy=off -r -l 1 -nc -np
http://localhost/test/
File `localhost/test/index.html' already there; not
retrieving.

File `localhost/test/a.gif' already there; not
retrieving.

File `localhost/test/b.gif' already there; not
retrieving.

File `localhost/test/c.jpg' already there; not
retrieving.

FINISHED --20:31:41--
Downloaded: 0 bytes in 0 files

I think wget 1.10.2 behavior is more correct. Anyway
it did not abort
in my case.

---
Charles



It brokes and it is wget 1.10.2 . I really don't know
why, and I can't have influence in that because I am
not an administrator of the system, just an user.
However, it seems that this bug occures.

Aca





  

Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


Re: wget aborts when file exists

2008-03-12 Thread Hrvoje Niksic
Charles <[EMAIL PROTECTED]> writes:

> On Thu, Mar 13, 2008 at 1:17 AM, Hrvoje Niksic <[EMAIL PROTECTED]> wrote:
>>  > It assums, though, that the preexisting index.html corresponds to
>>  > the one that you were trying to download; it's unclear to me how
>>  > wise that is.
>>
>>  That's what -nc does.  But the question is why it assumes that
>>  dependent files are also present.
>
> Because I repeated the command, and the files have all been downloaded
> before.

We know that, but Wget 1.11 doesn't seem to check it.  It only checks
index.html, but not the other dependent files.


Re: wget aborts when file exists

2008-03-12 Thread Charles
On Thu, Mar 13, 2008 at 1:17 AM, Hrvoje Niksic <[EMAIL PROTECTED]> wrote:
>  > It assums, though, that the preexisting index.html corresponds to
>  > the one that you were trying to download; it's unclear to me how
>  > wise that is.
>
>  That's what -nc does.  But the question is why it assumes that
>  dependent files are also present.

Because I repeated the command, and the files have all been downloaded
before. By the way, the index.html contains a link to the three
images. I was trying what Alexsandar Ralulovic was reporting.


Re: wget aborts when file exists

2008-03-12 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hrvoje Niksic wrote:
> Micah Cowan <[EMAIL PROTECTED]> writes:
> 
>>> When I tried this in my wget, I got different behavior with wget 1.11
>>> alpha and wget 1.10.2
>>>
>>> D:\>wget --proxy=off -r -l 1 -nc -np http://localhost/test/
>>> File `localhost/test/index.html' already there; not retrieving.
>>>
>>>
>>> D:\>wget110 --proxy=off -r -l 1 -nc -np http://localhost/test/
>>> File `localhost/test/index.html' already there; not retrieving.
>>>
>>> File `localhost/test/a.gif' already there; not retrieving.
>>>
>>> File `localhost/test/b.gif' already there; not retrieving.
>>>
>>> File `localhost/test/c.jpg' already there; not retrieving.
>>>
>>> FINISHED --20:31:41--
>>> Downloaded: 0 bytes in 0 files
>>>
>>> I think wget 1.10.2 behavior is more correct. Anyway it did not abort
>>> in my case.
>> I think I like the 1.11 behavior (I'm assuming it's intentional).
> 
> Let me recap to see if I understand the difference.  From the above
> output, it seems that 1.10's -r descended into an HTML even if it was
> downloaded.  1.11's -r assumes that if an HTML file is already there,
> then so are all the other files it references.
> 
> If this analysis is correct, I don't see the benefit of the new
> behavior.  If index.html happens to be present, it doesn't mean that
> the files it references are also present.  I don't know if the change
> was intentional, but it looks incorrect to me.

Oh. Um, yeah, I think I had it swapped. I was thinking the first example
was 1.10.2, and the second 1.11, but judging by the names I'm thinking
you're right. In that case, it looks to me like a regression.

Thanks, Hrvoje.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH2CGX7M8hyUobTrERAgQ3AJ4hNg/ujDOwhHHUuFPj0WnrnVPDWACgidpw
wNx435+A5Gjt4tr2LHxFzqo=
=CydB
-END PGP SIGNATURE-


Re: wget aborts when file exists

2008-03-12 Thread Hrvoje Niksic
Micah Cowan <[EMAIL PROTECTED]> writes:

>> When I tried this in my wget, I got different behavior with wget 1.11
>> alpha and wget 1.10.2
>> 
>> D:\>wget --proxy=off -r -l 1 -nc -np http://localhost/test/
>> File `localhost/test/index.html' already there; not retrieving.
>> 
>> 
>> D:\>wget110 --proxy=off -r -l 1 -nc -np http://localhost/test/
>> File `localhost/test/index.html' already there; not retrieving.
>> 
>> File `localhost/test/a.gif' already there; not retrieving.
>> 
>> File `localhost/test/b.gif' already there; not retrieving.
>> 
>> File `localhost/test/c.jpg' already there; not retrieving.
>> 
>> FINISHED --20:31:41--
>> Downloaded: 0 bytes in 0 files
>> 
>> I think wget 1.10.2 behavior is more correct. Anyway it did not abort
>> in my case.
>
> I think I like the 1.11 behavior (I'm assuming it's intentional).

Let me recap to see if I understand the difference.  From the above
output, it seems that 1.10's -r descended into an HTML even if it was
downloaded.  1.11's -r assumes that if an HTML file is already there,
then so are all the other files it references.

If this analysis is correct, I don't see the benefit of the new
behavior.  If index.html happens to be present, it doesn't mean that
the files it references are also present.  I don't know if the change
was intentional, but it looks incorrect to me.

> It assums, though, that the preexisting index.html corresponds to
> the one that you were trying to download; it's unclear to me how
> wise that is.

That's what -nc does.  But the question is why it assumes that
dependent files are also present.


Re: wget aborts when file exists

2008-03-12 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Charles wrote:
> On Wed, Mar 12, 2008 at 11:03 PM, Micah Cowan <[EMAIL PROTECTED]> wrote:
>>  I think I like the 1.11 behavior (I'm assuming it's intentional). It
>>  assums, though, that the preexisting index.html corresponds to the one
>>  that you were trying to download; it's unclear to me how wise that is.
>>  Hrvoje, are you aware of this change and its rationale?
> 
> Hi,
> 
> One drawback of this behavior is that when we mirror a website, and
> then we cancel it, but the server does not provide last modified
> header (because the content is dynamically generated, for example), we
> cannot continue from the point we cancel the download (all the files
> will have to be downloaded again).

If you didn't want that, you probably shouldn't specify -nc, and should
instead specify -c.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH2A8x7M8hyUobTrERAvqZAJ9JGvX60DJBheqB/BjiEQh9KIRpPgCbBccX
bD/mUv5ee+dRxFXPBZtGE+o=
=7fvu
-END PGP SIGNATURE-


  1   2   3   4   5   6   7   8   9   10   >