Re: [bug #64808] When I use wget to download some files from a web server, files with russian names do not get proper names

2023-11-17 Thread Eli Zaretskii
> Date: Fri, 17 Nov 2023 20:34:37 +0100
> From: grafgrim...@gmx.de
> 
> I use Linux and so not exe files. I use Gentoo Linux.
> 
> Command line example:
> One line (wget and the url):
> 
> wget
> http://releases.mozilla.org/pub/firefox/releases/119.0.1/source/firefox-119.0.1.source.tar.xz
> 
> result: a file with a wrong checksum.

But the above file name has no Russian characters, so why did you say
"files with russian names do not get proper names"?  What am I
missing?



[bug #60287] Windows recursive download escapes utf8 URLs twice

2021-03-28 Thread Eli Zaretskii
Follow-up Comment #12, bug #60287 (project wget):

You are welcome to send patches which would implement what you think should be
the correct behavior in Wget.  At the time, based on my study of the Wget
sources and its basic design of fetching Web pages, my conclusion was that the
only reliable way in Wget on Windows to deal with non-ASCII characters in URLs
specified by Web pages is to provide Wget with the remote and local encodings,
especially since UTF-8 support on Windows is rudimentary at best.  I thought I
was doing fine by helping you and others deal with these situations by
explaining how to use those options to your benefit...


___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60287] Windows recursive download escapes utf8 URLs twice

2021-03-28 Thread Eli Zaretskii
Follow-up Comment #10, bug #60287 (project wget):

Without converting charsets, it would be difficult to rely on certain library
functions and support certain features.

For example, locale-dependent C library functions work only with the locale's
encoding, and will produce wrong results if presented with strings encoded
differently.  The IRI support needs to work in UTF-8 internally.  And when
writing Web pages to disk, Wget needs to encode the page name so that it would
be acceptable as a file name by the local filesystem.

That is why conversion to the locale's charset is rather necessary. Using the
original bytes might work for some operations, but not for others, so keeping
the original bytes would need some logic for where they can and cannot be
used, which is a complication.  It is better to convert once, and then forget
about it.

The 404 error is most probably because Wget does attempt to convert encoding,
but does it incorrectly when you don't tell it the actual encodings.  So the
re-encoded URL is garbled.


___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60287] Windows recursive download escapes utf8 URLs twice

2021-03-27 Thread Eli Zaretskii
Follow-up Comment #8, bug #60287 (project wget):

> Is this because wget first downloads the html file and then reads the
contents off disk

No.  It's because Wget downloads the pages you told it to, and saves them as
disk files.  Any links in the downloaded pages that lead to other pages
produce additional disk files (e.g., if you told Wget to download
recursively).

IOW, the file-name encoding issue happens when a Web page needs to be saved to
a file for some reason.

> If the bytes were downloaded with the correct encoding, and written to the
file system with the correct encoding, I would expect it to be able to parse
the file with the correct encoding.

What is the "correct encoding", though?

> the file `wget-test.html` has no non-ascii characters in it

Of course, it doesn't: the non-ASCII characters appear when we decode the
hex-encoded bytes.



___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60287] Windows recursive download escapes utf8 URLs twice

2021-03-26 Thread Eli Zaretskii
Follow-up Comment #6, bug #60287 (project wget):

> Isn't the encoding specified in the HTTP header?

Not the local one.  (And not every page you download has these headers, so the
remote one isn't always known, either.)

You must specify the local encoding, especially on MS-Windows, because Windows
filesystems aren't agnostic about encoding file names, they don't allow
arbitrary byte sequences to be part of a file name.  The file names are
written on disk in UTF-16, and so the file I/O APIs on Windows must convert
file names to UTF-16, and for that they need to know its encoding.

> If feels like a bug because my browser handles the links just fine, without
the chatset specified by the server.

The browser just shows the page, it doesn't save it to a disk file.  So
encoding of the page's name isn't an issue for the browser, as it is for
Wget.


___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60287] Windows recursive download escapes utf8 URLs twice

2021-03-26 Thread Eli Zaretskii
Follow-up Comment #4, bug #60287 (project wget):

Why does this feel like a bug to you?  How can Wget be expected to guess the
correct encoding, if you don't tell it?


___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




[bug #60287] Windows recursive download escapes utf8 URLs twice

2021-03-25 Thread Eli Zaretskii
Follow-up Comment #2, bug #60287 (project wget):

What was the locale on the GNU/Linux machine, where this "just works"?  I'm
guessing it was a UTF-8 locale, in which case I'd try the same with a
different locale.

I think you must use --remote-encoding=UTF-8 (and perhaps also a suitable
--local-encoding) to make this work correctly on MS-Windows.  Did you try
that?


___

Reply to this item at:

  

___
  Message sent via Savannah
  https://savannah.gnu.org/




Re: Error SSL al ejecutar desde Windows

2021-02-13 Thread Eli Zaretskii
> From: Tim Rühsen 
> Date: Sat, 13 Feb 2021 18:23:31 +0100
> 
> Try without " or leave away the spaces between these and the URL. 
> (Possibly I am misguided but your mailers line breaks.)
> 
> wget --no-check-certificate 
> "https://www.datos.gov.co/api/views/gt2j-8ykr/rows.csv?accessType=DOWNLOAD;

I don't think this is the problem, as the image clearly shows the
program received the URL correctly.

Wget 1.17 I have here doesn't have any problems downloading that file
on Windows, FWIW.



Re: Confusing "Success" error message

2019-11-08 Thread Eli Zaretskii
> Date: Fri, 8 Nov 2019 17:29:21 +0100
> From: "Andries E. Brouwer" 
> Cc: "Andries E. Brouwer" , tim.rueh...@gmx.de,
> ftu...@fastmail.fm, bug-wget@gnu.org
> 
> Did you read the line "a function that succeeds is allowed to change errno"?

Yes, but that's against every library whose sources I've ever read.



Re: Confusing "Success" error message

2019-11-08 Thread Eli Zaretskii
> Date: Fri, 8 Nov 2019 16:47:30 +0100
> From: "Andries E. Brouwer" 
> Cc: "Andries E. Brouwer" , tim.rueh...@gmx.de,
> ftu...@fastmail.fm, bug-wget@gnu.org
> 
> On Fri, Nov 08, 2019 at 04:34:10PM +0200, Eli Zaretskii wrote:
> 
> > > Libc functions are free to call other functions internally,
> > > and such internal calls may fail where the outer level call
> > > does not fail. So even if a libc function does not return
> > > an error, errno can have changed.
> > 
> > That would be a bug in libc, I think.  Its functions should save and
> > restore errno if other functions they call error out without causing
> > the calling function to fail.
> 
> % man 3 errno
> ...
>A common mistake is to do
> 
>if (somecall() == -1) {
>printf("somecall() failed\n");
>if (errno == ...) { ... }
>}
> 
>where errno no longer needs to have the value it had upon  return  from
>somecall()  (i.e.,  it may have been changed by the printf(3)).  If the
>value of errno should be preserved across a library call,  it  must  be
>saved:
> 
>if (somecall() == -1) {
>int errsv = errno;
>printf("somecall() failed\n");
>if (errsv == ...) { ... }
>}
> 
> That was the Linux man page. Here is the POSIX man page:
> 
> ...
>The  value  in  errno  is significant only when the return value of the
>call indicated an error (i.e., -1 from most system calls;  -1  or  NULL
>from  most  library  functions); a function that succeeds is allowed to
>change errno.

Thanks, but AFAIU this says the same as I did: if a function succeeds,
it should not modify errno.

In the above example from a man page, the "may have been changed by
printf" part alludes to the possibility that printf fails in some way,
e.g. because the format is in error or stdout is closed or somesuch.



Re: Confusing "Success" error message

2019-11-08 Thread Eli Zaretskii
> Date: Fri, 8 Nov 2019 15:03:21 +0100
> From: "Andries E. Brouwer" 
> Cc: Francesco Turco , bug-wget@gnu.org,
>  "Andries E. Brouwer" 
> 
> > Libc functions only touch errno if there *is* an error
> 
> Libc functions are free to call other functions internally,
> and such internal calls may fail where the outer level call
> does not fail. So even if a libc function does not return
> an error, errno can have changed.

That would be a bug in libc, I think.  Its functions should save and
restore errno if other functions they call error out without causing
the calling function to fail.

IOW, if a libc function succeed, it should whatever it takes to
preserve errno.



Re: [Bug-wget] Problem downloading with RIGHT SINGLE QUOTATION MARK (U+2019) in filename

2019-10-11 Thread Eli Zaretskii
> From: Cameron Tacklind 
> Date: Thu, 10 Oct 2019 20:31:02 -0700
> 
> The error is pretty clearly an encoding conversion issue, going from UTF-8,
> assumed to be CP1252, converting into UTF-8, which becomes wrong.

I think you need to tell Wget that the page encoding is UTF-8, by
using the --remote-encoding switch.  Did you try that?



Re: [Bug-wget] Wget on Windows handling of wildcards

2018-06-06 Thread Eli Zaretskii
> From: Sam Habiel 
> Date: Wed, 6 Jun 2018 08:27:44 -0400
> Cc: bug-wget@gnu.org
> 
> Is there a valid argument to be made that some arguments for wget
> should not be expanded, like accept and reject?

Probably.  The problem is that wildcard expansion of the command line
doesn't understand the command, and so doesn't know what to expand and
what not.  Quoting was supposed to be the user's tool to control the
expansion, and it did work up until Windows Vista, when Microsoft in
their infinite wisdom changed the long-standing behavior of their
setargv code.

I came up with the *.[d]at and similar tricks because I frequently
need to invoke a port of Grep using the --include switch, where I have
the same problem on Vista and newer systems.



Re: [Bug-wget] Wget on Windows handling of wildcards

2018-06-05 Thread Eli Zaretskii
> From: Sam Habiel 
> Date: Tue, 5 Jun 2018 14:16:27 -0400
> 
> I have a wget command that has a -A flag that contains a wildcard.
> It's '*.DAT'. That works fine on Linux. I am trying to get the same
> thing to run on Windows, but *.DAT keeps getting expanded by wget (cmd
> does no expansion itself). There is no way that I found of suppressing
> that. I think I tried everything: single quotes, double quotes, escape
> * with ^ (cmd escape char), etc.

What version of Windows is that?

> For reference, here's the whole command:
> 
> wget -rNndp -A "*.DAT"
> "https://foia-vista.osehra.org:443/Patches_By_Application/PSN-NATIONAL
> DRUG FILE (NDF)/PPS_DATS/" -P .
> 
> Run it twice on Windows to see the problem.

Did you try using "*.[D]AT"?

The problem AFAIK is that C runtime on modern versions of Windows
expands wildcards even when quoted.  So either you need to build wget
with wildcard expansion disabled (using the appropriate global
variable whose details depend on whether you use MSVC or MinGW and
which version of MinGW), or you use the above trick (assuming that
wget can expand such wildcards).  Disabling expansions altogether is
usually not a good option in this case, since you probably need it
with other use cases.

HTH



[Bug-wget] Run-time issues with Wget2 1.99.1 built with MinGW

2018-05-12 Thread Eli Zaretskii
> From: Tim Rühsen 
> Date: Tue, 1 May 2018 15:15:26 +0200
> 
> GNU Wget2 is the successor of GNU Wget, a file and recursive website
> downloader.
> 
> Designed and written from scratch it wraps around libwget, that provides
> the basic functions needed by a web client.
> 
> Wget2 works multi-threaded and uses many features to allow fast operation.
> 
> In many cases Wget2 downloads much faster than Wget1.x due to HTTP zlib
> compression, parallel connections and use of If-Modified-Since HTTP header.

Thanks.  I've built this using mingw.org's MinGW GCC and runtime
support, and found the following run-time issues:

 . The help screen shows the command name as "wget", not "wget2".  is
   that deliberate?

 . Error message is displayed at startup about False Start, due to
   using GnuTLS 3.4.  Why not simply silently avoid using the False
   Start option by default on such systems?

 . Progress bar displays escape sequences, which are not converted to
   colors on MS-Windows.  I see there are functions in
   libwget/console.c that produce colors on Windows: would you prefer
   using them for progress bar, or would you rather have
   Windows-specific code in bar.c?

 . The tests that use libmicrohttpd all fail, and all pop up the
   Windows UAC elevation dialogue.  I don't yet know why that happens,
   and I'm looking into this issue.  To help me out, could you perhaps
   describe how libmicrohttpd is involved in wget2 tests that use it?
   What is the main idea, and how should I interpret the test logs
   (which includes quite a bit of text that doesn't really explain
   itself).

Thanks.



[Bug-wget] Build issues when building Wget2 1.99.1 with MinGW

2018-05-12 Thread Eli Zaretskii
> From: Tim Rühsen 
> Date: Tue, 1 May 2018 15:15:26 +0200
> 
> GNU Wget2 is the successor of GNU Wget, a file and recursive website
> downloader.
> 
> Designed and written from scratch it wraps around libwget, that provides
> the basic functions needed by a web client.
> 
> Wget2 works multi-threaded and uses many features to allow fast operation.
> 
> In many cases Wget2 downloads much faster than Wget1.x due to HTTP zlib
> compression, parallel connections and use of If-Modified-Since HTTP header.

Thanks.  I've built this using mingw.org's MinGW GCC and runtime
support, and found the following issues that affect the build:

 . Several issues with Gnulib headers and functions, already reported
   to Gnulib mailing list.

 . The README says libmicrohttpd is required for running the test
   suite, but it doesn't tell which optional libmicrohttpd features
   are expected/recommended for the testing.  For example, is HTTPS
   support by libmicrohttpd required?  I presume yes, because
   otherwise HTTPS cannot be tested.  Likewise for other options -- it
   would be good to know how to build libmicrohttpd for optimal
   coverage of the test suite.  (This is especially important on
   Windows, since the existing binary of libmicrohttpd distributed by
   its developers was built without dependencies, so no HTTPS support,
   for example; I needed to build my own port.)

 . The configure time test for external regexp seems to assume that no
   library needs to be added to the link command line to get that
   functionality.  In my case, I needed a -lregex added, but the only
   way to do that seems to set LIBS at configure time.

 . Compiling lib/thread.c produces a warning:

 In file included from thread.c:43:0:
 thread.c: In function 'wget_thread_self':
 ../lib/glthread/thread.h:353:5: warning: return makes integer from pointer 
without a cast [-Wint-conversion]
  gl_thread_self_func ()
  ^~
 thread.c:279:9: note: in expansion of macro 'gl_thread_self'
   return gl_thread_self();
  ^~

   This is because wget.h does this:

 typedef unsigned long wget_thread_id_t;

   which conflicts with gl_thread_t, which on Windows is a pointer to
   a structure.

   To fix this, we could either make wget_thread_id_t a wide enough
   type (unsigned long is not wide enough for 64-bit Windows, we need
   uintptr_t instead), and then use an explicit cast in
   wget_thread_self; or wget_thread_id_t should be an opaque data
   type, like gl_thread_t, but then the rest of the code shouldn't
   treat it as a simple scalar integral type.

 . Compiling programs in examples/ produces this warning from libtool:

   CCLD getstream.exe
 libtool: warning: '-no-install' is ignored for i686-pc-mingw32
 libtool: warning: assuming '-no-fast-install' instead

 . "make install" installs wget2_noinstall.exe, which is presumably a
   mistake.  It does NOT install wget2.info, perhaps because docs were
   not built (I have neither Doxygen nor Pandoc on that system).

Thanks.



Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded

2017-11-12 Thread Eli Zaretskii
> From: Tim Rühsen 
> Date: Sun, 12 Nov 2017 14:50:47 +0100
> Cc: YX Hao 
> 
> As I understand, the second patch is still in discussion with Eli. Since I do 
> not have Windows, I can't help you here. Though what I saw from the 
> discussion, you address a portability issue that likely should be solved 
> within gnulib. Maybe you could (in parallel) send a mail to 
> bug-gnu...@gnu.org 
> with a link to your discussion with Eli. There might be some people with 
> deeper knowledge.

I don't think it's a Gnulib issue.  The problem is that on Windows,
the implicit call at the beginning of Wget

  setlocale (LC_ALL, "C");

is not good enough to work in multibyte locales of the Far East,
because the Windows runtime assumes a single-byte locale after that
call.  And since Wget happens to need to display text and create files
with non-ASCII characters, it gets hit more than other programs.

The proposed solution is to add a special call to setlocale which gets
this right on Windows.




Re: [Bug-wget] Patch: Fix printing mutibyte characters as unprintable characters on Windows

2017-11-11 Thread Eli Zaretskii
> From: "YX Hao" 
> Cc: 
> Date: Sun, 5 Nov 2017 23:01:22 +0800
> 
> And I can tell you that 'GetConsoleOutputCP' returns the codepage as command
> 'chcp'. It is right. The gnu 'vsnprintf' doesn't work right with 'setlocale'
> omitted.

I guess this means wget needs to call 'setlocale' with the right
codepage even when NLS is not enabled, because the naïve belief that
the default C locale will show n on-ASCII characters correctly is
false on Windows, especially in multibyte locales.  The MSDN
documentation of 'setlocale' confirms that by saying:

  The C locale assumes that all char data types are 1 byte and that
  their value is always less than 256.



Re: [Bug-wget] Patch: Fix printing mutibyte characters as unprintable characters on Windows

2017-11-05 Thread Eli Zaretskii
> From: "YX Hao" 
> Cc: 
> Date: Sun, 5 Nov 2017 23:01:22 +0800
> 
> >> '_getmbcp' is used
> > Maybe the problem is that the codepage used for the console output is
> > different from the system's ANSI codepage?  What does GetConsoleOutputCP
> > return in the case you describe?
> >
> > What happens if ENABLE_NLS is defined?  Your patch only handles the
> > situation where ENABLE_NLS is NOT defined.
> 
> Yes, my patch only handles the situation I meet with, by using necessary
> predefined conditions. I leave the others untouched, because I don't have
> the needed libraries.
> I hope others who has the environments can test it and turn on the switches
> when necessary.
> 
> And I can tell you that 'GetConsoleOutputCP' returns the codepage as command
> 'chcp'. It is right. The gnu 'vsnprintf' doesn't work right with 'setlocale'
> omitted.

Sorry, I don't follow.  Does GetConsoleOutputCP return the same value
as _getmbcp, or does it return a different value?



Re: [Bug-wget] Patch: Fix printing mutibyte characters as unprintable characters on Windows

2017-11-04 Thread Eli Zaretskii
> From: "YX Hao" <lifenjoi...@163.com>
> Cc: "'Eli Zaretskii'" <e...@gnu.org>
> Date: Fri, 3 Nov 2017 20:14:02 +0800
> 
> Second, as my test, 'setlocale' is needed for the gnu printf related
> functions to work correctly on mutibyte characters. You can see that as
> attached screensots:
> setlocale_936.png
> setlocale_empty.png
> setlocale_omitted-OCP-0.png, the 3 results are the same

What do you mean by "gnu printf related functions"?  If this is a
build that doesn't define ENABLE_NLS, then wget outputs the original
text using the MS runtime versions of printf.  And in a build that
does define ENABLE_NLS, the text is additionally processed by the GNU
gettext library.  So is the problem with the build which defines
ENABLE_NLS or the build that didn't define ENABLE_NLS?  Or is it with
both?
> > Your change calls setlocale with a different value, does that even when
> 
> One tricky situation: one PC is all set to United States, except the
> multibyte code page is 936, for example.
> So, '_getmbcp' is used.

Maybe the problem is that the codepage used for the console output is
different from the system's ANSI codepage?  What does
GetConsoleOutputCP return in the case you describe?

> static void
> i18n_initialize (void)
> {
> +#if defined(WINDOWS) && !defined(ENABLE_NLS)
> +  char MBCP[8] = "";
> +  int CP;
> +#endif
> +
>   /* ENABLE_NLS implies existence of functions invoked here.  */
> #ifdef ENABLE_NLS
>   /* Set the current locale.  */
>   setlocale (LC_ALL, "");
>   /* Set the text message domain.  */
>   bindtextdomain ("wget", LOCALEDIR);
>   textdomain ("wget");
> #endif /* ENABLE_NLS */
> +
> +#if defined(WINDOWS) && !defined(ENABLE_NLS)
> +  CP = _getmbcp(); /* Consider it's different from default. */
> +  if (CP > 0)
> +sprintf(MBCP, ".%d", CP);
> +  setlocale(LC_ALL, MBCP);
> +#endif }

What happens if ENABLE_NLS is defined?  Your patch only handles the
situation where ENABLE_NLS is NOT defined.



Re: [Bug-wget] Patch: Fix printing mutibyte characters as unprintable characters on Windows

2017-11-02 Thread Eli Zaretskii
> From: "YX Hao" 
> Date: Thu, 2 Nov 2017 21:09:31 +0800
> 
> During my daily use, I've found a few small bugs and made the patches.
> I will email them in standalone topics. Patch is attached.
> 
> I made the patch on Windows. I think it shouldn't break anything on other
> platforms. Please take a review :)

Thanks.

I'm not Tim, but I have a few questions about your patches.

> 1. setlocale

Can you explain why you needed this?  wget already calls setlocale:

  static void
  i18n_initialize (void)
  {
/* ENABLE_NLS implies existence of functions invoked here.  */
  #ifdef ENABLE_NLS
/* Set the current locale.  */
setlocale (LC_ALL, "");
/* Set the text message domain.  */
bindtextdomain ("wget", LOCALEDIR);
textdomain ("wget");
  #endif /* ENABLE_NLS */
  }

Your change calls setlocale with a different value, does that even
when ENABLE_NLS is not defined, and also runs the risk of using a
wrong codepage, if _getmbcp returns zero (as MSDN says it could).  Why
is that needed?

> +#ifdef WINDOWS
> +  CP = _getmbcp(); /* Consider it's different from default. */

Why would it be different from default, and if it is, why doesn't the
call to setlocale shown above do its job?



Re: [Bug-wget] Wget keeps crashing on me

2017-05-14 Thread Eli Zaretskii
> From: William Higgs 
> Cc: ,
>   'Jernej Simončič' 
> Date: Sun, 14 May 2017 21:17:02 -0400
> 
> Hey guys.  So while I was doing some research, I found the following post
> located at
> https://stackoverflow.com/questions/35004832/wget-exe-for-windows-10/3796296
> 5#37962965
> :
> "eternallybored build will crash when you are downloading a large file.
> This can be avoided by disabling LFH (Low Fragmentation Heap) by GlobalFlag
> registry."

Makes absolutely no sense to me.  LFH is the default heap allocation
strategy on MS-Windows since Vista; disabling it is only justified
when running a program under a debugger.  Disabling LFH globally for
your entire system means you risk running out of heap memory in some
memory-intensive applications, utterly unrelated to wget.

If that particular build of wget crashes when LFH is in use, it most
probably means a subtle memory-allocation bug, which is simply swept
under the carpet by changing the algorithm for heap allocation.  So I
would suggest to simply switch to a different build of wget, instead
of compromising your entire system.

> However, after looking into how to do this, I cannot find an explanation as
> to how to do this.  Can someone please provide some assistance?

  
https://support.microsoft.com/en-us/help/929136/why-the-low-fragmentation-heap-lfh-mechanism-may-be-disabled-on-some-computers-that-are-running-windows-server-2003,-windows-xp,-or-windows-2000

But I'm not sure this will work on Windows 10, and I urge you not to
do this in the first place.



Re: [Bug-wget] Wget keeps crashing on me

2017-05-14 Thread Eli Zaretskii
> From: William Higgs 
> Date: Sun, 14 May 2017 12:13:58 -0400
> 
> So just to be clear, you want me to use an older release of wget?

No, I'm just saying that the version I built worked without crashing.
You may wish to try it; if it works on your system, it might mean the
problem is not with the OS, but with the wget build you have.



Re: [Bug-wget] Wget keeps crashing on me

2017-05-14 Thread Eli Zaretskii
> From: William Higgs 
> Date: Sun, 14 May 2017 10:27:12 -0400
> 
> And I saw that you had stated that it was working on Windows 7, which
> further convinces me that it is probably a windows 10 thing.  I originally
> thought this was the case because, while the faulting application is wget,
> the faulting module (module I assume to mean what actually caused the crash
> in the application), is ntdll.dll, which is a core system dll.  But sfc
> scans return no issues..

I'm not sure this is a Windows problem.  Crashes inside system DLLs
more often than not are caused by bugs in the applications.



Re: [Bug-wget] Wget keeps crashing on me

2017-05-14 Thread Eli Zaretskii
> From: William Higgs 
> Date: Sun, 14 May 2017 10:23:35 -0400
> 
> The txt file contains the output from the command "wget --version".

Maybe I'm missing something, but I don't see that.  All I see is this:

  Description   : Faulting application name: wget.exe, version: 0.0.0.0, 
time stamp: 0x003cc610

which doesn't show the wget version.

I tried with wget 1.16.1, which I built myself.  You can find its
binaries here:

  
https://sourceforge.net/projects/ezwinports/files/wget-1.16.1-w32-bin.zip/download

> As for the second question, I obtained the binaries by utilizing
> chocolatey's package management system.  If you are more familiar
> with Linux, you can think of chocolatey as the official unofficial
> (while not directly supported by Microsoft, Microsoft has
> incorporated its use into Powershell 5's package management cmdlets)
> "apt-get" for Windows (https://chocolatey.org/).

Thanks for the info.

P.S. Please keep the list address on the CC.



Re: [Bug-wget] Wget keeps crashing on me

2017-05-14 Thread Eli Zaretskii
> From: William Higgs 
> Date: Sat, 13 May 2017 17:37:13 -0400
> 
> So this may have nothing to do with wget (probably very likely, as Windows
> 10 creators update continues to be a very large thorn in my side), but wget
> keeps crashing when I run the attached bat file (converted to txt).  I saved
> the event logs associated with the crash, but again, it looks more like an
> os issue than wget.  Still, wanted to get your opinion on the matter.  Also,
> thanks for the awesome, quality, free software.

FWIW, this works for me, on Windows 7.

What version of wget did you use, and where did you get the binaries?



Re: [Bug-wget] GSoC Project | Design and Implementation of a Framework for Plugins

2017-03-20 Thread Eli Zaretskii
> Date: Tue, 21 Mar 2017 02:29:20 +0700
> From: Didik Setiawan 
> 
> > One way to implement plugins is via libdl (dlopen(), ...), and that is what 
> > I 
> > have in mind. That is not perfectly portable, but our first goal will be 
> > systems that support dlopen().
> 
> So, should I continue to use dlopen() or there is another better method? In 
> case 
> we need more portability.

I'd suggest to use libltdl (part of libtool), which will make these
features more portable.

Note that the GNU project's practice for plugins is to require that
any compatible plugin exports a symbol named plugin_is_GPL_compatible,
to signal that it's released under GPL or a compatible license.  I
think any framework should have the verification of this as its part.

Thanks.



Re: [Bug-wget] Vulnerability Report - CRLF Injection in Wget Host Part

2017-03-06 Thread Eli Zaretskii
> From: Tim Ruehsen 
> Date: Mon, 06 Mar 2017 10:17:25 +0100
> Cc: Orange Tsai 
> 
> Thanks, just pushed a commit, not allowing control chars in host part.

Hmm... is it really enough to reject only ASCII control characters?
Maybe we should also reject control characters from other Unicode
ranges?  Just a thought.



Re: [Bug-wget] Fwd: PATCH: bugs 20369 and 20389

2017-03-04 Thread Eli Zaretskii
> From: Vijo Cherian 
> Date: Fri, 3 Mar 2017 11:33:05 -0800
> Cc: bug-wget@gnu.org
> 
> bool
> file_exists_p (const char *filename, file_stats_t *fstats)
> {
>   struct stat buf;
> 
> #if defined(WINDOWS) || defined(__VMS)
> return stat (filename, ) >= 0;
> #else

This leaves fstats untouched on Windows.  At least access_err should
be set, I think.



Re: [Bug-wget] Fwd: PATCH: bugs 20369 and 20389

2017-03-03 Thread Eli Zaretskii
> From: Vijo Cherian 
> Date: Thu, 2 Mar 2017 18:47:11 -0800
> 
> Changes
>   - Bug #20369 - Safeguards against TOCTTOU
> Added safe_fopen() and safe_open() that checks to makes sure the file
> didn't change underneath us.
>   - Bug #20389 - Return error from file_exists_p()
> Added a way to return error from this file without major surgery to
> the callers.

Allow me a few comments to your patch.

> +  errno = 0;
> +  if (stat (filename, ) >= 0 && S_ISREG(buf.st_mode) &&

'stat' is documented to return 0 upon success, so I don't think a
positive return value should be considered a success.

> +  (((S_IRUSR & buf.st_mode) && (getuid() == buf.st_uid))  ||
> +   ((S_IRGRP & buf.st_mode) && group_member(buf.st_gid))  ||
> +(S_IROTH & buf.st_mode))) {

These tests assume Posix semantics, and will be too restrictive on
MS-Windows, for example.

> +if (fstats != NULL) {
> +  logprintf (LOG_VERBOSE, _("File %s exists, but NULL for fstats\n"), 
> filename);

The log message says fstats is NULL, but it isn't.

> +  fstats->access_err = 0;
> +  fstats->st_ino = buf.st_ino;
> +  fstats->st_dev = buf.st_dev;
> +}
> +logprintf (LOG_VERBOSE, _("%s exists!!\n"), filename);
> +return true;
> +  } else {
> +if (fstats != NULL) {
> +  fstats->access_err = (errno == 0 ? EACCES : errno);
> +  logprintf (LOG_VERBOSE, _("File %s is not accessible\n"), filename);
> +}
> +logprintf (LOG_VERBOSE, _("File %s doesn't exist\n"), filename);
> +errno = 0;
> +return false;

Do we really need such detailed log messages for such a trivial check?

Also, the name of the function and its commentary seem to no longer
describe what it actually does.  The commentary should also describe
the return value.

> +/* Safe_fopen assumes that file_exists_p() was called earlier. 

The name of the function doesn't describe what it does.

Also, instead of "assumes that file_exists_p() was called earlier",
I'd suggest to state that the FSTATS argument should be available,
e.g. by calling file_exists_p.

> +  if (fstats != NULL && 
> +  (fdstats.st_dev != fstats->st_dev ||
> +   fdstats.st_ino != fstats->st_ino)) {

These are Posix assumptions; on Windows you will get meaningless
results from such a test.  I suggest to have a function for this, with
different implementations on Posix and non-Posix platforms.

Same comments for safe_open.

Thanks for working on this.



Re: [Bug-wget] patch: Improve the rolling file name length for downloading progress image when without NLS

2017-02-17 Thread Eli Zaretskii
> Date: Fri, 17 Feb 2017 14:44:02 +0100
> From: "Andries E. Brouwer" <andries.brou...@cwi.nl>
> Cc: "Andries E. Brouwer" <andries.brou...@cwi.nl>, tim.rueh...@gmx.de,
> bug-wget@gnu.org, lifenjoi...@163.com
> 
> On Fri, Feb 17, 2017 at 03:38:40PM +0200, Eli Zaretskii wrote:
> 
> > Fonts indeed can affect the visual width, but if we assume that the
> > terminal font is a fixed-pitch one, that problem is much less
> > significant, IME.
> 
> Experience shows that one may get a single-width replacement symbol
> when the actual double width symbol is not available.

That happens, yes.  But again, it's unrelated to the encoding of the
writes to the terminal.  It only depends on the terminal fonts used.



Re: [Bug-wget] patch: Stored file name coversion logic correction

2017-02-15 Thread Eli Zaretskii
> Date: Thu, 16 Feb 2017 12:42:23 +0800 (CST)
> From: "YX Hao" 
> 
> I downloaded the 'mbox format' original, and found out the reason why you 
> can't reproduce the issue.
> The non-ASCII characters you use is encoded in "iso-8859-1" in your email, 
> and should be displayed correctly in your environment.
> So, your encoding is compatible with 'UTF8', which is the remote server's 
> default encoding. That won't cause iconv error :)
> Think about 'UFT8' incompatible encoding envrionments ...

Maybe I misunderstand, but ISO-8859-1 (a.k.a. "Latin-1") is NOT
compatible with UTF-8.  Trying to decode Latin-1 text as UTF-8 will
get you errors from the conversion routines, because Latin-1 byte
sequences are generally not valid UTF-8 sequences.



Re: [Bug-wget] [PATCH] utils: rename base64_{encode,decode}

2016-12-15 Thread Eli Zaretskii
> From: Rahul Bedarkar 
> Date: Thu, 15 Dec 2016 12:57:16 +0530
> 
> In case of static build, all symbols are visible. Since GnuTLS is static 
> library, which is just archive of object files, linking happens at 
> caller end i.e. wget, linker don't know what to (un)export. That's why 
> we see definition clash in static builds. Please correct me if I'm 
> missing something.

That's not so: static linking will only pull from a static library
symbols that are not already resolved by earlier object files the
linker processed.  In this case, since that symbol should have been
satisfied by wget's own function, the linker had no reason to use the
one in the library.  Unless, that is, you submitted the explicit
library file name to the linker command line, instead of using -lgnutls.



Re: [Bug-wget] Query about correcting for DST with Wget

2016-11-15 Thread Eli Zaretskii
> From: Tim Ruehsen 
> Date: Tue, 15 Nov 2016 10:41:40 +0100
> 
> > If we care about this, we could have a private implementation of
> > utimes in mswindows.c, which would DTRT in the use cases needed by
> > Wget.
> 
> Wget uses utime() if available. This function is not covered by gnulib and it 
> is obsoleted by POSIX 2008.
> 
> Instead we should use utimens which is covered by gnulib and circumvents 
> several issues. Currently, I can see no special Windows code in gnulib - but 
> if the issue persists, it should IMO be fixed in gnulib.
> 
> WDYT ?

I won't hold my breath for this to solve the issue.  At best, gnulib
will probably call utimes in its utimens implementation, which will
reset us back to square one.  At worse, they will ask us to provide
the missing implementation for MS-Windows.

This issue is not solvable without calling win32 APIs directly,
because the MS C runtime function all behave consistently -- and
wrongly -- in this case.



Re: [Bug-wget] Query about correcting for DST with Wget

2016-11-14 Thread Eli Zaretskii
> Date: Sun, 13 Nov 2016 21:39:32 +0100
> From: Jernej Simončič <jernej|s-w...@eternallybored.org>
> 
> On Sunday, November 13, 2016, 19:53:06, Eli Zaretskii wrote:
> 
> > Does "while DST is in effect" mean that you download the file when DST
> > is in effect, or you examine the timestamp of the file when the DST is
> > in effect?
> 
> I download the file when DST is(n't) in effect (I download that
> specific URL quite often, on different computers).

Then yes, it's a known problem with how the MS-Windows implementation
of the utimes function works: it converts (internally) from local time
to UTC using the current setting of the DST flag, not its setting at
the time being converted.  The irony of this is that there should be
no need to go through local time in this case, because the timestamp
provided by Wget is in UTC to begin with, and the low-level Windows
APIs that timestamp files accept UTC values.

If we care about this, we could have a private implementation of
utimes in mswindows.c, which would DTRT in the use cases needed by
Wget.

> I just remembered that there may be a 3rd explanation: some msvcrt
> functions return different timestamps depending on whether DST is in
> effect or not - at least with GIMP you can observe that it'll rescan
> all fonts the first time it's run after DST change (I'm not sure if
> this applies only to msvcrt.dll [which MinGW uses by default], or also
> to the runtimes shipped with newer Visual Studio versions).

You are talking about 'stat', I believe.  Yes, they, too, have a
similar bug, but 'stat' is not involved in the code in Wget that sets
the timestamps of downloaded files.



Re: [Bug-wget] Query about correcting for DST with Wget

2016-11-13 Thread Eli Zaretskii
> Date: Sun, 13 Nov 2016 19:10:57 +0100
> From: Jernej Simončič 
> 
> I'm not sure if this is a problem with wget, Windows or the server
> hosting the file, but I observed this happening with
>  - while DST is in effect,
> the file gets timestamp of 22:19, and when it's not it's 23:19 (I'm in
> the CET timezone).

Does "while DST is in effect" mean that you download the file when DST
is in effect, or you examine the timestamp of the file when the DST is
in effect?

Also, how do you display the timestamp of the file? with what program?



Re: [Bug-wget] Query about correcting for DST with Wget

2016-11-10 Thread Eli Zaretskii
> From: "Tim" 
> Date: Thu, 10 Nov 2016 19:26:45 -
> 
> I would be very grateful for any help with an issue I am having with 
> downloading files from a website using Version 1.11.4.3287 of Wget on a 
> Windows XP computer.

That's a very old version of Wget.

> When I use Wget to download a file from a website, the timestamp is out by an 
> hour. I think this is because of Daylight Saving Time. Do any of you know how 
> I can correct this?

Can you tell more details, like the exact URL you downloaded and how
you see the 1-hour difference?  I'd like to try to reproduce this
here.

In general, Windows XP has a database of DST offsets and should use it
to avoid such problems, but maybe Wget doesn't DTRT in this matter,
somehow, at least in your case.



Re: [Bug-wget] Fwd: Re: [PATCH v3] bug #45790: wget prints it's progress even when background

2016-10-19 Thread Eli Zaretskii
> From: "Wajda, Piotr" 
> Date: Wed, 19 Oct 2016 12:18:13 +0200
>
> For CTRL+Break we could probably go to background on windows by forking
> process using current fake_fork method. Child process should be then
> started with -c and -b.

Could be, although it'd be a strange thing to do for Ctrl+BREAK,
IMO, because it's akin to Unix SIGQUIT.



Re: [Bug-wget] [PATCH v3] bug #45790: wget prints it's progress even when background

2016-10-19 Thread Eli Zaretskii
> From: "Wajda, Piotr" 
> Date: Wed, 19 Oct 2016 11:57:06 +0200
> 
> My only confusion was that during testing on windows, when sending 
> CTRL+C or CTRL+Break it immediately terminates, which is basically what 
> I think it should do for CTRL+C. Not sure about CTRL+Break.

What else is reasonable for CTRL+Break?  We can arrange for them to
produce different effects, if there are two alternative behaviors that
would make sense.

Thanks.



Re: [Bug-wget] strerror() on Win32

2016-10-14 Thread Eli Zaretskii
> From: Gisle Vanem 
> Date: Thu, 13 Oct 2016 22:42:03 +0200
> 
> I think I've mentioned earlier; the troubles with strerror()
> returning "Unknown error" for seemingly common 'errno' values.
> 
> I hit me today, when connection to my ftp-hosting service. From
> the Wsock-trace [1] of connect():
> 
>   * 49.163 sec: f:/MingW32/src/gnu/gnulib/lib/connect.c(43) (rpl_connect+64):
> connect (620, 46.30.213.77:21, fam AF_INET) --> WSAETIMEDOUT (10060).
> 
> failed: Unknown error.
> 
> I put some trace-code in Wget's connect.c and do see 'errno' is 138.
> Which is ETIMEDOUT as defined by Gnulib's . But I fail to
> understand why Gnulib's strerror(138) is incapable of handling it.
> 
> Looking at Gnulib's strerror-override.c, I see it should return
> "Connection timed out" there. But it doesn't. Any pointers?

Didn't we already have a similar discussion?  I think you told about
some connect attempt that times out, but the error message doesn't
mention timeout?  And I tried that with my MinGW-compiled Wget, and
couldn't reproduce the problem, because my Wget did report "Connection
timed out"?

My guess is that for some reason Wget calls the MS-Windows strerror,
not its Gnulib replacement.  But that's a guess, and I don't know how
to explain it.  Perhaps put a breakpoint both at the Gnulib strerror
and the MS runtime one, and see what happens in your scenario.

Failing that, if you can show a recipe for reproducing this, including
a URL to use, I could see what happens on my system, and maybe we will
see the light.



Re: [Bug-wget] [PATCH v3] bug #45790: wget prints it's progress even when background

2016-10-07 Thread Eli Zaretskii
> From: losgrandes 
> Date: Thu,  6 Oct 2016 09:47:01 +0200
> 
> Fortunately I tested wget.exe in normal mode and background mode (-b). Was ok.
> Unfortunately I haven't tested wget.exe with CTRL+Break/CTRL+C (is it really 
> works on windows?).

Yes, CTRL-C/CTRL-BREAK should work on Windows.  What didn't work in
your case?

> 1. Could you be so kind and test my wget.exe with CTRL+Break?

Send your test instructions, and I will try to build and test it here.

> 2. Advise me with error I get while compiling:
>   main.o:main.c:(.text+0x579): undefined reference to `pipe'

The Windows runtime doesn't have 'pipe', it has '_pipe' (with a
slightly different argument list).  I believe we need the Gnulib pipe
module to get this to compile.  However, just as a quick hack, replace
the call to 'pipe' with a corresponding call to '_pipe', you can find
its documentation here:

  https://msdn.microsoft.com/en-us/library/edze9h7e.aspx

(This problem is unrelated to your changes, the call to 'pipe' is
already in the repository.)

>   url.o:url.c:(.text+0x1e78): undefined reference to `libiconv_open'
>   url.o:url.c:(.text+0x1f25): undefined reference to `libiconv'
>   url.o:url.c:(.text+0x1f57): undefined reference to `libiconv'
>   url.o:url.c:(.text+0x1f7e): undefined reference to `libiconv_close'
>   url.o:url.c:(.text+0x20ee): undefined reference to `libiconv_close'
>   collect2: error: ld returned 1 exit status
> 
> This was generated by:
> ./configure --host=i686-w64-mingw32 --without-ssl --without-libidn 
> --without-metalink --with-gpgme-prefix=/dev/null 
> CFLAGS=-I$BUILDDIR/tmp/include LDFLAGS=-L$BUILDDIR/tmp/lib 
> --with-libiconv-prefix=$BUILDDIR/tmp CFLAGS=-liconv

I think you need to add -liconv to LIBS, not to CFLAGS.  GNU ld is a
one-pass linker, so it needs to see -liconv _after_ all the object
files, not before.



Re: [Bug-wget] [PATCH v2] bug #45790: wget prints it's progress even when background

2016-10-03 Thread Eli Zaretskii
> Cc: Eli Zaretskii <e...@gnu.org>
> From: "pwa...@gmail.net.pl" <pwa...@gmail.net.pl>
> Date: Sun, 2 Oct 2016 21:54:58 +0200
> 
> Is there a instruction on how to compile current wget version on windows?

Nothing special, just "./configure && make", as you'd do on a Posix
system.

The trick is to have a development environment that supports the
above.  I use MSYS and MinGW from mingw.org:

  https://sourceforge.net/projects/mingw/files/

If you don't have the dependency libraries, you need to install them
first.  They are also built as above, but you can find precompiled
32-bit binaries and the corresponding header files here:

  https://sourceforge.net/projects/ezwinports/files

If you want to build a 64-bit version of Wget, I suggest to get the
dependencies from the MSYS2 project, starting with this page's
instructions:

  https://msys2.github.io/

The MSYS2/MinGW64 development environment can be installed from that
page as well (if you don't have it).



Re: [Bug-wget] wget for windows - current build?

2016-10-02 Thread Eli Zaretskii
> From: Tim Rühsen 
> Cc: ge...@mweb.co.za
> Date: Sat, 01 Oct 2016 20:12:28 +0200
> 
> If you like to create a README.windows maybe with (basic) explanations on how 
> to build wget on Windows plus pointers to your port(s), we include it into 
> the 
> project.

That's okay (will do when I have time), but I think it would be more
useful to have a link on the Wget Web page to the places where Windows
binaries can be found.

> BTW, meanwhile libidn fixed several security issues, as well as gnutls, 
> libpsl 
> and wget itself ;-)

Duly noted.



Re: [Bug-wget] wget for windows - current build?

2016-10-01 Thread Eli Zaretskii
> From: Tim Rühsen 
> Cc: ge...@mweb.co.za
> Date: Sat, 01 Oct 2016 18:10:26 +0200
> 
> > > It shouldn't be too hard to write a script that cross-compiles wget and
> > > some dependencies via mingw. But would such an .exe really work on a real
> > > Windows machine ?
> > 
> > I'm not sure I understand the question.  If cross-compiling works,
> > then why won't the result run as expected?
> 
> Well, some years ago I copied cross compiled executables (32bit) onto a WinXP 
> machine. Executing these didn't error, but they immediately returned without 
> doing anything. Even the first printf() line didn't do anything.
> While executing the same executables with wine on the machine that I used for 
> compilation, they worked fine.

Sounds like some incompatibility between the import libraries you had
in that cross-environment and the corresponding DLLs on the target
Windows XP machine.

> While it seems pretty easy to generate a wget.exe on Linux and even run it 
> through wine, it seems not to work out that easily on a real Windows. At 
> least 
> these questions for a recent Windows executable are pretty common - and the 
> Windows affine users here do not have a easy solution as it seems.

Building Wget on Windows is easy if you have an operational
development environment.  What's not easy is running the test suite
and figuring out what each failure means, then fixing the sources as
needed.

Anyway, in addition to my site, which offers a 32-bit build, the MSYS2
project offers both 32-bit and 64-bit builds (although I cannot vouch
for their thoroughness in running the test suite -- not that I know
they didn't, mind you).  People who ask about that must be doing that
out of ignorance; perhaps we should include pointers to those places
in the distribution.



Re: [Bug-wget] wget for windows - current build?

2016-10-01 Thread Eli Zaretskii
> From: Tim Rühsen 
> Cc: "ge...@mweb.co.za" 
> Date: Sat, 01 Oct 2016 13:04:25 +0200
> 
> It shouldn't be too hard to write a script that cross-compiles wget and some 
> dependencies via mingw. But would such an .exe really work on a real Windows 
> machine ?

I'm not sure I understand the question.  If cross-compiling works,
then why won't the result run as expected?



Re: [Bug-wget] wget for windows - current build?

2016-09-30 Thread Eli Zaretskii
> Date: Fri, 30 Sep 2016 16:52:55 +0200 (SAST)
> From: "ge...@mweb.co.za" 
> 
> So, is there a "secret" new place hosting a newer version for Windows? Or is 
> the 1.11 on sourceforge actually okay? And - while I am already asking all 
> these stupid questions - would that version actually handle larger file sizes 
> already? 

You can find a 32-bit Windows port of 1.16.1 here:

  https://sourceforge.net/projects/ezwinports/files/?source=navbar



Re: [Bug-wget] [PATCH v2] bug #45790: wget prints it's progress even when background

2016-09-30 Thread Eli Zaretskii
> From: Piotr Wajda 
> Date: Fri, 30 Sep 2016 09:51:37 +0200
> 
> Hi, Reworked recent patch to behave correctly on fg and bg. Now user can 
> switch from fg to bg and vice versa and wget will select fd accordingly.

Thanks.

> +  /* Initialize this values so we don't have to ask every time we print line 
> */
> +  shell_is_interactive = isatty (STDIN_FILENO);

The MS-Windows version of isatty returns non-zero when its argument
file descriptor is open on any character device.  Notably, this
includes the null device, which is definitely not what we want in this
case, I think.

So I think using this logic will need to import isatty from Gnulib, or
provide an alternative implementation in mswindows.c.

>  static void
>  check_redirect_output (void)
>  {
> -  if (redirect_request == RR_REQUESTED)
> +  /* If it was redirected already to log file by SIGHUP or SIGUSR1, it was 
> permanent */
> +  if(!redirect_request_signal_name && shell_is_interactive)
>  {
> -  redirect_request = RR_DONE;
> -  redirect_output ();
> +  if(tcgetpgrp(STDIN_FILENO) != getpgrp()) 

Neither tcgetpgrp nor getpgrp exist on MS-Windows.

AFAIU, this test is intended to check whether wget was backgrounded.
Since AFAIK that's not possible on MS-Windows, this test should always
return zero on Windows, so I suggest a separate predicate function
with 2 implementations: one on Windows, the other on Posix platforms.



Re: [Bug-wget] [PATCH 20/25] New option --metalink-index to process Metalink application/metalink4+xml

2016-09-16 Thread Eli Zaretskii
> From: Tim Ruehsen 
> Cc: mehw.is...@inventati.org, bug-wget@gnu.org
> Date: Fri, 16 Sep 2016 11:15:31 +0200
> 
> > So if wget needs to create or open such files, it needs to replace the
> > colon with some other character, like '!'.
> 
> That is what I meant with 'Wget has functions to percent escape special 
> characters...'. It is not only colons. And it depends on the OS (and/or file 
> system).

OK, so the problem should not exist, good.

> From https://en.wikipedia.org/wiki/Comparison_of_file_systems:
> "MS-DOS, Microsoft Windows, and OS/2 disallow the characters \ / : ? * " > < 
> | 
> and NUL in file and directory names across all filesystems. Unices and Linux 
> disallow the characters / and NUL in file and directory names across all 
> filesystems."

This is a better reference, JFYI:

  
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx



Re: [Bug-wget] [PATCH 20/25] New option --metalink-index to process Metalink application/metalink4+xml

2016-09-16 Thread Eli Zaretskii
> From: Tim Ruehsen 
> Date: Fri, 16 Sep 2016 10:15:17 +0200
> Cc: bug-wget@gnu.org
> 
> >   *name  +  ref   -> result
> >   -
> >   NULL   + "foo/C:D:file" -> "file" [bare basename]
> >   "foobar"   + "foo/C:D:file" -> "file" [bare basename]
> >   "dir/old"  + "foo/C:D:file" -> "dir/C:D:file"
> >   "C:D:file/old" + "foo/E:F:new"  -> "C:D:file/E:F:new" [is this ok?]
> 
> Just make sure that no file name beginning with letter+colon is used for 
> system 
> calls on Windows (e.g. open("C:D:file/E:F:new", ...) is not a good idea). 
> Either you strip the 'C:D:', or percent escape ':' on Windows. Wget has 
> functions to percent escape special characters in file names, depending on 
> the 
> OS it is built on.

(I've lost track of this discussion, and don't understand the context
well enough to get back on track, so please bear with me.)

Windows filesystems will not allow file names that have embedded colon
characters, except if that colon is part of the drive specification at
the beginning of a file name, as in "D:/dir/file".  File names like
the 2 last results above are not allowed, and cannot be created or
opened.

So if wget needs to create or open such files, it needs to replace the
colon with some other character, like '!'.

Again, apologies if this comment makes no sense in the context of
whatever you've been discussing.



Re: [Bug-wget] [PATCH 09/25] Enforce Metalink file name verification, strip directory if necessary

2016-09-12 Thread Eli Zaretskii
> From: Tim Ruehsen 
> Date: Mon, 12 Sep 2016 13:00:32 +0200
> 
> > +  char *basename = name;
> > +
> > +  while ((name = strstr (basename, "/")))
> > +basename = name + 1;
> 
> Could you use strrchr() ? something like
> 
> char *basename = strrchr (name, '/');
> 
> if (basename)
>   basename += 1;
> else
>   basename = name;

I think we want to use ISSEP, no?  Otherwise Windows file names with
backslashes will misfire.



Re: [Bug-wget] Wget - acess list bypass / race condition PoC

2016-08-21 Thread Eli Zaretskii
> From: Giuseppe Scrivano 
> Date: Sun, 21 Aug 2016 15:26:58 +0200
> Cc: "bug-wget@gnu.org" ,
>   Dawid Golunski ,
>   "kseifr...@redhat.com" 
> 
>  #else /* def __VMS */
> -  *fp = fopen (hs->local_file, "wb");
> +  if (opt.delete_after
> +|| opt.spider /* opt.recursive is implicitely true */
> +|| !acceptable (hs->local_file))
> +{
> +  *fp = fdopen (open (hs->local_file, O_CREAT | O_TRUNC | 
> O_WRONLY, S_IRUSR | S_IWUSR), "wb");
> +}

For this to work on MS-Windows, the 'open' call should use O_BINARY,
in addition to the other flags.  Otherwise, the "b" in "wb" of
'fdopen' will be ignored by the MS runtime.

Thanks.



Re: [Bug-wget] [PATCH] wget: Add --ssh-askpass support

2016-07-23 Thread Eli Zaretskii
> From: j...@wxcvbn.org (Jeremie Courreges-Anglas)
> Cc: "Liam R. Howlett" , bug-wget@gnu.org
> Date: Sat, 23 Jul 2016 21:24:33 +0200
> 
> > This implementation is unnecessarily non-portable ('fork' doesn't
> > exist on some supported platforms).  I suggest to use a much more
> > portable 'popen' instead.
> 
> popen(3) may be more portable but is it subject to all the problems
> brought by "sh -c": the string may contain shell metacharacters, etc.

Nothing command-line quoting cannot handle, surely.

> What worries me is the use of strace(1), which is afaik available only
> on Linux. OpenBSD for example doesn't have it.  Why would strace(1) be
> needed here?

Right.



Re: [Bug-wget] [PATCH] wget: Add --ssh-askpass support

2016-07-23 Thread Eli Zaretskii
> From: "Liam R. Howlett" 
> Date: Fri, 22 Jul 2016 20:24:05 -0400
> Cc: liam.howl...@windriver.com
> 
> This adds the --ssh-askpass option which is disabled by default.

Thanks.

> +
> +/* Execute external application SSH_ASKPASS which is stored in 
> opt.ssh_askpass
> + */
> +void
> +run_ssh_askpass(const char *question, char **answer)
> +{
> +  char tmp[1024];
> +  pid_t pid;
> +  int com[2];
> +
> +  if (pipe(com) == -1)
> +  {
> +fprintf(stderr, _("Cannot create pipe"));
> +exit (WGET_EXIT_GENERIC_ERROR);
> +  }
> +
> +  pid = fork();
> +  if (pid == -1)
> +  {
> +fprintf(stderr, "Error forking SSH_ASKPASS");
> +exit (WGET_EXIT_GENERIC_ERROR);
> +  }
> +  else if (pid == 0)
> +  {
> +/* Child */
> +dup2(com[1], STDOUT_FILENO);
> +close(com[0]);
> +close(com[1]);
> +fprintf(stdout, "test");
> +execlp("/usr/bin/strace", "-s256", "-otest.out", opt.ssh_askpass, 
> question, (char*)NULL);
> +assert("Execlp failed!");
> +  }
> +  else
> +  {
> +close(com[1]);
> +unsigned int bytes = read(com[0], tmp, sizeof(tmp));
> +if (!bytes)
> +{
> +  fprintf(stderr,
> +_("Error reading response from SSH_ASKPASS %s %s\n"),
> +opt.ssh_askpass, question);
> +  exit (WGET_EXIT_GENERIC_ERROR);
> +}
> +else if (bytes > 1)
> +  *answer = strndup(tmp, bytes-1);
> +  }
> +}

This implementation is unnecessarily non-portable ('fork' doesn't
exist on some supported platforms).  I suggest to use a much more
portable 'popen' instead.



Re: [Bug-wget] [PATCH] Trivial changes in HSTS

2016-06-18 Thread Eli Zaretskii
> From: Gisle Vanem 
> Date: Fri, 17 Jun 2016 22:50:27 +0200
> 
> > +static bool
> > +hsts_file_access_valid (const char *filename)
> > +{
> > +  struct_stat st;
> > +
> > +  if (stat (filename, ) == -1)
> > +return false;
> > +
> > +  return !(st.st_mode & S_IWOTH) && S_ISREG (st.st_mode);
> 
> Due to the above patch, the following output on Wget/Windows seems
> a bit paranoid; wget -d https://vortex.data.microsoft.com/collect/v1
>   ...
>   Reading HSTS entries from c:\Users\Gisle\AppData\Roaming/.wget-hsts
>   Will not apply HSTS. The HSTS database must be a regular and 
> non-world-writable file.
>   ERROR: could not open HSTS store at 
> 'c:\Users\Gisle\AppData\Roaming/.wget-hsts'. HSTS will be disabled.
> 
> On Windows this file is *not* "world-writeable" AFAICS (and yes, it does 
> exists).
> Hence this "paranoia" should be accounted for. I'm not so much into Posix,
> so I'll leave it to you experts to comment & patch.

IMO, this test should be bypassed on Windows.  The "world" part in
"world-writeable" is a Unix-centric notion, and its translation into
MS-Windows ACLs is non-trivial (read: "impossible").  (For example,
your "non-world-writeable" file is accessible to certain users and
groups of users on Windows, other than Administrator.)  So the sanest
solution for this is simply not to make this test on Windows.



Re: [Bug-wget] retrieval failure:Forbidden? for UTF-8-URL in wget that works on FF and IE

2016-06-08 Thread Eli Zaretskii
> Date: Wed, 08 Jun 2016 11:47:46 -0700
> From: "L. A. Walsh" 
> 
> I tried:
> 
> wget "http://translate.google.com/#ja/en/クイーンズブレイド・メインテーマB;
> 
> But get a an Error "403: Forbidden" (tried w/ and w/o proxy) -- same.

On what OS and with which version of wget?



Re: [Bug-wget] Progress bar on MS-Windows

2016-06-07 Thread Eli Zaretskii
> From: Gisle Vanem 
> Date: Tue, 7 Jun 2016 09:00:43 +0200
> 
> Compare the attached image wget-progress-1.png:
>   wget --show-progress --quiet -np -r www.watt-32.net/watt-doc/
> 
> VS wget-progress-2.png:
>   wget --show-progress --quiet --limit-rate=2k -np -r 
> www.watt-32.net/watt-doc/
> 
> I think it's a bit strange the final d/l speed isn't "sticky" in both cases.
> Is it because the speed is too high?

I think the download time is so short that wget doesn't have enough to
estimate the speed, before it's all over.



Re: [Bug-wget] Progress bar on MS-Windows

2016-06-06 Thread Eli Zaretskii
> Date: Sat, 04 Jun 2016 13:40:12 +0300
> From: Eli Zaretskii <e...@gnu.org>
> 
> > Here's a build as of commit 7c0752c4cb6575c6720d6e2d4bf4eda61b63e0f1:
> > https://eternallybored.org/misc/wget/test/wget.exe
> 
> Thanks, will try it.

Finally had an opportunity to try this build (needed a very slow
connection and a large file, to see the longest ETA string
displayed).  Indeed, the problem I was trying to solve doesn't exists
in this build, so the patch I proposed is not needed.

Thanks.



Re: [Bug-wget] Progress bar on MS-Windows

2016-06-04 Thread Eli Zaretskii
> Date: Sat, 4 Jun 2016 11:08:18 +0200
> From: Jernej Simončič <jernej|s-w...@eternallybored.org>
> 
> On Saturday, June 4, 2016, 10:27:56, Eli Zaretskii wrote:
> 
> > Sorry, no.  Not unless someone will be kind enough to produce a
> > complete tarball.  Building from Git requires all kinds of utilities
> > that are not simple to set up on Windows.
> 
> Here's a build as of commit 7c0752c4cb6575c6720d6e2d4bf4eda61b63e0f1:
> https://eternallybored.org/misc/wget/test/wget.exe

Thanks, will try it.



Re: [Bug-wget] Progress bar on MS-Windows

2016-06-04 Thread Eli Zaretskii
> Date: Wed, 1 Jun 2016 10:28:35 +0200
> From: Darshit Shah 
> Cc: bug-wget@gnu.org
> 
> Can you please try with the latest HEAD once? As far as I am aware, all 
> the off-by-one errors have been fixed.

Sorry, no.  Not unless someone will be kind enough to produce a
complete tarball.  Building from Git requires all kinds of utilities
that are not simple to set up on Windows.

In any case, I did look at the latest sources in Git, and I don't
think the patch I suggested is fixed, because the calculation in
determine_screen_width didn't change.  The problem here is that the
Windows cmd console moves to the next line as soon as you display the
last character that fits on the line.  So the last column must never
be occupied if we want a proper progress display.

> If there is something left over, we should fix it in the progress bar 
> output itself instead of a hack somewhere in the Windows specific code.  

It's not a hack.

> Unless the issue is in how the window size is reported in Windows.

It is.

Thanks.



[Bug-wget] Progress bar on MS-Windows

2016-05-28 Thread Eli Zaretskii
Running wget from the Windows cmd window, I see that we write 1 column
too many, when we display "eta XXm YYs" -- this causes the next
progress bar be displayed on the next screen line.  So I came up with
a small patch below, in the Windows specific portion of
determine_screen_width.

Does anyone else see this?  I see this in Wget 1.16.1, but I don't see
any changes in the related code in the current Git master.  Did I miss
something?  If not, OK to push this change?

--- src/utils.c~0   2014-11-23 18:49:06.0 +0200
+++ src/utils.c 2016-05-28 21:09:24.91675 +0300
@@ -1822,7 +1824,7 @@ determine_screen_width (void)
   CONSOLE_SCREEN_BUFFER_INFO csbi;
   if (!GetConsoleScreenBufferInfo (GetStdHandle (STD_ERROR_HANDLE), ))
 return 0;
-  return csbi.dwSize.X;
+  return csbi.dwSize.X - 1;
 #else  /* neither TIOCGWINSZ nor WINDOWS */
   return 0;
 #endif /* neither TIOCGWINSZ nor WINDOWS */



Re: [Bug-wget] wget IRI test failures on Mac OS X

2016-05-18 Thread Eli Zaretskii
> From: Ryan Schmidt 
> Date: Wed, 18 May 2016 02:39:56 -0500
> Cc: bug-wget@gnu.org
> 
> Thanks Eli. I tried the latest commit from April 2016, 
> 42cc84b6b6cceeb146a668797ceaafe60743ce6d, and the IRI tests still failed:

Does OS X have a function that can compare equal strings with composed
and decomposed characters that are equivalent sequences?



Re: [Bug-wget] Wget 1.17.1 bug?

2016-05-17 Thread Eli Zaretskii
> Date: Tue, 17 May 2016 11:22:25 +0300
> From: "Zeroes & Ones" 
> 
> I not compiled himself, i use binaries installed used setup-x86.exe (v2.874 
> 32 bit) 
> 
> chosen site:
> cygwin.mirror.constant.com
> 
> trouble reproduced 100%
> 
> wget -V
> 
> GNU Wget 1.17.1 built on cygwin.

That's a Cygwin build, so you need to use chmod to make the downloaded
file executable.  Cygwin programs emulate Posix permissions using NT
security features, so you need to play by Cygwin rules.




Re: [Bug-wget] wget IRI test failures on Mac OS X

2016-05-16 Thread Eli Zaretskii
> From: Ryan Schmidt 
> Date: Thu, 12 May 2016 23:52:08 -0500
> Cc: Micah Cowan 
> 
> Hello, just wanted to gently remind you that this bug in the wget test suite 
> running on OS X that I reported in 2009 with wget 1.12 still exists today 
> with wget 1.17.1.

Please try the latest Git master, some progress was made, although I'm
not quite sure those changes are enough for the HPFS+ canonical
decomposition of file names.  But it could.



Re: [Bug-wget] Wget 1.17.1 bug?

2016-05-16 Thread Eli Zaretskii
> Date: Mon, 16 May 2016 10:24:25 +0300
> From: "Zeroes & Ones" 
> 
> i update Wget 1.11.4 to latest 1.17.1 and i have troubles
> 
> output file have wrong permission on NTFS (checked on W2008R2, Win8.1)
> 
> 
> for example
> wget.exe http://www.nch.com.au/components/burnsetup.exe
> after complete downloading i see what i can't execute file
> 
> accesschk.exe burnsetup.exe
> 
> Accesschk v6.01 - Reports effective permissions for securable objects
> Copyright (C) 2006-2016 Mark Russinovich
> Sysinternals - www.sysinternals.com
> 
> Error: C:\6\burnsetup.exe has a non-canonical DACL:
>Explicit Deny after Explicit Allow
> C:\6\burnsetup.exe
>   RW GOMELENERGO\s.dindikov
>   R  GOMELENERGO\Domain Users
>   RW NT AUTHORITY\SYSTEM
>   RW BUILTIN\Administrators
>   R  BUILTIN\Users
>   R  Everyone
> 
> with older version Wget all OK:
> 
> Accesschk v6.01 - Reports effective permissions for securable objects
> Copyright (C) 2006-2016 Mark Russinovich
> Sysinternals - www.sysinternals.com
> 
> C:\5\burnsetup.exe
>   RW NT AUTHORITY\SYSTEM
>   RW BUILTIN\Administrators
>   R  BUILTIN\Users

I don't think the native Windows port of Wget uses any NT security
related system calls, so it's hard to believe what you see is due to
some changes in Wget code proper.

Did you build both versions of Wget yourself?  If not, where did you
get the binaries?  Could it be that the latter was built differently,
with some changes in the sources, or with some optional libraries
which could explain this?  E.g., could it be that Wget 1.17 is a
Cygwin build or an MSYS build?  (What does "wget --version" display?)



[Bug-wget] [bug #47701] wget 1.17.1 fails to convert from percent encoding to unicode correctly (mingw32)

2016-04-22 Thread Eli Zaretskii
Follow-up Comment #5, bug #47701 (project wget):

In order for you to see the files with non-ASCII names correctly named on your
Windows disk, all the non-ASCII characters in the file names must be supported
by the current system codepage. In addition, your wget must be built with
libiconv.  If any of these two conditions is not true, you will see mojibake
in the file names, because Windows doesn't support UTF-8 encoded file names.

A way to lift one of these limitations -- that the file names be expressible
in the system codepage -- was discussed, but no one has submitted a clean
patchset to fix it. (Doing so on Windows requires to replace/wrap C library
functions that deal with file names with versions that can accept UTF-8
encoded name, convert it to UTF-16, and then call the appropriate library
function, like call _wopen instead of open etc.)

One other thing: a few months back I submitted changes to make non-ASCII file
name support more correct, and I'm not sure that patch is in wget 1.17.1. 
Perhaps Tim or Giuseppe could tell.  If the patch is not in 1.17.1, I suggest
to build wget from the Git repository and see if some of the problems are
gone.


___

Reply to this item at:

  

___
  Message sent via/by Savannah
  http://savannah.gnu.org/




[Bug-wget] [bug #47701] wget 1.17.1 fails to convert from percent encoding to unicode correctly (mingw32)

2016-04-15 Thread Eli Zaretskii
Follow-up Comment #1, bug #47701 (project wget):

You need to give Wget the --local-encoding=UTF-8 command-line option, because
the URL you are trying to fetch is actually in UTF-8 encoding (and then each
byte of the UTF-8 sequence is encoded with percent escapes on top of that).

When I use that switch, the command works for me (with Wget 1.16.1 compiled
with MinGW on MS-Windows).


___

Reply to this item at:

  

___
  Message sent via/by Savannah
  http://savannah.gnu.org/




[Bug-wget] [bug #47689] Support for UTF-16 encoding.

2016-04-13 Thread Eli Zaretskii
Follow-up Comment #1, bug #47689 (project wget):

This site does work for me with Wget 1.16.1 on MS-Windows, with the exact
command you have shown.  The file index.html is downloaded and it is encoded
in UTF-16LE on my disk.

So I'm unsure why it doesn't work for you.


___

Reply to this item at:

  

___
  Message sent via/by Savannah
  http://savannah.gnu.org/




Re: [Bug-wget] HAVE_CARES on Windows

2016-04-11 Thread Eli Zaretskii
> From: Gisle Vanem 
> Date: Mon, 11 Apr 2016 11:30:45 +0200
> 
> Tim Rühsen wrote:
> 
> > As Eli, I would like to know a few more details.
> > Is it possible to make c-ares return the 'native' socket numbers to not get 
> > in 
> > conflict with gnulib ?
> 
> As Eli pointed out, it's vice-versa; C-ares *do* return 'native'
> socket numbers. While Gnulib's socket(), select() etc. creates and
> expects 'file descriptors'. Normally in the range >= 3 (?). (I assume
> this has something to POSIX compliance. Winsock's socket() never returns
> such low numbers).

Windows sockets are handles in disguise, not file descriptors.

> Eli> However, converting a handle into a
> Eli> file descriptor and vise versa involves using 2 simple functions,
> 
> I'm not sure what those functions are since I'm not so much into Gnulib.

It's not a Gnulib thing, it's a Windows runtime library thing:

  HANDLE h = _get_osfhandle (int fd);

will produce a handle underlying a file descriptor, while

  int fd = _open_osfhandle (HANDLE h, O_RDWR | O_BINARY);

will do the opposite.

> My intuition told me the 'rpl_select()' was the cause for the resolve-
> failure, hence this 'undef'. And since the host.c 'select()' is used only for
> 'HAVE_LIBCARES' code, I felt it won't hurt do '#undef select' in host.c.

Is it a good idea to have 2 different implementations of 'select' in
the same program?  Can it happen that Wget wants to wait on both the
libcares sockets and the other kind?

> But I'm open to alternatives. Eli, can you try building with
> 'HAVE_LIBCARES'?

Not right now, as I'm quite busy these days.



Re: [Bug-wget] HAVE_CARES on Windows

2016-04-10 Thread Eli Zaretskii
> From: Tim Rühsen 
> Date: Sun, 10 Apr 2016 20:29:36 +0200
> 
> > I have tried building latest Wget with '-DHAVE_LIBCARES'
> > and all resolve attempts failed due to Gnulib's select()
> > is not compatible with the socket-number(s) returned from
> > a normal C-ares library on Windows.
> 
> As Eli, I would like to know a few more details.
> Is it possible to make c-ares return the 'native' socket numbers to not get 
> in 
> conflict with gnulib ?

I should tell here what I wrote to Gisle privately: Gnulib attempts to
hide the Windows socket-related idiosyncrasies by returning a file
descriptor instead of the 'native' handle, and Gnulib's 'select'
expects such file descriptors.  However, converting a handle into a
file descriptor and vise versa involves using 2 simple functions, so I
hope a more elegant solution should be possible.

Admittedly, Gisle already built Wget with libcares, while I didn't, so
his inputs and opinions carry more weight than mine...



Re: [Bug-wget] HAVE_CARES on Windows

2016-04-09 Thread Eli Zaretskii
> From: Gisle Vanem 
> Date: Sat, 9 Apr 2016 21:58:18 +0200
> 
> I have tried building latest Wget with '-DHAVE_LIBCARES'
> and all resolve attempts failed due to Gnulib's select()
> is not compatible with the socket-number(s) returned from
> a normal C-ares library on Windows.

What is a "socket number" that libcares returns?  Is it a file
descriptor, a handle, or something else?



Re: [Bug-wget] --no-check-certificate does not work in 1.11.4.3287

2016-02-20 Thread Eli Zaretskii
> From: Tim Paulson 
> Date: Fri, 19 Feb 2016 16:32:21 +
> 
> The following command with -no-check-certificate works in wget 1.10.2 but 
> fails in 1.11.4 on Win7.
> wget -t 3 --http-user=admin --http-passwd=password --no-check-certificate 
> https://192.168.1.51/admin/backups/latest
> 
> Has the syntax changed in 1.11.4 for the ability to ignore self signed 
> certificate errors?
> 
> Sample output from wget 1.10.2 and wget 1.11.4 shown below.
> 
> 
> c:\Program Files (x86)\GnuWin32\bin>wget -t 3 --http-user=admin 
> --http-passwd=password --no-check-certificate 
> https://192.168.1.51/admin/backups/latest
> SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrcsyswgetrc = c:\Program Files 
> (x86)\GnuWin32/etc/wgetrc

What do you have in your wgetrc?  Could that be the cause?

FWIW, this option works for me in Wget 1.16.1.  I cannot verify that
with the command line you gave, as it uses an address on your private
network, but if you show a command using an address I can reach, I
will try that and see if it works.



Re: [Bug-wget] [Patch] Use -isystem in Makefile to suppress warnings from libraries

2016-01-29 Thread Eli Zaretskii
> From: Darshit Shah 
> Date: Fri, 29 Jan 2016 15:18:57 +0100
> 
> A recent GCC / LLVM update has caused my setup to spew far too many
> warnings on compiling Wget. On a closer look, they all come from
> Gnulib code. I propose the attached patch to explicitly mark those
> files as libraries and have the compiler suppress warnings from them.
> This way we can focus on the warnings generated by Wget codebase
> alone.

If we do this, who will tell Gnulib people to get their act together
and fix those warnings?  I think the right solution to this is in
Gnulib, not in Wget.

Thanks.



Re: [Bug-wget] [Patch] Use -isystem in Makefile to suppress warnings from libraries

2016-01-29 Thread Eli Zaretskii
> From: Darshit Shah 
> Date: Fri, 29 Jan 2016 15:45:02 +0100
> Cc: Bug-Wget 
> 
> Most of them are actually false positives, probably due to us. Gnulib
> uses some more modern code extensions and the compiler keeps warning
> us about it since we set the C language to std=gnu89.

Does Gnulib assume C99?  Is that documented somewhere?

> I'm not happy about this fact, but this discussion has happened
> multiple times and I don't think we will be moving to a more modern
> setup anytime soon. I would personally prefer using *at least* C99,
> a more recent version like C11 would be even better, but not all
> compiler would support that.

If Gnulib requires C99, why not use -std=gnu99 if it is supported?
Many packages already do.



[Bug-wget] [bug #46943] Crash on old CPU w/o SSE2

2016-01-21 Thread Eli Zaretskii
Follow-up Comment #2, bug #46943 (project wget):

Did you build wget yourself, or did you download its binary from somewhere? 
If you downloaded a precompiled binary, can you tell which site did you
download it from?


___

Reply to this item at:

  

___
  Message sent via/by Savannah
  http://savannah.gnu.org/




Re: [Bug-wget] Support non-ASCII URLs

2016-01-12 Thread Eli Zaretskii
> From: Giuseppe Scrivano <gscriv...@gnu.org>
> Cc: tim.rueh...@gmx.de, bug-wget@gnu.org
> Date: Tue, 12 Jan 2016 20:58:16 +0100
> 
> Eli Zaretskii <e...@gnu.org> writes:
> 
> > This was fixed by Tim in the meantime.  Are you running the current
> > Git version?
> 
> sorry my mistake, I was using an outdated version.  All works now for me
> as well.

Great, thanks for testing.



Re: [Bug-wget] Support non-ASCII URLs

2016-01-12 Thread Eli Zaretskii
> From: Giuseppe Scrivano 
> Cc: Tim Rühsen , bug-wget@gnu.org
> Date: Tue, 12 Jan 2016 12:19:06 +0100
> 
> >> FAIL: Test-iri-forced-remote
> >> 
> >> My son has birthday tomorrow, so I am not sure how much time I can spend 
> >> on 
> >> the weekend on this issue. Maybe Eli or you could have a look ?
> >
> > I cannot bootstrap the Git repo (too many prerequisites I don't have).
> > Can you or someone else produce a distribution tarball out of Git that
> > I could then build "as usual"?
> >
> > Also, can you show me the log of the failed test?  Turkish locales
> > have "an issue" with certain upper/lower-case characters, maybe that's
> > the problem.  Or maybe it's something else; looking at the log might
> > give good clues.
> 
> sorry for taking so long, this is the log I get when I run
> 
> $ TESTS_ENVIRONMENT="LC_ALL=tr_TR.utf8 VALGRIND_TESTS=0" make check
> 
> ===
>wget 1.17.1.10-c78d: tests/test-suite.log
> ===
> 
> # TOTAL: 85
> # PASS:  84
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  1
> # XPASS: 0
> # ERROR: 0
> 
> .. contents:: :depth: 2
> 
> FAIL: Test-iri-forced-remote

This was fixed by Tim in the meantime.  Are you running the current
Git version?



Re: [Bug-wget] Support non-ASCII URLs

2015-12-20 Thread Eli Zaretskii
> From: Tim Rühsen 
> Date: Sun, 20 Dec 2015 21:34:18 +0100
> 
> Please review this patch.

Looks good to me, thanks.



Re: [Bug-wget] Support non-ASCII URLs

2015-12-19 Thread Eli Zaretskii
> From: Tim Rühsen <tim.rueh...@gmx.de>
> Cc: Giuseppe Scrivano <gscriv...@gnu.org>, Eli Zaretskii <e...@gnu.org>
> Date: Fri, 18 Dec 2015 22:41:29 +0100
> 
> 1. Maybe do_conversion() should take a char * argument instead of const char 
> *. We avoid one ugly const -> non-const cast an also a warning about iconv.

I agree.

> 2. contrib/check-hard fails with
> TESTS_ENVIRONMENT="LC_ALL=tr_TR.utf8 VALGRIND_TESTS=0" make check
> 
> FAIL: Test-iri-forced-remote
> 
> My son has birthday tomorrow, so I am not sure how much time I can spend on 
> the weekend on this issue. Maybe Eli or you could have a look ?

I cannot bootstrap the Git repo (too many prerequisites I don't have).
Can you or someone else produce a distribution tarball out of Git that
I could then build "as usual"?

Also, can you show me the log of the failed test?  Turkish locales
have "an issue" with certain upper/lower-case characters, maybe that's
the problem.  Or maybe it's something else; looking at the log might
give good clues.



Re: [Bug-wget] Support non-ASCII URLs

2015-12-19 Thread Eli Zaretskii
> Date: Sat, 19 Dec 2015 10:15:03 +0200
> From: Eli Zaretskii <e...@gnu.org>
> Cc: bug-wget@gnu.org
> 
> > 2. contrib/check-hard fails with
> > TESTS_ENVIRONMENT="LC_ALL=tr_TR.utf8 VALGRIND_TESTS=0" make check
> > 
> > FAIL: Test-iri-forced-remote
> > 
> > My son has birthday tomorrow, so I am not sure how much time I can spend on 
> > the weekend on this issue. Maybe Eli or you could have a look ?
> 
> I cannot bootstrap the Git repo (too many prerequisites I don't have).
> Can you or someone else produce a distribution tarball out of Git that
> I could then build "as usual"?
> 
> Also, can you show me the log of the failed test?  Turkish locales
> have "an issue" with certain upper/lower-case characters, maybe that's
> the problem.  Or maybe it's something else; looking at the log might
> give good clues.

Tim sent me the tarball and the log off-list (thanks!).  I didn't yet
try to build Wget, but just looking at the test, I guess I don't
understand its idea.  It has an index.html page that's encoded in
ISO-8859-15, but Wget is invoked with --remote-encoding=iso-8859-1,
and the URLs themselves in "my %urls" are all encoded in UTF-8.  How's
this supposed to work?

Also, I'm not following the logic of overriding Content-type by the
remote encoding: p1_fran%C3%A7ais.html states "charset=UTF-8", but
includes a link encoded in ISO-8859-1, and the test seems to expect
Wget to use the remote encoding in preference to what "charset=" says.
Does the remote encoding override the encoding for the _contents_ of
the URL, not just for the URL itself?  That seems to make little sense
to me: the contents and the name can legitimately be encoded
differently, I think.

I guess I lack some basic info about what Wget is supposed to do in
these tricky situations, and how.  Can you help me understand that?
The manual doesn't seem to be very details on what's expected here.

TIA



Re: [Bug-wget] Support non-ASCII URLs

2015-12-18 Thread Eli Zaretskii
> From: Giuseppe Scrivano <gscriv...@gnu.org>
> Cc: bug-wget@gnu.org, Eli Zaretskii <e...@gnu.org>
> Date: Fri, 18 Dec 2015 11:31:17 +0100
> 
> >> Attached.
> >
> > Nice, thank you.
> >
> > There is just one test not passing: Test-ftp-iri.px
> >
> > Maybe the test is wrong (using --local-encoding=iso-8859-1, but writing to 
> > an 
> > UTF-8 filename). I am not very much into FTP. How do we know the remote 
> > encoding ?
> 
> the patch looks fine to me.  Eli, could you please modify the test the
> pass and add a note in NEWS?

Attached.

>From 8ce8fc66bd6d994194eabd2768aefccbe2090e43 Mon Sep 17 00:00:00 2001
From: Eli Zaretskii <e...@gnu.org>
Date: Fri, 18 Dec 2015 17:03:26 +0200
Subject: [PATCH] Support non-ASCII URLs

* src/url.c [HAVE_ICONV]: Include iconv.h and langinfo.h.
(convert_fname): New function.
[HAVE_ICONV]: Convert file name from remote encoding to local
encoding.
(url_file_name): Call convert_fname.
(filechr_table): Don't consider bytes in 128..159 as control
characters.

* tests/Test-ftp-iri.px: Fix the expected file name to match the
new file-name recoding.  State the remote encoding explicitly on
the Wget command line.

* NEWS: Mention the URI recoding when built with libiconv.
---
 NEWS  |  7 +
 src/url.c | 87 +--
 tests/Test-ftp-iri.px |  4 +--
 3 files changed, 94 insertions(+), 4 deletions(-)

diff --git a/NEWS b/NEWS
index c8cebad..c63c678 100644
--- a/NEWS
+++ b/NEWS
@@ -9,6 +9,13 @@ Please send GNU Wget bug reports to <bug-wget@gnu.org>.
 
 * Changes in Wget X.Y.Z
 
+* When Wget is built with libiconv, it now converts non-ASCII URIs to
+  the locale's codeset when it creates files.  The encoding of the
+  remote files and URIs is taken from --remote-encoding, defaulting to
+  UTF-8.  The result is that non-ASCII URIs and files downloaded via
+  HTTP/HTTPS and FTP will have names on the local filesystem that
+  correspond to their remote names.
+
 * Changes in Wget 1.17.1
 
 * Fix compile error when IPv6 is disabled or SSL is not present.
diff --git a/src/url.c b/src/url.c
index c62867f..ca7fe29 100644
--- a/src/url.c
+++ b/src/url.c
@@ -43,6 +43,11 @@ as that of the covered work.  */
 #include "host.h"  /* for is_valid_ipv6_address */
 #include "c-strcase.h"
 
+#if HAVE_ICONV
+#include 
+#include 
+#endif
+
 #ifdef __VMS
 #include "vms.h"
 #endif /* def __VMS */
@@ -1399,8 +1404,8 @@ UVWC, VC, VC, VC,  VC, VC, VC, VC,   /* NUL SOH STX ETX  EOT ENQ ACK BEL */
0,  0,  0,  0,   0,  0,  0,  0,   /* p   q   r   st   u   v   w   */
0,  0,  0,  0,   W,  0,  0,  C,   /* x   y   z   {|   }   ~   DEL */
 
-  C, C, C, C,  C, C, C, C,  C, C, C, C,  C, C, C, C, /* 128-143 */
-  C, C, C, C,  C, C, C, C,  C, C, C, C,  C, C, C, C, /* 144-159 */
+  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0, /* 128-143 */
+  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0, /* 144-159 */
   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
 
@@ -1531,6 +1536,82 @@ append_uri_pathel (const char *b, const char *e, bool escaped,
   append_null (dest);
 }
 
+static char *
+convert_fname (const char *fname)
+{
+  char *converted_fname = (char *)fname;
+#if HAVE_ICONV
+  const char *from_encoding = opt.encoding_remote;
+  const char *to_encoding = opt.locale;
+  iconv_t cd;
+  size_t len, done, inlen, outlen;
+  char *s;
+  const char *orig_fname = fname;;
+
+  /* Defaults for remote and local encodings.  */
+  if (!from_encoding)
+from_encoding = "UTF-8";
+  if (!to_encoding)
+to_encoding = nl_langinfo (CODESET);
+
+  cd = iconv_open (to_encoding, from_encoding);
+  if (cd == (iconv_t)(-1))
+logprintf (LOG_VERBOSE, _("Conversion from %s to %s isn't supported\n"),
+	   quote (from_encoding), quote (to_encoding));
+  else
+{
+  inlen = strlen (fname);
+  len = outlen = inlen * 2;
+  converted_fname = s = xmalloc (outlen + 1);
+  done = 0;
+
+  for (;;)
+	{
+	  if (iconv (cd, , , , ) != (size_t)(-1)
+	  && iconv (cd, NULL, NULL, , ) != (size_t)(-1))
+	{
+	  *(converted_fname + len - outlen - done) = '\0';
+	  iconv_close(cd);
+	  DEBUGP (("Converted file name '%s' (%s) -> '%s' (%s)\n",
+		   orig_fname, from_encoding, converted_fname, to_encoding));
+	  xfree (orig_fname);
+	  return converted_fname;
+	}
+
+	  /* Incomplete or invalid multibyte sequence */
+	  if (errno == EINVAL || errno == EILSEQ)
+	{
+	  logprintf (LOG_VERBOSE,
+			 _("Incomplete or invalid multibyte sequence encountered\n"));
+	  xfree (converted_fname);
+	  converted_fname = (char *)orig_fname;
+	  break;
+	}
+	  else if (errno == E2BIG) /* Output buffer full */
+	{
+	  done = len;
+	   

Re: [Bug-wget] Support non-ASCII URLs

2015-12-17 Thread Eli Zaretskii
> From: Tim Ruehsen <tim.rueh...@gmx.de>
> Cc: Giuseppe Scrivano <gscriv...@gnu.org>
> Date: Thu, 17 Dec 2015 17:50:47 +0100
> 
> @Eli: If my change is ok for Giuseppe, please apply the changes from iri.c to 
> your patch. If possible, make a local commit and create the attachment/patch 
> with 'git format -1' (or -2 for the latest two commits). That makes it easier 
> for us to apply the patch since author (you) and commit message are copied as 
> well.

Attached.

>From 197483b6c62dcea1a900d626c79ba7e65a0c1e67 Mon Sep 17 00:00:00 2001
From: Eli Zaretskii <e...@gnu.org>
Date: Thu, 17 Dec 2015 20:06:30 +0200
Subject: [PATCH] Support non-ASCII URLs

* src/url.c [HAVE_ICONV]: Include iconv.h and langinfo.h.
(convert_fname): New function.
[HAVE_ICONV]: Convert file name from remote encoding to local
encoding.
(url_file_name): Call convert_fname.
(filechr_table): Don't consider bytes in 128..159 as control
characters.
---
 src/url.c | 87 +--
 1 file changed, 85 insertions(+), 2 deletions(-)

diff --git a/src/url.c b/src/url.c
index c62867f..ca7fe29 100644
--- a/src/url.c
+++ b/src/url.c
@@ -43,6 +43,11 @@ as that of the covered work.  */
 #include "host.h"  /* for is_valid_ipv6_address */
 #include "c-strcase.h"
 
+#if HAVE_ICONV
+#include 
+#include 
+#endif
+
 #ifdef __VMS
 #include "vms.h"
 #endif /* def __VMS */
@@ -1399,8 +1404,8 @@ UVWC, VC, VC, VC,  VC, VC, VC, VC,   /* NUL SOH STX ETX  EOT ENQ ACK BEL */
0,  0,  0,  0,   0,  0,  0,  0,   /* p   q   r   st   u   v   w   */
0,  0,  0,  0,   W,  0,  0,  C,   /* x   y   z   {|   }   ~   DEL */
 
-  C, C, C, C,  C, C, C, C,  C, C, C, C,  C, C, C, C, /* 128-143 */
-  C, C, C, C,  C, C, C, C,  C, C, C, C,  C, C, C, C, /* 144-159 */
+  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0, /* 128-143 */
+  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0, /* 144-159 */
   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
 
@@ -1531,6 +1536,82 @@ append_uri_pathel (const char *b, const char *e, bool escaped,
   append_null (dest);
 }
 
+static char *
+convert_fname (const char *fname)
+{
+  char *converted_fname = (char *)fname;
+#if HAVE_ICONV
+  const char *from_encoding = opt.encoding_remote;
+  const char *to_encoding = opt.locale;
+  iconv_t cd;
+  size_t len, done, inlen, outlen;
+  char *s;
+  const char *orig_fname = fname;;
+
+  /* Defaults for remote and local encodings.  */
+  if (!from_encoding)
+from_encoding = "UTF-8";
+  if (!to_encoding)
+to_encoding = nl_langinfo (CODESET);
+
+  cd = iconv_open (to_encoding, from_encoding);
+  if (cd == (iconv_t)(-1))
+logprintf (LOG_VERBOSE, _("Conversion from %s to %s isn't supported\n"),
+	   quote (from_encoding), quote (to_encoding));
+  else
+{
+  inlen = strlen (fname);
+  len = outlen = inlen * 2;
+  converted_fname = s = xmalloc (outlen + 1);
+  done = 0;
+
+  for (;;)
+	{
+	  if (iconv (cd, , , , ) != (size_t)(-1)
+	  && iconv (cd, NULL, NULL, , ) != (size_t)(-1))
+	{
+	  *(converted_fname + len - outlen - done) = '\0';
+	  iconv_close(cd);
+	  DEBUGP (("Converted file name '%s' (%s) -> '%s' (%s)\n",
+		   orig_fname, from_encoding, converted_fname, to_encoding));
+	  xfree (orig_fname);
+	  return converted_fname;
+	}
+
+	  /* Incomplete or invalid multibyte sequence */
+	  if (errno == EINVAL || errno == EILSEQ)
+	{
+	  logprintf (LOG_VERBOSE,
+			 _("Incomplete or invalid multibyte sequence encountered\n"));
+	  xfree (converted_fname);
+	  converted_fname = (char *)orig_fname;
+	  break;
+	}
+	  else if (errno == E2BIG) /* Output buffer full */
+	{
+	  done = len;
+	  len = outlen = done + inlen * 2;
+	  converted_fname = xrealloc (converted_fname, outlen + 1);
+	  s = converted_fname + done;
+	}
+	  else /* Weird, we got an unspecified error */
+	{
+	  logprintf (LOG_VERBOSE, _("Unhandled errno %d\n"), errno);
+	  xfree (converted_fname);
+	  converted_fname = (char *)orig_fname;
+	  break;
+	}
+	}
+  DEBUGP (("Failed to convert file name '%s' (%s) -> '?' (%s)\n",
+	   orig_fname, from_encoding, to_encoding));
+}
+
+iconv_close(cd);
+#endif
+
+  return converted_fname;
+}
+
 /* Append to DEST the directory structure that corresponds the
directory part of URL's path.  For example, if the URL is
http://server/dir1/dir2/file, this appends "/dir1/dir2".
@@ -1706,6 +1787,8 @@ url_file_name (const struct url *u, char *replaced_filename)
 
   xfree (temp_fnres.base);
 
+  fname = convert_fname (fname);
+
   /* Check the cases in which the unique extensions are not used:
  1) Clobbering is turned off (-nc).
  2) Retrieval with regetting.
-- 
2.6.4.windows.1



Re: [Bug-wget] Support non-ASCII URLs

2015-12-17 Thread Eli Zaretskii
> From: Tim Rühsen 
> Cc: gscriv...@gnu.org
> Date: Thu, 17 Dec 2015 21:16:28 +0100
> 
> There is just one test not passing: Test-ftp-iri.px
> 
> Maybe the test is wrong (using --local-encoding=iso-8859-1, but writing to an 
> UTF-8 filename). I am not very much into FTP. How do we know the remote 
> encoding ?

>From --remote-encoding, and falling back to UTF-8.  It looks like the
file name is Latin-1 encoded in that test case, but you expect the
downloaded file name to be in UTF-8, although you use
"--local-encoding=iso-8859-1", is that right?  That should create a
file whose name is encoded in Latin-1, not UTF-8.



Re: [Bug-wget] Marking Release v1.17.1?

2015-12-16 Thread Eli Zaretskii
> From: Giuseppe Scrivano <gscriv...@gnu.org>
> Cc: Gisle Vanem <gva...@yahoo.no>, bug-wget@gnu.org
> Date: Wed, 16 Dec 2015 10:34:12 +0100
> 
> do you mind to send it in the git am format with a ChangeLog entry?

Attached.  (I presume by "ChangeLog entry" you meant a commit log
message formatted according to ChangeLog rules.)

>From 9a0c637b07be7b842b9be21488238d578f39d781 Mon Sep 17 00:00:00 2001
From: Eli Zaretskii <e...@gnu.org>
Date: Wed, 16 Dec 2015 14:40:17 +0200
Subject: [PATCH] Avoid hanging on MS-Windows when invoked with
 --connect-timeout

* src/connect.c (connect_to_ip) [WIN32]: Don't call fd_close if
the connection timed out, to avoid hanging.
---
 src/connect.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/connect.c b/src/connect.c
index 024b231..0704000 100644
--- a/src/connect.c
+++ b/src/connect.c
@@ -369,7 +369,14 @@ connect_to_ip (const ip_address *ip, int port, const char *print)
logprintf.  */
 int save_errno = errno;
 if (sock >= 0)
-  fd_close (sock);
+  {
+#ifdef WIN32
+	/* If the connection timed out, fd_close will hang in Gnulib's
+	   close_fd_maybe_socket, inside the call to WSAEnumNetworkEvents.  */
+	if (errno != ETIMEDOUT)
+#endif
+	  fd_close (sock);
+  }
 if (print)
   logprintf (LOG_NOTQUIET, _("failed: %s.\n"), strerror (errno));
 errno = save_errno;
-- 
2.6.3.windows.1



Re: [Bug-wget] Support non-ASCII URLs

2015-12-16 Thread Eli Zaretskii
> From: Giuseppe Scrivano 
> Cc: bug-wget@gnu.org, andries.brou...@cwi.nl
> Date: Wed, 16 Dec 2015 10:53:51 +0100
> 
> > +  for (;;)
> > +   {
> > + if (iconv (cd, , , , ) != (size_t)(-1))
> > +   {
> > + /* Flush the last bytes.  */
> > + iconv (cd, NULL, NULL, , );
> 
> should not the return code be checked here?

We should probably simply copy what iri.c does in a similar function,
yes.

> > + else if (errno == E2BIG) /* Output buffer full */
> > +   {
> > + char *new;
> > +
> > + done = len;
> > + outlen = done + inlen * 2;
> > + new = xmalloc (outlen + 1);
> > + memcpy (new, converted_fname, done);
> > + xfree (converted_fname);
> 
> What would be the extra cost in terms of copied bytes if we just replace
> the three lines above with xrealloc?

I don't know, probably nothing.  This is simply copied (with trivial
changes) from do_conversion in iri.c, so if we want to make that
change, we should do it there as well.

Thanks.



Re: [Bug-wget] Support non-ASCII URLs (Was: GNU wget 1.17.1 released)

2015-12-15 Thread Eli Zaretskii
> Date: Sun, 13 Dec 2015 20:04:31 +0100
> From: "Andries E. Brouwer" <andries.brou...@cwi.nl>
> Cc: "Andries E. Brouwer" <andries.brou...@cwi.nl>, bug-wget@gnu.org
> 
> On Sun, Dec 13, 2015 at 08:01:27PM +0200, Eli Zaretskii wrote:
> 
> > If no one is going to pick up the gauntlet, I will sit down and do it
> > myself, although I'm terribly busy with Emacs 25.1 release.
> 
> Good!

OK, I'm ready to send the patch series.  I tested it on GNU/Linux and
on MS-Windows, and it passed all my tests.

I will send the patch in 2 parts.  This 1st part stops wget from
treating codepoints between 128 and 159 as control characters.  This
only makes sense with ISO-8859 encodings, which are used by a tiny
minority of systems nowadays.  Both UTF-8 and the Windows codepages
have printable characters and/or meaningful codes in that range that
must not be munged.

If we want to preserve back-compatibility in this respect, then a
variant of Tim's or Andries's patch could be used here, but the test
in it should be inverted: only if the locale's codeset is
ISO-8859-SOMETHING, we should tread these codepoints as control
characters.  All the other codesets should pass these codes unaltered.


diff --git a/src/url.c b/src/url.c
index c62867f..d984bf7 100644
--- a/src/url.c
+++ b/src/url.c
@@ -1399,8 +1404,8 @@ UVWC, VC, VC, VC,  VC, VC, VC, VC,   /* NUL SOH STX ETX  
EOT ENQ ACK BEL */
0,  0,  0,  0,   0,  0,  0,  0,   /* p   q   r   st   u   v   w   */
0,  0,  0,  0,   W,  0,  0,  C,   /* x   y   z   {|   }   ~   DEL */
 
-  C, C, C, C,  C, C, C, C,  C, C, C, C,  C, C, C, C, /* 128-143 */
-  C, C, C, C,  C, C, C, C,  C, C, C, C,  C, C, C, C, /* 144-159 */
+  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0, /* 128-143 */
+  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0, /* 144-159 */
   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
   0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,  0, 0, 0, 0,
 



Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)

2015-12-15 Thread Eli Zaretskii
> From: Tim Ruehsen <tim.rueh...@gmx.de>
> Cc: Eli Zaretskii <e...@gnu.org>
> Date: Tue, 15 Dec 2015 11:02:21 +0100
> 
> I pushed a conversion fix to master.

Thanks!

> There is another bug in wget that comes out with
> wget -d --local-encoding=cp1255 
> 'http://he.wikipedia.org/wiki/%F9._%F9%F4%F8%E4'
> 
> Wget double escapes/converts to UTF-8... Maybe you can address this when you 
> are working on the code !?

You mean, because http redirects to https?  Yes, I've seen that
already.  The simple patch below fixes that.  The problem seems to be
that wget assumes the redirected URL to be encoded in the same
encoding as the original one (which, as described earlier, starts with
the local encoding), whereas it is much more reasonable to use the
value provided by --remote-encoding.

And if the 'if' in the patch looks strange to you, it's rightfully
so.  Look at this strange logic in set_uri_encoding:

  /* Set uri_encoding of struct iri i. If a remote encoding was specified, use
 it unless force is true. */
  void
  set_uri_encoding (struct iri *i, const char *charset, bool force)
  {
DEBUGP (("URI encoding = %s\n", charset ? quote (charset) : "None"));
if (!force && opt.encoding_remote)
  return;

I understand the reason to prefer opt.encoding_remote when the 'force'
flag is false -- the user-provided remote encoding should take
preference.  But why return without making sure the URI's encoding is
in fact set to that??  I guess there's some assumption that
iri->uri_encoding is already set to opt.encoding_remote, but this
assumption is certainly false in this case.  So I tyhink this function
should be changed to actually use opt.encoding_remote, if non-NULL,
and otherwise use 'charset' even if 'force' is false.  Then the patch
below could be simplify to avoid the test.  WDYT?

Here's the patch I promised.  With it, wget survives redirection from
http to https and successful retrieves that page.


diff --git a/src/retr.c b/src/retr.c
index a6a9bd7..6af26a0 100644
--- a/src/retr.c
+++ b/src/retr.c
@@ -872,9 +872,11 @@ retrieve_url (struct url * orig_parsed, const char 
*origurl, char **file,
   xfree (mynewloc);
   mynewloc = construced_newloc;
 
-  /* Reset UTF-8 encoding state, keep the URI encoding and reset
+  /* Reset UTF-8 encoding state, set the URI encoding and reset
  the content encoding. */
   iri->utf8_encode = opt.enable_iri;
+  if (opt.encoding_remote)
+   set_uri_encoding (iri, opt.encoding_remote, true);
   set_content_encoding (iri, NULL);
   xfree (iri->orig_url);
 



Re: [Bug-wget] Support non-ASCII URLs

2015-12-15 Thread Eli Zaretskii
This second part is the main part of the change.  It uses 'iconv',
when available, to convert the file names to the local encoding,
before saving the files.  Note that the same function I modified is
used by ftp.c, so downloading via FTP should also work with non-ASCII
file names now; however, I didn't test that.

Thanks.

diff --git a/src/url.c b/src/url.c
index c62867f..d984bf7 100644
--- a/src/url.c
+++ b/src/url.c
@@ -43,6 +43,11 @@ as that of the covered work.  */
 #include "host.h"  /* for is_valid_ipv6_address */
 #include "c-strcase.h"
 
+#if HAVE_ICONV
+#include 
+#include 
+#endif
+
 #ifdef __VMS
 #include "vms.h"
 #endif /* def __VMS */
@@ -1531,6 +1536,90 @@ append_uri_pathel (const char *b, const char *e, bool 
escaped,
   append_null (dest);
 }
 
+static char *
+convert_fname (const char *fname)
+{
+  char *converted_fname = (char *)fname;
+#if HAVE_ICONV
+  const char *from_encoding = opt.encoding_remote;
+  const char *to_encoding = opt.locale;
+  iconv_t cd;
+  /* sXXXav : hummm hard to guess... */
+  size_t len, done, inlen, outlen;
+  char *s;
+  const char *orig_fname = fname;;
+
+  /* Defaults for remote and local encodings.  */
+  if (!from_encoding)
+from_encoding = "UTF-8";
+  if (!to_encoding)
+to_encoding = nl_langinfo (CODESET);
+
+  cd = iconv_open (to_encoding, from_encoding);
+  if (cd == (iconv_t)(-1))
+logprintf (LOG_VERBOSE, _("Conversion from %s to %s isn't supported\n"),
+  quote (from_encoding), quote (to_encoding));
+  else
+{
+  inlen = strlen (fname);
+  len = outlen = inlen * 2;
+  converted_fname = s = xmalloc (outlen + 1);
+  done = 0;
+
+  for (;;)
+   {
+ if (iconv (cd, , , , ) != (size_t)(-1))
+   {
+ /* Flush the last bytes.  */
+ iconv (cd, NULL, NULL, , );
+ *(converted_fname + len - outlen - done) = '\0';
+ iconv_close(cd);
+ DEBUGP (("Converted file name '%s' (%s) -> '%s' (%s)\n",
+  orig_fname, from_encoding, converted_fname, 
to_encoding));
+ return converted_fname;
+   }
+
+ /* Incomplete or invalid multibyte sequence */
+ if (errno == EINVAL || errno == EILSEQ)
+   {
+ logprintf (LOG_VERBOSE,
+_("Incomplete or invalid multibyte sequence 
encountered\n"));
+ xfree (converted_fname);
+ converted_fname = (char *)orig_fname;
+ break;
+   }
+ else if (errno == E2BIG) /* Output buffer full */
+   {
+ char *new;
+
+ done = len;
+ outlen = done + inlen * 2;
+ new = xmalloc (outlen + 1);
+ memcpy (new, converted_fname, done);
+ xfree (converted_fname);
+ converted_fname = new;
+ len = outlen;
+ s = converted_fname + done;
+   }
+ else /* Weird, we got an unspecified error */
+   {
+ logprintf (LOG_VERBOSE, _("Unhandled errno %d\n"), errno);
+ xfree (converted_fname);
+ converted_fname = (char *)orig_fname;
+ break;
+   }
+   }
+  DEBUGP (("Failed to convert file name '%s' (%s) -> '?' (%s)\n",
+  orig_fname, from_encoding, to_encoding));
+  xfree (fname);
+}
+
+iconv_close(cd);
+#endif
+
+  return converted_fname;
+}
+
 /* Append to DEST the directory structure that corresponds the
directory part of URL's path.  For example, if the URL is
http://server/dir1/dir2/file, this appends "/dir1/dir2".
@@ -1706,6 +1795,8 @@ url_file_name (const struct url *u, char 
*replaced_filename)
 
   xfree (temp_fnres.base);
 
+  fname = convert_fname (fname);
+
   /* Check the cases in which the unique extensions are not used:
  1) Clobbering is turned off (-nc).
  2) Retrieval with regetting.



Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)

2015-12-14 Thread Eli Zaretskii
> From: Tim Rühsen 
> Date: Mon, 14 Dec 2015 20:22:41 +0100
> 
> >  1. The functions that call 'iconv' (in iri.c) don't make a point of
> > flushing the last portion of the converted URL after 'iconv'
> > returns successfully having converted the input string in its
> > entirety.  IME, you need then to call 'iconv' one last time with
> > either the 2nd or the 3rd argument set to NULL, otherwise
> > sometimes the last converted character doesn't get output.  In my
> > case, some URLs converted from CP1255 to UTF-8 lost their last
> > character.  It sounds like no one has actually used this
> > conversion in iri.c, except for trivially converting UTF-8 to
> > itself.  Is that possible/reasonable?
> 
> Possibly. 
> Could you please give an example string ? I would like to test it on 
> GNU/Linux, BSD and Solaris to see if the output is always the same.

This is what gave me trouble:

https://he.wikipedia.org/wiki/%F9._%F9%F4%F8%E4

This is https://he.wikipedia.org/wiki/ש._שפרה that Andries was using
in his tests, but it's encoded in CP1255 (and hex-encoded after that).
Try converting it into UTF-8, and you will get the last character
chopped off after 'iconv' returns.  Or at least that's what happens
for me.

> >  2. Wget assumes that the URL given on its command line is encoded in
> > the locale's encoding.  This is a good assumption when the user
> > herself types the URL at the shell prompt, but not when the URL is
> > copy-pasted from a browser's address bar.  In the latter case, the
> > URL tends to be in UTF-8 (sometimes hex-encoded).  At least that's
> > what I get from Firefox.  We don't seem to have in wget any
> > facilities to specify a separate (3rd) encoding for the URLs on
> > the command line, do we?
> 
> I stumbled upon this a while ago when thinking about the design of wget2. And 
> wget2 already has a working --input-encoding option for such cases.
> AFAIK, nobody asked for such an option during the last years - so I assume 
> this to be a somewhat 'expert' or 'fancy' option, at least a low priority one.
> It is an optional goodie.

IMO, it's a sorely missing feature, since copy/pasting URLs from a
browser is something people do very often.  I do it all the time,
because many times wget is much better in downloading large files than
a browser.



Re: [Bug-wget] GNU wget 1.17.1 released

2015-12-13 Thread Eli Zaretskii
> From: Tim Rühsen 
> Date: Sun, 13 Dec 2015 15:17:02 +0100
> Cc: andries.brou...@cwi.nl
> 
> Andries, thanks for insisting.
> 
> As Andries says, I came up with a polished version of his patch (17th 
> August), 
> but got no review resp. 'ok for pushing'.

AFAICS, the patch you posted does not cover the discussion in its
entirety, or at least doesn't follow the agreement reached near its
end.  I proposed a method to deal with the problem reported by
Andries, in a way that will work on Windows as well:

  http://lists.gnu.org/archive/html/bug-wget/2015-08/msg00154.html

Let me summarize it:

 . If the user asked for unmodified file names, do nothing with them

 . Otherwise, convert file names from their remote charset to the
   local charset using 'iconv'

 . 'iconv' needs the from-charset and the to-charset, which should be
   computed as follows:

   . if the user specified a from-charset, use that; otherwise assume
 UTF-8
   . if the user specified to-charset, use that; otherwise call
 nl_langinfo(CODESET) to find out the current locale's encoding,
 and use that

 . If 'iconv' fails, convert to ASCII using %NN hex-encoding

I believe the above is portable, with the sole exception of
nl_langinfo, which doesn't exist on Windows.  But there's a Gnulib
replacement; alternatively, a simple replacement can be put on
mswindows.c.

Comments?



Re: [Bug-wget] Marking Release v1.17.1?

2015-12-13 Thread Eli Zaretskii
> From: Gisle Vanem 
> Date: Sat, 12 Dec 2015 13:58:07 +0100
> 
> > Here's another one that I thought was already fixed, but apparently
> > wasn't - --connect-timeout doesn't work on Windows without this patch
> 
> You're right. This is needed:
> 
> --- src/connect.c~0 2014-12-02 09:49:37.0 +0200
> +++ src/connect.c   2015-03-17 17:14:48.414375000 +0200
> @@ -364,7 +364,12 @@ connect_to_ip (const ip_address *ip, int
> logprintf.  */
>  int save_errno = errno;
>  if (sock >= 0)
> -  fd_close (sock);
> +  {
> +#ifdef WIN32
> +   if (errno != ETIMEDOUT)
> +#endif
> + fd_close (sock);
> +  }
> 
> 
> But I don't really understand why. Care to explain?

I thought I explained this back in March, see

  http://lists.gnu.org/archive/html/bug-wget/2015-03/msg00134.html

If we call fd_close here with a socket that failed to connect, wget
hangs inside Gnulib's close_fd_maybe_socket, waiting for
WSAEnumNetworkEvents that never returns.  Why it never returns, I
don't know, but I suspect that a failed connection and a blocking
socket have something to do with that.

And yes, I think this should be applied.

>   --2015-12-12 12:43:06--  (try: 3)  http://10.0.0.22:21/
>   Connecting to 10.0.0.22:21... failed: Unknown error.
>   Giving up.
> 
>   Timer 1 off: 13.53.40  Elapsed: 0.00.33,08
> 
> Without you patch, that command never finishes.
> 
> The message wrongly says "Unknown error", but that is another matter...

In my testing (see the Mar 2015 message above) it said "Connection
timed out", as expected.  Can you see where did the value of errno get
overwritten?

Thanks.




Re: [Bug-wget] GNU wget 1.17.1 released

2015-12-13 Thread Eli Zaretskii
> From: Tim Rühsen 
> Cc: andries.brou...@cwi.nl
> Date: Sun, 13 Dec 2015 17:49:46 +0100
> 
> Someone has to implement/code it in a backward compatible fashion.
> We have coded this (or similar) already in the 'wget2' branch - which is 
> absolutely not mergable with master. I would like to see 'wget2' being 
> released within the next months... but it still needs help, testing, time.
> This is one of the reason why I personally won't put much time into coding 
> larger changes for wget1.x. I can still take time for reviews, cleanups, bug 
> fixes and small changes.

Fair enough.  Perhaps Andries could do this, then.  I believe he
agreed with my suggestions back then.  I can code the nl_langinfo
replacement for mswindows.c, if needed.

Thanks.



Re: [Bug-wget] GNU wget 1.17.1 released

2015-12-13 Thread Eli Zaretskii
> Date: Sun, 13 Dec 2015 18:35:30 +0100
> From: "Andries E. Brouwer" <andries.brou...@cwi.nl>
> Cc: bug-wget@gnu.org, Eli Zaretskii <e...@gnu.org>, andries.brou...@cwi.nl
> 
> The current state of affairs as I see it:
> 
> 1. wget is seriously broken: the default invocation cannot download
>remote utf8 files to a local utf8 system.
>People have complained for over ten years, there are many bug reports.
>The reason is some old name-mangling that had an ISO 8859-2 origin,
>and is totally inappropriate for UTF-8.
>It is extremely easy to fix the problem. Just remove this old name 
> mangling.
>The current problem may also be a security risk.
> 
> 2. I submitted a Unix-only patch that works.
> 
> 3. Eli Zaretskii discussed what should be done on Windows.
>We partly agree and partly disagree, but the details of our
>points of view are unimportant as there is no code for Windows.
> 
> 4. Within a few days the problem can be fixed for wget on Unix.
>For Windows we need someone, Eli or someone else, willing to
>actually write the code. Or we need to wait for Tim's wget2
>(although it might be that there are problems there as well,
>since using iconv may be problematic).

My memory is different: we agreed that iconv should be used.  Doing
that should be portable, and quite simple.

If no one is going to pick up the gauntlet, I will sit down and do it
myself, although I'm terribly busy with Emacs 25.1 release.  But
PLEASE do not release code that unnecessarily leaves Windows out of
this, it's simply unjustified in this case.  Especially after so much
time was invested into discussing this and arriving at the right
conclusions.

Thanks.



Re: [Bug-wget] Windows cert store support

2015-12-11 Thread Eli Zaretskii
> Date: Thu, 10 Dec 2015 01:12:37 +0100
> From: Ángel González 
> Cc: bug-wget 
> 
> On 09/12/15 03:06, Random Coder wrote:
> > I'm not sure if the wget maintainers would be interested, but I've
> > been carrying this patch around in my private builds of wget for a
> > while.  It allows wget to load SSL certs from the default Windows cert
> > store.
> >
> > The patch itself is fairly straightforward, but as it changes the
> > default SSL behavior, and no care was taken to follow coding convents
> > when I wrote it, so it's probably not ready for inclusion in the
> > codebase.  Still, if it's useful, feel free to use it for ideas.
> Wow, supporting the OS store would certainly be very cool.
> 
> I would probably move it to windows.c and attempt to make it also work 
> in gnutls, but in general it looks good.

Wget compiled with GnuTLS already supports this feature: it calls
gnutls_certificate_set_x509_system_trust when the GnuTLS library
supports that.  gnutls_certificate_set_x509_system_trust does
internally what the proposed patch does.

So I think this code should indeed go only to openssl.c, as gnutls.c
already has its equivalent.

One other comment I have about the patch is that it's inconsistent
with what gnutls.c does:

  if (!opt.ca_directory)
ncerts = gnutls_certificate_set_x509_system_trust (credentials);
  /* If GnuTLS version is too old or CA loading failed, fallback to old 
behaviour.
   * Also use old behaviour if the CA directory is user-provided.  */
  if (ncerts <= 0)
{

IOW, condition the attempt to load the system certs on
opt.ca_directory, and fall back to the certs from files if that fails.

Thanks.




Re: [Bug-wget] bad filenames (again)

2015-08-23 Thread Eli Zaretskii
 Date: Sun, 23 Aug 2015 17:16:37 +0200
 From: Ángel González keis...@gmail.com
 CC: bug-wget@gnu.org
 
 On 23/08/15 16:47, Eli Zaretskii wrote:
  Wrong. I can work with a larger one by using a UNC path.
  But then you will be unable to use relative file names, and will have
  to convert all the file names to the UNC format by hand, and any file
  names we create that exceed the 260-character limit will be almost
  unusable, since almost any program will be unable to
  read/write/delete/copy/whatever it.  So this method is impractical,
  and it doesn't lift the limit anyway, see below.
 {{reference needed}}

For what part do you need a reference?

 I'm quite sure explorer will happily work with UNC paths, which means
 the user will be able to flawlessly move/copy/delete them.

No, the Explorer cannot handle files longer than 260 characters.  The
Explorer uses shell APIs that are limited to 260 characters.

Like I said: creating files whose names are longer than 260 characters
is asking for trouble.  You will need to write your own programs to
manipulate such files.

 And actually, I think most programs will happily open (and read,
 edit, etc.) a file that was provided in UNC format.

UNC format is indeed supported by most (if not all) programs, but as
soon as the file name is longer than 260 characters, all file-related
APIs begin to fail.

  * _Some_ Windows when using _some_ filesystems / apis have fixed limits,
  but there are ways to produce larger paths...
  The issue here is not whether the size limits differ, the issue is
  whether the largest limit is still fixed.  And it is, on Windows.
  I had tried to skip over the specific details in my previous mail. I
  didn't meant that
  the limit would be bigger, but that there isn't one (that you can rely
  on, at least). On
  Windows 95/98 you had this 260 character limit, and you currently still
  do depending
  on the API you are using. But that's not a system limit any more.
  This is wrong, and the URL I posted clearly describes the limitation:
  If you use UNCs, the size is still limited to 32K characters.  So even
  if we want to convert every file name to the UNC \\?\x:\foo\bar form
  and create unusable files (which I don't recommend), the maximum
  length is still known in advance.
 Ok, it is possible that there *is* a limit of 32K characters. Still, 
 it's not a
 practical one to hardcode.

Why not?  Here's a simple code snippet that should work:

  int
  open_utf8 (const char *fn, int mode)
  {
wchar_t fn_utf16[32*1024];
int result = MultiByteToWideChar (CP_UTF8, MB_ERR_INVALID_CHARS, fn, -1,
 fn_utf16, 32*1024);

if (!result)
  {
DWORD err = GetLastError ();

switch (err)
  {
  case ERROR_INVALID_FLAGS:
  case ERROR_INVALID_PARAMETER:
errno = EINVAL;
break;
  case ERROR_INSUFFICIENT_BUFFER:
errno = ENAMETOOLONG;
break;
  case ERROR_NO_UNICODE_TRANSLATION:
  default:
errno = ENOENT;
break;
  }
return -1;
  }
return _wopen (fn_utf16, mode);
  }

 And we would be risking a stack overflow if attempting to create
 such buffer in the stack.

The default stack size of Windows programs is 2MB, so I think we are
safe using 64K here.




Re: [Bug-wget] bad filenames (again)

2015-08-23 Thread Eli Zaretskii
 Date: Sun, 23 Aug 2015 16:15:04 +0200
 From: Ángel González keis...@gmail.com
 CC: bug-wget@gnu.org
 
 On 20/08/15 04:42, Eli Zaretskii wrote:
  From: Ángel González wrote:
 
  On 19/08/15 16:38, Eli Zaretskii wrote:
  Indeed.  Actually, there's no need to allocate memory dynamically,
  neither will malloc nor with alloca, since Windows file names have
  fixed size limitation that is known in advance.  So each conversion
  function can use a fixed-sized local wchar_t array.  Doing that will
  also avoid the need for 2 calls to MultiByteToWideChar, the first one
  to find out how much space to allocate.
  Nope. These functions would receive full path names, so there's no
  maximum length.*
  Please see the URL I mentioned earlier in this thread: _all_ Windows
  file-related APIs are limited to 260 characters, including the drive
  letter and all the leading directories.
 Wrong. I can work with a larger one by using a UNC path.

But then you will be unable to use relative file names, and will have
to convert all the file names to the UNC format by hand, and any file
names we create that exceed the 260-character limit will be almost
unusable, since almost any program will be unable to
read/write/delete/copy/whatever it.  So this method is impractical,
and it doesn't lift the limit anyway, see below.

  * _Some_ Windows when using _some_ filesystems / apis have fixed limits,
  but there are ways to produce larger paths...
  The issue here is not whether the size limits differ, the issue is
  whether the largest limit is still fixed.  And it is, on Windows.
 I had tried to skip over the specific details in my previous mail. I 
 didn't meant that
 the limit would be bigger, but that there isn't one (that you can rely 
 on, at least). On
 Windows 95/98 you had this 260 character limit, and you currently still 
 do depending
 on the API you are using. But that's not a system limit any more.

This is wrong, and the URL I posted clearly describes the limitation:
If you use UNCs, the size is still limited to 32K characters.  So even
if we want to convert every file name to the UNC \\?\x:\foo\bar form
and create unusable files (which I don't recommend), the maximum
length is still known in advance.




Re: [Bug-wget] bad filenames (again)

2015-08-20 Thread Eli Zaretskii
 From: Tim Ruehsen tim.rueh...@gmx.de
 Cc: Andries E. Brouwer andries.brou...@cwi.nl
 Date: Thu, 20 Aug 2015 10:47:35 +0200
 
  Tim says he has some/most of that coded on a branch, so I think we
  should start by merging that branch, and then take it from there.
 
 It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can just 
 'click on the merge button' to merge.
 Basically, I keep track of the charset of each URL input (command line, input 
 file, stdin, downloaded+scanned). So when generating the filename we have the 
 to and from charset. When iconv fails here (e.g. Chinese input, ASCII 
 output), 
 escaping takes place.

Sounds good to me.  Is something holding the merge of this to master?



Re: [Bug-wget] bad filenames (again)

2015-08-19 Thread Eli Zaretskii
 Date: Wed, 19 Aug 2015 02:52:57 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: bug-wget@gnu.org
 
 Look at the remote filename.
 
 Assign a character set as follows:
 - if the user specified a from-charset, use that
 - if the name is printable ASCII (in 0x20-0x7f), take ASCII
 - if the name is non-ASCII and valid UTF-8, take UTF-8
 - otherwise take Unknown.

I think this is simpler and produces the same results:
 - if the user specified a from-charset, use that
 - otherwise assume UTF-8

 Determine a local character set as follows:
 - if the user specified a to-charset, use that
 - if the locale uses UTF-8, use that
 - otherwise take ASCII

I suggest this instead:
 - if the user specified a to-charset, use that
 - otherwise, call nl_langinfo(CODESET) to find out the current
   locale's encoding

 Convert the name from from-charset to to-charset:
 - if the user asked for unmodified filenames, do nothing
 - if the name is ASCII, do nothing
 - if the name is UTF-8 and the locale uses UTF-8, do nothing
 - convert from Unknown by hex-escaping the entire name
 - convert to ASCII by hex-escaping the entire name
 - otherwise invoke iconv(); upon failure, escape the illegal bytes

My suggestion:
 - if the user asked for unmodified filenames, do nothing
 - else invoke 'iconv' to convert from remote to local encoding
 - if 'iconv' fails, convert to ASCII by hex-escaping

Hex-escaping only the bytes that fail 'iconv' is better than
hex-escaping all of them, but it's more complex, and I'm not sure it's
worth the hassle.  But if it can be implemented without undue trouble,
I'm all for it, as it will make wget more user-friendly in those
cases.

 Once we know what we want it is trivial to write the code,
 but it may take a while to figure out what we want.
 I think we should start applying the current patch.

Tim says he has some/most of that coded on a branch, so I think we
should start by merging that branch, and then take it from there.



Re: [Bug-wget] bad filenames (again)

2015-08-19 Thread Eli Zaretskii
 Date: Tue, 18 Aug 2015 22:28:21 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, tim.rueh...@gmx.de,
 bug-wget@gnu.org
 
  What is needed to have a full Unicode support in wget on Windows is to
  provide replacements for all the file-name related libc functions
  ('fopen', 'open', 'stat', 'access', etc.) which will accept file names
  encoded in UTF-8, convert them internally into UTF-16, and call the
  wchar_t equivalents of those functions ('_wfopen', '_wopen', '_wstat',
  '_waccess', etc.) with the converted file name.  Another thing that is
  needed is similar replacements for 'printf', 'puts', 'fprintf',
  etc. when they are used for writing file names to the console --
  because we cannot write UTF-8 sequences to the Windows console.
 
 Aha. That reminds me of a patch by I think Aleksey Bykov.
 Yes - see http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00080.html
 
 There we had a similar discussion, and he wrote mswindows.diff with
 
 +int 
 +wc_utime (unsigned char *filename, struct _utimbuf *times)
 +{
 +  wchar_t *w_filename;
 +  int buffer_size;
 +
 +  buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, filename, 
 -1, 
 w_filename, 0);
 +  w_filename = alloca (buffer_size);
 +  MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size);
 +  return _wutime (w_filename, times);
 +}
 
 and similar for stat, open, etc. Something similar is what would be needed on 
 Windows?

Yes, thanks for pointing out those patches.  Any reasons they weren't
accepted back then?

 Is his patch usable?

It needs some minor polishing, but in general it should do the job,
yes.

I admit that I don't understand the need for the url.c patch.  Why do
we need to convert to wchar_t when the locale's codeset is already
UTF-8?  (I could understand that for non-UTF-8 locales, but the patch
explicitly limits the conversion to wchar_t and back to UTF-8 locales,
where the normal string functions should do the job.)  Is this only
for converting to upper/lower-case?

There's still the part with writing UTF-8 encoded file/URL names to
the Windows console; that will have to be added.



Re: [Bug-wget] bad filenames (again)

2015-08-19 Thread Eli Zaretskii
 Date: Wed, 19 Aug 2015 01:43:51 +0200
 From: Ángel González keis...@gmail.com
 
 +int
 +wc_utime (unsigned char *filename, struct _utimbuf *times)
 +{
 +  wchar_t *w_filename;
 +  int buffer_size;
 +
 +  buffer_size = sizeof (wchar_t) * MultiByteToWideChar(65001, 0, 
 filename, -1, 
 w_filename, 0);
 +  w_filename = alloca (buffer_size);
 +  MultiByteToWideChar(65001, 0, filename, -1, w_filename, buffer_size);
 +  return _wutime (w_filename, times);
 +}
 
 and similar for stat, open, etc. Something similar is what would be 
 needed on 
 Windows?
 Is his patch usable? Maybe I also commented a little in
 http://lists.gnu.org/archive/html/bug-wget/2014-04/msg00081.html
 but after that nothing happened, it seems.
 
 That would probably work, but would need a review. On a quick look, some of 
 the functions have memory leaks (seems he first used malloc, then changed to 
 alloca just some of them).

Indeed.  Actually, there's no need to allocate memory dynamically,
neither will malloc nor with alloca, since Windows file names have
fixed size limitation that is known in advance.  So each conversion
function can use a fixed-sized local wchar_t array.  Doing that will
also avoid the need for 2 calls to MultiByteToWideChar, the first one
to find out how much space to allocate.

 And of course, there's the question of what to do if the filename we are 
 trying to convert to utf-16 is not in fact valid utf-8.

The calls to MultiByteToWideChar should use a flag
(MB_ERR_INVALID_CHARS) in its 2nd argument that makes the function
fail with a distinct error code in that case.  When it fails like
that, the wc_* wrappers should simply call the normal unibyte
functions with the original 'char *' argument.  This makes the
modified code fall back on previous behavior when the source file
names are not in UTF-8.

And regardless, wget should convert to the locale's codeset (on all
platforms).  Once the above patches are accepted, the Windows build
will pretend that its locale's codeset is UTF-8, and that will ensure
the conversions with MultiByteToWideChar will work in most situations.




Re: [Bug-wget] bad filenames (again)

2015-08-19 Thread Eli Zaretskii
 Date: Wed, 19 Aug 2015 20:50:55 +0200
 From: Andries E. Brouwer andries.brou...@cwi.nl
 Cc: Andries E. Brouwer andries.brou...@cwi.nl, keis...@gmail.com,
 bug-wget@gnu.org
 
 On Wed, Aug 19, 2015 at 09:46:04PM +0300, Eli Zaretskii wrote:
 
  OK, but how is this different from what we'd get using your suggested
  4 alternatives?
 
 What can I reply? Just read my letter again.
 I think I said what I wanted to say.

OK, then let me explain my line of reasoning.  Plain ASCII is valid
UTF-8, and if converting with iconv assuming it's UTF-8 fails, you
know it's not valid UTF-8.  So the last 3 possibilities in your
suggestion boil down to try converting as if it were UTF-8, and if
that fails, you know it's Unknown.



  1   2   >