Re: [Bug-wget] wget -crNl inf --- filenames mangled

Andres Valloud Thu, 14 Feb 2019 03:32:21 -0800

Tim,

On 2/14/19 02:03, Tim Rühsen wrote:

I looked at the downloaded html files with grep.  They do contain the
substring "1f43", seemingly after a ^M character (I did not check every
single occurrence).  Sometimes, the ^M character is within a file name
such as this:


<tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M
1f43^M
"


If this is contained in the HTML file, then 'mp3ogg.png1f43' seems
correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for
End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers simply
ignore it. This is nothing that can be addressed with --restrict-file-names.

But to make sure, look at the original file by downloading it with 'wget
<URL>'. Does the file have the above 'lf43'/^M stuff in it as well ? If
so, we can't do much about it.

If all looks ok in there, please attach both files so we can compare and
possibly reproduce.

If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux
x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the
request is coming via Firefox.
curl and wget have both the --user-agent option for this.

Do you get a different file when using that option ?

There was one additional detail to make this work. Instead of placing arequest for index.html, I had to ask curl to get just the directory nameending with a slash. Then the server responded with (essentially)index.html.

Both curl and wget retrieve index.html contents without '1f43' whenasking for just that URL. vimdiff says the retrieved files are identical.

I am at a loss as to how to explain how the '1f43' problem appears whenasking wget to update the mirror of the site (rather than downloading asingle file). I'll look at the log file tomorrow and see if I get moreideas.


Andres.

Re: [Bug-wget] wget -crNl inf --- filenames mangled

Reply via email to