Hi Andres, On 2/14/19 6:23 AM, Andres Valloud wrote: > Hi, > > I've run into an issue with wget, I don't know what else to do to debug > the problem. The use case is mirroring a website with the command > > wget -crNl inf https://... -P local/folder > > Initially I was running wget 1.17, and that was very slow in the case > the files were already downloaded. I switched to 1.20.1 (latest), the > behavior is way faster now. Things progress nicely for about 3 hours, > but deterministically the file name "greenbrq.669" is transformed into > "1f43greenbrq.669", this results in a 404 and wget aborts after 5805 files. > > Running this with logging turned on results in a 399 megabyte log file. > Looking at the occurrences of greenbrq.669, I suspect because of -l inf > the file is found several times. The last time, however, it looks like > there is an index.html file on the server that has the wrong name. But > using a web browser to presumably look at said index.html file does not > result in a link to the wrongly named file, because the file downloads > fine. > > Next I noted that when the 1f43 prefix to greenbrq.669 appears, there is > a mention to IRI. I suspected that perhaps there was some confusion > going on with filename encoding, so I provided --no-iri to wget and ran > the job again. Another 399 megabyte log file was produced, and the > result was the same. Interestingly, however, the log file has "[IRI" > entries, even though the --no-iri switch was provided. Is this as > expected? In both log files, egrep "^.IRI" results in lines that always > end with "None". > > Looking at the log, it looks like the file URL is encountered several > times. Some times it is mentioned with UTF-8, sometimes it isn't. > Before the first time greenbrq.669 appears with the seemingly bogus 1f43 > prefix, the previous occurrence of greenbrq.669 in the log file is a log > entry that says "no-follow". > > Also looking at the log, there are other files with mangled names, > except these have 1f43 suffixed to the filename, e.g.: mod.png1f43. A > quick check shows many of these mangled file names have URLs sizes that > are zero modulo 4 (I did not check *every* mangled file name). > > I looked at the downloaded html files with grep. They do contain the > substring "1f43", seemingly after a ^M character (I did not check every > single occurrence). Sometimes, the ^M character is within a file name > such as this: > > <tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M > 1f43^M > "
If this is contained in the HTML file, then 'mp3ogg.png1f43' seems correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers simply ignore it. This is nothing that can be addressed with --restrict-file-names. But to make sure, look at the original file by downloading it with 'wget <URL>'. Does the file have the above 'lf43'/^M stuff in it as well ? If so, we can't do much about it. If all looks ok in there, please attach both files so we can compare and possibly reproduce. > > (and wget thinks it has to download "mp3ogg.png1f43", as if it had > ignored ^M and had merged the path with the 1f43 segment) and some > others, like this: > > <td align="right">2014-10-02 20:24 ^M > 1f43^M > </td> > > I have no idea whether these HTML files are valid or even meaningful. I > tried using curl to get one of those HTML files with another mechanism, > but unfortunately the site maintainer does not allow using curl. If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the request is coming via Firefox. curl and wget have both the --user-agent option for this. Do you get a different file when using that option ? > There > are additional restrictions on the web browsers allowed. I can look at > the website with Safari (which downloads greenbrq.669 properly), and I > can also ask Safari to save the page where the file greenbrq.669 is > listed --- the saved file does not have any occurrences of "1f43". > > Googling for answers, and especially instances of "1f43", didn't turn up > anything immediately interesting. However, I found the following with > seems somewhat related to the problem. > > https://www.win.tue.nl/~aeb/linux/misc/wget.html > > Is there any credence to the above report? Just to make sure, doing as > it said with --restrict-file-names=nocontrol did not eliminate the > apparently spurious occurrences of "1f43" from the wget log file. > > What else can I do to diagnose why this apparent misbehavior is occurring? > > Andres. > Regards, Tim
signature.asc
Description: OpenPGP digital signature
