Re: jigdo-file: Does not report package rejections because checksum mismatch
Hi, i wrote: > > us.cdimage.debian.org is a quick one. Nicholas Geovanis wrote: > From which location? Germany? Ja. How about this final message in case that files are missing and that mismatching downloads were detected ? (The mismatches shown are fake and recorded twice to get more than 2.) == begin of example - Aaargh - 1 files remain missing. This should not happen! 4 download attempts yielded files with mismatching MD5 checksums: http://archive.debian.org/debian/pool/main/p/partman-reiserfs/partman-reiserfs_50_all.udeb http://archive.debian.org/debian/pool/main/p/partman-reiserfs/partman-reiserfs_50_all.udeb http://us.cdimage.debian.org/cdimage/snapshot/Debian/pool/main/p/partman-reiserfs/partman-reiserfs_50_all.udeb ... the WARNING messages of this run report more ... After a retry with the same mirror consider to search the web for the names of remaining mismatching packages. As mirror name use the found URL up to the "/" before the directory name "pool". Press Return to retry downloading the missing files. Press Ctrl-C to abort. (If you re-run jigdo-lite later, it will resume from here, the downloaded data is not lost if you press Ctrl-C now.) : == end of example Is the advise understandable ? I propose one repetition to remove any warnings from the user defined mirror which were resolved by the fallback mirror. In the second run only messages about the problematic files should emerge, hopefully making it easier to spot the checksum warnings. The list of URLs must be restricted to 3 and the usual messages must be omitted in order to keep the text below 24 lines if the URLs are longer than two lines. The usual message appears if no mismatches were detected but still files are missing. I changed the old message Aaargh - 1 files could not be downloaded. This should not happen! ... in both cases to Aaargh - 1 files remain missing. This should not happen! because it matches better the spectrum of potential problem causes. --- /usr/bin/jigdo-lite.sid 2017-12-28 14:20:23.882643023 +0100 +++ /home/thomas/projekte/jigdo_dir/jigdo-lite.sid.with_md5_check 2017-12-29 20:09:34.055048360 +0100 @@ -29,6 +29,13 @@ else windows=false nl='\n' fi + +# Counter of MD5 mismatches after download +fileMismatchCount=0 +# Short list of URLs which yielded mismatch +fileMismatchList="" +# Very few mismatching URLs shall be shown at the end. They can be long. +fileMismatchMaxRec=3 #__ # read with readline, only if running bash >=2.03 (-e gives error on POSIX) @@ -75,10 +82,84 @@ fetch() { } #__ -# Given URLs, fetch them into $imageTmp, then merge them into image +# DEVELOPMENT TEST: +# Set to a package name of the ISO to get simulated MD5 mismatch +# simulateMD5Mismatch="partman-reiserfs_50_all.udeb" +simulateMD5Mismatch="NOT_A_PACKAGE_NAME" + +# Given URLs and MD5s, fetch them into $imageTmp and verify, +# then merge them into image fetchAndMerge() { + + # The other arguments are URLs in the same sequence as the words in md5List + md5List="$1" + shift 1 + if test "$#" -eq 0; then return 0; fi fetch --force-directories --directory-prefix="$imageTmp" -- "$@" + + # Try to verify downloaded files + for md5 in $md5List + do +url="$1" +shift 1 +test "$md5" = ".no.MD5.known." && continue + +# Simulated MD5 mismatch +if echo "$url" | grep '/'"$simulateMD5Mismatch"'$' >/dev/null 2>&1 +then + echo "DEVELOPMENT TEST: Faking a checksum mismatch with package $simulateMD5Mismatch" >&2 + md5="*INVALIDATED*CHECKSUM*" +fi + +# Alternative proposals by Philip Hands: +# localPath="$imageTmp"/`echo "$url" | sed -e 's,^\(https\?\|ftp\|file\)://,,i'` +# localPath="$imageTmp/${url#[[:alpha:]]*://}" + +localPath="$imageTmp"/`echo "$url" | \ +sed -e 's/^[hH][tT][tT][pP]:\/\///' \ +-e 's/^[hH][tT][tT][pP][sS]:\/\///' \ +-e 's/^[fF][tT][pP]:\/\///' \ +-e 's/^[fF][iI][lL][eE]:\/\///'` + +if test ! -e "$localPath" +then + # Maybe the file was downloaded but above guess was wrong + baseName=`basename "$url"` + localPath=`find "$imageTmp" -name "$baseName" | head -1` +fi + +if test -n "$localPath" -a -e "$localPath" +then + fileMD5=`$jigdoFile md5sum --report=quiet "$localPath" | sed 's/ .*$//'` + if test "$md5" != "$fileMD5" + then +echo >&2 +echo "WARNING: Downloaded file does not match expected MD5:" >&2 +echo " $url" >&2 +echo "
Re: jigdo-file: Does not report package rejections because checksum mismatch
On Fri, Dec 29, 2017 at 5:35 AM, Thomas Schmittwrote: > I wrote: >> > > > sed -e 's/^[hH][tT][tT][pP]:\/\///' \ >> > > >... more -e for https, ftp, and file ... > > Philip Hands wrote: >> > > sed -e 's,^\(https\?\|ftp\|file\)://,,i' >> > > ... >> > > "$imageTmp/${url#[[:alpha:]]*://}" > >> > Are these widely portable enough ? > >> [Philip Hands analyzed portability with positive result] > > I still think that the long explicit sed is clearer. FWIW I agree. And with respect to the new awk dependency mentioned earlier, it's a simple expression and awk tempts me with that all the time because it's so much easier. Just you know, be advised that on say Debian 8.9 it's gawk while on Ubuntu 17.10 it's mawk, etc. Debian 9.2 is still gawk. > > The effective throughput of roughly 1.5 to 2.5 MB/s is still much slower > than wget's speed report of about 5.5 MB/s. > I tried with 100 files per run of wget and "jigdo-file make-image". > No significant difference to see. It's all about mirror server latency > with each single file, i guess. > us.cdimage.debian.org is a quick one. >From which location? Germany? I find it a bit slower than average from here inside the US, but I have large, local mirrors. And at long last, one of my own is budgeted in 2018. You betcha ;-) > I wrote: >> > > > fileMD5=`$jigdoFile md5 "$localPath" 2>/dev/null | awk '{print $1}'` > > Philip Hands wrote: >> `$jigdoFile md5sum --report=quiet "$localPath" | sed 's/ .*$//'` > > I'll take this one. Oh, OK. Yeah, better. > Next development step will be to issue a correct "Aaargh" message and to > tell at least some of the mismatching files in that message. > Have a nice day :) Vielen Dank Thomas. > Thomas >
Re: jigdo-file: Does not report package rejections because checksum mismatch
Hi, I wrote: > > > > sed -e 's/^[hH][tT][tT][pP]:\/\///' \ > > > >... more -e for https, ftp, and file ... Philip Hands wrote: > > > sed -e 's,^\(https\?\|ftp\|file\)://,,i' > > > ... > > > "$imageTmp/${url#[[:alpha:]]*://}" > > Are these widely portable enough ? > [Philip Hands analyzed portability with positive result] I still think that the long explicit sed is clearer. But in the end it will be up to Steve to decide which one to use. I tested both proposals of yours and have put them as comments into my evolving changeset. The importance of this expression has decreased by my decision to run "find" if the guessed local path does not lead to an existing file: localPath=... guessed from URL ... if test ! -e "$localPath" then # Maybe above guess was wrong baseName=`basename "$url"` localPath=`find "$imageTmp" -name "$baseName" | head -1` fi if test -n "$localPath" -a -e "$localPath" then ... checksum verification ... The use of "head" and "find" will be new in the script. But the increased ruggedness makes it worthwhile in my opinion. I made mini benchmarks with guessed names and found names. No significant differences were to see. (The tree is really small because fetchAndMerge() deletes it when the 10 files are processed.) The effective throughput of roughly 1.5 to 2.5 MB/s is still much slower than wget's speed report of about 5.5 MB/s. I tried with 100 files per run of wget and "jigdo-file make-image". No significant difference to see. It's all about mirror server latency with each single file, i guess. us.cdimage.debian.org is a quick one. I wrote: > > > > fileMD5=`$jigdoFile md5 "$localPath" 2>/dev/null | awk '{print $1}'` Philip Hands wrote: > `$jigdoFile md5sum --report=quiet "$localPath" | sed 's/ .*$//'` I'll take this one. Next development step will be to issue a correct "Aaargh" message and to tell at least some of the mismatching files in that message. Have a nice day :) Thomas
Re: jigdo-file: Does not report package rejections because checksum mismatch
On Thu, 28 Dec 2017, "Thomas Schmitt"wrote: > Hi, > > first a correction of my proposal: > > The else-case with > echo "WARNING: File not found after download:" >&2 > is not good. > It floods the log if one uses a mirror with few matching files. > Not finding a file after the download is a normal situation. > > --- > > I wrote: >> >sed -e 's/^[hH][tT][tT][pP]:\/\///' \ >> >-e 's/^[hH][tT][tT][pP][sS]:\/\///' \ >> >-e 's/^[fF][tT][pP]:\/\///' \ >> >-e 's/^[fF][iI][lL][eE]:\/\///'` > > Philip Hands wrote: >> sed -e 's,^\(https\?\|ftp\|file\)://,,i' >> ... >> "$imageTmp/${url#[[:alpha:]]*://}" > > Are these widely portable enough ? the , rather than / feature is already in use in the script (except that its using s%%%). \( \) is already in use, and AFAIK \| has been there for as long \? _might_ be a bit later as a feature, in which case one could add \|https, but then again isURI() doesn't match https: anyway The i flag is a GNU extension, so is probably not that portable, so one could go for \(http\|HTTP\|... For the shell, I suspect that [[:alpha:]] is an innovation from the 90's, so one could play it safe (well, except that it might break with odd codings) with [a-zA-Z]. posh doesn't seem to know about [:alpha:] for instance. posh does know about the ${ # } thing, but that wasn't in Solaris SVR4 shell AFAIK. > Mine can be justified by S.R.Bourne's "The Unix System", i guess, > and it is coordinated with function isURI. > > Well, my scruples are mainly about what wget guarantees to use as > local disk path. I understand that jigdo-file would be quite tolerant > as long as the file is somewhere in the "$imageTmp" tree. > Maybe i should invest a "find" run in case of missing file. The tree is small. > > > I wrote: >> > fileMD5=`$jigdoFile md5 "$localPath" 2>/dev/null | awk '{print $1}'` > >> The way it's done elsewhere in the script (which I happen to think is >> pretty horrible, but that's what is already there) is using set, thus: >> set -- `$jigdoFile md5sum --report=quiet "$localPath"` > > Option "--report=quiet" instead of "2>/dev/null" is a good point. > > One would have to wrap the "set --" into a sub-shell, because fetchAndMerge > already tampers with its own arguments. > Like: > answer=`$jigdoFile md5sum --report=quiet "$localPath"` > fileMD5=`(set -- $answer ; echo "$1")` > Now that's really ugly. This seems preferable, and avoids new dependencies: `$jigdoFile md5sum --report=quiet "$localPath" | sed 's/ .*$//'` > If direct objections emerge against "awk", i'd consider some helper > function which echos "$1". > > >> I also happen to think that using `` rather than $() is pretty horrible >> in this day and age, but that's what's currently there throughout the > > Yep. Not to speak of the headless camelBack variable names. > > I strive to be minimally intrusive for the purpose and to be as > conservative as in an autotools script. Fair enough. Cheers, Phil. -- |)| Philip Hands [+44 (0)20 8530 9560] HANDS.COM Ltd. |-| http://www.hands.com/http://ftp.uk.debian.org/ |(| Hugo-Klemm-Strasse 34, 21075 Hamburg,GERMANY signature.asc Description: PGP signature
Re: jigdo-file: Does not report package rejections because checksum mismatch
Hi, first a correction of my proposal: The else-case with echo "WARNING: File not found after download:" >&2 is not good. It floods the log if one uses a mirror with few matching files. Not finding a file after the download is a normal situation. --- I wrote: > >sed -e 's/^[hH][tT][tT][pP]:\/\///' \ > >-e 's/^[hH][tT][tT][pP][sS]:\/\///' \ > >-e 's/^[fF][tT][pP]:\/\///' \ > >-e 's/^[fF][iI][lL][eE]:\/\///'` Philip Hands wrote: > sed -e 's,^\(https\?\|ftp\|file\)://,,i' > ... > "$imageTmp/${url#[[:alpha:]]*://}" Are these widely portable enough ? Mine can be justified by S.R.Bourne's "The Unix System", i guess, and it is coordinated with function isURI. Well, my scruples are mainly about what wget guarantees to use as local disk path. I understand that jigdo-file would be quite tolerant as long as the file is somewhere in the "$imageTmp" tree. Maybe i should invest a "find" run in case of missing file. The tree is small. I wrote: > > fileMD5=`$jigdoFile md5 "$localPath" 2>/dev/null | awk '{print $1}'` > The way it's done elsewhere in the script (which I happen to think is > pretty horrible, but that's what is already there) is using set, thus: > set -- `$jigdoFile md5sum --report=quiet "$localPath"` Option "--report=quiet" instead of "2>/dev/null" is a good point. One would have to wrap the "set --" into a sub-shell, because fetchAndMerge already tampers with its own arguments. Like: answer=`$jigdoFile md5sum --report=quiet "$localPath"` fileMD5=`(set -- $answer ; echo "$1")` Now that's really ugly. If direct objections emerge against "awk", i'd consider some helper function which echos "$1". > I also happen to think that using `` rather than $() is pretty horrible > in this day and age, but that's what's currently there throughout the Yep. Not to speak of the headless camelBack variable names. I strive to be minimally intrusive for the purpose and to be as conservative as in an autotools script. An alternative to changing the code would still be to tell the user with the "Aaargh" text that repeated download and subsequent "Aaargh" could indicate damaged files on the mirror. In this case the user shall search the web for other mirrors which offer the repeatedly downloaded packages. But that would be embarrassing for the involved programmers. (Having script jigdo-lite instead of doing the job inside jigdo-file is also not overly glorious ...) Have a nice day :) Thomas
Re: jigdo-file: Does not report package rejections because checksum mismatch
On Thu, 28 Dec 2017, "Thomas Schmitt"wrote: ... > Especially in need of attention: > > - I could not find a clear description how "wget" determines the local > file paths from URLs and option --directory-prefix="$imageTmp". > My current conversion from URL to path is purely heuristic therefore: > > localPath="$imageTmp"/`echo "$url" | \ >sed -e 's/^[hH][tT][tT][pP]:\/\///' \ >-e 's/^[hH][tT][tT][pP][sS]:\/\///' \ >-e 's/^[fF][tT][pP]:\/\///' \ >-e 's/^[fF][iI][lL][eE]:\/\///'` A rather less laboured way of getting the same effect with sed would be: sed -e 's,^\(https\?\|ftp\|file\)://,,i' [ Things to note about that: s,,, in place of s/// means that no escaping of / is needed the 'i' flag at the end makes the match case insensitive s\? means match zero or one 's' ] However, I doubt that it's important to worry about the potential for unexpectedly removing a prefix of e.g. cdrom:// or ://, in which case you could dispense with sed and instead do this: localpath="$imageTmp/${url#[[:alpha:]]*://}" > - I introduced a dependency on "awk", which was not used in jigdo-lite > before. The task is to obtain the first word of jigdo-file's output: > > fileMD5=`$jigdoFile md5 "$localPath" 2>/dev/null | awk '{print $1}'` The way it's done elsewhere in the script (which I happen to think is pretty horrible, but that's what is already there) is using set, thus: set -- `$jigdoFile md5sum --report=quiet "$localPath"` which leaves the value that you are after in $1. I also happen to think that using `` rather than $() is pretty horrible in this day and age, but that's what's currently there throughout the script, so I guess one should stick with that, or fix it everywhere. Cheers, Phil. -- |)| Philip Hands [+44 (0)20 8530 9560] HANDS.COM Ltd. |-| http://www.hands.com/http://ftp.uk.debian.org/ |(| Hugo-Klemm-Strasse 34, 21075 Hamburg,GERMANY signature.asc Description: PGP signature
Re: jigdo-file: Does not report package rejections because checksum mismatch
Hi, i Cc: debian-cd with this follow-up to bug 884526 in the hope to get some review for the endeavor to detect damaged downloaded package files during a run of jigdo-lite. Some disputable aspects remain (plus a possible bug in current jigdo-lite, which will vanish by my proposal). I put them behind my changeset, which is based on Sid of today. How it works: Currently and in my proposal jigdo-lite asks jigdo-file for a list of files yet missing in the emerging image. It downloads them in groups of 10 and submits each group to jigdo-file for grafting them into the image. The list consists of text blocks of the form: URL of package Alternative URL of same package ... more alternative URLs ... MD5Sum:... One of the URLs from each block gets picked by a loop with variable "pass", but currently jigdo-file ignores the MD5Sum lines. My proposal collects the MD5s and submits them to function fetchAndMerge which currently only gets the 10 URLs. I had to re-arrange the entrails of the list reading loop in imageDownload. It currently ends when the 10th URL is found and thus would not have yet reached the 10th checksum line. My proposal uses the empty line as trigger to append the picked URL and the MD5 to the argument list. Tested with Jessie's jigdo-file and these three input lines: http://cdimage.debian.org/mirror/cdimage/archive/6.0.7/amd64/jigdo-cd/debian-6.0.7-amd64-businesscard.jigdo http://us.cdimage.debian.org/cdimage/snapshot/Debian/ --- /usr/bin/jigdo-lite.sid 2017-12-28 14:20:23.882643023 +0100 +++ /home/thomas/projekte/jigdo_dir/jigdo-lite.sid.with_md5_check 2017-12-28 15:12:37.738654856 +0100 @@ -75,10 +75,61 @@ fetch() { } #__ -# Given URLs, fetch them into $imageTmp, then merge them into image +# Simulated MD5 mismatch: +simulateMD5Mismatch="partman-reiserfs_50_all.udeb" +# simulateMD5Mismatch="NOT_A_PACKAGE_NAME" + +# Given URLs and MD5s, fetch them into $imageTmp and verify, +# then merge them into image fetchAndMerge() { + + # The other arguments are URLs in the same sequence as the words in md5List + md5List="$1" + shift 1 + if test "$#" -eq 0; then return 0; fi fetch --force-directories --directory-prefix="$imageTmp" -- "$@" + + # Try to verify downloaded files + for md5 in $md5List + do +url="$1" +shift 1 +test "$md5" = ".no.MD5.known." && continue + +# Simulated MD5 mismatch +if echo "$url" | grep '/'"$simulateMD5Mismatch"'$' >/dev/null 2>&1 +then + echo "ATTENTION : Faking a checksum mismatch with package $simulateMD5Mismatch" >&2 + md5="*INVALIDATED*CHECKSUM*" +fi + +localPath="$imageTmp"/`echo "$url" | \ +sed -e 's/^[hH][tT][tT][pP]:\/\///' \ +-e 's/^[hH][tT][tT][pP][sS]:\/\///' \ +-e 's/^[fF][tT][pP]:\/\///' \ +-e 's/^[fF][iI][lL][eE]:\/\///'` +if test -e "$localPath" +then + fileMD5=`$jigdoFile md5 "$localPath" 2>/dev/null | awk '{print $1}'` + if test "$md5" != "$fileMD5" + then +echo >&2 +echo "WARNING: Downloaded file does not match expected MD5:" >&2 +echo " $url" >&2 +echo " $localPath" >&2 +echo " expected: $md5 | $fileMD5 :downloaded" >&2 +echo >&2 + fi +else + echo >&2 + echo "WARNING: File not fot found after download:" >&2 + echo " $url" >&2 + echo " $localPath" >&2 + echo >&2 +fi + done + # Merge into the image $jigdoFile $jigdoOpts --no-cache make-image --image="$image" \ --jigdo="$jigdoF" --template="$template" "$imageTmp" @@ -596,31 +647,56 @@ imageDownload() { for pass in x xx xxx x xx xxx ; do $jigdoFile print-missing-all --image="$image" --jigdo="$jigdoF" \ --template="$template" $jigdoOpts $uriOpts \ - | egrep -i '^(http:|ftp:|$)' >"$list" + >"$list" + # This counts the empty lines missingCount=`egrep '^$' <"$list" | wc -l | sed -e 's/ *//g'` # Accumulate URLs in $@, pass them to fetchAndMerge in batches shift "$#" # Solaris /bin/sh doesn't understand "set --" count="" exec 3<"$list" + useUrl="." + md5=".no.MD5.known." + md5List="" while $readLine url <&3; do count="x$count" -if strEmpty "$url"; then count=""; continue; fi -if test "$count" != "$pass"; then continue; fi -if $noMorePasses; then - hrule - echo "$missingCount files not found in previous pass, trying" - echo "alternative download locations:" - echo -fi -noMorePasses=false -set -- "$@" "$url" -if test "$#" -ge "$filesPerFetch"; then - if fetchAndMerge "$@"; then true;