Re: jigdo-file: Does not report package rejections because checksum mismatch

2017-12-29 Thread Thomas Schmitt
Hi,

i wrote:
> > us.cdimage.debian.org is a quick one.

Nicholas Geovanis wrote:
> From which location? Germany?

Ja.


How about this final message in case that files are missing and that
mismatching downloads were detected ?
(The mismatches shown are fake and recorded twice to get more than 2.)

== begin of example
-
Aaargh - 1 files remain missing. This should not happen!
4 download attempts yielded files with mismatching MD5 checksums:

   
http://archive.debian.org/debian/pool/main/p/partman-reiserfs/partman-reiserfs_50_all.udeb
   
http://archive.debian.org/debian/pool/main/p/partman-reiserfs/partman-reiserfs_50_all.udeb
   
http://us.cdimage.debian.org/cdimage/snapshot/Debian/pool/main/p/partman-reiserfs/partman-reiserfs_50_all.udeb
   ... the WARNING messages of this run report more ...

After a retry with the same mirror consider to search the web
for the names of remaining mismatching packages. As mirror name
use the found URL up to the "/" before the directory name "pool".

Press Return to retry downloading the missing files.
Press Ctrl-C to abort. (If you re-run jigdo-lite later, it will
resume from here, the downloaded data is not lost if you press
Ctrl-C now.)
:
== end of example

Is the advise understandable ?
I propose one repetition to remove any warnings from the user defined mirror
which were resolved by the fallback mirror. In the second run only messages
about the problematic files should emerge, hopefully making it easier to
spot the checksum warnings.

The list of URLs must be restricted to 3 and the usual messages must be
omitted in order to keep the text below 24 lines if the URLs are longer
than two lines.

The usual message appears if no mismatches were detected but still files
are missing.

I changed the old message
  Aaargh - 1 files could not be downloaded. This should not
  happen! ...
in both cases to
  Aaargh - 1 files remain missing. This should not happen!
because it matches better the spectrum of potential problem causes.


--- /usr/bin/jigdo-lite.sid 2017-12-28 14:20:23.882643023 +0100
+++ /home/thomas/projekte/jigdo_dir/jigdo-lite.sid.with_md5_check   
2017-12-29 20:09:34.055048360 +0100
@@ -29,6 +29,13 @@ else
   windows=false
   nl='\n'
 fi
+
+# Counter of MD5 mismatches after download
+fileMismatchCount=0
+# Short list of URLs which yielded mismatch
+fileMismatchList=""
+# Very few mismatching URLs shall be shown at the end. They can be long.
+fileMismatchMaxRec=3
 #__
 
 # read with readline, only if running bash >=2.03 (-e gives error on POSIX)
@@ -75,10 +82,84 @@ fetch() {
 }
 #__
 
-# Given URLs, fetch them into $imageTmp, then merge them into image
+# DEVELOPMENT TEST:
+# Set to a package name of the ISO to get simulated MD5 mismatch
+# simulateMD5Mismatch="partman-reiserfs_50_all.udeb"
+simulateMD5Mismatch="NOT_A_PACKAGE_NAME"
+
+# Given URLs and MD5s, fetch them into $imageTmp and verify,
+# then merge them into image
 fetchAndMerge() {
+
+  # The other arguments are URLs in the same sequence as the words in md5List
+  md5List="$1"
+  shift 1
+
   if test "$#" -eq 0; then return 0; fi
   fetch --force-directories --directory-prefix="$imageTmp" -- "$@"
+
+  # Try to verify downloaded files
+  for md5 in $md5List
+  do
+url="$1"
+shift 1
+test "$md5" = ".no.MD5.known." && continue
+
+# Simulated MD5 mismatch
+if echo "$url" | grep '/'"$simulateMD5Mismatch"'$' >/dev/null 2>&1
+then
+  echo "DEVELOPMENT TEST: Faking a checksum mismatch with package 
$simulateMD5Mismatch" >&2
+  md5="*INVALIDATED*CHECKSUM*"
+fi
+
+# Alternative proposals by Philip Hands:
+# localPath="$imageTmp"/`echo "$url" | sed -e 
's,^\(https\?\|ftp\|file\)://,,i'`
+# localPath="$imageTmp/${url#[[:alpha:]]*://}"
+
+localPath="$imageTmp"/`echo "$url" | \
+sed -e 's/^[hH][tT][tT][pP]:\/\///' \
+-e 's/^[hH][tT][tT][pP][sS]:\/\///' \
+-e 's/^[fF][tT][pP]:\/\///' \
+-e 's/^[fF][iI][lL][eE]:\/\///'`
+
+if test ! -e "$localPath"
+then
+  # Maybe the file was downloaded but above guess was wrong
+  baseName=`basename "$url"`
+  localPath=`find "$imageTmp" -name "$baseName" | head -1`
+fi
+
+if test -n "$localPath" -a -e "$localPath"
+then
+  fileMD5=`$jigdoFile md5sum --report=quiet "$localPath" | sed 's/ .*$//'`
+  if test "$md5" != "$fileMD5"
+  then
+echo >&2
+echo "WARNING: Downloaded file does not match expected MD5:" >&2
+echo " $url" >&2
+echo " 

Re: jigdo-file: Does not report package rejections because checksum mismatch

2017-12-29 Thread Nicholas Geovanis
On Fri, Dec 29, 2017 at 5:35 AM, Thomas Schmitt  wrote:
> I wrote:
>> > > >   sed -e 's/^[hH][tT][tT][pP]:\/\///' \
>> > > >... more -e for https, ftp, and file ...
>
> Philip Hands wrote:
>> > >   sed -e 's,^\(https\?\|ftp\|file\)://,,i'
>> > > ...
>> > >  "$imageTmp/${url#[[:alpha:]]*://}"
>
>> > Are these widely portable enough ?
>
>> [Philip Hands analyzed portability with positive result]
>
> I still think that the long explicit sed is clearer.

FWIW I agree.

And with respect to the new awk dependency mentioned earlier, it's a simple
expression and awk tempts me with that all the time because it's so much easier.
Just you know, be advised that on say Debian 8.9 it's gawk while on Ubuntu 17.10
it's mawk, etc. Debian 9.2 is still gawk.

> 
> The effective throughput of roughly 1.5 to 2.5 MB/s is still much slower
> than wget's speed report of about 5.5 MB/s.
> I tried with 100 files per run of wget and "jigdo-file make-image".
> No significant difference to see. It's all about mirror server latency
> with each single file, i guess.
> us.cdimage.debian.org is a quick one.

>From which location? Germany?
I find it a bit slower than average from here inside the US, but I
have large, local
mirrors. And at long last, one of my own is budgeted in 2018. You betcha ;-)

> I wrote:
>> > > > fileMD5=`$jigdoFile md5 "$localPath" 2>/dev/null | awk '{print $1}'`
>
> Philip Hands wrote:
>> `$jigdoFile md5sum --report=quiet "$localPath" | sed 's/ .*$//'`
>
> I'll take this one.

Oh, OK. Yeah, better.

> Next development step will be to issue a correct "Aaargh" message and to
> tell at least some of the mismatching files in that message.
> Have a nice day :)

Vielen Dank Thomas.

> Thomas
>



Re: jigdo-file: Does not report package rejections because checksum mismatch

2017-12-29 Thread Thomas Schmitt
Hi,

I wrote:
> > > >   sed -e 's/^[hH][tT][tT][pP]:\/\///' \
> > > >... more -e for https, ftp, and file ...

Philip Hands wrote:
> > >   sed -e 's,^\(https\?\|ftp\|file\)://,,i'
> > > ...
> > >  "$imageTmp/${url#[[:alpha:]]*://}"

> > Are these widely portable enough ?

> [Philip Hands analyzed portability with positive result]

I still think that the long explicit sed is clearer. But in the end it will
be up to Steve to decide which one to use.
I tested both proposals of yours and have put them as comments into my
evolving changeset.

The importance of this expression has decreased by my decision to run "find"
if the guessed local path does not lead to an existing file:

localPath=... guessed from URL ...

if test ! -e "$localPath"
then
  # Maybe above guess was wrong
  baseName=`basename "$url"`
  localPath=`find "$imageTmp" -name "$baseName" | head -1`
fi

if test -n "$localPath" -a -e "$localPath"
then
   ... checksum verification ...

The use of "head" and "find" will be new in the script. But the increased
ruggedness makes it worthwhile in my opinion.

I made mini benchmarks with guessed names and found names. No significant
differences were to see.
(The tree is really small because fetchAndMerge() deletes it when the 10
 files are processed.)

The effective throughput of roughly 1.5 to 2.5 MB/s is still much slower
than wget's speed report of about 5.5 MB/s.
I tried with 100 files per run of wget and "jigdo-file make-image".
No significant difference to see. It's all about mirror server latency
with each single file, i guess.
us.cdimage.debian.org is a quick one.


I wrote:
> > > > fileMD5=`$jigdoFile md5 "$localPath" 2>/dev/null | awk '{print $1}'`

Philip Hands wrote:
> `$jigdoFile md5sum --report=quiet "$localPath" | sed 's/ .*$//'`

I'll take this one.


Next development step will be to issue a correct "Aaargh" message and to
tell at least some of the mismatching files in that message.


Have a nice day :)

Thomas



Re: jigdo-file: Does not report package rejections because checksum mismatch

2017-12-29 Thread Philip Hands
On Thu, 28 Dec 2017, "Thomas Schmitt"  wrote:
> Hi,
>
> first a correction of my proposal:
>
> The else-case with
>   echo "WARNING: File not found after download:" >&2
> is not good.
> It floods the log if one uses a mirror with few matching files.
> Not finding a file after the download is a normal situation.
>
> ---
>
> I wrote:
>> >sed -e 's/^[hH][tT][tT][pP]:\/\///' \
>> >-e 's/^[hH][tT][tT][pP][sS]:\/\///' \
>> >-e 's/^[fF][tT][pP]:\/\///' \
>> >-e 's/^[fF][iI][lL][eE]:\/\///'`
>
> Philip Hands wrote:
>>   sed -e 's,^\(https\?\|ftp\|file\)://,,i'
>> ...
>>   "$imageTmp/${url#[[:alpha:]]*://}"
>
> Are these widely portable enough ?

the , rather than / feature is already in use in the script (except that
its using s%%%).

\( \) is already in use, and AFAIK \| has been there for as long

\? _might_ be a bit later as a feature, in which case one could add
\|https, but then again isURI() doesn't match https: anyway

The i flag is a GNU extension, so is probably not that portable, so one
could go for \(http\|HTTP\|...

For the shell, I suspect that [[:alpha:]] is an innovation from the
90's, so one could play it safe (well, except that it might break with
odd codings) with [a-zA-Z]. posh doesn't seem to know about [:alpha:]
for instance.

posh does know about the ${ # } thing, but that wasn't in Solaris SVR4
shell AFAIK.

> Mine can be justified by S.R.Bourne's "The Unix System", i guess,
> and it is coordinated with function isURI.
>
> Well, my scruples are mainly about what wget guarantees to use as
> local disk path. I understand that jigdo-file would be quite tolerant
> as long as the file is somewhere in the "$imageTmp" tree.
> Maybe i should invest a "find" run in case of missing file. The tree is small.
>
>
> I wrote:
>> > fileMD5=`$jigdoFile md5 "$localPath" 2>/dev/null | awk '{print $1}'`
>
>> The way it's done elsewhere in the script (which I happen to think is
>> pretty horrible, but that's what is already there) is using set, thus:
>>   set -- `$jigdoFile md5sum --report=quiet "$localPath"`
>
> Option "--report=quiet" instead of "2>/dev/null" is a good point.
>
> One would have to wrap the "set --" into a sub-shell, because fetchAndMerge
> already tampers with its own arguments.
> Like:
>   answer=`$jigdoFile md5sum --report=quiet "$localPath"`
>   fileMD5=`(set -- $answer ; echo "$1")`
> Now that's really ugly.

This seems preferable, and avoids new dependencies:

  `$jigdoFile md5sum --report=quiet "$localPath" | sed 's/ .*$//'`

> If direct objections emerge against "awk", i'd consider some helper
> function which echos "$1".
>
>
>> I also happen to think that using `` rather than $() is pretty horrible
>> in this day and age, but that's what's currently there throughout the
>
> Yep. Not to speak of the headless camelBack variable names.
>
> I strive to be minimally intrusive for the purpose and to be as
> conservative as in an autotools script.

Fair enough.

Cheers, Phil.
-- 
|)|  Philip Hands  [+44 (0)20 8530 9560]  HANDS.COM Ltd.
|-|  http://www.hands.com/http://ftp.uk.debian.org/
|(|  Hugo-Klemm-Strasse 34,   21075 Hamburg,GERMANY


signature.asc
Description: PGP signature


Re: jigdo-file: Does not report package rejections because checksum mismatch

2017-12-28 Thread Thomas Schmitt
Hi,

first a correction of my proposal:

The else-case with
  echo "WARNING: File not found after download:" >&2
is not good.
It floods the log if one uses a mirror with few matching files.
Not finding a file after the download is a normal situation.

---

I wrote:
> >sed -e 's/^[hH][tT][tT][pP]:\/\///' \
> >-e 's/^[hH][tT][tT][pP][sS]:\/\///' \
> >-e 's/^[fF][tT][pP]:\/\///' \
> >-e 's/^[fF][iI][lL][eE]:\/\///'`

Philip Hands wrote:
>   sed -e 's,^\(https\?\|ftp\|file\)://,,i'
> ...
>   "$imageTmp/${url#[[:alpha:]]*://}"

Are these widely portable enough ?
Mine can be justified by S.R.Bourne's "The Unix System", i guess,
and it is coordinated with function isURI.

Well, my scruples are mainly about what wget guarantees to use as
local disk path. I understand that jigdo-file would be quite tolerant
as long as the file is somewhere in the "$imageTmp" tree.
Maybe i should invest a "find" run in case of missing file. The tree is small.


I wrote:
> > fileMD5=`$jigdoFile md5 "$localPath" 2>/dev/null | awk '{print $1}'`

> The way it's done elsewhere in the script (which I happen to think is
> pretty horrible, but that's what is already there) is using set, thus:
>   set -- `$jigdoFile md5sum --report=quiet "$localPath"`

Option "--report=quiet" instead of "2>/dev/null" is a good point.

One would have to wrap the "set --" into a sub-shell, because fetchAndMerge
already tampers with its own arguments.
Like:
  answer=`$jigdoFile md5sum --report=quiet "$localPath"`
  fileMD5=`(set -- $answer ; echo "$1")`
Now that's really ugly.

If direct objections emerge against "awk", i'd consider some helper
function which echos "$1".


> I also happen to think that using `` rather than $() is pretty horrible
> in this day and age, but that's what's currently there throughout the

Yep. Not to speak of the headless camelBack variable names.

I strive to be minimally intrusive for the purpose and to be as
conservative as in an autotools script.


An alternative to changing the code would still be to tell the user with
the "Aaargh" text that repeated download and subsequent "Aaargh" could
indicate damaged files on the mirror. In this case the user shall search
the web for other mirrors which offer the repeatedly downloaded packages.

But that would be embarrassing for the involved programmers.
(Having script jigdo-lite instead of doing the job inside jigdo-file is
 also not overly glorious ...)


Have a nice day :)

Thomas



Re: jigdo-file: Does not report package rejections because checksum mismatch

2017-12-28 Thread Philip Hands
On Thu, 28 Dec 2017, "Thomas Schmitt"  wrote:
...
> Especially in need of attention:
>
> - I could not find a clear description how "wget" determines the local
>   file paths from URLs and option --directory-prefix="$imageTmp".
>   My current conversion from URL to path is purely heuristic therefore:
>
> localPath="$imageTmp"/`echo "$url" | \
>sed -e 's/^[hH][tT][tT][pP]:\/\///' \
>-e 's/^[hH][tT][tT][pP][sS]:\/\///' \
>-e 's/^[fF][tT][pP]:\/\///' \
>-e 's/^[fF][iI][lL][eE]:\/\///'`

A rather less laboured way of getting the same effect with sed would be:

  sed -e 's,^\(https\?\|ftp\|file\)://,,i'

[ Things to note about that:
 s,,, in place of s/// means that no escaping of / is needed
 the 'i' flag at the end makes the match case insensitive
 s\? means match zero or one 's'
]

However, I doubt that it's important to worry about the potential for
unexpectedly removing a prefix of e.g. cdrom:// or ://, in which case
you could dispense with sed and instead do this:

  localpath="$imageTmp/${url#[[:alpha:]]*://}"

> - I introduced a dependency on "awk", which was not used in jigdo-lite
>   before. The task is to obtain the first word of jigdo-file's output:
>
> fileMD5=`$jigdoFile md5 "$localPath" 2>/dev/null | awk '{print $1}'`

The way it's done elsewhere in the script (which I happen to think is
pretty horrible, but that's what is already there) is using set, thus:

  set -- `$jigdoFile md5sum --report=quiet "$localPath"`

which leaves the value that you are after in $1.

I also happen to think that using `` rather than $() is pretty horrible
in this day and age, but that's what's currently there throughout the
script, so I guess one should stick with that, or fix it everywhere.

Cheers, Phil.
-- 
|)|  Philip Hands  [+44 (0)20 8530 9560]  HANDS.COM Ltd.
|-|  http://www.hands.com/http://ftp.uk.debian.org/
|(|  Hugo-Klemm-Strasse 34,   21075 Hamburg,GERMANY


signature.asc
Description: PGP signature


Re: jigdo-file: Does not report package rejections because checksum mismatch

2017-12-28 Thread Thomas Schmitt
Hi,

i Cc: debian-cd with this follow-up to bug 884526 in the hope to get
some review for the endeavor to detect damaged downloaded package files
during a run of jigdo-lite.

Some disputable aspects remain (plus a possible bug in current jigdo-lite,
which will vanish by my proposal). I put them behind my changeset, which is
based on Sid of today.


How it works:

Currently and in my proposal jigdo-lite asks jigdo-file for a list of files
yet missing in the emerging image. It downloads them in groups of 10 and
submits each group to jigdo-file for grafting them into the image.
The list consists of text blocks of the form:
  URL of package
  Alternative URL of same package
  ... more alternative URLs ...
  MD5Sum:...
  
One of the URLs from each block gets picked by a loop with variable "pass",
but currently jigdo-file ignores the MD5Sum lines.

My proposal collects the MD5s and submits them to function fetchAndMerge
which currently only gets the 10 URLs.
I had to re-arrange the entrails of the list reading loop in imageDownload.
It currently ends when the 10th URL is found and thus would not have yet
reached the 10th checksum line. My proposal uses the empty line as trigger
to append the picked URL and the MD5 to the argument list.

Tested with Jessie's jigdo-file and these three input lines:

  
http://cdimage.debian.org/mirror/cdimage/archive/6.0.7/amd64/jigdo-cd/debian-6.0.7-amd64-businesscard.jigdo
  
  http://us.cdimage.debian.org/cdimage/snapshot/Debian/

--- /usr/bin/jigdo-lite.sid 2017-12-28 14:20:23.882643023 +0100
+++ /home/thomas/projekte/jigdo_dir/jigdo-lite.sid.with_md5_check   
2017-12-28 15:12:37.738654856 +0100
@@ -75,10 +75,61 @@ fetch() {
 }
 #__
 
-# Given URLs, fetch them into $imageTmp, then merge them into image
+# Simulated MD5 mismatch:
+simulateMD5Mismatch="partman-reiserfs_50_all.udeb"
+# simulateMD5Mismatch="NOT_A_PACKAGE_NAME"
+
+# Given URLs and MD5s, fetch them into $imageTmp and verify,
+# then merge them into image
 fetchAndMerge() {
+
+  # The other arguments are URLs in the same sequence as the words in md5List
+  md5List="$1"
+  shift 1
+
   if test "$#" -eq 0; then return 0; fi
   fetch --force-directories --directory-prefix="$imageTmp" -- "$@"
+
+  # Try to verify downloaded files
+  for md5 in $md5List
+  do
+url="$1"
+shift 1
+test "$md5" = ".no.MD5.known." && continue
+
+# Simulated MD5 mismatch
+if echo "$url" | grep '/'"$simulateMD5Mismatch"'$' >/dev/null 2>&1
+then
+  echo "ATTENTION : Faking a checksum mismatch with package 
$simulateMD5Mismatch" >&2
+  md5="*INVALIDATED*CHECKSUM*"
+fi
+
+localPath="$imageTmp"/`echo "$url" | \
+sed -e 's/^[hH][tT][tT][pP]:\/\///' \
+-e 's/^[hH][tT][tT][pP][sS]:\/\///' \
+-e 's/^[fF][tT][pP]:\/\///' \
+-e 's/^[fF][iI][lL][eE]:\/\///'`
+if test -e "$localPath"
+then
+  fileMD5=`$jigdoFile md5 "$localPath" 2>/dev/null | awk '{print $1}'`
+  if test "$md5" != "$fileMD5"
+  then
+echo >&2
+echo "WARNING: Downloaded file does not match expected MD5:" >&2
+echo " $url" >&2
+echo " $localPath" >&2
+echo " expected: $md5 | $fileMD5 :downloaded" >&2
+echo >&2
+  fi
+else
+  echo >&2
+  echo "WARNING: File not fot found after download:" >&2
+  echo " $url" >&2
+  echo " $localPath" >&2
+  echo >&2
+fi
+  done
+
   # Merge into the image
   $jigdoFile $jigdoOpts --no-cache make-image --image="$image" \
 --jigdo="$jigdoF" --template="$template" "$imageTmp"
@@ -596,31 +647,56 @@ imageDownload() {
 for pass in x xx xxx  x xx xxx ; do
   $jigdoFile print-missing-all --image="$image" --jigdo="$jigdoF" \
 --template="$template" $jigdoOpts $uriOpts \
-  | egrep -i '^(http:|ftp:|$)' >"$list"
+   >"$list"
+  # This counts the empty lines
   missingCount=`egrep '^$' <"$list" | wc -l | sed -e 's/ *//g'`
   # Accumulate URLs in $@, pass them to fetchAndMerge in batches
   shift "$#" # Solaris /bin/sh doesn't understand "set --"
   count=""
   exec 3<"$list"
+  useUrl="."
+  md5=".no.MD5.known."
+  md5List=""
   while $readLine url <&3; do
 count="x$count"
-if strEmpty "$url"; then count=""; continue; fi
-if test "$count" != "$pass"; then continue; fi
-if $noMorePasses; then
-  hrule
-  echo "$missingCount files not found in previous pass, trying"
-  echo "alternative download locations:"
-  echo
-fi
-noMorePasses=false
-set -- "$@" "$url"
-if test "$#" -ge "$filesPerFetch"; then
-  if fetchAndMerge "$@"; then true;