bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-18 Thread pelzflorian (Florian Pelz)
"pelzflorian (Florian Pelz)"  writes:
> * If I resume a crashed installer, I need to resume twice because the
>   first resume always fails immediately.

Hooray, you fixed it.  Ludo, your debugging speed is miraculous.  I did
not know SQLite uses multiple files per database.


> * With bad luck, it permanently fails, even a second, third, fourth,
>   fifth time fail.

It can still permanently fail to resume, e.g. sometimes when doing
Ctrl-c during download of a substitue, it will continue to say nss-certs
is an unknown package, but that may be too rare to happen by chance and
is not what this bug was about.

Closing!

Regards,
Florian





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-17 Thread Ludovic Courtès
After spending a few more hours on this, I got convinced that upon
restarting guix-daemon, even though we had restored
/var/guix/db/db.sqlite, the presence of stale db.sqlite-{wal,shm} files
could lead sqlite to do as if transactions in the WAL file had been
committed.

Commit 495c50008be91429ebea3805e161a1e385a2a572 deletes these two
files, and it appears to solve the problem for me.

I also pushed the patch previously shared in this thread, to make sure
db.sqlite is only copied when guix-daemon is stopped.

So we have this:

  495c50008b installer: final: Delete SQLite WAL and shm files upon completion.
  9b6703eabe installer: final: Stop guix-daemon before accessing store database.

I’ll go ahead and prepare for the release as planned, to be published on Monday.

Ludo’.





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-17 Thread Ludovic Courtès
"pelzflorian (Florian Pelz)"  skribis:

> I saw a comment
>> void LocalStore::registerValidPaths(const ValidPathInfos & infos)
>> {
>> /* SQLite will fsync by default, but the new valid paths may not be 
>> fsync-ed.
>>  * So some may want to fsync them before registering the validity, at the
>>  * expense of some speed of the path registering operation. */
>> if (settings.syncBeforeRegistering) sync();
>
> In vain, I therefore tried

Yeah, I don’t think this has much to do with syncing data on disk.  It’s
an inconsistency between the store database and the actual store.

Ludo’.





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-17 Thread pelzflorian (Florian Pelz)
Ahoi. :)

Ludovic Courtès  writes:
>> Now that you found the dynamic-wind’s out-guard does not even run:
> It does not run on C-c, but it does run in other cases, typically if you
> just press Enter after reading the message that says “command failed,
> press Enter”.

Ahh.  Then would it be good if you at least pushed the partial fix about
replacing 'restart' with 'stop-service'?  I’m unsure now if it has an
effect on the likelihood that a second resume works again.  But maybe it
does.  And is closer to correct.


> I don’t see how to address the C-c issue so we’ll have to live with it.

Yes.  Thank you for all investigations!

Regards,
Florian





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-17 Thread pelzflorian (Florian Pelz)
Ludovic Courtès  writes:
> The error message that’s haunting us:
>
>   opening file `/gnu/store/….drv': No such file or directory
>
> comes from guix-daemon.  It happens while the client is doing an
> ‘add-text-to-store’ RPC to add that .drv to the store.
> ‘LocalStore::addTextToStore’ supposedly creates the .drv file in
> /gnu/store and then reads it back (‘registerValidPath’ -> ‘addValidPath’
> -> ‘readDerivation’ -> ‘readFile’): this is where it gets ENOENT.
>
> It would suggest that the database is consistent, but that somehow
> writes don’t go through the overlay FS.

Most interesting.

I saw a comment
> void LocalStore::registerValidPaths(const ValidPathInfos & infos)
> {
> /* SQLite will fsync by default, but the new valid paths may not be 
> fsync-ed.
>  * So some may want to fsync them before registering the validity, at the
>  * expense of some speed of the path registering operation. */
> if (settings.syncBeforeRegistering) sync();

In vain, I therefore tried

diff --git a/nix/libstore/globals.cc b/nix/libstore/globals.cc
index d4f9a46a74..5f8a3a3031 100644
--- a/nix/libstore/globals.cc
+++ b/nix/libstore/globals.cc
@@ -40,7 +40,7 @@ Settings::Settings()
 reservedSize = 8 * 1024 * 1024;
 fsyncMetadata = true;
 useSQLiteWAL = true;
-syncBeforeRegistering = false;
+syncBeforeRegistering = true;
 useSubstitutes = true;
 useChroot = false;
 impersonateLinux26 = false;

But it changes nothing.

Regards,
Florian


bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-17 Thread Ludovic Courtès
Ludovic Courtès  skribis:

> I did reproduce the issue in a VM by running “ifconfig ens3 down” in a
> tty, or by killing the ‘guix substitute’ process, to cause failure of
> ‘guix system init’.  In that case the database is indeed restored, but I
> occasionally get errors like “/gnu/store/….drv: No such file or
> directory”.

The error message that’s haunting us:

  opening file `/gnu/store/….drv': No such file or directory

comes from guix-daemon.  It happens while the client is doing an
‘add-text-to-store’ RPC to add that .drv to the store.
‘LocalStore::addTextToStore’ supposedly creates the .drv file in
/gnu/store and then reads it back (‘registerValidPath’ -> ‘addValidPath’
-> ‘readDerivation’ -> ‘readFile’): this is where it gets ENOENT.

It would suggest that the database is consistent, but that somehow
writes don’t go through the overlay FS.

More investigation is needed, but we may have to live with this bug in
1.4.0.

Ludo’.





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-17 Thread Ludovic Courtès
Moin!

"pelzflorian (Florian Pelz)"  skribis:

> Ludovic Courtès  writes:
>> One finding: when hitting C-c, the dynamic-wind exit handler (the one
>> that restores the database and umounts the cow store) is *not* executed.
>
> Impressive findings.
>
> Now that you found the dynamic-wind’s out-guard does not even run:

It does not run on C-c, but it does run in other cases, typically if you
just press Enter after reading the message that says “command failed,
press Enter”.

I don’t see how to address the C-c issue so we’ll have to live with it.

Longer-term we may have to find a different strategy than the
‘call-with-mnt-container’ trick, but that’s difficult.

> After seeing that you opened a bug#60116 on setsid(), I tested removing
> the setsid call and it had no effect, but if the dynamic-wind’s
> out-guard does not even run, that is to be expected.

Right; #60116 is related, and it’s not great but it’s not critical.

Ludo’.





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-16 Thread pelzflorian (Florian Pelz)
Ludovic Courtès  writes:
> One finding: when hitting C-c, the dynamic-wind exit handler (the one
> that restores the database and umounts the cow store) is *not* executed.

Impressive findings.

Now that you found the dynamic-wind’s out-guard does not even run: Uhh I
had misdiagnosed when I thought your 'stop-service' patch had made a
difference and caused a second resume to work.  Second resume was
already possible on rc2.  Except eventually resume stops working and on
some install attempts with rc2, resume stops working right away.

After seeing that you opened a bug#60116 on setsid(), I tested removing
the setsid call and it had no effect, but if the dynamic-wind’s
out-guard does not even run, that is to be expected.


> I did reproduce the issue in a VM by running “ifconfig ens3 down” in a
> tty, or by killing the ‘guix substitute’ process, to cause failure of
> ‘guix system init’.  In that case the database is indeed restored, but I
> occasionally get errors like “/gnu/store/….drv: No such file or
> directory”.

Yes, this is the error message that I get on failing resumes.

Regards,
Florian





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-16 Thread pelzflorian (Florian Pelz)
Maxime Devos  writes:
> So, I'm nominally 'on hiatus', but I noticed this mail, and noticed
> you copied a file (and fsync'ed it), but forgot to fsync the directory
> it was copied to -- from what I've read (but I don't recall the
> source), fsyncing the contents of the file isn't enough, you also need
> to fsync the directory such that the new file entry is in the
> directory after crashing.

Ohh indeed!  The Linux manpage on fsync confirms it.  That invalidates
my fsync testing.  Which was on a codepath that, as Ludo found out, did
not even run.  But I will remember to fsync the directory in the future.

Thank you very much Maxime!

Regards,
Florian





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-16 Thread Ludovic Courtès
Hi,

"pelzflorian (Florian Pelz)"  skribis:

> Desperately I tried also adding fsync, to no avail, both issues remain.
> Non-working patch attached.
>
> Maybe dynamic-wind is an inappropriate pattern here?
>
> If I interrupt installation using Ctrl-C (which I normally don’t,
> instead I unplug Ethernet), then I have to press Ctrl-C twice.  Maybe
> that could be related to why I need to resume twice?

One finding: when hitting C-c, the dynamic-wind exit handler (the one
that restores the database and umounts the cow store) is *not* executed.

This is because ‘call-with-mnt-container’ sets a SIGINT handler that
terminates that process with SIGKILL (I’m not entirely sure of the
rationale, but said process cannot handle signals in Scheme while it’s
in ‘waitpid’, called from ‘run-command’).

I did reproduce the issue in a VM by running “ifconfig ens3 down” in a
tty, or by killing the ‘guix substitute’ process, to cause failure of
‘guix system init’.  In that case the database is indeed restored, but I
occasionally get errors like “/gnu/store/….drv: No such file or
directory”.

Ludo’.





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-16 Thread Maxime Devos



On 14-12-2022 22:47, pelzflorian (Florian Pelz) wrote:

fsyncing the database had no effect.  (In addition to Ludo’s
'stop-service', I had done


fsync.patch

diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index ef487805f0..13deffef85 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -217,8 +217,16 @@ (define (assert-exit x)
   ;; alive.
   (stop-service 'guix-daemon)
  
- ;; Restore the database and restart it.

+ ;; Restore the database.
   (copy-file saved-database database-file)
+
+ ;; Sync it to the filesystem.
+ (let* ((flags O_RDONLY)
+(fd (open database-file flags)))
+   (fsync fd)
+   (close fd))
+


So, I'm nominally 'on hiatus', but I noticed this mail, and noticed you 
copied a file (and fsync'ed it), but forgot to fsync the directory it 
was copied to -- from what I've read (but I don't recall the source), 
fsyncing the contents of the file isn't enough, you also need to fsync 
the directory such that the new file entry is in the directory after 
crashing.


Greetings,
Maxime.


OpenPGP_0x49E3EE22191725EE.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-15 Thread pelzflorian (Florian Pelz)
Desperately I tried also adding fsync, to no avail, both issues remain.
Non-working patch attached.

Maybe dynamic-wind is an inappropriate pattern here?

If I interrupt installation using Ctrl-C (which I normally don’t,
instead I unplug Ethernet), then I have to press Ctrl-C twice.  Maybe
that could be related to why I need to resume twice?

I’m in the dark.

Regards,
Florian

does not help

diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index 5f720f6641..f5935a29c9 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -201,6 +201,12 @@ (define (assert-exit x)
  (stop-service 'guix-daemon)
  (copy-file database-file saved-database)
 
+ ;; Sync it to the filesystem.
+ (let* ((flags O_RDONLY)
+(fd (open saved-database flags)))
+   (fsync fd)
+   (close fd))
+
  (mount-cow-store (%installer-target-dir) backing-directory))
(lambda ()
  ;; We need to drag the guix-daemon to the container MNT
@@ -218,8 +224,16 @@ (define (assert-exit x)
  ;; alive.
  (stop-service 'guix-daemon)
 
- ;; Restore the database and restart it.
+ ;; Restore the database.
  (copy-file saved-database database-file)
+
+ ;; Sync it to the filesystem.
+ (let* ((flags O_RDONLY)
+(fd (open database-file flags)))
+   (fsync fd)
+   (close fd))
+
+ ;; And restart guix-daemon.
  (start-service 'guix-daemon)
 
  ;; Finally umount the cow-store and exit the container.


bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-15 Thread pelzflorian (Florian Pelz)
Hi Ludo…

Ludovic Courtès  writes:
> This time, I believe we only ever copy the database when we’re sure no
> guix-daemon process is accessing it.

Failure.  In addition to your partially helpful patch from before
(with which a second resume now works most of the time), I now tried
further the new change:

diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index 044f79372b..360b34d8cb 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -196,14 +196,15 @@ (define (assert-exit x)
  ;; the loaded cow-store locale files will prevent umounting.
  (install-locale locale)

- ;; Save the database, so that it can be restored once the
- ;; cow-store is umounted.
+ ;; Stop the daemon and save the database, so that it can be
+ ;; restored once the cow-store is umounted.
+ (stop-service 'guix-daemon)
  (copy-file database-file saved-database)
+
  (mount-cow-store (%installer-target-dir) backing-directory))
(lambda ()
  ;; We need to drag the guix-daemon to the container MNT
  ;; namespace, so that it can operate on the cow-store.
- (stop-service 'guix-daemon)
  (start-service 'guix-daemon (list (number->string (getpid

  (setvbuf (current-output-port) 'none)


No additional effect. :(  Perhaps at that time, the guix-daemon isnt
doing anything anyway (though the addition makes sense in general and
may help some users).  There are the same two problems, needing to
resume twice each time and eventually not being able to resume at all
(perhaps some multi-core issue?).  I sent installer-dump-89be04d5.

I tried interrupting the Ethernet on the same machine but with an
installed 1.4.0rc2 Guix System during `guix system reconfigure`.
This has no issues…  There must be corruption in the installer.

Regards,
Florian





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-14 Thread Ludovic Courtès
Grrr, I’m really silly: we have the same problem (copying the database
before the daemon has been stopped) just a few lines above.

How about this:

diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index 044f79372b..360b34d8cb 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -1,6 +1,6 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2018, 2020 Mathieu Othacehe 
-;;; Copyright © 2019, 2020 Ludovic Courtès 
+;;; Copyright © 2019, 2020, 2022 Ludovic Courtès 
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -196,14 +196,15 @@ (define (assert-exit x)
  ;; the loaded cow-store locale files will prevent umounting.
  (install-locale locale)
 
- ;; Save the database, so that it can be restored once the
- ;; cow-store is umounted.
+ ;; Stop the daemon and save the database, so that it can be
+ ;; restored once the cow-store is umounted.
+ (stop-service 'guix-daemon)
  (copy-file database-file saved-database)
+
  (mount-cow-store (%installer-target-dir) backing-directory))
(lambda ()
  ;; We need to drag the guix-daemon to the container MNT
  ;; namespace, so that it can operate on the cow-store.
- (stop-service 'guix-daemon)
  (start-service 'guix-daemon (list (number->string (getpid
 
  (setvbuf (current-output-port) 'none)
@@ -213,10 +214,13 @@ (define (assert-exit x)
 
  (set! ret (run-command install-command #:tty? #t)))
(lambda ()
- ;; Restart guix-daemon so that it does no keep the MNT namespace
+ ;; Stop guix-daemon so that it does no keep the MNT namespace
  ;; alive.
- (restart-service 'guix-daemon)
+ (stop-service 'guix-daemon)
+
+ ;; Restore the database and restart it.
  (copy-file saved-database database-file)
+ (start-service 'guix-daemon)
 
  ;; Finally umount the cow-store and exit the container.
  (unmount-cow-store (%installer-target-dir) backing-directory)

?

This time, I believe we only ever copy the database when we’re sure no
guix-daemon process is accessing it.

Ludo’.


bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-14 Thread pelzflorian (Florian Pelz)
"pelzflorian (Florian Pelz)"  writes:
> I shall try with fsync now.

fsyncing the database had no effect.  (In addition to Ludo’s
'stop-service', I had done

diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index ef487805f0..13deffef85 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -217,8 +217,16 @@ (define (assert-exit x)
  ;; alive.
  (stop-service 'guix-daemon)
 
- ;; Restore the database and restart it.
+ ;; Restore the database.
  (copy-file saved-database database-file)
+
+ ;; Sync it to the filesystem.
+ (let* ((flags O_RDONLY)
+(fd (open database-file flags)))
+   (fsync fd)
+   (close fd))
+
+ ;; And restart guix-daemon.
  (start-service 'guix-daemon)
 
  ;; Finally umount the cow-store and exit the container.


The same two problems:

* If I resume a crashed installer, I need to resume twice because the
  first resume always fails immediately.

* With bad luck, it permanently fails, even a second, third, fourth,
  fifth time fail.

This is the same as without the fsync.  Fsync had no effect.  Still I
uploaded installer-dump-194618fa.

Regards,
Florian


bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-14 Thread pelzflorian (Florian Pelz)
Eventual success, partially.

First of all:

Ludovic Courtès  writes:
> "pelzflorian (Florian Pelz)"  skribis:
>> Additionally, I had to do “GUIX_ALLOW_ME_TO_USE_PRIVATE_COMMIT=y
>> make update-guix-package”.  Or else the installer was using a Guix that
>> did not have the lines swapped.
> Hmm this is surprising because we’re already using (current-guix) in
> (gnu installer).

Apparently no.  If I commit only those two diffs from your mail, with
`./pre-inst-env guix system image -t iso9660 --label=Guix
gnu/system/install.scm`, then

guix gc --references /gnu/store/*-installer-real

prints a Guix package that does not contain any of the changes to
gnu/installer/final.scm.

Nonetheless I used it and ran the installer with surprising failures
that make me doubt either the health of my USB drive: `guix system
init --fallback` did not download substitutes but said ACL seems to be
uninitialized and fell back to downloading/building the tar.xz
sources.  I pulled the Ethernet plug, resumed the installer to run
`guix system init` again, but this now complains that nss-certs is an
unknown package.  Sending a dump crashed the installer.  On TTY3, `ls
/tmp` tells me '-bash: ls: command not found'.

Another USB drive, another try, the installer again says there's no
ACL and downloads tar.xz, but otherwise behaves as rc2 and sometimes
bugs out when pulling Ethernet; final.scm does not contain the patch.

Is that second diff of yours perhaps really about ACLs?

I do the authorization dance, commit the diff about 'stop-service' and
the update-guix-package, then pull --branch=version-1.4.0.  I can now
resume happily, when pulling the Ethernet and even when pressing
Ctrl-C just for fun.

Except it is necessary to resume twice.  The first resume always fails
and the second resume resumes.  Does it confuse the two databases?

Except after a large number of resumes, not even the second resume
resumes anymore.  I sent a installer-dump-c82c7abf.

I shall try with fsync now.

Regards,
Florian





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-13 Thread Ludovic Courtès
"pelzflorian (Florian Pelz)"  skribis:

> Ludovic Courtès  writes:
>> So my guess is that things will be much better if we swap these two
>> lines.
>
> This was helpful, but not enough.

Sorry, I think I wasn’t thinking at full speed.  There needs to be zero
daemons running while we copy the database.  So the real fix is more
like this:

diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index 044f79372b..9a6bbad122 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -213,10 +213,13 @@ (define (assert-exit x)
 
  (set! ret (run-command install-command #:tty? #t)))
(lambda ()
- ;; Restart guix-daemon so that it does no keep the MNT namespace
+ ;; Stop guix-daemon so that it does no keep the MNT namespace
  ;; alive.
- (restart-service 'guix-daemon)
+ (stop-service 'guix-daemon)
+
+ ;; Restore the database and restart it.
  (copy-file saved-database database-file)
+ (start-service 'guix-daemon)
 
  ;; Finally umount the cow-store and exit the container.
  (unmount-cow-store (%installer-target-dir) backing-directory)

>>   ./pre-inst-env guix system image -t iso9660 --label=Guix \
>> gnu/system/install.scm
>
> Additionally, I had to do “GUIX_ALLOW_ME_TO_USE_PRIVATE_COMMIT=y
> make update-guix-package”.  Or else the installer was using a Guix that
> did not have the lines swapped.

Hmm this is surprising because we’re already using (current-guix) in
(gnu installer).

> Also before I did the GPG authorization dance (my x86 machine isn’t
> worth getting my actual commiter GPG keys, so I make sure its dummy GPG
> key is in the keyring branch, .guix-authorizations file, that
> guix/channels.scm’s default guix channel points to the url
> /home/florian/src/guix and to the commit with the new authorization).
> Then I guix pulled.  So that building the installer succeeds.  I did
> *not* use ./pre-inst-env.

Ah yes, apologies.  You should be able to disable authentication with
this:

diff --git a/gnu/packages/package-management.scm b/gnu/packages/package-management.scm
index 5a09b1fcf8..374b187d8c 100644
--- a/gnu/packages/package-management.scm
+++ b/gnu/packages/package-management.scm
@@ -625,6 +625,7 @@ (define-public current-guix-package
(inherit guix)
(source source)
(build-system channel-build-system)
+   (arguments '(#:authenticate? #f))
(inputs '())
(native-inputs '())
(propagated-inputs '(

Thanks a lot for patiently testing, this is very helpful!

Ludo’.


bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-13 Thread pelzflorian (Florian Pelz)
Hi again.

Ludovic Courtès  writes:
> So my guess is that things will be much better if we swap these two
> lines.

This was helpful, but not enough.

Swapping them may have improved the likelihood of being able to retry,
but the issue is still there.  I uploaded as installer-dump-5f9f8dbe,
but it is pretty much the same as the previous dump.

Tomorrow, I will try to add an fsync call in between the two lines.

>   ./pre-inst-env guix system image -t iso9660 --label=Guix \
> gnu/system/install.scm

Additionally, I had to do “GUIX_ALLOW_ME_TO_USE_PRIVATE_COMMIT=y
make update-guix-package”.  Or else the installer was using a Guix that
did not have the lines swapped.

Also before I did the GPG authorization dance (my x86 machine isn’t
worth getting my actual commiter GPG keys, so I make sure its dummy GPG
key is in the keyring branch, .guix-authorizations file, that
guix/channels.scm’s default guix channel points to the url
/home/florian/src/guix and to the commit with the new authorization).
Then I guix pulled.  So that building the installer succeeds.  I did
*not* use ./pre-inst-env.

Regards,
Florian





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-13 Thread Ludovic Courtès
Hi again,

Ludovic Courtès  skribis:

> It looks like the store is in a broken state, with its database not
> matching its actual contents.  The ‘install-system’ procedure is
> supposed to protect against that by making a backup of the database
> before starting the installation and restoring it afterwards.  (It
> apparently worked for me when I interrupted ‘guix system init’ by
> hitting C-c.)

Actually, look at the excerpt from final.scm:

 ;; Restart guix-daemon so that it does no keep the MNT namespace
 ;; alive.
 (restart-service 'guix-daemon)
 (copy-file saved-database database-file)

We’re restarting the daemon *before* we have restored the database,
which is wrong: depending on how lucky you are, guix-daemon might load
the old database (all this depends on what exactly happens when sqlite
opens the database, but I think there’s a possibility that it will load
or cache a few things and thus fail to see the changes ‘copy-file’
introduces.)

So my guess is that things will be much better if we swap these two
lines.

Florian, it would be great if you could try that and run a new image
generated version ‘version-1.4.0’ with these two lines changed.  To
produce the image, run:

  ./pre-inst-env guix system image -t iso9660 --label=Guix \
gnu/system/install.scm

Ludo’.





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-13 Thread Ludovic Courtès
Hi,

"pelzflorian (Florian Pelz)"  skribis:

> I now uploaded an installer-dump-bade9971 of me reproducing the issue.

Here’s the relevant syslog excerpt (this was with 1.4.0rc1) where we can
see the point where you unplugged the Ethernet connection:

--8<---cut here---start->8---
Dec 10 09:07:29 localhost installer[399]: running command ("guix" "system" 
"init" "--fallback" "/mnt/etc/config.scm" "/mnt") 
Dec 10 09:07:48 localhost installer[399]: ^[[1m10.3 MB will be downloaded^M 
Dec 10 09:07:49 localhost installer[399]: ^[[0m^M^[[K^M^[[K utf8proc-2.5.0  
52KiB716KiB/s 00:00 [##] 100.0%^M^[[K 
utf8proc-2.5.0  52KiB594KiB/s 00:00 [##] 
100.0%^M 

[...]

Dec 10 09:08:48 localhost installer[399]: ^[[0m^M^[[Kretrying download of 
'/gnu/store/8zigz7afvz2rjrvrh7zq1d389qbl2684-dbus-1.12.20' with other 
substitute URLs...^M 
Dec 10 09:08:48 localhost installer[399]: guix substitute: warning: 
bordeaux.guix.gnu.org: host not found: Name or service not known^M 
Dec 10 09:08:48 localhost installer[399]: guix substitute: error: failed to 
find alternative substitute for 
'/gnu/store/8zigz7afvz2rjrvrh7zq1d389qbl2684-dbus-1.12.20'^M 
Dec 10 09:08:48 localhost installer[399]: ^[[31;1msubstitution of 
/gnu/store/8zigz7afvz2rjrvrh7zq1d389qbl2684-dbus-1.12.20 failed^[[0m^M 
Dec 10 09:08:49 localhost installer[399]: ^M^[[K^M^[[Kretrying download of 
'/gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash' with other substitute 
URLs...^M 
Dec 10 09:08:49 localhost installer[399]: guix substitute: warning: 
bordeaux.guix.gnu.org: host not found: Name or service not known^M 
Dec 10 09:08:49 localhost installer[399]: guix substitute: error: failed to 
find alternative substitute for 
'/gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash'^M 
Dec 10 09:08:49 localhost installer[399]: ^[[31;1msubstitution of 
/gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash failed^[[0m^M 
Dec 10 09:08:49 localhost installer[399]: guix system: ^[[1;31merror: 
^[[0mcorrupt input while restoring archive from #^M 
Dec 10 09:08:49 localhost installer[399]: command ("guix" "system" "init" 
"--fallback" "/mnt/etc/config.scm" "/mnt") exited with value 1 
Dec 10 09:08:58 localhost vmunix: [ 1220.571986] r8169 :02:00.0 enp2s0: 
Link is Up - 1Gbps/Full - flow control off

[...]

Dec 10 09:09:12 localhost shepherd[1]: Service guix-daemon has been stopped. 
Dec 10 09:09:12 localhost shepherd[1]: Service guix-daemon has been started. 
Dec 10 09:09:17 localhost installer[274]: unmounting "/mnt/" 
Dec 10 09:09:17 localhost vmunix: [ 1239.111442] EXT4-fs (sda3): unmounting 
filesystem.
Dec 10 09:09:19 localhost installer[274]: running form # 
("Installation menu") with 0 clients 
Dec 10 09:09:22 localhost installer[274]: running step 'final' 
Dec 10 09:09:22 localhost installer[274]: proceeding with final step 
Dec 10 09:09:23 localhost installer[274]: mounting "/dev/sda3" on "/mnt/" 
Dec 10 09:09:23 localhost vmunix: [ 1245.890840] EXT4-fs (sda3): mounted 
filesystem with ordered data mode. Quota mode: none.
Dec 10 09:09:23 localhost vmunix: [ 1245.893304] Adding 3905532k swap on 
/dev/sda2.  Priority:-2 extents:1 across:3905532k SSFS
Dec 10 09:09:23 localhost installer[274]: running form # 
("Configuration file") with 0 clients 
Dec 10 09:09:29 localhost installer[437]: install supported locale en_US.utf8. 
Dec 10 09:09:29 localhost shepherd[1]: Service guix-daemon has been stopped. 
Dec 10 09:09:29 localhost shepherd[1]: Service guix-daemon has been started. 
Dec 10 09:09:29 localhost installer[437]: running command ("guix" "system" 
"init" "--fallback" "/mnt/etc/config.scm" "/mnt") 
Dec 10 09:09:54 localhost installer[437]: ^[[1m60.8 MB will be downloaded^M 
Dec 10 09:09:54 localhost installer[437]: ^[[0mguix system: ^[[1;31merror: 
^[[0mopening file 
`/gnu/store/igxf1b1l2b19h7mx2s6r117270dbi6iq-guix-1.4.0rc1.drv': No such file 
or directory^M 
Dec 10 09:09:54 localhost installer[437]: command ("guix" "system" "init" 
"--fallback" "/mnt/etc/config.scm" "/mnt") exited with value 1 
Dec 10 09:10:21 localhost shepherd[1]: Service guix-daemon has been stopped. 
Dec 10 09:10:21 localhost shepherd[1]: Service guix-daemon has been started. 
Dec 10 09:10:21 localhost installer[274]: unmounting "/mnt/" 
Dec 10 09:10:21 localhost vmunix: [ 1303.398583] EXT4-fs (sda3): unmounting 
filesystem.
Dec 10 09:10:28 localhost installer[274]: crashing due to uncaught exception: 
%exception (#<>) 
--8<---cut here---end--->8---

It looks like the store is in a broken state, with its database not
matching its actual contents.  The ‘install-system’ procedure is
supposed to protect against that by making a backup of the database
before starting the installation and restoring it afterwards.  (It
apparently worked for me when I interrupted ‘guix system init’ by
hitting C-c.)

I wonder how that failed here.  Mathieu, ideas?

Thanks,
Ludo’.





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-12 Thread pelzflorian (Florian Pelz)
"pelzflorian (Florian Pelz)"  writes:
> shepherd: Service guix-daemon has been stopped.
> shepherd: Service guix-daemon has been started.
> guix system: Fehler: opening file
> `/gnu/store/4z81a7njyvnwa4kn46ad6vhvi0lcnrhh-shadow-4.9.drv': No such
> file or directory
> Befehl ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") hat 
> mit Exit-Code 1 geendet

Still happens with 1.4.0rc2.  I guess install-system in
gnu/installer/final.scm does not sync the disk on failure?

Regards,
Florian





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-10 Thread pelzflorian (Florian Pelz)
Ludovic Courtès  writes:
> I tried to reproduce it:
>
>   0. I chose a basic installation to a fully-encrypted disk with a
>  single partition.
>
>   1. I hit Ctrl-C while ‘guix system init’ was downloading substitutes.
>
>   2. That led me to a confusing error screen says “Command cryptsetup
>  failed” with Ignore/Abort/Retry buttons.  This should have been
>  “Command guix system init” failed no?
>
>   3. I resumed starting with the “Configuration File” step, and there
>  ‘guix system init’ ran to completion just fine.

Yes, these were the steps, except I did not do encryption.  But I had
not told the whole story …  Sorry!

So what was missing is that the reason I pressed Ctrl-C was a rare
dropout by my Ethernet controller.  Because it is so rare and has not
happened anymore since, as a substitute, for reproducing, I did as
follows:

 0. Use Ethernet for the installation.

 1. During substitute downloading, pull the Ethernet plug.

 2. Get lucky so the installation will crash with an error and not just
pause.  Otherwise, if no crash, repeat.

 3. Press Ctrl-C.

 4. Resume the installation from the last step.

 5. It will fail now.

I now uploaded an installer-dump-bade9971 of me reproducing the issue.

> Maybe the difference is that you hit Ctrl-C when ‘guix system init’ had
> already started copying stuff to /mnt?

No, like you, I was in the substitute downloading step.

This issue is much rarer than I thought.

Thank you for investigating.

Regards,
Florian





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-09 Thread Ludovic Courtès
Ludovic Courtès  skribis:

>   2. That led me to a confusing error screen says “Command cryptsetup
>  failed” with Ignore/Abort/Retry buttons.

Actually it’s “External command ("cryptsetup" "close" "cryptroot")
exited with code 5” and “cryptroot device is busy”.

Ludo’.





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-09 Thread Ludovic Courtès
Hi,

"pelzflorian (Florian Pelz)"  skribis:

> I aborted graphical system installation (Ctrl-C), retried the
> installation and got this:
>
> shepherd: Service guix-daemon has been stopped.
> shepherd: Service guix-daemon has been started.
> guix system: Fehler: opening file 
> `/gnu/store/4z81a7njyvnwa4kn46ad6vhvi0lcnrhh-shadow-4.9.drv': No such file or 
> directory
> Befehl ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") hat 
> mit Exit-Code 1 geendet
> Drücken Sie die Eingabetaste, um fortzufahren.
>
> (It told me to press Enter to continue.)  I did so; retried; but again
> it did not really retry the installation, I always get this same error
> message.

Related to that, I found this old bug:

  https://issues.guix.gnu.org/35543

I tried to reproduce it:

  0. I chose a basic installation to a fully-encrypted disk with a
 single partition.

  1. I hit Ctrl-C while ‘guix system init’ was downloading substitutes.

  2. That led me to a confusing error screen says “Command cryptsetup
 failed” with Ignore/Abort/Retry buttons.  This should have been
 “Command guix system init” failed no?

  3. I resumed starting with the “Configuration File” step, and there
 ‘guix system init’ ran to completion just fine.

Maybe the difference is that you hit Ctrl-C when ‘guix system init’ had
already started copying stuff to /mnt?

Thanks,
Ludo’.





bug#59784: [version 1.4.0rc1] Retrying a failed install fails

2022-12-02 Thread pelzflorian (Florian Pelz)
I aborted graphical system installation (Ctrl-C), retried the
installation and got this:

shepherd: Service guix-daemon has been stopped.
shepherd: Service guix-daemon has been started.
guix system: Fehler: opening file 
`/gnu/store/4z81a7njyvnwa4kn46ad6vhvi0lcnrhh-shadow-4.9.drv': No such file or 
directory
Befehl ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") hat 
mit Exit-Code 1 geendet
Drücken Sie die Eingabetaste, um fortzufahren.

(It told me to press Enter to continue.)  I did so; retried; but again
it did not really retry the installation, I always get this same error
message.

Sorry in case this is a duplicate bug.

Regards,
Florian