newnfs pkgng database corruption?

2013-04-10 Thread Eggert, Lars
Hi,

on a diskless server, I keep the ports tree and pkgng databases on a newnfs 
NFSv4 mount. After a bunch of "portmaster -a" runs, the pkgng sqlite database 
appears to get corrupted. For example, when I try to update an existing port, 
this happens:

root@five:~ # portmaster ports-mgmt/pkg
...
===>   Registering installation for pkg-1.0.11
Installing pkg-1.0.11...pkg: sqlite: database disk image is malformed 
(pkgdb.c:925)
pkg: sqlite: database disk image is malformed (pkgdb.c:1914)
*** [fake-pkg] Error code 70

I have removed all ports and the pkgng databases and reinstalled, but the 
corruption seems to return after a few days or weeks of installing and 
deinstalling ports.

On another system that has a disk, that corruption of the pkgng database has 
not happened over six months or so. I therefore wonder if storing the sqlite 
database on an NFS-mount is triggering some sort of bug, either in pkgng or in 
newnfs. AFAIK, pkgng is using locks on the database quite liberally, could that 
be where a bug is lurking?

I'm happy to help debug this, but someone would need to let me know what to try.

Thanks,
Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: newnfs pkgng database corruption?

2013-04-10 Thread Baptiste Daroussin
On Wed, Apr 10, 2013 at 07:42:30AM +, Eggert, Lars wrote:
> Hi,
> 
> on a diskless server, I keep the ports tree and pkgng databases on a newnfs 
> NFSv4 mount. After a bunch of "portmaster -a" runs, the pkgng sqlite database 
> appears to get corrupted. For example, when I try to update an existing port, 
> this happens:
> 
> root@five:~ # portmaster ports-mgmt/pkg
> ...
> ===>   Registering installation for pkg-1.0.11
> Installing pkg-1.0.11...pkg: sqlite: database disk image is malformed 
> (pkgdb.c:925)
> pkg: sqlite: database disk image is malformed (pkgdb.c:1914)
> *** [fake-pkg] Error code 70
> 
> I have removed all ports and the pkgng databases and reinstalled, but the 
> corruption seems to return after a few days or weeks of installing and 
> deinstalling ports.
> 
> On another system that has a disk, that corruption of the pkgng database has 
> not happened over six months or so. I therefore wonder if storing the sqlite 
> database on an NFS-mount is triggering some sort of bug, either in pkgng or 
> in newnfs. AFAIK, pkgng is using locks on the database quite liberally, could 
> that be where a bug is lurking?
> 
> I'm happy to help debug this, but someone would need to let me know what to 
> try.
> 

This can usually happen when a user do not have the nfs lock system started.

Are you sure that nfs lock is correctly started?

If that is the case, there is anyway a bug in pkgng that should catch the
problem and refuse to operate in such situation, I know sqlite to provide a
mechanism that allow us to be able to catch this, I'm not sure yet to use it.

regards,
Bapt


pgpnT1k4l4k_P.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-10 Thread Eggert, Lars
Hi,

On Apr 10, 2013, at 10:02, Baptiste Daroussin  wrote:
> This can usually happen when a user do not have the nfs lock system started.
> Are you sure that nfs lock is correctly started?

with NFSv4, the locking system is integrated with the main protocol, it's no 
longer separate.

> If that is the case, there is anyway a bug in pkgng that should catch the
> problem and refuse to operate in such situation, I know sqlite to provide a
> mechanism that allow us to be able to catch this, I'm not sure yet to use it.

Not sure about that.

In case anyone wonders, the corruption is quite substantial:

[elars@stanley ~]$ sqlite3 local/db/local.sqlite 
SQLite version 3.7.14.1 2012-10-04 19:37:12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> PRAGMA integrity_check; 
*** in database main ***
On tree page 1238 cell 17: 2nd reference to page 1237
On tree page 1238 cell 17: Child page depth differs
On tree page 1238 cell 18: Child page depth differs
On tree page 1241 cell 6: Rowid 17518 out of order (max larger than parent max 
of 12550)
On tree page 1242 cell 3: Rowid 17566 out of order (max larger than parent max 
of 12557)
On tree page 1243 cell 6: Rowid 12558 out of order (min less than parent min of 
17566)
On tree page 2867 cell 28: 2nd reference to page 1241
On tree page 2867 cell 28: Child page depth differs
On tree page 2867 cell 29: 2nd reference to page 1242
On tree page 2867 cell 30: Child page depth differs
On tree page 1417 cell 66: 2nd reference to page 1239
On tree page 1417 cell 66: Child page depth differs
On tree page 1417 cell 67: 2nd reference to page 1240
On tree page 1417 cell 68: Child page depth differs
rowid 62 missing from index sqlite_autoindex_packages_1
wrong # of entries in index sqlite_autoindex_packages_1
rowid 96 missing from index scripts_package_id
rowid 96 missing from index sqlite_autoindex_scripts_1
rowid 97 missing from index scripts_package_id
rowid 97 missing from index sqlite_autoindex_scripts_1
rowid 98 missing from index scripts_package_id
rowid 98 missing from index sqlite_autoindex_scripts_1
wrong # of entries in index scripts_package_id
wrong # of entries in index sqlite_autoindex_scripts_1
rowid 12509 missing from index sqlite_autoindex_files_1
rowid 12510 missing from index sqlite_autoindex_files_1
rowid 12511 missing from index sqlite_autoindex_files_1
rowid 12512 missing from index sqlite_autoindex_files_1
rowid 86 missing from index files_package_id
rowid 86 missing from index sqlite_autoindex_files_1
rowid 87 missing from index files_package_id
rowid 87 missing from index sqlite_autoindex_files_1
Error: database disk image is malformed

Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: newnfs pkgng database corruption?

2013-04-10 Thread Baptiste Daroussin
On Wed, Apr 10, 2013 at 08:09:42AM +, Eggert, Lars wrote:
> Hi,
> 
> On Apr 10, 2013, at 10:02, Baptiste Daroussin  wrote:
> > This can usually happen when a user do not have the nfs lock system started.
> > Are you sure that nfs lock is correctly started?
> 
> with NFSv4, the locking system is integrated with the main protocol, it's no 
> longer separate.
> 
> > If that is the case, there is anyway a bug in pkgng that should catch the
> > problem and refuse to operate in such situation, I know sqlite to provide a
> > mechanism that allow us to be able to catch this, I'm not sure yet to use 
> > it.
> 
> Not sure about that.
> 
> In case anyone wonders, the corruption is quite substantial:
> 

I think I know why let me a couple of days to test a patch.

Unfortunatly your database can't be recovered apparently sqlite has some
problems with the locking ystem of nfsv4 and has some workaround.

Will you be able to test it?

Just to warn you firefox may have the same problem with unbundled sqlite.

regards,
Bapt


pgpojDMUn7mno.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-10 Thread Rick Macklem
Lars Eggert wrote:
> Hi,
> 
> on a diskless server, I keep the ports tree and pkgng databases on a
> newnfs NFSv4 mount. After a bunch of "portmaster -a" runs, the pkgng
> sqlite database appears to get corrupted. For example, when I try to
> update an existing port, this happens:
> 
> root@five:~ # portmaster ports-mgmt/pkg
> ...
> ===> Registering installation for pkg-1.0.11
> Installing pkg-1.0.11...pkg: sqlite: database disk image is malformed
> (pkgdb.c:925)
> pkg: sqlite: database disk image is malformed (pkgdb.c:1914)
> *** [fake-pkg] Error code 70
> 
Error code 70 is ESTALE (or NFSERR_STALE, if you prefer). The server
replies with that when the file no longer exists.

File locking doesn't stop a file from being removed, as far as I know.

rick

> I have removed all ports and the pkgng databases and reinstalled, but
> the corruption seems to return after a few days or weeks of installing
> and deinstalling ports.
> 
> On another system that has a disk, that corruption of the pkgng
> database has not happened over six months or so. I therefore wonder if
> storing the sqlite database on an NFS-mount is triggering some sort of
> bug, either in pkgng or in newnfs. AFAIK, pkgng is using locks on the
> database quite liberally, could that be where a bug is lurking?
> 
> I'm happy to help debug this, but someone would need to let me know
> what to try.
> 
> Thanks,
> Lars
> ___
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to
> "freebsd-current-unsubscr...@freebsd.org"
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: newnfs pkgng database corruption?

2013-04-10 Thread Eggert, Lars
Hi,

On Apr 11, 2013, at 1:28, Rick Macklem  wrote:
> Error code 70 is ESTALE (or NFSERR_STALE, if you prefer). The server
> replies with that when the file no longer exists.
> 
> File locking doesn't stop a file from being removed, as far as I know.

but the file is still there.

Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: newnfs pkgng database corruption?

2013-04-10 Thread Eggert, Lars
Hi,

On Apr 11, 2013, at 0:16, Baptiste Daroussin  wrote:
> Will you be able to test it?

yes. (But I will be traveling for the next two weeks and so the turnaround may 
be a bit longer than normal.)

Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: newnfs pkgng database corruption?

2013-04-11 Thread Baptiste Daroussin
On Thu, Apr 11, 2013 at 05:52:52AM +, Eggert, Lars wrote:
> Hi,
> 
> On Apr 11, 2013, at 0:16, Baptiste Daroussin  wrote:
> > Will you be able to test it?
> 
> yes. (But I will be traveling for the next two weeks and so the turnaround 
> may be a bit longer than normal.)
> 
> Lars

First, I think you can recover your database.

Can you try the following command:

# mv /var/db/pkg/local.sqlite /var/db/pkg/backup.sqlite
# echo '.dump' | pkg shell /var/db/pkg/backup.sqlite | pkg shell
# echo 'pragma user_config=12;' | pkg shell

This should give you again a working database I hope :)

I think the corruption you get are due to the synchronous pragma. I need to dig
in that direction.

regards,
Bapt


pgpPilmucC2XE.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-11 Thread Eggert, Lars
Hi,

On Apr 11, 2013, at 10:30, Baptiste Daroussin  wrote:
> First, I think you can recover your database.

that would be great.

> Can you try the following command:
> 
> # mv /var/db/pkg/local.sqlite /var/db/pkg/backup.sqlite
> # echo '.dump' | pkg shell /var/db/pkg/backup.sqlite | pkg shell

That step doesn't quite work:

[root@stanley /usr/home/elars/local/db]# echo '.dump' | pkg shell backup.sqlite 
| pkg shell
Error: near line 15927: column path is not unique
Error: near line 15928: column path is not unique
Error: near line 15929: column path is not unique
Error: near line 15930: column path is not unique
Error: near line 15931: column path is not unique
Error: near line 15932: column path is not unique
Error: near line 15933: column path is not unique
Error: near line 15934: column path is not unique
Error: near line 15935: column path is not unique
Error: near line 15936: column path is not unique
Error: near line 15937: column path is not unique

[root@stanley /usr/home/elars/local/db]# ll local.sqlite 
-rw-r--r--  1 root  wheel  0 Apr 11 10:42 local.sqlite

I can send you the database off-list, if you like.

> I think the corruption you get are due to the synchronous pragma. I need to 
> dig
> in that direction.

Thanks for looking into this!

Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: newnfs pkgng database corruption?

2013-04-11 Thread Baptiste Daroussin
On Thu, Apr 11, 2013 at 08:44:01AM +, Eggert, Lars wrote:
> Hi,
> 
> On Apr 11, 2013, at 10:30, Baptiste Daroussin  wrote:
> > First, I think you can recover your database.
> 
> that would be great.
> 
> > Can you try the following command:
> > 
> > # mv /var/db/pkg/local.sqlite /var/db/pkg/backup.sqlite
> > # echo '.dump' | pkg shell /var/db/pkg/backup.sqlite | pkg shell
> 
> That step doesn't quite work:
> 
> [root@stanley /usr/home/elars/local/db]# echo '.dump' | pkg shell 
> backup.sqlite | pkg shell
> Error: near line 15927: column path is not unique
> Error: near line 15928: column path is not unique
> Error: near line 15929: column path is not unique
> Error: near line 15930: column path is not unique
> Error: near line 15931: column path is not unique
> Error: near line 15932: column path is not unique
> Error: near line 15933: column path is not unique
> Error: near line 15934: column path is not unique
> Error: near line 15935: column path is not unique
> Error: near line 15936: column path is not unique
> Error: near line 15937: column path is not unique
> 
> [root@stanley /usr/home/elars/local/db]# ll local.sqlite 
> -rw-r--r--  1 root  wheel  0 Apr 11 10:42 local.sqlite
> 
> I can send you the database off-list, if you like.
> 
Yes please.

regards,
Bapt


pgp2civlZAOdQ.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-11 Thread Rick Macklem
Lars Eggert wrote:
> Hi,
> 
> On Apr 11, 2013, at 1:28, Rick Macklem  wrote:
> > Error code 70 is ESTALE (or NFSERR_STALE, if you prefer). The server
> > replies with that when the file no longer exists.
> >
> > File locking doesn't stop a file from being removed, as far as I
> > know.
> 
> but the file is still there.
> 
Well, I have no idea why an NFS server would reply errno 70 if the file
still exists, unless the client has somehow sent a bogus file handle
to the server. (I am not aware of any client bug that might do that. I
am almost suspicious that there might be a memory problem or something
that corrupts bits in the network layer. Do you have TSO enabled for your
network interface by any chance? If so, I'd try disabling that on the
network interface. Same goes for checksum offload.)

rick
ps: If you can capture packets between the client and server at the
time this error occurs, looking at them in wireshark might be
useful?

> Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: newnfs pkgng database corruption?

2013-04-12 Thread Eggert, Lars
Hi,

On Apr 12, 2013, at 1:10, Rick Macklem  wrote:
> Well, I have no idea why an NFS server would reply errno 70 if the file
> still exists, unless the client has somehow sent a bogus file handle
> to the server. (I am not aware of any client bug that might do that. I
> am almost suspicious that there might be a memory problem or something
> that corrupts bits in the network layer. Do you have TSO enabled for your
> network interface by any chance? If so, I'd try disabling that on the
> network interface. Same goes for checksum offload.)
> 
> rick
> ps: If you can capture packets between the client and server at the
>time this error occurs, looking at them in wireshark might be
>useful?

I will try all of those things.

But first, a question that someone who understands pkgng will be able to 
answerr: Is this "fake-pkg" process even running on the NFS mount? The WRKDIR 
is /tmp, which is an mfs mount.

Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: newnfs pkgng database corruption?

2013-04-12 Thread Baptiste Daroussin
On Fri, Apr 12, 2013 at 12:56:10PM +, Eggert, Lars wrote:
> Hi,
> 
> On Apr 12, 2013, at 1:10, Rick Macklem  wrote:
> > Well, I have no idea why an NFS server would reply errno 70 if the file
> > still exists, unless the client has somehow sent a bogus file handle
> > to the server. (I am not aware of any client bug that might do that. I
> > am almost suspicious that there might be a memory problem or something
> > that corrupts bits in the network layer. Do you have TSO enabled for your
> > network interface by any chance? If so, I'd try disabling that on the
> > network interface. Same goes for checksum offload.)
> > 
> > rick
> > ps: If you can capture packets between the client and server at the
> >time this error occurs, looking at them in wireshark might be
> >useful?
> 
> I will try all of those things.
> 
> But first, a question that someone who understands pkgng will be able to 
> answerr: Is this "fake-pkg" process even running on the NFS mount? The WRKDIR 
> is /tmp, which is an mfs mount.

fake-pkg is run in WRKDIR, but it calls pkgng which will open
/var/db/pkg/local.sqlite aka nfs mount.

The Error 70 is EX_SOFTWARE returned by pkgng.

Can you try the following patch:
http://people.freebsd.org/~bapt/patch-libpkg__pkgdb.c

Just add that file to /usr/ports/ports-mgmt/pkg/files/

If that works for you, that means the posix advisory locks is somehow failing on
nfsv4 files.

Given it is already known to be failing on nfsv3 (because people often
misconfigure it) I'll probablmy make unix-dotfile the default locking system
when local.sqlite is stored on network filesystem.

regards,
Bapt


pgp8rduiaO8x_.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-12 Thread Rick Macklem
Baptiste Daroussin wrote:
> On Fri, Apr 12, 2013 at 12:56:10PM +, Eggert, Lars wrote:
> > Hi,
> >
> > On Apr 12, 2013, at 1:10, Rick Macklem  wrote:
> > > Well, I have no idea why an NFS server would reply errno 70 if the
> > > file
> > > still exists, unless the client has somehow sent a bogus file
> > > handle
> > > to the server. (I am not aware of any client bug that might do
> > > that. I
> > > am almost suspicious that there might be a memory problem or
> > > something
> > > that corrupts bits in the network layer. Do you have TSO enabled
> > > for your
> > > network interface by any chance? If so, I'd try disabling that on
> > > the
> > > network interface. Same goes for checksum offload.)
> > >
> > > rick
> > > ps: If you can capture packets between the client and server at
> > > the
> > >time this error occurs, looking at them in wireshark might be
> > >useful?
> >
> > I will try all of those things.
> >
You might still try the above suggestions, but since Error 70 wasn't an
errno.h error number, it isn't a stale fh problem and, as such, there
isn't any evidence that bits are getting messed with by the network layers.

rick

> > But first, a question that someone who understands pkgng will be
> > able to answerr: Is this "fake-pkg" process even running on the NFS
> > mount? The WRKDIR is /tmp, which is an mfs mount.
> 
> fake-pkg is run in WRKDIR, but it calls pkgng which will open
> /var/db/pkg/local.sqlite aka nfs mount.
> 
> The Error 70 is EX_SOFTWARE returned by pkgng.
> 
> Can you try the following patch:
> http://people.freebsd.org/~bapt/patch-libpkg__pkgdb.c
> 
> Just add that file to /usr/ports/ports-mgmt/pkg/files/
> 
> If that works for you, that means the posix advisory locks is somehow
> failing on
> nfsv4 files.
> 
> Given it is already known to be failing on nfsv3 (because people often
> misconfigure it) I'll probablmy make unix-dotfile the default locking
> system
> when local.sqlite is stored on network filesystem.
> 
> regards,
> Bapt
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: newnfs pkgng database corruption?

2013-04-22 Thread Baptiste Daroussin
On Fri, Apr 12, 2013 at 03:10:37PM +0200, Baptiste Daroussin wrote:
> On Fri, Apr 12, 2013 at 12:56:10PM +, Eggert, Lars wrote:
> > Hi,
> > 
> > On Apr 12, 2013, at 1:10, Rick Macklem  wrote:
> > > Well, I have no idea why an NFS server would reply errno 70 if the file
> > > still exists, unless the client has somehow sent a bogus file handle
> > > to the server. (I am not aware of any client bug that might do that. I
> > > am almost suspicious that there might be a memory problem or something
> > > that corrupts bits in the network layer. Do you have TSO enabled for your
> > > network interface by any chance? If so, I'd try disabling that on the
> > > network interface. Same goes for checksum offload.)
> > > 
> > > rick
> > > ps: If you can capture packets between the client and server at the
> > >time this error occurs, looking at them in wireshark might be
> > >useful?
> > 
> > I will try all of those things.
> > 
> > But first, a question that someone who understands pkgng will be able to 
> > answerr: Is this "fake-pkg" process even running on the NFS mount? The 
> > WRKDIR is /tmp, which is an mfs mount.
> 
> fake-pkg is run in WRKDIR, but it calls pkgng which will open
> /var/db/pkg/local.sqlite aka nfs mount.
> 
> The Error 70 is EX_SOFTWARE returned by pkgng.
> 
> Can you try the following patch:
> http://people.freebsd.org/~bapt/patch-libpkg__pkgdb.c
> 
> Just add that file to /usr/ports/ports-mgmt/pkg/files/
> 
> If that works for you, that means the posix advisory locks is somehow failing 
> on
> nfsv4 files.
> 
> Given it is already known to be failing on nfsv3 (because people often
> misconfigure it) I'll probablmy make unix-dotfile the default locking system
> when local.sqlite is stored on network filesystem.
> 
> regards,
> Bapt

As anyone been able to test this patch?

regards,
Bapt


pgplHpD5nZJzY.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-23 Thread Eggert, Lars
Hi,

On Apr 22, 2013, at 2:56, Baptiste Daroussin  wrote:
> As anyone been able to test this patch?

I've been running with it for a few days. I've done a reinstall of all ports 
plus a few "portmaster -a" runs without pkgng database corruption. I've not 
tested it for very long, but so far, things look good.

Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: newnfs pkgng database corruption?

2013-04-23 Thread Baptiste Daroussin
On Tue, Apr 23, 2013 at 08:44:43PM +, Eggert, Lars wrote:
> Hi,
> 
> On Apr 22, 2013, at 2:56, Baptiste Daroussin  wrote:
> > As anyone been able to test this patch?
> 
> I've been running with it for a few days. I've done a reinstall of all ports 
> plus a few "portmaster -a" runs without pkgng database corruption. I've not 
> tested it for very long, but so far, things look good.
> 
> Lars

Great thank you I'll activate this for all database located on a network
filesystem.

Thank you very much!
Bapt


pgppQ4uGA6loz.pgp
Description: PGP signature