Re: newnfs pkgng database corruption?
Hi, On Apr 22, 2013, at 2:56, Baptiste Daroussin b...@freebsd.org wrote: As anyone been able to test this patch? I've been running with it for a few days. I've done a reinstall of all ports plus a few portmaster -a runs without pkgng database corruption. I've not tested it for very long, but so far, things look good. Lars ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: newnfs pkgng database corruption?
On Tue, Apr 23, 2013 at 08:44:43PM +, Eggert, Lars wrote: Hi, On Apr 22, 2013, at 2:56, Baptiste Daroussin b...@freebsd.org wrote: As anyone been able to test this patch? I've been running with it for a few days. I've done a reinstall of all ports plus a few portmaster -a runs without pkgng database corruption. I've not tested it for very long, but so far, things look good. Lars Great thank you I'll activate this for all database located on a network filesystem. Thank you very much! Bapt pgppQ4uGA6loz.pgp Description: PGP signature
Re: newnfs pkgng database corruption?
On Fri, Apr 12, 2013 at 03:10:37PM +0200, Baptiste Daroussin wrote: On Fri, Apr 12, 2013 at 12:56:10PM +, Eggert, Lars wrote: Hi, On Apr 12, 2013, at 1:10, Rick Macklem rmack...@uoguelph.ca wrote: Well, I have no idea why an NFS server would reply errno 70 if the file still exists, unless the client has somehow sent a bogus file handle to the server. (I am not aware of any client bug that might do that. I am almost suspicious that there might be a memory problem or something that corrupts bits in the network layer. Do you have TSO enabled for your network interface by any chance? If so, I'd try disabling that on the network interface. Same goes for checksum offload.) rick ps: If you can capture packets between the client and server at the time this error occurs, looking at them in wireshark might be useful? I will try all of those things. But first, a question that someone who understands pkgng will be able to answerr: Is this fake-pkg process even running on the NFS mount? The WRKDIR is /tmp, which is an mfs mount. fake-pkg is run in WRKDIR, but it calls pkgng which will open /var/db/pkg/local.sqlite aka nfs mount. The Error 70 is EX_SOFTWARE returned by pkgng. Can you try the following patch: http://people.freebsd.org/~bapt/patch-libpkg__pkgdb.c Just add that file to /usr/ports/ports-mgmt/pkg/files/ If that works for you, that means the posix advisory locks is somehow failing on nfsv4 files. Given it is already known to be failing on nfsv3 (because people often misconfigure it) I'll probablmy make unix-dotfile the default locking system when local.sqlite is stored on network filesystem. regards, Bapt As anyone been able to test this patch? regards, Bapt pgplHpD5nZJzY.pgp Description: PGP signature
Re: newnfs pkgng database corruption?
Hi, On Apr 12, 2013, at 1:10, Rick Macklem rmack...@uoguelph.ca wrote: Well, I have no idea why an NFS server would reply errno 70 if the file still exists, unless the client has somehow sent a bogus file handle to the server. (I am not aware of any client bug that might do that. I am almost suspicious that there might be a memory problem or something that corrupts bits in the network layer. Do you have TSO enabled for your network interface by any chance? If so, I'd try disabling that on the network interface. Same goes for checksum offload.) rick ps: If you can capture packets between the client and server at the time this error occurs, looking at them in wireshark might be useful? I will try all of those things. But first, a question that someone who understands pkgng will be able to answerr: Is this fake-pkg process even running on the NFS mount? The WRKDIR is /tmp, which is an mfs mount. Lars ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: newnfs pkgng database corruption?
On Fri, Apr 12, 2013 at 12:56:10PM +, Eggert, Lars wrote: Hi, On Apr 12, 2013, at 1:10, Rick Macklem rmack...@uoguelph.ca wrote: Well, I have no idea why an NFS server would reply errno 70 if the file still exists, unless the client has somehow sent a bogus file handle to the server. (I am not aware of any client bug that might do that. I am almost suspicious that there might be a memory problem or something that corrupts bits in the network layer. Do you have TSO enabled for your network interface by any chance? If so, I'd try disabling that on the network interface. Same goes for checksum offload.) rick ps: If you can capture packets between the client and server at the time this error occurs, looking at them in wireshark might be useful? I will try all of those things. But first, a question that someone who understands pkgng will be able to answerr: Is this fake-pkg process even running on the NFS mount? The WRKDIR is /tmp, which is an mfs mount. fake-pkg is run in WRKDIR, but it calls pkgng which will open /var/db/pkg/local.sqlite aka nfs mount. The Error 70 is EX_SOFTWARE returned by pkgng. Can you try the following patch: http://people.freebsd.org/~bapt/patch-libpkg__pkgdb.c Just add that file to /usr/ports/ports-mgmt/pkg/files/ If that works for you, that means the posix advisory locks is somehow failing on nfsv4 files. Given it is already known to be failing on nfsv3 (because people often misconfigure it) I'll probablmy make unix-dotfile the default locking system when local.sqlite is stored on network filesystem. regards, Bapt pgp8rduiaO8x_.pgp Description: PGP signature
Re: newnfs pkgng database corruption?
Baptiste Daroussin wrote: On Fri, Apr 12, 2013 at 12:56:10PM +, Eggert, Lars wrote: Hi, On Apr 12, 2013, at 1:10, Rick Macklem rmack...@uoguelph.ca wrote: Well, I have no idea why an NFS server would reply errno 70 if the file still exists, unless the client has somehow sent a bogus file handle to the server. (I am not aware of any client bug that might do that. I am almost suspicious that there might be a memory problem or something that corrupts bits in the network layer. Do you have TSO enabled for your network interface by any chance? If so, I'd try disabling that on the network interface. Same goes for checksum offload.) rick ps: If you can capture packets between the client and server at the time this error occurs, looking at them in wireshark might be useful? I will try all of those things. You might still try the above suggestions, but since Error 70 wasn't an errno.h error number, it isn't a stale fh problem and, as such, there isn't any evidence that bits are getting messed with by the network layers. rick But first, a question that someone who understands pkgng will be able to answerr: Is this fake-pkg process even running on the NFS mount? The WRKDIR is /tmp, which is an mfs mount. fake-pkg is run in WRKDIR, but it calls pkgng which will open /var/db/pkg/local.sqlite aka nfs mount. The Error 70 is EX_SOFTWARE returned by pkgng. Can you try the following patch: http://people.freebsd.org/~bapt/patch-libpkg__pkgdb.c Just add that file to /usr/ports/ports-mgmt/pkg/files/ If that works for you, that means the posix advisory locks is somehow failing on nfsv4 files. Given it is already known to be failing on nfsv3 (because people often misconfigure it) I'll probablmy make unix-dotfile the default locking system when local.sqlite is stored on network filesystem. regards, Bapt ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: newnfs pkgng database corruption?
On Thu, Apr 11, 2013 at 05:52:52AM +, Eggert, Lars wrote: Hi, On Apr 11, 2013, at 0:16, Baptiste Daroussin b...@freebsd.org wrote: Will you be able to test it? yes. (But I will be traveling for the next two weeks and so the turnaround may be a bit longer than normal.) Lars First, I think you can recover your database. Can you try the following command: # mv /var/db/pkg/local.sqlite /var/db/pkg/backup.sqlite # echo '.dump' | pkg shell /var/db/pkg/backup.sqlite | pkg shell # echo 'pragma user_config=12;' | pkg shell This should give you again a working database I hope :) I think the corruption you get are due to the synchronous pragma. I need to dig in that direction. regards, Bapt pgpPilmucC2XE.pgp Description: PGP signature
Re: newnfs pkgng database corruption?
Hi, On Apr 11, 2013, at 10:30, Baptiste Daroussin b...@freebsd.org wrote: First, I think you can recover your database. that would be great. Can you try the following command: # mv /var/db/pkg/local.sqlite /var/db/pkg/backup.sqlite # echo '.dump' | pkg shell /var/db/pkg/backup.sqlite | pkg shell That step doesn't quite work: [root@stanley /usr/home/elars/local/db]# echo '.dump' | pkg shell backup.sqlite | pkg shell Error: near line 15927: column path is not unique Error: near line 15928: column path is not unique Error: near line 15929: column path is not unique Error: near line 15930: column path is not unique Error: near line 15931: column path is not unique Error: near line 15932: column path is not unique Error: near line 15933: column path is not unique Error: near line 15934: column path is not unique Error: near line 15935: column path is not unique Error: near line 15936: column path is not unique Error: near line 15937: column path is not unique [root@stanley /usr/home/elars/local/db]# ll local.sqlite -rw-r--r-- 1 root wheel 0 Apr 11 10:42 local.sqlite I can send you the database off-list, if you like. I think the corruption you get are due to the synchronous pragma. I need to dig in that direction. Thanks for looking into this! Lars ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: newnfs pkgng database corruption?
On Thu, Apr 11, 2013 at 08:44:01AM +, Eggert, Lars wrote: Hi, On Apr 11, 2013, at 10:30, Baptiste Daroussin b...@freebsd.org wrote: First, I think you can recover your database. that would be great. Can you try the following command: # mv /var/db/pkg/local.sqlite /var/db/pkg/backup.sqlite # echo '.dump' | pkg shell /var/db/pkg/backup.sqlite | pkg shell That step doesn't quite work: [root@stanley /usr/home/elars/local/db]# echo '.dump' | pkg shell backup.sqlite | pkg shell Error: near line 15927: column path is not unique Error: near line 15928: column path is not unique Error: near line 15929: column path is not unique Error: near line 15930: column path is not unique Error: near line 15931: column path is not unique Error: near line 15932: column path is not unique Error: near line 15933: column path is not unique Error: near line 15934: column path is not unique Error: near line 15935: column path is not unique Error: near line 15936: column path is not unique Error: near line 15937: column path is not unique [root@stanley /usr/home/elars/local/db]# ll local.sqlite -rw-r--r-- 1 root wheel 0 Apr 11 10:42 local.sqlite I can send you the database off-list, if you like. Yes please. regards, Bapt pgp2civlZAOdQ.pgp Description: PGP signature
Re: newnfs pkgng database corruption?
Lars Eggert wrote: Hi, On Apr 11, 2013, at 1:28, Rick Macklem rmack...@uoguelph.ca wrote: Error code 70 is ESTALE (or NFSERR_STALE, if you prefer). The server replies with that when the file no longer exists. File locking doesn't stop a file from being removed, as far as I know. but the file is still there. Well, I have no idea why an NFS server would reply errno 70 if the file still exists, unless the client has somehow sent a bogus file handle to the server. (I am not aware of any client bug that might do that. I am almost suspicious that there might be a memory problem or something that corrupts bits in the network layer. Do you have TSO enabled for your network interface by any chance? If so, I'd try disabling that on the network interface. Same goes for checksum offload.) rick ps: If you can capture packets between the client and server at the time this error occurs, looking at them in wireshark might be useful? Lars ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
newnfs pkgng database corruption?
Hi, on a diskless server, I keep the ports tree and pkgng databases on a newnfs NFSv4 mount. After a bunch of portmaster -a runs, the pkgng sqlite database appears to get corrupted. For example, when I try to update an existing port, this happens: root@five:~ # portmaster ports-mgmt/pkg ... === Registering installation for pkg-1.0.11 Installing pkg-1.0.11...pkg: sqlite: database disk image is malformed (pkgdb.c:925) pkg: sqlite: database disk image is malformed (pkgdb.c:1914) *** [fake-pkg] Error code 70 I have removed all ports and the pkgng databases and reinstalled, but the corruption seems to return after a few days or weeks of installing and deinstalling ports. On another system that has a disk, that corruption of the pkgng database has not happened over six months or so. I therefore wonder if storing the sqlite database on an NFS-mount is triggering some sort of bug, either in pkgng or in newnfs. AFAIK, pkgng is using locks on the database quite liberally, could that be where a bug is lurking? I'm happy to help debug this, but someone would need to let me know what to try. Thanks, Lars ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: newnfs pkgng database corruption?
On Wed, Apr 10, 2013 at 07:42:30AM +, Eggert, Lars wrote: Hi, on a diskless server, I keep the ports tree and pkgng databases on a newnfs NFSv4 mount. After a bunch of portmaster -a runs, the pkgng sqlite database appears to get corrupted. For example, when I try to update an existing port, this happens: root@five:~ # portmaster ports-mgmt/pkg ... === Registering installation for pkg-1.0.11 Installing pkg-1.0.11...pkg: sqlite: database disk image is malformed (pkgdb.c:925) pkg: sqlite: database disk image is malformed (pkgdb.c:1914) *** [fake-pkg] Error code 70 I have removed all ports and the pkgng databases and reinstalled, but the corruption seems to return after a few days or weeks of installing and deinstalling ports. On another system that has a disk, that corruption of the pkgng database has not happened over six months or so. I therefore wonder if storing the sqlite database on an NFS-mount is triggering some sort of bug, either in pkgng or in newnfs. AFAIK, pkgng is using locks on the database quite liberally, could that be where a bug is lurking? I'm happy to help debug this, but someone would need to let me know what to try. This can usually happen when a user do not have the nfs lock system started. Are you sure that nfs lock is correctly started? If that is the case, there is anyway a bug in pkgng that should catch the problem and refuse to operate in such situation, I know sqlite to provide a mechanism that allow us to be able to catch this, I'm not sure yet to use it. regards, Bapt pgpnT1k4l4k_P.pgp Description: PGP signature
Re: newnfs pkgng database corruption?
Hi, On Apr 10, 2013, at 10:02, Baptiste Daroussin b...@freebsd.org wrote: This can usually happen when a user do not have the nfs lock system started. Are you sure that nfs lock is correctly started? with NFSv4, the locking system is integrated with the main protocol, it's no longer separate. If that is the case, there is anyway a bug in pkgng that should catch the problem and refuse to operate in such situation, I know sqlite to provide a mechanism that allow us to be able to catch this, I'm not sure yet to use it. Not sure about that. In case anyone wonders, the corruption is quite substantial: [elars@stanley ~]$ sqlite3 local/db/local.sqlite SQLite version 3.7.14.1 2012-10-04 19:37:12 Enter .help for instructions Enter SQL statements terminated with a ; sqlite PRAGMA integrity_check; *** in database main *** On tree page 1238 cell 17: 2nd reference to page 1237 On tree page 1238 cell 17: Child page depth differs On tree page 1238 cell 18: Child page depth differs On tree page 1241 cell 6: Rowid 17518 out of order (max larger than parent max of 12550) On tree page 1242 cell 3: Rowid 17566 out of order (max larger than parent max of 12557) On tree page 1243 cell 6: Rowid 12558 out of order (min less than parent min of 17566) On tree page 2867 cell 28: 2nd reference to page 1241 On tree page 2867 cell 28: Child page depth differs On tree page 2867 cell 29: 2nd reference to page 1242 On tree page 2867 cell 30: Child page depth differs On tree page 1417 cell 66: 2nd reference to page 1239 On tree page 1417 cell 66: Child page depth differs On tree page 1417 cell 67: 2nd reference to page 1240 On tree page 1417 cell 68: Child page depth differs rowid 62 missing from index sqlite_autoindex_packages_1 wrong # of entries in index sqlite_autoindex_packages_1 rowid 96 missing from index scripts_package_id rowid 96 missing from index sqlite_autoindex_scripts_1 rowid 97 missing from index scripts_package_id rowid 97 missing from index sqlite_autoindex_scripts_1 rowid 98 missing from index scripts_package_id rowid 98 missing from index sqlite_autoindex_scripts_1 wrong # of entries in index scripts_package_id wrong # of entries in index sqlite_autoindex_scripts_1 rowid 12509 missing from index sqlite_autoindex_files_1 rowid 12510 missing from index sqlite_autoindex_files_1 rowid 12511 missing from index sqlite_autoindex_files_1 rowid 12512 missing from index sqlite_autoindex_files_1 rowid 86 missing from index files_package_id rowid 86 missing from index sqlite_autoindex_files_1 rowid 87 missing from index files_package_id rowid 87 missing from index sqlite_autoindex_files_1 Error: database disk image is malformed Lars ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: newnfs pkgng database corruption?
On Wed, Apr 10, 2013 at 08:09:42AM +, Eggert, Lars wrote: Hi, On Apr 10, 2013, at 10:02, Baptiste Daroussin b...@freebsd.org wrote: This can usually happen when a user do not have the nfs lock system started. Are you sure that nfs lock is correctly started? with NFSv4, the locking system is integrated with the main protocol, it's no longer separate. If that is the case, there is anyway a bug in pkgng that should catch the problem and refuse to operate in such situation, I know sqlite to provide a mechanism that allow us to be able to catch this, I'm not sure yet to use it. Not sure about that. In case anyone wonders, the corruption is quite substantial: I think I know why let me a couple of days to test a patch. Unfortunatly your database can't be recovered apparently sqlite has some problems with the locking ystem of nfsv4 and has some workaround. Will you be able to test it? Just to warn you firefox may have the same problem with unbundled sqlite. regards, Bapt pgpojDMUn7mno.pgp Description: PGP signature
Re: newnfs pkgng database corruption?
Lars Eggert wrote: Hi, on a diskless server, I keep the ports tree and pkgng databases on a newnfs NFSv4 mount. After a bunch of portmaster -a runs, the pkgng sqlite database appears to get corrupted. For example, when I try to update an existing port, this happens: root@five:~ # portmaster ports-mgmt/pkg ... === Registering installation for pkg-1.0.11 Installing pkg-1.0.11...pkg: sqlite: database disk image is malformed (pkgdb.c:925) pkg: sqlite: database disk image is malformed (pkgdb.c:1914) *** [fake-pkg] Error code 70 Error code 70 is ESTALE (or NFSERR_STALE, if you prefer). The server replies with that when the file no longer exists. File locking doesn't stop a file from being removed, as far as I know. rick I have removed all ports and the pkgng databases and reinstalled, but the corruption seems to return after a few days or weeks of installing and deinstalling ports. On another system that has a disk, that corruption of the pkgng database has not happened over six months or so. I therefore wonder if storing the sqlite database on an NFS-mount is triggering some sort of bug, either in pkgng or in newnfs. AFAIK, pkgng is using locks on the database quite liberally, could that be where a bug is lurking? I'm happy to help debug this, but someone would need to let me know what to try. Thanks, Lars ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: newnfs pkgng database corruption?
Hi, On Apr 11, 2013, at 1:28, Rick Macklem rmack...@uoguelph.ca wrote: Error code 70 is ESTALE (or NFSERR_STALE, if you prefer). The server replies with that when the file no longer exists. File locking doesn't stop a file from being removed, as far as I know. but the file is still there. Lars ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: newnfs pkgng database corruption?
Hi, On Apr 11, 2013, at 0:16, Baptiste Daroussin b...@freebsd.org wrote: Will you be able to test it? yes. (But I will be traveling for the next two weeks and so the turnaround may be a bit longer than normal.) Lars ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org