Re: newnfs pkgng database corruption?

2013-04-23 Thread Eggert, Lars
Hi,

On Apr 22, 2013, at 2:56, Baptiste Daroussin b...@freebsd.org wrote:
 As anyone been able to test this patch?

I've been running with it for a few days. I've done a reinstall of all ports 
plus a few portmaster -a runs without pkgng database corruption. I've not 
tested it for very long, but so far, things look good.

Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: newnfs pkgng database corruption?

2013-04-23 Thread Baptiste Daroussin
On Tue, Apr 23, 2013 at 08:44:43PM +, Eggert, Lars wrote:
 Hi,
 
 On Apr 22, 2013, at 2:56, Baptiste Daroussin b...@freebsd.org wrote:
  As anyone been able to test this patch?
 
 I've been running with it for a few days. I've done a reinstall of all ports 
 plus a few portmaster -a runs without pkgng database corruption. I've not 
 tested it for very long, but so far, things look good.
 
 Lars

Great thank you I'll activate this for all database located on a network
filesystem.

Thank you very much!
Bapt


pgppQ4uGA6loz.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-22 Thread Baptiste Daroussin
On Fri, Apr 12, 2013 at 03:10:37PM +0200, Baptiste Daroussin wrote:
 On Fri, Apr 12, 2013 at 12:56:10PM +, Eggert, Lars wrote:
  Hi,
  
  On Apr 12, 2013, at 1:10, Rick Macklem rmack...@uoguelph.ca wrote:
   Well, I have no idea why an NFS server would reply errno 70 if the file
   still exists, unless the client has somehow sent a bogus file handle
   to the server. (I am not aware of any client bug that might do that. I
   am almost suspicious that there might be a memory problem or something
   that corrupts bits in the network layer. Do you have TSO enabled for your
   network interface by any chance? If so, I'd try disabling that on the
   network interface. Same goes for checksum offload.)
   
   rick
   ps: If you can capture packets between the client and server at the
  time this error occurs, looking at them in wireshark might be
  useful?
  
  I will try all of those things.
  
  But first, a question that someone who understands pkgng will be able to 
  answerr: Is this fake-pkg process even running on the NFS mount? The 
  WRKDIR is /tmp, which is an mfs mount.
 
 fake-pkg is run in WRKDIR, but it calls pkgng which will open
 /var/db/pkg/local.sqlite aka nfs mount.
 
 The Error 70 is EX_SOFTWARE returned by pkgng.
 
 Can you try the following patch:
 http://people.freebsd.org/~bapt/patch-libpkg__pkgdb.c
 
 Just add that file to /usr/ports/ports-mgmt/pkg/files/
 
 If that works for you, that means the posix advisory locks is somehow failing 
 on
 nfsv4 files.
 
 Given it is already known to be failing on nfsv3 (because people often
 misconfigure it) I'll probablmy make unix-dotfile the default locking system
 when local.sqlite is stored on network filesystem.
 
 regards,
 Bapt

As anyone been able to test this patch?

regards,
Bapt


pgplHpD5nZJzY.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-12 Thread Eggert, Lars
Hi,

On Apr 12, 2013, at 1:10, Rick Macklem rmack...@uoguelph.ca wrote:
 Well, I have no idea why an NFS server would reply errno 70 if the file
 still exists, unless the client has somehow sent a bogus file handle
 to the server. (I am not aware of any client bug that might do that. I
 am almost suspicious that there might be a memory problem or something
 that corrupts bits in the network layer. Do you have TSO enabled for your
 network interface by any chance? If so, I'd try disabling that on the
 network interface. Same goes for checksum offload.)
 
 rick
 ps: If you can capture packets between the client and server at the
time this error occurs, looking at them in wireshark might be
useful?

I will try all of those things.

But first, a question that someone who understands pkgng will be able to 
answerr: Is this fake-pkg process even running on the NFS mount? The WRKDIR 
is /tmp, which is an mfs mount.

Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: newnfs pkgng database corruption?

2013-04-12 Thread Baptiste Daroussin
On Fri, Apr 12, 2013 at 12:56:10PM +, Eggert, Lars wrote:
 Hi,
 
 On Apr 12, 2013, at 1:10, Rick Macklem rmack...@uoguelph.ca wrote:
  Well, I have no idea why an NFS server would reply errno 70 if the file
  still exists, unless the client has somehow sent a bogus file handle
  to the server. (I am not aware of any client bug that might do that. I
  am almost suspicious that there might be a memory problem or something
  that corrupts bits in the network layer. Do you have TSO enabled for your
  network interface by any chance? If so, I'd try disabling that on the
  network interface. Same goes for checksum offload.)
  
  rick
  ps: If you can capture packets between the client and server at the
 time this error occurs, looking at them in wireshark might be
 useful?
 
 I will try all of those things.
 
 But first, a question that someone who understands pkgng will be able to 
 answerr: Is this fake-pkg process even running on the NFS mount? The WRKDIR 
 is /tmp, which is an mfs mount.

fake-pkg is run in WRKDIR, but it calls pkgng which will open
/var/db/pkg/local.sqlite aka nfs mount.

The Error 70 is EX_SOFTWARE returned by pkgng.

Can you try the following patch:
http://people.freebsd.org/~bapt/patch-libpkg__pkgdb.c

Just add that file to /usr/ports/ports-mgmt/pkg/files/

If that works for you, that means the posix advisory locks is somehow failing on
nfsv4 files.

Given it is already known to be failing on nfsv3 (because people often
misconfigure it) I'll probablmy make unix-dotfile the default locking system
when local.sqlite is stored on network filesystem.

regards,
Bapt


pgp8rduiaO8x_.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-12 Thread Rick Macklem
Baptiste Daroussin wrote:
 On Fri, Apr 12, 2013 at 12:56:10PM +, Eggert, Lars wrote:
  Hi,
 
  On Apr 12, 2013, at 1:10, Rick Macklem rmack...@uoguelph.ca wrote:
   Well, I have no idea why an NFS server would reply errno 70 if the
   file
   still exists, unless the client has somehow sent a bogus file
   handle
   to the server. (I am not aware of any client bug that might do
   that. I
   am almost suspicious that there might be a memory problem or
   something
   that corrupts bits in the network layer. Do you have TSO enabled
   for your
   network interface by any chance? If so, I'd try disabling that on
   the
   network interface. Same goes for checksum offload.)
  
   rick
   ps: If you can capture packets between the client and server at
   the
  time this error occurs, looking at them in wireshark might be
  useful?
 
  I will try all of those things.
 
You might still try the above suggestions, but since Error 70 wasn't an
errno.h error number, it isn't a stale fh problem and, as such, there
isn't any evidence that bits are getting messed with by the network layers.

rick

  But first, a question that someone who understands pkgng will be
  able to answerr: Is this fake-pkg process even running on the NFS
  mount? The WRKDIR is /tmp, which is an mfs mount.
 
 fake-pkg is run in WRKDIR, but it calls pkgng which will open
 /var/db/pkg/local.sqlite aka nfs mount.
 
 The Error 70 is EX_SOFTWARE returned by pkgng.
 
 Can you try the following patch:
 http://people.freebsd.org/~bapt/patch-libpkg__pkgdb.c
 
 Just add that file to /usr/ports/ports-mgmt/pkg/files/
 
 If that works for you, that means the posix advisory locks is somehow
 failing on
 nfsv4 files.
 
 Given it is already known to be failing on nfsv3 (because people often
 misconfigure it) I'll probablmy make unix-dotfile the default locking
 system
 when local.sqlite is stored on network filesystem.
 
 regards,
 Bapt
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: newnfs pkgng database corruption?

2013-04-11 Thread Baptiste Daroussin
On Thu, Apr 11, 2013 at 05:52:52AM +, Eggert, Lars wrote:
 Hi,
 
 On Apr 11, 2013, at 0:16, Baptiste Daroussin b...@freebsd.org wrote:
  Will you be able to test it?
 
 yes. (But I will be traveling for the next two weeks and so the turnaround 
 may be a bit longer than normal.)
 
 Lars

First, I think you can recover your database.

Can you try the following command:

# mv /var/db/pkg/local.sqlite /var/db/pkg/backup.sqlite
# echo '.dump' | pkg shell /var/db/pkg/backup.sqlite | pkg shell
# echo 'pragma user_config=12;' | pkg shell

This should give you again a working database I hope :)

I think the corruption you get are due to the synchronous pragma. I need to dig
in that direction.

regards,
Bapt


pgpPilmucC2XE.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-11 Thread Eggert, Lars
Hi,

On Apr 11, 2013, at 10:30, Baptiste Daroussin b...@freebsd.org wrote:
 First, I think you can recover your database.

that would be great.

 Can you try the following command:
 
 # mv /var/db/pkg/local.sqlite /var/db/pkg/backup.sqlite
 # echo '.dump' | pkg shell /var/db/pkg/backup.sqlite | pkg shell

That step doesn't quite work:

[root@stanley /usr/home/elars/local/db]# echo '.dump' | pkg shell backup.sqlite 
| pkg shell
Error: near line 15927: column path is not unique
Error: near line 15928: column path is not unique
Error: near line 15929: column path is not unique
Error: near line 15930: column path is not unique
Error: near line 15931: column path is not unique
Error: near line 15932: column path is not unique
Error: near line 15933: column path is not unique
Error: near line 15934: column path is not unique
Error: near line 15935: column path is not unique
Error: near line 15936: column path is not unique
Error: near line 15937: column path is not unique

[root@stanley /usr/home/elars/local/db]# ll local.sqlite 
-rw-r--r--  1 root  wheel  0 Apr 11 10:42 local.sqlite

I can send you the database off-list, if you like.

 I think the corruption you get are due to the synchronous pragma. I need to 
 dig
 in that direction.

Thanks for looking into this!

Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: newnfs pkgng database corruption?

2013-04-11 Thread Baptiste Daroussin
On Thu, Apr 11, 2013 at 08:44:01AM +, Eggert, Lars wrote:
 Hi,
 
 On Apr 11, 2013, at 10:30, Baptiste Daroussin b...@freebsd.org wrote:
  First, I think you can recover your database.
 
 that would be great.
 
  Can you try the following command:
  
  # mv /var/db/pkg/local.sqlite /var/db/pkg/backup.sqlite
  # echo '.dump' | pkg shell /var/db/pkg/backup.sqlite | pkg shell
 
 That step doesn't quite work:
 
 [root@stanley /usr/home/elars/local/db]# echo '.dump' | pkg shell 
 backup.sqlite | pkg shell
 Error: near line 15927: column path is not unique
 Error: near line 15928: column path is not unique
 Error: near line 15929: column path is not unique
 Error: near line 15930: column path is not unique
 Error: near line 15931: column path is not unique
 Error: near line 15932: column path is not unique
 Error: near line 15933: column path is not unique
 Error: near line 15934: column path is not unique
 Error: near line 15935: column path is not unique
 Error: near line 15936: column path is not unique
 Error: near line 15937: column path is not unique
 
 [root@stanley /usr/home/elars/local/db]# ll local.sqlite 
 -rw-r--r--  1 root  wheel  0 Apr 11 10:42 local.sqlite
 
 I can send you the database off-list, if you like.
 
Yes please.

regards,
Bapt


pgp2civlZAOdQ.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-11 Thread Rick Macklem
Lars Eggert wrote:
 Hi,
 
 On Apr 11, 2013, at 1:28, Rick Macklem rmack...@uoguelph.ca wrote:
  Error code 70 is ESTALE (or NFSERR_STALE, if you prefer). The server
  replies with that when the file no longer exists.
 
  File locking doesn't stop a file from being removed, as far as I
  know.
 
 but the file is still there.
 
Well, I have no idea why an NFS server would reply errno 70 if the file
still exists, unless the client has somehow sent a bogus file handle
to the server. (I am not aware of any client bug that might do that. I
am almost suspicious that there might be a memory problem or something
that corrupts bits in the network layer. Do you have TSO enabled for your
network interface by any chance? If so, I'd try disabling that on the
network interface. Same goes for checksum offload.)

rick
ps: If you can capture packets between the client and server at the
time this error occurs, looking at them in wireshark might be
useful?

 Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


newnfs pkgng database corruption?

2013-04-10 Thread Eggert, Lars
Hi,

on a diskless server, I keep the ports tree and pkgng databases on a newnfs 
NFSv4 mount. After a bunch of portmaster -a runs, the pkgng sqlite database 
appears to get corrupted. For example, when I try to update an existing port, 
this happens:

root@five:~ # portmaster ports-mgmt/pkg
...
===   Registering installation for pkg-1.0.11
Installing pkg-1.0.11...pkg: sqlite: database disk image is malformed 
(pkgdb.c:925)
pkg: sqlite: database disk image is malformed (pkgdb.c:1914)
*** [fake-pkg] Error code 70

I have removed all ports and the pkgng databases and reinstalled, but the 
corruption seems to return after a few days or weeks of installing and 
deinstalling ports.

On another system that has a disk, that corruption of the pkgng database has 
not happened over six months or so. I therefore wonder if storing the sqlite 
database on an NFS-mount is triggering some sort of bug, either in pkgng or in 
newnfs. AFAIK, pkgng is using locks on the database quite liberally, could that 
be where a bug is lurking?

I'm happy to help debug this, but someone would need to let me know what to try.

Thanks,
Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: newnfs pkgng database corruption?

2013-04-10 Thread Baptiste Daroussin
On Wed, Apr 10, 2013 at 07:42:30AM +, Eggert, Lars wrote:
 Hi,
 
 on a diskless server, I keep the ports tree and pkgng databases on a newnfs 
 NFSv4 mount. After a bunch of portmaster -a runs, the pkgng sqlite database 
 appears to get corrupted. For example, when I try to update an existing port, 
 this happens:
 
 root@five:~ # portmaster ports-mgmt/pkg
 ...
 ===   Registering installation for pkg-1.0.11
 Installing pkg-1.0.11...pkg: sqlite: database disk image is malformed 
 (pkgdb.c:925)
 pkg: sqlite: database disk image is malformed (pkgdb.c:1914)
 *** [fake-pkg] Error code 70
 
 I have removed all ports and the pkgng databases and reinstalled, but the 
 corruption seems to return after a few days or weeks of installing and 
 deinstalling ports.
 
 On another system that has a disk, that corruption of the pkgng database has 
 not happened over six months or so. I therefore wonder if storing the sqlite 
 database on an NFS-mount is triggering some sort of bug, either in pkgng or 
 in newnfs. AFAIK, pkgng is using locks on the database quite liberally, could 
 that be where a bug is lurking?
 
 I'm happy to help debug this, but someone would need to let me know what to 
 try.
 

This can usually happen when a user do not have the nfs lock system started.

Are you sure that nfs lock is correctly started?

If that is the case, there is anyway a bug in pkgng that should catch the
problem and refuse to operate in such situation, I know sqlite to provide a
mechanism that allow us to be able to catch this, I'm not sure yet to use it.

regards,
Bapt


pgpnT1k4l4k_P.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-10 Thread Eggert, Lars
Hi,

On Apr 10, 2013, at 10:02, Baptiste Daroussin b...@freebsd.org wrote:
 This can usually happen when a user do not have the nfs lock system started.
 Are you sure that nfs lock is correctly started?

with NFSv4, the locking system is integrated with the main protocol, it's no 
longer separate.

 If that is the case, there is anyway a bug in pkgng that should catch the
 problem and refuse to operate in such situation, I know sqlite to provide a
 mechanism that allow us to be able to catch this, I'm not sure yet to use it.

Not sure about that.

In case anyone wonders, the corruption is quite substantial:

[elars@stanley ~]$ sqlite3 local/db/local.sqlite 
SQLite version 3.7.14.1 2012-10-04 19:37:12
Enter .help for instructions
Enter SQL statements terminated with a ;
sqlite PRAGMA integrity_check; 
*** in database main ***
On tree page 1238 cell 17: 2nd reference to page 1237
On tree page 1238 cell 17: Child page depth differs
On tree page 1238 cell 18: Child page depth differs
On tree page 1241 cell 6: Rowid 17518 out of order (max larger than parent max 
of 12550)
On tree page 1242 cell 3: Rowid 17566 out of order (max larger than parent max 
of 12557)
On tree page 1243 cell 6: Rowid 12558 out of order (min less than parent min of 
17566)
On tree page 2867 cell 28: 2nd reference to page 1241
On tree page 2867 cell 28: Child page depth differs
On tree page 2867 cell 29: 2nd reference to page 1242
On tree page 2867 cell 30: Child page depth differs
On tree page 1417 cell 66: 2nd reference to page 1239
On tree page 1417 cell 66: Child page depth differs
On tree page 1417 cell 67: 2nd reference to page 1240
On tree page 1417 cell 68: Child page depth differs
rowid 62 missing from index sqlite_autoindex_packages_1
wrong # of entries in index sqlite_autoindex_packages_1
rowid 96 missing from index scripts_package_id
rowid 96 missing from index sqlite_autoindex_scripts_1
rowid 97 missing from index scripts_package_id
rowid 97 missing from index sqlite_autoindex_scripts_1
rowid 98 missing from index scripts_package_id
rowid 98 missing from index sqlite_autoindex_scripts_1
wrong # of entries in index scripts_package_id
wrong # of entries in index sqlite_autoindex_scripts_1
rowid 12509 missing from index sqlite_autoindex_files_1
rowid 12510 missing from index sqlite_autoindex_files_1
rowid 12511 missing from index sqlite_autoindex_files_1
rowid 12512 missing from index sqlite_autoindex_files_1
rowid 86 missing from index files_package_id
rowid 86 missing from index sqlite_autoindex_files_1
rowid 87 missing from index files_package_id
rowid 87 missing from index sqlite_autoindex_files_1
Error: database disk image is malformed

Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: newnfs pkgng database corruption?

2013-04-10 Thread Baptiste Daroussin
On Wed, Apr 10, 2013 at 08:09:42AM +, Eggert, Lars wrote:
 Hi,
 
 On Apr 10, 2013, at 10:02, Baptiste Daroussin b...@freebsd.org wrote:
  This can usually happen when a user do not have the nfs lock system started.
  Are you sure that nfs lock is correctly started?
 
 with NFSv4, the locking system is integrated with the main protocol, it's no 
 longer separate.
 
  If that is the case, there is anyway a bug in pkgng that should catch the
  problem and refuse to operate in such situation, I know sqlite to provide a
  mechanism that allow us to be able to catch this, I'm not sure yet to use 
  it.
 
 Not sure about that.
 
 In case anyone wonders, the corruption is quite substantial:
 

I think I know why let me a couple of days to test a patch.

Unfortunatly your database can't be recovered apparently sqlite has some
problems with the locking ystem of nfsv4 and has some workaround.

Will you be able to test it?

Just to warn you firefox may have the same problem with unbundled sqlite.

regards,
Bapt


pgpojDMUn7mno.pgp
Description: PGP signature


Re: newnfs pkgng database corruption?

2013-04-10 Thread Rick Macklem
Lars Eggert wrote:
 Hi,
 
 on a diskless server, I keep the ports tree and pkgng databases on a
 newnfs NFSv4 mount. After a bunch of portmaster -a runs, the pkgng
 sqlite database appears to get corrupted. For example, when I try to
 update an existing port, this happens:
 
 root@five:~ # portmaster ports-mgmt/pkg
 ...
 === Registering installation for pkg-1.0.11
 Installing pkg-1.0.11...pkg: sqlite: database disk image is malformed
 (pkgdb.c:925)
 pkg: sqlite: database disk image is malformed (pkgdb.c:1914)
 *** [fake-pkg] Error code 70
 
Error code 70 is ESTALE (or NFSERR_STALE, if you prefer). The server
replies with that when the file no longer exists.

File locking doesn't stop a file from being removed, as far as I know.

rick

 I have removed all ports and the pkgng databases and reinstalled, but
 the corruption seems to return after a few days or weeks of installing
 and deinstalling ports.
 
 On another system that has a disk, that corruption of the pkgng
 database has not happened over six months or so. I therefore wonder if
 storing the sqlite database on an NFS-mount is triggering some sort of
 bug, either in pkgng or in newnfs. AFAIK, pkgng is using locks on the
 database quite liberally, could that be where a bug is lurking?
 
 I'm happy to help debug this, but someone would need to let me know
 what to try.
 
 Thanks,
 Lars
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to
 freebsd-current-unsubscr...@freebsd.org
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: newnfs pkgng database corruption?

2013-04-10 Thread Eggert, Lars
Hi,

On Apr 11, 2013, at 1:28, Rick Macklem rmack...@uoguelph.ca wrote:
 Error code 70 is ESTALE (or NFSERR_STALE, if you prefer). The server
 replies with that when the file no longer exists.
 
 File locking doesn't stop a file from being removed, as far as I know.

but the file is still there.

Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: newnfs pkgng database corruption?

2013-04-10 Thread Eggert, Lars
Hi,

On Apr 11, 2013, at 0:16, Baptiste Daroussin b...@freebsd.org wrote:
 Will you be able to test it?

yes. (But I will be traveling for the next two weeks and so the turnaround may 
be a bit longer than normal.)

Lars
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org