Re: increase softint_bytes

2017-11-20 Thread Martin Husemann
On Tue, Nov 21, 2017 at 03:45:26PM +0900, Masanobu SAITOH wrote:
>  0) Apply the following change to -current.
> 
> 
> Index: kern_softint.c
> ===
> RCS file: /cvsroot/src/sys/kern/kern_softint.c,v
> retrieving revision 1.43
> diff -u -p -r1.43 kern_softint.c
> --- kern_softint.c4 Jul 2016 04:20:14 -   1.43
> +++ kern_softint.c21 Nov 2017 06:41:35 -
> @@ -217,7 +217,7 @@ typedef struct softcpu {
>  static void  softint_thread(void *);
> -u_intsoftint_bytes = 8192;
> +u_intsoftint_bytes = 16384;
>  u_intsoftint_timing;
>  static u_int softint_max;
>  static kmutex_t  softint_lock;
> 
> 
>  1) Sent the pullup request to netbsd-8
> 
>  2) Write auto-resize code and commit.
> 
>  3) If it's stable, send the pullup request to netbsd-8.
> 
> 
>  OK?

Sounds like a great plan!

Martin


Re: increase softint_bytes

2017-11-20 Thread Masanobu SAITOH

On 2017/11/20 17:28, Masanobu SAITOH wrote:

On 2017/11/17 18:42, 6b...@6bone.informatik.uni-leipzig.de wrote:

On Thu, 16 Nov 2017, Masanobu SAITOH wrote:


Hi, all.

Some device drivers now allocate a lot of softints.
See:

http://mail-index.netbsd.org/current-users/2017/11/09/msg032581.html

To avoid this panic, I wrote the following patch:

http://www.netbsd.org/~msaitoh/softint-20171116-0.dif



I tested the patch. Now the dump comes in another place.

https://suse.uni-leipzig.de/crash/crash-with-patch.jpg

Regards
Uwe


Could you test the following patch?

 http://www.netbsd.org/~msaitoh/vlan-20171120-0.dif


Updated patch

http://www.netbsd.org/~msaitoh/vlan-20171121-0.dif

Fix compile error (sorry)

Revert if_wmreg.h 1.104 and if_wm.c 1.542

--
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)



Re: increase softint_bytes

2017-11-20 Thread Masanobu SAITOH



can't this be fixed by making it dynamic?


  It's not easy because the return value of softint_establish() is
made from this area's address. As you know, the value is keep by each driver.


  I'm sorry. I misread kern_softint.c. The return value is not directly point to
the area, but it's offset of the area, so it would be easy to resize it.

  I'll try to modify the code to do auto-resize.


 It'll take a little time to write this change. And, it's low level
and important code, so it will take a time to test the stability
before sending pullup request. So,

 0) Apply the following change to -current.


Index: kern_softint.c
===
RCS file: /cvsroot/src/sys/kern/kern_softint.c,v
retrieving revision 1.43
diff -u -p -r1.43 kern_softint.c
--- kern_softint.c  4 Jul 2016 04:20:14 -   1.43
+++ kern_softint.c  21 Nov 2017 06:41:35 -
@@ -217,7 +217,7 @@ typedef struct softcpu {
 
 static void	softint_thread(void *);
 
-u_int		softint_bytes = 8192;

+u_int  softint_bytes = 16384;
 u_int  softint_timing;
 static u_int   softint_max;
 static kmutex_tsoftint_lock;


 1) Sent the pullup request to netbsd-8

 2) Write auto-resize code and commit.

 3) If it's stable, send the pullup request to netbsd-8.


 OK?

--
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)



Re: FFS corruption

2017-11-20 Thread Mouse
>>> After this migration, the filesystem started corrupting the same
>>> file /usr/pkg/etc/httpd/httpd.conf at the same time, which happens
>>> to be /etc/daily end of execution.
>> Which filesystem?  The original or the copy?
> The copy.  It would blow my mind if making a copy of a VM could break
> the original.

Strictly speaking, yes.  But you not only copied the data off the
source VM - which, yes, should not break anything - you also wrote it
to the destination.  I can imagine cases where that writing is what
caused the trouble.  Writing data to the destination might do this if,
for example, the destination is another VM ultimately backed by the
same spindle and there is something wrong with a driver that causes it
to confuse nominally-distinct disk blocks with one another (like my
possibility 2).

>> [With Xen] that there's at least one more layer of mapping between
>> OS sector numbers and hardware sector numbers
> Indeed; the domU filesystem is a file in the dom0 filesystem.

Hm.  That does make it more difficult to come up with a plausible
failure mode to explain this.  Have you tried creating another file in
the dom0 filesystem of the same size with easily identifiable content,
to see if any of that content appears in the affected domU filesystem?

dholland's identification of the overwrite data as inodes certainly
does feel provocative, but I'm not sure what to make of it.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: FFS corruption

2017-11-20 Thread Emmanuel Dreyfus
Mouse  wrote:

> > After this migration, the filesystem started corrupting the same file
> > /usr/pkg/etc/httpd/httpd.conf at the same time, which happens to be
> > /etc/daily end of execution.
> 
> Which filesystem?  The original or the copy?

The copy. It would blow my mind if making a copy of a VM could break the
original.

> [With Xen] that there's at least one more layer of mapping between
> OS sector numbers and hardware sector numbers

Indeed; the domU filesystem is a file in the dom0 filesystem.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: FFS corruption

2017-11-20 Thread Mouse
> Everything was fine until I scp it over the network on a new machine.

> After this migration, the filesystem started corrupting the same file
> /usr/pkg/etc/httpd/httpd.conf at the same time, which happens to be
> /etc/daily end of execution.

Which filesystem?  The original or the copy?

> httpd.conf metadata did not change, its content was just [filled]
> with some fixed length binary records (sample included below, in case
> it rings a bell to someone).  Setting immutable flags did not prevent
> the corruption; And using ktrace on /etc/daily showed it did not
> touch httpd.conf nor even its parent directory.

> And fsck did not [find] anything wrong.  Is there anything ringing a
> bell to someone here?  Any explanation?

Offhand, this sounds like one of two things:

(1) The same piece of disk is being used by two filesystems at once,
and that just happens to be the place where both filesystems actually
_use_ overlapping pieces of disk (if a filesystem is mostly empty, most
of the space it's nominally using can be scribbled on without
corrupting the filesystem; two mostly empty filesystems nominally using
overlapping areas of disk might end up almost never both actually
depending on the same sectors).

(2) Somewhere in the data path for disk writes, the high bits of the
disk block numbers are getting lost, thereby directing writes to two
nominally different pieces of disk to the same sectors.  This could be
a software bug or a hardware issue (which could be a hardware bug, a
software bug, or a case of incompatibility).  As a simple example that
probably is not what's going on in your case, a SCSI driver that
doesn't know how to use 10-byte CDBs can end up redirecting sectors
above the 1G point back onto the same sectors as others that are below
the 1G point.

You mentioned that at least one of these machines was a Xen instance.
I don't know enough about Xen to do more than guess here, but it does
mean that there's at least one more layer of mapping between OS sector
numbers and hardware sector numbers, and thus at least one more layer
where two supposedly different pieces of disk could get mapped to the
same real sectors.  Those additional layers are also additional places
where the sort of botch outlined in (2) could strike.

I realize this isn't very helpful, but it's about all that comes to
mind that explains your observations.  In particular, the metadata not
changing, the immutable flag making no difference, ktrace showing no
accesses - those all, to me, point to something corrupting the disk
behind the OS's back.  It could be either of the above, or perhaps even
broken disk firmware, though that strikes me as unlikely compared to
the above.  fsck noticing nothing wrong probably just means that the
only thing that got hit was data blocks.  Hit a metadata block (inode
table, superblock, etc) instead and fsck should get upset, but if all
you're damaging is data blocks, fsck shouldn't care.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: FFS corruption

2017-11-20 Thread David Holland
On Mon, Nov 20, 2017 at 08:09:28AM +, Emmanuel Dreyfus wrote:
 >   80 81 01 00 00 00 00 00  00 00 00 00 00 00 00 00  
 > ||
 > 0010  5a 8d 0e 5a 60 8e 09 0f  5a 8d 0e 5a 60 8e 09 0f  
 > |Z..Z`...Z..Z`...|
 > 0020  5a 8d 0e 5a 60 8e 09 0f  00 00 00 00 00 00 00 00  
 > |Z..Z`...|
 > 0030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  
 > ||
 > *

That's an inode.

-- 
David A. Holland
dholl...@netbsd.org


Re: FFS corruption

2017-11-20 Thread Emmanuel Dreyfus
On Mon, Nov 20, 2017 at 01:33:44PM +, Christos Zoulas wrote:
> I think if the block allocation fails in a bad spot on ffsv2, fsck does
> not correct it, so new file allocation from those blocks will fail.

But that happened on FFSv1, and on an existig file.

-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: FFS corruption

2017-11-20 Thread Christos Zoulas
In article <20171120103643.gh4...@trav.math.uni-bonn.de>,
Edgar Fuß   wrote:
>> Is there anything ringing a bell to someone here?
>Yes, but I guess that doesn't help.
>I experienced something remotely similar after a disc firmware crash followed 
>by a mpt(4) lockup (before I wrote the timeout recovery buhrow@ committed). 
>I would get a "mangled directory" panic on the same directory again and again; 
>fsck repaired it but found nothing else. I was just short of 
>dump/newfs/restore, but then something (I guess removing that directory) 
>helped. That was on a FFSv2, though.
>
>> Any explanation?
>No. Only that apparently, an FFS can be inconsistent in a way fsck doesn't 
>recognize.

I think if the block allocation fails in a bad spot on ffsv2, fsck does
not correct it, so new file allocation from those blocks will fail.

christos



Re: FFS corruption

2017-11-20 Thread Edgar Fuß
> Is there anything ringing a bell to someone here?
Yes, but I guess that doesn't help.
I experienced something remotely similar after a disc firmware crash followed 
by a mpt(4) lockup (before I wrote the timeout recovery buhrow@ committed). 
I would get a "mangled directory" panic on the same directory again and again; 
fsck repaired it but found nothing else. I was just short of 
dump/newfs/restore, but then something (I guess removing that directory) 
helped. That was on a FFSv2, though.

> Any explanation?
No. Only that apparently, an FFS can be inconsistent in a way fsck doesn't 
recognize.


Re: increase softint_bytes

2017-11-20 Thread Masanobu SAITOH

On 2017/11/17 18:42, 6b...@6bone.informatik.uni-leipzig.de wrote:

On Thu, 16 Nov 2017, Masanobu SAITOH wrote:


Hi, all.

Some device drivers now allocate a lot of softints.
See:

http://mail-index.netbsd.org/current-users/2017/11/09/msg032581.html

To avoid this panic, I wrote the following patch:

http://www.netbsd.org/~msaitoh/softint-20171116-0.dif



I tested the patch. Now the dump comes in another place.

https://suse.uni-leipzig.de/crash/crash-with-patch.jpg

Regards
Uwe


Could you test the following patch?

http://www.netbsd.org/~msaitoh/vlan-20171120-0.dif


--
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)



FFS corruption

2017-11-20 Thread Emmanuel Dreyfus
Hello

I experienced some nasty FFS corrupton, which was only resolved by
reformatting.L

The filesystem was a root partition image for a Xen NetBSD-7.1/i386 
domU. It was formatetd FFSv1 level 4. Everything was fine until I 
scp it over the network on a new machine. 

After this migration, the filesystem started corrupting the same file
/usr/pkg/etc/httpd/httpd.conf at the same time, which happens to be
/etc/daily end of execution. Other files were affected, ncluding a
/usr/pkg/etc/httpd/httpd.conf.bak set there to recover, but it is
difficult to asset the span of the problem. I assume few files were
touched because the machine could still work after just restoring 
httpd.conf

httpd.conf metadata did not change, its content was just filed with 
some fixed length binary records (sample included below, in case it 
rings a bell to someone). Setting immutable flags did not prevent the 
corruption; And using ktrace on /etc/daily showed it did not touch 
httpd.conf nor even its parent directory.

And fsck did not fid anything wrong. Is there anything ringing a bell
to someone here? Any explanation?

Corrupted httpd.conf sample:
  80 81 01 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
0010  5a 8d 0e 5a 60 8e 09 0f  5a 8d 0e 5a 60 8e 09 0f  |Z..Z`...Z..Z`...|
0020  5a 8d 0e 5a 60 8e 09 0f  00 00 00 00 00 00 00 00  |Z..Z`...|
0030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
*
0060  00 00 00 00 00 00 00 00  00 00 00 00 f4 4e a9 72  |.N.r|
0070  90 01 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
0080  80 81 01 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
0090  7f 8d 0e 5a 7e 77 dc 25  7f 8d 0e 5a 7e 77 dc 25  |...Z~w.%...Z~w.%|
00a0  7f 8d 0e 5a 7e 77 dc 25  00 00 00 00 00 00 00 00  |...Z~w.%|
00b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
*
00e0  00 00 00 00 00 00 00 00  00 00 00 00 8b 73 ed 77  |.s.w|
00f0  90 01 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
0100  80 81 01 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
0110  92 8d 0e 5a 54 43 af 15  92 8d 0e 5a 54 43 af 15  |...ZTC.ZTC..|
0120  92 8d 0e 5a 54 43 af 15  00 00 00 00 00 00 00 00  |...ZTC..|
0130  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ||
*



-- 
Emmanuel Dreyfus
m...@netbsd.org