Re: Help with filing a [maybe] ZFS/mmap bug.

2013-09-22 Thread George Hartzell
Andriy Gapon writes:
 > on 18/07/2013 20:44 George Hartzell said the following:
 > > Andriy Gapon writes:
 > >  > on 17/07/2013 23:47 George Hartzell said the following:
 > >  > > How should I move forward with this?
 > >  > 
 > >  > Could you please try to reproduce this problem using a kernel built with
 > >  > INVARIANTS options?
 > > 
 > > I added INVARIANT_SUPPORT and INVARIANTS options to the GENERIC
 > > kernel, rebuilt it, installed it and running through my "test case"
 > > generated a lot of invalid flac files.  I"m not sure what the options
 > > are/were supposed to do though, it looks like they generally lead to
 > > KASSERTS, which lead to abort()'s.  Nothing in /var/log/messages or on
 > > the console.
 > 
 > George,
 > 
 > do you have anything new on this issue?
 > 
 > Could you please try the following patch?
 > http://people.freebsd.org/~avg/zfs-putpages.diff
 > 
 > I expect it to not really fix the issue, but it may help to narrow it down.
 > Please keep INVARIANTS.
 > Thank you.
 > -- 
 > Andriy Gapon

Hi Andriy,

This weekend I built up a system using the 10.0 beta 2 dvd, then
updated /usr/src from head.

I grabbed a fresh copy of your patch this afternoon.

I applied your patch with no problems.  I was unable to build a new
kernel though, you have one reference to m->busy, where m is a
vm_page_t (if I remember correctly).  I dug around a bit and decided
that you meant m->busy_lock, which let me build a usable kernel.

It looks like INVARIANTS and INVARIANT_SUPPORT are included in the
GENERIC conf file.

I ran through my test routine with the original system and was able to
reproduce the problem.

After building and installing a kernel with your patch I was still
able to trigger the problem.  If anything it was worse (sample size =
1, I know...).

I did not see any interesting output in /var/log/messages or to the
console or anywhere else obvious.

I'm not sure what to do next.  It's likely that my m->busy to
m->busy_lock change was not The Right Thing to Do and might have
invalidated what the patch was trying to do.

In any case, I now have a system running HEAD and should be able to
test things more easily.

Thanks for the help,

g.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Help with filing a [maybe] ZFS/mmap bug.

2013-08-07 Thread George Hartzell
Andriy Gapon writes:
 > on 18/07/2013 20:44 George Hartzell said the following:
 > > Andriy Gapon writes:
 > >  > on 17/07/2013 23:47 George Hartzell said the following:
 > >  > > How should I move forward with this?
 > >  > 
 > >  > Could you please try to reproduce this problem using a kernel built with
 > >  > INVARIANTS options?
 > > 
 > > I added INVARIANT_SUPPORT and INVARIANTS options to the GENERIC
 > > kernel, rebuilt it, installed it and running through my "test case"
 > > generated a lot of invalid flac files.  I"m not sure what the options
 > > are/were supposed to do though, it looks like they generally lead to
 > > KASSERTS, which lead to abort()'s.  Nothing in /var/log/messages or on
 > > the console.
 > 
 > George,
 > 
 > do you have anything new on this issue?

Since the message that you quoted I narrowed down my "test case"
somewhat but I have not yet produced a stand-alone tool that
reproduces it (you still have to go through picard et al.).

 > Could you please try the following patch?
 > http://people.freebsd.org/~avg/zfs-putpages.diff
 > 
 > I expect it to not really fix the issue, but it may help to narrow it down.
 > Please keep INVARIANTS.

Absolutely.  Probably not until the weekend, but I'll give it a go.

Thanks for following up.

g.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Help with filing a [maybe] ZFS/mmap bug.

2013-08-07 Thread Andriy Gapon
on 18/07/2013 20:44 George Hartzell said the following:
> Andriy Gapon writes:
>  > on 17/07/2013 23:47 George Hartzell said the following:
>  > > How should I move forward with this?
>  > 
>  > Could you please try to reproduce this problem using a kernel built with
>  > INVARIANTS options?
> 
> I added INVARIANT_SUPPORT and INVARIANTS options to the GENERIC
> kernel, rebuilt it, installed it and running through my "test case"
> generated a lot of invalid flac files.  I"m not sure what the options
> are/were supposed to do though, it looks like they generally lead to
> KASSERTS, which lead to abort()'s.  Nothing in /var/log/messages or on
> the console.

George,

do you have anything new on this issue?

Could you please try the following patch?
http://people.freebsd.org/~avg/zfs-putpages.diff

I expect it to not really fix the issue, but it may help to narrow it down.
Please keep INVARIANTS.
Thank you.
-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Help with filing a [maybe] ZFS/mmap bug.

2013-07-24 Thread George Hartzell
George Hartzell writes:
 > 
 > George Hartzell writes:
 >  > George Hartzell writes:
 >  >  > [...]
 >  >  > So, it would seem that there's something about the filesystem in which
 >  >  > my home directory resides that contributes to the problem.
 >  >  > [...]
 >  > 
 >  > Another data point.
 >  > 
 >  > [...]
 > 
 > Yet another data point or three.
 > 
 > I took an unused disk, set it up with a single pool and copied
 > everything from my two disk system to it using zfs send & recv.  I was
 > hoping that if there was something goofy about the state of the
 > filesystems on the older two disk pool it might get cleaned up in the
 > transfer.
 > 
 > I tagged the entire set of flac files, they were all successfully
 > validated via the plugin.  After exiting Picard, one failed
 > validation.  After rebooting, many failed validation.
 > 
 > Next I created a new filesystem on this new pool, mounted it,
 > configured Picard to save to that filesystem and ran through all of
 > the tracks.  They validated fine via the plugin and by hand after
 > exiting Picard.  They also validated properly after unmounting and
 > remounting the filesystem and after a reboot.  Sigh.
 > 
 > Then I destroyed all of the snapshots on the filesystems that I
 > transfered over from my "real" dual-disk system.  Tagging all of the
 > flac files into my home directory generated errors from the validation
 > plugin and by hand after exiting picard.  I didn't bother rebooting
 > and checking.
 > 
 > So it seems to be something about the filesystem{s} themselves.
 > [...]

A [small] breakthrough.  I understand why saving to a freshly created
filesystem never led to any errors.

I'd tentatively concluded that there was something hinky with the
filesystem itself that was causing the problem, something that
came along when I recreated the filesystem via zfs send/recv.

This was based on my inability to trigger the problem when I saved the
files to a newly created zfs filesystem.

Yesterday I used dump and restore to transfer my trouble-free home
directory from its UFS partition to a newly created zfs filesystem (I
hadn't know that restore would write to a zfs filesystem but it
appears to...).

The resulting system generated errors when I ran through my "test
case", even though it wasn't a zfs send/recv copy.

Next I created a new zfs filesystem and arranged to write the tagged
files there.  The resulting files were error free, even after a
reboot.

Next I copied the untagged source flacs onto the newly created zfs
filesystem and ran through the test routine, saving the tagged files
to the newly created zfs filesystem.  This resulted in a glorious pile
of errors.

Conclusion: my test case only generates errors when the untagged files
are on the fileysystem to which the tagged files will be written.

A bit of poking around in the sources provided the explanation.
Picard tries to move the tagged file to its final destination.  If
it's within the same filesystem this ends up being a rename operation
and I'm left with the inconsistent flac file.  If the destination is
in another fileysystem then it copies the file, which ends up reading
the clean memory-resident data.

So, now I have a smaller test version of my workflow that doesn't
involve rebooting my machine to generate the error.  I'll get back to
trying to come up with a variant of Richard's stand alone
bug-tickler.

phew.

g.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Help with filing a [maybe] ZFS/mmap bug.

2013-07-21 Thread George Hartzell

George Hartzell writes:
 > George Hartzell writes:
 >  > [...]
 >  > So, it would seem that there's something about the filesystem in which
 >  > my home directory resides that contributes to the problem.
 >  > [...]
 > 
 > Another data point.
 > 
 > [...]

Yet another data point or three.

I took an unused disk, set it up with a single pool and copied
everything from my two disk system to it using zfs send & recv.  I was
hoping that if there was something goofy about the state of the
filesystems on the older two disk pool it might get cleaned up in the
transfer.

I tagged the entire set of flac files, they were all successfully
validated via the plugin.  After exiting Picard, one failed
validation.  After rebooting, many failed validation.

Next I created a new filesystem on this new pool, mounted it,
configured Picard to save to that filesystem and ran through all of
the tracks.  They validated fine via the plugin and by hand after
exiting Picard.  They also validated properly after unmounting and
remounting the filesystem and after a reboot.  Sigh.

Then I destroyed all of the snapshots on the filesystems that I
transfered over from my "real" dual-disk system.  Tagging all of the
flac files into my home directory generated errors from the validation
plugin and by hand after exiting picard.  I didn't bother rebooting
and checking.

So it seems to be something about the filesystem{s} themselves.

I'm running a scrub now.

g.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Help with filing a [maybe] ZFS/mmap bug.

2013-07-19 Thread George Hartzell
George Hartzell writes:
 > [...]
 > So, it would seem that there's something about the filesystem in which
 > my home directory resides that contributes to the problem.
 > [...]

Another data point.

I just ran through my test case, saving the tagged and transcoded
files into /tmp, a zfs filesystem that was created back when I built
up the system (contemporaneously with /usr/home).  I was unable to
trigger the bug there.

As I control, I then ran through the test case, saving a directory in
my home directory and triggered the bug.

I then created a new directory /usr/home/foo (within the same zfs
filesystem as my home directory).  I was unable to trigger the bug
there either.

I then ran through all 165 flac files in the full "test case", saving
the results to /usr/home/foo.  After exiting picard and running flac
-t on all of the files I had errors on many files, including the file
in my single-file test case above.  I did not even need to reboot.

I then ran the single file test case, saving into /usr/foo (as above)
and was now able to observe the error after a reboot.

I then ran the single file test case (again to make sure I wasn't
crazy), saving into /usr/foo (as above) and was now able to observe
the error after a reboot.

One more control.

Create /usr/home/bar.
Run single file test case.  Reboot.  This time I observed an invalid
flac.  Not sure what this means about the test case above.

Sigh.

g.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Help with filing a [maybe] ZFS/mmap bug.

2013-07-19 Thread George Hartzell
Richard Todd writes:
 > On Thu, Jul 18, 2013 at 11:40:51AM -0700, George Hartzell wrote:
 > [...]
 > > [...]  
 > > In my case I'd want to find a particular set of file size, offset, and
 > > insertion size that triggers the problem and code up a c/c++ equiv. of
 > > the mmap calls that py-mutagen does.  Right?
 > 
 > Yeah. 

I'm stuck.  Or I've discovered something relevant.  Or both.

I've identified a slightly simpler test case.  I load my handful of
test albums, look up single, particular album, and save a particular
track.  The tagged flac file appears to be valid.  Then I reboot.  Now
the flac file is invalid.

It's repeatable, which is useful.

Following the lead of your test script I created a new zfs filesystem,
mounted it, and had picard save the tagged files there.  After exiting
picard the files appears to be valid.  After unmounting and remounting
the filesystem the file *still* appears to be valid.  After rebooting,
the file *still* appears to be valid.

So, it would seem that there's something about the filesystem in which
my home directory resides that contributes to the problem.

The only obvious thing I saw is that my homedir filesystem has a quota
and is 80% full.  I tried creating a new, small, zfs filesystem and
running the test there.  The tagged flac file validates successfully,
I do not see the problem (the single file makes the filesystem 88%
full).

All of the filesystems have automagically created snapshots, so I
tried creating a snapshot of the new zfs filesystem before running
through the test case.  I was still unable to replicate the problem.

My spin on your gen4.cpp test case (modified to use the filesize and
offset that picard uses) does not generate a difference when run in my
home directory followed by a reboot (picard calls insert_bytes twice,
using either set of values does not cause a problem).

The only difference I see in "zfs get all" output (excluding obvious
sizes, etc...) is that the new filesystem has xattr on via the
"default", whereas my home directory has it off via "temporary".  I'm
not sure why it's off.

So, I currently have a repeatable, not-too-efficient test case using
my home directory.  I am unable to repeat the test case using a newly
created zfs filesystem (even a very small one) nor am I able to make
any headway with Richard's test case.

As I described in another thread with Andriy, add INVARIANTS and
INVARIANT_SUPPORT into the kernel did not lead to any different
behaviour, in fact the experiments described above were run on this
new kernel.

Any suggestions for a next step?

g.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Help with filing a [maybe] ZFS/mmap bug.

2013-07-18 Thread Richard Todd
On Thu, Jul 18, 2013 at 11:40:51AM -0700, George Hartzell wrote:
> Removing the mmap support from those two routines seems to avoid the
> issue.

Aha. 

>  > If so, then the issue is triggered by one or both of those two routines;
>  > hack them to print out the exact offsets used on each call and use that to 
>  > try and code up a simple C++ test case.  
>  > [...]
> 
> Your test case doesn't use mmap, I assume that you've offered it up as
> a hint, not as something that's nearly done.  The shell script in
> particular seems useful.

Um, go look at gen4.cpp again.  It uses mmap().  The insert_bytes and
delete_bytes functions should work the same way as the (mmap-using path of)
the functions of the same name in py-mutagen. 


> In my case I'd want to find a particular set of file size, offset, and
> insertion size that triggers the problem and code up a c/c++ equiv. of
> the mmap calls that py-mutagen does.  Right?

Yeah. 


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Help with filing a [maybe] ZFS/mmap bug.

2013-07-18 Thread Alexander Yerenkow
I know how all not loving "me-too" emails, but I'll try :)
There's a rtorrent, which uses mmap. And I had cases (related to reboot),
where big files
(or average files in many-files torrents) appears with broken checksum
without any good reason.
Author of rtorrent not very politely always assume that really broken other
filesystems, while rtorrent have simple logic (with mmap using)...

This post is pretty interesting:
http://libtorrent.rakshasa.no/ticket/483

In the end I just switched to transmission.

Anyway, there could be really some weird/rare bugs with ZFS and mmap. Or
just ZFS.
I hope this will help at least to narrow direction for potential bugbusting.



2013/7/18 George Hartzell 

> Richard Todd writes:
>  > George Hartzell  writes:
>  >
>  > > Hi All,
>  > >
>  > > I have what I think is a ZFS related bug.
>  > > [...]
>  >
>  > [summary: Picard seems to trigger an mmap consistency bug in ZFS].
>  >
>  > [...]
>  > Anyway, what I'd suggest is the following: see if my patch for
> py-mutagen
>  > disabling the mmap() in those two functions lets you run picard
> reliably.
>
> Removing the mmap support from those two routines seems to avoid the
> issue.
>
>  > If so, then the issue is triggered by one or both of those two routines;
>  > hack them to print out the exact offsets used on each call and use that
> to
>  > try and code up a simple C++ test case.
>  > [...]
>
> Your test case doesn't use mmap, I assume that you've offered it up as
> a hint, not as something that's nearly done.  The shell script in
> particular seems useful.
>
> In my case I'd want to find a particular set of file size, offset, and
> insertion size that triggers the problem and code up a c/c++ equiv. of
> the mmap calls that py-mutagen does.  Right?
>
> I'm hesistant about that.  I believe (and will try to prove) that the
> problem does not occur deterministically for a particular track
> between different test runs.  I'm worried that it's not as simple as
> "using mmap to insert 27 bytes into a 1024 bytes file at pos 42 causes
> corruption" but rather that it depends on a more complex set of
> interactions.
>
> My next step will be to see if a track that has trouble in one run has
> trouble in another.  If not, then I'm not sure that a simple test will
> be successful.
>
> g.
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>



-- 
Regards,
Alexander Yerenkow
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Help with filing a [maybe] ZFS/mmap bug.

2013-07-18 Thread George Hartzell
Richard Todd writes:
 > George Hartzell  writes:
 > 
 > > Hi All,
 > >
 > > I have what I think is a ZFS related bug.
 > > [...]
 >
 > [summary: Picard seems to trigger an mmap consistency bug in ZFS].
 > 
 > [...]
 > Anyway, what I'd suggest is the following: see if my patch for py-mutagen
 > disabling the mmap() in those two functions lets you run picard reliably.

Removing the mmap support from those two routines seems to avoid the
issue.

 > If so, then the issue is triggered by one or both of those two routines;
 > hack them to print out the exact offsets used on each call and use that to 
 > try and code up a simple C++ test case.  
 > [...]

Your test case doesn't use mmap, I assume that you've offered it up as
a hint, not as something that's nearly done.  The shell script in
particular seems useful.

In my case I'd want to find a particular set of file size, offset, and
insertion size that triggers the problem and code up a c/c++ equiv. of
the mmap calls that py-mutagen does.  Right?

I'm hesistant about that.  I believe (and will try to prove) that the
problem does not occur deterministically for a particular track
between different test runs.  I'm worried that it's not as simple as
"using mmap to insert 27 bytes into a 1024 bytes file at pos 42 causes
corruption" but rather that it depends on a more complex set of
interactions.

My next step will be to see if a track that has trouble in one run has
trouble in another.  If not, then I'm not sure that a simple test will
be successful.

g.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Help with filing a [maybe] ZFS/mmap bug.

2013-07-18 Thread George Hartzell
Andriy Gapon writes:
 > on 17/07/2013 23:47 George Hartzell said the following:
 > > How should I move forward with this?
 > 
 > Could you please try to reproduce this problem using a kernel built with
 > INVARIANTS options?

I added INVARIANT_SUPPORT and INVARIANTS options to the GENERIC
kernel, rebuilt it, installed it and running through my "test case"
generated a lot of invalid flac files.  I"m not sure what the options
are/were supposed to do though, it looks like they generally lead to
KASSERTS, which lead to abort()'s.  Nothing in /var/log/messages or on
the console.

g.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Help with filing a [maybe] ZFS/mmap bug.

2013-07-18 Thread Andriy Gapon
on 17/07/2013 23:47 George Hartzell said the following:
> How should I move forward with this?

Could you please try to reproduce this problem using a kernel built with
INVARIANTS options?

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Help with filing a [maybe] ZFS/mmap bug.

2013-07-17 Thread Richard Todd
George Hartzell  writes:

> Hi All,
>
> I have what I think is a ZFS related bug.  Unfortunately my simplest
> test case is a bit cumbersome and I haven't definitively proven that
> the problem is ZFS related.

> I'm hoping for some feedback on how to move forward.
>
> Quick background: I rip my CD's using grip and produce flac files.  I
> tag the music using Musicbrainz' Picard and transcode it to mp3's
> within Picard using a plugin that I wrote.  Picard is a python based
> app and uses the Mutagen library to tag files.

[summary: Picard seems to trigger an mmap consistency bug in ZFS].

Aw, crud.  I ran into what may be this same issue a few years ago, 
under almost identical circumstances (picard with Ogg files instead of 
flac), but never got around to writing up a proper mailinglist post on it.
I *thought* it had been fixed some time back, though I'm not entirely
sure because I've been running with my local patch to py-mutagen to disable
its mmap() usage. 

Anyway, here's what I recall from when I was trying to track this down way
back when:

1) Your suspicions are right, it's definitely a ZFS/mmap interaction bug. 
UFS filesystems didn't show the bug.

2) It's triggered by the insert_bytes or delete_bytes function in py-mutagen.

3) As long as the in-memory version of that chunk of the file stuck around
in memory, the file read OK, but the data on disk was corrupt, so if that
data got evicted from cache and had to be reread, or if you forced the 
cache data to be disposed of by unmounting and remounting the FS, you would
see a corrupt file.  (Rebooting, of course, would also allow the corrupted
file data to become visible again.)  

I'll attach my patch to disable py-mutagen's mmap usage in insert/delete_bytes
below.  Try it and see if it makes the corruption problems go away.  If so,
that narrows your search down to those two routines in py-mutagen and what
they're doing.  

I *had* what at the time I recall was a simple C++ test program that
managed to trigger the bug more-or-less reliably.  Unfortunately, it
doesn't look like my test case still works, i.e., the test program
doesn't seem to trigger the bug either on the 10-CURRENT box or RELENG-9
VM I'm trying them on now.  As I recall, the bug's presence was dependent on
fairly picky details on what exact offsets were used on the mmap()s and the
write()s, so it may be that different offsets from what I tried will still show
the bug.  

Anyway, what I'd suggest is the following: see if my patch for py-mutagen
disabling the mmap() in those two functions lets you run picard reliably.
If so, then the issue is triggered by one or both of those two routines;
hack them to print out the exact offsets used on each call and use that to 
try and code up a simple C++ test case.  

Here's the py-mutagen patch:

--- mutagen/_util.py.orig   2008-06-01 01:33:00.0 -0500
+++ mutagen/_util.py2009-04-11 18:16:53.363758128 -0500
@@ -213,12 +213,6 @@
 fobj.write('\x00' * size)
 fobj.flush()
 try:
-try:
-import mmap
-map = mmap.mmap(fobj.fileno(), filesize + size)
-try: map.move(offset + size, offset, movesize)
-finally: map.close()
-except (ValueError, EnvironmentError, ImportError):
 # handle broken mmap scenarios
 locked = lock(fobj)
 fobj.truncate(filesize)
@@ -272,17 +266,11 @@
 try:
 if movesize > 0:
 fobj.flush()
-try:
-import mmap
-map = mmap.mmap(fobj.fileno(), filesize)
-try: map.move(offset, offset + size, movesize)
-finally: map.close()
-except (ValueError, EnvironmentError, ImportError):
-# handle broken mmap scenarios
-locked = lock(fobj)
-fobj.seek(offset + size)
-buf = fobj.read(BUFFER_SIZE)
-while buf:
+# handle broken mmap scenarios
+locked = lock(fobj)
+fobj.seek(offset + size)
+buf = fobj.read(BUFFER_SIZE)
+while buf:
 fobj.seek(offset)
 fobj.write(buf)
 offset += len(buf)


and here's my test case:

# This is a shell archive.  Save it in a file, remove anything before
# this line, and then unpack it by entering "sh file".  Note, it may
# create directories; files and directories will be owned by you and
# have default permissions.
#
# This archive contains:
#
#   gen4.cpp
#   test4.sh
#
echo x - gen4.cpp
sed 's/^X//' >gen4.cpp << 'END-of-gen4.cpp'
X/*
X** Program to create a file of data and do some mmap()ing writes to it. 
X*/
X
X#include 
X#include 
X#include 
X#include 
X#include 
X#include 
X#include 
X#include 
X
X/* Insert size bytes (zeros) into file at offset */
Xvoid
Xinsert_bytes(int fd, unsigned int size, unsigned int offset)
X{
Xunsigned int filesize = lseek(fd, (off_t)0, SEEK_END);
Xunsig

Help with filing a [maybe] ZFS/mmap bug.

2013-07-17 Thread George Hartzell

Hi All,

I have what I think is a ZFS related bug.  Unfortunately my simplest
test case is a bit cumbersome and I haven't definitively proven that
the problem is ZFS related.

I'm hoping for some feedback on how to move forward.

Quick background: I rip my CD's using grip and produce flac files.  I
tag the music using Musicbrainz' Picard and transcode it to mp3's
within Picard using a plugin that I wrote.  Picard is a python based
app and uses the Mutagen library to tag files.

I'm working on a MacPro with 10GB ram and using Seagate ST31000340AS
drives updated to the latest firmware (SD1A).  The system is running
9-STABLE from late June.  It is ZFS only and boots from a mirrored
pool that provides a bunch of zfs filesystems, including my home
directory.

I recently realized that some of the flacs were corrupt and have been
chasing down the problem.  I've blamed Picard, my disks (there was
newer, "important" firmware, which they're now running), my RAM,
etc...

After blaming each of the moving parts in turn I offer up the
following experiment as evidence that I have found a ZFS problem.

- start with a bunch of untagged flac files that pass validation with
  "flac -t".

- load them into Picard, tag them and save them (this also transcodes
  them to mp3's using my plugin and runs a plugin which runs flac -t
  on the tagged file).

- run flac -t on all of the tag flac files and collect the result as
  pre-exit-validation.

- exit Picard "politely" (using the menu options, not killing it from
  the command line...).

- run flac -t on all of the tag flac files and collect the result
  post-exit-validation.

- reboot the machine

- run flac -t on all of the tag flac files and collect the result
  post-reboot-validation.

On multiple runs through this routine I'll sometimes see errors in the
{pre,post}-exit-validations, but they'll often all validate perfectly.

On all of the runs through the validation I'll see many invalid files
in the post-reboot-validation output.

I've even scp'd the directories to an unrelated machine (Mac OS X
10.8) at the various points to do the "flac -t" validation, with the
same results.

Looking carefully at a couple of instances shows that they differ in a
few bytes.  E.g. one file differs by a few bytes starting at 139253 to
139264 (I might have an off by one counting issue, using emacs' buffer
positions here).  2^17 + 2^13 = 139264, which is an interesting
coincidence.  In another file I see a difference ending at 2^17+2^12
(again, I might be off by one or so in my counting).  Patching the
different hunk from a good file into a bad file (again via emacs)
results in a file that passes validation.

At one point I was blaming RAM and was pulling/swapping sims.  Running
with less memory increased the likelihood of files being invalid.

I built up a similar system running 9-STABLE as of yesterday (7/16)
that uses UFS and have been unable to recreate the problem.

Given that the files are valid after exiting Picard, I do not think
that there is anything in my tagging pipeline that is causing the
problem.

The fact that the files "become" invalid after a reboot suggests
something in the ZFS buffering and/or interactions with the VM system.
The observation that running with less memory causes more/earlier
problems reinforces this.  The fact that the garbage in the file
happens near a power-of-two boundry also reinforces this.

My current test case involves my local version of Picard and my
plugins, and 165 flac files (some of which Picard can discover
automatically based on grip's freedb based metadata, some of which
need a helping hand).  Not particularly minimal but I'm not sure that
I can ever get it trimmed down to something trivial that a ZFS
developer might be able to run locally.

Thanks for making it this far!

How should I move forward with this?

g.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"