Re: bad metadata crossing stripe boundary

2016-03-31 Thread Marc Haber
On Thu, Mar 31, 2016 at 11:16:30PM +0200, Kai Krakow wrote:
> Am Thu, 31 Mar 2016 23:00:04 +0200
> schrieb Marc Haber :
> > I find it somewhere between funny and disturbing that the first call
> > of btrfs check made my kernel log the following:
> > Mar 31 22:45:36 fan kernel: [ 6253.178264] EXT4-fs (dm-31): mounted
> > filesystem with ordered data mode. Opts: (null) Mar 31 22:45:38 fan
> > kernel: [ 6255.361328] BTRFS: device label fanbtr devid 1 transid
> > 67526 /dev/dm-31
> > 
> > No, the filesystem was not converted, it was directly created as
> > btrfs, and no, I didn't try mounting it.
> 
> I suggest that your partition contained ext4 before, and you didn't run
> wipefs before running mkfs.btrfs.

I cryptsetup luksFormat'ted the partition before I mkfs.btrfs'ed it.
That should do a much better job than wipefsing it, shouldnt it?

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V15 00/15] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size

2016-03-31 Thread Chandan Rajendra
On Thursday 31 Mar 2016 15:59:11 David Sterba wrote:
> On Thu, Mar 31, 2016 at 11:31:06AM +0200, David Sterba wrote:
> > On Tue, Mar 22, 2016 at 06:50:32PM +0530, Chandan Rajendra wrote:
> > > On Tuesday 22 Mar 2016 12:04:23 David Sterba wrote:
> > > > On Thu, Feb 11, 2016 at 11:17:38PM +0530, Chandan Rajendra wrote:
> > > > > this patchset temporarily disables the commit
> > > > > f82c458a2c3ffb94b431fc6ad791a79df1b3713e.
> > > > > 
> > > > > The commits for the Btrfs kernel module can be found at
> > > > > https://github.com/chandanr/linux/tree/btrfs/subpagesize-blocksize.
> > > > 
> > > > The branch does not apply cleanly to at least 4.5, I've tried to
> > > > rebase
> > > > it but there are conflicts that are not simple. Please update it on
> > > > top
> > > > of current master, ie. with the preparatory patchset merged.
> > > 
> > > I will rebase the branch and post the patchset soon.
> > 
> > JFYI, I've seen some minor compilation failures:
> > - __do_readpage: unused variable cached
> > - end_bio_extent_buffer_readpage: btree_readahead_hook must tak fs_info
> > - fails build with (config) sanity checks enabled, the fs_info moved
> > 
> >   from eb to eb head
> 
> And the tests crash with my quick fixes, so I'll move the branch out of
> next for now. Please fix it and let me know. My fixes are on top of your
> lastest branch, chandan-subpage-latest in my development gits.

Hi David,

After rebasing the patchset, I found a 'hard to reproduce' space
accounting bug. I am currently figuring out the root cause of the bug. I will
post the patchset once the issue is fixed.

-- 
chandan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 00/19] Btrfs dedupe framework

2016-03-31 Thread Qu Wenruo



David Sterba wrote on 2016/03/31 18:12 +0200:

On Wed, Mar 30, 2016 at 03:55:55PM +0800, Qu Wenruo wrote:

This March 30th patchset update mostly addresses the patchset structure
comment from David:
1) Change the patchset sequence
Not If only apply the first 14 patches, it can provide the full
backward compatible in-memory only dedupe backend.

Only starts from patch 15, on-disk format will be changed.

So patch 1~14 is going to be pushed for next merge window, while I'll
still submit them all for review purpose.


I'll buy 1-10 with the ioctl hidden under the BTRFS_DEBUG config option
until the interface is settled.




BTW, don't pick them directly, as I just forgot independent 2 bug fix 
patches.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread J. Bruce Fields
On Fri, Apr 01, 2016 at 11:33:00AM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 06:34:17PM -0400, J. Bruce Fields wrote:
> > I haven't looked at the code, but I assume a JUKEBOX-returning write to
> > an absent file brings into cache the bits necessary to perform the
> > write, but stops short of actually doing the write.
> 
> Not exactly, as all subsequent read/write/truncate requests will
> EJUKEBOX until the absent file has been brought back onto disk. Once
> that is done, the next operation attempt will proceed.
> 
> > That allows
> > handling the retried write quickly without doing the wrong thing in the
> > case the retry never comes.
> 
> Essentially. But if a retry never comes it means there's either a
> bug in the client NFS implementation or the client crashed,

NFS clients are under no obligation to retry operations after JUKEBOX.
And I'd expect them not to in the case the calling process was
interrupted, for example.

> > I guess it doesn't matter as much in practice, since the only way you're
> > likely to notice that fallocate unexpectedly succeeded would be if it
> > caused you to hit ENOSPC elsewhere.  Is that right?  Still, it seems a
> > little weird.
> 
> s/succeeded/failed/ and that statement is right.

Sorry, I didn't explain clearly.

The case I was worrying about was the case were the on-the-wire ALLOCATE
call returns JUKEBOX, but the server allocates anyway.

That behavior violates the spec as I understand it.

The client therefore assumes there was no allocation, when in fact there
was.

So, technically a bug, but I wondered if it's likely to bite anyone.
One of the only ways it seems someone would notice would be if it caused
the filesystem to run out of space earlier than I expected.  But perhaps
that's unlikely.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-03-31 Thread Qu Wenruo



Henk Slager wrote on 2016/04/01 01:27 +0200:

On Thu, Mar 31, 2016 at 10:44 PM, Kai Krakow  wrote:

Hello!

I already reported this in another thread but it was a bit confusing by
intermixing multiple volumes. So let's start a new thread:

Since one of the last kernel upgrades, I'm experiencing one VDI file
(containing a NTFS image with Windows 7) getting damaged when running
the machine in VirtualBox. I got knowledge about this after
experiencing an error "duplicate object" and btrfs went RO. I fixed it
by deleting the VDI and restoring from backup - but no I get csum
errors as soon as some VM IO goes into the VDI file.

The FS is still usable. One effect is, that after reading all files
with rsync (to copy to my backup), each call of "du" or "df" hangs, also
similar calls to "btrfs {sub|fi} ..." show the same effect. I guess one
outcome of this is, that the FS does not properly unmount during
shutdown.

Kernel is 4.5.0 by now (the FS is much much older, dates back to 3.x
series, and never had problems), including Gentoo patch-set r1.


One possibility could be that the vbox kernel modules somehow corrupt
btrfs kernel area since kernel 4.5.

In order to make this reproducible (or an attempt to reproduce) for
others, you could unload VirtualBox stuff and restore the VDI file
from backup (or whatever big file) and then make pseudo-random, but
reproducible writes to the file.

It is not clear to me what 'Gentoo patch-set r1' is and does. So just
boot a vanilla v4.5 kernel from kernel.org and see if you get csum
errors in dmesg.

Also, where does 'duplicate object' come from? dmesg ? then please
post its surroundings, straight from dmesg.


The device layout is:

$ lsblk -o NAME,MODEL,FSTYPE,LABEL,MOUNTPOINT
NAMEMODELFSTYPE LABEL  MOUNTPOINT
sda Crucial_CT128MX1
├─sda1   vfat   ESP/boot
├─sda2
└─sda3   bcache
   ├─bcache0  btrfs  system
   ├─bcache1  btrfs  system
   └─bcache2  btrfs  system /usr/src
sdb SAMSUNG HD103SJ
├─sdb1   swap   swap0  [SWAP]
└─sdb2   bcache
   └─bcache2  btrfs  system /usr/src
sdc SAMSUNG HD103SJ
├─sdc1   swap   swap1  [SWAP]
└─sdc2   bcache
   └─bcache1  btrfs  system
sdd SAMSUNG HD103UJ
├─sdd1   swap   swap2  [SWAP]
└─sdd2   bcache
   └─bcache0  btrfs  system

Mount options are:

$ mount|fgrep btrfs
/dev/bcache2 on / type btrfs 
(rw,noatime,compress=lzo,nossd,discard,space_cache,autodefrag,subvolid=256,subvol=/gentoo/rootfs)

The FS uses mraid=1 and draid=0.

Output of btrfsck is:
(also available here:
https://gist.github.com/kakra/bfcce4af242f6548f4d6b45c8afb46ae)

$ btrfsck /dev/disk/by-label/system
checking extents
ref mismatch on [10443660537856 524288] extent item 1, found 2

This   10443660537856  number is bigger than the  1832931324360 number
found for total bytes. AFAIK, this is already wrong.


Nope. That's btrfs logical space address, which can be beyond real disk 
bytenr.


The easiest method to reproduce such case, is write something in a 256M 
btrfs, and balance the fs several times.


Then all chunks can be at bytenr beyond 256M.

The real problem is, the extent has mismatched reference.
Normally it can fixed by --init-extent-tree option, but it normally 
means bigger problem, especially it has already caused kernel 
delayed-ref problem.


No to mention the error "extent item 11271947091968 has multiple extent 
items", which makes the problem more serious.



I assume some older kernel have already screwed up the extent tree, as 
although delayed-ref is bug-prove, it has improved in recent years.


But it seems fs tree is less damaged, I assume the extent tree 
corruption could be fixed by "--init-extent-tree".


For the only fs tree error (missing csum), if "btrfsck 
--init-extent-tree --repair" works without any problem, the most simple 
fix would be, just removing the file.
Or you can use a lot of CPU time and disk IO to rebuild the whole csum, 
by using "--init-csum-tree" option.


Thanks,
Qu



[...]


checking fs roots
root 4336 inode 4284125 errors 1000, some csum missing

What is in this inode?


Checking filesystem on /dev/disk/by-label/system
UUID: d2bb232a-2e8f-4951-8bcc-97e237f1b536
found 1832931324360 bytes used err is 1
total csum bytes: 1730105656
total tree bytes: 6494474240
total fs tree bytes: 3789783040
total extent tree bytes: 608219136
btree space waste bytes: 1221460063
file data blocks allocated: 2406059724800
  referenced 2040857763840

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line 

Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Dave Chinner
On Thu, Mar 31, 2016 at 06:34:17PM -0400, J. Bruce Fields wrote:
> On Fri, Apr 01, 2016 at 09:20:23AM +1100, Dave Chinner wrote:
> > On Thu, Mar 31, 2016 at 01:47:50PM -0600, Andreas Dilger wrote:
> > > On Mar 31, 2016, at 12:08 PM, J. Bruce Fields  
> > > wrote:
> > > > 
> > > > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> > > >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> > > >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > >  On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > > > Or is it ok that fallocate could block, potentially for a long time 
> > > > as
> > > > we stream cows through the page cache (or however unshare works
> > > > internally)?  Those same programs might not be expecting fallocate 
> > > > to
> > > > take a long time.
> > >  
> > >  Yes, it's perfectly fine for fallocate to block for long periods of
> > >  time. See what gfs2 does during preallocation of blocks - it ends up
> > >  calling sb_issue_zerout() because it doesn't have unwritten
> > >  extents, and hence can block for long periods of time
> > > >>> 
> > > >>> gfs2 fallocate is an implementation that will cause all but the most
> > > >>> trivial users real pain.  Even the initial XFS implementation just
> > > >>> marking the transactions synchronous made it unusable for all kinds
> > > >>> of applications, and this is much worse.  E.g. a NFS ALLOCATE 
> > > >>> operation
> > > >>> to gfs2 will probab;ly hand your connection for extended periods of
> > > >>> time.
> > > >>> 
> > > >>> If we need to support something like what gfs2 does we should have a
> > > >>> separate flag for it.
> > > >> 
> > > >> Using fallocate() for preallocation was always intended to
> > > >> be a faster, more efficient method allocating zeroed space
> > > >> than having userspace write blocks of data. Faster, more efficient
> > > >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> > > >> that if the hardware has zeroing offloads (deterministic trim, write
> > > >> same, etc) it will use them, and that will be much faster than
> > > >> writing zeros from userspace.
> > > >> 
> > > >> IMO, what gfs2 is definitely within the intended usage of
> > > >> fallocate() for accelerating the preallocation of blocks.
> > > >> 
> > > >> Yes, it may not be optimal for things like NFS servers which haven't
> > > >> considered that a fallocate based offload operation might take some
> > > >> time to execute, but that's not a problem with fallocate. i.e.
> > > >> that's a problem with the nfs server ALLOCATE implementation not
> > > >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> > > >> and timeouts while the operation is run
> > > > 
> > > > That's an interesting idea, but I don't think it's really legal.  I take
> > > > JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> > > > later and it might succeed", not "OK, I'm working on it, try again and
> > > > you may find out I've done it".
> > > > 
> > > > So if the client gets a JUKEBOX error but the server goes ahead and does
> > > > the operation anyway, that'd be unexpected.
> > > 
> > > Well, the tape continued to be mounted in the background and/or the file
> > > restored from the tape into the filesystem...
> > 
> > Right, and SGI have been shipping a DMAPI-aware Linux NFS server for
> > many years, using the above NFSERR_JUKEBOX behaviour for operations
> > that may block for a long time due to the need to pull stuff into
> > the filesytsem from the slow backing store. Best explanation is in
> > the relevant commit in the last published XFS+DMAPI branch from SGI,
> > for example:
> > 
> > http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=28b171cf2b64167826474efbb82ad9d471a05f75
> 
> I haven't looked at the code, but I assume a JUKEBOX-returning write to
> an absent file brings into cache the bits necessary to perform the
> write, but stops short of actually doing the write.

Not exactly, as all subsequent read/write/truncate requests will
EJUKEBOX until the absent file has been brought back onto disk. Once
that is done, the next operation attempt will proceed.

> That allows
> handling the retried write quickly without doing the wrong thing in the
> case the retry never comes.

Essentially. But if a retry never comes it means there's either a
bug in the client NFS implementation or the client crashed, in which
case we don't really care.

> Implementing fallocate by returning JUKEBOX while still continuing the
> allocation in the background is a bit different.

Not really. like the HSM case we don't really care if a retry occurs
or not - the server simply needs to reply NFSERR_JUKEBOX for all
subsequent read/write/fallocate/truncate operations on that inode
until the fallocate completes...

i.e. it requires O_NONBLOCK style operation for filesystem IO
operations to 

Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning

2016-03-31 Thread Qu Wenruo



David Sterba wrote on 2016/03/31 18:30 +0200:

On Thu, Mar 31, 2016 at 10:19:34AM +0800, Qu Wenruo wrote:

At least 2 user from mail list reported btrfsck reported false alert of
"bad metadata [,) crossing stripe boundary".

While the reported number are all inside the same 64K boundary.
After some check, all the false alert have the same bytenr feature,
which can be divided by stripe size (64K).

The result seems to be initial 'max_size' can be 0, causing 'start' +
'max_size' - 1, to cross the stripe boundary.

Fix it by always update extent_record->cross_stripe when the
extent_record is updated, to avoid temporary false alert to be reported.

Signed-off-by: Qu Wenruo 


Applied, thanks.

Do you have a test image for that?



Unfortunately, no.

Although I figured out the cause the the false alert, I still didn't 
find a image/method to reproduce it, except the images of reporters.


I can dig a little further trying to make a image.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 00/19] Btrfs dedupe framework

2016-03-31 Thread Qu Wenruo



David Sterba wrote on 2016/03/31 18:12 +0200:

On Wed, Mar 30, 2016 at 03:55:55PM +0800, Qu Wenruo wrote:

This March 30th patchset update mostly addresses the patchset structure
comment from David:
1) Change the patchset sequence
Not If only apply the first 14 patches, it can provide the full
backward compatible in-memory only dedupe backend.

Only starts from patch 15, on-disk format will be changed.

So patch 1~14 is going to be pushed for next merge window, while I'll
still submit them all for review purpose.


I'll buy 1-10 with the ioctl hidden under the BTRFS_DEBUG config option
until the interface is settled.



Nice to hear that.

I'll add BTRFS_DEBUG config then.

BTW, any comment on btrfs-convert rewrite?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-03-31 Thread Henk Slager
On Thu, Mar 31, 2016 at 10:44 PM, Kai Krakow  wrote:
> Hello!
>
> I already reported this in another thread but it was a bit confusing by
> intermixing multiple volumes. So let's start a new thread:
>
> Since one of the last kernel upgrades, I'm experiencing one VDI file
> (containing a NTFS image with Windows 7) getting damaged when running
> the machine in VirtualBox. I got knowledge about this after
> experiencing an error "duplicate object" and btrfs went RO. I fixed it
> by deleting the VDI and restoring from backup - but no I get csum
> errors as soon as some VM IO goes into the VDI file.
>
> The FS is still usable. One effect is, that after reading all files
> with rsync (to copy to my backup), each call of "du" or "df" hangs, also
> similar calls to "btrfs {sub|fi} ..." show the same effect. I guess one
> outcome of this is, that the FS does not properly unmount during
> shutdown.
>
> Kernel is 4.5.0 by now (the FS is much much older, dates back to 3.x
> series, and never had problems), including Gentoo patch-set r1.

One possibility could be that the vbox kernel modules somehow corrupt
btrfs kernel area since kernel 4.5.

In order to make this reproducible (or an attempt to reproduce) for
others, you could unload VirtualBox stuff and restore the VDI file
from backup (or whatever big file) and then make pseudo-random, but
reproducible writes to the file.

It is not clear to me what 'Gentoo patch-set r1' is and does. So just
boot a vanilla v4.5 kernel from kernel.org and see if you get csum
errors in dmesg.

Also, where does 'duplicate object' come from? dmesg ? then please
post its surroundings, straight from dmesg.

> The device layout is:
>
> $ lsblk -o NAME,MODEL,FSTYPE,LABEL,MOUNTPOINT
> NAMEMODELFSTYPE LABEL  MOUNTPOINT
> sda Crucial_CT128MX1
> ├─sda1   vfat   ESP/boot
> ├─sda2
> └─sda3   bcache
>   ├─bcache0  btrfs  system
>   ├─bcache1  btrfs  system
>   └─bcache2  btrfs  system /usr/src
> sdb SAMSUNG HD103SJ
> ├─sdb1   swap   swap0  [SWAP]
> └─sdb2   bcache
>   └─bcache2  btrfs  system /usr/src
> sdc SAMSUNG HD103SJ
> ├─sdc1   swap   swap1  [SWAP]
> └─sdc2   bcache
>   └─bcache1  btrfs  system
> sdd SAMSUNG HD103UJ
> ├─sdd1   swap   swap2  [SWAP]
> └─sdd2   bcache
>   └─bcache0  btrfs  system
>
> Mount options are:
>
> $ mount|fgrep btrfs
> /dev/bcache2 on / type btrfs 
> (rw,noatime,compress=lzo,nossd,discard,space_cache,autodefrag,subvolid=256,subvol=/gentoo/rootfs)
>
> The FS uses mraid=1 and draid=0.
>
> Output of btrfsck is:
> (also available here:
> https://gist.github.com/kakra/bfcce4af242f6548f4d6b45c8afb46ae)
>
> $ btrfsck /dev/disk/by-label/system
> checking extents
> ref mismatch on [10443660537856 524288] extent item 1, found 2
This   10443660537856  number is bigger than the  1832931324360 number
found for total bytes. AFAIK, this is already wrong.

[...]

> checking fs roots
> root 4336 inode 4284125 errors 1000, some csum missing
What is in this inode?

> Checking filesystem on /dev/disk/by-label/system
> UUID: d2bb232a-2e8f-4951-8bcc-97e237f1b536
> found 1832931324360 bytes used err is 1
> total csum bytes: 1730105656
> total tree bytes: 6494474240
> total fs tree bytes: 3789783040
> total extent tree bytes: 608219136
> btree space waste bytes: 1221460063
> file data blocks allocated: 2406059724800
>  referenced 2040857763840
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "/tmp/mnt.", and not honouring compression

2016-03-31 Thread Lionel Bouton
Le 31/03/2016 22:49, Chris Murray a écrit :
> Hi,
>
> I'm trying to troubleshoot a ceph cluster which doesn't seem to be
> honouring BTRFS compression on some OSDs. Can anyone offer some help? Is
> it likely to be a ceph issue or a BTRFS one? Or something else? I've
> asked on ceph-users already, but not received a response yet.
>
> Config is set to mount with "noatime,nodiratime,compress-force=lzo"
>
> Some OSDs have been getting much more full than others though, which I
> think is something to do with these 'tmp' mounts e.g. below:

Note that there are other reasons for unbalanced storage on Ceph OSD.
The main reason is too few PGs (there's a calculator on ceph.com, google
for it).
These tmp mounts aren't normal, you should find out what is causing them.

So it might be a Ceph issue (too few PGs) or a system issue (some
component trying to use your filesystems for its own purposes).
You might have more luck on the ceph-users list (post your Ceph version,
the result of ceph osd tree, df on all OSDs and hunt for the process
creating theses mounts on your systems).

It's probably not a Btrfs issue (I run a Ceph on Btrfs cluster in
production and I've never seen this kind of problem).

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "/tmp/mnt.", and not honouring compression

2016-03-31 Thread Duncan
Chris Murray posted on Thu, 31 Mar 2016 21:49:29 +0100 as excerpted:

> I'm using Proxmox, based on Debian. Kernel version 4.2.8-1-pve. Btrfs
> v3.17.

The problem itself is beyond my level, but aiming for the obvious low-
hanging fruit...

On this list, which is forward looking as btrfs remains stabilizing, not 
yet fully stable and mature, kernel support comes in four tracks, 
mainstream and btrfs development trees, mainstream current, mainstream 
lts, and everything else.

Mainstream and btrfs development trees should be obvious.  It covers 
mainstream current git and rc kernels as well as btrfs-integration and 
linux-next.  Generally only recommended for bleeding edge testers willing 
to lose what they're testing.

Mainstream current follows mainstream latest releases, with generally the 
latest two kernel series being best supported.  With 4.5 out, that's 4.5 
and 4.4.

Mainstream LTS follows mainstream LTS series, and until recently, again 
the latest two were best supported.  That's the 4.4 and 4.1 LTS series.  
However, as btrfs has matured, the previous LTS series, 3.18, hasn't 
turned out so bad and remains reasonably well supported as well, tho 
depending on the issue, you may still be asked to upgrade and see if it's 
still there in 4.1 or 4.4.

Then there's "everything else", which is where a 4.2 kernel such as 
you're running comes in.  These kernels are either long ago history 
(pre-3.18 LTS, for instance) in btrfs terms, or out of their mainstream 
kernel support windows, which is where 4.2 is.  While we recognize that 
various distros claiming btrfs support may still be using these kernels, 
because we're mainline focused we don't track what patches they may or 
may not have backported, and thus aren't in a particularly good position 
to support them.  If you're relying on your distro's support in such a 
case, that's where you need to look, as they know what they've backported 
and what they haven't and are thus in a far better position to provide 
support.

As for the list, we still do the best we can with these "everything else" 
kernels, but unless it's a known problem recognized on-sight, that's most 
often simply to recommend upgrading to something that's better supported 
and trying to duplicate the problem there.

Meanwhile, for long-term enterprise level stability, btrfs isn't likely 
to be a good choice in any case, as it really is still stabilizing and 
the expectation is that people running it will be upgrading to get the 
newer patches.  If that's not feasible, as it may not be for the 
enterprise-stability-level use-case, then it's very likely that btrfs 
isn't a good match for the use-case anyway, as it's simply not to that 
level of stability yet.  A more mature filesystem such as ext4, ext3, the 
old reiserfs which I still use on some spinning rust here (all my btrfs 
are on ssd), xfs, etc, is very likely to be a more appropriate choice for 
that use-case.

For kernel 4.2, that leaves you with a few choices:

1) Ask your distro for btrfs support if they offer it on the out-of-
mainline-support kernels which they've obviously chosen to use instead of 
the LTS series that /are/ still mainline supported.

2) Upgrade to the supported 4.4 LTS kernel series.

3) Downgrade to the older supported 4.1 LTS kernel series.

4) Decide btrfs is inappropriate for your use-case and switch to a fully 
stable and mature filesystem.

5) Continue with 4.2 and muddle thru, using our "best effort" help where 
you can and doing without or getting it elsewhere if the opportunity 
presents itself or you have money to buy it from a qualified provider.


Personally I'd choose option 2, upgrading to 4.4, but that's just me.  
The other choices may work better for you.


As for btrfs-progs userspace, when the filesystem is working it's not as 
critical, since other than filesystem creation with mkfs.btrfs, most 
operational commands simply invoke kernel code to do the real work.  
However, once problems appear, a newer version can be critical as patches 
to deal with newly discovered problems continue to be added to tools such 
as btrfs check (for detecting and repairing problems) and btrfs restore 
(for recovery of files off an unmountable filesystem).  And newer 
userspace is designed to work with older kernels, so newer isn't a 
problem in that regard.

As a result, to keep userspace from getting /too/ far behind and because 
userspace release version numbers are synced with kernel version, a good 
rule of thumb is to run a userspace version similar to that of your 
kernel, or newer.  Assuming you're already following the current or LTS 
track kernel recommendations, that will keep you reasonably current, and 
you can always upgrade to the newest available if you're trying to fix 
otherwise unfixable problems.

Unfortunately your userspace falls well outside that recommendation as 
well, with 3.17 userspace being before the earliest supported 3.18 LTS 
kernel, let alone comparable to your current 

Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread J. Bruce Fields
On Fri, Apr 01, 2016 at 09:20:23AM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 01:47:50PM -0600, Andreas Dilger wrote:
> > On Mar 31, 2016, at 12:08 PM, J. Bruce Fields  wrote:
> > > 
> > > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> > >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> > >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> >  On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > > Or is it ok that fallocate could block, potentially for a long time as
> > > we stream cows through the page cache (or however unshare works
> > > internally)?  Those same programs might not be expecting fallocate to
> > > take a long time.
> >  
> >  Yes, it's perfectly fine for fallocate to block for long periods of
> >  time. See what gfs2 does during preallocation of blocks - it ends up
> >  calling sb_issue_zerout() because it doesn't have unwritten
> >  extents, and hence can block for long periods of time
> > >>> 
> > >>> gfs2 fallocate is an implementation that will cause all but the most
> > >>> trivial users real pain.  Even the initial XFS implementation just
> > >>> marking the transactions synchronous made it unusable for all kinds
> > >>> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> > >>> to gfs2 will probab;ly hand your connection for extended periods of
> > >>> time.
> > >>> 
> > >>> If we need to support something like what gfs2 does we should have a
> > >>> separate flag for it.
> > >> 
> > >> Using fallocate() for preallocation was always intended to
> > >> be a faster, more efficient method allocating zeroed space
> > >> than having userspace write blocks of data. Faster, more efficient
> > >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> > >> that if the hardware has zeroing offloads (deterministic trim, write
> > >> same, etc) it will use them, and that will be much faster than
> > >> writing zeros from userspace.
> > >> 
> > >> IMO, what gfs2 is definitely within the intended usage of
> > >> fallocate() for accelerating the preallocation of blocks.
> > >> 
> > >> Yes, it may not be optimal for things like NFS servers which haven't
> > >> considered that a fallocate based offload operation might take some
> > >> time to execute, but that's not a problem with fallocate. i.e.
> > >> that's a problem with the nfs server ALLOCATE implementation not
> > >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> > >> and timeouts while the operation is run
> > > 
> > > That's an interesting idea, but I don't think it's really legal.  I take
> > > JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> > > later and it might succeed", not "OK, I'm working on it, try again and
> > > you may find out I've done it".
> > > 
> > > So if the client gets a JUKEBOX error but the server goes ahead and does
> > > the operation anyway, that'd be unexpected.
> > 
> > Well, the tape continued to be mounted in the background and/or the file
> > restored from the tape into the filesystem...
> 
> Right, and SGI have been shipping a DMAPI-aware Linux NFS server for
> many years, using the above NFSERR_JUKEBOX behaviour for operations
> that may block for a long time due to the need to pull stuff into
> the filesytsem from the slow backing store. Best explanation is in
> the relevant commit in the last published XFS+DMAPI branch from SGI,
> for example:
> 
> http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=28b171cf2b64167826474efbb82ad9d471a05f75

I haven't looked at the code, but I assume a JUKEBOX-returning write to
an absent file brings into cache the bits necessary to perform the
write, but stops short of actually doing the write.  That allows
handling the retried write quickly without doing the wrong thing in the
case the retry never comes.

Implementing fallocate by returning JUKEBOX while still continuing the
allocation in the background is a bit different.

I guess it doesn't matter as much in practice, since the only way you're
likely to notice that fallocate unexpectedly succeeded would be if it
caused you to hit ENOSPC elsewhere.  Is that right?  Still, it seems a
little weird.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Dave Chinner
On Thu, Mar 31, 2016 at 01:47:50PM -0600, Andreas Dilger wrote:
> On Mar 31, 2016, at 12:08 PM, J. Bruce Fields  wrote:
> > 
> > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
>  On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > Or is it ok that fallocate could block, potentially for a long time as
> > we stream cows through the page cache (or however unshare works
> > internally)?  Those same programs might not be expecting fallocate to
> > take a long time.
>  
>  Yes, it's perfectly fine for fallocate to block for long periods of
>  time. See what gfs2 does during preallocation of blocks - it ends up
>  calling sb_issue_zerout() because it doesn't have unwritten
>  extents, and hence can block for long periods of time
> >>> 
> >>> gfs2 fallocate is an implementation that will cause all but the most
> >>> trivial users real pain.  Even the initial XFS implementation just
> >>> marking the transactions synchronous made it unusable for all kinds
> >>> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> >>> to gfs2 will probab;ly hand your connection for extended periods of
> >>> time.
> >>> 
> >>> If we need to support something like what gfs2 does we should have a
> >>> separate flag for it.
> >> 
> >> Using fallocate() for preallocation was always intended to
> >> be a faster, more efficient method allocating zeroed space
> >> than having userspace write blocks of data. Faster, more efficient
> >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> >> that if the hardware has zeroing offloads (deterministic trim, write
> >> same, etc) it will use them, and that will be much faster than
> >> writing zeros from userspace.
> >> 
> >> IMO, what gfs2 is definitely within the intended usage of
> >> fallocate() for accelerating the preallocation of blocks.
> >> 
> >> Yes, it may not be optimal for things like NFS servers which haven't
> >> considered that a fallocate based offload operation might take some
> >> time to execute, but that's not a problem with fallocate. i.e.
> >> that's a problem with the nfs server ALLOCATE implementation not
> >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> >> and timeouts while the operation is run
> > 
> > That's an interesting idea, but I don't think it's really legal.  I take
> > JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> > later and it might succeed", not "OK, I'm working on it, try again and
> > you may find out I've done it".
> > 
> > So if the client gets a JUKEBOX error but the server goes ahead and does
> > the operation anyway, that'd be unexpected.
> 
> Well, the tape continued to be mounted in the background and/or the file
> restored from the tape into the filesystem...

Right, and SGI have been shipping a DMAPI-aware Linux NFS server for
many years, using the above NFSERR_JUKEBOX behaviour for operations
that may block for a long time due to the need to pull stuff into
the filesytsem from the slow backing store. Best explanation is in
the relevant commit in the last published XFS+DMAPI branch from SGI,
for example:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=28b171cf2b64167826474efbb82ad9d471a05f75

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad metadata crossing stripe boundary

2016-03-31 Thread Kai Krakow
Am Thu, 31 Mar 2016 23:16:30 +0200
schrieb Kai Krakow :

> Am Thu, 31 Mar 2016 23:00:04 +0200
> schrieb Marc Haber :
> 
> > On Thu, Mar 31, 2016 at 10:31:49AM +0800, Qu Wenruo wrote:  
> > > Would you please try the following patch based on v4.5
> > > btrfs-progs? https://patchwork.kernel.org/patch/8706891/
> > 
> > This also fixes the "bad metadata crossing stripe boundary" on my
> > pet patient.
> > 
> > I find it somewhere between funny and disturbing that the first call
> > of btrfs check made my kernel log the following:
> > Mar 31 22:45:36 fan kernel: [ 6253.178264] EXT4-fs (dm-31): mounted
> > filesystem with ordered data mode. Opts: (null) Mar 31 22:45:38 fan
> > kernel: [ 6255.361328] BTRFS: device label fanbtr devid 1 transid
> > 67526 /dev/dm-31
> > 
> > No, the filesystem was not converted, it was directly created as
> > btrfs, and no, I didn't try mounting it.  
> 
> I suggest that your partition contained ext4 before, and you didn't
> run wipefs before running mkfs.btrfs. Now, the ext4 superblock is
> still detected because no btrfs structure or block did overwrite it.
> I had a similar problem when I first tried btrfs.
> 
> I think there is some magic dd-fu to damage the ext4 superblock
> without hurting the btrfs itself. But I leave this to the fs devs,
> they could properly tell you.

Tho, you could also try to force detecting btrfs before ext4 by
modifying /etc/filesystems.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad metadata crossing stripe boundary

2016-03-31 Thread Kai Krakow
Am Thu, 31 Mar 2016 23:00:04 +0200
schrieb Marc Haber :

> On Thu, Mar 31, 2016 at 10:31:49AM +0800, Qu Wenruo wrote:
> > Would you please try the following patch based on v4.5 btrfs-progs?
> > https://patchwork.kernel.org/patch/8706891/  
> 
> This also fixes the "bad metadata crossing stripe boundary" on my pet
> patient.
> 
> I find it somewhere between funny and disturbing that the first call
> of btrfs check made my kernel log the following:
> Mar 31 22:45:36 fan kernel: [ 6253.178264] EXT4-fs (dm-31): mounted
> filesystem with ordered data mode. Opts: (null) Mar 31 22:45:38 fan
> kernel: [ 6255.361328] BTRFS: device label fanbtr devid 1 transid
> 67526 /dev/dm-31
> 
> No, the filesystem was not converted, it was directly created as
> btrfs, and no, I didn't try mounting it.

I suggest that your partition contained ext4 before, and you didn't run
wipefs before running mkfs.btrfs. Now, the ext4 superblock is still
detected because no btrfs structure or block did overwrite it. I had a
similar problem when I first tried btrfs.

I think there is some magic dd-fu to damage the ext4 superblock without
hurting the btrfs itself. But I leave this to the fs devs, they could
properly tell you.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix file/data loss caused by fsync after rename and new inode

2016-03-31 Thread Chris Mason
On Wed, Mar 30, 2016 at 11:37:21PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> If we rename an inode A (be it a file or a directory), create a new
> inode B with the old name of inode A and under the same parent directory,
> fsync inode B and then power fail, at log tree replay time we end up
> removing inode A completely. If inode A is a directory then all its files
> are gone too.
> 
> Example scenarios where this happens:
> This is reproducible with the following steps, taken from a couple of
> test cases written for fstests which are going to be submitted upstream
> soon:

Thanks Filipe!  Since this is an older bug, I won't rush it into
tomorrow's pull, but I'll test and get it into next week.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix file/data loss caused by fsync after rename and new inode

2016-03-31 Thread Duncan
fdmanana posted on Wed, 30 Mar 2016 23:37:21 +0100 as excerpted:

> From: Filipe Manana 
> 
> If we rename an inode A (be it a file or a directory), create a new
> inode B with the old name of inode A and under the same parent
> directory, fsync inode B and then power fail, at log tree replay time
> we end up removing inode A completely. If inode A is a directory then
> all its files are gone too.

...

> V2: Node code changes, only updated the change log and the comment to
> be more clear about the problems solved by the new checks.

If there's a V3 anyway, apparent typo:

s/Node code/No code/

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad metadata crossing stripe boundary

2016-03-31 Thread Marc Haber
On Thu, Mar 31, 2016 at 10:31:49AM +0800, Qu Wenruo wrote:
> Would you please try the following patch based on v4.5 btrfs-progs?
> https://patchwork.kernel.org/patch/8706891/

This also fixes the "bad metadata crossing stripe boundary" on my pet
patient.

I find it somewhere between funny and disturbing that the first call
of btrfs check made my kernel log the following:
Mar 31 22:45:36 fan kernel: [ 6253.178264] EXT4-fs (dm-31): mounted filesystem 
with ordered data mode. Opts: (null)
Mar 31 22:45:38 fan kernel: [ 6255.361328] BTRFS: device label fanbtr devid 1 
transid 67526 /dev/dm-31

No, the filesystem was not converted, it was directly created as
btrfs, and no, I didn't try mounting it.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


"/tmp/mnt.", and not honouring compression

2016-03-31 Thread Chris Murray
Hi,

I'm trying to troubleshoot a ceph cluster which doesn't seem to be
honouring BTRFS compression on some OSDs. Can anyone offer some help? Is
it likely to be a ceph issue or a BTRFS one? Or something else? I've
asked on ceph-users already, but not received a response yet.

Config is set to mount with "noatime,nodiratime,compress-force=lzo"

Some OSDs have been getting much more full than others though, which I
think is something to do with these 'tmp' mounts e.g. below:

/dev/sdc1 on /var/lib/ceph/tmp/mnt.AywYKY type btrfs
(rw,noatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/)
/dev/sdd1 on /var/lib/ceph/osd/ceph-16 type btrfs
(rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol=
/)
/dev/sdc1 on /var/lib/ceph/osd/ceph-15 type btrfs
(rw,noatime,nodiratime,space_cache,user_subvol_rm_allowed,subvolid=5,sub
vol=/)
/dev/sdb1 on /var/lib/ceph/osd/ceph-20 type btrfs
(rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol=
/)

After a reboot, it's moved to another drive:

/dev/sdd1 on /var/lib/ceph/tmp/mnt.kWh2NA type btrfs
(rw,noatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/)
/dev/sdd1 on /var/lib/ceph/osd/ceph-16 type btrfs
(rw,noatime,nodiratime,space_cache,user_subvol_rm_allowed,subvolid=5,sub
vol=/)
/dev/sdc1 on /var/lib/ceph/osd/ceph-15 type btrfs
(rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol=
/)
/dev/sdb1 on /var/lib/ceph/osd/ceph-20 type btrfs
(rw,noatime,nodiratime,compress-force=lzo,space_cache,subvolid=5,subvol=
/)

I'm using Proxmox, based on Debian. Kernel version 4.2.8-1-pve. Btrfs
v3.17.

Thank you,
Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfsck: backpointer mismatch (and multiple other errors)

2016-03-31 Thread Kai Krakow
Hello!

I already reported this in another thread but it was a bit confusing by
intermixing multiple volumes. So let's start a new thread:

Since one of the last kernel upgrades, I'm experiencing one VDI file
(containing a NTFS image with Windows 7) getting damaged when running
the machine in VirtualBox. I got knowledge about this after
experiencing an error "duplicate object" and btrfs went RO. I fixed it
by deleting the VDI and restoring from backup - but no I get csum
errors as soon as some VM IO goes into the VDI file.

The FS is still usable. One effect is, that after reading all files
with rsync (to copy to my backup), each call of "du" or "df" hangs, also
similar calls to "btrfs {sub|fi} ..." show the same effect. I guess one
outcome of this is, that the FS does not properly unmount during
shutdown.

Kernel is 4.5.0 by now (the FS is much much older, dates back to 3.x
series, and never had problems), including Gentoo patch-set r1.

The device layout is:

$ lsblk -o NAME,MODEL,FSTYPE,LABEL,MOUNTPOINT
NAMEMODELFSTYPE LABEL  MOUNTPOINT
sda Crucial_CT128MX1
├─sda1   vfat   ESP/boot
├─sda2
└─sda3   bcache
  ├─bcache0  btrfs  system
  ├─bcache1  btrfs  system
  └─bcache2  btrfs  system /usr/src
sdb SAMSUNG HD103SJ
├─sdb1   swap   swap0  [SWAP]
└─sdb2   bcache
  └─bcache2  btrfs  system /usr/src
sdc SAMSUNG HD103SJ
├─sdc1   swap   swap1  [SWAP]
└─sdc2   bcache
  └─bcache1  btrfs  system
sdd SAMSUNG HD103UJ
├─sdd1   swap   swap2  [SWAP]
└─sdd2   bcache
  └─bcache0  btrfs  system

Mount options are:

$ mount|fgrep btrfs
/dev/bcache2 on / type btrfs 
(rw,noatime,compress=lzo,nossd,discard,space_cache,autodefrag,subvolid=256,subvol=/gentoo/rootfs)

The FS uses mraid=1 and draid=0.

Output of btrfsck is:
(also available here:
https://gist.github.com/kakra/bfcce4af242f6548f4d6b45c8afb46ae)

$ btrfsck /dev/disk/by-label/system
checking extents
ref mismatch on [10443660537856 524288] extent item 1, found 2
Backref 10443660537856 root 256 owner 23536425 offset 1310720 num_refs 0 not 
found in extent tree
Incorrect local backref count on 10443660537856 root 256 owner 23536425 offset 
1310720 found 1 wanted 0 back 0x4ceee750
Backref disk bytenr does not match extent record, bytenr=10443660537856, ref 
bytenr=10443660914688
Backref bytes do not match extent backref, bytenr=10443660537856, ref 
bytes=524288, backref bytes=69632
backpointer mismatch on [10443660537856 524288]
extent item 11271946579968 has multiple extent items
ref mismatch on [11271946579968 110592] extent item 1, found 2
Backref disk bytenr does not match extent record, bytenr=11271946579968, ref 
bytenr=11271946629120
backpointer mismatch on [11271946579968 110592]
extent item 11271946690560 has multiple extent items
ref mismatch on [11271946690560 114688] extent item 1, found 2
Backref disk bytenr does not match extent record, bytenr=11271946690560, ref 
bytenr=11271946739712
Backref bytes do not match extent backref, bytenr=11271946690560, ref 
bytes=114688, backref bytes=110592
backpointer mismatch on [11271946690560 114688]
extent item 11271946805248 has multiple extent items
ref mismatch on [11271946805248 114688] extent item 1, found 3
Backref disk bytenr does not match extent record, bytenr=11271946805248, ref 
bytenr=11271946850304
Backref bytes do not match extent backref, bytenr=11271946805248, ref 
bytes=114688, backref bytes=53248
Backref disk bytenr does not match extent record, bytenr=11271946805248, ref 
bytenr=11271946903552
Backref bytes do not match extent backref, bytenr=11271946805248, ref 
bytes=114688, backref bytes=49152
backpointer mismatch on [11271946805248 114688]
extent item 11271946919936 has multiple extent items
ref mismatch on [11271946919936 61440] extent item 1, found 2
Backref disk bytenr does not match extent record, bytenr=11271946919936, ref 
bytenr=11271946952704
Backref bytes do not match extent backref, bytenr=11271946919936, ref 
bytes=61440, backref bytes=110592
backpointer mismatch on [11271946919936 61440]
extent item 11271946981376 has multiple extent items
ref mismatch on [11271946981376 110592] extent item 1, found 2
Backref disk bytenr does not match extent record, bytenr=11271946981376, ref 
bytenr=11271947063296
backpointer mismatch on [11271946981376 110592]
extent item 11271947091968 has multiple extent items
ref mismatch on [11271947091968 110592] extent item 1, found 2
Backref disk bytenr does not match extent record, bytenr=11271947091968, ref 
bytenr=11271947173888
Backref bytes do not match extent backref, bytenr=11271947091968, ref 
bytes=110592, backref bytes=114688
backpointer mismatch on [11271947091968 110592]
extent item 11271947202560 has multiple 

Re: bad metadata crossing stripe boundary

2016-03-31 Thread Henk Slager
>> Would you please try the following patch based on v4.5 btrfs-progs?
>> https://patchwork.kernel.org/patch/8706891/
>>
>> According to your output, all the output is false alert.
>> All the extent starting bytenr can be divided by 64K, and I think at
>> initial time, its 'max_size' may be set to 0, causing "start + 0 - 1"
>> to be inside previous 64K range.
>>
>> The patch would update cross_stripe every time the extent is updated,
>> so such temporary false alert should disappear.
>
> Applied and no more reports of crossing stripe boundary - thanks.
>
> Will this go into 4.5.1 or 4.5.2?

It is not in 4.5.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad metadata crossing stripe boundary

2016-03-31 Thread Kai Krakow
Am Thu, 31 Mar 2016 10:31:49 +0800
schrieb Qu Wenruo :

> Qu Wenruo wrote on 2016/03/31 09:33 +0800:
> >
> >
> > Kai Krakow wrote on 2016/03/28 12:02 +0200:  
> >> Changing subject to reflect the current topic...
> >>
> >> Am Sun, 27 Mar 2016 21:55:40 +0800
> >> schrieb Qu Wenruo :
> >>  
>  [...]  
>  [...]  
> >>
> >> No, btrfs-progs 4.5 reports those, too (as far as I understood,
> >> this includes the fixes for bogus "bad metadata" errors, tho I
> >> thought this has already been fixed in 4.2.1, I used 4.4.1). There
> >> were some nbytes wrong errors before which I already repaired
> >> using "--repair". I think that's okay, I had those in the past and
> >> it looks like btrfsck can repair those now (and I don't have to
> >> delete and recreate the files). It caused problems with "du" and
> >> "df" in the past, a problem that I'm currently facing too. So I
> >> better fixed them.
> >>
> >> With that done, the backup fs now only reports "bad metadata" which
> >> have been there before space cache v2. Full output below.
> >>  
>  [...]  
>  [...]  
> >>
> >> Copy and paste problem. Claws mail pretends to be smarter than me
> >> - I missed to fix that one. ;-)  
> >
> > I was searching for the missing '\n' and hopes to find any chance to
> > submit a new patch.
> > What a pity. :(
> >  
> >>  
>  [...]  
> >>
> >> Helped, it automatically reverted the FS back to space cache v1
> >> with incompat flag cleared. (I wouldn't have enabled v2 if it
> >> wasn't documented that this is possible)
> >>  
>  [...]  
>  [...]  
> >>
> >> It's gone now, ignore that. It's back to the situation before space
> >> cache v2. Minus some "nbytes wrong" errors I had and fixed.  
> >
> > Nice to see it works.
> >  
> >>
> >> Nevertheless, I'm now using btrfs-progs 4.5. Here's the full
> >> output: (the lines seem to be partly out of order, probably due to
> >> the redirection)
> >>
> >> $ sudo btrfsck /dev/sde1 2>&1 | tee btrfsck-label-usb-backup.txt
> >> checking extents
> >> bad metadata [156041216, 156057600) crossing stripe boundary
> >> bad metadata [181403648, 181420032) crossing stripe boundary
> >> bad metadata [392167424, 392183808) crossing stripe boundary
> >> bad metadata [783482880, 783499264) crossing stripe boundary
> >> bad metadata [784924672, 784941056) crossing stripe boundary
> >> bad metadata [130151612416, 130151628800) crossing stripe boundary
> >> bad metadata [162826813440, 162826829824) crossing stripe boundary
> >> bad metadata [162927083520, 162927099904) crossing stripe boundary
> >> bad metadata [619740659712, 619740676096) crossing stripe boundary
> >> bad metadata [619781947392, 619781963776) crossing stripe boundary
> >> bad metadata [619795644416, 619795660800) crossing stripe boundary
> >> bad metadata [619816091648, 619816108032) crossing stripe boundary
> >> bad metadata [620011388928, 620011405312) crossing stripe boundary
> >> bad metadata [890992459776, 890992476160) crossing stripe boundary
> >> bad metadata [891022737408, 891022753792) crossing stripe boundary
> >> bad metadata [891101773824, 891101790208) crossing stripe boundary
> >> bad metadata [891301199872, 891301216256) crossing stripe boundary
> >> bad metadata [1012219314176, 1012219330560) crossing stripe
> >> boundary bad metadata [1017202409472, 1017202425856) crossing
> >> stripe boundary bad metadata [1017365397504, 1017365413888)
> >> crossing stripe boundary bad metadata [1020764422144,
> >> 1020764438528) crossing stripe boundary bad metadata
> >> [1251103342592, 1251103358976) crossing stripe boundary bad
> >> metadata [1251145809920, 1251145826304) crossing stripe boundary
> >> bad metadata [1251147055104, 1251147071488) crossing stripe
> >> boundary bad metadata [1259271225344, 1259271241728) crossing
> >> stripe boundary bad metadata [1266223611904, 1266223628288)
> >> crossing stripe boundary bad metadata [1304750063616,
> >> 130475008) crossing stripe boundary bad metadata
> >> [1304790106112, 1304790122496) crossing stripe boundary bad
> >> metadata [1304850792448, 1304850808832) crossing stripe boundary
> >> bad metadata [1304869928960, 1304869945344) crossing stripe
> >> boundary bad metadata [1305089540096, 1305089556480) crossing
> >> stripe boundary bad metadata [1309581443072, 1309581459456)
> >> crossing stripe boundary bad metadata [1309583671296,
> >> 1309583687680) crossing stripe boundary bad metadata
> >> [1309942808576, 1309942824960) crossing stripe boundary bad
> >> metadata [1310050549760, 1310050566144) crossing stripe boundary
> >> bad metadata [1313031585792, 1313031602176) crossing stripe
> >> boundary bad metadata [1313232912384, 1313232928768) crossing
> >> stripe boundary bad metadata [1555210764288, 1555210780672)
> >> crossing stripe boundary bad metadata [1555395182592,
> >> 1555395198976) crossing stripe boundary bad metadata
> >> [205057678, 2050576760832) crossing stripe boundary bad
> >> metadata [2050803957760, 

Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Liu Bo
On Thu, Mar 31, 2016 at 07:18:55AM -0400, Austin S. Hemmelgarn wrote:
> On 2016-03-30 20:32, Liu Bo wrote:
> >On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> >>Hi all,
> >>
> >>Christoph and I have been working on adding reflink and CoW support to
> >>XFS recently.  Since the purpose of (mode 0) fallocate is to make sure
> >>that future file writes cannot ENOSPC, I extended the XFS fallocate
> >>handler to unshare any shared blocks via the copy on write mechanism I
> >>built for it.  However, Christoph shared the following concerns with
> >>me about that interpretation:
> >>
> >>>I know that I suggested unsharing blocks on fallocate, but it turns out
> >>>this is causing problems.  Applications expect falloc to be a fast
> >>>metadata operation, and copying a potentially large number of blocks
> >>>is against that expextation.  This is especially bad for the NFS
> >>>server, which should not be blocked for a long time in a synchronous
> >>>operation.
> >>>
> >>>I think we'll have to remove the unshare and just fail the fallocate
> >>>for a reflinked region for now.  I still think it makes sense to expose
> >>>an unshare operation, and we probably should make that another
> >>>fallocate mode.
> >
> >I'm expecting fallocate to be fast, too.
> >
> >Well, btrfs fallocate doesn't allocate space if it's a shared one
> >because it thinks the space is already allocated.  So a later overwrite
> >over this shared extent may hit enospc errors.
> And this _really_ should get fixed, otherwise glibc will add a check for
> running posix_fallocate against BTRFS and force emulation, and people _will_
> complain about performance.

Even if glibc adds a check like that and emulates fallocate by writing
zero to real blocks, btrfs still does cow and requests to allocate space
for new writes, so it's not only performance, but also getting ENOSPC in
extremely case though.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Andreas Dilger
On Mar 31, 2016, at 12:08 PM, J. Bruce Fields  wrote:
> 
> On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
>> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
>>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
 On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> Or is it ok that fallocate could block, potentially for a long time as
> we stream cows through the page cache (or however unshare works
> internally)?  Those same programs might not be expecting fallocate to
> take a long time.
 
 Yes, it's perfectly fine for fallocate to block for long periods of
 time. See what gfs2 does during preallocation of blocks - it ends up
 calling sb_issue_zerout() because it doesn't have unwritten
 extents, and hence can block for long periods of time
>>> 
>>> gfs2 fallocate is an implementation that will cause all but the most
>>> trivial users real pain.  Even the initial XFS implementation just
>>> marking the transactions synchronous made it unusable for all kinds
>>> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
>>> to gfs2 will probab;ly hand your connection for extended periods of
>>> time.
>>> 
>>> If we need to support something like what gfs2 does we should have a
>>> separate flag for it.
>> 
>> Using fallocate() for preallocation was always intended to
>> be a faster, more efficient method allocating zeroed space
>> than having userspace write blocks of data. Faster, more efficient
>> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
>> that if the hardware has zeroing offloads (deterministic trim, write
>> same, etc) it will use them, and that will be much faster than
>> writing zeros from userspace.
>> 
>> IMO, what gfs2 is definitely within the intended usage of
>> fallocate() for accelerating the preallocation of blocks.
>> 
>> Yes, it may not be optimal for things like NFS servers which haven't
>> considered that a fallocate based offload operation might take some
>> time to execute, but that's not a problem with fallocate. i.e.
>> that's a problem with the nfs server ALLOCATE implementation not
>> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
>> and timeouts while the operation is run
> 
> That's an interesting idea, but I don't think it's really legal.  I take
> JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> later and it might succeed", not "OK, I'm working on it, try again and
> you may find out I've done it".
> 
> So if the client gets a JUKEBOX error but the server goes ahead and does
> the operation anyway, that'd be unexpected.

Well, the tape continued to be mounted in the background and/or the file
restored from the tape into the filesystem...

> I suppose it's comparable to the case where a slow fallocate is
> interrupted--would it be legal to return EINTR in that case and leave
> the application to sort out whether some part of the allocation had
> already happened?

If the later fallocate() was not re-doing the same work as the first one,
it should be fine for the client to re-send the fallocate() request.  The
fallocate() to reserve blocks does not touch the blocks that are already
allocated, so this is safe to do even if another process is writing to the
file.  If you have multiple processes writing and calling fallocate() with
PUNCH/ZERO/COLLAPSE/INSERT to overlapping regions at the same time then
the application is in for a world of hurt already.

> Would it be legal to continue the fallocate under the covers even after
> returning EINTR?

That might produce unexpected results in some cases, but it depends on
the options used.  Probably the safest is to not continue, and depend on
userspace to retry the operation on EINTR.  For fallocate() doing prealloc
or punch or zero this should eventually complete even if it is slow.

Cheers, Andreas

> But anyway my first inclination is to say that the NFS FALLOCATE
> protocol just wasn't designed to handle long-running fallocates, and if
> we really need that then we need to give it a way to either report
> partial results or to report results asynchronously.
> 
> --b.






signature.asc
Description: Message signed with OpenPGP using GPGMail


contents incomplete (?)

2016-03-31 Thread Jeff Mahoney
Hi all -

I recently spent some time writing up btrfs ioctl decoding for strace to
be used in debugging various third party tools (e.g. snapper) in their
interactions with btrfs.[1]  In doing so, I found that substantial parts
of the tree interface haven't been exported via .  For my
purposes, it was mostly flags and two structures passed in as ioctl
arguments. All things that are intended to be part of the API.  But as I
started to write up a patch to remedy that, I realized that we've
basically exported most of the file system internals via the SEARCH_TREE
ioctl.  In fact, the SEARCH_TREE ioctl is the intended interface for a
number of features (e.g. qgroup status).  As a result, pretty much every
item type, objectid values, etc are part of the API and should be
published via  for consumers.

I'm happy to do the lifting to move things around to make this happen,
but before I do, I wanted to ask:

Is it expected that consumers of the interface at this low a level will
provide their own copies of the structures, flags, types, and objectid
values they want to consume?  If so, why?  These are all on-disk format
values and effectively set in stone.  At least the ones that are already
defined.  Wouldn't things like new flags and reserved values being
claimed cause API drift pretty quickly and put the maintenance burden on
the third party developer?

-Jeff

[1] It's also pretty neat in that you can just write a quick program to
run an ioctl and not have to bother decoding most of it -- just look at
the strace output.  I do provide symbolic names for tree ids and key
types, but don't decode the item contents.

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: "bad metadata" not fixed by btrfs repair

2016-03-31 Thread Henk Slager
On Mon, Mar 28, 2016 at 4:37 PM, Marc Haber  wrote:
> Hi,
>
> I have a btrfs which btrfs check --repair doesn't fix:
>
> # btrfs check --repair /dev/mapper/fanbtr
> bad metadata [4425377054720, 4425377071104) crossing stripe boundary
> bad metadata [4425380134912, 4425380151296) crossing stripe boundary
> bad metadata [4427532795904, 4427532812288) crossing stripe boundary
> bad metadata [4568321753088, 4568321769472) crossing stripe boundary
> bad metadata [4568489656320, 4568489672704) crossing stripe boundary
> bad metadata [4571474493440, 4571474509824) crossing stripe boundary
> bad metadata [4571946811392, 4571946827776) crossing stripe boundary
> bad metadata [4572782919680, 4572782936064) crossing stripe boundary
> bad metadata [4573086351360, 4573086367744) crossing stripe boundary
> bad metadata [4574221041664, 4574221058048) crossing stripe boundary
> bad metadata [4574373412864, 4574373429248) crossing stripe boundary
> bad metadata [4574958649344, 4574958665728) crossing stripe boundary
> bad metadata [4575996018688, 4575996035072) crossing stripe boundary
> bad metadata [4580376772608, 4580376788992) crossing stripe boundary

In this case, for all  ... [X,Y) ...

X is 64K aligned and Y - X = 16K

So also false alerts.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "bad metadata" not fixed by btrfs repair

2016-03-31 Thread Henk Slager
On Thu, Mar 31, 2016 at 4:23 AM, Qu Wenruo  wrote:
>
>
> Henk Slager wrote on 2016/03/30 16:03 +0200:
>>
>> On Wed, Mar 30, 2016 at 9:00 AM, Qu Wenruo 
>> wrote:
>>>
>>> First of all.
>>>
>>> The "crossing stripe boundary" error message itself is *HARMLESS* for
>>> recent
>>> kernels.
>>>
>>> It only means, that metadata extent won't be checked by scrub on recent
>>> kernels.
>>> Because scrub by its codes, has a limitation that, it can only check tree
>>> blocks which are inside a 64K block.
>>>
>>> Old kernel won't have anything wrong, until that tree block is being
>>> scrubbed.
>>> When scrubbed, old kernel just BUG_ON().
>>>
>>> Now recent kernel will handle such limitation by checking extent
>>> allocation
>>> and avoid crossing boundary, so new created fs with new kernel won't
>>> cause
>>> such error message at all.
>>>
>>> But for old created fs, the problem can't be avoided, but at least, new
>>> kernels will not BUG_ON() when you scrub these extents, they just get
>>> ignored (not that good, but at least no BUG_ON).
>>>
>>> And new fsck will check such case, gives such warning.
>>>
>>> Overall, you're OK if you are using recent kernels.
>>>
>>> Marc Haber wrote on 2016/03/29 08:43 +0200:


 On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote:
>
>
> Did you convert this filesystem from ext4 (or ext3)?



 No.

> You hadn't mentioned what version of btrfs-progs you're using, and that
> is
> somewhat important for recovery.  I'm not sure if current versions of
> btrfs
> check can fix this issue, but I know for a fact that older versions
> (prior
> to at least 4.1) can not fix it.



 4.1 for creation and btrfs check.
>>>
>>>
>>>
>>> I assume that you have run older kernel on it, like v4.1 or v4.2.
>>>
>>> In those old kernels, it lacks the check to avoid such extent allocation
>>> check.
>>>

> As far as what the kernel is involved with, the easy way to check is if
> it's
> operating on a mounted filesystem or not.  If it only operates on
> mounted
> filesystems, it almost certainly goes through the kernel, if it only
> operates on unmounted filesystems, it's almost certainly done in
> userspace
> (except dev scan and technically fi show).



 Then btrfs check is a userspace-only matter, as it wants the fs
 unmounted, and it is irrelevant that I did btrfs check from a rescue
 system with an older kernel, 3.16 if I recall correctly.
>>>
>>>
>>>
>>> Not recommended to use older kernel to RW mount or use older fsck to do
>>> repair.
>>> As it's possible that older kernel/btrfsck may allocate extent that cross
>>> the 64K boundary.
>>>

> 2. Regarding general support:  If you're using an enterprise
> distribution
> (RHEL, SLES, CentOS, OEL, or something similar), you are almost
> certainly
> going to get better support from your vendor than from the mailing list
> or
> IRC.



 My "productive" desktops (fan is one of them) run Debian unstable with
 a current vanilla kernel. At the moment, I can't use 4.5 because it
 acts up with KVM.  When I need a rescue system, I use grml, which
 unfortunately hasn't released since November 2014 and is still with
 kernel 3.16
>>>
>>>
>>>
>>> To fix your problem(make these error message just disappear, even they
>>> are
>>> harmless on recent kernels), the most easy one, is to balance your
>>> metadata.
>>
>>
>> I did a balance with filter -musage=100  (kernel/tools 4.5/4.5) of the
>> filesystem mentioned in here:
>> http://www.spinics.net/lists/linux-btrfs/msg51405.html
>>
>> but still   bad metadata [ ),  crossing stripe boundary   messages,
>> double amount compared to 2 months ago
>>
>> Kernel operating this fs has always been maximum 1 month behind
>> 'Latest Stable Kernel' (kernel.org terminology)
>
>
> Would you please try the following patch?
> https://patchwork.kernel.org/patch/8706891/
>
> It is based on v4.5 and I think it should fix the false alert.

I have applied the patch to 4.5 and ran  btrfs check  again and now no
bad metadata [ ),  crossing stripe boundary   messages anymore
(and also no other errors).
Thanks.

> Thanks,
> Qu
>
>
>>
>>> As I explained, the bug only lies in metadata, and balance will allocate
>>> new
>>> tree blocks, then copy old data into new locations.
>>>
>>> In the allocation process of recent kernel, it will avoid such cross
>>> boundary, and to fix your problem.
>>>
>>> But if you are using old kernels, don't scrub your metadata.
>>>
>>> Thanks,
>>> Qu



 Greetings
 Marc

>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
>
--
To unsubscribe from 

Re: "bad metadata" not fixed by btrfs repair

2016-03-31 Thread Henk Slager
On Thu, Mar 31, 2016 at 2:28 AM, Qu Wenruo  wrote:
>
>
> Henk Slager wrote on 2016/03/30 16:03 +0200:
>>
>> On Wed, Mar 30, 2016 at 9:00 AM, Qu Wenruo 
>> wrote:
>>>
>>> First of all.
>>>
>>> The "crossing stripe boundary" error message itself is *HARMLESS* for
>>> recent
>>> kernels.
>>>
>>> It only means, that metadata extent won't be checked by scrub on recent
>>> kernels.
>>> Because scrub by its codes, has a limitation that, it can only check tree
>>> blocks which are inside a 64K block.
>>>
>>> Old kernel won't have anything wrong, until that tree block is being
>>> scrubbed.
>>> When scrubbed, old kernel just BUG_ON().
>>>
>>> Now recent kernel will handle such limitation by checking extent
>>> allocation
>>> and avoid crossing boundary, so new created fs with new kernel won't
>>> cause
>>> such error message at all.
>>>
>>> But for old created fs, the problem can't be avoided, but at least, new
>>> kernels will not BUG_ON() when you scrub these extents, they just get
>>> ignored (not that good, but at least no BUG_ON).
>>>
>>> And new fsck will check such case, gives such warning.
>>>
>>> Overall, you're OK if you are using recent kernels.
>>>
>>> Marc Haber wrote on 2016/03/29 08:43 +0200:


 On Mon, Mar 28, 2016 at 03:35:32PM -0400, Austin S. Hemmelgarn wrote:
>
>
> Did you convert this filesystem from ext4 (or ext3)?



 No.

> You hadn't mentioned what version of btrfs-progs you're using, and that
> is
> somewhat important for recovery.  I'm not sure if current versions of
> btrfs
> check can fix this issue, but I know for a fact that older versions
> (prior
> to at least 4.1) can not fix it.



 4.1 for creation and btrfs check.
>>>
>>>
>>>
>>> I assume that you have run older kernel on it, like v4.1 or v4.2.
>>>
>>> In those old kernels, it lacks the check to avoid such extent allocation
>>> check.
>>>

> As far as what the kernel is involved with, the easy way to check is if
> it's
> operating on a mounted filesystem or not.  If it only operates on
> mounted
> filesystems, it almost certainly goes through the kernel, if it only
> operates on unmounted filesystems, it's almost certainly done in
> userspace
> (except dev scan and technically fi show).



 Then btrfs check is a userspace-only matter, as it wants the fs
 unmounted, and it is irrelevant that I did btrfs check from a rescue
 system with an older kernel, 3.16 if I recall correctly.
>>>
>>>
>>>
>>> Not recommended to use older kernel to RW mount or use older fsck to do
>>> repair.
>>> As it's possible that older kernel/btrfsck may allocate extent that cross
>>> the 64K boundary.
>>>

> 2. Regarding general support:  If you're using an enterprise
> distribution
> (RHEL, SLES, CentOS, OEL, or something similar), you are almost
> certainly
> going to get better support from your vendor than from the mailing list
> or
> IRC.



 My "productive" desktops (fan is one of them) run Debian unstable with
 a current vanilla kernel. At the moment, I can't use 4.5 because it
 acts up with KVM.  When I need a rescue system, I use grml, which
 unfortunately hasn't released since November 2014 and is still with
 kernel 3.16
>>>
>>>
>>>
>>> To fix your problem(make these error message just disappear, even they
>>> are
>>> harmless on recent kernels), the most easy one, is to balance your
>>> metadata.
>>
>>
>> I did a balance with filter -musage=100  (kernel/tools 4.5/4.5) of the
>> filesystem mentioned in here:
>> http://www.spinics.net/lists/linux-btrfs/msg51405.html
>>
>> but still   bad metadata [ ),  crossing stripe boundary   messages,
>> double amount compared to 2 months ago
>
>
> Would you please give an example of the output?
> So I can check if it's really crossing the boundary.

This is the 1st one of the 105 messages:
bad metadata [8263437058048, 8263437062144) crossing stripe boundary

For all ... [X,Y) ...
X is 64K aligned and Y - X = 4K

So in my case, all false alerts.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Darrick J. Wong
On Thu, Mar 31, 2016 at 02:08:21PM -0400, J. Bruce Fields wrote:
> On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> > On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> > > On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > > > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > > > > Or is it ok that fallocate could block, potentially for a long time as
> > > > > we stream cows through the page cache (or however unshare works
> > > > > internally)?  Those same programs might not be expecting fallocate to
> > > > > take a long time.
> > > > 
> > > > Yes, it's perfectly fine for fallocate to block for long periods of
> > > > time. See what gfs2 does during preallocation of blocks - it ends up
> > > > calling sb_issue_zerout() because it doesn't have unwritten
> > > > extents, and hence can block for long periods of time
> > > 
> > > gfs2 fallocate is an implementation that will cause all but the most
> > > trivial users real pain.  Even the initial XFS implementation just
> > > marking the transactions synchronous made it unusable for all kinds
> > > of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> > > to gfs2 will probab;ly hand your connection for extended periods of
> > > time.
> > > 
> > > If we need to support something like what gfs2 does we should have a
> > > separate flag for it.
> > 
> > Using fallocate() for preallocation was always intended to
> > be a faster, more efficient method allocating zeroed space
> > than having userspace write blocks of data. Faster, more efficient
> > does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> > that if the hardware has zeroing offloads (deterministic trim, write
> > same, etc) it will use them, and that will be much faster than
> > writing zeros from userspace.
> > 
> > IMO, what gfs2 is definitely within the intended usage of
> > fallocate() for accelerating the preallocation of blocks.
> > 
> > Yes, it may not be optimal for things like NFS servers which haven't
> > considered that a fallocate based offload operation might take some
> > time to execute, but that's not a problem with fallocate. i.e.
> > that's a problem with the nfs server ALLOCATE implementation not
> > being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> > and timeouts while the operation is run
> 
> That's an interesting idea, but I don't think it's really legal.  I take
> JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> later and it might succeed", not "OK, I'm working on it, try again and
> you may find out I've done it".
> 
> So if the client gets a JUKEBOX error but the server goes ahead and does
> the operation anyway, that'd be unexpected.
> 
> I suppose it's comparable to the case where a slow fallocate is
> interrupted--would it be legal to return EINTR in that case and leave
> the application to sort out whether some part of the allocation had
> already happened?

 The unshare component to XFS fallocate does this if something
sends a fatal signal to the process.  There's a difference between
shooting down a process in the middle of fallocate and fallocate
returning EINTR out of the blue, though...

...the manpage for fallocate says that "EINTR == a signal was caught".

> Would it be legal to continue the fallocate under the covers even
> after returning EINTR?

It doesn't do that, however.

--D

> But anyway my first inclination is to say that the NFS FALLOCATE
> protocol just wasn't designed to handle long-running fallocates, and if
> we really need that then we need to give it a way to either report
> partial results or to report results asynchronously.
> 
> --b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread J. Bruce Fields
On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> > On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > > > Or is it ok that fallocate could block, potentially for a long time as
> > > > we stream cows through the page cache (or however unshare works
> > > > internally)?  Those same programs might not be expecting fallocate to
> > > > take a long time.
> > > 
> > > Yes, it's perfectly fine for fallocate to block for long periods of
> > > time. See what gfs2 does during preallocation of blocks - it ends up
> > > calling sb_issue_zerout() because it doesn't have unwritten
> > > extents, and hence can block for long periods of time
> > 
> > gfs2 fallocate is an implementation that will cause all but the most
> > trivial users real pain.  Even the initial XFS implementation just
> > marking the transactions synchronous made it unusable for all kinds
> > of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> > to gfs2 will probab;ly hand your connection for extended periods of
> > time.
> > 
> > If we need to support something like what gfs2 does we should have a
> > separate flag for it.
> 
> Using fallocate() for preallocation was always intended to
> be a faster, more efficient method allocating zeroed space
> than having userspace write blocks of data. Faster, more efficient
> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> that if the hardware has zeroing offloads (deterministic trim, write
> same, etc) it will use them, and that will be much faster than
> writing zeros from userspace.
> 
> IMO, what gfs2 is definitely within the intended usage of
> fallocate() for accelerating the preallocation of blocks.
> 
> Yes, it may not be optimal for things like NFS servers which haven't
> considered that a fallocate based offload operation might take some
> time to execute, but that's not a problem with fallocate. i.e.
> that's a problem with the nfs server ALLOCATE implementation not
> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> and timeouts while the operation is run

That's an interesting idea, but I don't think it's really legal.  I take
JUKEBOX to mean "sorry, I'm failing this operation for now, try again
later and it might succeed", not "OK, I'm working on it, try again and
you may find out I've done it".

So if the client gets a JUKEBOX error but the server goes ahead and does
the operation anyway, that'd be unexpected.

I suppose it's comparable to the case where a slow fallocate is
interrupted--would it be legal to return EINTR in that case and leave
the application to sort out whether some part of the allocation had
already happened?  Would it be legal to continue the fallocate under the
covers even after returning EINTR?

But anyway my first inclination is to say that the NFS FALLOCATE
protocol just wasn't designed to handle long-running fallocates, and if
we really need that then we need to give it a way to either report
partial results or to report results asynchronously.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Henk Slager
On Thu, Mar 31, 2016 at 5:31 PM, Andreas Dilger  wrote:
> On Mar 31, 2016, at 1:55 AM, Christoph Hellwig  wrote:
>>
>> On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote:
>>> Well, btrfs fallocate doesn't allocate space if it's a shared one
>>> because it thinks the space is already allocated.  So a later overwrite
>>> over this shared extent may hit enospc errors.
>>
>> And this makes it an incorrect implementation of posix_fallocate,
>> which glibcs implements using fallocate if available.
>
> It isn't really useful for a COW filesystem to implement fallocate()
> to reserve blocks.  Even if it did allocate all of the blocks on the
> initial fallocate() call, when it comes time to overwrite these blocks
> new blocks need to be allocated as the old ones will not be overwritten.

There are also use-cases on BTRFS with CoW disabled, like operations
on virtual machine images that aren't snapshotted.
Those files tend to be big and having fallocate() implemented and
working like for e.g. XFS, in order to achieve space and speed
efficiency, makes sense IMHO.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Reset IO error counters before start of device replacing

2016-03-31 Thread David Sterba
On Tue, Mar 29, 2016 at 02:17:48PM -0700, Yauhen Kharuzhy wrote:
> If device replace entry was found on disk at mounting and its num_write_errors
> stats counter has non-NULL value, then replace operation will never be
> finished and -EIO error will be reported by btrfs_scrub_dev() because
> this counter is never reset.
> 
>  # mount -o degraded /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/
>  # btrfs replace status /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/
>  Started on 25.Mar 07:28:00, canceled on 25.Mar 07:28:01 at 0.0%, 40 write 
> errs, 0 uncorr. read errs
>  # btrfs replace start -B 4 /dev/sdg 
> /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/
>  ERROR: ioctl(DEV_REPLACE_START) failed on 
> "/media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/": Input/output error, no error
> 
> Reset num_write_errors and num_uncorrectable_read_errors counters in the
> dev_replace structure before start of replacing.
> 
> Signed-off-by: Yauhen Kharuzhy 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND][PATCH] btrfs: Add qgroup tracing

2016-03-31 Thread David Sterba
On Tue, Mar 29, 2016 at 05:19:55PM -0700, Mark Fasheh wrote:
> This patch adds tracepoints to the qgroup code on both the reporting side
> (insert_dirty_extents) and the accounting side. Taken together it allows us
> to see what qgroup operations have happened, and what their result was.
> 
> Signed-off-by: Mark Fasheh 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't use src fd for printk

2016-03-31 Thread David Sterba
On Fri, Mar 25, 2016 at 10:02:41AM -0400, Josef Bacik wrote:
> The fd we pass in may not be on a btrfs file system, so don't try to do
> BTRFS_I() on it.  Thanks,
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning

2016-03-31 Thread David Sterba
On Thu, Mar 31, 2016 at 10:19:34AM +0800, Qu Wenruo wrote:
> At least 2 user from mail list reported btrfsck reported false alert of
> "bad metadata [,) crossing stripe boundary".
> 
> While the reported number are all inside the same 64K boundary.
> After some check, all the false alert have the same bytenr feature,
> which can be divided by stripe size (64K).
> 
> The result seems to be initial 'max_size' can be 0, causing 'start' +
> 'max_size' - 1, to cross the stripe boundary.
> 
> Fix it by always update extent_record->cross_stripe when the
> extent_record is updated, to avoid temporary false alert to be reported.
> 
> Signed-off-by: Qu Wenruo 

Applied, thanks.

Do you have a test image for that?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 00/19] Btrfs dedupe framework

2016-03-31 Thread David Sterba
On Wed, Mar 30, 2016 at 03:55:55PM +0800, Qu Wenruo wrote:
> This March 30th patchset update mostly addresses the patchset structure
> comment from David:
> 1) Change the patchset sequence
>Not If only apply the first 14 patches, it can provide the full
>backward compatible in-memory only dedupe backend.
>
>Only starts from patch 15, on-disk format will be changed.
> 
>So patch 1~14 is going to be pushed for next merge window, while I'll
>still submit them all for review purpose.

I'll buy 1-10 with the ioctl hidden under the BTRFS_DEBUG config option
until the interface is settled.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Austin S. Hemmelgarn

On 2016-03-31 11:31, Andreas Dilger wrote:

On Mar 31, 2016, at 1:55 AM, Christoph Hellwig  wrote:


On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote:

Well, btrfs fallocate doesn't allocate space if it's a shared one
because it thinks the space is already allocated.  So a later overwrite
over this shared extent may hit enospc errors.


And this makes it an incorrect implementation of posix_fallocate,
which glibcs implements using fallocate if available.


It isn't really useful for a COW filesystem to implement fallocate()
to reserve blocks.  Even if it did allocate all of the blocks on the
initial fallocate() call, when it comes time to overwrite these blocks
new blocks need to be allocated as the old ones will not be overwritten.

Because of snapshots that could hold references to the old blocks,
there isn't even the guarantee that the previous fallocated blocks will
be released in a reasonable time to free up an equal amount of space.


That really depends on how it's done.  AFAIK, unwritten extents on BTRFS 
are block reservations which make sure that you can write there (IOW, 
the unwritten extent gets converted to a regular extent in-place, not 
via COW).  This means that it is possible to guarantee that the first 
write to that area will work, which is technically all that POSIX 
requires.  This in turn means that stuff like SystemD and RDBMS software 
don't exactly see things working as they expect them too, but that's 
because they make assumptions based on existing technology.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Andreas Dilger
On Mar 31, 2016, at 1:55 AM, Christoph Hellwig  wrote:
> 
> On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote:
>> Well, btrfs fallocate doesn't allocate space if it's a shared one
>> because it thinks the space is already allocated.  So a later overwrite
>> over this shared extent may hit enospc errors.
> 
> And this makes it an incorrect implementation of posix_fallocate,
> which glibcs implements using fallocate if available.

It isn't really useful for a COW filesystem to implement fallocate()
to reserve blocks.  Even if it did allocate all of the blocks on the
initial fallocate() call, when it comes time to overwrite these blocks
new blocks need to be allocated as the old ones will not be overwritten.

Because of snapshots that could hold references to the old blocks,
there isn't even the guarantee that the previous fallocated blocks will
be released in a reasonable time to free up an equal amount of space.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail


Btrfs progs release 4.5.1

2016-03-31 Thread David Sterba
Hi,

btrfs-progs 4.5.1 have been released.  A bugfix release, several build
fixes, a few minor fixes, improvements or preparatory work that do not
need to be delayed.

There's one user visible change:

* mkfs: allow DUP on multi-device filesystem

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

Anand Jain (6):
  btrfs-progs: rearrange subvolume functions together
  btrfs-progs: move test_issubvolume() to utils.c
  btrfs-progs: remove duplicate function __is_subvol()
  btrfs-progs: move get_subvol_name() to utils.c
  btrfs-progs: create get_subvol_info()
  btrfs-progs: rename get_subvol_name() to subvol_strip_mountpoint()

Austin S. Hemmelgarn (1):
  btrfs-progs: fix fi du so it works in more cases

David Sterba (14):
  btrfs-progs: fragments: fix build
  btrfs-progs: utils: make more arguments const
  btrfs-progs: cleanup block group helpers types
  btrfs-progs: fix build of standalone utilities after clean
  btrfs-progs: tests: introduce mustfail helper
  btrfs-progs: tests: add misc 014-filesystem-label
  btrfs-progs: make error message from add_clone_source more generic
  btrfs-progs: mkfs: allow DUP on multidev fs, only warn
  btrfs-progs: rename __strncpy__null  to __strncpy_null
  btrfs-progs: use safe copy for label buffer everywhere
  btrfs-progs: fix fd leak in get_subvol_info
  btrfs-progs: tests: update 001-basic-profiles, dup on multidev fs
  btrfs-progs: docs: update mkfs page for dup on multidev fs
  Btrfs progs v4.5.1

Julio Montes (1):
  btrfs-progs: fix unknown type name 'u64' in gccgo

Noah Massey (1):
  btrfs-progs: build: fix static standalone utilities

Petros Angelatos (1):
  btrfs-progs: utils: make sure set_label_mounted uses correct length 
buffers

Satoru Takeuchi (1):
  btrfs-progs: mkfs: fix an error when using DUP on multidev fs

Tsutomu Itoh (1):
  btrfs-progs: send: fix handling of multiple snapshots
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fix unknown type name 'u64' in gccgo

2016-03-31 Thread David Sterba
On Tue, Mar 29, 2016 at 03:34:48PM -0600, Julio Montes wrote:
> From: Julio Montes 
> 
> Signed-off-by: Julio Montes 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V15 00/15] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size

2016-03-31 Thread David Sterba
On Thu, Mar 31, 2016 at 11:31:06AM +0200, David Sterba wrote:
> On Tue, Mar 22, 2016 at 06:50:32PM +0530, Chandan Rajendra wrote:
> > On Tuesday 22 Mar 2016 12:04:23 David Sterba wrote:
> > > On Thu, Feb 11, 2016 at 11:17:38PM +0530, Chandan Rajendra wrote:
> > > > this patchset temporarily disables the commit
> > > > f82c458a2c3ffb94b431fc6ad791a79df1b3713e.
> > > > 
> > > > The commits for the Btrfs kernel module can be found at
> > > > https://github.com/chandanr/linux/tree/btrfs/subpagesize-blocksize.
> > > 
> > > The branch does not apply cleanly to at least 4.5, I've tried to rebase
> > > it but there are conflicts that are not simple. Please update it on top
> > > of current master, ie. with the preparatory patchset merged.
> > 
> > I will rebase the branch and post the patchset soon.
> 
> JFYI, I've seen some minor compilation failures:
> - __do_readpage: unused variable cached
> - end_bio_extent_buffer_readpage: btree_readahead_hook must tak fs_info
> - fails build with (config) sanity checks enabled, the fs_info moved
>   from eb to eb head

And the tests crash with my quick fixes, so I'll move the branch out of
next for now. Please fix it and let me know. My fixes are on top of your
lastest branch, chandan-subpage-latest in my development gits.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to cancel btrfs balance on unmounted filesystem

2016-03-31 Thread Marc Haber
On Thu, Mar 31, 2016 at 01:01:37PM +0500, Roman Mamedov wrote:
> On Thu, 31 Mar 2016 08:21:12 +0200
> Marc Haber  wrote:
> > the balance restarts immediately after mounting
> 
> You can use the skip_balance mount option to prevent that.

Thanks. I now have this in all fstabs. On the system in questionl, I
was able to sneak in a btrfs balance cancel before the system hanged
itself.

Mar 31 08:17:42 fan kernel: [  240.595465] INFO: task kworker/u16:0:6 blocked 
for more than 120 seconds.
Mar 31 08:17:42 fan kernel: [  240.595604]   Tainted: GW   
4.4.6-zgws1 #2
Mar 31 08:17:42 fan kernel: [  240.595705] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 31 08:17:42 fan kernel: [  240.595845] kworker/u16:0   D 88062fc956c0   
  0 6  2 0x
Mar 31 08:17:42 fan kernel: [  240.595913] Workqueue: btrfs-endio-write 
btrfs_endio_write_helper [btrfs]
Mar 31 08:17:42 fan kernel: [  240.595919]  88017ca680c0 0002 
88017ca78000 88017ca77ca0
Mar 31 08:17:42 fan kernel: [  240.595927]  8800c9388960 0002 
81409e1c 88017ca680c0
Mar 31 08:17:42 fan kernel: [  240.595934]  81408329 7fff 
81409e5a 00c0a044e7d3
Mar 31 08:17:42 fan kernel: [  240.595941] Call Trace:
Mar 31 08:17:42 fan kernel: [  240.595955]  [] ? 
usleep_range+0x35/0x35
Mar 31 08:17:42 fan kernel: [  240.595964]  [] ? 
schedule+0x6f/0x7c
Mar 31 08:17:42 fan kernel: [  240.595973]  [] ? 
schedule_timeout+0x3e/0x128
Mar 31 08:17:42 fan kernel: [  240.595981]  [] ? 
cache_alloc+0x1bd/0x277
Mar 31 08:17:42 fan kernel: [  240.595990]  [] ? 
__wait_for_common+0x121/0x16d
Mar 31 08:17:42 fan kernel: [  240.595997]  [] ? 
__wait_for_common+0x121/0x16d
Mar 31 08:17:42 fan kernel: [  240.596006]  [] ? 
wake_up_q+0x3b/0x3b
Mar 31 08:17:42 fan kernel: [  240.596047]  [] ? 
btrfs_async_run_delayed_refs+0xbf/0xd5 [btrfs]
Mar 31 08:17:42 fan kernel: [  240.596093]  [] ? 
__btrfs_end_transaction+0x291/0x2d5 [btrfs]
Mar 31 08:17:42 fan kernel: [  240.596140]  [] ? 
btrfs_finish_ordered_io+0x418/0x4d7 [btrfs]
Mar 31 08:17:42 fan kernel: [  240.596187]  [] ? 
btrfs_scrubparity_helper+0xf4/0x233 [btrfs]
Mar 31 08:17:42 fan kernel: [  240.596198]  [] ? 
process_one_work+0x178/0x27b
Mar 31 08:17:42 fan kernel: [  240.596206]  [] ? 
worker_thread+0x1da/0x280
Mar 31 08:17:42 fan kernel: [  240.596213]  [] ? 
rescuer_thread+0x284/0x284
Mar 31 08:17:42 fan kernel: [  240.596220]  [] ? 
kthread+0x95/0x9d
Mar 31 08:17:42 fan kernel: [  240.596227]  [] ? 
kthread_parkme+0x16/0x16
Mar 31 08:17:42 fan kernel: [  240.596234]  [] ? 
ret_from_fork+0x3f/0x70
Mar 31 08:17:42 fan kernel: [  240.596240]  [] ? 
kthread_parkme+0x16/0x16
Mar 31 08:17:42 fan kernel: [  240.596272] INFO: task kworker/u16:2:134 blocked 
for more than 120 seconds.
Mar 31 08:17:42 fan kernel: [  240.596399]   Tainted: GW   
4.4.6-zgws1 #2
Mar 31 08:17:42 fan kernel: [  240.596499] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 31 08:17:42 fan kernel: [  240.596637] kworker/u16:2   D 88062fcd56c0   
  0   134  2 0x
Mar 31 08:17:42 fan kernel: [  240.596688] Workqueue: btrfs-endio-write 
btrfs_endio_write_helper [btrfs]
Mar 31 08:17:42 fan kernel: [  240.596692]  8806130e4780 0003 
880613108000 880613107ca0
Mar 31 08:17:42 fan kernel: [  240.596699]  8805caa1d960 0002 
81409e1c 8806130e4780
Mar 31 08:17:42 fan kernel: [  240.596706]  81408329 7fff 
81409e5a 88062fd556c0
Mar 31 08:17:42 fan kernel: [  240.596712] Call Trace:
Mar 31 08:17:42 fan kernel: [  240.596721]  [] ? 
usleep_range+0x35/0x35
Mar 31 08:17:42 fan kernel: [  240.596728]  [] ? 
schedule+0x6f/0x7c
Mar 31 08:17:42 fan kernel: [  240.596735]  [] ? 
schedule_timeout+0x3e/0x128
Mar 31 08:17:42 fan kernel: [  240.596742]  [] ? 
check_preempt_curr+0x41/0x63
Mar 31 08:17:42 fan kernel: [  240.596750]  [] ? 
ttwu_do_wakeup+0xf/0xd0
Mar 31 08:17:42 fan kernel: [  240.596757]  [] ? 
__wait_for_common+0x121/0x16d
Mar 31 08:17:42 fan kernel: [  240.596764]  [] ? 
__wait_for_common+0x121/0x16d
Mar 31 08:17:42 fan kernel: [  240.596771]  [] ? 
wake_up_q+0x3b/0x3b
Mar 31 08:17:42 fan kernel: [  240.596812]  [] ? 
btrfs_async_run_delayed_refs+0xbf/0xd5 [btrfs]
Mar 31 08:17:42 fan kernel: [  240.596858]  [] ? 
__btrfs_end_transaction+0x291/0x2d5 [btrfs]
Mar 31 08:17:42 fan kernel: [  240.596904]  [] ? 
btrfs_finish_ordered_io+0x418/0x4d7 [btrfs]
Mar 31 08:17:42 fan kernel: [  240.596952]  [] ? 
btrfs_scrubparity_helper+0xf4/0x233 [btrfs]
Mar 31 08:17:42 fan kernel: [  240.596960]  [] ? 
process_one_work+0x178/0x27b
Mar 31 08:17:42 fan kernel: [  240.596968]  [] ? 
worker_thread+0x1da/0x280
Mar 31 08:17:42 fan kernel: [  240.596976]  [] ? 
rescuer_thread+0x284/0x284
Mar 31 08:17:42 fan kernel: [  240.596982]  [] ? 
kthread+0x95/0x9d
Mar 31 08:17:42 fan 

Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Austin S. Hemmelgarn

On 2016-03-31 07:18, Austin S. Hemmelgarn wrote:

On 2016-03-30 20:32, Liu Bo wrote:

On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:

Hi all,

Christoph and I have been working on adding reflink and CoW support to
XFS recently.  Since the purpose of (mode 0) fallocate is to make sure
that future file writes cannot ENOSPC, I extended the XFS fallocate
handler to unshare any shared blocks via the copy on write mechanism I
built for it.  However, Christoph shared the following concerns with
me about that interpretation:


I know that I suggested unsharing blocks on fallocate, but it turns out
this is causing problems.  Applications expect falloc to be a fast
metadata operation, and copying a potentially large number of blocks
is against that expextation.  This is especially bad for the NFS
server, which should not be blocked for a long time in a synchronous
operation.

I think we'll have to remove the unshare and just fail the fallocate
for a reflinked region for now.  I still think it makes sense to expose
an unshare operation, and we probably should make that another
fallocate mode.


I'm expecting fallocate to be fast, too.

Well, btrfs fallocate doesn't allocate space if it's a shared one
because it thinks the space is already allocated.  So a later overwrite
over this shared extent may hit enospc errors.

And this _really_ should get fixed, otherwise glibc will add a check for
running posix_fallocate against BTRFS and force emulation, and people
_will_ complain about performance.

Thinking a bit further about this, how hard would it be to add the 
ability to have unwritten extents point somewhere else for reads?  Then 
when we get an fallocate call, we create the unwritten extents, and add 
the metadata to make them read from the shared region.  Then, when a 
write gets issued to that extent, the parts that aren't being written in 
that block get copied, the write happens, and then the link for that 
block gets removed.  This way, fallocate would still provide the correct 
semantics, it would be relatively fast (still not quite as fast as it is 
now, but it wouldn't be anywhere near as slow as copying the data), and 
the cost of copying gets amortized across writes (we may not need to 
copy everything, but we'll still copy less than we would for just 
un-sharing the extent).  This would of course need to be an incompat 
feature, but I would personally say that's not as much of an issue, as 
things are subtly broken in the common use-case right now (at this point 
I'm just thinking BTRFS, as what Darrick suggested for XFS seems to be a 
better solution there at least short term).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Austin S. Hemmelgarn

On 2016-03-30 20:32, Liu Bo wrote:

On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:

Hi all,

Christoph and I have been working on adding reflink and CoW support to
XFS recently.  Since the purpose of (mode 0) fallocate is to make sure
that future file writes cannot ENOSPC, I extended the XFS fallocate
handler to unshare any shared blocks via the copy on write mechanism I
built for it.  However, Christoph shared the following concerns with
me about that interpretation:


I know that I suggested unsharing blocks on fallocate, but it turns out
this is causing problems.  Applications expect falloc to be a fast
metadata operation, and copying a potentially large number of blocks
is against that expextation.  This is especially bad for the NFS
server, which should not be blocked for a long time in a synchronous
operation.

I think we'll have to remove the unshare and just fail the fallocate
for a reflinked region for now.  I still think it makes sense to expose
an unshare operation, and we probably should make that another
fallocate mode.


I'm expecting fallocate to be fast, too.

Well, btrfs fallocate doesn't allocate space if it's a shared one
because it thinks the space is already allocated.  So a later overwrite
over this shared extent may hit enospc errors.
And this _really_ should get fixed, otherwise glibc will add a check for 
running posix_fallocate against BTRFS and force emulation, and people 
_will_ complain about performance.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Dave Chinner
On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > > Or is it ok that fallocate could block, potentially for a long time as
> > > we stream cows through the page cache (or however unshare works
> > > internally)?  Those same programs might not be expecting fallocate to
> > > take a long time.
> > 
> > Yes, it's perfectly fine for fallocate to block for long periods of
> > time. See what gfs2 does during preallocation of blocks - it ends up
> > calling sb_issue_zerout() because it doesn't have unwritten
> > extents, and hence can block for long periods of time
> 
> gfs2 fallocate is an implementation that will cause all but the most
> trivial users real pain.  Even the initial XFS implementation just
> marking the transactions synchronous made it unusable for all kinds
> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> to gfs2 will probab;ly hand your connection for extended periods of
> time.
> 
> If we need to support something like what gfs2 does we should have a
> separate flag for it.

Using fallocate() for preallocation was always intended to
be a faster, more efficient method allocating zeroed space
than having userspace write blocks of data. Faster, more efficient
does not mean instantaneous, and gfs2 using sb_issue_zerout() means
that if the hardware has zeroing offloads (deterministic trim, write
same, etc) it will use them, and that will be much faster than
writing zeros from userspace.

IMO, what gfs2 is definitely within the intended usage of
fallocate() for accelerating the preallocation of blocks.

Yes, it may not be optimal for things like NFS servers which haven't
considered that a fallocate based offload operation might take some
time to execute, but that's not a problem with fallocate. i.e.
that's a problem with the nfs server ALLOCATE implementation not
being prepared to return NFSERR_JUKEBOX to prevent client side hangs
and timeouts while the operation is run

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Austin S. Hemmelgarn

On 2016-03-31 03:58, Christoph Hellwig wrote:

On Wed, Mar 30, 2016 at 02:58:38PM -0400, Austin S. Hemmelgarn wrote:

Nothing that I can find in the man-pages or API documentation for Linux's
fallocate explicitly says that it will be fast.  There are bits that say it
should be efficient, but that is not itself well defined (given context, I
would assume it to mean that it doesn't use as much I/O as writing out that
many bytes of zero data, not necessarily that it will return quickly).


And that's pretty much as narrow as an defintion we get.  But apparently
gfs2 already breaks that expectation :(
GFS2 breaks other expectations as well (mostly stuff with locking) in 
arguably more significant ways, so I would not personally consider it to 
be precedent for breaking this on other filesystems.



delalloc system is careful enough to check that there are enough free
blocks to handle both the allocation and the metadata updates.  The
only gap in this scheme that I can see is if we fallocate, crash, and
upon restart the program then tries to write without retrying the
fallocate.  Can we trade some performance for the added requirement
that we must fallocate -> write -> fsync, and retry the trio if we
crash before the fsync returns?  I think that's already an implicit
requirement, so we might be ok here.

Most of the software I've seen that doesn't use fallocate like this is
either doing odd things otherwise, or is just making sure it has space for
temporary files, so I think it is probably safe to require this.


posix_fallocate gurantees you that you don't get ENOSPC from the write,
and there is plenty of software relying on that or crashing / cause data
integrity problems that way.

posix_fallocate is not the same thing as the fallocate syscall.  It's 
there for compatibility, it has less functionality, and most 
importantly, it _can_ be slow (because at least glibc will emulate it if 
the underlying FS doesn't support fallocate, which means it's no faster 
than just using dd).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs file/data loss bug fix

2016-03-31 Thread fdmanana
From: Filipe Manana 

Hi Chris,

Please consider the following fix for the linux kernel 4.6 release. It is
not a regression in the code for 4.6 nor introduced in any recent release,
it's a problem that's been around for a long time (years). But since it's
a quite serious one, it's important in my opinion to get the fix into 4.6
(and to the stable releases) instead of waiting for the 4.7 merge window.
Three test cases were sent upstream for xfstests.

Thanks.

The following changes since commit 232cad8413a0bfbd25f11cc19fd13dfd85e1d8ad:

  Merge branch 'misc-4.6' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6 
(2016-03-24 17:36:13 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git for-chris-4.6

for you to fetch changes up to 3943609915b333dd6c69ea2993e4c717da07ad46:

  Btrfs: fix file/data loss caused by fsync after rename and new inode 
(2016-03-30 23:39:06 +0100)


Filipe Manana (1):
  Btrfs: fix file/data loss caused by fsync after rename and new inode

 fs/btrfs/tree-log.c | 137 
+
 1 file changed, 137 insertions(+)

-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: fix file/data loss caused by fsync after rename and new inode

2016-03-31 Thread fdmanana
From: Filipe Manana 

If we rename an inode A (be it a file or a directory), create a new
inode B with the old name of inode A and under the same parent directory,
fsync inode B and then power fail, at log tree replay time we end up
removing inode A completely. If inode A is a directory then all its files
are gone too.

Example scenarios where this happens:
This is reproducible with the following steps, taken from a couple of
test cases written for fstests which are going to be submitted upstream
soon:

   # Scenario 1

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir -p /mnt/a/x
   echo "hello" > /mnt/a/x/foo
   echo "world" > /mnt/a/x/bar
   sync
   mv /mnt/a/x /mnt/a/y
   mkdir /mnt/a/x
   xfs_io -c fsync /mnt/a/x
   

   The next time the fs is mounted, log tree replay happens and
   the directory "y" does not exist nor do the files "foo" and
   "bar" exist anywhere (neither in "y" nor in "x", nor the root
   nor anywhere).

   # Scenario 2

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir /mnt/a
   echo "hello" > /mnt/a/foo
   sync
   mv /mnt/a/foo /mnt/a/bar
   echo "world" > /mnt/a/foo
   xfs_io -c fsync /mnt/a/foo
   

   The next time the fs is mounted, log tree replay happens and the
   file "bar" does not exists anymore. A file with the name "foo"
   exists and it matches the second file we created.

Another related problem that does not involve file/data loss is when a
new inode is created with the name of a deleted snapshot and we fsync it:

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir /mnt/testdir
   btrfs subvolume snapshot /mnt /mnt/testdir/snap
   btrfs subvolume delete /mnt/testdir/snap
   rmdir /mnt/testdir
   mkdir /mnt/testdir
   xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir
   

   The next time the fs is mounted the log replay procedure fails because
   it attempts to delete the snapshot entry (which has dir item key type
   of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry,
   resulting in the following error that causes mount to fail:

   [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, 
inode 257 parent 257
   [52174.512570] [ cut here ]
   [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 
__btrfs_unlink_inode+0x178/0x351 [btrfs]()
   [52174.514681] BTRFS: Transaction aborted (error -2)
   [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay 
crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport 
tpm evdev i2c_piix4 proc
   [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: GW   
4.5.0-rc6-btrfs-next-27+ #1
   [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
by qemu-project.org 04/01/2014
   [52174.524053]   8801df2a7710 81264e93 
8801df2a7758
   [52174.524053]  0009 8801df2a7748 81051618 
a03591cd
   [52174.524053]  fffe 88015e6e5000 88016dbc3c88 
88016dbc3c88
   [52174.524053] Call Trace:
   [52174.524053]  [] dump_stack+0x67/0x90
   [52174.524053]  [] warn_slowpath_common+0x99/0xb2
   [52174.524053]  [] ? __btrfs_unlink_inode+0x178/0x351 
[btrfs]
   [52174.524053]  [] warn_slowpath_fmt+0x48/0x50
   [52174.524053]  [] __btrfs_unlink_inode+0x178/0x351 [btrfs]
   [52174.524053]  [] ? iput+0xb0/0x284
   [52174.524053]  [] btrfs_unlink_inode+0x1c/0x3d [btrfs]
   [52174.524053]  [] check_item_in_log+0x1fe/0x29b [btrfs]
   [52174.524053]  [] replay_dir_deletes+0x167/0x1cf [btrfs]
   [52174.524053]  [] fixup_inode_link_count+0x289/0x2aa 
[btrfs]
   [52174.524053]  [] fixup_inode_link_counts+0xcb/0x105 
[btrfs]
   [52174.524053]  [] btrfs_recover_log_trees+0x258/0x32c 
[btrfs]
   [52174.524053]  [] ? replay_one_extent+0x511/0x511 [btrfs]
   [52174.524053]  [] open_ctree+0x1dd4/0x21b9 [btrfs]
   [52174.524053]  [] btrfs_mount+0x97e/0xaed [btrfs]
   [52174.524053]  [] ? trace_hardirqs_on+0xd/0xf
   [52174.524053]  [] mount_fs+0x67/0x131
   [52174.524053]  [] vfs_kern_mount+0x6c/0xde
   [52174.524053]  [] btrfs_mount+0x1ac/0xaed [btrfs]
   [52174.524053]  [] ? trace_hardirqs_on+0xd/0xf
   [52174.524053]  [] ? lockdep_init_map+0xb9/0x1b3
   [52174.524053]  [] mount_fs+0x67/0x131
   [52174.524053]  [] vfs_kern_mount+0x6c/0xde
   [52174.524053]  [] do_mount+0x8a6/0x9e8
   [52174.524053]  [] ? strndup_user+0x3f/0x59
   [52174.524053]  [] SyS_mount+0x77/0x9f
   [52174.524053]  [] entry_SYSCALL_64_fastpath+0x12/0x6b
   [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]---

Fix this by forcing a transaction commit when such cases happen.
This means we check in the commit root of the subvolume tree if there
was any other inode with the same reference when the inode we are
fsync'ing is a new inode (created in the current transaction).

Test cases for fstests, covering all the scenarios given above, were
submitted upstream for fstests:

  * fstests: generic test for fsync after 

Re: [PATCH V15 00/15] Btrfs: Subpagesize-blocksize: Allow I/O on blocks whose size is less than page size

2016-03-31 Thread David Sterba
On Tue, Mar 22, 2016 at 06:50:32PM +0530, Chandan Rajendra wrote:
> On Tuesday 22 Mar 2016 12:04:23 David Sterba wrote:
> > On Thu, Feb 11, 2016 at 11:17:38PM +0530, Chandan Rajendra wrote:
> > > this patchset temporarily disables the commit
> > > f82c458a2c3ffb94b431fc6ad791a79df1b3713e.
> > > 
> > > The commits for the Btrfs kernel module can be found at
> > > https://github.com/chandanr/linux/tree/btrfs/subpagesize-blocksize.
> > 
> > The branch does not apply cleanly to at least 4.5, I've tried to rebase
> > it but there are conflicts that are not simple. Please update it on top
> > of current master, ie. with the preparatory patchset merged.
> 
> I will rebase the branch and post the patchset soon.

JFYI, I've seen some minor compilation failures:
- __do_readpage: unused variable cached
- end_bio_extent_buffer_readpage: btree_readahead_hook must tak fs_info
- fails build with (config) sanity checks enabled, the fs_info moved
  from eb to eb head
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to cancel btrfs balance on unmounted filesystem

2016-03-31 Thread Dmitrii Tcvetkov
Hello.
There is no tool to disable balance on unmounted filesystem. But you can use 
mount option skip_balance for this.


 Original Message 
From: Marc Haber 
Sent: March 31, 2016 9:21:12 AM GMT+03:00
To: linux-btrfs@vger.kernel.org
Subject: How to cancel btrfs balance on unmounted filesystem

Hi,

one of my problem btrfs instances went into a hung process state
while blancing metadata. This process is recorded in the file system
somehow and the balance restarts immediately after mounting the
filesystem with no chance to issue a btrfs balance cancel command
before the system hangs again.

Is there any possiblity to cancel the pending balance without mounting
the fs first?

I have also filed https://bugzilla.kernel.org/show_bug.cgi?id=115581
to adress this in a more elegant way.

Greetings
Marc


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to cancel btrfs balance on unmounted filesystem

2016-03-31 Thread Roman Mamedov
On Thu, 31 Mar 2016 08:21:12 +0200
Marc Haber  wrote:

> the balance restarts immediately after mounting

You can use the skip_balance mount option to prevent that.

-- 
With respect,
Roman


pgpGkbeeS9Inh.pgp
Description: OpenPGP digital signature


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Christoph Hellwig
On Wed, Mar 30, 2016 at 02:58:38PM -0400, Austin S. Hemmelgarn wrote:
> Nothing that I can find in the man-pages or API documentation for Linux's
> fallocate explicitly says that it will be fast.  There are bits that say it
> should be efficient, but that is not itself well defined (given context, I
> would assume it to mean that it doesn't use as much I/O as writing out that
> many bytes of zero data, not necessarily that it will return quickly).

And that's pretty much as narrow as an defintion we get.  But apparently
gfs2 already breaks that expectation :(

> >delalloc system is careful enough to check that there are enough free
> >blocks to handle both the allocation and the metadata updates.  The
> >only gap in this scheme that I can see is if we fallocate, crash, and
> >upon restart the program then tries to write without retrying the
> >fallocate.  Can we trade some performance for the added requirement
> >that we must fallocate -> write -> fsync, and retry the trio if we
> >crash before the fsync returns?  I think that's already an implicit
> >requirement, so we might be ok here.
> Most of the software I've seen that doesn't use fallocate like this is
> either doing odd things otherwise, or is just making sure it has space for
> temporary files, so I think it is probably safe to require this.

posix_fallocate gurantees you that you don't get ENOSPC from the write,
and there is plenty of software relying on that or crashing / cause data
integrity problems that way.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Christoph Hellwig
On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote:
> Well, btrfs fallocate doesn't allocate space if it's a shared one
> because it thinks the space is already allocated.  So a later overwrite
> over this shared extent may hit enospc errors.

And this makes it an incorrect implementation of posix_fallocate,
which glibcs implements using fallocate if available.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread Christoph Hellwig
On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > Or is it ok that fallocate could block, potentially for a long time as
> > we stream cows through the page cache (or however unshare works
> > internally)?  Those same programs might not be expecting fallocate to
> > take a long time.
> 
> Yes, it's perfectly fine for fallocate to block for long periods of
> time. See what gfs2 does during preallocation of blocks - it ends up
> calling sb_issue_zerout() because it doesn't have unwritten
> extents, and hence can block for long periods of time

gfs2 fallocate is an implementation that will cause all but the most
trivial users real pain.  Even the initial XFS implementation just
marking the transactions synchronous made it unusable for all kinds
of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
to gfs2 will probab;ly hand your connection for extended periods of
time.

If we need to support something like what gfs2 does we should have a
separate flag for it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to cancel btrfs balance on unmounted filesystem

2016-03-31 Thread Marc Haber
Hi,

one of my problem btrfs instances went into a hung process state
while blancing metadata. This process is recorded in the file system
somehow and the balance restarts immediately after mounting the
filesystem with no chance to issue a btrfs balance cancel command
before the system hangs again.

Is there any possiblity to cancel the pending balance without mounting
the fs first?

I have also filed https://bugzilla.kernel.org/show_bug.cgi?id=115581
to adress this in a more elegant way.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html