Re: Force recalculation of a data block checksum

2016-07-25 Thread Tomasz Melcer

On 24.07.2016 07:07, Andrei Borzenkov wrote:

Rewriting single block in question should do it. As you can read from
raw device and have both physical and logical block address something like

dd if=/dev/sdd1 skip=222940168 count=8 | dd of=/path/to/dysk/dysk.bin
seek=189669768 conv=notrunc count=8


Thank you for the one-liner, I thought of doing it myself before, but 
when I tried to figure out which numbers mean what in scrub messages, I 
got a bit lost.


However, this one-liner doesn't seem to work here; `dd` prints:

#v+
8+0 records in
8+0 records out
4096 bytes (4.1 kB) copied, 5.2857e-05 s, 77.5 MB/s
dd: writing to ‘/media/liori/USB/dysk/dysk.bin’: Input/output error
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.00364222 s, 0.0 kB/s
#v-


I'm not sure what would this message mean in this case. Is btrfs reading 
the block before writing? Maybe some block blacklist stored somewhere (I 
see that the corruption counter in scrub messages go up even if it 
reports the same errors in subsequent runs)?




If you have possibility to verify file content(s), this technique can be
used to read over unreadable parts of file by combining multiple dd's
from file and device.


Some of the files stored in this `dysk/dysk.bin` disk image have some 
internal checksums (zip archives, etc.). That includes the ones hit by 
the checksum error. I'll try verifying these later.




--
Tomasz Melcer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BTRFS-PROGS][PATCH] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump'

2016-07-25 Thread Qu Wenruo



At 07/26/2016 01:14 AM, Goffredo Baroncelli wrote:

On 2016-07-25 04:14, Qu Wenruo wrote:

Hi Goffredo,

At 07/24/2016 07:03 PM, Goffredo Baroncelli wrote:

Hi all,

the following patches add two new commands: 1) btrfs
inspect-internal physical-find 2) btrfs inspect-internal
physical-dump

The aim of these two new commands is to locate (1) and dump (2) the
stripe elements stored on the disks. I developed these two new
command to simplify the debugging of some RAID5 bugs (but this is
another discussion).


That's pretty nice function. However the function seems to be a
combination of fiemap(to resolve file to logical) and resolve logical
(to resolve logical to on-device bytenr), with RAID5/6 specific
flags.


Instead of introduce a new functions doing a combination of existing
features, would you mind expanding current logical-resolve?


[I suppose that you are referring to ./btrfs-map-logical and not 
logical-resolve]


Oh sorry, I misunderstand the recent trend to merge standalone tools 
into btrfs.


Before developing these two utils, I was unaware of logical-resolve.

Frankly speaking I am skeptical to extend btrfs-map-logical, which is
outside the "btrfs" family function. So I consider it as legacy code: the one
which should not be extended, because it was born as "a quick and dirty" 
utility.


IIRC, there is the trend to merge such tools into inspect-internal 
subcommands.




Finally, there is a big differences between my tools and btrfs-map-logical.
The former needs a mounted filesystem, the latter doesn't.


Oh, I really missed the difference.
So your tools is based on tree search ioctl, not the offline tree search.

And in that case, your tool is totally useful then, as we don't have any 
online tool for that purpose yet.




IMHO, it's not a good idea to cross two different logical
layers(file<->logical mapping layer and logical<->dev exten mapping
layer) in one function.


Indeed, it was the goal. The use case it a tool to help to
find bug in the raid5 code. Before my tools, I used some hacks to
parse the output of filefrag and combing this with btrfs-debug-tree info..
So I develop "btrfs insp physical-find". Then Chris asked me to make a
further step, so I made 'btrfs insp physical-dump'. I think that definitely
we need a tool like "btrfs insp physical-find/dump".

If you think that btrfs-map-logical has its use case, I can develop
further physical-dump/find to start from a "logical" address instead of
a file: it would be a lot more simple to me than extend btrfs-map-logical.


Please do it, and this makes it possible to inspect metadata extents too.

And it can also make the code easier to use from other subcommand.

Thanks,
Qu





Thanks, Qu


An example of 'btrfs inspect-internal physical-find' is the
following:

# btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 3
dev_name: /dev/loop2 offset: 61931520 type: DATA devid: 2 dev_name:
/dev/loop1 offset: 61931520 type: OTHER devid: 1 dev_name:
/dev/loop0 offset: 81854464 type: PARITY devid: 4 dev_name:
/dev/loop3 offset: 61931520 type: PARITY

In the output above, DATA is the stripe elemnt conaining data.
OTHER is the sibling stripe elemnt: it contains data related to or
other files or to the same file but different position. The two
stripe elements contain the RAID6 parity (P and Q).

It is possible to pass the offset of the file to inspect.

An example of 'btrfs inspect-internal physical-dump' is the
following

# btrfs insp physical-find mnt/out.txt mnt/out.txt: 0 devid: 5
dev_name: /dev/loop4 offset: 56819712 type: OTHER devid: 4
dev_name: /dev/loop3 offset: 56819712 type: OTHER devid: 3
dev_name: /dev/loop2 offset: 56819712 type: DATA devid: 2 dev_name:
/dev/loop1 offset: 56819712 type: PARITY devid: 1 dev_name:
/dev/loop0 offset: 76742656 type: PARITY

# btrfs insp physical-dump mnt/out.txt | xxd mnt/out.txt: 0 file:
/dev/loop2 off=56819712 : 6164 6161 6161 6161 6161 6161
6161 6161  adaa 0010: 6161 6161 6161 6161 6161 6161
6161 6161   0020: 6161 6161 6161 6161 6161 6161
6161 6161   0030: 6161 6161 6161 6161 6161 6161
6161 6161   0040: 6161 6161 6161 6161 6161 6161
6161 6161   [...]

In this case it is dumped the content of the first 4k of the file.
It is possible to pass also an offset (at step of 4k). Moreover it
is possible to select to dump: which copy has to be dumped (switch
-c, only for RAID1/RAID10/DUP); which parity has to be dumped
(switch -p, only for RAID5/RAID6); which stripe element other than
data (switch -s, only for RAID5/RAID6).

BR G.Baroncelli


-- To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





-- To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH RFC] btrfs: send: Disable clone detection

2016-07-25 Thread Qu Wenruo



At 07/25/2016 11:37 PM, David Sterba wrote:

On Mon, Jul 25, 2016 at 03:19:59PM +0800, Qu Wenruo wrote:

This patch will disable clone detection of send.


Making that unconditional is not the right way. We do have the send
operation flags in place so if you really think there's a need for
disabling the clones, let's add a flag for that (BTRFS_SEND_FLAG_*).



The problem is, we don't know if it will cause problem  until it happens.
So the flag like BTRFS_SEND_FLAG_DISABLE_CLONE won't really help, unless 
it becomes the default flag.


Anyway, the RFC patch is not meant to be merged, but only to get idea 
and advice to how to really fix the problem.


I think Filipe will have some good idea after he understand the real 
problem behind the bug, since he is much more experienced than me in 
send code.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: send: Disable clone detection

2016-07-25 Thread Qu Wenruo



At 07/25/2016 09:48 PM, Filipe Manana wrote:

On Mon, Jul 25, 2016 at 8:19 AM, Qu Wenruo  wrote:

This patch will disable clone detection of send.

The main problem of clone detetion in send is its memory usage and long
execution time.

The clone detection is done by iterating all backrefs and adding backref
whose root is the source.

However iterating all backrefs is already quite a bad idea, we should
never try to do it in a loop, and unfortunately in-band/out-of-band and
reflink can easily create a file whose file extents are point to the
same extent.

In that case, btrfs will do backref walk for the same extent again and
again, until either OOM or soft lockup is triggered.

So disabling clone detection until we find a method that iterates
backrefs of one extent only once, just like what balance/qgroup is doing.


Is this really a common scenario?


If any user can create such file without root privilege, and takes 
kernel CPU for a long time, no matter if it's a common scenario, it's a 
big problem.



I don't recall ever seeing a report of such slow down or oom due to
the same file pointing to the same extent thousands of times.

Then, check the test case I submitted days ago. No dedupe involved at all.
IIRC, I have Cced you.

https://patchwork.kernel.org/patch/9235825/

And, FIEMAP ioctl has the same problem, test case is already merged into 
fstests, check generic/352.



Sure it's easy to create such scenario with artificial test data for
our inband deduplication feature (and we should really strive to make
what we have stable and reliable rather than add yet more features).


This can be triggered even without the dedupe. Just check the test case.

And you should yell at out-of-band dedupe first than in-band, because 
with out-of-band dedupe, all these bugs can be triggered for a long 
time, but no one really cares until it's exposed by in-band dedupe.


The only thing dedupe get involved is, it makes us know there is still a 
lot of bugs that no one really cares before.




What needs to be fixed here is the common backref walking code, called
by send plus a few other places (fiemap and some ioctls at least), for
which IIRC there was
some recent patch from one of your colleagues.


Yes, Lu's fix for fiemap is OK, but that won't really fix the problem.
The fix is only for fiemap, as it just do a early exit, instead of 
always do a backref walk.


There are several backref callers, including:
1) balance
2) qgroup
3) fiemap (check_shared)
4) send

But for the same reflinked file, 1) and 2) won't cause such long 
execution time, just because balance and qgroup will at most call 
find_parent_nodes() on the same *EXTENT* twice.


However for old fiemap, or current send, they call find_parent_nodes() 
on every *FILE* *EXTENT*, which points to the same *EXTENT*.


And according to my ftrace, for one extent shared by 4096 times, it will 
takes about 1.6sec to execute find_parent_nodes().


So for balance and qgroup, extra 3.2 seconds won't be a big problem.
(Although mix balance and qgroup is another problem)

But for fiemap or send, it will be 4096 * 1.6 = 6553 secs, definitely 
not the right thing.




Disabling this code makes the problem go away for the scenario of the
same file pointing to the same extent thousands of times (or tens or
hundreds of thousands, whatever).
But what happens if a file points to an extent once but the extent is
referenced by other 100k files (possibly in different snapshots or
subvolumes), isn't it the same problem?


No, not the same problem.
In that case, the extent will only be iterated once, just 1.6sec.

Nothing will be wrong at all.

The only wrong thing is, send lacks the method to avoid calling 
find_parent_nodes() again and again on the same extent.


And yes, we can fix it in backref code, but now send is the only caller 
which may call find_parent_nodes() on the same extent again and again.


Qgroup has been changed from call find_parent_nodes() every time to only 
call it twice.

Balance itself is already extent based, nothing to do with the bug.
And fiemap adds early quit.

Only send is left here.


The backref code has to find
all such inodes/offsets/roots and take again the same long time...

Following your logic, it seems like everything that uses the backref
walking code should be disabled.

Just check my previous statement.

It's not about execution time of find_parent_nodes() and its variants, 
but the caller behavior.


If caller only call find_parent_nodes() once for one extent, it's just 
less than 2 seconds.(almost linear growth with the number of reference)


While caller call find_parent_nodes() on every file extent it finds, 
then there is the problem. (squared to cubic growth)



And, I know this is a quite bad idea to disable clone detection, so I 
send this patch as RFC.
Just to raise the concern of any find_paren_nodes() caller, and to find 
a good method to fix the bug.


Thanks,
Qu




Cc: Filipe Manana 

Re: Any suggestions for thousands of disk image snapshots ?

2016-07-25 Thread Chris Murphy
On Mon, Jul 25, 2016 at 2:49 PM, Chris Murphy  wrote:

> 1. mdadm raid10 + LVM thinp + either XFS or Btrfs.

You probably want to use LVM thin provisioning because its snapshot
implementation is much faster and more hands off than thick
provisioning.

And you'll need to use mdadm raid10 instead of LVM raid10 because
thinp doesn't support RAID.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any suggestions for thousands of disk image snapshots ?

2016-07-25 Thread Chris Murphy
On Mon, Jul 25, 2016 at 1:25 AM, Kurt Seo  wrote:
>  Hi all
>
>
>  I am currently running a project for building servers with btrfs.
> Purposes of servers are exporting disk images through iscsi targets
> and disk images are generated from btrfs subvolume snapshot.

How is the disk image generated from Btrfs subvolume snapshot?

On what file system is the disk image stored?


> Maximum number of clients is 500 and each client uses two snapshots of
> disk images. the first disk image's size is about 50GB and second one
> is about 1.5TB.
> Important thing is that the original 1.5TB disk image is mounted with
> loop device and modified real time - eg. continuously downloading
> torrents in it.
> snapshots are made when clients boot up and deleted when they turned off.
>
> So server has two original disk images and about a thousand of
> snapshots in total.
> I made a list of factors affect server's performance and stability.
>
> 1. Raid Configuration - Mdadm raid vs btrfs raid, configuration and
> options for them.
> 2. How to format btrfs - nodesize, features
> 3. Mount options - nodatacow and compression things.
> 4. Kernel parameter tuning.
> 5. Hardware specification.
>
>
> My current setups are
>
> 1. mdadm raid10 with 1024k chunk and 12 disks of 512GB ssd.
> 2. nodesize 32k and nothing else.
> 3. nodatacow, noatime, nodiratime, nospace_cache, ssd, compress=lzo
> 4. Ubuntu with 4.1.27 kernel without additional configurations.
> 5.
> CPU : Xeon E3- 1225v2 Quad Core 3.2Ghz
> RAM : 2 x DDR3 8GB ECC  ( total 16GB)
> NIC : 2 x 10Gbe
>
>
>  The result of test so far is
>
> 1. btrfs-transaction and btrfs-cleaner assume cpu regularly.
> 2. When cpu is busy for those processes, creating snapshots takes long.
> 3. The performance is getting slow as time goes by.
>
>
> So if there are any wrong and missing configurations , can you suggest some?
> like i need to increase physical memory.
>
> Any idea would help me a lot.

Off hand it sounds like you have a file system inside a disk image
which itself is stored on a file system. So there's two file systems.
And somehow you have to create the disk image from a subvolume, which
isn't going to be very fast. And also something I read recently on the
XFS list makes me wonder if loop devices are production worthy.

I'd reconsider the layout for any one of these reasons alone.


1. mdadm raid10 + LVM thinp + either XFS or Btrfs. The first LV you
create is the one the host is constantly updating. You can use XFS
freeze to freeze the file system, take the snapshot, and then release
the freeze. You now have the original LV which is still being updated
by the host, but you have a 2nd LV that itself can be exported as an
iSCSI target to a client system. There's no need to create a disk
image, so the creation of the snapshot and iSCSI target is much
faster.

2. Similar to the above, but you could make the 2nd LV (the snapshot)
a Btrfs seed device that all of the clients share, and they are each
pointed to their own additional LV used for the Btrfs sprout device.

The issue I had a year ago with LVM thin provisioning is when the
metadata pool gets full, the entire VG implodes very badly and I
didn't get any sufficient warnings in advance that the setup was
suboptimal, or that it was about to run out of metadata space, and it
wasn't repairable. But it was just a test. I haven't substantially
played with LVM thinp with more than a dozen snapshots. But LVM being
like emacs, you've got a lot of levers to adjust things depending on
the workload whereas Btrfs has very few.

Therefore, the plus of the 2nd option is you're only using a handful
of LVM thinp snapshots. And you're also not really using Btrfs
snapshots either, you're using the union-like fs feature of the
seed-sprout capability of Btrfs. The other nice thing is that when the
clients quit, the LV's are removed at an LVM extent level. There is no
need for file system cleanup processes to decrement reference counts
on all the affected extents. So it'd be fast.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely

2016-07-25 Thread Josef Bacik

On 07/25/2016 03:51 AM, Wang Xiaoguang wrote:

This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
#!/bin/bash
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
dev=$(losetup --show -f fs.img)
mkfs.btrfs -f -M $dev
mkdir /tmp/mntpoint
mount $dev /tmp/mntpoint
cd /tmp/mntpoint
xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile

Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
|   bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
|-> btrfs_reserve_extent()
|-> btrfs_add_reserved_bytes()
|   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
|   change bytes_may_use, and bytes_reserved += 64M. Now
|   bytes_may_use + bytes_reserved == 128M, which is greater
|   than btrfs_space_info's total_bytes, false enospc occurs.
|   Note, the bytes_may_use decrease operation will be done in
|   end of btrfs_fallocate(), which is too late.

Here is another simple case for buffered write:
CPU 1  |  CPU 2
   |
|-> cow_file_range()   |-> __btrfs_buffered_write()
|-> btrfs_reserve_extent() |   |
|  |   |
|  |   |
|. |   |-> btrfs_check_data_free_space()
|  |
|  |
|-> extent_clear_unlock_delalloc() |

In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.

To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.

As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.

Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.

Signed-off-by: Wang Xiaoguang 
---


Sigh sorry, baby kept me up most of the night, that should be

Reviewed-by: Josef Bacik 

Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BTRFS-PROGS][PATCH] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump'

2016-07-25 Thread Goffredo Baroncelli
On 2016-07-25 04:14, Qu Wenruo wrote:
> Hi Goffredo,
> 
> At 07/24/2016 07:03 PM, Goffredo Baroncelli wrote:
>> Hi all,
>> 
>> the following patches add two new commands: 1) btrfs
>> inspect-internal physical-find 2) btrfs inspect-internal
>> physical-dump
>> 
>> The aim of these two new commands is to locate (1) and dump (2) the
>> stripe elements stored on the disks. I developed these two new
>> command to simplify the debugging of some RAID5 bugs (but this is
>> another discussion).
> 
> That's pretty nice function. However the function seems to be a
> combination of fiemap(to resolve file to logical) and resolve logical
> (to resolve logical to on-device bytenr), with RAID5/6 specific
> flags.
> 
> 
> Instead of introduce a new functions doing a combination of existing
> features, would you mind expanding current logical-resolve?

[I suppose that you are referring to ./btrfs-map-logical and not 
logical-resolve]

Before developing these two utils, I was unaware of logical-resolve.

Frankly speaking I am skeptical to extend btrfs-map-logical, which is 
outside the "btrfs" family function. So I consider it as legacy code: the one
which should not be extended, because it was born as "a quick and dirty" 
utility.

Finally, there is a big differences between my tools and btrfs-map-logical. 
The former needs a mounted filesystem, the latter doesn't.

> 
> IMHO, it's not a good idea to cross two different logical
> layers(file<->logical mapping layer and logical<->dev exten mapping
> layer) in one function.

Indeed, it was the goal. The use case it a tool to help to
find bug in the raid5 code. Before my tools, I used some hacks to
parse the output of filefrag and combing this with btrfs-debug-tree info..
So I develop "btrfs insp physical-find". Then Chris asked me to make a
further step, so I made 'btrfs insp physical-dump'. I think that definitely
we need a tool like "btrfs insp physical-find/dump".

If you think that btrfs-map-logical has its use case, I can develop
further physical-dump/find to start from a "logical" address instead of
a file: it would be a lot more simple to me than extend btrfs-map-logical.

> 
> Thanks, Qu
>> 
>> An example of 'btrfs inspect-internal physical-find' is the
>> following:
>> 
>> # btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 3
>> dev_name: /dev/loop2 offset: 61931520 type: DATA devid: 2 dev_name:
>> /dev/loop1 offset: 61931520 type: OTHER devid: 1 dev_name:
>> /dev/loop0 offset: 81854464 type: PARITY devid: 4 dev_name:
>> /dev/loop3 offset: 61931520 type: PARITY
>> 
>> In the output above, DATA is the stripe elemnt conaining data.
>> OTHER is the sibling stripe elemnt: it contains data related to or
>> other files or to the same file but different position. The two
>> stripe elements contain the RAID6 parity (P and Q).
>> 
>> It is possible to pass the offset of the file to inspect.
>> 
>> An example of 'btrfs inspect-internal physical-dump' is the
>> following
>> 
>> # btrfs insp physical-find mnt/out.txt mnt/out.txt: 0 devid: 5
>> dev_name: /dev/loop4 offset: 56819712 type: OTHER devid: 4
>> dev_name: /dev/loop3 offset: 56819712 type: OTHER devid: 3
>> dev_name: /dev/loop2 offset: 56819712 type: DATA devid: 2 dev_name:
>> /dev/loop1 offset: 56819712 type: PARITY devid: 1 dev_name:
>> /dev/loop0 offset: 76742656 type: PARITY
>> 
>> # btrfs insp physical-dump mnt/out.txt | xxd mnt/out.txt: 0 file:
>> /dev/loop2 off=56819712 : 6164 6161 6161 6161 6161 6161
>> 6161 6161  adaa 0010: 6161 6161 6161 6161 6161 6161
>> 6161 6161   0020: 6161 6161 6161 6161 6161 6161
>> 6161 6161   0030: 6161 6161 6161 6161 6161 6161
>> 6161 6161   0040: 6161 6161 6161 6161 6161 6161
>> 6161 6161   [...]
>> 
>> In this case it is dumped the content of the first 4k of the file.
>> It is possible to pass also an offset (at step of 4k). Moreover it
>> is possible to select to dump: which copy has to be dumped (switch
>> -c, only for RAID1/RAID10/DUP); which parity has to be dumped
>> (switch -p, only for RAID5/RAID6); which stripe element other than
>> data (switch -s, only for RAID5/RAID6).
>> 
>> BR G.Baroncelli
>> 
>> 
>> -- To unsubscribe from this list: send the line "unsubscribe
>> linux-btrfs" in the body of a message to majord...@vger.kernel.org 
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
> 
> 
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in the body of a message to majord...@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: send: Disable clone detection

2016-07-25 Thread David Sterba
On Mon, Jul 25, 2016 at 03:19:59PM +0800, Qu Wenruo wrote:
> This patch will disable clone detection of send.

Making that unconditional is not the right way. We do have the send
operation flags in place so if you really think there's a need for
disabling the clones, let's add a flag for that (BTRFS_SEND_FLAG_*).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: send: Disable clone detection

2016-07-25 Thread Filipe Manana
On Mon, Jul 25, 2016 at 8:19 AM, Qu Wenruo  wrote:
> This patch will disable clone detection of send.
>
> The main problem of clone detetion in send is its memory usage and long
> execution time.
>
> The clone detection is done by iterating all backrefs and adding backref
> whose root is the source.
>
> However iterating all backrefs is already quite a bad idea, we should
> never try to do it in a loop, and unfortunately in-band/out-of-band and
> reflink can easily create a file whose file extents are point to the
> same extent.
>
> In that case, btrfs will do backref walk for the same extent again and
> again, until either OOM or soft lockup is triggered.
>
> So disabling clone detection until we find a method that iterates
> backrefs of one extent only once, just like what balance/qgroup is doing.

Is this really a common scenario?
I don't recall ever seeing a report of such slow down or oom due to
the same file pointing to the same extent thousands of times.
Sure it's easy to create such scenario with artificial test data for
our inband deduplication feature (and we should really strive to make
what we have stable and reliable rather than add yet more features).

What needs to be fixed here is the common backref walking code, called
by send plus a few other places (fiemap and some ioctls at least), for
which IIRC there was
some recent patch from one of your colleagues.

Disabling this code makes the problem go away for the scenario of the
same file pointing to the same extent thousands of times (or tens or
hundreds of thousands, whatever).
But what happens if a file points to an extent once but the extent is
referenced by other 100k files (possibly in different snapshots or
subvolumes), isn't it the same problem? The backref code has to find
all such inodes/offsets/roots and take again the same long time...

Following your logic, it seems like everything that uses the backref
walking code should be disabled.

>
> Cc: Filipe Manana 
> Reported-by: Tsutomu Itoh 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/send.c | 22 --
>  1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 2db8dc8..eed3f1c 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -1166,6 +1166,7 @@ struct backref_ctx {
> int found_itself;
>  };
>
> +#if 0
>  static int __clone_root_cmp_bsearch(const void *key, const void *elt)
>  {
> u64 root = (u64)(uintptr_t)key;
> @@ -1177,6 +1178,7 @@ static int __clone_root_cmp_bsearch(const void *key, 
> const void *elt)
> return 1;
> return 0;
>  }
> +#endif
>
>  static int __clone_root_cmp_sort(const void *e1, const void *e2)
>  {
> @@ -1190,6 +1192,7 @@ static int __clone_root_cmp_sort(const void *e1, const 
> void *e2)
> return 0;
>  }
>
> +#if 0
>  /*
>   * Called for every backref that is found for the current extent.
>   * Results are collected in sctx->clone_roots->ino/offset/found_refs
> @@ -1445,6 +1448,7 @@ out:
> kfree(backref_ctx);
> return ret;
>  }
> +#endif
>
>  static int read_symlink(struct btrfs_root *root,
> u64 ino,
> @@ -5291,7 +5295,6 @@ static int process_extent(struct send_ctx *sctx,
>   struct btrfs_path *path,
>   struct btrfs_key *key)
>  {
> -   struct clone_root *found_clone = NULL;
> int ret = 0;
>
> if (S_ISLNK(sctx->cur_inode_mode))
> @@ -5333,12 +5336,27 @@ static int process_extent(struct send_ctx *sctx,
> }
> }
>
> +   /*
> +* Current clone detection is both time and memory consuming.
> +*
> +* Time consuming is caused by iterating all backref of extent.
> +* Memory consuming is caused by allocating "found_clone" every
> +* time for a backref.
> +*
> +* XXX: Disabling it is never the best method, but at least it
> +* won't cause OOM nor super long execution time.
> +* The root fix needs to change the iteration basis, from iterating
> +* file extents to iterating extents, so find_parent_nodes() and
> +* backref walk should be called only once for one extent.
> +*/
> +#if 0
> ret = find_extent_clone(sctx, path, key->objectid, key->offset,
> sctx->cur_inode_size, _clone);
> if (ret != -ENOENT && ret < 0)
> goto out;
> +#endif
>
> -   ret = send_write_or_clone(sctx, path, key, found_clone);
> +   ret = send_write_or_clone(sctx, path, key, NULL);
> if (ret)
> goto out;
>  out_hole:
> --
> 2.9.0
>
>
>



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body 

Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2016-07-25 Thread Josef Bacik

On 07/25/2016 09:01 AM, David Sterba wrote:

On Mon, Jul 18, 2016 at 09:42:50AM -0400, Josef Bacik wrote:


This makes search for BLOCK_GROUP_ITEM very very very slow if extent tree is
really big.

On the handle, we search CHUNK_ITEM very very fast, because CHUNK_ITEM are in
their own tree.
(CHUNK_ITEM and BLOCK_GROUP_ITEM are 1:1 mapped)

So to completely fix it, btrfs needs on-disk format change to put
BLOCK_GROUP_ITEM into their own tree.

IMHO there maybe be some objection from other devs though.

Anyway, I add the three maintainers to Cc, and hopes we can get a better idea to
fix it.


Yeah I'm going to fix this when I do the per-block group extent tree thing.  
Thanks,


Will it be capable of "per- subvolume set" extent trees? IOW, a set of
subvolumes will could share extents only among the members of the same
group. The usecase is to start an isolate subvolume and allow snapshots
(and obviously forbid reflinks outside of the group).



I suppose.   The problem is the btrfs_header doesn't have much room for verbose 
descriptions of which root owns it.  We have objectid since it was always 
unique, but in the case of per bg extents we can't use that anymore, so we have 
to abuse flags to say this is an extent root block, and then we know that 
btrfs_header_owner(eb) is really the offset of the root and not the objectid. 
Doing something like having it per subvolume would mean having another flag that 
says this block belongs to a subvolume root, and then have the 
btrfs_header_owner(eb) set to the new offset.


The point I'm making is we can do whatever we want here, but it'll be a little 
strange since we have to use flag bits to indicate what type of root the owner 
points to, so any future modifications will also be format changes.  At least 
once I get this work done we'll be able to more easily add new variations on the 
per-whatever setup.  Thanks,


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely

2016-07-25 Thread Josef Bacik

On 07/25/2016 03:51 AM, Wang Xiaoguang wrote:

This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
#!/bin/bash
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
dev=$(losetup --show -f fs.img)
mkfs.btrfs -f -M $dev
mkdir /tmp/mntpoint
mount $dev /tmp/mntpoint
cd /tmp/mntpoint
xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile

Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
|   bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
|-> btrfs_reserve_extent()
|-> btrfs_add_reserved_bytes()
|   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
|   change bytes_may_use, and bytes_reserved += 64M. Now
|   bytes_may_use + bytes_reserved == 128M, which is greater
|   than btrfs_space_info's total_bytes, false enospc occurs.
|   Note, the bytes_may_use decrease operation will be done in
|   end of btrfs_fallocate(), which is too late.

Here is another simple case for buffered write:
CPU 1  |  CPU 2
   |
|-> cow_file_range()   |-> __btrfs_buffered_write()
|-> btrfs_reserve_extent() |   |
|  |   |
|  |   |
|. |   |-> btrfs_check_data_free_space()
|  |
|  |
|-> extent_clear_unlock_delalloc() |

In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.

To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.

As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.

Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.

Signed-off-by: Wang Xiaoguang 


Yup this is good, thanks for fixing this problem

Signed-off-by: Josef Bacik 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] btrfs: should block unused block groups deletion work when allocating data space

2016-07-25 Thread Josef Bacik

On 07/25/2016 03:51 AM, Wang Xiaoguang wrote:

cleaner_kthread() may run at any time, in which it'll call 
btrfs_delete_unused_bgs()
to delete unused block groups. Because this work is asynchronous, it may also 
result
in false ENOSPC error. Please see below race window:

   CPU1   | CPU2
  |
|-> btrfs_alloc_data_chunk_ondemand() |-> cleaner_kthread()
|-> do_chunk_alloc()  |   |
|   assume it returns ENOSPC, which means |   |
|   btrfs_space_info is full and have free|   |
|   space to satisfy data request.|   |
| |   |- > btrfs_delete_unused_bgs()
| |   |it will decrease 
btrfs_space_info
| |   |total_bytes and make
| |   |btrfs_space_info is not 
full.
| |   |
In this case, we may get ENOSPC error, but btrfs_space_info is not full.

To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
do_chunk_alloc() to allocating new chunk, we should block 
btrfs_delete_unused_bgs().
So here we introduce a new struct rw_semaphore bg_delete_sem to do this job.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/extent-tree.c | 40 ++--
 3 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7eb2913..bf0751d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -800,6 +800,7 @@ struct btrfs_fs_info {
struct mutex cleaner_mutex;
struct mutex chunk_mutex;
struct mutex volume_mutex;
+   struct rw_semaphore bg_delete_sem;

/*
 * this is taken to make sure we don't set block groups ro after
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60ce119..65a1465 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2676,6 +2676,7 @@ int open_ctree(struct super_block *sb,
mutex_init(_info->ordered_operations_mutex);
mutex_init(_info->tree_log_mutex);
mutex_init(_info->chunk_mutex);
+   init_rwsem(_info->bg_delete_sem);
mutex_init(_info->transaction_kthread_mutex);
mutex_init(_info->cleaner_mutex);
mutex_init(_info->volume_mutex);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index df8d756..d1f8638 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, 
u64 bytes)
int ret = 0;
int need_commit = 2;
int have_pinned_space;
+   int have_bg_delete_sem = 0;

/* make sure bytes are sectorsize aligned */
bytes = ALIGN(bytes, root->sectorsize);
@@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, 
u64 bytes)
}

data_sinfo = fs_info->data_sinfo;
-   if (!data_sinfo)
+   if (!data_sinfo) {
+   down_read(>fs_info->bg_delete_sem);
+   have_bg_delete_sem = 1;
goto alloc;
+   }

 again:
/* make sure we have enough space to handle the data first */
@@ -4134,10 +4138,21 @@ again:
if (used + bytes > data_sinfo->total_bytes) {
struct btrfs_trans_handle *trans;

+   spin_unlock(_sinfo->lock);
+   /*
+* We may need to allocate new chunk, so we should block
+* btrfs_delete_unused_bgs()
+*/
+   if (have_bg_delete_sem == 0) {
+   down_read(>fs_info->bg_delete_sem);
+   have_bg_delete_sem = 1;
+   }
+


We could race with somebody else doing an allocation here, so instead do 
something like


if (!have_bg_delete_sem) {
spin_unlock(_sinfo->lock);
down_read(>fs_info->bg_delete_sem);
have_bg_delete_sem = 1;
spin_lock((_sinfo->lock);
if (used + bytes <= data_sinfo->total_bytes)
goto done; // add a done label at bytes_may_use += bytes
}

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2016-07-25 Thread David Sterba
On Mon, Jul 18, 2016 at 09:42:50AM -0400, Josef Bacik wrote:
> >
> > This makes search for BLOCK_GROUP_ITEM very very very slow if extent tree is
> > really big.
> >
> > On the handle, we search CHUNK_ITEM very very fast, because CHUNK_ITEM are 
> > in
> > their own tree.
> > (CHUNK_ITEM and BLOCK_GROUP_ITEM are 1:1 mapped)
> >
> > So to completely fix it, btrfs needs on-disk format change to put
> > BLOCK_GROUP_ITEM into their own tree.
> >
> > IMHO there maybe be some objection from other devs though.
> >
> > Anyway, I add the three maintainers to Cc, and hopes we can get a better 
> > idea to
> > fix it.
> 
> Yeah I'm going to fix this when I do the per-block group extent tree thing.  
> Thanks,

Will it be capable of "per- subvolume set" extent trees? IOW, a set of
subvolumes will could share extents only among the members of the same
group. The usecase is to start an isolate subvolume and allow snapshots
(and obviously forbid reflinks outside of the group).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12.2 03/15] btrfs: dedupe: Introduce function to initialize dedupe info

2016-07-25 Thread kbuild test robot
Hi,

[auto build test ERROR on next-20160724]
[cannot apply to btrfs/next linus/master stable/master v4.7-rc7 v4.7-rc6 
v4.7-rc5 v4.7]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-In-band-De-duplication/20160725-140320
config: i386-randconfig-s0-07250950 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   fs/built-in.o: In function `btrfs_dedupe_enable':
>> (.text+0xd3c77b): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


[PATCH v2 0/4] update bytes_may_use timely to avoid false ENOSPC issue

2016-07-25 Thread Wang Xiaoguang
Currently in btrfs, for data space reservation, it does not update
bytes_may_use in btrfs_update_reserved_bytes() and the decrease operation
will be delayed to be done in extent_clear_unlock_delalloc(), for
fallocate(2), decrease operation is even delayed to be done in end
of btrfs_fallocate(), which is too late. Obviously such delay will
cause unnecessary pressure to enospc system.

So in this patch set, we will remove RESERVE_FREE, RESERVE_ALLOC and
RESERVE_ALLOC_NO_ACCOUNT, and always update bytes_may_use timely.

I already have sent a fstests test case for this issue, and I can send
[Patch 4/4] as a independent patch, but its bug also can be revealed by
the same reproduce scripts, so I include it here.

Changelog:
v2:
  Fix a trace point issue.

Wang Xiaoguang (4):
  btrfs: use correct offset for reloc_inode in
prealloc_file_extent_cluster()
  btrfs: divide btrfs_update_reserved_bytes() into two functions
  btrfs: update btrfs_space_info's bytes_may_use timely
  btrfs: should block unused block groups deletion work when allocating
data space

 fs/btrfs/ctree.h   |   3 +-
 fs/btrfs/disk-io.c |   1 +
 fs/btrfs/extent-tree.c | 171 -
 fs/btrfs/extent_io.h   |   1 +
 fs/btrfs/file.c|  26 
 fs/btrfs/inode-map.c   |   3 +-
 fs/btrfs/inode.c   |  37 ---
 fs/btrfs/relocation.c  |  17 +++--
 8 files changed, 159 insertions(+), 100 deletions(-)

-- 
2.9.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/4] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()

2016-07-25 Thread Wang Xiaoguang
In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses
wrong file offset for reloc_inode, it uses cluster->start and cluster->end,
which indeed are extent's bytenr. The correct value should be
cluster->[start|end] minus block group's start bytenr.

start bytenr   cluster->start
|  | extent  |   extent   | ...| extent |
||
|block group reloc_inode |

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/relocation.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 0477dca..a0de885 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3030,12 +3030,14 @@ int prealloc_file_extent_cluster(struct inode *inode,
u64 num_bytes;
int nr = 0;
int ret = 0;
+   u64 prealloc_start = cluster->start - offset;
+   u64 prealloc_end = cluster->end - offset;
 
BUG_ON(cluster->start != cluster->boundary[0]);
inode_lock(inode);
 
-   ret = btrfs_check_data_free_space(inode, cluster->start,
- cluster->end + 1 - cluster->start);
+   ret = btrfs_check_data_free_space(inode, prealloc_start,
+ prealloc_end + 1 - prealloc_start);
if (ret)
goto out;
 
@@ -3056,8 +3058,8 @@ int prealloc_file_extent_cluster(struct inode *inode,
break;
nr++;
}
-   btrfs_free_reserved_data_space(inode, cluster->start,
-  cluster->end + 1 - cluster->start);
+   btrfs_free_reserved_data_space(inode, prealloc_start,
+  prealloc_end + 1 - prealloc_start);
 out:
inode_unlock(inode);
return ret;
-- 
2.9.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/4] btrfs: divide btrfs_update_reserved_bytes() into two functions

2016-07-25 Thread Wang Xiaoguang
This patch divides btrfs_update_reserved_bytes() into
btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and
next patch will extend btrfs_add_reserved_bytes()to fix some
false ENOSPC error, please see later patch for detailed info.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 97 +-
 1 file changed, 57 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 82b912a..8eaac39 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -104,9 +104,10 @@ static int find_next_key(struct btrfs_path *path, int 
level,
 struct btrfs_key *key);
 static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
int dump_block_groups);
-static int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache,
-  u64 num_bytes, int reserve,
-  int delalloc);
+static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
+   u64 num_bytes, int reserve, int delalloc);
+static int btrfs_free_reserved_bytes(struct btrfs_block_group_cache *cache,
+u64 num_bytes, int delalloc);
 static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv,
   u64 num_bytes);
 int btrfs_pin_extent(struct btrfs_root *root,
@@ -6297,19 +6298,14 @@ void btrfs_wait_block_group_reservations(struct 
btrfs_block_group_cache *bg)
 }
 
 /**
- * btrfs_update_reserved_bytes - update the block_group and space info counters
+ * btrfs_add_reserved_bytes - update the block_group and space info counters
  * @cache: The cache we are manipulating
  * @num_bytes: The number of bytes in question
  * @reserve:   One of the reservation enums
  * @delalloc:   The blocks are allocated for the delalloc write
  *
- * This is called by the allocator when it reserves space, or by somebody who 
is
- * freeing space that was never actually used on disk.  For example if you
- * reserve some space for a new leaf in transaction A and before transaction A
- * commits you free that leaf, you call this with reserve set to 0 in order to
- * clear the reservation.
- *
- * Metadata reservations should be called with RESERVE_ALLOC so we do the 
proper
+ * This is called by the allocator when it reserves space. Metadata
+ * reservations should be called with RESERVE_ALLOC so we do the proper
  * ENOSPC accounting.  For data we handle the reservation through clearing the
  * delalloc bits in the io_tree.  We have to do this since we could end up
  * allocating less disk space for the amount of data we have reserved in the
@@ -6319,44 +6315,65 @@ void btrfs_wait_block_group_reservations(struct 
btrfs_block_group_cache *bg)
  * make the reservation and return -EAGAIN, otherwise this function always
  * succeeds.
  */
-static int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache,
-  u64 num_bytes, int reserve, int delalloc)
+static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
+   u64 num_bytes, int reserve, int delalloc)
 {
struct btrfs_space_info *space_info = cache->space_info;
int ret = 0;
 
spin_lock(_info->lock);
spin_lock(>lock);
-   if (reserve != RESERVE_FREE) {
-   if (cache->ro) {
-   ret = -EAGAIN;
-   } else {
-   cache->reserved += num_bytes;
-   space_info->bytes_reserved += num_bytes;
-   if (reserve == RESERVE_ALLOC) {
-   trace_btrfs_space_reservation(cache->fs_info,
-   "space_info", space_info->flags,
-   num_bytes, 0);
-   space_info->bytes_may_use -= num_bytes;
-   }
-
-   if (delalloc)
-   cache->delalloc_bytes += num_bytes;
-   }
+   if (cache->ro) {
+   ret = -EAGAIN;
} else {
-   if (cache->ro)
-   space_info->bytes_readonly += num_bytes;
-   cache->reserved -= num_bytes;
-   space_info->bytes_reserved -= num_bytes;
+   cache->reserved += num_bytes;
+   space_info->bytes_reserved += num_bytes;
+   if (reserve == RESERVE_ALLOC) {
+   trace_btrfs_space_reservation(cache->fs_info,
+   "space_info", space_info->flags,
+   num_bytes, 0);
+   space_info->bytes_may_use -= num_bytes;
+   }
 
if (delalloc)
-  

[PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely

2016-07-25 Thread Wang Xiaoguang
This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
#!/bin/bash
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
dev=$(losetup --show -f fs.img)
mkfs.btrfs -f -M $dev
mkdir /tmp/mntpoint
mount $dev /tmp/mntpoint
cd /tmp/mntpoint
xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile

Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
|   bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
|-> btrfs_reserve_extent()
|-> btrfs_add_reserved_bytes()
|   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
|   change bytes_may_use, and bytes_reserved += 64M. Now
|   bytes_may_use + bytes_reserved == 128M, which is greater
|   than btrfs_space_info's total_bytes, false enospc occurs.
|   Note, the bytes_may_use decrease operation will be done in
|   end of btrfs_fallocate(), which is too late.

Here is another simple case for buffered write:
CPU 1  |  CPU 2
   |
|-> cow_file_range()   |-> __btrfs_buffered_write()
|-> btrfs_reserve_extent() |   |
|  |   |
|  |   |
|. |   |-> btrfs_check_data_free_space()
|  |
|  |
|-> extent_clear_unlock_delalloc() |

In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.

To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.

As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.

Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h   |  2 +-
 fs/btrfs/extent-tree.c | 56 +-
 fs/btrfs/extent_io.h   |  1 +
 fs/btrfs/file.c| 26 +--
 fs/btrfs/inode-map.c   |  3 +--
 fs/btrfs/inode.c   | 37 -
 fs/btrfs/relocation.c  | 11 --
 7 files changed, 73 insertions(+), 63 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4274a7b..7eb2913 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2556,7 +2556,7 @@ int btrfs_alloc_logged_file_extent(struct 
btrfs_trans_handle *trans,
   struct btrfs_root *root,
   u64 root_objectid, u64 owner, u64 offset,
   struct btrfs_key *ins);
-int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
+int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, u64 num_bytes,
 u64 min_alloc_size, u64 empty_size, u64 hint_byte,
 struct btrfs_key *ins, int is_data, int delalloc);
 int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 

[PATCH v2 4/4] btrfs: should block unused block groups deletion work when allocating data space

2016-07-25 Thread Wang Xiaoguang
cleaner_kthread() may run at any time, in which it'll call 
btrfs_delete_unused_bgs()
to delete unused block groups. Because this work is asynchronous, it may also 
result
in false ENOSPC error. Please see below race window:

   CPU1   | CPU2
  |
|-> btrfs_alloc_data_chunk_ondemand() |-> cleaner_kthread()
|-> do_chunk_alloc()  |   |
|   assume it returns ENOSPC, which means |   |
|   btrfs_space_info is full and have free|   |
|   space to satisfy data request.|   |
| |   |- > btrfs_delete_unused_bgs()
| |   |it will decrease 
btrfs_space_info
| |   |total_bytes and make
| |   |btrfs_space_info is not 
full.
| |   |
In this case, we may get ENOSPC error, but btrfs_space_info is not full.

To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
do_chunk_alloc() to allocating new chunk, we should block 
btrfs_delete_unused_bgs().
So here we introduce a new struct rw_semaphore bg_delete_sem to do this job.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/extent-tree.c | 40 ++--
 3 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7eb2913..bf0751d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -800,6 +800,7 @@ struct btrfs_fs_info {
struct mutex cleaner_mutex;
struct mutex chunk_mutex;
struct mutex volume_mutex;
+   struct rw_semaphore bg_delete_sem;
 
/*
 * this is taken to make sure we don't set block groups ro after
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60ce119..65a1465 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2676,6 +2676,7 @@ int open_ctree(struct super_block *sb,
mutex_init(_info->ordered_operations_mutex);
mutex_init(_info->tree_log_mutex);
mutex_init(_info->chunk_mutex);
+   init_rwsem(_info->bg_delete_sem);
mutex_init(_info->transaction_kthread_mutex);
mutex_init(_info->cleaner_mutex);
mutex_init(_info->volume_mutex);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index df8d756..d1f8638 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, 
u64 bytes)
int ret = 0;
int need_commit = 2;
int have_pinned_space;
+   int have_bg_delete_sem = 0;
 
/* make sure bytes are sectorsize aligned */
bytes = ALIGN(bytes, root->sectorsize);
@@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, 
u64 bytes)
}
 
data_sinfo = fs_info->data_sinfo;
-   if (!data_sinfo)
+   if (!data_sinfo) {
+   down_read(>fs_info->bg_delete_sem);
+   have_bg_delete_sem = 1;
goto alloc;
+   }
 
 again:
/* make sure we have enough space to handle the data first */
@@ -4134,10 +4138,21 @@ again:
if (used + bytes > data_sinfo->total_bytes) {
struct btrfs_trans_handle *trans;
 
+   spin_unlock(_sinfo->lock);
+   /*
+* We may need to allocate new chunk, so we should block
+* btrfs_delete_unused_bgs()
+*/
+   if (have_bg_delete_sem == 0) {
+   down_read(>fs_info->bg_delete_sem);
+   have_bg_delete_sem = 1;
+   }
+
/*
 * if we don't have enough free bytes in this space then we need
 * to alloc a new chunk.
 */
+   spin_lock(_sinfo->lock);
if (!data_sinfo->full) {
u64 alloc_target;
 
@@ -4156,17 +4171,20 @@ alloc:
 * the fs.
 */
trans = btrfs_join_transaction(root);
-   if (IS_ERR(trans))
+   if (IS_ERR(trans)) {
+   up_read(>fs_info->bg_delete_sem);
return PTR_ERR(trans);
+   }
 
ret = do_chunk_alloc(trans, root->fs_info->extent_root,
 alloc_target,
 CHUNK_ALLOC_NO_FORCE);
btrfs_end_transaction(trans, root);
if (ret < 0) {
-   if (ret != -ENOSPC)
+   if (ret != -ENOSPC) {
+   

[PATCH v2] generic/371: run write(2) and fallocate(2) in parallel

2016-07-25 Thread Wang Xiaoguang
Currently in btrfs, there is something wrong with fallocate(2)'s data
space reservation, it'll temporarily occupy more data space thant it
really needs, which in turn will impact other operations' data request.

In this test case, it runs write(2) and fallocate(2) in parallel and the
total needed data space for these two operations don't exceed whole fs
free data space, to see whether we will get any unexpected ENOSPC error.

Signed-off-by: Wang Xiaoguang 
---
v2: adopt Eryu Guan's suggestions to make this reproducer cleaner, thanks
---
 tests/generic/371 | 75 +++
 tests/generic/371.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 78 insertions(+)
 create mode 100755 tests/generic/371
 create mode 100644 tests/generic/371.out

diff --git a/tests/generic/371 b/tests/generic/371
new file mode 100755
index 000..7955856
--- /dev/null
+++ b/tests/generic/371
@@ -0,0 +1,75 @@
+#! /bin/bash
+# FS QA Test 371
+#
+# Run write(2) and fallocate(2) in parallel and the total needed data space
+# for these operations don't exceed whole fs free data space, to see whether
+# we will get any unexpected ENOSPC error.
+#
+#---
+# Copyright (c) 2016 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_xfs_io_command "falloc"
+
+_scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1
+_scratch_mount
+
+testfile1=$SCRATCH_MNT/testfile1
+testfile2=$SCRATCH_MNT/testfile2
+
+echo "Silence is golden"
+for ((i=0; i<100; i++)); do
+   $XFS_IO_PROG -fc "pwrite 0 80M" $testfile1 >/dev/null
+   rm -f $testfile1
+done &
+pids=$!
+
+for ((i=0; i<100; i++)); do
+   $XFS_IO_PROG -fc "falloc 0 80M" $testfile2 >/dev/null
+   rm -f $testfile2
+done &
+pids="$pids $!"
+
+wait $pids
+status=0
+exit
diff --git a/tests/generic/371.out b/tests/generic/371.out
new file mode 100644
index 000..22ec8a2
--- /dev/null
+++ b/tests/generic/371.out
@@ -0,0 +1,2 @@
+QA output created by 371
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index 97ecb65..1236281 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -373,3 +373,4 @@
 368 auto quick richacl
 369 auto quick richacl
 370 auto quick richacl
+371 auto quick enospc prealloc
-- 
2.9.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Any suggestions for thousands of disk image snapshots ?

2016-07-25 Thread Kurt Seo
 Hi all


 I am currently running a project for building servers with btrfs.
Purposes of servers are exporting disk images through iscsi targets
and disk images are generated from btrfs subvolume snapshot.
Maximum number of clients is 500 and each client uses two snapshots of
disk images. the first disk image's size is about 50GB and second one
is about 1.5TB.
Important thing is that the original 1.5TB disk image is mounted with
loop device and modified real time - eg. continuously downloading
torrents in it.
snapshots are made when clients boot up and deleted when they turned off.

So server has two original disk images and about a thousand of
snapshots in total.
I made a list of factors affect server's performance and stability.

1. Raid Configuration - Mdadm raid vs btrfs raid, configuration and
options for them.
2. How to format btrfs - nodesize, features
3. Mount options - nodatacow and compression things.
4. Kernel parameter tuning.
5. Hardware specification.


My current setups are

1. mdadm raid10 with 1024k chunk and 12 disks of 512GB ssd.
2. nodesize 32k and nothing else.
3. nodatacow, noatime, nodiratime, nospace_cache, ssd, compress=lzo
4. Ubuntu with 4.1.27 kernel without additional configurations.
5.
CPU : Xeon E3- 1225v2 Quad Core 3.2Ghz
RAM : 2 x DDR3 8GB ECC  ( total 16GB)
NIC : 2 x 10Gbe


 The result of test so far is

1. btrfs-transaction and btrfs-cleaner assume cpu regularly.
2. When cpu is busy for those processes, creating snapshots takes long.
3. The performance is getting slow as time goes by.


So if there are any wrong and missing configurations , can you suggest some?
like i need to increase physical memory.

Any idea would help me a lot.

Thank you


Seo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] btrfs: send: Disable clone detection

2016-07-25 Thread Qu Wenruo
This patch will disable clone detection of send.

The main problem of clone detetion in send is its memory usage and long
execution time.

The clone detection is done by iterating all backrefs and adding backref
whose root is the source.

However iterating all backrefs is already quite a bad idea, we should
never try to do it in a loop, and unfortunately in-band/out-of-band and
reflink can easily create a file whose file extents are point to the
same extent.

In that case, btrfs will do backref walk for the same extent again and
again, until either OOM or soft lockup is triggered.

So disabling clone detection until we find a method that iterates
backrefs of one extent only once, just like what balance/qgroup is doing.

Cc: Filipe Manana 
Reported-by: Tsutomu Itoh 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/send.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 2db8dc8..eed3f1c 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -1166,6 +1166,7 @@ struct backref_ctx {
int found_itself;
 };
 
+#if 0
 static int __clone_root_cmp_bsearch(const void *key, const void *elt)
 {
u64 root = (u64)(uintptr_t)key;
@@ -1177,6 +1178,7 @@ static int __clone_root_cmp_bsearch(const void *key, 
const void *elt)
return 1;
return 0;
 }
+#endif
 
 static int __clone_root_cmp_sort(const void *e1, const void *e2)
 {
@@ -1190,6 +1192,7 @@ static int __clone_root_cmp_sort(const void *e1, const 
void *e2)
return 0;
 }
 
+#if 0
 /*
  * Called for every backref that is found for the current extent.
  * Results are collected in sctx->clone_roots->ino/offset/found_refs
@@ -1445,6 +1448,7 @@ out:
kfree(backref_ctx);
return ret;
 }
+#endif
 
 static int read_symlink(struct btrfs_root *root,
u64 ino,
@@ -5291,7 +5295,6 @@ static int process_extent(struct send_ctx *sctx,
  struct btrfs_path *path,
  struct btrfs_key *key)
 {
-   struct clone_root *found_clone = NULL;
int ret = 0;
 
if (S_ISLNK(sctx->cur_inode_mode))
@@ -5333,12 +5336,27 @@ static int process_extent(struct send_ctx *sctx,
}
}
 
+   /*
+* Current clone detection is both time and memory consuming.
+*
+* Time consuming is caused by iterating all backref of extent.
+* Memory consuming is caused by allocating "found_clone" every
+* time for a backref.
+*
+* XXX: Disabling it is never the best method, but at least it
+* won't cause OOM nor super long execution time.
+* The root fix needs to change the iteration basis, from iterating
+* file extents to iterating extents, so find_parent_nodes() and
+* backref walk should be called only once for one extent.
+*/
+#if 0
ret = find_extent_clone(sctx, path, key->objectid, key->offset,
sctx->cur_inode_size, _clone);
if (ret != -ENOENT && ret < 0)
goto out;
+#endif
 
-   ret = send_write_or_clone(sctx, path, key, found_clone);
+   ret = send_write_or_clone(sctx, path, key, NULL);
if (ret)
goto out;
 out_hole:
-- 
2.9.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic/371: run write(2) and fallocate(2) in parallel

2016-07-25 Thread Wang Xiaoguang

hello,

On 07/21/2016 06:41 PM, Eryu Guan wrote:

On Thu, Jul 21, 2016 at 03:30:25PM +0800, Wang Xiaoguang wrote:

Currently in btrfs, there is something wrong with fallocate(2)'s data
space reservation, it'll temporarily occupy more data space thant it
really needs, which in turn will impact other operations' data request.

In this test case, it runs write(2) and fallocate(2) in parallel and the
total needed data space for these two operations don't exceed whole fs
free data space, to see whether we will get any unexpected ENOSPC error.

Signed-off-by: Wang Xiaoguang 
---
  tests/generic/371 | 119 ++
  tests/generic/371.out |   2 +
  tests/generic/group   |   1 +
  3 files changed, 122 insertions(+)
  create mode 100755 tests/generic/371
  create mode 100644 tests/generic/371.out

diff --git a/tests/generic/371 b/tests/generic/371
new file mode 100755
index 000..b85327a
--- /dev/null
+++ b/tests/generic/371
@@ -0,0 +1,119 @@
+#! /bin/bash
+# FS QA Test 371
+#
+# Run write(2) and fallocate(2) in parallel and the total needed data space
+# for these operations don't exceed whole fs free data space, to see whether
+# we will get any unexpected ENOSPC error.
+#
+#---
+# Copyright (c) 2016 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_scratch

Need '_require_xfs_io_command "falloc"', otherwise test fails on ext3/2,
because they don't support fallocate(2).

OK, I see.




+
+_scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1
+_scratch_mount
+
+testfile1=$SCRATCH_MNT/testfile1
+testfile2=$SCRATCH_MNT/testfile2
+
+write_work()
+{
+   rm -f $testfile1
+   while [ 1 ]; do
+   $XFS_IO_PROG -f -c "pwrite 0 80M" $testfile1 >>$seqres.full 2>&1
+   grep "No space left on device" $seqres.full >/dev/null
+   if [ $? -eq 0 ]; then
+   echo "unexpected ENOSPC error occurs"
+   exit 1
+   fi

No need to grep for error message, just redirect stdout to /dev/null and
any error will appear in stderr, which will break golden image.

And $seqres.full is where we dump logs for debug purpose, tests should
dump output they need to somewhere like $tmp.

I see.




+   rm -f $testfile1
+   done
+}
+
+fallocate_work()
+{
+   rm -f  $testfile2
+   while [ 1 ]; do
+   $XFS_IO_PROG -f -c "falloc 0 80M" $testfile2 >>$seqres.full 2>&1
+   grep "No space left on device" $seqres.full >/dev/null
+   if [ $? -eq 0 ]; then
+   echo "unexpected ENOSPC error occurs"
+   exit 1
+   fi
+   rm -f $testfile2
+   done
+}
+
+run_time=$((180 * $TIME_FACTOR))

180s is too long time, I can reproduce it in around 10s on my test vm,
just loop for 100 times for each operation (pwrite and falloc)


+write_work &
+write_work_pid=$!
+fallocate_work &
+fallocate_work_pid=$!
+
+for ((elapsed_time = 0; elapsed_time < run_time; elapsed_time += 5)); do
+   kill -0 $write_work_pid >/dev/null 2>&1
+   if [ $? -ne 0 ]; then
+   kill $fallocate_work_pid >/dev/null 2>&1
+   break
+   fi
+
+   kill -0 $fallocate_work_pid >/dev/null 2>&1
+   if [ $? -ne 0 ]; then
+   kill $write_work_pid >/dev/null 2>&1
+   break
+   fi
+   sleep 5
+done
+
+kill $fallocate_work_pid >/dev/null 2>&1
+kill $write_work_pid >/dev/null 2>&1
+wait
+
+# wait un-finished xfs_io
+while ps aux | grep "xfs_io" | grep -qv grep; do
+   sleep 1
+done

And this seems unnecessarily complicated

So I'd write it as:

echo "Silence is golden"
for ((i=0; i<100; i++)); do
 $XFS_IO_PROG -fc