Re: Force recalculation of a data block checksum
On 24.07.2016 07:07, Andrei Borzenkov wrote: Rewriting single block in question should do it. As you can read from raw device and have both physical and logical block address something like dd if=/dev/sdd1 skip=222940168 count=8 | dd of=/path/to/dysk/dysk.bin seek=189669768 conv=notrunc count=8 Thank you for the one-liner, I thought of doing it myself before, but when I tried to figure out which numbers mean what in scrub messages, I got a bit lost. However, this one-liner doesn't seem to work here; `dd` prints: #v+ 8+0 records in 8+0 records out 4096 bytes (4.1 kB) copied, 5.2857e-05 s, 77.5 MB/s dd: writing to ‘/media/liori/USB/dysk/dysk.bin’: Input/output error 1+0 records in 0+0 records out 0 bytes (0 B) copied, 0.00364222 s, 0.0 kB/s #v- I'm not sure what would this message mean in this case. Is btrfs reading the block before writing? Maybe some block blacklist stored somewhere (I see that the corruption counter in scrub messages go up even if it reports the same errors in subsequent runs)? If you have possibility to verify file content(s), this technique can be used to read over unreadable parts of file by combining multiple dd's from file and device. Some of the files stored in this `dysk/dysk.bin` disk image have some internal checksums (zip archives, etc.). That includes the ones hit by the checksum error. I'll try verifying these later. -- Tomasz Melcer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BTRFS-PROGS][PATCH] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump'
At 07/26/2016 01:14 AM, Goffredo Baroncelli wrote: On 2016-07-25 04:14, Qu Wenruo wrote: Hi Goffredo, At 07/24/2016 07:03 PM, Goffredo Baroncelli wrote: Hi all, the following patches add two new commands: 1) btrfs inspect-internal physical-find 2) btrfs inspect-internal physical-dump The aim of these two new commands is to locate (1) and dump (2) the stripe elements stored on the disks. I developed these two new command to simplify the debugging of some RAID5 bugs (but this is another discussion). That's pretty nice function. However the function seems to be a combination of fiemap(to resolve file to logical) and resolve logical (to resolve logical to on-device bytenr), with RAID5/6 specific flags. Instead of introduce a new functions doing a combination of existing features, would you mind expanding current logical-resolve? [I suppose that you are referring to ./btrfs-map-logical and not logical-resolve] Oh sorry, I misunderstand the recent trend to merge standalone tools into btrfs. Before developing these two utils, I was unaware of logical-resolve. Frankly speaking I am skeptical to extend btrfs-map-logical, which is outside the "btrfs" family function. So I consider it as legacy code: the one which should not be extended, because it was born as "a quick and dirty" utility. IIRC, there is the trend to merge such tools into inspect-internal subcommands. Finally, there is a big differences between my tools and btrfs-map-logical. The former needs a mounted filesystem, the latter doesn't. Oh, I really missed the difference. So your tools is based on tree search ioctl, not the offline tree search. And in that case, your tool is totally useful then, as we don't have any online tool for that purpose yet. IMHO, it's not a good idea to cross two different logical layers(file<->logical mapping layer and logical<->dev exten mapping layer) in one function. Indeed, it was the goal. The use case it a tool to help to find bug in the raid5 code. Before my tools, I used some hacks to parse the output of filefrag and combing this with btrfs-debug-tree info.. So I develop "btrfs insp physical-find". Then Chris asked me to make a further step, so I made 'btrfs insp physical-dump'. I think that definitely we need a tool like "btrfs insp physical-find/dump". If you think that btrfs-map-logical has its use case, I can develop further physical-dump/find to start from a "logical" address instead of a file: it would be a lot more simple to me than extend btrfs-map-logical. Please do it, and this makes it possible to inspect metadata extents too. And it can also make the code easier to use from other subcommand. Thanks, Qu Thanks, Qu An example of 'btrfs inspect-internal physical-find' is the following: # btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY In the output above, DATA is the stripe elemnt conaining data. OTHER is the sibling stripe elemnt: it contains data related to or other files or to the same file but different position. The two stripe elements contain the RAID6 parity (P and Q). It is possible to pass the offset of the file to inspect. An example of 'btrfs inspect-internal physical-dump' is the following # btrfs insp physical-find mnt/out.txt mnt/out.txt: 0 devid: 5 dev_name: /dev/loop4 offset: 56819712 type: OTHER devid: 4 dev_name: /dev/loop3 offset: 56819712 type: OTHER devid: 3 dev_name: /dev/loop2 offset: 56819712 type: DATA devid: 2 dev_name: /dev/loop1 offset: 56819712 type: PARITY devid: 1 dev_name: /dev/loop0 offset: 76742656 type: PARITY # btrfs insp physical-dump mnt/out.txt | xxd mnt/out.txt: 0 file: /dev/loop2 off=56819712 : 6164 6161 6161 6161 6161 6161 6161 6161 adaa 0010: 6161 6161 6161 6161 6161 6161 6161 6161 0020: 6161 6161 6161 6161 6161 6161 6161 6161 0030: 6161 6161 6161 6161 6161 6161 6161 6161 0040: 6161 6161 6161 6161 6161 6161 6161 6161 [...] In this case it is dumped the content of the first 4k of the file. It is possible to pass also an offset (at step of 4k). Moreover it is possible to select to dump: which copy has to be dumped (switch -c, only for RAID1/RAID10/DUP); which parity has to be dumped (switch -p, only for RAID5/RAID6); which stripe element other than data (switch -s, only for RAID5/RAID6). BR G.Baroncelli -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH RFC] btrfs: send: Disable clone detection
At 07/25/2016 11:37 PM, David Sterba wrote: On Mon, Jul 25, 2016 at 03:19:59PM +0800, Qu Wenruo wrote: This patch will disable clone detection of send. Making that unconditional is not the right way. We do have the send operation flags in place so if you really think there's a need for disabling the clones, let's add a flag for that (BTRFS_SEND_FLAG_*). The problem is, we don't know if it will cause problem until it happens. So the flag like BTRFS_SEND_FLAG_DISABLE_CLONE won't really help, unless it becomes the default flag. Anyway, the RFC patch is not meant to be merged, but only to get idea and advice to how to really fix the problem. I think Filipe will have some good idea after he understand the real problem behind the bug, since he is much more experienced than me in send code. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: send: Disable clone detection
At 07/25/2016 09:48 PM, Filipe Manana wrote: On Mon, Jul 25, 2016 at 8:19 AM, Qu Wenruowrote: This patch will disable clone detection of send. The main problem of clone detetion in send is its memory usage and long execution time. The clone detection is done by iterating all backrefs and adding backref whose root is the source. However iterating all backrefs is already quite a bad idea, we should never try to do it in a loop, and unfortunately in-band/out-of-band and reflink can easily create a file whose file extents are point to the same extent. In that case, btrfs will do backref walk for the same extent again and again, until either OOM or soft lockup is triggered. So disabling clone detection until we find a method that iterates backrefs of one extent only once, just like what balance/qgroup is doing. Is this really a common scenario? If any user can create such file without root privilege, and takes kernel CPU for a long time, no matter if it's a common scenario, it's a big problem. I don't recall ever seeing a report of such slow down or oom due to the same file pointing to the same extent thousands of times. Then, check the test case I submitted days ago. No dedupe involved at all. IIRC, I have Cced you. https://patchwork.kernel.org/patch/9235825/ And, FIEMAP ioctl has the same problem, test case is already merged into fstests, check generic/352. Sure it's easy to create such scenario with artificial test data for our inband deduplication feature (and we should really strive to make what we have stable and reliable rather than add yet more features). This can be triggered even without the dedupe. Just check the test case. And you should yell at out-of-band dedupe first than in-band, because with out-of-band dedupe, all these bugs can be triggered for a long time, but no one really cares until it's exposed by in-band dedupe. The only thing dedupe get involved is, it makes us know there is still a lot of bugs that no one really cares before. What needs to be fixed here is the common backref walking code, called by send plus a few other places (fiemap and some ioctls at least), for which IIRC there was some recent patch from one of your colleagues. Yes, Lu's fix for fiemap is OK, but that won't really fix the problem. The fix is only for fiemap, as it just do a early exit, instead of always do a backref walk. There are several backref callers, including: 1) balance 2) qgroup 3) fiemap (check_shared) 4) send But for the same reflinked file, 1) and 2) won't cause such long execution time, just because balance and qgroup will at most call find_parent_nodes() on the same *EXTENT* twice. However for old fiemap, or current send, they call find_parent_nodes() on every *FILE* *EXTENT*, which points to the same *EXTENT*. And according to my ftrace, for one extent shared by 4096 times, it will takes about 1.6sec to execute find_parent_nodes(). So for balance and qgroup, extra 3.2 seconds won't be a big problem. (Although mix balance and qgroup is another problem) But for fiemap or send, it will be 4096 * 1.6 = 6553 secs, definitely not the right thing. Disabling this code makes the problem go away for the scenario of the same file pointing to the same extent thousands of times (or tens or hundreds of thousands, whatever). But what happens if a file points to an extent once but the extent is referenced by other 100k files (possibly in different snapshots or subvolumes), isn't it the same problem? No, not the same problem. In that case, the extent will only be iterated once, just 1.6sec. Nothing will be wrong at all. The only wrong thing is, send lacks the method to avoid calling find_parent_nodes() again and again on the same extent. And yes, we can fix it in backref code, but now send is the only caller which may call find_parent_nodes() on the same extent again and again. Qgroup has been changed from call find_parent_nodes() every time to only call it twice. Balance itself is already extent based, nothing to do with the bug. And fiemap adds early quit. Only send is left here. The backref code has to find all such inodes/offsets/roots and take again the same long time... Following your logic, it seems like everything that uses the backref walking code should be disabled. Just check my previous statement. It's not about execution time of find_parent_nodes() and its variants, but the caller behavior. If caller only call find_parent_nodes() once for one extent, it's just less than 2 seconds.(almost linear growth with the number of reference) While caller call find_parent_nodes() on every file extent it finds, then there is the problem. (squared to cubic growth) And, I know this is a quite bad idea to disable clone detection, so I send this patch as RFC. Just to raise the concern of any find_paren_nodes() caller, and to find a good method to fix the bug. Thanks, Qu Cc: Filipe Manana
Re: Any suggestions for thousands of disk image snapshots ?
On Mon, Jul 25, 2016 at 2:49 PM, Chris Murphywrote: > 1. mdadm raid10 + LVM thinp + either XFS or Btrfs. You probably want to use LVM thin provisioning because its snapshot implementation is much faster and more hands off than thick provisioning. And you'll need to use mdadm raid10 instead of LVM raid10 because thinp doesn't support RAID. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any suggestions for thousands of disk image snapshots ?
On Mon, Jul 25, 2016 at 1:25 AM, Kurt Seowrote: > Hi all > > > I am currently running a project for building servers with btrfs. > Purposes of servers are exporting disk images through iscsi targets > and disk images are generated from btrfs subvolume snapshot. How is the disk image generated from Btrfs subvolume snapshot? On what file system is the disk image stored? > Maximum number of clients is 500 and each client uses two snapshots of > disk images. the first disk image's size is about 50GB and second one > is about 1.5TB. > Important thing is that the original 1.5TB disk image is mounted with > loop device and modified real time - eg. continuously downloading > torrents in it. > snapshots are made when clients boot up and deleted when they turned off. > > So server has two original disk images and about a thousand of > snapshots in total. > I made a list of factors affect server's performance and stability. > > 1. Raid Configuration - Mdadm raid vs btrfs raid, configuration and > options for them. > 2. How to format btrfs - nodesize, features > 3. Mount options - nodatacow and compression things. > 4. Kernel parameter tuning. > 5. Hardware specification. > > > My current setups are > > 1. mdadm raid10 with 1024k chunk and 12 disks of 512GB ssd. > 2. nodesize 32k and nothing else. > 3. nodatacow, noatime, nodiratime, nospace_cache, ssd, compress=lzo > 4. Ubuntu with 4.1.27 kernel without additional configurations. > 5. > CPU : Xeon E3- 1225v2 Quad Core 3.2Ghz > RAM : 2 x DDR3 8GB ECC ( total 16GB) > NIC : 2 x 10Gbe > > > The result of test so far is > > 1. btrfs-transaction and btrfs-cleaner assume cpu regularly. > 2. When cpu is busy for those processes, creating snapshots takes long. > 3. The performance is getting slow as time goes by. > > > So if there are any wrong and missing configurations , can you suggest some? > like i need to increase physical memory. > > Any idea would help me a lot. Off hand it sounds like you have a file system inside a disk image which itself is stored on a file system. So there's two file systems. And somehow you have to create the disk image from a subvolume, which isn't going to be very fast. And also something I read recently on the XFS list makes me wonder if loop devices are production worthy. I'd reconsider the layout for any one of these reasons alone. 1. mdadm raid10 + LVM thinp + either XFS or Btrfs. The first LV you create is the one the host is constantly updating. You can use XFS freeze to freeze the file system, take the snapshot, and then release the freeze. You now have the original LV which is still being updated by the host, but you have a 2nd LV that itself can be exported as an iSCSI target to a client system. There's no need to create a disk image, so the creation of the snapshot and iSCSI target is much faster. 2. Similar to the above, but you could make the 2nd LV (the snapshot) a Btrfs seed device that all of the clients share, and they are each pointed to their own additional LV used for the Btrfs sprout device. The issue I had a year ago with LVM thin provisioning is when the metadata pool gets full, the entire VG implodes very badly and I didn't get any sufficient warnings in advance that the setup was suboptimal, or that it was about to run out of metadata space, and it wasn't repairable. But it was just a test. I haven't substantially played with LVM thinp with more than a dozen snapshots. But LVM being like emacs, you've got a lot of levers to adjust things depending on the workload whereas Btrfs has very few. Therefore, the plus of the 2nd option is you're only using a handful of LVM thinp snapshots. And you're also not really using Btrfs snapshots either, you're using the union-like fs feature of the seed-sprout capability of Btrfs. The other nice thing is that when the clients quit, the LV's are removed at an LVM extent level. There is no need for file system cleanup processes to decrement reference counts on all the affected extents. So it'd be fast. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely
On 07/25/2016 03:51 AM, Wang Xiaoguang wrote: This patch can fix some false ENOSPC errors, below test script can reproduce one false ENOSPC error: #!/bin/bash dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128 dev=$(losetup --show -f fs.img) mkfs.btrfs -f -M $dev mkdir /tmp/mntpoint mount $dev /tmp/mntpoint cd /tmp/mntpoint xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile Above script will fail for ENOSPC reason, but indeed fs still has free space to satisfy this request. Please see call graph: btrfs_fallocate() |-> btrfs_alloc_data_chunk_ondemand() | bytes_may_use += 64M |-> btrfs_prealloc_file_range() |-> btrfs_reserve_extent() |-> btrfs_add_reserved_bytes() | alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not | change bytes_may_use, and bytes_reserved += 64M. Now | bytes_may_use + bytes_reserved == 128M, which is greater | than btrfs_space_info's total_bytes, false enospc occurs. | Note, the bytes_may_use decrease operation will be done in | end of btrfs_fallocate(), which is too late. Here is another simple case for buffered write: CPU 1 | CPU 2 | |-> cow_file_range() |-> __btrfs_buffered_write() |-> btrfs_reserve_extent() | | | | | | | | |. | |-> btrfs_check_data_free_space() | | | | |-> extent_clear_unlock_delalloc() | In CPU 1, btrfs_reserve_extent()->find_free_extent()-> btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease operation will be delayed to be done in extent_clear_unlock_delalloc(). Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's btrfs_check_data_free_space() tries to reserve 100MB data space. If 100MB > data_sinfo->total_bytes - data_sinfo->bytes_used - data_sinfo->bytes_reserved - data_sinfo->bytes_pinned - data_sinfo->bytes_readonly - data_sinfo->bytes_may_use btrfs_check_data_free_space() will try to allcate new data chunk or call btrfs_start_delalloc_roots(), or commit current transaction in order to reserve some free space, obviously a lot of work. But indeed it's not necessary as long as decreasing bytes_may_use timely, we still have free space, decreasing 128M from bytes_may_use. To fix this issue, this patch chooses to update bytes_may_use for both data and metadata in btrfs_add_reserved_bytes(). For compress path, real extent length may not be equal to file content length, so introduce a ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by file content length. Then compress path can update bytes_may_use correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC and RESERVE_FREE. As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as PREALLOC, we also need to update bytes_may_use, but can not pass EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook() to update btrfs_space_info's bytes_may_use. Meanwhile __btrfs_prealloc_file_range() will call btrfs_free_reserved_data_space() internally for both sucessful and failed path, btrfs_prealloc_file_range()'s callers does not need to call btrfs_free_reserved_data_space() any more. Signed-off-by: Wang Xiaoguang--- Sigh sorry, baby kept me up most of the night, that should be Reviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BTRFS-PROGS][PATCH] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump'
On 2016-07-25 04:14, Qu Wenruo wrote: > Hi Goffredo, > > At 07/24/2016 07:03 PM, Goffredo Baroncelli wrote: >> Hi all, >> >> the following patches add two new commands: 1) btrfs >> inspect-internal physical-find 2) btrfs inspect-internal >> physical-dump >> >> The aim of these two new commands is to locate (1) and dump (2) the >> stripe elements stored on the disks. I developed these two new >> command to simplify the debugging of some RAID5 bugs (but this is >> another discussion). > > That's pretty nice function. However the function seems to be a > combination of fiemap(to resolve file to logical) and resolve logical > (to resolve logical to on-device bytenr), with RAID5/6 specific > flags. > > > Instead of introduce a new functions doing a combination of existing > features, would you mind expanding current logical-resolve? [I suppose that you are referring to ./btrfs-map-logical and not logical-resolve] Before developing these two utils, I was unaware of logical-resolve. Frankly speaking I am skeptical to extend btrfs-map-logical, which is outside the "btrfs" family function. So I consider it as legacy code: the one which should not be extended, because it was born as "a quick and dirty" utility. Finally, there is a big differences between my tools and btrfs-map-logical. The former needs a mounted filesystem, the latter doesn't. > > IMHO, it's not a good idea to cross two different logical > layers(file<->logical mapping layer and logical<->dev exten mapping > layer) in one function. Indeed, it was the goal. The use case it a tool to help to find bug in the raid5 code. Before my tools, I used some hacks to parse the output of filefrag and combing this with btrfs-debug-tree info.. So I develop "btrfs insp physical-find". Then Chris asked me to make a further step, so I made 'btrfs insp physical-dump'. I think that definitely we need a tool like "btrfs insp physical-find/dump". If you think that btrfs-map-logical has its use case, I can develop further physical-dump/find to start from a "logical" address instead of a file: it would be a lot more simple to me than extend btrfs-map-logical. > > Thanks, Qu >> >> An example of 'btrfs inspect-internal physical-find' is the >> following: >> >> # btrfs inspect physical-find mnt/out.txt mnt/out.txt: 0 devid: 3 >> dev_name: /dev/loop2 offset: 61931520 type: DATA devid: 2 dev_name: >> /dev/loop1 offset: 61931520 type: OTHER devid: 1 dev_name: >> /dev/loop0 offset: 81854464 type: PARITY devid: 4 dev_name: >> /dev/loop3 offset: 61931520 type: PARITY >> >> In the output above, DATA is the stripe elemnt conaining data. >> OTHER is the sibling stripe elemnt: it contains data related to or >> other files or to the same file but different position. The two >> stripe elements contain the RAID6 parity (P and Q). >> >> It is possible to pass the offset of the file to inspect. >> >> An example of 'btrfs inspect-internal physical-dump' is the >> following >> >> # btrfs insp physical-find mnt/out.txt mnt/out.txt: 0 devid: 5 >> dev_name: /dev/loop4 offset: 56819712 type: OTHER devid: 4 >> dev_name: /dev/loop3 offset: 56819712 type: OTHER devid: 3 >> dev_name: /dev/loop2 offset: 56819712 type: DATA devid: 2 dev_name: >> /dev/loop1 offset: 56819712 type: PARITY devid: 1 dev_name: >> /dev/loop0 offset: 76742656 type: PARITY >> >> # btrfs insp physical-dump mnt/out.txt | xxd mnt/out.txt: 0 file: >> /dev/loop2 off=56819712 : 6164 6161 6161 6161 6161 6161 >> 6161 6161 adaa 0010: 6161 6161 6161 6161 6161 6161 >> 6161 6161 0020: 6161 6161 6161 6161 6161 6161 >> 6161 6161 0030: 6161 6161 6161 6161 6161 6161 >> 6161 6161 0040: 6161 6161 6161 6161 6161 6161 >> 6161 6161 [...] >> >> In this case it is dumped the content of the first 4k of the file. >> It is possible to pass also an offset (at step of 4k). Moreover it >> is possible to select to dump: which copy has to be dumped (switch >> -c, only for RAID1/RAID10/DUP); which parity has to be dumped >> (switch -p, only for RAID5/RAID6); which stripe element other than >> data (switch -s, only for RAID5/RAID6). >> >> BR G.Baroncelli >> >> >> -- To unsubscribe from this list: send the line "unsubscribe >> linux-btrfs" in the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > > > -- To unsubscribe from this list: send the line "unsubscribe > linux-btrfs" in the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: send: Disable clone detection
On Mon, Jul 25, 2016 at 03:19:59PM +0800, Qu Wenruo wrote: > This patch will disable clone detection of send. Making that unconditional is not the right way. We do have the send operation flags in place so if you really think there's a need for disabling the clones, let's add a flag for that (BTRFS_SEND_FLAG_*). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: send: Disable clone detection
On Mon, Jul 25, 2016 at 8:19 AM, Qu Wenruowrote: > This patch will disable clone detection of send. > > The main problem of clone detetion in send is its memory usage and long > execution time. > > The clone detection is done by iterating all backrefs and adding backref > whose root is the source. > > However iterating all backrefs is already quite a bad idea, we should > never try to do it in a loop, and unfortunately in-band/out-of-band and > reflink can easily create a file whose file extents are point to the > same extent. > > In that case, btrfs will do backref walk for the same extent again and > again, until either OOM or soft lockup is triggered. > > So disabling clone detection until we find a method that iterates > backrefs of one extent only once, just like what balance/qgroup is doing. Is this really a common scenario? I don't recall ever seeing a report of such slow down or oom due to the same file pointing to the same extent thousands of times. Sure it's easy to create such scenario with artificial test data for our inband deduplication feature (and we should really strive to make what we have stable and reliable rather than add yet more features). What needs to be fixed here is the common backref walking code, called by send plus a few other places (fiemap and some ioctls at least), for which IIRC there was some recent patch from one of your colleagues. Disabling this code makes the problem go away for the scenario of the same file pointing to the same extent thousands of times (or tens or hundreds of thousands, whatever). But what happens if a file points to an extent once but the extent is referenced by other 100k files (possibly in different snapshots or subvolumes), isn't it the same problem? The backref code has to find all such inodes/offsets/roots and take again the same long time... Following your logic, it seems like everything that uses the backref walking code should be disabled. > > Cc: Filipe Manana > Reported-by: Tsutomu Itoh > Signed-off-by: Qu Wenruo > --- > fs/btrfs/send.c | 22 -- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c > index 2db8dc8..eed3f1c 100644 > --- a/fs/btrfs/send.c > +++ b/fs/btrfs/send.c > @@ -1166,6 +1166,7 @@ struct backref_ctx { > int found_itself; > }; > > +#if 0 > static int __clone_root_cmp_bsearch(const void *key, const void *elt) > { > u64 root = (u64)(uintptr_t)key; > @@ -1177,6 +1178,7 @@ static int __clone_root_cmp_bsearch(const void *key, > const void *elt) > return 1; > return 0; > } > +#endif > > static int __clone_root_cmp_sort(const void *e1, const void *e2) > { > @@ -1190,6 +1192,7 @@ static int __clone_root_cmp_sort(const void *e1, const > void *e2) > return 0; > } > > +#if 0 > /* > * Called for every backref that is found for the current extent. > * Results are collected in sctx->clone_roots->ino/offset/found_refs > @@ -1445,6 +1448,7 @@ out: > kfree(backref_ctx); > return ret; > } > +#endif > > static int read_symlink(struct btrfs_root *root, > u64 ino, > @@ -5291,7 +5295,6 @@ static int process_extent(struct send_ctx *sctx, > struct btrfs_path *path, > struct btrfs_key *key) > { > - struct clone_root *found_clone = NULL; > int ret = 0; > > if (S_ISLNK(sctx->cur_inode_mode)) > @@ -5333,12 +5336,27 @@ static int process_extent(struct send_ctx *sctx, > } > } > > + /* > +* Current clone detection is both time and memory consuming. > +* > +* Time consuming is caused by iterating all backref of extent. > +* Memory consuming is caused by allocating "found_clone" every > +* time for a backref. > +* > +* XXX: Disabling it is never the best method, but at least it > +* won't cause OOM nor super long execution time. > +* The root fix needs to change the iteration basis, from iterating > +* file extents to iterating extents, so find_parent_nodes() and > +* backref walk should be called only once for one extent. > +*/ > +#if 0 > ret = find_extent_clone(sctx, path, key->objectid, key->offset, > sctx->cur_inode_size, _clone); > if (ret != -ENOENT && ret < 0) > goto out; > +#endif > > - ret = send_write_or_clone(sctx, path, key, found_clone); > + ret = send_write_or_clone(sctx, path, key, NULL); > if (ret) > goto out; > out_hole: > -- > 2.9.0 > > > -- Filipe David Manana, "People will forget what you said, people will forget what you did, but people will never forget how you made them feel." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On 07/25/2016 09:01 AM, David Sterba wrote: On Mon, Jul 18, 2016 at 09:42:50AM -0400, Josef Bacik wrote: This makes search for BLOCK_GROUP_ITEM very very very slow if extent tree is really big. On the handle, we search CHUNK_ITEM very very fast, because CHUNK_ITEM are in their own tree. (CHUNK_ITEM and BLOCK_GROUP_ITEM are 1:1 mapped) So to completely fix it, btrfs needs on-disk format change to put BLOCK_GROUP_ITEM into their own tree. IMHO there maybe be some objection from other devs though. Anyway, I add the three maintainers to Cc, and hopes we can get a better idea to fix it. Yeah I'm going to fix this when I do the per-block group extent tree thing. Thanks, Will it be capable of "per- subvolume set" extent trees? IOW, a set of subvolumes will could share extents only among the members of the same group. The usecase is to start an isolate subvolume and allow snapshots (and obviously forbid reflinks outside of the group). I suppose. The problem is the btrfs_header doesn't have much room for verbose descriptions of which root owns it. We have objectid since it was always unique, but in the case of per bg extents we can't use that anymore, so we have to abuse flags to say this is an extent root block, and then we know that btrfs_header_owner(eb) is really the offset of the root and not the objectid. Doing something like having it per subvolume would mean having another flag that says this block belongs to a subvolume root, and then have the btrfs_header_owner(eb) set to the new offset. The point I'm making is we can do whatever we want here, but it'll be a little strange since we have to use flag bits to indicate what type of root the owner points to, so any future modifications will also be format changes. At least once I get this work done we'll be able to more easily add new variations on the per-whatever setup. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely
On 07/25/2016 03:51 AM, Wang Xiaoguang wrote: This patch can fix some false ENOSPC errors, below test script can reproduce one false ENOSPC error: #!/bin/bash dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128 dev=$(losetup --show -f fs.img) mkfs.btrfs -f -M $dev mkdir /tmp/mntpoint mount $dev /tmp/mntpoint cd /tmp/mntpoint xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile Above script will fail for ENOSPC reason, but indeed fs still has free space to satisfy this request. Please see call graph: btrfs_fallocate() |-> btrfs_alloc_data_chunk_ondemand() | bytes_may_use += 64M |-> btrfs_prealloc_file_range() |-> btrfs_reserve_extent() |-> btrfs_add_reserved_bytes() | alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not | change bytes_may_use, and bytes_reserved += 64M. Now | bytes_may_use + bytes_reserved == 128M, which is greater | than btrfs_space_info's total_bytes, false enospc occurs. | Note, the bytes_may_use decrease operation will be done in | end of btrfs_fallocate(), which is too late. Here is another simple case for buffered write: CPU 1 | CPU 2 | |-> cow_file_range() |-> __btrfs_buffered_write() |-> btrfs_reserve_extent() | | | | | | | | |. | |-> btrfs_check_data_free_space() | | | | |-> extent_clear_unlock_delalloc() | In CPU 1, btrfs_reserve_extent()->find_free_extent()-> btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease operation will be delayed to be done in extent_clear_unlock_delalloc(). Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's btrfs_check_data_free_space() tries to reserve 100MB data space. If 100MB > data_sinfo->total_bytes - data_sinfo->bytes_used - data_sinfo->bytes_reserved - data_sinfo->bytes_pinned - data_sinfo->bytes_readonly - data_sinfo->bytes_may_use btrfs_check_data_free_space() will try to allcate new data chunk or call btrfs_start_delalloc_roots(), or commit current transaction in order to reserve some free space, obviously a lot of work. But indeed it's not necessary as long as decreasing bytes_may_use timely, we still have free space, decreasing 128M from bytes_may_use. To fix this issue, this patch chooses to update bytes_may_use for both data and metadata in btrfs_add_reserved_bytes(). For compress path, real extent length may not be equal to file content length, so introduce a ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by file content length. Then compress path can update bytes_may_use correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC and RESERVE_FREE. As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as PREALLOC, we also need to update bytes_may_use, but can not pass EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook() to update btrfs_space_info's bytes_may_use. Meanwhile __btrfs_prealloc_file_range() will call btrfs_free_reserved_data_space() internally for both sucessful and failed path, btrfs_prealloc_file_range()'s callers does not need to call btrfs_free_reserved_data_space() any more. Signed-off-by: Wang XiaoguangYup this is good, thanks for fixing this problem Signed-off-by: Josef Bacik -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] btrfs: should block unused block groups deletion work when allocating data space
On 07/25/2016 03:51 AM, Wang Xiaoguang wrote: cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs() to delete unused block groups. Because this work is asynchronous, it may also result in false ENOSPC error. Please see below race window: CPU1 | CPU2 | |-> btrfs_alloc_data_chunk_ondemand() |-> cleaner_kthread() |-> do_chunk_alloc() | | | assume it returns ENOSPC, which means | | | btrfs_space_info is full and have free| | | space to satisfy data request.| | | | |- > btrfs_delete_unused_bgs() | | |it will decrease btrfs_space_info | | |total_bytes and make | | |btrfs_space_info is not full. | | | In this case, we may get ENOSPC error, but btrfs_space_info is not full. To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs(). So here we introduce a new struct rw_semaphore bg_delete_sem to do this job. Signed-off-by: Wang Xiaoguang--- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 1 + fs/btrfs/extent-tree.c | 40 ++-- 3 files changed, 36 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7eb2913..bf0751d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -800,6 +800,7 @@ struct btrfs_fs_info { struct mutex cleaner_mutex; struct mutex chunk_mutex; struct mutex volume_mutex; + struct rw_semaphore bg_delete_sem; /* * this is taken to make sure we don't set block groups ro after diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 60ce119..65a1465 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2676,6 +2676,7 @@ int open_ctree(struct super_block *sb, mutex_init(_info->ordered_operations_mutex); mutex_init(_info->tree_log_mutex); mutex_init(_info->chunk_mutex); + init_rwsem(_info->bg_delete_sem); mutex_init(_info->transaction_kthread_mutex); mutex_init(_info->cleaner_mutex); mutex_init(_info->volume_mutex); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index df8d756..d1f8638 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes) int ret = 0; int need_commit = 2; int have_pinned_space; + int have_bg_delete_sem = 0; /* make sure bytes are sectorsize aligned */ bytes = ALIGN(bytes, root->sectorsize); @@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes) } data_sinfo = fs_info->data_sinfo; - if (!data_sinfo) + if (!data_sinfo) { + down_read(>fs_info->bg_delete_sem); + have_bg_delete_sem = 1; goto alloc; + } again: /* make sure we have enough space to handle the data first */ @@ -4134,10 +4138,21 @@ again: if (used + bytes > data_sinfo->total_bytes) { struct btrfs_trans_handle *trans; + spin_unlock(_sinfo->lock); + /* +* We may need to allocate new chunk, so we should block +* btrfs_delete_unused_bgs() +*/ + if (have_bg_delete_sem == 0) { + down_read(>fs_info->bg_delete_sem); + have_bg_delete_sem = 1; + } + We could race with somebody else doing an allocation here, so instead do something like if (!have_bg_delete_sem) { spin_unlock(_sinfo->lock); down_read(>fs_info->bg_delete_sem); have_bg_delete_sem = 1; spin_lock((_sinfo->lock); if (used + bytes <= data_sinfo->total_bytes) goto done; // add a done label at bytes_may_use += bytes } Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On Mon, Jul 18, 2016 at 09:42:50AM -0400, Josef Bacik wrote: > > > > This makes search for BLOCK_GROUP_ITEM very very very slow if extent tree is > > really big. > > > > On the handle, we search CHUNK_ITEM very very fast, because CHUNK_ITEM are > > in > > their own tree. > > (CHUNK_ITEM and BLOCK_GROUP_ITEM are 1:1 mapped) > > > > So to completely fix it, btrfs needs on-disk format change to put > > BLOCK_GROUP_ITEM into their own tree. > > > > IMHO there maybe be some objection from other devs though. > > > > Anyway, I add the three maintainers to Cc, and hopes we can get a better > > idea to > > fix it. > > Yeah I'm going to fix this when I do the per-block group extent tree thing. > Thanks, Will it be capable of "per- subvolume set" extent trees? IOW, a set of subvolumes will could share extents only among the members of the same group. The usecase is to start an isolate subvolume and allow snapshots (and obviously forbid reflinks outside of the group). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12.2 03/15] btrfs: dedupe: Introduce function to initialize dedupe info
Hi, [auto build test ERROR on next-20160724] [cannot apply to btrfs/next linus/master stable/master v4.7-rc7 v4.7-rc6 v4.7-rc5 v4.7] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-In-band-De-duplication/20160725-140320 config: i386-randconfig-s0-07250950 (attached as .config) compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): fs/built-in.o: In function `btrfs_dedupe_enable': >> (.text+0xd3c77b): undefined reference to `__udivdi3' --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
[PATCH v2 0/4] update bytes_may_use timely to avoid false ENOSPC issue
Currently in btrfs, for data space reservation, it does not update bytes_may_use in btrfs_update_reserved_bytes() and the decrease operation will be delayed to be done in extent_clear_unlock_delalloc(), for fallocate(2), decrease operation is even delayed to be done in end of btrfs_fallocate(), which is too late. Obviously such delay will cause unnecessary pressure to enospc system. So in this patch set, we will remove RESERVE_FREE, RESERVE_ALLOC and RESERVE_ALLOC_NO_ACCOUNT, and always update bytes_may_use timely. I already have sent a fstests test case for this issue, and I can send [Patch 4/4] as a independent patch, but its bug also can be revealed by the same reproduce scripts, so I include it here. Changelog: v2: Fix a trace point issue. Wang Xiaoguang (4): btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster() btrfs: divide btrfs_update_reserved_bytes() into two functions btrfs: update btrfs_space_info's bytes_may_use timely btrfs: should block unused block groups deletion work when allocating data space fs/btrfs/ctree.h | 3 +- fs/btrfs/disk-io.c | 1 + fs/btrfs/extent-tree.c | 171 - fs/btrfs/extent_io.h | 1 + fs/btrfs/file.c| 26 fs/btrfs/inode-map.c | 3 +- fs/btrfs/inode.c | 37 --- fs/btrfs/relocation.c | 17 +++-- 8 files changed, 159 insertions(+), 100 deletions(-) -- 2.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/4] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses wrong file offset for reloc_inode, it uses cluster->start and cluster->end, which indeed are extent's bytenr. The correct value should be cluster->[start|end] minus block group's start bytenr. start bytenr cluster->start | | extent | extent | ...| extent | || |block group reloc_inode | Signed-off-by: Wang XiaoguangSigned-off-by: Josef Bacik --- fs/btrfs/relocation.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 0477dca..a0de885 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3030,12 +3030,14 @@ int prealloc_file_extent_cluster(struct inode *inode, u64 num_bytes; int nr = 0; int ret = 0; + u64 prealloc_start = cluster->start - offset; + u64 prealloc_end = cluster->end - offset; BUG_ON(cluster->start != cluster->boundary[0]); inode_lock(inode); - ret = btrfs_check_data_free_space(inode, cluster->start, - cluster->end + 1 - cluster->start); + ret = btrfs_check_data_free_space(inode, prealloc_start, + prealloc_end + 1 - prealloc_start); if (ret) goto out; @@ -3056,8 +3058,8 @@ int prealloc_file_extent_cluster(struct inode *inode, break; nr++; } - btrfs_free_reserved_data_space(inode, cluster->start, - cluster->end + 1 - cluster->start); + btrfs_free_reserved_data_space(inode, prealloc_start, + prealloc_end + 1 - prealloc_start); out: inode_unlock(inode); return ret; -- 2.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/4] btrfs: divide btrfs_update_reserved_bytes() into two functions
This patch divides btrfs_update_reserved_bytes() into btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and next patch will extend btrfs_add_reserved_bytes()to fix some false ENOSPC error, please see later patch for detailed info. Signed-off-by: Wang XiaoguangSigned-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 97 +- 1 file changed, 57 insertions(+), 40 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 82b912a..8eaac39 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -104,9 +104,10 @@ static int find_next_key(struct btrfs_path *path, int level, struct btrfs_key *key); static void dump_space_info(struct btrfs_space_info *info, u64 bytes, int dump_block_groups); -static int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache, - u64 num_bytes, int reserve, - int delalloc); +static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache, + u64 num_bytes, int reserve, int delalloc); +static int btrfs_free_reserved_bytes(struct btrfs_block_group_cache *cache, +u64 num_bytes, int delalloc); static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv, u64 num_bytes); int btrfs_pin_extent(struct btrfs_root *root, @@ -6297,19 +6298,14 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg) } /** - * btrfs_update_reserved_bytes - update the block_group and space info counters + * btrfs_add_reserved_bytes - update the block_group and space info counters * @cache: The cache we are manipulating * @num_bytes: The number of bytes in question * @reserve: One of the reservation enums * @delalloc: The blocks are allocated for the delalloc write * - * This is called by the allocator when it reserves space, or by somebody who is - * freeing space that was never actually used on disk. For example if you - * reserve some space for a new leaf in transaction A and before transaction A - * commits you free that leaf, you call this with reserve set to 0 in order to - * clear the reservation. - * - * Metadata reservations should be called with RESERVE_ALLOC so we do the proper + * This is called by the allocator when it reserves space. Metadata + * reservations should be called with RESERVE_ALLOC so we do the proper * ENOSPC accounting. For data we handle the reservation through clearing the * delalloc bits in the io_tree. We have to do this since we could end up * allocating less disk space for the amount of data we have reserved in the @@ -6319,44 +6315,65 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg) * make the reservation and return -EAGAIN, otherwise this function always * succeeds. */ -static int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache, - u64 num_bytes, int reserve, int delalloc) +static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache, + u64 num_bytes, int reserve, int delalloc) { struct btrfs_space_info *space_info = cache->space_info; int ret = 0; spin_lock(_info->lock); spin_lock(>lock); - if (reserve != RESERVE_FREE) { - if (cache->ro) { - ret = -EAGAIN; - } else { - cache->reserved += num_bytes; - space_info->bytes_reserved += num_bytes; - if (reserve == RESERVE_ALLOC) { - trace_btrfs_space_reservation(cache->fs_info, - "space_info", space_info->flags, - num_bytes, 0); - space_info->bytes_may_use -= num_bytes; - } - - if (delalloc) - cache->delalloc_bytes += num_bytes; - } + if (cache->ro) { + ret = -EAGAIN; } else { - if (cache->ro) - space_info->bytes_readonly += num_bytes; - cache->reserved -= num_bytes; - space_info->bytes_reserved -= num_bytes; + cache->reserved += num_bytes; + space_info->bytes_reserved += num_bytes; + if (reserve == RESERVE_ALLOC) { + trace_btrfs_space_reservation(cache->fs_info, + "space_info", space_info->flags, + num_bytes, 0); + space_info->bytes_may_use -= num_bytes; + } if (delalloc) -
[PATCH v2 3/4] btrfs: update btrfs_space_info's bytes_may_use timely
This patch can fix some false ENOSPC errors, below test script can reproduce one false ENOSPC error: #!/bin/bash dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128 dev=$(losetup --show -f fs.img) mkfs.btrfs -f -M $dev mkdir /tmp/mntpoint mount $dev /tmp/mntpoint cd /tmp/mntpoint xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile Above script will fail for ENOSPC reason, but indeed fs still has free space to satisfy this request. Please see call graph: btrfs_fallocate() |-> btrfs_alloc_data_chunk_ondemand() | bytes_may_use += 64M |-> btrfs_prealloc_file_range() |-> btrfs_reserve_extent() |-> btrfs_add_reserved_bytes() | alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not | change bytes_may_use, and bytes_reserved += 64M. Now | bytes_may_use + bytes_reserved == 128M, which is greater | than btrfs_space_info's total_bytes, false enospc occurs. | Note, the bytes_may_use decrease operation will be done in | end of btrfs_fallocate(), which is too late. Here is another simple case for buffered write: CPU 1 | CPU 2 | |-> cow_file_range() |-> __btrfs_buffered_write() |-> btrfs_reserve_extent() | | | | | | | | |. | |-> btrfs_check_data_free_space() | | | | |-> extent_clear_unlock_delalloc() | In CPU 1, btrfs_reserve_extent()->find_free_extent()-> btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease operation will be delayed to be done in extent_clear_unlock_delalloc(). Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's btrfs_check_data_free_space() tries to reserve 100MB data space. If 100MB > data_sinfo->total_bytes - data_sinfo->bytes_used - data_sinfo->bytes_reserved - data_sinfo->bytes_pinned - data_sinfo->bytes_readonly - data_sinfo->bytes_may_use btrfs_check_data_free_space() will try to allcate new data chunk or call btrfs_start_delalloc_roots(), or commit current transaction in order to reserve some free space, obviously a lot of work. But indeed it's not necessary as long as decreasing bytes_may_use timely, we still have free space, decreasing 128M from bytes_may_use. To fix this issue, this patch chooses to update bytes_may_use for both data and metadata in btrfs_add_reserved_bytes(). For compress path, real extent length may not be equal to file content length, so introduce a ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by file content length. Then compress path can update bytes_may_use correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC and RESERVE_FREE. As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as PREALLOC, we also need to update bytes_may_use, but can not pass EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook() to update btrfs_space_info's bytes_may_use. Meanwhile __btrfs_prealloc_file_range() will call btrfs_free_reserved_data_space() internally for both sucessful and failed path, btrfs_prealloc_file_range()'s callers does not need to call btrfs_free_reserved_data_space() any more. Signed-off-by: Wang Xiaoguang--- fs/btrfs/ctree.h | 2 +- fs/btrfs/extent-tree.c | 56 +- fs/btrfs/extent_io.h | 1 + fs/btrfs/file.c| 26 +-- fs/btrfs/inode-map.c | 3 +-- fs/btrfs/inode.c | 37 - fs/btrfs/relocation.c | 11 -- 7 files changed, 73 insertions(+), 63 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 4274a7b..7eb2913 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2556,7 +2556,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 root_objectid, u64 owner, u64 offset, struct btrfs_key *ins); -int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes, +int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, u64 num_bytes, u64 min_alloc_size, u64 empty_size, u64 hint_byte, struct btrfs_key *ins, int is_data, int delalloc); int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
[PATCH v2 4/4] btrfs: should block unused block groups deletion work when allocating data space
cleaner_kthread() may run at any time, in which it'll call btrfs_delete_unused_bgs() to delete unused block groups. Because this work is asynchronous, it may also result in false ENOSPC error. Please see below race window: CPU1 | CPU2 | |-> btrfs_alloc_data_chunk_ondemand() |-> cleaner_kthread() |-> do_chunk_alloc() | | | assume it returns ENOSPC, which means | | | btrfs_space_info is full and have free| | | space to satisfy data request.| | | | |- > btrfs_delete_unused_bgs() | | |it will decrease btrfs_space_info | | |total_bytes and make | | |btrfs_space_info is not full. | | | In this case, we may get ENOSPC error, but btrfs_space_info is not full. To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call do_chunk_alloc() to allocating new chunk, we should block btrfs_delete_unused_bgs(). So here we introduce a new struct rw_semaphore bg_delete_sem to do this job. Signed-off-by: Wang Xiaoguang--- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 1 + fs/btrfs/extent-tree.c | 40 ++-- 3 files changed, 36 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7eb2913..bf0751d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -800,6 +800,7 @@ struct btrfs_fs_info { struct mutex cleaner_mutex; struct mutex chunk_mutex; struct mutex volume_mutex; + struct rw_semaphore bg_delete_sem; /* * this is taken to make sure we don't set block groups ro after diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 60ce119..65a1465 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2676,6 +2676,7 @@ int open_ctree(struct super_block *sb, mutex_init(_info->ordered_operations_mutex); mutex_init(_info->tree_log_mutex); mutex_init(_info->chunk_mutex); + init_rwsem(_info->bg_delete_sem); mutex_init(_info->transaction_kthread_mutex); mutex_init(_info->cleaner_mutex); mutex_init(_info->volume_mutex); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index df8d756..d1f8638 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4111,6 +4111,7 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes) int ret = 0; int need_commit = 2; int have_pinned_space; + int have_bg_delete_sem = 0; /* make sure bytes are sectorsize aligned */ bytes = ALIGN(bytes, root->sectorsize); @@ -4121,8 +4122,11 @@ int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes) } data_sinfo = fs_info->data_sinfo; - if (!data_sinfo) + if (!data_sinfo) { + down_read(>fs_info->bg_delete_sem); + have_bg_delete_sem = 1; goto alloc; + } again: /* make sure we have enough space to handle the data first */ @@ -4134,10 +4138,21 @@ again: if (used + bytes > data_sinfo->total_bytes) { struct btrfs_trans_handle *trans; + spin_unlock(_sinfo->lock); + /* +* We may need to allocate new chunk, so we should block +* btrfs_delete_unused_bgs() +*/ + if (have_bg_delete_sem == 0) { + down_read(>fs_info->bg_delete_sem); + have_bg_delete_sem = 1; + } + /* * if we don't have enough free bytes in this space then we need * to alloc a new chunk. */ + spin_lock(_sinfo->lock); if (!data_sinfo->full) { u64 alloc_target; @@ -4156,17 +4171,20 @@ alloc: * the fs. */ trans = btrfs_join_transaction(root); - if (IS_ERR(trans)) + if (IS_ERR(trans)) { + up_read(>fs_info->bg_delete_sem); return PTR_ERR(trans); + } ret = do_chunk_alloc(trans, root->fs_info->extent_root, alloc_target, CHUNK_ALLOC_NO_FORCE); btrfs_end_transaction(trans, root); if (ret < 0) { - if (ret != -ENOSPC) + if (ret != -ENOSPC) { +
[PATCH v2] generic/371: run write(2) and fallocate(2) in parallel
Currently in btrfs, there is something wrong with fallocate(2)'s data space reservation, it'll temporarily occupy more data space thant it really needs, which in turn will impact other operations' data request. In this test case, it runs write(2) and fallocate(2) in parallel and the total needed data space for these two operations don't exceed whole fs free data space, to see whether we will get any unexpected ENOSPC error. Signed-off-by: Wang Xiaoguang--- v2: adopt Eryu Guan's suggestions to make this reproducer cleaner, thanks --- tests/generic/371 | 75 +++ tests/generic/371.out | 2 ++ tests/generic/group | 1 + 3 files changed, 78 insertions(+) create mode 100755 tests/generic/371 create mode 100644 tests/generic/371.out diff --git a/tests/generic/371 b/tests/generic/371 new file mode 100755 index 000..7955856 --- /dev/null +++ b/tests/generic/371 @@ -0,0 +1,75 @@ +#! /bin/bash +# FS QA Test 371 +# +# Run write(2) and fallocate(2) in parallel and the total needed data space +# for these operations don't exceed whole fs free data space, to see whether +# we will get any unexpected ENOSPC error. +# +#--- +# Copyright (c) 2016 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# Modify as appropriate. +_supported_fs generic +_supported_os Linux +_require_scratch +_require_xfs_io_command "falloc" + +_scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1 +_scratch_mount + +testfile1=$SCRATCH_MNT/testfile1 +testfile2=$SCRATCH_MNT/testfile2 + +echo "Silence is golden" +for ((i=0; i<100; i++)); do + $XFS_IO_PROG -fc "pwrite 0 80M" $testfile1 >/dev/null + rm -f $testfile1 +done & +pids=$! + +for ((i=0; i<100; i++)); do + $XFS_IO_PROG -fc "falloc 0 80M" $testfile2 >/dev/null + rm -f $testfile2 +done & +pids="$pids $!" + +wait $pids +status=0 +exit diff --git a/tests/generic/371.out b/tests/generic/371.out new file mode 100644 index 000..22ec8a2 --- /dev/null +++ b/tests/generic/371.out @@ -0,0 +1,2 @@ +QA output created by 371 +Silence is golden diff --git a/tests/generic/group b/tests/generic/group index 97ecb65..1236281 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -373,3 +373,4 @@ 368 auto quick richacl 369 auto quick richacl 370 auto quick richacl +371 auto quick enospc prealloc -- 2.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Any suggestions for thousands of disk image snapshots ?
Hi all I am currently running a project for building servers with btrfs. Purposes of servers are exporting disk images through iscsi targets and disk images are generated from btrfs subvolume snapshot. Maximum number of clients is 500 and each client uses two snapshots of disk images. the first disk image's size is about 50GB and second one is about 1.5TB. Important thing is that the original 1.5TB disk image is mounted with loop device and modified real time - eg. continuously downloading torrents in it. snapshots are made when clients boot up and deleted when they turned off. So server has two original disk images and about a thousand of snapshots in total. I made a list of factors affect server's performance and stability. 1. Raid Configuration - Mdadm raid vs btrfs raid, configuration and options for them. 2. How to format btrfs - nodesize, features 3. Mount options - nodatacow and compression things. 4. Kernel parameter tuning. 5. Hardware specification. My current setups are 1. mdadm raid10 with 1024k chunk and 12 disks of 512GB ssd. 2. nodesize 32k and nothing else. 3. nodatacow, noatime, nodiratime, nospace_cache, ssd, compress=lzo 4. Ubuntu with 4.1.27 kernel without additional configurations. 5. CPU : Xeon E3- 1225v2 Quad Core 3.2Ghz RAM : 2 x DDR3 8GB ECC ( total 16GB) NIC : 2 x 10Gbe The result of test so far is 1. btrfs-transaction and btrfs-cleaner assume cpu regularly. 2. When cpu is busy for those processes, creating snapshots takes long. 3. The performance is getting slow as time goes by. So if there are any wrong and missing configurations , can you suggest some? like i need to increase physical memory. Any idea would help me a lot. Thank you Seo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC] btrfs: send: Disable clone detection
This patch will disable clone detection of send. The main problem of clone detetion in send is its memory usage and long execution time. The clone detection is done by iterating all backrefs and adding backref whose root is the source. However iterating all backrefs is already quite a bad idea, we should never try to do it in a loop, and unfortunately in-band/out-of-band and reflink can easily create a file whose file extents are point to the same extent. In that case, btrfs will do backref walk for the same extent again and again, until either OOM or soft lockup is triggered. So disabling clone detection until we find a method that iterates backrefs of one extent only once, just like what balance/qgroup is doing. Cc: Filipe MananaReported-by: Tsutomu Itoh Signed-off-by: Qu Wenruo --- fs/btrfs/send.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 2db8dc8..eed3f1c 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -1166,6 +1166,7 @@ struct backref_ctx { int found_itself; }; +#if 0 static int __clone_root_cmp_bsearch(const void *key, const void *elt) { u64 root = (u64)(uintptr_t)key; @@ -1177,6 +1178,7 @@ static int __clone_root_cmp_bsearch(const void *key, const void *elt) return 1; return 0; } +#endif static int __clone_root_cmp_sort(const void *e1, const void *e2) { @@ -1190,6 +1192,7 @@ static int __clone_root_cmp_sort(const void *e1, const void *e2) return 0; } +#if 0 /* * Called for every backref that is found for the current extent. * Results are collected in sctx->clone_roots->ino/offset/found_refs @@ -1445,6 +1448,7 @@ out: kfree(backref_ctx); return ret; } +#endif static int read_symlink(struct btrfs_root *root, u64 ino, @@ -5291,7 +5295,6 @@ static int process_extent(struct send_ctx *sctx, struct btrfs_path *path, struct btrfs_key *key) { - struct clone_root *found_clone = NULL; int ret = 0; if (S_ISLNK(sctx->cur_inode_mode)) @@ -5333,12 +5336,27 @@ static int process_extent(struct send_ctx *sctx, } } + /* +* Current clone detection is both time and memory consuming. +* +* Time consuming is caused by iterating all backref of extent. +* Memory consuming is caused by allocating "found_clone" every +* time for a backref. +* +* XXX: Disabling it is never the best method, but at least it +* won't cause OOM nor super long execution time. +* The root fix needs to change the iteration basis, from iterating +* file extents to iterating extents, so find_parent_nodes() and +* backref walk should be called only once for one extent. +*/ +#if 0 ret = find_extent_clone(sctx, path, key->objectid, key->offset, sctx->cur_inode_size, _clone); if (ret != -ENOENT && ret < 0) goto out; +#endif - ret = send_write_or_clone(sctx, path, key, found_clone); + ret = send_write_or_clone(sctx, path, key, NULL); if (ret) goto out; out_hole: -- 2.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic/371: run write(2) and fallocate(2) in parallel
hello, On 07/21/2016 06:41 PM, Eryu Guan wrote: On Thu, Jul 21, 2016 at 03:30:25PM +0800, Wang Xiaoguang wrote: Currently in btrfs, there is something wrong with fallocate(2)'s data space reservation, it'll temporarily occupy more data space thant it really needs, which in turn will impact other operations' data request. In this test case, it runs write(2) and fallocate(2) in parallel and the total needed data space for these two operations don't exceed whole fs free data space, to see whether we will get any unexpected ENOSPC error. Signed-off-by: Wang Xiaoguang--- tests/generic/371 | 119 ++ tests/generic/371.out | 2 + tests/generic/group | 1 + 3 files changed, 122 insertions(+) create mode 100755 tests/generic/371 create mode 100644 tests/generic/371.out diff --git a/tests/generic/371 b/tests/generic/371 new file mode 100755 index 000..b85327a --- /dev/null +++ b/tests/generic/371 @@ -0,0 +1,119 @@ +#! /bin/bash +# FS QA Test 371 +# +# Run write(2) and fallocate(2) in parallel and the total needed data space +# for these operations don't exceed whole fs free data space, to see whether +# we will get any unexpected ENOSPC error. +# +#--- +# Copyright (c) 2016 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# Modify as appropriate. +_supported_fs generic +_supported_os Linux +_require_scratch Need '_require_xfs_io_command "falloc"', otherwise test fails on ext3/2, because they don't support fallocate(2). OK, I see. + +_scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1 +_scratch_mount + +testfile1=$SCRATCH_MNT/testfile1 +testfile2=$SCRATCH_MNT/testfile2 + +write_work() +{ + rm -f $testfile1 + while [ 1 ]; do + $XFS_IO_PROG -f -c "pwrite 0 80M" $testfile1 >>$seqres.full 2>&1 + grep "No space left on device" $seqres.full >/dev/null + if [ $? -eq 0 ]; then + echo "unexpected ENOSPC error occurs" + exit 1 + fi No need to grep for error message, just redirect stdout to /dev/null and any error will appear in stderr, which will break golden image. And $seqres.full is where we dump logs for debug purpose, tests should dump output they need to somewhere like $tmp. I see. + rm -f $testfile1 + done +} + +fallocate_work() +{ + rm -f $testfile2 + while [ 1 ]; do + $XFS_IO_PROG -f -c "falloc 0 80M" $testfile2 >>$seqres.full 2>&1 + grep "No space left on device" $seqres.full >/dev/null + if [ $? -eq 0 ]; then + echo "unexpected ENOSPC error occurs" + exit 1 + fi + rm -f $testfile2 + done +} + +run_time=$((180 * $TIME_FACTOR)) 180s is too long time, I can reproduce it in around 10s on my test vm, just loop for 100 times for each operation (pwrite and falloc) +write_work & +write_work_pid=$! +fallocate_work & +fallocate_work_pid=$! + +for ((elapsed_time = 0; elapsed_time < run_time; elapsed_time += 5)); do + kill -0 $write_work_pid >/dev/null 2>&1 + if [ $? -ne 0 ]; then + kill $fallocate_work_pid >/dev/null 2>&1 + break + fi + + kill -0 $fallocate_work_pid >/dev/null 2>&1 + if [ $? -ne 0 ]; then + kill $write_work_pid >/dev/null 2>&1 + break + fi + sleep 5 +done + +kill $fallocate_work_pid >/dev/null 2>&1 +kill $write_work_pid >/dev/null 2>&1 +wait + +# wait un-finished xfs_io +while ps aux | grep "xfs_io" | grep -qv grep; do + sleep 1 +done And this seems unnecessarily complicated So I'd write it as: echo "Silence is golden" for ((i=0; i<100; i++)); do $XFS_IO_PROG -fc