On Thu, Dec 4, 2014 at 3:58 PM, Peter Volkov <p...@gentoo.org> wrote: > Hi, guys again. Looking at this issue, I suspect this is bug in btrfs. > We'll have to clean up this installation soon, so if there is any > request to do some debugging, please, ask. I'll try to reiterate what > was said in this thread. > > Short story: btrfs filesystem made of 22 1Tb disks with lot's of files > (~30240000). Write load is 25 Mbyte/second. After some time file system > became unable to cope with this load. Also at this time `sync` takes > ages to finish, shutdown -r hangs (I guess related to sync). > > Also I see there is one some kernel kworker that is main suspect for > this behavior: all the time it takes 100% of CPU core, jumping from core > to core. At the same time according to iostat write/read speed is close > to zero and everything is stuck. > > Siting some details from previous messages: > >> > top - 13:10:58 up 1 day, 9:26, 5 users, load average: 157.76, 156.61, >> > 149.29 >> > Tasks: 235 total, 2 running, 233 sleeping, 0 stopped, 0 zombie >> > %Cpu(s): 19.8 us, 15.0 sy, 0.0 ni, 60.7 id, 3.9 wa, 0.0 hi, 0.6 si, >> > 0.0 st >> > KiB Mem: 65922104 total, 65414856 used, 507248 free, 1844 buffers >> > KiB Swap: 0 total, 0 used, 0 free. 62570804 cached Mem >> > >> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >> > COMMAND >> > 8644 root 20 0 0 0 0 R 96.5 0.0 127:21.95 >> > kworker/u16:16 >> > 5047 dvr 20 0 6884292 122668 4132 S 6.4 0.2 258:59.49 >> > dvrserver >> > 30223 root 20 0 20140 2600 2132 R 6.4 0.0 0:00.01 top >> > 1 root 20 0 4276 1628 1524 S 0.0 0.0 0:40.19 init >> > >> > There are about 300 treads on server, some of which are writing on disk. >> > A bit information about this btrfs filesystem: this is 22 disk file >> > system with raid1 for metadata and raid0 for data: >> > >> > # btrfs filesystem df /store/ >> > Data, single: total=11.92TiB, used=10.86TiB >> > System, RAID1: total=8.00MiB, used=1.27MiB >> > System, single: total=4.00MiB, used=0.00B >> > Metadata, RAID1: total=46.00GiB, used=33.49GiB >> > Metadata, single: total=8.00MiB, used=0.00B >> > GlobalReserve, single: total=512.00MiB, used=128.00KiB >> > # btrfs property get /store/ >> > ro=false >> > label=store >> > # btrfs device stats /store/ >> > (shows all zeros) >> > # btrfs balance status /store/ >> > No balance found on '/store/' > > # btrfs filesystem show > Label: 'store' uuid: 296404d1-bd3f-417d-8501-02f8d7906bcf > Total devices 22 FS bytes used 6.50TiB > devid 1 size 931.51GiB used 558.02GiB path /dev/sdb > devid 2 size 931.51GiB used 559.00GiB path /dev/sdc > devid 3 size 931.51GiB used 559.00GiB path /dev/sdd > devid 4 size 931.51GiB used 559.00GiB path /dev/sde > devid 5 size 931.51GiB used 559.00GiB path /dev/sdf > devid 6 size 931.51GiB used 559.00GiB path /dev/sdg > devid 7 size 931.51GiB used 559.00GiB path /dev/sdh > devid 8 size 931.51GiB used 559.00GiB path /dev/sdi > devid 9 size 931.51GiB used 559.00GiB path /dev/sdj > devid 10 size 931.51GiB used 559.00GiB path /dev/sdk > devid 11 size 931.51GiB used 559.00GiB path /dev/sdl > devid 12 size 931.51GiB used 559.00GiB path /dev/sdm > devid 13 size 931.51GiB used 559.00GiB path /dev/sdn > devid 14 size 931.51GiB used 559.00GiB path /dev/sdo > devid 15 size 931.51GiB used 559.00GiB path /dev/sdp > devid 16 size 931.51GiB used 559.00GiB path /dev/sdq > devid 17 size 931.51GiB used 559.00GiB path /dev/sdr > devid 18 size 931.51GiB used 559.00GiB path /dev/sds > devid 19 size 931.51GiB used 559.00GiB path /dev/sdt > devid 20 size 931.51GiB used 559.00GiB path /dev/sdu > devid 21 size 931.51GiB used 559.01GiB path /dev/sdv > devid 22 size 931.51GiB used 560.01GiB path /dev/sdw > > Btrfs v3.17.1 > >> > iostat 1 exposes following problem: >> > >> > avg-cpu: %user %nice %system %iowait %steal %idle >> > 16.96 0.00 17.09 65.95 0.00 0.00 >> > >> > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn >> > sda 0.00 0.00 0.00 0 0 >> > sdc 0.00 0.00 0.00 0 0 >> > sdb 0.00 0.00 0.00 0 0 >> > sde 0.00 0.00 0.00 0 0 >> > sdd 0.00 0.00 0.00 0 0 >> > sdf 0.00 0.00 0.00 0 0 >> > sdg 0.00 0.00 0.00 0 0 >> > sdj 0.00 0.00 0.00 0 0 >> > sdh 0.00 0.00 0.00 0 0 >> > sdk 0.00 0.00 0.00 0 0 >> > sdi 1.00 0.00 200.00 0 200 >> > sdl 0.00 0.00 0.00 0 0 >> > sdn 48.00 0.00 17260.00 0 17260 >> > sdm 0.00 0.00 0.00 0 0 >> > sdp 0.00 0.00 0.00 0 0 >> > sdo 0.00 0.00 0.00 0 0 >> > sdq 0.00 0.00 0.00 0 0 >> > sdr 0.00 0.00 0.00 0 0 >> > sds 0.00 0.00 0.00 0 0 >> > sdt 0.00 0.00 0.00 0 0 >> > sdv 0.00 0.00 0.00 0 0 >> > sdw 0.00 0.00 0.00 0 0 >> > sdu 0.00 0.00 0.00 0 0 > > At that time I saw such load profile. Write load changed from disk to > disk with time, so I do not suspect broken disk. Currently write profile > is different: > https://drive.google.com/file/d/0BygFL6N3ZVUAVmxaZ1Q5VTZpSGc/view?usp=sharing > Sometimes like above, sometimes all zero, most time load is very low. > >> > write goes to one disk. I've tried to debug what's going in kworker and >> > did >> > >> > $ echo workqueue:workqueue_queue_work >> >> /sys/kernel/debug/tracing/set_event >> > $ cat /sys/kernel/debug/tracing/trace_pipe > trace_pipe.out2 > > I've put result here: > https://drive.google.com/file/d/0BygFL6N3ZVUAMWxCQ0tDREE1Uzg/view?usp=sharing >
Is Btrfs single profile expected to parallel write to block devices? Initially, any write is a new write rather than an overwrite, because of COW. All writes go into a single chunk on a single device until the chunk is full, then onto the next device with a new chunk until that chunk is full. And so on. This behavior only changes once all space is allocated as a data or metadata chunk on all block devices, which actually could take some time. If there are many chunks on many devices that are 90% full, then I don't know how Btrfs decides which chunks it writes to. But I still don't think it's highly parallelized like it is on XFS. Are reads are parallelized in this case? Unless there's parallelized reads and writes, the single profile isn't scalable. So before something is a bug, I'd wonder if the design expects this layout to be used for the intended use case rather than raid0. The chances of a single drive dying with 22 drives in the volume is astronomically high, probably 100% over as short as 6 months, and then what? I'm unaware of either existing or planned functionality to allow such a volume to remain functional: to do that, Btrfs needs to delete all affected files so they're no longer referenced. I've actually thought of this layout for use with GlusterFS and Ceph, in such a way that a drive can die and Btrfs informs the distributed filesystem above it what files are no longer available by this particular storage brick; next the brick's filesystem can be "cleaned up" by deleting all missing files, then deleting the missing device, thereby stabilizing the existing fs. The distributed file system starts replicating missing files according to its policies. But right now, if any device dies in your example layout, the filesystem is functionally lost. Yes you can get remaining data out of it, but it's in a sense 1/22nd's broken and not fixable as far as I know. But I haven't tried fixing this manually, e.g. do a scrub to get a missing files listing and start delete those files, add a new device, and delete the missing device. If the missing files aren't explicitly deleted, I think the fs still has references for them and will just return read/corruption errors rather than denying the file even exists. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html