Re: bio linked list corruption.
On 10/26/2016 04:00 PM, Chris Mason wrote: > > > On 10/26/2016 03:06 PM, Linus Torvalds wrote: >> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones <da...@codemonkey.org.uk> wrote: >>> >>> The stacks show nearly all of them are stuck in sync_inodes_sb >> >> That's just wb_wait_for_completion(), and it means that some IO isn't >> completing. >> >> There's also a lot of processes waiting for inode_lock(), and a few >> waiting for mnt_want_write() >> >> Ignoring those, we have >> >>> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs] >>> [] btrfs_sync_fs+0x31/0xc0 [btrfs] >>> [] sync_filesystem+0x6e/0xa0 >>> [] SyS_syncfs+0x3c/0x70 >>> [] do_syscall_64+0x5c/0x170 >>> [] entry_SYSCALL64_slow_path+0x25/0x25 >>> [] 0x >> >> Don't know this one. There's a couple of them. Could there be some >> ABBA deadlock on the ordered roots waiting? > > It's always possible, but we haven't changed anything here. > > I've tried a long list of things to reproduce this on my test boxes, > including days of trinity runs and a kernel module to exercise vmalloc, > and thread creation. > > Today I turned off every CONFIG_DEBUG_* except for list debugging, and > ran dbench 2048: > This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in < 10 minutes. I've done 30 minutes each with XFS and Ext4 without luck. This is all in a virtual machine that I can copy on to a bunch of hosts. So I'll get some parallel tests going tonight to narrow it down. [ cut here ] WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0 list_add corruption. prev->next should be next (e8d80b08), but was 88012b65fb88. (prev=880128c8d500). Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr sch_fq_codel autofs4 virtio_blk CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 880104eff868 814fde0f 8151c46e 880104eff8c8 880104eff8c8 880104eff8b8 810648cf 880128cab2c0 00213fc57c68 8801384e8928 880128cab180 Call Trace: [] dump_stack+0x53/0x74 [] ? __list_add+0xbe/0xd0 [] __warn+0xff/0x120 [] warn_slowpath_fmt+0x49/0x50 [] __list_add+0xbe/0xd0 [] blk_sq_make_request+0x388/0x580 [] generic_make_request+0x104/0x200 [] submit_bio+0x65/0x130 [] ? __percpu_counter_add+0x96/0xd0 [] btrfs_map_bio+0x23c/0x310 [] btrfs_submit_bio_hook+0xd3/0x190 [] submit_one_bio+0x6d/0xa0 [] flush_epd_write_bio+0x4e/0x70 [] extent_writepages+0x5d/0x70 [] ? btrfs_releasepage+0x50/0x50 [] ? wbc_attach_and_unlock_inode+0x6e/0x170 [] btrfs_writepages+0x27/0x30 [] do_writepages+0x20/0x30 [] __filemap_fdatawrite_range+0xb5/0x100 [] filemap_fdatawrite_range+0x13/0x20 [] btrfs_fdatawrite_range+0x2b/0x70 [] btrfs_sync_file+0x88/0x490 [] ? group_send_sig_info+0x42/0x80 [] ? kill_pid_info+0x5d/0x90 [] ? SYSC_kill+0xba/0x1d0 [] ? __sb_end_write+0x58/0x80 [] vfs_fsync_range+0x4c/0xb0 [] ? syscall_trace_enter+0x201/0x2e0 [] vfs_fsync+0x1c/0x20 [] do_fsync+0x3d/0x70 [] ? syscall_slow_exit_work+0xfb/0x100 [] SyS_fsync+0x10/0x20 [] do_syscall_64+0x55/0xd0 [] ? prepare_exit_to_usermode+0x37/0x40 [] entry_SYSCALL64_slow_path+0x25/0x25 ---[ end trace efe6b17c6dba2a6e ]---
Re: btrfs bio linked list corruption.
On 10/11/2016 11:19 AM, Dave Jones wrote: On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote: > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote: > > This is from Linus' current tree, with Al's iovec fixups on top. > > Those iovec fixups are in the current tree... ah yeah, git quietly dropped my local copy when I rebased so I didn't notice. > TBH, I don't see anything > in splice-related stuff that could come anywhere near that (short of > some general memory corruption having random effects of that sort). > > Could you try to bisect that sucker, or is it too hard to reproduce? Only hit it the once overnight so far. Will see if I can find a better way to reproduce today. This call trace is reading metadata so we can finish the truncate. I'd say adding more memory pressure would make it happen more often. I'll try to trigger. -chris
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.9 has our merge window pull: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.9 This is later than normal because I was tracking down a use-after-free during btrfs/101 in xfstests. I had hoped to fix up the offending patch, but wasn't happy with the size of the changes at this point in the merge window. The use-after-free was enough of a corner case that I didn't want to rebase things out at this point. So instead the top of the pull is my revert, and the rest of these were prepped by Dave Sterba (thanks Dave!). This is a big variety of fixes and cleanups. Liu Bo continues to fixup fuzzer related problems, and some of Josef's cleanups are prep for his bigger extent buffer changes (slated for v4.10). Liu Bo (13) commits (+207/-36): Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf (+5/-1) Btrfs: return gracefully from balance if fs tree is corrupted (+17/-6) Btrfs: improve check_node to avoid reading corrupted nodes (+28/-4) Btrfs: add error handling for extent buffer in print tree (+7/-0) Btrfs: memset to avoid stale content in btree node block (+11/-0) Btrfs: bail out if block group has different mixed flag (+14/-0) Btrfs: memset to avoid stale content in btree leaf (+28/-19) Btrfs: fix memory leak in reading btree blocks (+9/-0) Btrfs: fix memory leak of block group cache (+75/-0) Btrfs: kill BUG_ON in run_delayed_tree_ref (+7/-1) Btrfs: remove BUG_ON in start_transaction (+1/-4) Btrfs: fix memory leak in do_walk_down (+1/-0) Btrfs: remove BUG() in raid56 (+4/-1) Jeff Mahoney (7) commits (+849/-902): btrfs: btrfs_debug should consume fs_info when DEBUG is not defined (+10/-4) btrfs: clean the old superblocks before freeing the device (+11/-27) btrfs: convert send's verbose_printk to btrfs_debug (+38/-27) btrfs: convert printk(KERN_* to use pr_* calls (+205/-275) btrfs: convert pr_* to btrfs_* where possible (+231/-177) btrfs: unsplit printed strings (+324/-391) btrfs: add dynamic debug support (+30/-1) Josef Bacik (5) commits (+178/-156): Btrfs: kill the start argument to read_extent_buffer_pages (+15/-28) Btrfs: kill BUG_ON()'s in btrfs_mark_extent_written (+33/-8) Btrfs: add a flags field to btrfs_fs_info (+99/-109) Btrfs: don't leak reloc root nodes on error (+4/-0) Btrfs: don't BUG() during drop snapshot (+27/-11) Goldwyn Rodrigues (3) commits (+3/-18): btrfs: Do not reassign count in btrfs_run_delayed_refs (+0/-1) btrfs: Remove already completed TODO comment (+0/-2) btrfs: parent_start initialization cleanup (+3/-15) Luis Henriques (2) commits (+0/-4): btrfs: Fix warning "variable ‘blocksize’ set but not used" (+0/-2) btrfs: Fix warning "variable ‘gen’ set but not used" (+0/-2) Eric Sandeen (1) commits (+1/-1): btrfs: fix perms on demonstration debugfs interface Anand Jain (1) commits (+20/-6): btrfs: fix a possible umount deadlock Lu Fengqi (1) commits (+369/-10): btrfs: fix check_shared for fiemap ioctl Chris Mason (1) commits (+15/-11): Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs" Masahiro Yamada (1) commits (+8/-28): btrfs: squash lines for simple wrapper functions Qu Wenruo (1) commits (+37/-25): btrfs: extend btrfs_set_extent_delalloc and its friends to support in-band dedupe and subpage size patchset Arnd Bergmann (1) commits (+7/-10): btrfs: fix btrfs_no_printk stub helper David Sterba (1) commits (+9/-0): btrfs: create example debugfs file only in debugging build Naohiro Aota (1) commits (+11/-15): btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs Total: (39) commits (+1714/-1222) fs/btrfs/backref.c| 409 ++ fs/btrfs/btrfs_inode.h| 11 -- fs/btrfs/check-integrity.c| 342 +++ fs/btrfs/compression.c| 6 +- fs/btrfs/ctree.c | 56 ++ fs/btrfs/ctree.h | 116 fs/btrfs/delayed-inode.c | 25 ++- fs/btrfs/delayed-ref.c| 15 +- fs/btrfs/dev-replace.c| 21 ++- fs/btrfs/dir-item.c | 7 +- fs/btrfs/disk-io.c| 237 fs/btrfs/disk-io.h| 2 + fs/btrfs/extent-tree.c| 198 +++- fs/btrfs/extent_io.c | 170 +++--- fs/btrfs/extent_io.h | 4 +- fs/btrfs/file.c | 43 - fs/btrfs/free-space-cache.c | 21 ++- fs/btrfs/free-space-cache.h | 6 +- fs/btrfs/free-space-tree.c| 20 ++- fs/btrfs/inode-map.c | 31 ++-- fs/btrfs/inode.c | 70 +--- fs/btrfs/ioctl.c | 14 +- fs/btrfs/lzo.c| 6 +- fs/btrfs/ordered-data.c | 4 +- fs/btrfs/print-tree.c | 93 +- fs/btrfs/qgroup.c | 77 fs/bt
Re: btrfs bio linked list corruption.
On 10/11/2016 10:45 AM, Dave Jones wrote: > This is from Linus' current tree, with Al's iovec fixups on top. > > [ cut here ] > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > list_add corruption. prev->next should be next (e8806648), but was > c967fcd8. (prev=880503878b80). > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > c9d87458 8d32007c c9d874a8 > c9d87498 8d07a6c1 00210246 88050388e880 > 880503878b80 e8806648 e8c06600 880502808008 > Call Trace: > [] dump_stack+0x4f/0x73 > [] __warn+0xc1/0xe0 > [] warn_slowpath_fmt+0x5a/0x80 > [] __list_add+0x89/0xb0 > [] blk_sq_make_request+0x2f8/0x350 /* * A task plug currently exists. Since this is completely lockless, * utilize that to temporarily store requests until the task is * either done or scheduled away. */ plug = current->plug; if (plug) { blk_mq_bio_to_request(rq, bio); if (!request_count) trace_block_plug(q); blk_mq_put_ctx(data.ctx); if (request_count >= BLK_MAX_REQUEST_COUNT) { blk_flush_plug_list(plug, false); trace_block_plug(q); } list_add_tail(>queuelist, >mq_list); ^^ Dave, is this where we're crashing? This seems strange. -chris
Re: btrfs bio linked list corruption.
On 10/13/2016 02:16 PM, Dave Jones wrote: On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote: > On 10/12/2016 10:40 AM, Dave Jones wrote: > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote: > > > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > > > > > > > > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > > > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > > > > > > > [ cut here ] > > > > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > > > > > list_add corruption. prev->next should be next (e8806648), but was c967fcd8. (prev=880503878b80). > > > > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > > > > > c9d87458 8d32007c c9d874a8 > > > > > c9d87498 8d07a6c1 00210246 88050388e880 > > > > > > I hit this again overnight, it's the same trace, the only difference > > > being slightly different addresses in the list pointers: > > > > > > [42572.777196] list_add corruption. prev->next should be next (e8806648), but was c9647cd8. (prev=880503a0ba00). > > > > > > I'm actually a little surprised that ->next was the same across two > > > reboots on two different kernel builds. That might be a sign this is > > > more repeatable than I'd thought, even if it does take hours of runtime > > > right now to trigger it. I'll try and narrow the scope of what trinity > > > is doing to see if I can make it happen faster. > > > > .. and of course the first thing that happens is a completely different > > btrfs trace.. > > > > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs] > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 > > c900019076a8 b731ff3c > > c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98 > > 0801 880501cfa2a8 008a 008a > > This isn't even IO. Uuug. We're going to need a fast enough test > that we can bisect. Progress... I've found that this combination of syscalls.. ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2 hits one of these two bugs in a few minutes runtime. Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync. Mix them together though, and something goes awry. Hasn't triggered here yet. I'll leave it running though. -chris
Re: btrfs bio linked list corruption.
On 10/12/2016 10:40 AM, Dave Jones wrote: On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote: > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > > > [ cut here ] > > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > > > list_add corruption. prev->next should be next (e8806648), but was c967fcd8. (prev=880503878b80). > > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > > > c9d87458 8d32007c c9d874a8 > > > c9d87498 8d07a6c1 00210246 88050388e880 > > I hit this again overnight, it's the same trace, the only difference > being slightly different addresses in the list pointers: > > [42572.777196] list_add corruption. prev->next should be next (e8806648), but was c9647cd8. (prev=880503a0ba00). > > I'm actually a little surprised that ->next was the same across two > reboots on two different kernel builds. That might be a sign this is > more repeatable than I'd thought, even if it does take hours of runtime > right now to trigger it. I'll try and narrow the scope of what trinity > is doing to see if I can make it happen faster. .. and of course the first thing that happens is a completely different btrfs trace.. WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs] CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 c900019076a8 b731ff3c c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98 0801 880501cfa2a8 008a 008a This isn't even IO. Uuug. We're going to need a fast enough test that we can bisect. -chris
Re: [PATCH] btrfs: limit async_work allocation and worker func duration
On 12/12/2016 03:35 PM, Maxim Patlasov wrote: On 12/12/2016 06:54 AM, David Sterba wrote: As far as we don't have any NO_THRESHOLD users of btrfs_workqueue_normal_congested for now, I tend to think it's better to add a descriptive comment and simply return "false" from btrfs_workqueue_normal_congested rather than trying to address some future needs now. See please v2 of the patch. Thanks, I've got v2 and added a cc for stable to v3.15+, which isn't exactly right, but its when the new workqueue system was put in place. -chris
Re: OOM: Better, but still there on 4.9
On 12/16/2016 02:39 AM, Michal Hocko wrote: [CC linux-mm and btrfs guys] On Thu 15-12-16 23:57:04, Nils Holland wrote: [...] Of course, none of this are workloads that are new / special in any way - prior to 4.8, I never experienced any issues doing the exact same things. Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0 Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0 Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2 Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014 Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1) Dec 15 19:02:18 teela kernel: eff0b604 c142bcce eff0b734 eff0b634 c1163332 0292 Dec 15 19:02:18 teela kernel: eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734 Dec 15 19:02:18 teela kernel: eff0b678 c110795f c1043895 eff0b664 c11075c7 0007 Dec 15 19:02:18 teela kernel: Call Trace: Dec 15 19:02:18 teela kernel: [] dump_stack+0x47/0x69 Dec 15 19:02:18 teela kernel: [] dump_header+0x60/0x178 Dec 15 19:02:18 teela kernel: [] ? ___ratelimit+0x86/0xe0 Dec 15 19:02:18 teela kernel: [] oom_kill_process+0x20f/0x3d0 Dec 15 19:02:18 teela kernel: [] ? has_capability_noaudit+0x15/0x20 Dec 15 19:02:18 teela kernel: [] ? oom_badness.part.13+0xb7/0x130 Dec 15 19:02:18 teela kernel: [] out_of_memory+0xd9/0x260 Dec 15 19:02:18 teela kernel: [] __alloc_pages_nodemask+0xbfb/0xc80 Dec 15 19:02:18 teela kernel: [] pagecache_get_page+0xad/0x270 Dec 15 19:02:18 teela kernel: [] alloc_extent_buffer+0x116/0x3e0 Dec 15 19:02:18 teela kernel: [] btrfs_find_create_tree_block+0xe/0x10 Dec 15 19:02:18 teela kernel: [] btrfs_alloc_tree_block+0x1ef/0x5f0 Dec 15 19:02:18 teela kernel: [] __btrfs_cow_block+0x143/0x5f0 Dec 15 19:02:18 teela kernel: [] btrfs_cow_block+0x13a/0x220 Dec 15 19:02:18 teela kernel: [] btrfs_search_slot+0x1d1/0x870 Dec 15 19:02:18 teela kernel: [] btrfs_lookup_file_extent+0x4d/0x60 Dec 15 19:02:18 teela kernel: [] __btrfs_drop_extents+0x176/0x1070 Dec 15 19:02:18 teela kernel: [] ? kmem_cache_alloc+0xb7/0x190 Dec 15 19:02:18 teela kernel: [] ? start_transaction+0x65/0x4b0 Dec 15 19:02:18 teela kernel: [] ? __kmalloc+0x147/0x1e0 Dec 15 19:02:18 teela kernel: [] cow_file_range_inline+0x215/0x6b0 Dec 15 19:02:18 teela kernel: [] cow_file_range.isra.49+0x55c/0x6d0 Dec 15 19:02:18 teela kernel: [] ? lock_extent_bits+0x75/0x1e0 Dec 15 19:02:18 teela kernel: [] run_delalloc_range+0x441/0x470 Dec 15 19:02:18 teela kernel: [] writepage_delalloc.isra.47+0x144/0x1e0 Dec 15 19:02:18 teela kernel: [] __extent_writepage+0xd8/0x2b0 Dec 15 19:02:18 teela kernel: [] extent_writepages+0x25c/0x380 Dec 15 19:02:18 teela kernel: [] ? btrfs_real_readdir+0x610/0x610 Dec 15 19:02:18 teela kernel: [] btrfs_writepages+0x1f/0x30 Dec 15 19:02:18 teela kernel: [] do_writepages+0x15/0x40 Dec 15 19:02:18 teela kernel: [] __writeback_single_inode+0x35/0x2f0 Dec 15 19:02:18 teela kernel: [] writeback_sb_inodes+0x16e/0x340 Dec 15 19:02:18 teela kernel: [] wb_writeback+0xaa/0x280 Dec 15 19:02:18 teela kernel: [] wb_workfn+0xd8/0x3e0 Dec 15 19:02:18 teela kernel: [] process_one_work+0x114/0x3e0 Dec 15 19:02:18 teela kernel: [] worker_thread+0x2f/0x4b0 Dec 15 19:02:18 teela kernel: [] ? create_worker+0x180/0x180 Dec 15 19:02:18 teela kernel: [] kthread+0x97/0xb0 Dec 15 19:02:18 teela kernel: [] ? __kthread_parkme+0x60/0x60 Dec 15 19:02:18 teela kernel: [] ret_from_fork+0x1b/0x28 Dec 15 19:02:18 teela kernel: Mem-Info: Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0 active_file:274324 inactive_file:281962 isolated_file:0 OK, so there is still some anonymous memory that could be swapped out and quite a lot of page cache. This might be harder to reclaim because the allocation is a GFP_NOFS request which is limited in its reclaim capabilities. It might be possible that those pagecache pages are pinned in some way by the the filesystem. unevictable:0 dirty:649 writeback:0 unstable:0 slab_reclaimable:40662 slab_unreclaimable:17754 mapped:7382 shmem:202 pagetables:351 bounce:0 free:206736 free_pcp:332 free_cma:0 Dec 15 19:02:18 teela kernel: Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no Dec 15 19:02:18 teela kernel: DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB
[GIT PULL] Btrfs
23/-96) btrfs: qgroup: Add comments explaining how btrfs qgroup works (+28/-0) Robbie Ko (3) commits (+5/-6): Btrfs: fix tree search logic when replaying directory entry deletes (+1/-2) Btrfs: fix deadlock caused by fsync when logging directory entries (+2/-2) Btrfs: fix enospc in hole punching (+2/-2) Wang Xiaoguang (3) commits (+42/-7): btrfs: cleanup: use already calculated value in btrfs_should_throttle_delayed_refs() (+1/-1) btrfs: add necessary comments about tickets_id (+4/-0) btrfs: improve delayed refs iterations (+37/-6) Liu Bo (2) commits (+12/-6): Btrfs: adjust len of writes if following a preallocated extent (+5/-3) Btrfs: fix BUG_ON in btrfs_mark_buffer_dirty (+7/-3) Chris Mason (2) commits (+11/-8): Revert "Btrfs: adjust len of writes if following a preallocated extent" (+3/-5) Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors (+8/-3) Josef Bacik (2) commits (+29/-5): Btrfs: abort transaction if fill_holes() fails (+17/-2) Btrfs: fix file extent corruption (+12/-3) Omar Sandoval (1) commits (+3/-3): Btrfs: deal with existing encompassing extent map in btrfs_get_extent() Maxim Patlasov (1) commits (+19/-2): btrfs: limit async_work allocation and worker func duration Xiaoguang Wang (1) commits (+3/-10): btrfs: remove useless comments Adam Borowski (1) commits (+40/-3): btrfs: make block group flags in balance printks human-readable Nick Terrell (1) commits (+1/-0): btrfs: Call kunmap if zlib_inflateInit2 fails Christophe JAILLET (1) commits (+0/-2): btrfs: remove redundant check of btrfs_iget return value Domagoj Tršan (1) commits (+6/-6): btrfs: change btrfs_csum_final result param type to u8 Shailendra Verma (1) commits (+6/-15): btrfs: return early from failed memory allocations in ioctl handlers Total: (77) commits (+5389/-5304) fs/btrfs/async-thread.c| 14 + fs/btrfs/async-thread.h|1 + fs/btrfs/backref.c | 10 +- fs/btrfs/check-integrity.c | 103 +-- fs/btrfs/check-integrity.h |5 +- fs/btrfs/compression.c | 196 ++-- fs/btrfs/compression.h | 12 +- fs/btrfs/ctree.c | 495 +- fs/btrfs/ctree.h | 241 ++--- fs/btrfs/delayed-inode.c | 147 ++- fs/btrfs/delayed-inode.h | 21 +- fs/btrfs/delayed-ref.c | 20 +- fs/btrfs/delayed-ref.h | 14 +- fs/btrfs/dev-replace.c | 68 +- fs/btrfs/dev-replace.h |4 +- fs/btrfs/dir-item.c| 45 +- fs/btrfs/disk-io.c | 595 ++-- fs/btrfs/disk-io.h | 34 +- fs/btrfs/export.c | 10 +- fs/btrfs/extent-tree.c | 1551 ++-- fs/btrfs/extent_io.c | 112 ++- fs/btrfs/extent_io.h | 17 +- fs/btrfs/file-item.c | 207 ++--- fs/btrfs/file.c| 249 ++--- fs/btrfs/free-space-cache.c| 164 ++-- fs/btrfs/free-space-cache.h| 12 +- fs/btrfs/free-space-tree.c | 44 +- fs/btrfs/inode-item.c | 11 +- fs/btrfs/inode-map.c | 22 +- fs/btrfs/inode.c | 910 +-- fs/btrfs/ioctl.c | 603 +++-- fs/btrfs/lzo.c | 17 +- fs/btrfs/ordered-data.c| 38 +- fs/btrfs/ordered-data.h|4 +- fs/btrfs/print-tree.c | 19 +- fs/btrfs/print-tree.h |4 +- fs/btrfs/props.c |5 +- fs/btrfs/qgroup.c | 299 +- fs/btrfs/qgroup.h | 64 +- fs/btrfs/raid56.c | 78 +- fs/btrfs/raid56.h |8 +- fs/btrfs/reada.c | 62 +- fs/btrfs/relocation.c | 453 +- fs/btrfs/root-tree.c | 28 +- fs/btrfs/scrub.c | 181 ++-- fs/btrfs/send.c| 33 +- fs/btrfs/super.c | 138 ++- fs/btrfs/tests/btrfs-tests.c | 13 +- fs/btrfs/tests/btrfs-tests.h |4 +- fs/btrfs/tests/extent-buffer-tests.c |7 +- fs/btrfs/tests/extent-io-tests.c |7 +- fs/btrfs/tests/free-space-tests.c | 18 +- fs/btrfs/tests/free-space-tree-tests.c |9 +- fs/btrfs/tests/inode-tests.c | 16 +- fs/btrfs/tests/qgroup-tests.c | 11 +- fs/btrfs/transaction.c | 615 +++-- fs/btrfs/transaction.h | 29 +- fs/btrfs/tree-log.c| 202 +++-- fs/btrfs/uuid-tree.c | 23 +- fs/btrfs/volumes.c
Re: OOM: Better, but still there on 4.9
On 12/16/2016 02:39 AM, Michal Hocko wrote: [CC linux-mm and btrfs guys] On Thu 15-12-16 23:57:04, Nils Holland wrote: [...] Of course, none of this are workloads that are new / special in any way - prior to 4.8, I never experienced any issues doing the exact same things. Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0 Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0 Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2 Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014 Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1) Dec 15 19:02:18 teela kernel: eff0b604 c142bcce eff0b734 eff0b634 c1163332 0292 Dec 15 19:02:18 teela kernel: eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734 Dec 15 19:02:18 teela kernel: eff0b678 c110795f c1043895 eff0b664 c11075c7 0007 Dec 15 19:02:18 teela kernel: Call Trace: Dec 15 19:02:18 teela kernel: [] dump_stack+0x47/0x69 Dec 15 19:02:18 teela kernel: [] dump_header+0x60/0x178 Dec 15 19:02:18 teela kernel: [] ? ___ratelimit+0x86/0xe0 Dec 15 19:02:18 teela kernel: [] oom_kill_process+0x20f/0x3d0 Dec 15 19:02:18 teela kernel: [] ? has_capability_noaudit+0x15/0x20 Dec 15 19:02:18 teela kernel: [] ? oom_badness.part.13+0xb7/0x130 Dec 15 19:02:18 teela kernel: [] out_of_memory+0xd9/0x260 Dec 15 19:02:18 teela kernel: [] __alloc_pages_nodemask+0xbfb/0xc80 Dec 15 19:02:18 teela kernel: [] pagecache_get_page+0xad/0x270 Dec 15 19:02:18 teela kernel: [] alloc_extent_buffer+0x116/0x3e0 Dec 15 19:02:18 teela kernel: [] btrfs_find_create_tree_block+0xe/0x10 Dec 15 19:02:18 teela kernel: [] btrfs_alloc_tree_block+0x1ef/0x5f0 Dec 15 19:02:18 teela kernel: [] __btrfs_cow_block+0x143/0x5f0 Dec 15 19:02:18 teela kernel: [] btrfs_cow_block+0x13a/0x220 Dec 15 19:02:18 teela kernel: [] btrfs_search_slot+0x1d1/0x870 Dec 15 19:02:18 teela kernel: [] btrfs_lookup_file_extent+0x4d/0x60 Dec 15 19:02:18 teela kernel: [] __btrfs_drop_extents+0x176/0x1070 Dec 15 19:02:18 teela kernel: [] ? kmem_cache_alloc+0xb7/0x190 Dec 15 19:02:18 teela kernel: [] ? start_transaction+0x65/0x4b0 Dec 15 19:02:18 teela kernel: [] ? __kmalloc+0x147/0x1e0 Dec 15 19:02:18 teela kernel: [] cow_file_range_inline+0x215/0x6b0 Dec 15 19:02:18 teela kernel: [] cow_file_range.isra.49+0x55c/0x6d0 Dec 15 19:02:18 teela kernel: [] ? lock_extent_bits+0x75/0x1e0 Dec 15 19:02:18 teela kernel: [] run_delalloc_range+0x441/0x470 Dec 15 19:02:18 teela kernel: [] writepage_delalloc.isra.47+0x144/0x1e0 Dec 15 19:02:18 teela kernel: [] __extent_writepage+0xd8/0x2b0 Dec 15 19:02:18 teela kernel: [] extent_writepages+0x25c/0x380 Dec 15 19:02:18 teela kernel: [] ? btrfs_real_readdir+0x610/0x610 Dec 15 19:02:18 teela kernel: [] btrfs_writepages+0x1f/0x30 Dec 15 19:02:18 teela kernel: [] do_writepages+0x15/0x40 Dec 15 19:02:18 teela kernel: [] __writeback_single_inode+0x35/0x2f0 Dec 15 19:02:18 teela kernel: [] writeback_sb_inodes+0x16e/0x340 Dec 15 19:02:18 teela kernel: [] wb_writeback+0xaa/0x280 Dec 15 19:02:18 teela kernel: [] wb_workfn+0xd8/0x3e0 Dec 15 19:02:18 teela kernel: [] process_one_work+0x114/0x3e0 Dec 15 19:02:18 teela kernel: [] worker_thread+0x2f/0x4b0 Dec 15 19:02:18 teela kernel: [] ? create_worker+0x180/0x180 Dec 15 19:02:18 teela kernel: [] kthread+0x97/0xb0 Dec 15 19:02:18 teela kernel: [] ? __kthread_parkme+0x60/0x60 Dec 15 19:02:18 teela kernel: [] ret_from_fork+0x1b/0x28 Dec 15 19:02:18 teela kernel: Mem-Info: Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0 active_file:274324 inactive_file:281962 isolated_file:0 OK, so there is still some anonymous memory that could be swapped out and quite a lot of page cache. This might be harder to reclaim because the allocation is a GFP_NOFS request which is limited in its reclaim capabilities. It might be possible that those pagecache pages are pinned in some way by the the filesystem. Reading harder, its possible those pagecache pages are all from the btree inode. They shouldn't be pinned by btrfs, kswapd should be able to wander in and free a good chunk. What btrfs wants to happen is for this allocation to sit and wait for kswapd to make progress. -chris
[GIT PULL] Btrfs fixes
Hi Linus, Dave Sterba queued up a few fixes for btrfs. I have them in my for-linus-4.10 branch: These are all over the place. The tracepoint part of the pull fixes a crash and adds a little more information to two tracepoints, while the rest are good old fashioned fixes. git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.10 Liu Bo (5) commits (+34/-11): Btrfs: adjust outstanding_extents counter properly when dio write is split (+9/-2) Btrfs: add truncated_len for ordered extent tracepoints (+4/-0) Btrfs: use down_read_nested to make lockdep silent (+2/-1) Btrfs: add 'inode' for extent map tracepoint (+9/-5) Btrfs: fix lockdep warning about log_mutex (+10/-3) David Sterba (2) commits (+80/-69): btrfs: fix crash when tracepoint arguments are freed by wq callbacks (+24/-13) btrfs: make tracepoint format strings more compact (+56/-56) Jeff Mahoney (2) commits (+4/-1): btrfs: fix locking when we put back a delayed ref that's too new (+1/-1) btrfs: fix error handling when run_delayed_extent_op fails (+3/-0) Pan Bian (1) commits (+1/-3): btrfs: return the actual error value from from btrfs_uuid_tree_iterate Total: (10) commits (+119/-84) fs/btrfs/async-thread.c | 15 +++-- fs/btrfs/extent-tree.c | 8 ++- fs/btrfs/inode.c | 13 +++- fs/btrfs/tree-log.c | 13 +++- fs/btrfs/uuid-tree.c | 4 +- include/trace/events/btrfs.h | 146 +++ 6 files changed, 117 insertions(+), 82 deletions(-)
Re: [Regression 4.7-rc1] btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl
On 01/06/2017 12:22 PM, Joseph Salisbury wrote: Hi Luke, A kernel bug report was opened against Ubuntu [0]. This bug was fixed by the following commit in v4.7-rc1: commit 4c63c2454eff996c5e27991221106eb511f7db38 Author: Luke DashjrDate: Thu Oct 29 08:22:21 2015 + btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl However, this commit introduced a new regression. With this commit applied, "btrfs fi show" no longer works and the btrfs snapshot functionality breaks. I was hoping to get your feedback, since you are the patch author. Do you think gathering any additional data will help diagnose this issue, or would it be best to submit a revert request? This is working for me, could you please include an strace of the problem? Thanks! -chris
Re: OOM: Better, but still there on
On Wed, Dec 21, 2016 at 12:16:53PM +0100, Michal Hocko wrote: On Wed 21-12-16 20:00:38, Tetsuo Handa wrote: One thing to note here, when we are talking about 32b kernel, things have changed in 4.8 when we moved from the zone based to node based reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis") and associated patches). It is possible that the reporter is hitting some pathological path which needs fixing but it might be also related to something else. So I am rather not trying to blame 32b yet... It might be interesting to put tracing on releasepage and see if btrfs is pinning pages around. I can't see how 32bit kernels would be different, but maybe we're hitting a weird corner. -chris
Re: OOM: Better, but still there on 4.9
On 12/16/2016 05:14 PM, Michal Hocko wrote: On Fri 16-12-16 13:15:18, Chris Mason wrote: On 12/16/2016 02:39 AM, Michal Hocko wrote: [...] I believe the right way to go around this is to pursue what I've started in [1]. I will try to prepare something for testing today for you. Stay tuned. But I would be really happy if somebody from the btrfs camp could check the NOFS aspect of this allocation. We have already seen allocation stalls from this path quite recently Just double checking, are you asking why we're using GFP_NOFS to avoid going into btrfs from the btrfs writepages call, or are you asking why we aren't allowing highmem? I am more interested in the NOFS part. Why cannot this be a full GFP_KERNEL context? What kind of locks we would lock up when recursing to the fs via slab shrinkers? Since this is our writepages call, any jump into direct reclaim would go to writepage, which would end up calling the same set of code to read metadata blocks, which would do a GFP_KERNEL allocation and end up back in writepage again. We'd also have issues with blowing through transaction reservations since the writepage recursion would have to nest into the running transaction. -chris
[GIT PULL] Btrfs
Hi Linus We have a small set of fixes for the next RC: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Zygo tracked down a very old bug with inline compressed extents. I didn't tag this one for stable because I want to do individual tested backports. It's a little tricky and I'd rather do some extra testing on it along the way. Otherwise they are pretty obvious: Liu Bo (1) commits (+2/-1): Btrfs: fix regression in lock_delalloc_pages Dmitry V. Levin (1) commits (+0/-27): btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h Zygo Blaxell (1) commits (+14/-0): btrfs: add missing memset while reading compressed inline extents Total: (3) commits (+16/-28) fs/btrfs/extent_io.c | 3 ++- fs/btrfs/inode.c | 14 ++ include/uapi/linux/btrfs.h | 27 --- 3 files changed, 16 insertions(+), 28 deletions(-)
[GIT PULL] Btrfs
Hi Linus, We have 3 small fixes queued up in my for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Goldwyn Rodrigues (1) commits (+7/-7): btrfs: Change qgroup_meta_rsv to 64bit Dan Carpenter (1) commits (+6/-1): Btrfs: fix an integer overflow check Liu Bo (1) commits (+31/-21): Btrfs: bring back repair during read Total: (3) commits (+44/-29) fs/btrfs/ctree.h | 2 +- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent_io.c | 46 -- fs/btrfs/inode.c | 6 +++--- fs/btrfs/qgroup.c| 10 +- fs/btrfs/send.c | 7 ++- 6 files changed, 44 insertions(+), 29 deletions(-)
Re: [PATCH] jump_label: Fix anonymous union initialization
On 03/02/2017 04:42 PM, Steven Rostedt wrote: On Thu, 2 Mar 2017 16:07:19 -0500 Jason Baron <jba...@akamai.com> wrote: On 02/28/2017 11:32 AM, Boris Ostrovsky wrote: Pre-4.6 gcc do not allow direct static initialization of members of anonymous structs/unions. After commit 3821fd35b58d ("jump_label: Reduce the size of struct static_key") STATIC_KEY_INIT_{TRUE|FALSE} definitions cannot be compiled with those older compilers. Placing initializers inside curved brackets works around this problem. Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com> --- include/linux/jump_label.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index 8e06d75..518020b 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -166,10 +166,10 @@ extern void arch_jump_label_transform_static(struct jump_entry *entry, */ #define STATIC_KEY_INIT_TRUE \ { .enabled = { 1 }, \ - .entries = (void *)JUMP_TYPE_TRUE } + { .entries = (void *)JUMP_TYPE_TRUE } } #define STATIC_KEY_INIT_FALSE \ { .enabled = { 0 }, \ - .entries = (void *)JUMP_TYPE_FALSE } + { .entries = (void *)JUMP_TYPE_FALSE } } #else /* !HAVE_JUMP_LABEL */ (Adding Steve to 'cc) Thanks for the fix. Reviewed-by: Jason Baron <jba...@akamai.com> Funny, Chris pinged me on IRC telling me that jump labels broke with my latest tree. And we discovered it was because of anonymous unions and he was using an older compiler (4.4 or something). I didn't know how to make it work, and we were just going to say "tough, jump labels are not for 4.4". Although, didn't goto asm get added into 4.5? Did someone backport it to the gcc 4.4 compilers? I believe 4.5 handles anonymous unions. Since the broken commit went through my tree, I'll take this patch. I'm getting ready for another git pull request to Linus. Compiled-by: Chris Mason <c...@fb.com> -chris
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Has Btrfs round two. These are mostly a continuation of Dave Sterba's collection of cleanups, but Filipe also has some bug fixes and performance improvements. Nikolay Borisov (42) commits (+611/-579): btrfs: Make lock_and_cleanup_extent_if_need take btrfs_inode (+14/-14) btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode (+39/-38) btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode (+10/-8) btrfs: all btrfs_delalloc_release_metadata take btrfs_inode (+22/-19) btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode (+3/-4) btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode (+7/-6) btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode (+3/-3) btrfs: Make btrfs_orphan_release_metadata take btrfs_inode (+8/-8) btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode (+7/-7) btrfs: Make check_parent_dirs_for_sync take btrfs_inode (+14/-14) btrfs: make btrfs_free_io_failure_record take btrfs_inode (+9/-7) btrfs: Make btrfs_lookup_ordered_range take btrfs_inode (+19/-18) btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode (+17/-16) btrfs: make btrfs_print_data_csum_error take btrfs_inode (+8/-7) btrfs: make btrfs_is_free_space_inode take btrfs_inode (+20/-19) btrfs: make btrfs_set_inode_index_count take btrfs_inode (+8/-8) btrfs: Make btrfs_requeue_inode_defrag take btrfs_inode (+5/-5) btrfs: Make clone_update_extent_map take btrfs_inode (+13/-14) btrfs: Make btrfs_mark_extent_written take btrfs_inode (+6/-6) btrfs: Make btrfs_drop_extent_cache take btrfs_inode (+30/-26) btrfs: Make calc_csum_metadata_size take btrfs_inode (+12/-15) btrfs: Make drop_outstanding_extent take btrfs_inode (+11/-12) btrfs: Make btrfs_del_delalloc_inode take btrfs_inode (+7/-7) btrfs: make btrfs_log_inode_parent take btrfs_inode (+24/-26) btrfs: Make btrfs_set_inode_index take btrfs_inode (+13/-13) btrfs: Make btrfs_clear_bit_hook take btrfs_inode (+25/-21) btrfs: Make check_extent_to_block take btrfs_inode (+6/-5) btrfs: make check_compressed_csum take btrfs_inode (+4/-5) btrfs: Make btrfs_insert_dir_item take btrfs_inode (+7/-7) btrfs: Make btrfs_log_all_parents take btrfs_inode (+5/-5) btrfs: Make btrfs_i_size_write take btrfs_inode (+18/-19) btrfs: make repair_io_failure take btrfs_inode (+12/-11) btrfs: Make btrfs_orphan_add take btrfs_inode (+24/-22) btrfs: make btrfs_orphan_del take btrfs_inode (+20/-20) btrfs: make clean_io_failure take btrfs_inode (+15/-14) btrfs: Make btrfs_add_nondir take btrfs_inode (+13/-9) btrfs: make free_io_failure take btrfs_inode (+13/-11) btrfs: Make check_can_nocow take btrfs_inode (+12/-10) btrfs: Make btrfs_add_link take btrfs_inode (+26/-23) btrfs: Make get_extent_t take btrfs_inode (+59/-54) btrfs: Make hole_mergeable take btrfs_inode (+5/-4) btrfs: Make fill_holes take btrfs_inode (+18/-19) David Sterba (16) commits (+139/-124): btrfs: use predefined limits for calculating maximum number of pages for compression (+6/-5) btrfs: derive maximum output size in the compression implementation (+9/-14) btrfs: merge nr_pages input and output parameter in compress_pages (+11/-15) btrfs: merge length input and output parameter in compress_pages (+18/-20) btrfs: add dummy callback for readpage_io_failed and drop checks (+10/-3) btrfs: do proper error handling in btrfs_insert_xattr_item (+2/-1) btrfs: drop checks for mandatory extent_io_ops callbacks (+3/-4) btrfs: constify device path passed to relevant helpers (+22/-18) btrfs: document existence of extent_io ops callbacks (+26/-11) btrfs: handle allocation error in update_dev_stat_item (+2/-1) btrfs: export compression buffer limits in a header (+15/-10) btrfs: constify name of subvolume in creation helpers (+3/-3) btrfs: constify buffers used by compression helpers (+3/-3) btrfs: remove BUG_ON from __tree_mod_log_insert (+0/-2) btrfs: constify input buffer of btrfs_csum_data (+3/-3) btrfs: let writepage_end_io_hook return void (+6/-11) Filipe Manana (8) commits (+163/-27): Btrfs: do not create explicit holes when replaying log tree if NO_HOLES enabled (+5/-0) Btrfs: try harder to migrate items to left sibling before splitting a leaf (+7/-0) Btrfs: fix assertion failure when freeing block groups at close_ctree() (+9/-6) Btrfs: incremental send, fix unnecessary hole writes for sparse files (+86/-2) Btrfs: fix use-after-free due to wrong order of destroying work queues (+7/-2) Btrfs: incremental send, do not delay rename when parent inode is new (+16/-3) Btrfs: fix data loss after truncate when using the no-holes feature (+6/-13) Btrfs: bulk delete checksum items in the same leaf (+27/-1) Robbie Ko (3) commits
[GIT PULL] Btrfs
Hi Linus Dave Sterba collected a few more fixes for the last rc: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 These aren't marked for stable, but I'm putting them in with a batch were testing/sending by hand for this release. Liu Bo (3) commits (+11/-13): Btrfs: fix invalid dereference in btrfs_retry_endio (+4/-10) Btrfs: fix potential use-after-free for cloned bio (+1/-1) Btrfs: fix segmentation fault when doing dio read (+6/-2) Adam Borowski (1) commits (+3/-0): btrfs: drop the nossd flag when remounting with -o ssd Total: (4) commits (+14/-13) fs/btrfs/inode.c | 22 ++ fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 2 +- 3 files changed, 14 insertions(+), 13 deletions(-)
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 03:25 PM, Hugo Mills wrote: On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: On 08/10/2017 04:30 AM, Eric Biggers wrote: Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. Could we please not add more mount options? I get that they're easy to implement, but it's a very blunt instrument. What we tend to see (with both nodatacow and compress) is people using the mount options, then asking for exceptions, discovering that they can't do that, and then falling back to doing it with attributes or btrfs properties. Could we just start with btrfs properties this time round, and cut out the mount option part of this cycle. In the long run, it'd be great to see most of the btrfs-specific mount options get deprecated and ultimately removed entirely, in favour of attributes/properties, where feasible. It's a good point, and as was commented later down I'd just do mount -o compress=zstd:3 or something. But I do prefer properties in general for this. My big point was just that next step is outside of Nick's scope. -chris
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 04:30 AM, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. -chris
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 03:00 PM, Eric Biggers wrote: On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: On 08/10/2017 04:30 AM, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. I am not surprised --- Zstandard is closer to the state of the art, both format-wise and implementation-wise, than the other choices in BTRFS. My point is that benchmarks need to account for how much data is compressed at a time. This is a common mistake when comparing different compression algorithms; the algorithm name and compression level do not tell the whole story. The dictionary size is extremely significant. No one is going to compress or decompress a 200 MB file as a single stream in kernel mode, so it does not make sense to justify adding Zstandard *to the kernel* based on such a benchmark. It is going to be divided into chunks. How big are the chunks in BTRFS? I thought that it compressed only one page (4 KiB) at a time, but I hope that has been, or is being, improved; 32 KiB - 128 KiB should be a better amount. (And if the amount of data compressed at a time happens to be different between the different algorithms, note that BTRFS benchmarks are likely to be measuring that as much as the algorithms themselves.) Btrfs hooks the compression code into the delayed allocation mechanism we use to gather large extents for COW. So if you write 100MB to a file, we'll have 100MB to compress at a time (within the limits of the amount of pages we allow to collect before forcing it down). But we want to balance how much memory you might need to uncompress during random reads. So we have an artificial limit of 128KB that we send at a time to the compression code. It's easy to change this, it's just a tradeoff made to limit the cost of reading small bits. It's the same for zlib,lzo and the new zstd patch. -chris
Re: Moving ndctl development into the kernel tree?
On 07/22/2017 02:49 PM, Dan Williams wrote: On Fri, Jul 21, 2017 at 7:52 PM, Dan Williamswrote: [ adding Chris ] On Fri, Jul 21, 2017 at 4:44 PM, Dan Williams wrote: On Fri, Jul 21, 2017 at 3:58 PM, Ingo Molnar wrote: * Dan Williams wrote: [...] * Like perf, ndctl borrows the sub-command architecture and option parsing from git. So, this code could be refactored into something shared / generic, i.e. the bits in tools/perf/util/. Just as a side note, stacktool (tools/stacktool/) is using the Git sub-command and options parsing code as well, and it's already sharing it with perf, via the tools/lib/subcmd/ library. ndctl could use that as well. Ah, nice, that refactoring happened about a year after ndctl was born. Which brings up the next question about what to do with the git history, but I'd want to know if ndctl is even welcome upstream before digging any deeper. I suspect this would be similar to what Chris did to merge btrfs while retaining the standalone history. Chris, any pointers on what worked well and what if anything you would do differently? I.e. I'm looking to use git filter-branch to rewrite ndctl history as if if had always been in tools/ndctl in the kernel tree. I found this old thread https://lkml.org/lkml/2008/10/30/523 and it seems to also recommend using an older kernel as the branch base. So it wasn't as painful as I thought it would be, I just used the script Linus recommended in that thread. Here is what I came up with merging the last ndctl release on top of v4.9, and then applying the pending development patches re-filtered to tools/ndctl: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=for-4.14/ndctl ...the next thing would be to rework the versioning to use the kernel version and switch to using tools/lib/subcmd/. I'd like to say I figured it all out back then, but the truth is that Linus held my hand the whole way. My memory of it is that his script worked really well, I just ran that and verified the results. -chris
Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
On 04/25/2017 04:49 PM, Tejun Heo wrote: On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote: Will try that too. I can't see why HT would change it because I see single CPU queues misevaluated. Just in case, you need to tune the test params so that it doesn't load the machine too much and that there are some non-CPU intensive workloads going on to purturb things a bit. Anyways, I'm gonna try disabling HT. It's finickier but after changing the duty cycle a bit, it reproduces w/ HT off. I think the trick is setting the number of threads to the number of logical CPUs and tune -s/-c so that p99 starts climbing up. The following is from the root cgroup. Since it's only measuring wakeup latency, schbench is best at exposing problems when the machine is just barely below saturated. At saturation, everyone has to wait for the CPUs, and if we're relatively idle there's always a CPU to be found There's schbench -a to try and find this magic tipping point, but I haven't found a great way to automate for every kind of machine yet (sorry). -chris
[GIT PULL] Btrfs
Hi Linus, We have one more for btrfs: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 This is dropping a new WARN_ON from rc1 that ended up making more noise than we really want. The larger fix for the underflow got delayed a bit and it's better for now to put it under CONFIG_BTRFS_DEBUG. David Sterba (1) commits (+7/-4): btrfs: qgroup: move noisy underflow warning to debugging build Total: (1) commits (+7/-4) fs/btrfs/qgroup.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-)
Re: [PATCH] btrfs: always write superblocks synchronously
On 05/03/2017 04:36 AM, Jan Kara wrote: On Tue 02-05-17 09:28:13, Davidlohr Bueso wrote: Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since REQ_FUA and REQ_FLUSH flags are stripped from submitted IO when the disk doesn't have volatile write cache and thus effectively make the write async. This was seen to cause performance hits up to 90% regression in disk IO related benchmarks such as reaim and dbench[1]. Fix the problem by making sure the first superblock write is also treated as synchronous since they can block progress of the journalling (commit, log syncs) machinery and thus the whole filesystem. Fixes: b685d3d65ac (block: treat REQ_FUA and REQ_PREFLUSH as synchronous) Cc: stableCc: Jan Kara Signed-off-by: Davidlohr Bueso I wasn't patient enough and already sent the fix as part of my series fixing other filesystems [1]. It also fixes one more place in btrfs that needs REQ_SYNC to return to the original behavior. Thanks guys. -chris
[GIT PULL] Btrfs
bdev_get_queue (+3/-4) btrfs: check if the device is flush capable (+4/-0) btrfs: delete unused member nobarriers (+0/-4) Edmund Nadolski (2) commits (+25/-20): btrfs: provide enumeration for __merge_refs mode argument (+13/-10) btrfs: replace hardcoded value with SEQ_LAST macro (+12/-10) Goldwyn Rodrigues (2) commits (+24/-3): btrfs: qgroups: Retry after commit on getting EDQUOT (+23/-1) btrfs: No need to check !(flags & MS_RDONLY) twice (+1/-2) Chris Mason (1) commits (+2/-2): btrfs: fix the gfp_mask for the reada_zones radix tree Adam Borowski (1) commits (+9/-3): btrfs: fix a bogus warning when converting only data or metadata Deepa Dinamani (1) commits (+2/-1): btrfs: Use ktime_get_real_ts for root ctime Dan Carpenter (1) commits (+15/-26): Btrfs: handle only applicable errors returned by btrfs_get_extent Dmitry V. Levin (1) commits (+2/-0): MAINTAINERS: add btrfs file entries for include directories Hans van Kranenburg (1) commits (+5/-5): Btrfs: consistent usage of types in balance_args Total: (71) commits MAINTAINERS | 2 + fs/btrfs/backref.c | 41 ++- fs/btrfs/btrfs_inode.h | 7 + fs/btrfs/compression.c | 18 +- fs/btrfs/ctree.c | 20 +- fs/btrfs/ctree.h | 34 +- fs/btrfs/delayed-inode.c | 46 +-- fs/btrfs/delayed-inode.h | 6 +- fs/btrfs/delayed-ref.c | 8 +- fs/btrfs/delayed-ref.h | 8 +- fs/btrfs/dev-replace.c | 9 +- fs/btrfs/disk-io.c | 13 +- fs/btrfs/disk-io.h | 4 +- fs/btrfs/extent-tree.c | 35 +- fs/btrfs/extent_io.c | 59 +-- fs/btrfs/extent_io.h | 8 +- fs/btrfs/extent_map.c| 10 +- fs/btrfs/extent_map.h| 3 +- fs/btrfs/file.c | 82 - fs/btrfs/free-space-cache.c | 2 +- fs/btrfs/inode.c | 289 +++ fs/btrfs/ioctl.c | 33 +- fs/btrfs/ordered-data.c | 20 +- fs/btrfs/ordered-data.h | 2 +- fs/btrfs/qgroup.c| 102 ++ fs/btrfs/qgroup.h| 51 ++- fs/btrfs/raid56.c| 38 +- fs/btrfs/reada.c | 37 +- fs/btrfs/root-tree.c | 3 +- fs/btrfs/scrub.c | 331 +++-- fs/btrfs/send.c | 23 +- fs/btrfs/super.c | 3 +- fs/btrfs/tests/btrfs-tests.c | 1 - fs/btrfs/transaction.c | 48 ++- fs/btrfs/transaction.h | 6 +- fs/btrfs/tree-log.c | 2 +- fs/btrfs/volumes.c | 854 +++ fs/btrfs/volumes.h | 8 +- include/trace/events/btrfs.h | 187 +- include/uapi/linux/btrfs.h | 10 +- 40 files changed, 1629 insertions(+), 834 deletions(-)
Re: [GIT PULL] Btrfs
On 05/09/2017 01:56 PM, Chris Mason wrote: > Hi Linus, > > My for-linus-4.12 branch: > > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git > for-linus-4.12 I hit send too soon, sorry. There's a trivial conflict with our WARN_ON fix that went into 4.11. I pushed the resolution to for-linus-4.12-merged. diff --cc fs/btrfs/qgroup.c index afbea61,3f75b5c..deffbeb --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@@ -1078,7 -1031,8 +1034,8 @@@ static int __qgroup_excl_accounting(str qgroup->excl += sign * num_bytes; qgroup->excl_cmpr += sign * num_bytes; if (sign > 0) { + trace_qgroup_update_reserve(fs_info, qgroup, -(s64)num_bytes); - if (WARN_ON(qgroup->reserved < num_bytes)) + if (qgroup->reserved < num_bytes) report_reserved_underflow(fs_info, qgroup, num_bytes); else qgroup->reserved -= num_bytes; @@@ -1103,7 -1057,9 +1060,9 @@@ WARN_ON(sign < 0 && qgroup->excl < num_bytes); qgroup->excl += sign * num_bytes; if (sign > 0) { + trace_qgroup_update_reserve(fs_info, qgroup, + -(s64)num_bytes); - if (WARN_ON(qgroup->reserved < num_bytes)) + if (qgroup->reserved < num_bytes) report_reserved_underflow(fs_info, qgroup, num_bytes); else @@@ -2472,7 -2451,8 +2454,8 @@@ void btrfs_qgroup_free_refroot(struct b qg = unode_aux_to_qgroup(unode); + trace_qgroup_update_reserve(fs_info, qg, -(s64)num_bytes); - if (WARN_ON(qg->reserved < num_bytes)) + if (qg->reserved < num_bytes) report_reserved_underflow(fs_info, qg, num_bytes); else qg->reserved -= num_bytes;
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On 05/17/2017 06:53 AM, Peter Zijlstra wrote: On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: sched/fair, cpumask: Export for_each_cpu_wrap() -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped) -{ - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); -} OK, so this patch fixed an actual bug in the for_each_cpu_wrap() implementation. The above 'n+1' should be 'n', and the effect is that it'll skip over CPUs, potentially resulting in an iteration that only sees every other CPU (for a fully contiguous mask). This in turn causes hackbench to further suffer from the regression introduced by commit: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") So its well past time to fix this. Where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. I'm also hoping it will fix/preserve kitsunyan's interactivity issue. Please test.. We'll get some tests going here too. -chris
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.12 branch has some fixes that Dave Sterba collected: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.12 We've been hitting an early enospc problem on production machines that Omar tracked down to an old int->u64 mistake. I waited a bit on this pull to make sure it was really the problem from production, but it's on ~2100 hosts now and I think we're good. Omar also noticed a commit in the queue would make new early ENOSPC problems. I pulled that out for now, which is why the top three commits are younger than the rest. Otherwise these are all fixes, some explaining very old bugs that we've been poking at for a while. Jeff Mahoney (2) commits (+4/-3): btrfs: fix race with relocation recovery and fs_root setup (+3/-3) btrfs: fix memory leak in update_space_info failure path (+1/-0) Liu Bo (1) commits (+1/-1): Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io Colin Ian King (1) commits (+1/-1): btrfs: fix incorrect error return ret being passed to mapping_set_error Omar Sandoval (1) commits (+2/-2): Btrfs: fix delalloc accounting leak caused by u32 overflow Qu Wenruo (1) commits (+122/-2): btrfs: fiemap: Cache and merge fiemap extent before submit it to user David Sterba (1) commits (+2/-2): btrfs: use correct types for page indices in btrfs_page_exists_in_range Jan Kara (1) commits (+6/-4): btrfs: Make flush bios explicitely sync Su Yue (1) commits (+1/-1): btrfs: tree-log.c: Wrong printk information about namelen Total: (9) commits (+139/-16) fs/btrfs/ctree.h | 4 +- fs/btrfs/dir-item.c| 2 +- fs/btrfs/disk-io.c | 10 ++-- fs/btrfs/extent-tree.c | 7 +-- fs/btrfs/extent_io.c | 126 +++-- fs/btrfs/inode.c | 6 +-- 6 files changed, 139 insertions(+), 16 deletions(-)
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On 06/06/2017 05:21 AM, Peter Zijlstra wrote: On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote: On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: Please test.. Results are still coming in but things do look better with your patch applied. It does look like there's a regression when running hackbench in process mode and when the CPUs are not fully utilised, e.g. check this out: This turned out to be a false positive; your patch improves things as far as I can see. Hooray, I'll move it to a part of the queue intended for merging. It's a little late, but Roman Gushchin helped get some runs of this with our production workload. The patch is every so slightly better. Thanks! -chris
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Reminder: Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, Quick update on the TAB elections, we have 5 nominations so far: Jon Corbet Greg Kroah-Hartman Shuah Khan Steve Rostedt Ted Tso The elections are next week, please feel free to contact me if you have any questions about the TAB. - The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
[GIT PULL] zstd support (lib, btrfs, squashfs)
Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. I have it in my zstd branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd There's a trivial conflict with the main btrfs pull that Dave Sterba just sent. His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. My idea was that you'd take our main btrfs pull first and this one second, but the conflicts are small enough it's not a big deal. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Nick has a number of benchmarks for the main zstd code in his lib/zstd commit: I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | I benchmarked zstd decompression using the same method on the same machine. The benchmark file is located in the upstream zstd repo under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is the amount of memory required to decompress data compressed with the given compression level. If you know the maximum size of your input, you can reduce the memory usage of decompression irrespective of the compression level. | Method | Time (s) | MB/s| Adjusted MB/s | Memory (MB) | |--|--|-|---|-| | none |0.025 | 8479.54 | - | - | | zstd -1 |0.358 | 592.15 |636.60 |0.84 | | zstd -3 |0.396 | 535.32 |571.40 |1.46 | | zstd -5 |0.396 | 535.32 |571.40 |1.46 | | zstd -10 |0.374 | 566.81 |607.42 |2.51 | | zstd -15 |0.379 | 559.34 |598.84 |4.61 | | zstd -19 |0.412 | 514.54 |547.77 |8.80 | | zlib -1 |0.940 | 225.52 |231.68 |0.04 | | zlib -3 |0.883 | 240.08 |247.07 |0.04 | | zlib -6 |0.844 | 251.17 |258.84 |0.04 | | zlib -9 |0.837 | 253.27 |287.64 |0.04 | === I ran a long series of tests and benchmarks on the btrfs side and the gains are very similar to the core benchmarks Nick ran. Nick Terrell (4) commits (+14578/-12): crypto: Add zstd support (+356/-0) btrfs: Add zstd support (+468/-12) lib: Add zstd modules (+13014/-0) lib: Add xxhash module (+740/-0) Sean Purcell (1) commits (+178/-0): squashfs: Add zstd support Total: (5) commits (+14756/-12)
Re: [GIT PULL] zstd support (lib, btrfs, squashfs)
On 09/08/2017 03:33 PM, Chris Mason wrote: Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. I have it in my zstd branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd There's a trivial conflict with the main btrfs pull that Dave Sterba just sent. His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. My idea was that you'd take our main btrfs pull first and this one second, but the conflicts are small enough it's not a big deal. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Just to clarify, we've been testing the kernel side of this here at FB, but our zstd use in prod is limited to the application side. -chris
Re: [GIT PULL] zstd support (lib, btrfs, squashfs)
On Sat, Sep 09, 2017 at 09:35:59AM +0800, Herbert Xu wrote: On Fri, Sep 08, 2017 at 03:33:05PM -0400, Chris Mason wrote: crypto/Kconfig |9 + crypto/Makefile|1 + crypto/testmgr.c | 10 + crypto/testmgr.h | 71 + crypto/zstd.c | 265 Is there anyone going to use zstd through the crypto API? If not then I don't see the point in adding it at this point. Especially as the compression API is still in a state of flux. That part was requested by intel, but I'm happy to leave it out for another time. The rest of the patch series doesn't depend on it at all. -chris
[GIT PULL v2] zstd support (lib, btrfs, squashfs, nocrypto)
Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. Herbert had asked about the crypto patch when we discussed the pull, but I didn't realize he really meant not-right-now. I've rebased it out of this branch, and none of the other patches depended on it. I have things in my zstd-minimal branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd-minimal There's a trivial conflict with the main btrfs pull from last week. Dave's pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Nick has a number of benchmarks for the main zstd code in his lib/zstd commit: I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | I benchmarked zstd decompression using the same method on the same machine. The benchmark file is located in the upstream zstd repo under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is the amount of memory required to decompress data compressed with the given compression level. If you know the maximum size of your input, you can reduce the memory usage of decompression irrespective of the compression level. | Method | Time (s) | MB/s| Adjusted MB/s | Memory (MB) | |--|--|-|---|-| | none |0.025 | 8479.54 | - | - | | zstd -1 |0.358 | 592.15 |636.60 |0.84 | | zstd -3 |0.396 | 535.32 |571.40 |1.46 | | zstd -5 |0.396 | 535.32 |571.40 |1.46 | | zstd -10 |0.374 | 566.81 |607.42 |2.51 | | zstd -15 |0.379 | 559.34 |598.84 |4.61 | | zstd -19 |0.412 | 514.54 |547.77 |8.80 | | zlib -1 |0.940 | 225.52 |231.68 |0.04 | | zlib -3 |0.883 | 240.08 |247.07 |0.04 | | zlib -6 |0.844 | 251.17 |258.84 |0.04 | | zlib -9 |0.837 | 253.27 |287.64 |0.04 | === I ran a long series of tests and benchmarks on the btrfs side and the gains are very similar to the core benchmarks Nick ran. Nick Terrell (3) commits (+14222/-12): btrfs: Add zstd support (+468/-12) lib: Add zstd modules (+13014/-0) lib: Add xxhash module (+740/-0) Sean Purcell (1) commits (+178/-0): squashfs: Add zstd support Total: (4) commits (+14400/-12) fs/btrfs/Kconfig |2 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |6 +- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |6 +- fs/btrfs/props.c |6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c |2 + fs/btrfs/zstd.c| 432 ++ fs/squashfs/Kconfig| 14 +
Reminder v2: Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, Quick update on the TAB elections, we have 6 nominations so far: Jon Corbet Greg Kroah-Hartman Shuah Khan Steve Rostedt Ted Tso Tim Bird The elections are coming soon, please feel free to contact me if you have any questions about the TAB. - The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup
On 11/30/2017 12:23 PM, David Sterba wrote: On Wed, Nov 29, 2017 at 01:38:26PM -0500, Chris Mason wrote: On 11/29/2017 12:05 PM, Tejun Heo wrote: On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote: Hello, On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote: What has happened with this patch set? No idea. cc'ing Chris directly. Chris, if the patchset looks good, can you please route them through the btrfs tree? lol looking at the patchset again, I'm not sure that's obviously the right tree. It can either be cgroup, block or btrfs. If no one objects, I'll just route them through cgroup. We'll have to coordinate a bit during the next merge window but I don't have a problem with these going in through cgroup. Dave does this sound good to you? There are only minor changes to btrfs code so cgroup tree would be better. I'd like to include my patch to do all crcs inline (instead of handing off to helper threads) when io controls are in place. By the merge window we should have some good data on how much it's all helping. Are there any problems in sight if the inline crc and cgroup chnanges go separately? I assume there's a runtime dependency, not a code dependency, so it could be sorted by the right merge order. The feature is just more useful with the inline crcs. Without them we end up with kworkers doing both high and low prio submissions and it all boils down to the speed of the lowest priority. -chris
Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup
On 11/29/2017 12:05 PM, Tejun Heo wrote: On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote: Hello, On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote: What has happened with this patch set? No idea. cc'ing Chris directly. Chris, if the patchset looks good, can you please route them through the btrfs tree? lol looking at the patchset again, I'm not sure that's obviously the right tree. It can either be cgroup, block or btrfs. If no one objects, I'll just route them through cgroup. We'll have to coordinate a bit during the next merge window but I don't have a problem with these going in through cgroup. Dave does this sound good to you? I'd like to include my patch to do all crcs inline (instead of handing off to helper threads) when io controls are in place. By the merge window we should have some good data on how much it's all helping. -chris
Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
On 6 Mar 2018, at 11:12, Linus Torvalds wrote: On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitovwrote: As the first step in development of bpfilter project [1] the request_module() code is extended to allow user mode helpers to be invoked. Idea is that user mode helpers are built as part of the kernel build and installed as traditional kernel modules with .ko file extension into distro specified location, such that from a distribution point of view, they are no different than regular kernel modules. Thus, allow request_module() logic to load such user mode helper (umh) modules via: [,,] I like this, but I have one request: can we make sure that this action is visible in the system messages? When we load a regular module, at least it shows in lsmod afterwards, although I have a few times wanted to really see module load as an event in the logs too. When we load a module that just executes a user program, and there is no sign of it in the module list, I think we *really* need to make that event show to the admin some way. .. and yes, maybe we'll need to rate-limit the messages, and maybe it turns out that I'm entirely wrong and people will hate the messages after they get used to the concept of these pseudo-modules, but particularly for the early implementation when this is a new thing, I really want a message like executed user process xyz-abc as a pseudo-module or something in dmesg. I do *not* want this to be a magical way to hide things. Especially early on, this makes a lot of sense. But I wanted to plug bps and the hopefully growing set of bpf introspection tools: https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt Long term these are probably a good place to tell the admin what's going on. -chris
Re: [PATCH 2/2] code-of-conduct: Strip the enforcement paragraph pending community discussion
On 6 Oct 2018, at 17:37, James Bottomley wrote: Significant concern has been expressed about the responsibilities outlined in the enforcement clause of the new code of conduct. Since there is concern that this becomes binding on the release of the 4.19 kernel, strip the enforcement clauses to give the community time to consider and debate how this should be handled. Even in the places where I don't agree with the discussion about what our code of conduct should be, I love that we're having it. Removing the enforcement clause basically goes back to the way things were. We'd be recognizing that we know issues happen, and explicitly stating that when serious events do happen, the community as a whole isn't committing to helping. It's true there are a lot of questions about how the community resolves problems and holds each other accountable for maintaining any code of conduct. I think the enforcement section leaves us the room we need to continue discussions and still make it clear that we're making an effort to shift away from the harsh discussions in the past. -chris
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Linux Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. We're also working with kernel maintainers to help refine the new code of conduct, and serving as the initial point of contact for code of conduct issues. The board has ten members, one of whom sits on the Linux Foundation board of directors. The election to select five TAB members will be held at the 2018 Kernel Summit in Vancouver, Canada. The elections will take place at the conference center on Tuesday November 13th, at 5:30pm. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Vancouver. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org The deadline for receiving nominations is up until the beginning of the event where the election is held. In past years, everyone running for the TAB has given a short speech before the voting began. We've received feedback that the speeches add logistical complexity for the election, and may not be the best indicator of how well qualified someone is for the TAB. Instead of speeches, this year we're asking candidates to include statements about why they would like to participate in the TAB. These will be combined into a slideshow running during the election, and available via a public google doc at this location: https://goo.gl/rPEc2v Even though the deadline for nominations is right before voting begins, any statements must be received by Monday November 12th at 5PM Pacific, so that we have time to setup the slideshow. Current TAB members, and their election year: Chris Mason 2016 H. Peter Anvin 2016 Olof Johansson 2016 Rik van Riel2016 Dan Williams 2016 Jon Corbet 2017 Greg Kroah-Hartman 2017 Steven Rostedt 2017 Ted Tso 2017 Tim Bird2017 The five slots from 2016 are all up for election. As always, please let us know if you have questions, and please do consider running. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, Friendly reminder that the TAB elections are coming soon. The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Linux Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. We're also working with kernel maintainers to help refine the new code of conduct, and serving as the initial point of contact for code of conduct issues. The board has ten members, one of whom sits on the Linux Foundation board of directors. The election to select five TAB members will be held at the 2018 Kernel Summit in Vancouver, Canada. The elections will take place at the conference center on Tuesday November 13th, at 5:30pm. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Vancouver. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org The deadline for receiving nominations is up until the beginning of the event where the election is held. In past years, everyone running for the TAB has given a short speech before the voting began. We've received feedback that the speeches add logistical complexity for the election, and may not be the best indicator of how well qualified someone is for the TAB. Instead of speeches, this year we're asking candidates to include statements about why they would like to participate in the TAB. These will be combined into a slideshow running during the election, and available via a public google doc at this location: https://goo.gl/rPEc2v Even though the deadline for nominations is right before voting begins, any statements must be received by Monday November 12th at 5PM Pacific, so that we have time to setup the slideshow. Current TAB members, and their election year: Chris Mason 2016 H. Peter Anvin 2016 Olof Johansson 2016 Rik van Riel2016 Dan Williams 2016 Jon Corbet 2017 Greg Kroah-Hartman 2017 Steven Rostedt 2017 Ted Tso 2017 Tim Bird2017 The five slots from 2016 are all up for election. As always, please let us know if you have questions, and please do consider running. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Re: [PATCH] writepage method changes
On Wednesday, May 09, 2001 10:51:17 PM -0300 Marcelo Tosatti <[EMAIL PROTECTED]> wrote: > > > On Wed, 9 May 2001, Marcelo Tosatti wrote: > >> Locked for the "not wrote out case" (I will fix my patch now, thanks) > > I just found out that there are filesystems (eg reiserfs) which write out > data even if an error ocurred, which means the unlocking must be done by > the filesystems, always. I'm not horribly attached to the way reiserfs is doing it right now. If reiserfs writepage manages to map any blocks, it writes them to disk, even if mapping other blocks in the page failed. These are only data blocks, so there are no special consistency rules. If we need to change this, it is not a big deal. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [reiserfs-dev] Re: reiserfs, xfs, ext2, ext3
On Friday, May 11, 2001 04:00:20 AM -0700 Hans Reiser <[EMAIL PROTECTED]> wrote: > Alan Cox wrote: > >> > Are you referring to Neil Brown's nfs operations patch as being as >> > ugly as hell, or something else? Just want to understand what you are >> > saying before arguing. >> >> Andi has sent me some stuff to look at. He listed four implementations >> and I've only seen two of them > > did you see an implementation which adds operations to VFS and is written > by Neil Brown (with reiserfs portions by Chris and Nikita)? I coded up a mixture of Andi's 2.2.x apis and Neil's 2.4.x stuff and sent it out for review a little while ago. It isn't as good as Neil's stuff, but it doesn't require changing the other filesystems. If it looked good to the NFS guys and the other FS guys don't hate it, I'll push it around for testing/inclusion. This would be my preferred solution right now, since it could also work for AFS. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reiserfs, Mongo and CPU question
On Tuesday, May 15, 2001 01:41:01 PM +0200 Ricardo Galli <[EMAIL PROTECTED]> wrote: > Hans and reiserfs developers, > the same student of my university > (http://www.cs.helsinki.fi/linux/linux-kernel/2001-18/0654.html) was > carrying up the mongo benchmarks against reiser, xfs, jfs and ext2 for > different base sizes. > > > For example, for the base size of 10.000 (the average of a clean > distribution is about 16.000 bytes) ReiserFS is even slower than ext2. > I've realised the bottleneck may be the CPU, a Cyrix MII 233MHz. > Would not surprise me, there's lots of room for improvement in reiserfs CPU usage. The 10k size is one of the worst cases for tail performance, those numbers should increase if you mount with -o notail. Here's a simple patch that should help on balance instensive apps (like creates/deletes). Please let me know if you see any difference with it. -chris diff -ur diff/linux/fs/reiserfs/fix_node.c linux/fs/reiserfs/fix_node.c --- diff/linux/fs/reiserfs/fix_node.c Mon Jan 15 18:31:19 2001 +++ linux/fs/reiserfs/fix_node.cFri Feb 2 15:40:54 2001 @@ -936,6 +936,7 @@ if (p_s_tb->FEB[p_s_tb->cur_blknum]) BUG(); +mark_buffer_journal_new(p_s_new_bh) ; p_s_tb->FEB[p_s_tb->cur_blknum++] = p_s_new_bh; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re[2]: ReiserFS 2.4.4/3.x.0k-pre2
On Tuesday, May 15, 2001 02:24:36 PM +0400 Samium Gromoff <[EMAIL PROTECTED]> wrote: > Hello, > I`m still experiencing file tail corruptions > on subj. > And more: after i had restored bblocked patrition > (by relying on drive`s ability to remap bblks on > write by wroting small modification of debugreiserfs > which zeroified all bblks), i had _runtime_ tail >corruptions of the mc`s dir hotlist which i tried >to rewrite again and again. > i found, that "sync"ing after modifying helps to keep > file fine, so it does until now. Hmmm, are you sure the disk is good now? What kinds of things are you doing on the files where you see tail corruptions? Can you reliably reproduce the corruption? -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting FS access events
On Tuesday, May 15, 2001 04:33:57 AM -0400 Alexander Viro <[EMAIL PROTECTED]> wrote: > > > On Tue, 15 May 2001, Linus Torvalds wrote: > >> Looks like there are 19 filesystems that use the buffer cache right now: >> >> grep -l bread fs/*/*.c | cut -d/ -f2 | sort -u | wc >> >> So quite a bit of work involved. > > Reiserfs... Dunno. They've got a private (slightly mutated) copy of > ~60% of fs/buffer.c. But, putting the log and the metadata in the page cache makes memory pressure and such cleaner, so this is one of my goals for 2.5. reiserfs will still have alias issues due to the packed tails (one copy in the btree, another in the page), but it will be no worse than it is now. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ReiserFs: Cosmetic problem in linux/Documentation/Changes[2.4.x]
On Friday, May 18, 2001 01:26:01 PM +0200 "Martin.Knoblauch" <[EMAIL PROTECTED]> wrote: > "Martin.Knoblauch" wrote: >> >> Hi, >> >> I submitted this a short while ago, only to realize later that the >> subject line was not very informative. Sorry. >> >> As a suggestion: maybe the reiser-tools should support the common >> -V/--version flag >> Newer verions (at least 3.x.0j) have a -V. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] improve reiserfs 2.4.x O_SYNC and fsync speed
Hi guys, This patch has been lightly tested, I'd appreciate it if some of you could try it out on data you don't care about. The idea is to improve fsync and O_SYNC performance by only doing a commit on the last transaction the file was actually involved in. The old code always forced a commit of the current transaction, which is just about the slowest possible choice (but easy to verify as correct ;-) (2.2.x reiserfs already has similar optimizations) The words I want to stress here are data_you_don't_care_about. I'm looking for benchmarks and impressions while I test here to make sure the logging rules are not being broken. -chris diff -Nru a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c --- a/fs/reiserfs/dir.c Mon Apr 30 12:45:15 2001 +++ b/fs/reiserfs/dir.c Mon Apr 30 12:45:15 2001 @@ -47,22 +47,10 @@ }; int reiserfs_dir_fsync(struct file *filp, struct dentry *dentry, int datasync) { - int ret = 0 ; - int windex ; - struct reiserfs_transaction_handle th ; - lock_kernel(); - - journal_begin(, dentry->d_inode->i_sb, 1) ; - windex = push_journal_writer("dir_fsync") ; - reiserfs_prepare_for_journal(th.t_super, SB_BUFFER_WITH_SB(th.t_super), 1) ; - journal_mark_dirty(, dentry->d_inode->i_sb, SB_BUFFER_WITH_SB (dentry->d_inode->i_sb)) ; - pop_journal_writer(windex) ; - journal_end_sync(, dentry->d_inode->i_sb, 1) ; - - unlock_kernel(); - - return ret ; + reiserfs_commit_for_inode(dentry->d_inode) ; + unlock_kernel() ; + return 0 ; } diff -Nru a/fs/reiserfs/file.c b/fs/reiserfs/file.c --- a/fs/reiserfs/file.cMon Apr 30 12:45:15 2001 +++ b/fs/reiserfs/file.cMon Apr 30 12:45:15 2001 @@ -50,6 +50,7 @@ lock_kernel() ; down (>i_sem); journal_begin(, inode->i_sb, JOURNAL_PER_BALANCE_CNT * 3) ; +reiserfs_update_inode_transaction(inode) ; #ifdef REISERFS_PREALLOCATE reiserfs_discard_prealloc (, inode); @@ -83,10 +84,7 @@ int datasync ) { struct inode * p_s_inode = p_s_dentry->d_inode; - struct reiserfs_transaction_handle th ; int n_err; - int windex ; - int jbegin_count = 1 ; lock_kernel() ; @@ -95,14 +93,12 @@ n_err = fsync_inode_buffers(p_s_inode) ; n_err |= fsync_inode_data_buffers(p_s_inode); + /* commit the current transaction to flush any metadata ** changes. sys_fsync takes care of flushing the dirty pages for us */ - journal_begin(, p_s_inode->i_sb, jbegin_count) ; - windex = push_journal_writer("sync_file") ; - reiserfs_update_sd(, p_s_inode); - pop_journal_writer(windex) ; - journal_end_sync(, p_s_inode->i_sb,jbegin_count) ; + reiserfs_commit_for_inode(p_s_inode) ; + unlock_kernel() ; return ( n_err < 0 ) ? -EIO : 0; } diff -Nru a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c --- a/fs/reiserfs/inode.c Mon Apr 30 12:45:15 2001 +++ b/fs/reiserfs/inode.c Mon Apr 30 12:45:15 2001 @@ -40,6 +40,7 @@ down (>i_sem); journal_begin(, inode->i_sb, jbegin_count) ; + reiserfs_update_inode_transaction(inode) ; windex = push_journal_writer("delete_inode") ; reiserfs_delete_object (, inode); @@ -281,6 +282,7 @@ reiserfs_update_sd(th, inode) ; journal_end(th, s, len) ; journal_begin(th, s, len) ; + reiserfs_update_inode_transaction(inode) ; } // it is called by get_block when create == 0. Returns block number @@ -604,6 +606,7 @@ TYPE_ANY, 3/*key length*/); if ((new_offset + inode->i_sb->s_blocksize) >= inode->i_size) { journal_begin(, inode->i_sb, jbegin_count) ; + reiserfs_update_inode_transaction(inode) ; transaction_started = 1 ; } research: @@ -628,6 +631,7 @@ if (!transaction_started) { pathrelse() ; journal_begin(, inode->i_sb, jbegin_count) ; + reiserfs_update_inode_transaction(inode) ; transaction_started = 1 ; goto research ; } @@ -704,6 +708,7 @@ */ pathrelse() ; journal_begin(, inode->i_sb, jbegin_count) ; + reiserfs_update_inode_transaction(inode) ; transaction_started = 1 ; goto research; } @@ -1296,6 +1301,10 @@ return ; } lock_kernel() ; + +/* this is really only used for atime updates, so they don't have +** to be included in O_SYNC or fsync +*/ journal_begin(, inode->i_sb, 1) ; reiserfs_update_sd (, inode); journal_end(, inode->i_sb, 1) ; @@ -1660,6 +1669,7 @@ */ prevent_flush_page_lock(page, p_s_inode) ; journal_begin(, p_s_inode->i_sb, JOURNAL_PER_BALANCE_CNT * 2 ) ; +reiserfs_update_inode_transaction(p_s_inode) ; windex = push_journal_writer("reiserfs_vfs_truncate_file") ; reiserfs_do_truncate (, p_s_inode, page, update_timestamps) ; pop_journal_writer(windex) ; @@ -1708,6 +1718,7 @@ lock_kernel() ; prevent_flush_page_lock(bh_result->b_page, inode) ; journal_begin(, inode->i_sb,
Re: Dying disk and filesystem choice.
On Friday, May 25, 2001 09:21:42 AM -0700 Hans Reiser <[EMAIL PROTECTED]> wrote: > No, our policy is strictly in sync with and reflective of that of the > rest of the linux-kernel. Since the ac series has a different policy, we > can be different in regards to the ac series. Not really, our policy has been much more restrictive than the rest of the kernel. Look at the patches we didn't send in. > > And I don't begin to comprehend your not sending in the lost disk space > after crash bug fix (I assume it is what you mean when you refer to lost > files after a crash, because I know of no lost files after a crash bug, > please phrase things more carefully), and it really annoys me and the > users, frankly. Why you consider that a feature is beyond me. The patch is a _huge_ change to the way files are deleted and truncated, to what happens during mount, and to the way transactions work. It is effectively a format extension, and must be verified against both 2.2.x kernels and 2.4.x kernels, in both disk formats. Before I even consider introducing a change of this size, I want to be as sure as I can the rest of the code is stable. It is the only way we can debug it and stay sane. Even after I release the code, I won't want it in an ac series for a while. It does much more harm than good if it somehow ruins compatibility with an older kernel, especially in 2.4.x. Yes, it is a bug fix. But, it is a very different kind of bug fix than something that corrupts files at random, or something that doesn't get buffers to disk at the right time. I won't pretend the fix isn't important, but I won't allow larger changes to ruin the progress we've made so far. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Dying disk and filesystem choice.
On Thursday, May 24, 2001 11:16:58 PM +0100 Alan Cox <[EMAIL PROTECTED]> wrote: >> IMHO we are not that deep into code freeze anymore. Freevxfs got added >> in linux-2.4.5-pre*, so I think that a patch that adds a useful feature >> like badblock support would be OK. > > FreeVxFS changes precisely nothing in the behaviour of any other fs - its > like adding a new driver. > > Updating Reiserfs requires a lot more care because it has the potential to > harm existing stable setups This has been mostly covered, but just in case. There are two different freezes, the kernel, and in reiserfs. The reiserfs part isn't something Alan or Linus have imposed on us, we just wanted to limit the reiserfs changes as much as possible during the early kernel releases. The end result is that some larger scale issues are unfixed (memory pressure from VM, lost files after a crash), but we have been able to focus on the critical hoses-my-files/crashes-my-box kinds of bugs. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.5 Oops at boot
On Wednesday, May 30, 2001 03:03:32 PM -0600 "D. Stimits" <[EMAIL PROTECTED]> wrote: [ snip ] > RAMDISK: Compressed image found at block 0 > Freeing initrd memory: 249k freed > VFS: Mounted root (ext2 filesystem). > Red Hat nash version 3.0.10 starting > VFS: Mounted root (ext2 filesystem) readonly. > change_root: old root has d_count=2 > Trying to unmount old root ... <1>Unable to handle kernel NULL pointer > dereference at virtual address 0010 > printing eip: Can't say for sure without the oops decoded through ksymoops, but this looks like the oops in rd_ioctl fixed by 2.4.5-ac3 and higher. I think the following patch (taken from ac3) will be sufficient: -chris --- linux.vanilla/fs/block_dev.cSat May 26 16:53:17 2001 +++ linux.ac/fs/block_dev.c Mon May 28 16:10:59 2001 @@ -603,6 +602,7 @@ if (!bdev->bd_op->ioctl) return -EINVAL; inode_fake.i_rdev=rdev; + inode_fake.i_bdev=bdev; init_waitqueue_head(_fake.i_wait); set_fs(KERNEL_DS); res = bdev->bd_op->ioctl(_fake, NULL, cmd, arg); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs_read_inode2
On Thursday, May 31, 2001 02:27:26 PM +0200 Lukasz Trabinski <[EMAIL PROTECTED]> wrote: > Hello > > What it's means? > > portraits:~# dmesg > vs-13042: reiserfs_read_inode2: [2299 593873 0x0 SD] not found > vs-13048: reiserfs_iget: bad_inode. Stat data of (2299 593873) not found > vs-13042: reiserfs_read_inode2: [2299 593807 0x0 SD] not found > vs-13048: reiserfs_iget: bad_inode. Stat data of (2299 593807) not found > > 2.4.5 with lock_kernel/unlock patch,reiserfsprogs 3.x.0h, RH 7.1 In this case, it probably means you are serving NFS from that disk, which needs extra patches. Are you? -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NULL characters in file on ReiserFS again.
On Thursday, May 31, 2001 03:33:06 PM +0400 Andrej Borsenkow <[EMAIL PROTECTED]> wrote: > This happened to me yesterday on kernel-2.4.4-6mdk (Mandrake cooker, based > on 2.4.4-ac14), single reiser root filesystem, mounted with default > options. Hardware - ASUS CUSL2 (i815e chipset), Fujitsu UDMA-4 drive. > > I tried to change hostname and did not have the corresponding entry in > /etc/hosts (or anywhere). As a tesult, startx hung starting X server; it > was not possible to switch to alpha console or kill X server. I pressed > reset and after reboot looked into /var/log/XFree86*log - and there were > a bunch of ^@ there. > There are two ways to get nulls in log files. reiserfs bugs, and a crash before data blocks are flushed to disk. You've probably hit the second. Reiserfs only logs metadata, so it is possible for newly allocated data blocks to have null bytes after a crash. Patches are in progress to flush new data blocks before transaction commit. I'm about to send out the first building block for this... -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] yet another knfsd-reiserfs patch
> On Monday, April 23, 2001 10:45:14 AM -0400 Chris Mason <[EMAIL PROTECTED]> wrote: > >> >> Hi guys, >> >> This patch is not meant to replace Neil Brown's knfsd ops stuff, the >> goal was to whip up something that had a chance of getting into 2.4.x, >> and that might be usable by the AFS guys too. Neil's patch tries to >> address a bunch of things that I didn't, and looks better for the >> long run. >> > Updated to 2.4.5, with the nfs list cc'd this time in hopes of comments or flames... -chris diff -Nru a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c --- a/fs/nfsd/nfsfh.c Fri Jun 1 16:08:41 2001 +++ b/fs/nfsd/nfsfh.c Fri Jun 1 16:08:41 2001 @@ -116,40 +116,12 @@ return error; } -/* this should be provided by each filesystem in an nfsd_operations interface as - * iget isn't really the right interface - */ -static struct dentry *nfsd_iget(struct super_block *sb, unsigned long ino, __u32 generation) +static struct dentry *dentry_from_inode(struct inode *inode) { - - /* iget isn't really right if the inode is currently unallocated!! -* This should really all be done inside each filesystem -* -* ext2fs' read_inode has been strengthed to return a bad_inode if the inode -* had been deleted. -* -* Currently we don't know the generation for parent directory, so a generation -* of 0 means "accept any" -*/ - struct inode *inode; struct list_head *lp; struct dentry *result; - inode = iget(sb, ino); - if (is_bad_inode(inode) - || (generation && inode->i_generation != generation) - ) { - /* we didn't find the right inode.. */ - dprintk("fh_verify: Inode %lu, Bad count: %d %d or version %u %u\n", - inode->i_ino, - inode->i_nlink, atomic_read(>i_count), - inode->i_generation, - generation); - - iput(inode); - return ERR_PTR(-ESTALE); - } - /* now to find a dentry. -* If possible, get a well-connected one + /* +* If possible, get a well-connected dentry */ spin_lock(_lock); for (lp = inode->i_dentry.next; lp != >i_dentry ; lp=lp->next) { @@ -173,6 +145,92 @@ return result; } +static struct inode *__inode_from_fh(struct super_block *sb, int ino, +int generation) +{ + struct inode *inode ; + + inode = iget(sb, ino); + if (is_bad_inode(inode) + || (generation && inode->i_generation != generation) + ) { + /* we didn't find the right inode.. */ + dprintk("fh_verify: Inode %lu, Bad count: %d %d or version %u %u\n", + inode->i_ino, + inode->i_nlink, atomic_read(>i_count), + inode->i_generation, + generation); + + iput(inode); + return ERR_PTR(-ESTALE); + } + return inode ; +} + +static struct inode *inode_from_fh(struct super_block *sb, + __u32 *datap, + int len) +{ + if (sb->s_op->inode_from_fh) + return sb->s_op->inode_from_fh(sb, datap, len) ; + return __inode_from_fh(sb, datap[0], datap[1]) ; +} + +static struct inode *parent_from_fh(struct super_block *sb, + __u32 *datap, + int len) +{ + if (sb->s_op->parent_from_fh) + return sb->s_op->parent_from_fh(sb, datap, len) ; + + if (len >= 3) + return __inode_from_fh(sb, datap[2], 0) ; + return ERR_PTR(-ESTALE); +} + +/* + * two iget funcs, one for inode, and one for parent directory + * + * this should be provided by each filesystem in an nfsd_operations interface as + * iget isn't really the right interface + * + * If the filesystem doesn't provide funcs to get inodes from datap, + * it must be: inum, generation, dir inum. Length of 2 means the + * dir inum isn't there. + * + * iget isn't really right if the inode is currently unallocated!! + * This should really all be done inside each filesystem + * + * ext2fs' read_inode has been strengthed to return a bad_inode if the inode + * had been deleted. + * + * Currently we don't know the generation for parent directory, so a generation + * of 0 means "accept any" + */ +static struct dentry *nfsd_iget(struct super_block *sb, __u32 *datap, int len) +{ + + struct inode *inode; + + inode = inode_from_fh(sb, datap, len) ; + if (IS_ERR(inode))
Re: [2.4.5 and all ac-Patches] massive file corruption with reiseror NFS
On Saturday, June 02, 2001 02:41:04 PM +0200 Andreas Hartmann <[EMAIL PROTECTED]> wrote: > Am Samstag, 2. Juni 2001 12:52 schrieb Rasmus Bøg Hansen: >> On Sat, 2 Jun 2001, Andreas Hartmann wrote: >> > I got massive file corruptions with the kernels mentioned in the >> > subject. I can reproduce it every time. >> >> You cannot use NFS on reiserfs unless you apply the knfsd patch. Look at >> www.namesys.com. > > Thank you very much for your advice. > > I tested your suggestion and run the machine without NFS-mounted devices > - it seems to be working fine. > > Anyway - I'm wondering why I didn't get any >problem until 2.4.4ac10 with > this configuration without the appropriate patch on the client or on the > server? The problem only happens when the clients do an operation on a file that has gone out of cache on the server. Under light load, this might happen very rarely. You only need the patch on the server. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NFS] Re: [RFC] yet another knfsd-reiserfs patch
On Saturday, June 02, 2001 12:19:59 AM +0200 Trond Myklebust <[EMAIL PROTECTED]> wrote: > > Hi Chris, > > Do you really need the parent inode in the filehandle? > > That screws rename up pretty badly, since the filehandle changes when > you rename into a different directory. It means for instance that when > I do > > open(foo) > mv foo bar/ > write (foo) > close(foo) > > then I have a pretty good chance of getting an ESTALE on the write() > statement. > Hmmm, didn't realize I had only answered this in private mail. The patch doesn't change when the parent dir's ino is included in the filehandle, it just adds wrappers for storing it and getting it out. For ext2, the parent inum is only sent for files when the subtree checks are turned on (_fh_update is unchanged if no fill_fh func is provided). The reiserfs one always puts the parent inum into the fh, but find_fh_dentry only pulls it out for directories or subtree checks so I didn't add the extra logic to the reiserfs fill_fh func. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.4.5 and all ac-Patches] massive file corruption with reiseror NFS
On Saturday, June 02, 2001 08:13:44 PM +0200 Andreas Hartmann <[EMAIL PROTECTED]> wrote: > Am Samstag, 2. Juni 2001 18:42 schrieben Sie: >> On Saturday, June 02, 2001 02:41:04 PM +0200 Andreas Hartmann >> >> <[EMAIL PROTECTED]> wrote: >> > Am Samstag, 2. Juni 2001 12:52 schrieb Rasmus Bøg Hansen: >> >> On Sat, 2 Jun 2001, Andreas Hartmann wrote: >> >> > I got massive file corruptions with the kernels mentioned in the >> >> > subject. I can reproduce it every time. >> >> > >> >> >> You cannot use NFS on reiserfs unless you apply the knfsd patch. >> >> >> Look at >> >> >> >> www.namesys.com. >> >> >> > > Thank you very much for your advice. >> > > I tested your suggestion and run the machine without NFS-mounted >> > > devices >> > >> > - it seems to be working fine. > > Anyway - I'm wondering why I didn't >> > get any problem until 2.4.4ac10 with this configuration without the >> > appropriate patch on the client or on the server? >> >> The problem only happens when the clients do an operation on a file that >> has gone out of cache on the server. Under light load, this might happen >> very rarely. > > The load didn't change. YOu can forget the load, it's very small. It's my > private server and I'm doing always the same thing via NFS - compiling > e.g. This has been working fine until 2.4.4.ac10, afterwards it has been > broken. Ok, there are two different problems here. The patch you posted to l-k is a generic NFS fix for 2.4.5. ext2 would need this too. If you are serving NFS from your reiserfs disk, you need an additional patch on the server only (this is the one I was talking about). Checkout the FAQ on www.namesys.com for all the details. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS] 245ac7 - ncr53c8xx && reiserfs
On Tuesday, June 05, 2001 03:00:40 PM -0400 Carlos E Gorges <[EMAIL PROTECTED]> wrote: > Hi all, > > I get some problems w/ 2.4.5-ac7, ncr53c8xx w/ 2.4.4-ac18 works fine. > > I gave a small looked on problem .. > the problem apparently is w/ ncr53c8xx driver ( who accuses timeout ), > and make reiserfs call BUG() : > reiserfs does this when it fails to write metadata or log buffers, continuing without a panic or readonly mount will result in FS corruption. A forced readonly mount is a much better solution, but I haven't had a chance yet to make sure it safely prevents writeback of all metadata, and cleans things up properly. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [reiserfs-list] major security bug in reiserfs (may affect SuSELinux)
On Wednesday, January 10, 2001 12:38:34 PM -0500 Alexander Viro <[EMAIL PROTECTED]> wrote: > On Wed, 10 Jan 2001, Chris Mason wrote: > >> In filldir, I don't like the line where we ((char *)dirent += reclen ; >> If reclen is much larger than the buffer sent from userspace, I don't >> see how we stay in bounds. > >So? copy_to_user() and put_user() will refuse to scramble the > kernel memory. IOW, dirent can be out of the userspace. Hell, user could > call getdents() and pass it a kernel pointer. Try it and you'll see what > happens. > Ah thanks, that makes more sense. But, copy_to_user is only working on namelen bytes, and reclen is bigger than that. So, who is checking the value for the buf->current_dir pointer? -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Possible deadlock with ->writepaged version offlush_dirty_buffers() and 2.4.0
On Wednesday, January 10, 2001 05:56:09 PM -0200 Marcelo Tosatti <[EMAIL PROTECTED]> wrote: > > Hi Chris, > > It seems there is a possible deadlock condition with your patch which > changes flush_dirty_buffers() to use ->writepage (something which we > _definately_ want for 2.5). Take a look: > Yes, good catch. > > mark_buffer_dirty->balance_dirty->wakeup_bdflush->flush_dirty_buffers-> > writepage->block_write_full_page->__block_write_full_page->get_block-> > ext2_get_block->ext2_alloc_branch-> > >ext2_alloc_block->ext2_new_block->lock_super >or >getblk()->lock_super > > > I dont see any reason why this deadlock could'nt happen in practice now. > It won't happen until someone other than fs/buffer.c starts marking ext2 pages dirty. The normal file write path will make sure that any dirty buffers are mapped, so the ext2_get_block code is never run. > If I'm right, it will pretty nasty to fix this. One possible solution is > to _never_ call mark_buffer_dirty() with the superblock lock held (ext2 > has a lot of places likes this right now) > This is probably the best solution, since it is a good idea regardless of my patch. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
generic_file_write change in 2.4.0-ac8
Hi guys, This code for generic_file_write calls vmtruncate without i_sem held. Is that intentional? It should cause problems for reiserfs at least... -chris diff -u --new-file --recursive --exclude-from /usr/src/exclude linux-2.4.0/mm/filemap.c linux.ac/mm/filemap.c --- linux-2.4.0/mm/filemap.cWed Jan 3 02:59:45 2001 +++ linux.ac/mm/filemap.c Thu Jan 11 17:26:55 2001 @@ -2578,6 +2625,13 @@ ClearPageUptodate(page); kunmap(page); goto unlock; +sync_failure: + UnlockPage(page); + deactivate_page(page); + page_cache_release(page); + if (pos + bytes > inode->i_size) + vmtruncate(inode, inode->i_size); + goto done; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: generic_file_write change in 2.4.0-ac8
On Friday, January 12, 2001 04:30:44 PM -0500 Alexander Viro <[EMAIL PROTECTED]> wrote: > > > On Fri, 12 Jan 2001, Chris Mason wrote: > >> >> Hi guys, >> >> This code for generic_file_write calls vmtruncate without i_sem held. Is >> that intentional? It should cause problems for reiserfs at least... > > Erm... generic_file_write() grabs i_sem upon entry and drops it on exit. > This call of vmtruncate() is deep inside the protected area. > Yup, I'm trying to track down a different problem, and saw what I wanted to instead of what was really there. Sigh. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: patch:reiserfs 3.6.25 + LVM(Fix oops reiserfs filesystem)
On Saturday, January 13, 2001 11:41:51 PM -0800 hugang <[EMAIL PROTECTED]> wrote: [ patch ] Odd, the create_vi op should never be null, so the real fix is somewhere else. We'll look into this. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More information on reiserfs bug
On Tuesday, January 16, 2001 07:38:58 PM +0100 Jakob Borg <[EMAIL PROTECTED]> wrote: > Hi again, > > It seems the problem occurs every time i start fetchmail... Attached are > ksymoops output and .config (if i remember this time). If there is > anything else I can do to help debug this, just tell me Linus fixed that hunk of debugging code in his merge, and it found a bug in the reiserfs O_SYNC support. reiserfs_commit_write needs to hold the BKL. This should fix it: --- linux/fs/reiserfs/inode.c.1 Tue Jan 16 13:46:35 2001 +++ linux/fs/reiserfs/inode.c Tue Jan 16 13:49:21 2001 @@ -1853,6 +1853,11 @@ struct reiserfs_transaction_handle th ; reiserfs_wait_on_write_block(inode->i_sb) ; + +/* prevent_flush_page_lock must be called before generic_commit_write, +** and the BKL must be held during the call. +*/ +lock_kernel() ; prevent_flush_page_lock(page, inode) ; ret = generic_commit_write(f, page, from, to) ; /* we test for O_SYNC here so we can commit the transaction @@ -1866,6 +1871,8 @@ journal_end_sync(, inode->i_sb, 1) ; } allow_flush_page_lock(page, inode) ; +unlock_kernel() ; + return ret ; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG with 2.4.1-pre7 reiserfs
On Tuesday, January 16, 2001 07:58:37 PM +0100 Jakob Borg <[EMAIL PROTECTED]> wrote: > On Tue, Jan 16, 2001 at 10:36:43AM -0800, Linus Torvalds wrote: >> > I seem to remember more possibly useful information scrolling by my >> > screen, but it seems to not have made it to the logs, and I will shut >> > down and fsck the filesystem now... >> >> It really needs the stack-trace to debug this sanely (along with >> translations of what the hex numbers are - see the bugreporting >> documentation in the kernel source tree). > > Got that in the other mail subjected "More information ... ". In the > meantime it seems the filesystem is unhurt because of this, but reiserfsck > says > > uread_super_block: bad block is found at a new superblock location > uread_super_block: bad block is found at an old superblock location > > which seems bogus. This is reiserfsck from the same suite that mkreiserfs > came from ("reiserfsprogs 3.x") so they should be talking about the same > sort of filesystem. > The BUG you hit should not corrupt anything, that debugging code is actually there to prevent silent corruption due to lack of locking. It is likely you are using an fsck version that can't read the 3.6.x format. They are still packaging the beta fsck tool for the new format, I'm not sure the exact download URL yet. When you mount the FS it tells you which version it is, please include that info as well. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: set_page_dirty/page_launder deadlock
On Sunday, January 14, 2001 10:56:10 AM -0800 Linus Torvalds <[EMAIL PROTECTED]> wrote: >> Marcelo Tosatti writes: >> > >> > While taking a look at page_launder()... >> >> ... >> >> > set_page_dirty() may lock the pagecache_lock which means potential >> > deadlock since we have the pagemap_lru_lock locked. >> > > Well, as the new shm code doesn't return 1 any more, the whole locked page > handling should just be deleted. ramfs always just re-marked the page > dirty in its own "writepage()" function, so it was only shmfs that ever > returned this special case, and because of other issues it already got > excised by Christoph.. > Then I'm confused by the code in 2.4.1pre8: -chris /* * Move the page from the page cache to the swap cache */ static int shmem_writepage(struct page * page) { int error; struct shmem_inode_info *info; swp_entry_t *entry, swap; info = >mapping->host->u.shmem_i; if (info->locked) return 1; swap = __get_swap_page(2); if (!swap.val) return 1; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 2.4.x and 2.4.1-preX - Higher latency then 2.2.xkernels?
On Saturday, January 20, 2001 02:59:24 PM -0500 Gregory Maxwell <[EMAIL PROTECTED]> wrote: > On Sat, Jan 20, 2001 at 02:50:16PM -0500, Shawn Starr wrote: >> It just seems that since using 2.4 ive noticed my poor Pentium 200Mhz >> slow down whether being in X or otherwise. It just seems that the system >> is sluggish. >> >> I am using the new ReiserFS filesystem and I do know its still in heavy >> development perhaps my latency is due to this (?) > > Reiserfs uses much more complex data structures then ext2 (trees..). I > don't think that latency has ever been a design criteria and all of the > benchmarks they use are pretty much pure throughput tests. > > So it wouldn't be really surprising if reiserfs had very bad latency. You > should apply the timepegs patch and profile your kernel latency to see > where it's coming from. I'm actually very interested in fixing any latency problems. If you do these tests, please send the results along. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 slowdown at boot.
On Thursday, January 25, 2001 05:23:26 PM +0100 Ondrej Sury <[EMAIL PROTECTED]> wrote: > > 2.4.1-pre10 slows down after printing those (maybe ACPI or reiserfs > issue), and even SysRQ-(s,u,b) is not imediate and waits several (two+) > seconds before (syncing,remounting,booting). > > ACPI: System description tables found > ACPI: System description tables loaded > ACPI: Subsystem enabled > ACPI: System firmware supports: C2 > ACPI: System firmware supports: S0 S1 S4 S5 > reiserfs: checking transaction log (device 03:04) ... > Warning, log replay starting on readonly filesystem > Here, reiserfs is telling you that it has started replaying transactions in the log. You should also have a reiserfs message telling you how many transactions it replayed, and how long it took. Do you have that message? -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 slowdown at boot.
On Thursday, January 25, 2001 06:51:33 PM +0100 Ondrej Sury <[EMAIL PROTECTED]> wrote: > Chris Mason <[EMAIL PROTECTED]> writes: >> > reiserfs: checking transaction log (device 03:04) ... >> > Warning, log replay starting on readonly filesystem >> > >> >> Here, reiserfs is telling you that it has started replaying transactions >> in the log. You should also have a reiserfs message telling you how many >> transactions it replayed, and how long it took. Do you have that >> message? > > Nope. I rebooted with Alt-SysRQ+B after some while (aprox more than 30 > sec, normally reiserfs replay is taking ~5 sec (pre9)). I wasn't so > patient. I could test it before I'll go from work to home. > Ok, depending on the metadata load before the crash, replay can take 30 seconds or more. You usually have to try to generate that many metadata changes, something like creating 100,000 tiny files or directories. Compiling with CONFIG_REISERFS_CHECK turned on will give you more details about the log replay. Or, perhaps DMA is now off on your IDE drive, making everything slower. Regardless, rebooting in the middle of log replay is safe. Those transactions will just be replayed again on the next boot. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ACPI error in 2.4.1-pre10 @ via82c686 (Was: 2.4.1-pre10slowdown at boot.)
On Thursday, January 25, 2001 07:37:16 PM +0100 Ondrej Sury <[EMAIL PROTECTED]> wrote: > I have discovered that it wasn't reiserfs problem. I have disabled ACPI > in BIOS and everything is ok. So I assume that something has changed in > ACPI between pre9 and pre10 versions and that something is broken in _my_ > system. > Ok. This isn't related to the slowdown problem you are seeing, but after a clean shutdown, there should not be any transactions that need replay. Keep an eye on the console as you shutdown, and make sure / is getting properly unmounted. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 2.4.x and 2.4.1-preX - Higher latency then 2.2.xkernels?
On Sunday, January 28, 2001 02:29:09 PM +1100 Andrew Morton <[EMAIL PROTECTED]> wrote: > Shawn Starr wrote: >> >> Andrew, the patch HAS made a difference. For example, while untaring >> glibc-2.2.1.tar.gz the system was not sluggish (mouse movements in X) >> etc. >> >> Seems to be a go for latency improvements on this system. > > hmm.. OK, thanks. > > Chris, this seems to be a worthwhile improvement to mainstream > reiserfs, independent of the low-latency thing. You can > probably achieve 10 milliseconds with just a few lines of > code - a subset of the patch which Shawn tested. (Unless you > were planning on magical algorithmic improvements...). > > I'm all set up to generate those few lines of code, so > I'll propose a patch later this week. Perfect, I was thinking exactly the same thing. We have to be careful here though, since the extra schedules will increase the chance the searching has to be redone from scratch, which can have big performance ramifications. I think your change to search_by_key will be the safest for performance considerations, along with the change to prepare_for_delete_or_cut. If those won't be enough, we can attack reiserfs_get_block (who is probably the biggest single offender without your patch). -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Renaming lost+found
On Friday, January 26, 2001 01:19:49 PM -0500 James Lewis Nance <[EMAIL PROTECTED]> wrote: > FWIW IBM's JFS file system does not have a lost+found directory. I dont > remember if reiserfs does or not. > reiserfsck creates it. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Reiserfs problem was: Re: Version 2.4.1 cannot be built.
On Tuesday, January 30, 2001 03:42:36 PM -0800 "Brett G. Person" <[EMAIL PROTECTED]> wrote: > Worked fine here but i am getting segfaults on my Reiser filesystems. > I've been distracted by a project over the last few days. Is what I'm > seeing a symptom of the fs corruption people were talking about last week? > If reiserfs is the cause you should have some clues in /var/log/messages. Does the kernel compile on ext2 on the same box? -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs min size (was: [2.4.1] mkreiserfs on loopdevice freezes kernel)
On Wednesday, January 31, 2001 11:27:57 PM +0100 Bernd Eckenfels <[EMAIL PROTECTED]> wrote: > On Wed, Jan 31, 2001 at 09:24:39AM +, James Sutherland wrote: >> 32 megaBLOCK?? How big is it in Mbytes? > > Blocksize is 4k, mkreiserfs in my version is telling me it can not generate > partitions smaller than 32M but it is not true, i have to do > > dd if=/dev/zero of=/var/loop.img count=32768 size=4096 > >> You do know reiserfs defaults to >> building a 32 Mbyte journal on the device, I take it? > > Yes, I wonder if it is a Error in mkreiserfs to require 128MB. It is. The actual min is around 40MB (with 32MB used by the journal. Next version of mkreiserfs will be fixed. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] reiserfs transaction overflow
Hi guys, Under certain loads, the reiserfs journal can overflow the max transaction size, leading to a crash (but not corruption). When the transaction is too full for another writer to join, the writer triggers a commit, and waits for the next transaction. But, it doesn't properly check to make sure the next transcation has enough room, which can lead to overflow. It is hard to hit because there is a large margin of error in the way log space is reserved (this bug was probably in v.1 of the journal code). A similar patch will be needed for 3.5.x reiserfs, that will follow soon. Anyway, this patch should fix 2.4.x, please apply: -chris --- linux/fs/reiserfs/journal.c.1 Tue Apr 17 09:36:36 2001 +++ linux/fs/reiserfs/journal.c Tue Apr 17 09:37:50 2001 @@ -2052,7 +2052,7 @@ sleep_on(&(SB_JOURNAL(p_s_sb)->j_join_wait)) ; } } -lock_journal(p_s_sb) ; /* relock to continue */ +goto relock ; } if (SB_JOURNAL(p_s_sb)->j_trans_start_time == 0) { /* we are the first writer, set trans_id */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ac only, allow reiserfs files > 4GB
This patch should set s_maxbytes correctly for reiserfs in the ac kernels, and adds a reiserfs_setattr call to catch expanding truncates past the MAX_NON_LFS limit for old format files. reiserfs_get_block already catches file writes and such for this case. It also adds a generic_inode_setattr call, mostly because I didn't want to copy/maintain that hunk of code in reiserfs. Testing has been light, I'll beat on it more this evening. patch against 2.4.3-ac7. -chris diff -Nru a/fs/attr.c b/fs/attr.c --- a/fs/attr.c Wed Apr 18 18:33:44 2001 +++ b/fs/attr.c Wed Apr 18 18:33:44 2001 @@ -111,6 +111,21 @@ return dn_mask; } +int generic_inode_setattr(struct inode *inode, struct iattr * attr) { + int error ; + unsigned int ia_valid = attr->ia_valid; + + error = inode_change_ok(inode, attr); + if (!error) { + if ((ia_valid & ATTR_UID && attr->ia_uid != inode->i_uid) || + (ia_valid & ATTR_GID && attr->ia_gid != inode->i_gid)) + error = DQUOT_TRANSFER(inode, attr) ? -EDQUOT : 0; + if (!error) + error = inode_setattr(inode, attr); + } + return error ; +} + int notify_change(struct dentry * dentry, struct iattr * attr) { struct inode *inode = dentry->d_inode; @@ -131,14 +146,7 @@ if (inode->i_op && inode->i_op->setattr) error = inode->i_op->setattr(dentry, attr); else { - error = inode_change_ok(inode, attr); - if (!error) { - if ((ia_valid & ATTR_UID && attr->ia_uid != inode->i_uid) || - (ia_valid & ATTR_GID && attr->ia_gid != inode->i_gid)) - error = DQUOT_TRANSFER(inode, attr) ? -EDQUOT : 0; - if (!error) - error = inode_setattr(inode, attr); - } + error = generic_inode_setattr(inode, attr) ; } unlock_kernel(); if (!error) { diff -Nru a/fs/reiserfs/file.c b/fs/reiserfs/file.c --- a/fs/reiserfs/file.cWed Apr 18 18:33:44 2001 +++ b/fs/reiserfs/file.cWed Apr 18 18:33:44 2001 @@ -106,6 +106,18 @@ return ( n_err < 0 ) ? -EIO : 0; } +static int reiserfs_setattr(struct dentry *dentry, struct iattr *attr) { +struct inode *inode = dentry->d_inode ; +if (attr->ia_valid & ATTR_SIZE) { + /* version 2 items will be caught by the s_maxbytes check + ** done for us in vmtruncate + */ +if (inode_items_version(inode) == ITEM_VERSION_1 && + attr->ia_size > MAX_NON_LFS) +return -EFBIG ; +} +return generic_inode_setattr(inode, attr) ; +} struct file_operations reiserfs_file_operations = { read: generic_file_read, @@ -119,6 +131,7 @@ struct inode_operations reiserfs_file_inode_operations = { truncate: reiserfs_vfs_truncate_file, +setattr:reiserfs_setattr, }; diff -Nru a/fs/reiserfs/super.c b/fs/reiserfs/super.c --- a/fs/reiserfs/super.c Wed Apr 18 18:33:44 2001 +++ b/fs/reiserfs/super.c Wed Apr 18 18:33:44 2001 @@ -412,7 +412,7 @@ SB_BUFFER_WITH_SB (s) = bh; SB_DISK_SUPER_BLOCK (s) = rs; s->s_op = _sops; -s->s_maxbytes = MAX_NON_LFS; +s->s_maxbytes = MAX_NON_LFS; /* old format is always limited at 2GB */ return 0; } #endif @@ -493,7 +493,11 @@ SB_BUFFER_WITH_SB (s) = bh; SB_DISK_SUPER_BLOCK (s) = rs; s->s_op = _sops; -s->s_maxbytes = 0x;/* 4Gig */ + +/* new format is limited by the 32 bit wide i_blocks field, want to +** be one full block below that. +*/ +s->s_maxbytes = (512LL << 32) - s->s_blocksize ; return 0; } diff -Nru a/include/linux/fs.h b/include/linux/fs.h --- a/include/linux/fs.hWed Apr 18 18:33:44 2001 +++ b/include/linux/fs.hWed Apr 18 18:33:44 2001 @@ -1359,6 +1359,7 @@ extern int inode_change_ok(struct inode *, struct iattr *); extern int inode_setattr(struct inode *, struct iattr *); +extern int generic_inode_setattr(struct inode *, struct iattr *); /* * Common dentry functions for inclusion in the VFS diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c --- a/kernel/ksyms.cWed Apr 18 18:33:44 2001 +++ b/kernel/ksyms.cWed Apr 18 18:33:44 2001 @@ -180,6 +180,7 @@ EXPORT_SYMBOL(permission); EXPORT_SYMBOL(vfs_permission); EXPORT_SYMBOL(inode_setattr); +EXPORT_SYMBOL(generic_inode_setattr); EXPORT_SYMBOL(inode_change_ok); EXPORT_SYMBOL(write_inode_now); EXPORT_SYMBOL(notify_change); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] reiserfs should daemonize
Hi guys, The reiserfs commit thread needs to daemonize. This patch was actually from Andi Kleen eons ago (but blame me if it breaks). Please apply. Against 2.4.3: --- linux/fs/reiserfs/journal.c Thu Apr 19 14:02:56 2001 +++ linux/fs/reiserfs/journal.c Thu Apr 19 18:11:57 2001 @@ -1814,16 +1814,14 @@ ** then run the per filesystem commit task queue when we wakeup. */ static int reiserfs_journal_commit_thread(void *nullp) { - exit_files(current); - exit_mm(current); + + daemonize() ; spin_lock_irq(>sigmask_lock); sigfillset(>blocked); recalc_sigpending(current); spin_unlock_irq(>sigmask_lock); - current->session = 1; - current->pgrp = 1; sprintf(current->comm, "kreiserfsd") ; lock_kernel() ; while(1) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] yet another knfsd-reiserfs patch
Hi guys, This patch is not meant to replace Neil Brown's knfsd ops stuff, the goal was to whip up something that had a chance of getting into 2.4.x, and that might be usable by the AFS guys too. Neil's patch tries to address a bunch of things that I didn't, and looks better for the long run. Anyway, the basic idea is the FS provides: int fill_fh(struct dentry *, __u32 *fh, int size) ; fills the array of ints in fh with enough info to find the file and its parent later. struct inode *inode_from_fh(struct super_block *, __u32 *fh, int size) ; struct inode *parent_from_fh(struct super_block *, __u32 *fh, int size) ; iget the inode or parent directory inode based on data in the array. Default ops are provided, the other filesystems should work the same as before. Anyway, please take a look. -chris # This is a BitKeeper generated patch for the following project: # Project Name: local kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet1.6 -> 1.7 #fs/reiserfs/super.c1.1 -> 1.2 #fs/nfsd/nfsfh.c1.1 -> 1.2 # include/linux/fs.h1.2 -> 1.3 #fs/reiserfs/inode.c1.1 -> 1.2 # include/linux/reiserfs_fs.h 1.1 -> 1.2 # # The following is the BitKeeper ChangeSet Log # # 01/04/23 [EMAIL PROTECTED] 1.7 # reiserfs-knfsd-fh-ops-2 # # Introduce file handle operations into the super ops. Add generic support and # reiserfs support. Meant for use by NFS (and perhaps AFS) to get around # reiserfs' inability to find a file with an inode number alone. # # fs.h reiserfs-knfsd-fh-ops-2 # reiserfs_fs.h reiserfs-knfsd-fh-ops-2 # nfsfh.c reiserfs-knfsd-fh-ops-2 # super.c reiserfs-knfsd-fh-ops-2 # inode.c reiserfs-knfsd-fh-ops-2 # # diff -Nru a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c --- a/fs/nfsd/nfsfh.c Mon Apr 23 02:14:42 2001 +++ b/fs/nfsd/nfsfh.c Mon Apr 23 02:14:42 2001 @@ -116,40 +116,12 @@ return error; } -/* this should be provided by each filesystem in an nfsd_operations interface as - * iget isn't really the right interface - */ -static struct dentry *nfsd_iget(struct super_block *sb, unsigned long ino, __u32 generation) +static struct dentry *dentry_from_inode(struct inode *inode) { - - /* iget isn't really right if the inode is currently unallocated!! -* This should really all be done inside each filesystem -* -* ext2fs' read_inode has been strengthed to return a bad_inode if the inode -* had been deleted. -* -* Currently we don't know the generation for parent directory, so a generation -* of 0 means "accept any" -*/ - struct inode *inode; struct list_head *lp; struct dentry *result; - inode = iget(sb, ino); - if (is_bad_inode(inode) - || (generation && inode->i_generation != generation) - ) { - /* we didn't find the right inode.. */ - dprintk("fh_verify: Inode %lu, Bad count: %d %d or version %u %u\n", - inode->i_ino, - inode->i_nlink, atomic_read(>i_count), - inode->i_generation, - generation); - - iput(inode); - return ERR_PTR(-ESTALE); - } - /* now to find a dentry. -* If possible, get a well-connected one + /* +* If possible, get a well-connected dentry */ spin_lock(_lock); for (lp = inode->i_dentry.next; lp != >i_dentry ; lp=lp->next) { @@ -172,6 +144,92 @@ return result; } +static struct inode *__inode_from_fh(struct super_block *sb, int ino, +int generation) +{ + struct inode *inode ; + + inode = iget(sb, ino); + if (is_bad_inode(inode) + || (generation && inode->i_generation != generation) + ) { + /* we didn't find the right inode.. */ + dprintk("fh_verify: Inode %lu, Bad count: %d %d or version %u %u\n", + inode->i_ino, + inode->i_nlink, atomic_read(>i_count), + inode->i_generation, + generation); + + iput(inode); + return ERR_PTR(-ESTALE); + } + return inode ; +} + +static struct inode *inode_from_fh(struct super_block *sb, + __u32 *datap, + int len) +{ + if (sb->s_op->inode_from_fh) + return sb->s_op->inode_from_fh(sb, datap, len) ; + return __inode_from_fh(sb, datap[0], datap[1]) ; +} + +static struct inode *parent_from_fh(struct super_block *sb, +
Re: [patch] linux likes to kill bad inodes
On Sunday, April 22, 2001 02:10:42 PM +0200 Pavel Machek <[EMAIL PROTECTED]> wrote: > Hi! > > I had a temporary disk failure (played with acpi too much). What > happened was that disk was not able to do anything for five minutes > or so. When disk recovered, linux happily overwrote all inodes it > could not read while disk was down with zeros -> massive disk > corruption. > > Solution is not to write bad inodes back to disk. > Wouldn't we rather make it so bad inodes don't get marked dirty at all? -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] linux likes to kill bad inodes
On Wednesday, April 25, 2001 10:01:20 PM +0200 Pavel Machek <[EMAIL PROTECTED]> wrote: > Hi! > >> > Hi! >> > >> > I had a temporary disk failure (played with acpi too much). What >> > happened was that disk was not able to do anything for five minutes >> > or so. When disk recovered, linux happily overwrote all inodes it >> > could not read while disk was down with zeros -> massive disk >> > corruption. >> > >> > Solution is not to write bad inodes back to disk. >> > >> >> Wouldn't we rather make it so bad inodes don't get marked dirty at all? > > I guess this is cheaper: we can mark inode dirty at 1000 points, but > you only write it at one point. Whoops, I worded that poorly. To me, it seems like a bug to dirty a bad inode. If this patch works, it is because somewhere, somebody did something with a bad inode, and thought the operation worked (otherwise, why dirty it?). So yes, even if we dirty them in a 1000 different places, we need to find the one place that believes it can do something worthwhile to a bad inode. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] reiserfs lfs fix for 2.4.4-pre5 and above
Hello everyone, 2.4.4-pre5 started honoring the s_maxbytes field, so reiserfs needs a patch to allow files > 4GB on 3.6.x format filesystems. If you work with large files on reiserfs and are willing to try the prerelease kernels (non-production), please give this a try, it works for me but I'd like a few confirmations before I send to Linus. This also prevents someone from using truncate to expand an old format file past the 2GB mark. -chris diff -Nru a/fs/reiserfs/file.c b/fs/reiserfs/file.c --- a/fs/reiserfs/file.cTue Apr 24 13:37:21 2001 +++ b/fs/reiserfs/file.cTue Apr 24 13:37:21 2001 @@ -106,6 +106,24 @@ return ( n_err < 0 ) ? -EIO : 0; } +static int reiserfs_setattr(struct dentry *dentry, struct iattr *attr) { +struct inode *inode = dentry->d_inode ; +int error ; +if (attr->ia_valid & ATTR_SIZE) { + /* version 2 items will be caught by the s_maxbytes check + ** done for us in vmtruncate + */ +if (inode_items_version(inode) == ITEM_VERSION_1 && + attr->ia_size > MAX_NON_LFS) +return -EFBIG ; +} + +error = inode_change_ok(inode, attr) ; +if (!error) +inode_setattr(inode, attr) ; + +return error ; +} struct file_operations reiserfs_file_operations = { read: generic_file_read, @@ -119,6 +137,7 @@ struct inode_operations reiserfs_file_inode_operations = { truncate: reiserfs_vfs_truncate_file, +setattr:reiserfs_setattr, }; diff -Nru a/fs/reiserfs/super.c b/fs/reiserfs/super.c --- a/fs/reiserfs/super.c Tue Apr 24 13:37:21 2001 +++ b/fs/reiserfs/super.c Tue Apr 24 13:37:21 2001 @@ -492,7 +492,11 @@ SB_BUFFER_WITH_SB (s) = bh; SB_DISK_SUPER_BLOCK (s) = rs; s->s_op = _sops; -s->s_maxbytes = 0x;/* 4Gig */ + +/* new format is limited by the 32 bit wide i_blocks field, want to +** be one full block below that. +*/ +s->s_maxbytes = (512LL << 32) - s->s_blocksize ; return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] reiserfs highmem bug on tail reads
Ok, so all the reiserfs tail bugs weren't quite fixed yet, the last tail fix can cause problems with highmem turned on. Both bugs are in fs/reiserfs/inode.c:_get_block_create_0 When reading the tail in, if the buffer was already up to date, we skip the disk i/o and return. But the cleanup code assumes the page was kmap'd, which isn't right. Also, there was a chance to double kmap the page if kmap scheduled a nd the tree balanced while we slept. This bug has been there for a long time. Anyway, this was tested with Andrea's HIGHMEM_DEBUG_MERE_MORTALS patch to force highmem on my 128MB machine. It works for me, but more testers are always good. -chris against 2.4.4-pre6, should work against 2.4.3 or higher. diff -Nru a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c --- a/fs/reiserfs/inode.c Wed Apr 25 23:15:14 2001 +++ b/fs/reiserfs/inode.c Wed Apr 25 23:15:14 2001 @@ -374,9 +374,11 @@ ** sure we need to. But, this means the item might move if ** kmap schedules */ -p = (char *)kmap(bh_result->b_page) ; -if (fs_changed (fs_gen, inode->i_sb) && item_moved (_ih, )) { -goto research; +if (!p) { + p = (char *)kmap(bh_result->b_page) ; + if (fs_changed (fs_gen, inode->i_sb) && item_moved (_ih, )) { + goto research; + } } p += offset ; memset (p, 0, inode->i_sb->s_blocksize); @@ -420,14 +422,15 @@ ih = get_ih (); } while (1); +flush_dcache_page(bh_result->b_page) ; +kunmap(bh_result->b_page) ; + finished: pathrelse (); bh_result->b_blocknr = 0 ; bh_result->b_dev = inode->i_dev; mark_buffer_uptodate (bh_result, 1); bh_result->b_state |= (1UL << BH_Mapped); -flush_dcache_page(bh_result->b_page) ; -kunmap(bh_result->b_page) ; return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thursday, April 26, 2001 02:24:26 PM -0400 Alexander Viro <[EMAIL PROTECTED]> wrote: > > > On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > >> correct. I bet other fs are affected as well btw. > > If only... block_read() vs. block_write() has the same race. I'm going > through the list of all wait_on_buffer() users right now. > Looks like reiserfs has it too when allocating tree blocks, but it should be harder to hit. The fix should be small but it will take me a bit to make sure it doesn't affect the rest of the balancing code. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ReiserFS question
On Thursday, April 26, 2001 11:05:25 PM +0400 Samium Gromoff <[EMAIL PROTECTED]> wrote: > Hi People... >got a following "dead of alive" question: >how to find a root block on a ReiserFS partition >with a corrupted superblock? > >reiserfsprogs-3.x.0.9j simply writes -2^32 >there at start (reset_super_block) and then simply >crashes when attempting to access to such mad place > ... got nearly lost my main partition ... > > The reiserfsck ---rebuild-tree will find the root block for you. Now that you've rebuilt the super, run with --rebuild-tree and it should find everything. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel panic with 2.4.x and reiserfs
On Friday, April 27, 2001 02:40:50 AM -0700 jason <[EMAIL PROTECTED]> wrote: [ ouch ] > > reiserfs_read_super: can't find reiserfs filesystem on dev 03:01 > Invalid session # or type of track > Kernel panic: VFS: Unable to mount root fs on 03:01 > > In case it's any help, I'm running Debian "sid" under kernel 2.4.3. hda > is a Western Digital WD400 (UDMA 100) while hdc is a Maxtor 36.5 GB. I > have a 900 Mhz Athlon on an Abit KT7A, the latter containing the South > Bridge VIA VT82C686B and a North Bridge VIA VT8363A. > Any info on how I could possibly retrieve data from my disk (hda) would > be greatly appreciated... > Looks like you've hit the pot-luck of VIA problems, and elevator bugs (2.4.1). When the last crash hit, did you recycle with the power button or the reset button? Step one, if you can, get a backup of the raw device. This will make everything easier if there are problems in step 3. Step two, grab the latest reiserfsprogs from ftp.reiserfs.org/pub/reiserfsprogs. Step three, reiserfsck --rebuild-sb ; reiserfsck --rebuild-tree Drop me a line if there are any questions. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] linux likes to kill bad inodes
On Friday, April 27, 2001 12:28:54 AM +0200 Pavel Machek <[EMAIL PROTECTED]> wrote: > Okay, so what about following patch, followed by attempt to debug it? > [I'd really like to get patch it; killing user's data without good > reason seems evil to me, and this did quite a lot of damage to my > $HOME.] 2.4.4-pre8 does have the patch to keep write_inode from syncing a bad_inode.In the short term this is the best way to go. For debugging further, it is probably best to put the warning in when marking the inode dirty, and randomly returning bad_inodes from read_inode. I'll give this a try next week. My guess is that UPDATE_ATIME is the offending caller, the follow_link path in open_namei is at least one place that should trigger it. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel panic with 2.4.x and reiserfs
On Friday, April 27, 2001 04:33:15 PM +0100 Tony Hoyle <[EMAIL PROTECTED]> wrote: > Reiserfs doesn't cope well with crashes Under 2.4 I wouldn't > recommend using it on any kind of critical server - it seems to > progressively corrupt itself (I'm looking at the second reformat and > reinstall in a week, and I'm not a happy bunny). Could you please forward along the details of these corruptions (including hardware)? > > As the warning on reiserfsck says, the rebuild-tree option is a last > resort. It's as likely to make the problem worse then improve it (It > rounds all the file lengths up to a block size, padding with zeros, which > breaks lots of stuff). Backup what you can first. It shouldn't always do this, most of the time it has enough info to get the size right. Which reiserfsck did you use? -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs autofix?
On Sunday, April 29, 2001 02:48:27 PM -0700 putter <[EMAIL PROTECTED]> wrote: > Hi, > I am kernel newbie, especially with logging filesystems. > Now I am using Mandrake 7.1 with 2.4.3 kernel and imon patch > and NVidia drivers compiled into the kernel. ^^^ The binary only nvidia drivers make it a bit hard for us to debug. > Now, all my partitions are ReiserFS. I usually play quake once > or twice a day. Sometimes graphics subsystem freezes up, so it takes > keyboard input. Caps and Numlock are working fine, unless I try to kill > X with ctrlalt-backspace. So I reset my machine with hardware switch. Check your /var/log/messages. You probably have messages from reiserfs. Send along an lspci so we can see what your hardware is. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs autofix?
On Monday, April 30, 2001 12:07:04 AM -0700 putter <[EMAIL PROTECTED]> wrote: > I think I have tracked down the problem to the card itself. My machine is > on @ graphics mode all the time, like 24hrs a day, and it seems that it > is somewhat taxing on the cards performance. So now I switch down to text > mode, everytime I leave the machine. How did I find out? I placed my > finger of heatsink of my GeForce DDR. It was HOT! Fan works alright, so > if I was to run computer a while, stress accumilates, and when I run > GeForce understress of maximum resolutions, it craps out. So much for > NVidia eh? Do a search through the kernel arcvhies for nvidia. The crashes could just be the driver. But heat is always a problem, add fans ;-) > > BTW, I don't question graphical subsystem crashes. I question reiserfs > that suppose to leave my partitions in consistent state, no matter how > trigger happy with power switch I am, or is my judgement is clouded? >=) After a crash, reiserfs only cleans up after itself. If someone else went in and hosed the metadata (nvidia, bad drive, controller, ide fun with via), you've still got bad blocks. This is one possible reason that we've seen more reports than ext2 has. After a crash, ext2fsck fixes _whatever_ was broken. log replay in reiserfs only fixes the operations that were in progress when the system crashed. Anyway, those messages show that you've got metadata corruption. grab the latest reiserfsprogs from ftp.reiserfs.org and run reiserfsck -x (after backing things up). -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs+lndir problem [was: 2.4.4 SMP: spurious EOVERFLOW"Value too large for defined data type"]
On Monday, April 30, 2001 10:55:57 PM +0200 Daniel Elstner <[EMAIL PROTECTED]> wrote: > Hi all, > > unfortunately I have to correct me again. > The problem seems unrelated to the kernel version or SMP/UP > (though only 2.4.[34] tried yet). > > Apparently it's a reiserfs/symlink problem. > I tried doing the lndir on an ext2 partition, sources still > on reiserfs. And it worked just fine! Neat, thanks for the extra details. Does that mean you can consistently repeat on reiserfs now? What happens when you do the lndir on reiserfs and diff the directories? Any useful messages in /var/log/messages? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] yet another knfsd-reiserfs patch
On Monday, April 23, 2001 10:45:14 AM -0400 Chris Mason <[EMAIL PROTECTED]> wrote: > > Hi guys, > > This patch is not meant to replace Neil Brown's knfsd ops stuff, the > goal was to whip up something that had a chance of getting into 2.4.x, > and that might be usable by the AFS guys too. Neil's patch tries to > address a bunch of things that I didn't, and looks better for the > long run. > Ok, here it is updated to 2.4.4. The only change was to adapt to the usage of comp_short_keys in reiserfs_iget under 2.4.4. -chris diff -Nru a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c --- a/fs/nfsd/nfsfh.c Sun Apr 29 18:01:04 2001 +++ b/fs/nfsd/nfsfh.c Sun Apr 29 18:01:04 2001 @@ -116,40 +116,12 @@ return error; } -/* this should be provided by each filesystem in an nfsd_operations interface as - * iget isn't really the right interface - */ -static struct dentry *nfsd_iget(struct super_block *sb, unsigned long ino, __u32 generation) +static struct dentry *dentry_from_inode(struct inode *inode) { - - /* iget isn't really right if the inode is currently unallocated!! -* This should really all be done inside each filesystem -* -* ext2fs' read_inode has been strengthed to return a bad_inode if the inode -* had been deleted. -* -* Currently we don't know the generation for parent directory, so a generation -* of 0 means "accept any" -*/ - struct inode *inode; struct list_head *lp; struct dentry *result; - inode = iget(sb, ino); - if (is_bad_inode(inode) - || (generation && inode->i_generation != generation) - ) { - /* we didn't find the right inode.. */ - dprintk("fh_verify: Inode %lu, Bad count: %d %d or version %u %u\n", - inode->i_ino, - inode->i_nlink, atomic_read(>i_count), - inode->i_generation, - generation); - - iput(inode); - return ERR_PTR(-ESTALE); - } - /* now to find a dentry. -* If possible, get a well-connected one + /* +* If possible, get a well-connected dentry */ spin_lock(_lock); for (lp = inode->i_dentry.next; lp != >i_dentry ; lp=lp->next) { @@ -172,6 +144,92 @@ return result; } +static struct inode *__inode_from_fh(struct super_block *sb, int ino, +int generation) +{ + struct inode *inode ; + + inode = iget(sb, ino); + if (is_bad_inode(inode) + || (generation && inode->i_generation != generation) + ) { + /* we didn't find the right inode.. */ + dprintk("fh_verify: Inode %lu, Bad count: %d %d or version %u %u\n", + inode->i_ino, + inode->i_nlink, atomic_read(>i_count), + inode->i_generation, + generation); + + iput(inode); + return ERR_PTR(-ESTALE); + } + return inode ; +} + +static struct inode *inode_from_fh(struct super_block *sb, + __u32 *datap, + int len) +{ + if (sb->s_op->inode_from_fh) + return sb->s_op->inode_from_fh(sb, datap, len) ; + return __inode_from_fh(sb, datap[0], datap[1]) ; +} + +static struct inode *parent_from_fh(struct super_block *sb, + __u32 *datap, + int len) +{ + if (sb->s_op->parent_from_fh) + return sb->s_op->parent_from_fh(sb, datap, len) ; + + if (len >= 3) + return __inode_from_fh(sb, datap[2], 0) ; + return ERR_PTR(-ESTALE); +} + +/* + * two iget funcs, one for inode, and one for parent directory + * + * this should be provided by each filesystem in an nfsd_operations interface as + * iget isn't really the right interface + * + * If the filesystem doesn't provide funcs to get inodes from datap, + * it must be: inum, generation, dir inum. Length of 2 means the + * dir inum isn't there. + * + * iget isn't really right if the inode is currently unallocated!! + * This should really all be done inside each filesystem + * + * ext2fs' read_inode has been strengthed to return a bad_inode if the inode + * had been deleted. + * + * Currently we don't know the generation for parent directory, so a generation + * of 0 means "accept any" + */ +static struct dentry *nfsd_iget(struct super_block *sb, __u32 *datap, int len) +{ + + struct inode *inode; + + inode = inode_from_fh(sb, datap, len) ; + if (IS_ERR(inode)) { + return ERR_PTR(PTR_ERR
Re: reiserfs+lndir problem [was: 2.4.4 SMP: spurious EOVERFLOW"Value too large for defined data type"]
On Wednesday, May 02, 2001 12:41:52 AM +0200 Daniel Elstner <[EMAIL PROTECTED]> wrote: > Hi, > > On Mon, 30 Apr 2001 21:03:47 -0400 Chris Mason <[EMAIL PROTECTED]> wrote: > >> > Apparently it's a reiserfs/symlink problem. >> > I tried doing the lndir on an ext2 partition, sources still >> > on reiserfs. And it worked just fine! >> >> Neat, thanks for the extra details. Does that mean you can consistently >> repeat on reiserfs now? What happens when you do the lndir on reiserfs >> and diff the directories? > > I just played around a bit with the following results: > > sources on reiserfs, lndir on reiserfs -> make fails, diff ok > sources on reiserfs, lndir on ext2 -> make ok > sources on ext2, lndir on reiserfs -> make fails, diff ok > > Doing the diff against a second copy of the tree shows no errors, too. > Always the same behaviour: You have to run lndir at least twice to > get the error. If the link tree was already set up after a boot, the > error occurs only after rm + lndir + rm + lndir. > > There's a strange way to get things working just like after a reboot. > After diff'ing the link tree with the 2nd copy (both on reiserfs), > make World won't fail - at least once. Ok, can you reproduce with a set of sources other than X? I would leave glibc alone for now, unless you can reproduce on ext2. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: * Re: Severe trashing in 2.4.4
On Tuesday, May 01, 2001 03:11:58 PM -0700 David <[EMAIL PROTECTED]> wrote: > Can't say for a definite fact that it was reiserfs but I can say for a > definite fact that something fishy happens sometimes. > > If I have a text file open, something.html comes to mind, If I edit it > and save it in one rxvt and open it in another rxvt, my changes may not > be there. If I save it *again* or exit the editing process, I will see > the changes in the second term. No, I'm not accidently forgetting to > save it, I know for a fact that I saved it and the first terminal shows > the non-modified state with the changes and the second term shows the > previous data. > > Somewhere something is stuck in cache and what's on disk isn't what's in > cache and a second process for some reason gets what is on disk and not > what is in cache. > > It happens infrequently but it -does- happen. Does it happen with -o notail? Which editor? -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Maximum files per Directory
On Tuesday, May 01, 2001 04:57:02 PM -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote: > H. Peter Anvin writes: >> Not correct, there can't be more than 2^15 *directories* in a single >> directory. I belive this is an ext2 limitation. > > > I see that reiserfs plays some tricks with the directory i_nlink count. > If you exceed 64536 links in a directory, it reverts to "1" and no longer > tracks the link count. Correct. The link count isn't used at all when deciding if the directory is empty (we use the size instead), so we can just lie to VFS if someone tries to make tons of subdirs. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Maximum files per Directory
On Friday, May 04, 2001 01:15:22 PM -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote: > Chris writes: >> On Tuesday, May 01, 2001 04:57:02 PM -0600 Andreas Dilger >> <[EMAIL PROTECTED]> wrote: >> > I see that reiserfs plays some tricks with the directory i_nlink count. >> > If you exceed 64536 links in a directory, it reverts to "1" and no >> > longer tracks the link count. >> >> Correct. The link count isn't used at all when deciding if the directory >> is empty (we use the size instead), so we can just lie to VFS if someone >> tries to make tons of subdirs. > > For that matter, ext2 doesn't use the link count on directories to > determine if they are empty either, so it shouldn't be too hard to do the > same with the ext2 indexed-directory code. Is there a reason that > reiserfs chose to have "large number of directories" represented by "1" > and not "LINK_MAX+1"? > find and a few others consider a link count of 1 to mean there is no link count tracking being done. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Maximum files per Directory
On Saturday, May 05, 2001 03:49:20 PM +0200 Jamie Lokier <[EMAIL PROTECTED]> wrote: > Chris Mason wrote: >> > Is there a reason that >> > reiserfs chose to have "large number of directories" represented by "1" >> > and not "LINK_MAX+1"? >> >> find and a few others consider a link count of 1 to mean there is no link >> count tracking being done. > > Indeed, and thank you for getting this right! > > Btw, is it possible to add dirent->d_type information to reiserfs, and > would there be any performance gain in doing so? reiserfs doesn't store that information in its directory items right now, but there are plenty of free bits to do so. It wouldn't be hard to add the feature, and yes there should be a performance gain. > > I have code to add d_type for every other filesystem that can support it > without additional disk reads, but I couldn't figure out whether > reiserfs can do it or whether stat() following readdir() is cheap anyway. stat is actually a little more expensive than ext2, since we have to search for the inode data in the tree. It is a fast search, but... -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/