Re: bio linked list corruption.

2016-10-26 Thread Chris Mason


On 10/26/2016 04:00 PM, Chris Mason wrote:
> 
> 
> On 10/26/2016 03:06 PM, Linus Torvalds wrote:
>> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones <da...@codemonkey.org.uk> wrote:
>>>
>>> The stacks show nearly all of them are stuck in sync_inodes_sb
>>
>> That's just wb_wait_for_completion(), and it means that some IO isn't
>> completing.
>>
>> There's also a lot of processes waiting for inode_lock(), and a few
>> waiting for mnt_want_write()
>>
>> Ignoring those, we have
>>
>>> [] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
>>> [] btrfs_sync_fs+0x31/0xc0 [btrfs]
>>> [] sync_filesystem+0x6e/0xa0
>>> [] SyS_syncfs+0x3c/0x70
>>> [] do_syscall_64+0x5c/0x170
>>> [] entry_SYSCALL64_slow_path+0x25/0x25
>>> [] 0x
>>
>> Don't know this one. There's a couple of them. Could there be some
>> ABBA deadlock on the ordered roots waiting?
> 
> It's always possible, but we haven't changed anything here.
> 
> I've tried a long list of things to reproduce this on my test boxes,
> including days of trinity runs and a kernel module to exercise vmalloc,
> and thread creation.
> 
> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
> ran dbench 2048:
> 

This one is special because CONFIG_VMAP_STACK is not set.  Btrfs triggers in < 
10 minutes.
I've done 30 minutes each with XFS and Ext4 without luck.

This is all in a virtual machine that I can copy on to a bunch of hosts.  So 
I'll get some
parallel tests going tonight to narrow it down.

[ cut here ]
WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0
list_add corruption. prev->next should be next (e8d80b08), but was 
88012b65fb88. (prev=880128c8d500).
Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul 
ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr 
sch_fq_codel autofs4 virtio_blk
CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 
04/01/2014
 880104eff868 814fde0f 8151c46e 880104eff8c8
 880104eff8c8  880104eff8b8 810648cf
 880128cab2c0 00213fc57c68 8801384e8928 880128cab180
Call Trace:
 [] dump_stack+0x53/0x74
 [] ? __list_add+0xbe/0xd0
 [] __warn+0xff/0x120
 [] warn_slowpath_fmt+0x49/0x50
 [] __list_add+0xbe/0xd0
 [] blk_sq_make_request+0x388/0x580
 [] generic_make_request+0x104/0x200
 [] submit_bio+0x65/0x130
 [] ? __percpu_counter_add+0x96/0xd0
 [] btrfs_map_bio+0x23c/0x310
 [] btrfs_submit_bio_hook+0xd3/0x190
 [] submit_one_bio+0x6d/0xa0
 [] flush_epd_write_bio+0x4e/0x70
 [] extent_writepages+0x5d/0x70
 [] ? btrfs_releasepage+0x50/0x50
 [] ? wbc_attach_and_unlock_inode+0x6e/0x170
 [] btrfs_writepages+0x27/0x30
 [] do_writepages+0x20/0x30
 [] __filemap_fdatawrite_range+0xb5/0x100
 [] filemap_fdatawrite_range+0x13/0x20
 [] btrfs_fdatawrite_range+0x2b/0x70
 [] btrfs_sync_file+0x88/0x490
 [] ? group_send_sig_info+0x42/0x80
 [] ? kill_pid_info+0x5d/0x90
 [] ? SYSC_kill+0xba/0x1d0
 [] ? __sb_end_write+0x58/0x80
 [] vfs_fsync_range+0x4c/0xb0
 [] ? syscall_trace_enter+0x201/0x2e0
 [] vfs_fsync+0x1c/0x20
 [] do_fsync+0x3d/0x70
 [] ? syscall_slow_exit_work+0xfb/0x100
 [] SyS_fsync+0x10/0x20
 [] do_syscall_64+0x55/0xd0
 [] ? prepare_exit_to_usermode+0x37/0x40
 [] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace efe6b17c6dba2a6e ]---


Re: btrfs bio linked list corruption.

2016-10-11 Thread Chris Mason



On 10/11/2016 11:19 AM, Dave Jones wrote:

On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
 > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
 > > This is from Linus' current tree, with Al's iovec fixups on top.
 >
 > Those iovec fixups are in the current tree...

ah yeah, git quietly dropped my local copy when I rebased so I didn't notice.

 > TBH, I don't see anything
 > in splice-related stuff that could come anywhere near that (short of
 > some general memory corruption having random effects of that sort).
 >
 > Could you try to bisect that sucker, or is it too hard to reproduce?

Only hit it the once overnight so far. Will see if I can find a better way to
reproduce today.


This call trace is reading metadata so we can finish the truncate.  I'd 
say adding more memory pressure would make it happen more often.


I'll try to trigger.

-chris



[GIT PULL] Btrfs

2016-10-11 Thread Chris Mason
Hi Linus,

My for-linus-4.9 has our merge window pull:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.9

This is later than normal because I was tracking down a use-after-free
during btrfs/101 in xfstests.  I had hoped to fix up the offending
patch, but wasn't happy with the size of the changes at this point in
the merge window.

The use-after-free was enough of a corner case that I didn't want to
rebase things out at this point.  So instead the top of the pull is my
revert, and the rest of these were prepped by Dave Sterba (thanks Dave!).  

This is a big variety of fixes and cleanups.  Liu Bo continues to fixup
fuzzer related problems, and some of Josef's cleanups are prep for his
bigger extent buffer changes (slated for v4.10).

Liu Bo (13) commits (+207/-36):
Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf (+5/-1)
Btrfs: return gracefully from balance if fs tree is corrupted (+17/-6)
Btrfs: improve check_node to avoid reading corrupted nodes (+28/-4)
Btrfs: add error handling for extent buffer in print tree (+7/-0)
Btrfs: memset to avoid stale content in btree node block (+11/-0)
Btrfs: bail out if block group has different mixed flag (+14/-0)
Btrfs: memset to avoid stale content in btree leaf (+28/-19)
Btrfs: fix memory leak in reading btree blocks (+9/-0)
Btrfs: fix memory leak of block group cache (+75/-0)
Btrfs: kill BUG_ON in run_delayed_tree_ref (+7/-1)
Btrfs: remove BUG_ON in start_transaction (+1/-4)
Btrfs: fix memory leak in do_walk_down (+1/-0)
Btrfs: remove BUG() in raid56 (+4/-1)

Jeff Mahoney (7) commits (+849/-902):
btrfs: btrfs_debug should consume fs_info when DEBUG is not defined (+10/-4)
btrfs: clean the old superblocks before freeing the device (+11/-27)
btrfs: convert send's verbose_printk to btrfs_debug (+38/-27)
btrfs: convert printk(KERN_* to use pr_* calls (+205/-275)
btrfs: convert pr_* to btrfs_* where possible (+231/-177)
btrfs: unsplit printed strings (+324/-391)
btrfs: add dynamic debug support (+30/-1)

Josef Bacik (5) commits (+178/-156):
Btrfs: kill the start argument to read_extent_buffer_pages (+15/-28)
Btrfs: kill BUG_ON()'s in btrfs_mark_extent_written (+33/-8)
Btrfs: add a flags field to btrfs_fs_info (+99/-109)
Btrfs: don't leak reloc root nodes on error (+4/-0)
Btrfs: don't BUG() during drop snapshot (+27/-11)

Goldwyn Rodrigues (3) commits (+3/-18):
btrfs: Do not reassign count in btrfs_run_delayed_refs (+0/-1)
btrfs: Remove already completed TODO comment (+0/-2)
btrfs: parent_start initialization cleanup (+3/-15)

Luis Henriques (2) commits (+0/-4):
btrfs: Fix warning "variable ‘blocksize’ set but not used" (+0/-2)
btrfs: Fix warning "variable ‘gen’ set but not used" (+0/-2)

Eric Sandeen (1) commits (+1/-1):
btrfs: fix perms on demonstration debugfs interface

Anand Jain (1) commits (+20/-6):
btrfs: fix a possible umount deadlock

Lu Fengqi (1) commits (+369/-10):
btrfs: fix check_shared for fiemap ioctl

Chris Mason (1) commits (+15/-11):
Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"

Masahiro Yamada (1) commits (+8/-28):
btrfs: squash lines for simple wrapper functions

Qu Wenruo (1) commits (+37/-25):
btrfs: extend btrfs_set_extent_delalloc and its friends to support in-band 
dedupe and subpage size patchset

Arnd Bergmann (1) commits (+7/-10):
btrfs: fix btrfs_no_printk stub helper

David Sterba (1) commits (+9/-0):
btrfs: create example debugfs file only in debugging build

Naohiro Aota (1) commits (+11/-15):
btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs

Total: (39) commits (+1714/-1222)

 fs/btrfs/backref.c| 409 ++
 fs/btrfs/btrfs_inode.h|  11 --
 fs/btrfs/check-integrity.c| 342 +++
 fs/btrfs/compression.c|   6 +-
 fs/btrfs/ctree.c  |  56 ++
 fs/btrfs/ctree.h  | 116 
 fs/btrfs/delayed-inode.c  |  25 ++-
 fs/btrfs/delayed-ref.c|  15 +-
 fs/btrfs/dev-replace.c|  21 ++-
 fs/btrfs/dir-item.c   |   7 +-
 fs/btrfs/disk-io.c| 237 
 fs/btrfs/disk-io.h|   2 +
 fs/btrfs/extent-tree.c| 198 +++-
 fs/btrfs/extent_io.c  | 170 +++---
 fs/btrfs/extent_io.h  |   4 +-
 fs/btrfs/file.c   |  43 -
 fs/btrfs/free-space-cache.c   |  21 ++-
 fs/btrfs/free-space-cache.h   |   6 +-
 fs/btrfs/free-space-tree.c|  20 ++-
 fs/btrfs/inode-map.c  |  31 ++--
 fs/btrfs/inode.c  |  70 +---
 fs/btrfs/ioctl.c  |  14 +-
 fs/btrfs/lzo.c|   6 +-
 fs/btrfs/ordered-data.c   |   4 +-
 fs/btrfs/print-tree.c |  93 +-
 fs/btrfs/qgroup.c |  77 
 fs/bt

Re: btrfs bio linked list corruption.

2016-10-11 Thread Chris Mason


On 10/11/2016 10:45 AM, Dave Jones wrote:
> This is from Linus' current tree, with Al's iovec fixups on top.
> 
> [ cut here ]
> WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> list_add corruption. prev->next should be next (e8806648), but was 
> c967fcd8. (prev=880503878b80).
> CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 
>  c9d87458 8d32007c c9d874a8 
>  c9d87498 8d07a6c1 00210246 88050388e880
>  880503878b80 e8806648 e8c06600 880502808008
> Call Trace:
> [] dump_stack+0x4f/0x73
> [] __warn+0xc1/0xe0
> [] warn_slowpath_fmt+0x5a/0x80
> [] __list_add+0x89/0xb0
> [] blk_sq_make_request+0x2f8/0x350

   /*  
 * A task plug currently exists. Since this is completely lockless, 
 * utilize that to temporarily store requests until the task is 
 * either done or scheduled away.   
 */ 
plug = current->plug;   
if (plug) { 
blk_mq_bio_to_request(rq, bio); 
if (!request_count) 
trace_block_plug(q);

blk_mq_put_ctx(data.ctx);   

if (request_count >= BLK_MAX_REQUEST_COUNT) {   
blk_flush_plug_list(plug, false);   
trace_block_plug(q);
}   

list_add_tail(>queuelist, >mq_list);  
^^

Dave, is this where we're crashing?  This seems strange.

-chris


Re: btrfs bio linked list corruption.

2016-10-13 Thread Chris Mason

On 10/13/2016 02:16 PM, Dave Jones wrote:

On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote:
 > On 10/12/2016 10:40 AM, Dave Jones wrote:
 > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
 > >  > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 > >  >  >
 > >  >  >
 > >  >  > On 10/11/2016 10:45 AM, Dave Jones wrote:
 > >  >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 > >  >  > >
 > >  >  > > [ cut here ]
 > >  >  > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 
__list_add+0x89/0xb0
 > >  >  > > list_add corruption. prev->next should be next (e8806648), 
but was c967fcd8. (prev=880503878b80).
 > >  >  > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
 > >  >  > >  c9d87458 8d32007c c9d874a8 

 > >  >  > >  c9d87498 8d07a6c1 00210246 
88050388e880
 > >  >
 > >  > I hit this again overnight, it's the same trace, the only difference
 > >  > being slightly different addresses in the list pointers:
 > >  >
 > >  > [42572.777196] list_add corruption. prev->next should be next 
(e8806648), but was c9647cd8. (prev=880503a0ba00).
 > >  >
 > >  > I'm actually a little surprised that ->next was the same across two
 > >  > reboots on two different kernel builds.  That might be a sign this is
 > >  > more repeatable than I'd thought, even if it does take hours of runtime
 > >  > right now to trigger it.  I'll try and narrow the scope of what trinity
 > >  > is doing to see if I can make it happen faster.
 > >
 > > .. and of course the first thing that happens is a completely different
 > > btrfs trace..
 > >
 > >
 > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
start_transaction+0x40a/0x440 [btrfs]
 > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 > >  c900019076a8 b731ff3c  
 > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 > >  0801 880501cfa2a8 008a 008a
 >
 > This isn't even IO.  Uuug.  We're going to need a fast enough test
 > that we can bisect.

Progress...
I've found that this combination of syscalls..

./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c 
lremovexattr -c pwritev2

hits one of these two bugs in a few minutes runtime.

Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
Mix them together though, and something goes awry.



Hasn't triggered here yet.  I'll leave it running though.

-chris


Re: btrfs bio linked list corruption.

2016-10-12 Thread Chris Mason

On 10/12/2016 10:40 AM, Dave Jones wrote:

On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
 > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 >  >
 >  >
 >  > On 10/11/2016 10:45 AM, Dave Jones wrote:
 >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 >  > >
 >  > > [ cut here ]
 >  > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
 >  > > list_add corruption. prev->next should be next (e8806648), but 
was c967fcd8. (prev=880503878b80).
 >  > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
 >  > >  c9d87458 8d32007c c9d874a8 
 >  > >  c9d87498 8d07a6c1 00210246 88050388e880
 >
 > I hit this again overnight, it's the same trace, the only difference
 > being slightly different addresses in the list pointers:
 >
 > [42572.777196] list_add corruption. prev->next should be next 
(e8806648), but was c9647cd8. (prev=880503a0ba00).
 >
 > I'm actually a little surprised that ->next was the same across two
 > reboots on two different kernel builds.  That might be a sign this is
 > more repeatable than I'd thought, even if it does take hours of runtime
 > right now to trigger it.  I'll try and narrow the scope of what trinity
 > is doing to see if I can make it happen faster.

.. and of course the first thing that happens is a completely different
btrfs trace..


WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
start_transaction+0x40a/0x440 [btrfs]
CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 c900019076a8 b731ff3c  
 c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 0801 880501cfa2a8 008a 008a


This isn't even IO.  Uuug.  We're going to need a fast enough test 
that we can bisect.


-chris


Re: [PATCH] btrfs: limit async_work allocation and worker func duration

2016-12-13 Thread Chris Mason

On 12/12/2016 03:35 PM, Maxim Patlasov wrote:

On 12/12/2016 06:54 AM, David Sterba wrote:

As far as we don't have any NO_THRESHOLD users of
btrfs_workqueue_normal_congested for now, I tend to think it's better to
add a descriptive comment and simply return "false" from
btrfs_workqueue_normal_congested rather than trying to address some
future needs now. See please v2 of the patch.



Thanks, I've got v2 and added a cc for stable to v3.15+, which isn't 
exactly right, but its when the new workqueue system was put in place.


-chris


Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Chris Mason

On 12/16/2016 02:39 AM, Michal Hocko wrote:

[CC linux-mm and btrfs guys]

On Thu 15-12-16 23:57:04, Nils Holland wrote:
[...]

Of course, none of this are workloads that are new / special in any
way - prior to 4.8, I never experienced any issues doing the exact
same things.

Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0
Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 
4.9.0-gentoo #2
Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook 
PC/21F7, BIOS F.22 08/06/2014
Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
Dec 15 19:02:18 teela kernel:  eff0b604 c142bcce eff0b734  eff0b634 
c1163332  0292
Dec 15 19:02:18 teela kernel:  eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 
e7fa2900 c1b58785 eff0b734
Dec 15 19:02:18 teela kernel:  eff0b678 c110795f c1043895 eff0b664 c11075c7 
0007  
Dec 15 19:02:18 teela kernel: Call Trace:
Dec 15 19:02:18 teela kernel:  [] dump_stack+0x47/0x69
Dec 15 19:02:18 teela kernel:  [] dump_header+0x60/0x178
Dec 15 19:02:18 teela kernel:  [] ? ___ratelimit+0x86/0xe0
Dec 15 19:02:18 teela kernel:  [] oom_kill_process+0x20f/0x3d0
Dec 15 19:02:18 teela kernel:  [] ? has_capability_noaudit+0x15/0x20
Dec 15 19:02:18 teela kernel:  [] ? oom_badness.part.13+0xb7/0x130
Dec 15 19:02:18 teela kernel:  [] out_of_memory+0xd9/0x260
Dec 15 19:02:18 teela kernel:  [] __alloc_pages_nodemask+0xbfb/0xc80
Dec 15 19:02:18 teela kernel:  [] pagecache_get_page+0xad/0x270
Dec 15 19:02:18 teela kernel:  [] alloc_extent_buffer+0x116/0x3e0
Dec 15 19:02:18 teela kernel:  [] 
btrfs_find_create_tree_block+0xe/0x10
Dec 15 19:02:18 teela kernel:  [] btrfs_alloc_tree_block+0x1ef/0x5f0
Dec 15 19:02:18 teela kernel:  [] __btrfs_cow_block+0x143/0x5f0
Dec 15 19:02:18 teela kernel:  [] btrfs_cow_block+0x13a/0x220
Dec 15 19:02:18 teela kernel:  [] btrfs_search_slot+0x1d1/0x870
Dec 15 19:02:18 teela kernel:  [] btrfs_lookup_file_extent+0x4d/0x60
Dec 15 19:02:18 teela kernel:  [] __btrfs_drop_extents+0x176/0x1070
Dec 15 19:02:18 teela kernel:  [] ? kmem_cache_alloc+0xb7/0x190
Dec 15 19:02:18 teela kernel:  [] ? start_transaction+0x65/0x4b0
Dec 15 19:02:18 teela kernel:  [] ? __kmalloc+0x147/0x1e0
Dec 15 19:02:18 teela kernel:  [] cow_file_range_inline+0x215/0x6b0
Dec 15 19:02:18 teela kernel:  [] cow_file_range.isra.49+0x55c/0x6d0
Dec 15 19:02:18 teela kernel:  [] ? lock_extent_bits+0x75/0x1e0
Dec 15 19:02:18 teela kernel:  [] run_delalloc_range+0x441/0x470
Dec 15 19:02:18 teela kernel:  [] 
writepage_delalloc.isra.47+0x144/0x1e0
Dec 15 19:02:18 teela kernel:  [] __extent_writepage+0xd8/0x2b0
Dec 15 19:02:18 teela kernel:  [] extent_writepages+0x25c/0x380
Dec 15 19:02:18 teela kernel:  [] ? btrfs_real_readdir+0x610/0x610
Dec 15 19:02:18 teela kernel:  [] btrfs_writepages+0x1f/0x30
Dec 15 19:02:18 teela kernel:  [] do_writepages+0x15/0x40
Dec 15 19:02:18 teela kernel:  [] __writeback_single_inode+0x35/0x2f0
Dec 15 19:02:18 teela kernel:  [] writeback_sb_inodes+0x16e/0x340
Dec 15 19:02:18 teela kernel:  [] wb_writeback+0xaa/0x280
Dec 15 19:02:18 teela kernel:  [] wb_workfn+0xd8/0x3e0
Dec 15 19:02:18 teela kernel:  [] process_one_work+0x114/0x3e0
Dec 15 19:02:18 teela kernel:  [] worker_thread+0x2f/0x4b0
Dec 15 19:02:18 teela kernel:  [] ? create_worker+0x180/0x180
Dec 15 19:02:18 teela kernel:  [] kthread+0x97/0xb0
Dec 15 19:02:18 teela kernel:  [] ? __kthread_parkme+0x60/0x60
Dec 15 19:02:18 teela kernel:  [] ret_from_fork+0x1b/0x28
Dec 15 19:02:18 teela kernel: Mem-Info:
Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0
   active_file:274324 inactive_file:281962 
isolated_file:0


OK, so there is still some anonymous memory that could be swapped out
and quite a lot of page cache. This might be harder to reclaim because
the allocation is a GFP_NOFS request which is limited in its reclaim
capabilities. It might be possible that those pagecache pages are pinned
in some way by the the filesystem.


   unevictable:0 dirty:649 writeback:0 unstable:0
   slab_reclaimable:40662 slab_unreclaimable:17754
   mapped:7382 shmem:202 pagetables:351 bounce:0
   free:206736 free_pcp:332 free_cma:0
Dec 15 19:02:18 teela kernel: Node 0 active_anon:234740kB inactive_anon:360kB 
active_file:1097296kB inactive_file:1127848kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB 
shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB 
writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
Dec 15 19:02:18 teela kernel: DMA free:3952kB min:788kB low:984kB high:1180kB 
active_anon:0kB inactive_anon:0kB active_file:7316kB 

[GIT PULL] Btrfs

2016-12-16 Thread Chris Mason
23/-96)
btrfs: qgroup: Add comments explaining how btrfs qgroup works (+28/-0)

Robbie Ko (3) commits (+5/-6):
Btrfs: fix tree search logic when replaying directory entry deletes (+1/-2)
Btrfs: fix deadlock caused by fsync when logging directory entries (+2/-2)
Btrfs: fix enospc in hole punching (+2/-2)

Wang Xiaoguang (3) commits (+42/-7):
btrfs: cleanup: use already calculated value in 
btrfs_should_throttle_delayed_refs() (+1/-1)
btrfs: add necessary comments about tickets_id (+4/-0)
btrfs: improve delayed refs iterations (+37/-6)

Liu Bo (2) commits (+12/-6):
Btrfs: adjust len of writes if following a preallocated extent (+5/-3)
Btrfs: fix BUG_ON in btrfs_mark_buffer_dirty (+7/-3)

Chris Mason (2) commits (+11/-8):
Revert "Btrfs: adjust len of writes if following a preallocated extent" 
(+3/-5)
Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors (+8/-3)

Josef Bacik (2) commits (+29/-5):
Btrfs: abort transaction if fill_holes() fails (+17/-2)
Btrfs: fix file extent corruption (+12/-3)

Omar Sandoval (1) commits (+3/-3):
Btrfs: deal with existing encompassing extent map in btrfs_get_extent()

Maxim Patlasov (1) commits (+19/-2):
btrfs: limit async_work allocation and worker func duration

Xiaoguang Wang (1) commits (+3/-10):
btrfs: remove useless comments

Adam Borowski (1) commits (+40/-3):
btrfs: make block group flags in balance printks human-readable

Nick Terrell (1) commits (+1/-0):
btrfs: Call kunmap if zlib_inflateInit2 fails

Christophe JAILLET (1) commits (+0/-2):
btrfs: remove redundant check of btrfs_iget return value

Domagoj Tršan (1) commits (+6/-6):
btrfs: change btrfs_csum_final result param type to u8

Shailendra Verma (1) commits (+6/-15):
btrfs: return early from failed memory allocations in ioctl handlers

Total: (77) commits (+5389/-5304)

 fs/btrfs/async-thread.c|   14 +
 fs/btrfs/async-thread.h|1 +
 fs/btrfs/backref.c |   10 +-
 fs/btrfs/check-integrity.c |  103 +--
 fs/btrfs/check-integrity.h |5 +-
 fs/btrfs/compression.c |  196 ++--
 fs/btrfs/compression.h |   12 +-
 fs/btrfs/ctree.c   |  495 +-
 fs/btrfs/ctree.h   |  241 ++---
 fs/btrfs/delayed-inode.c   |  147 ++-
 fs/btrfs/delayed-inode.h   |   21 +-
 fs/btrfs/delayed-ref.c |   20 +-
 fs/btrfs/delayed-ref.h |   14 +-
 fs/btrfs/dev-replace.c |   68 +-
 fs/btrfs/dev-replace.h |4 +-
 fs/btrfs/dir-item.c|   45 +-
 fs/btrfs/disk-io.c |  595 ++--
 fs/btrfs/disk-io.h |   34 +-
 fs/btrfs/export.c  |   10 +-
 fs/btrfs/extent-tree.c | 1551 ++--
 fs/btrfs/extent_io.c   |  112 ++-
 fs/btrfs/extent_io.h   |   17 +-
 fs/btrfs/file-item.c   |  207 ++---
 fs/btrfs/file.c|  249 ++---
 fs/btrfs/free-space-cache.c|  164 ++--
 fs/btrfs/free-space-cache.h|   12 +-
 fs/btrfs/free-space-tree.c |   44 +-
 fs/btrfs/inode-item.c  |   11 +-
 fs/btrfs/inode-map.c   |   22 +-
 fs/btrfs/inode.c   |  910 +--
 fs/btrfs/ioctl.c   |  603 +++--
 fs/btrfs/lzo.c |   17 +-
 fs/btrfs/ordered-data.c|   38 +-
 fs/btrfs/ordered-data.h|4 +-
 fs/btrfs/print-tree.c  |   19 +-
 fs/btrfs/print-tree.h  |4 +-
 fs/btrfs/props.c   |5 +-
 fs/btrfs/qgroup.c  |  299 +-
 fs/btrfs/qgroup.h  |   64 +-
 fs/btrfs/raid56.c  |   78 +-
 fs/btrfs/raid56.h  |8 +-
 fs/btrfs/reada.c   |   62 +-
 fs/btrfs/relocation.c  |  453 +-
 fs/btrfs/root-tree.c   |   28 +-
 fs/btrfs/scrub.c   |  181 ++--
 fs/btrfs/send.c|   33 +-
 fs/btrfs/super.c   |  138 ++-
 fs/btrfs/tests/btrfs-tests.c   |   13 +-
 fs/btrfs/tests/btrfs-tests.h   |4 +-
 fs/btrfs/tests/extent-buffer-tests.c   |7 +-
 fs/btrfs/tests/extent-io-tests.c   |7 +-
 fs/btrfs/tests/free-space-tests.c  |   18 +-
 fs/btrfs/tests/free-space-tree-tests.c |9 +-
 fs/btrfs/tests/inode-tests.c   |   16 +-
 fs/btrfs/tests/qgroup-tests.c  |   11 +-
 fs/btrfs/transaction.c |  615 +++--
 fs/btrfs/transaction.h |   29 +-
 fs/btrfs/tree-log.c|  202 +++--
 fs/btrfs/uuid-tree.c   |   23 +-
 fs/btrfs/volumes.c

Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Chris Mason

On 12/16/2016 02:39 AM, Michal Hocko wrote:

[CC linux-mm and btrfs guys]

On Thu 15-12-16 23:57:04, Nils Holland wrote:
[...]

Of course, none of this are workloads that are new / special in any
way - prior to 4.8, I never experienced any issues doing the exact
same things.

Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0
Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 
4.9.0-gentoo #2
Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook 
PC/21F7, BIOS F.22 08/06/2014
Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
Dec 15 19:02:18 teela kernel:  eff0b604 c142bcce eff0b734  eff0b634 
c1163332  0292
Dec 15 19:02:18 teela kernel:  eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 
e7fa2900 c1b58785 eff0b734
Dec 15 19:02:18 teela kernel:  eff0b678 c110795f c1043895 eff0b664 c11075c7 
0007  
Dec 15 19:02:18 teela kernel: Call Trace:
Dec 15 19:02:18 teela kernel:  [] dump_stack+0x47/0x69
Dec 15 19:02:18 teela kernel:  [] dump_header+0x60/0x178
Dec 15 19:02:18 teela kernel:  [] ? ___ratelimit+0x86/0xe0
Dec 15 19:02:18 teela kernel:  [] oom_kill_process+0x20f/0x3d0
Dec 15 19:02:18 teela kernel:  [] ? has_capability_noaudit+0x15/0x20
Dec 15 19:02:18 teela kernel:  [] ? oom_badness.part.13+0xb7/0x130
Dec 15 19:02:18 teela kernel:  [] out_of_memory+0xd9/0x260
Dec 15 19:02:18 teela kernel:  [] __alloc_pages_nodemask+0xbfb/0xc80
Dec 15 19:02:18 teela kernel:  [] pagecache_get_page+0xad/0x270
Dec 15 19:02:18 teela kernel:  [] alloc_extent_buffer+0x116/0x3e0
Dec 15 19:02:18 teela kernel:  [] 
btrfs_find_create_tree_block+0xe/0x10
Dec 15 19:02:18 teela kernel:  [] btrfs_alloc_tree_block+0x1ef/0x5f0
Dec 15 19:02:18 teela kernel:  [] __btrfs_cow_block+0x143/0x5f0
Dec 15 19:02:18 teela kernel:  [] btrfs_cow_block+0x13a/0x220
Dec 15 19:02:18 teela kernel:  [] btrfs_search_slot+0x1d1/0x870
Dec 15 19:02:18 teela kernel:  [] btrfs_lookup_file_extent+0x4d/0x60
Dec 15 19:02:18 teela kernel:  [] __btrfs_drop_extents+0x176/0x1070
Dec 15 19:02:18 teela kernel:  [] ? kmem_cache_alloc+0xb7/0x190
Dec 15 19:02:18 teela kernel:  [] ? start_transaction+0x65/0x4b0
Dec 15 19:02:18 teela kernel:  [] ? __kmalloc+0x147/0x1e0
Dec 15 19:02:18 teela kernel:  [] cow_file_range_inline+0x215/0x6b0
Dec 15 19:02:18 teela kernel:  [] cow_file_range.isra.49+0x55c/0x6d0
Dec 15 19:02:18 teela kernel:  [] ? lock_extent_bits+0x75/0x1e0
Dec 15 19:02:18 teela kernel:  [] run_delalloc_range+0x441/0x470
Dec 15 19:02:18 teela kernel:  [] 
writepage_delalloc.isra.47+0x144/0x1e0
Dec 15 19:02:18 teela kernel:  [] __extent_writepage+0xd8/0x2b0
Dec 15 19:02:18 teela kernel:  [] extent_writepages+0x25c/0x380
Dec 15 19:02:18 teela kernel:  [] ? btrfs_real_readdir+0x610/0x610
Dec 15 19:02:18 teela kernel:  [] btrfs_writepages+0x1f/0x30
Dec 15 19:02:18 teela kernel:  [] do_writepages+0x15/0x40
Dec 15 19:02:18 teela kernel:  [] __writeback_single_inode+0x35/0x2f0
Dec 15 19:02:18 teela kernel:  [] writeback_sb_inodes+0x16e/0x340
Dec 15 19:02:18 teela kernel:  [] wb_writeback+0xaa/0x280
Dec 15 19:02:18 teela kernel:  [] wb_workfn+0xd8/0x3e0
Dec 15 19:02:18 teela kernel:  [] process_one_work+0x114/0x3e0
Dec 15 19:02:18 teela kernel:  [] worker_thread+0x2f/0x4b0
Dec 15 19:02:18 teela kernel:  [] ? create_worker+0x180/0x180
Dec 15 19:02:18 teela kernel:  [] kthread+0x97/0xb0
Dec 15 19:02:18 teela kernel:  [] ? __kthread_parkme+0x60/0x60
Dec 15 19:02:18 teela kernel:  [] ret_from_fork+0x1b/0x28
Dec 15 19:02:18 teela kernel: Mem-Info:
Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0
   active_file:274324 inactive_file:281962 
isolated_file:0


OK, so there is still some anonymous memory that could be swapped out
and quite a lot of page cache. This might be harder to reclaim because
the allocation is a GFP_NOFS request which is limited in its reclaim
capabilities. It might be possible that those pagecache pages are pinned
in some way by the the filesystem.


Reading harder, its possible those pagecache pages are all from the 
btree inode.  They shouldn't be pinned by btrfs, kswapd should be able 
to wander in and free a good chunk.  What btrfs wants to happen is for 
this allocation to sit and wait for kswapd to make progress.


-chris


[GIT PULL] Btrfs fixes

2017-01-13 Thread Chris Mason
Hi Linus,

Dave Sterba queued up a few fixes for btrfs.  I have them in my
for-linus-4.10 branch:

These are all over the place.  The tracepoint part of the pull fixes a
crash and adds a little more information to two tracepoints, while the
rest are good old fashioned fixes.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.10

Liu Bo (5) commits (+34/-11):
Btrfs: adjust outstanding_extents counter properly when dio write is split 
(+9/-2)
Btrfs: add truncated_len for ordered extent tracepoints (+4/-0)
Btrfs: use down_read_nested to make lockdep silent (+2/-1)
Btrfs: add 'inode' for extent map tracepoint (+9/-5)
Btrfs: fix lockdep warning about log_mutex (+10/-3)

David Sterba (2) commits (+80/-69):
btrfs: fix crash when tracepoint arguments are freed by wq callbacks 
(+24/-13)
btrfs: make tracepoint format strings more compact (+56/-56)

Jeff Mahoney (2) commits (+4/-1):
btrfs: fix locking when we put back a delayed ref that's too new (+1/-1)
btrfs: fix error handling when run_delayed_extent_op fails (+3/-0)

Pan Bian (1) commits (+1/-3):
btrfs: return the actual error value from  from btrfs_uuid_tree_iterate

Total: (10) commits (+119/-84)

 fs/btrfs/async-thread.c  |  15 +++--
 fs/btrfs/extent-tree.c   |   8 ++-
 fs/btrfs/inode.c |  13 +++-
 fs/btrfs/tree-log.c  |  13 +++-
 fs/btrfs/uuid-tree.c |   4 +-
 include/trace/events/btrfs.h | 146 +++
 6 files changed, 117 insertions(+), 82 deletions(-)


Re: [Regression 4.7-rc1] btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl

2017-01-06 Thread Chris Mason

On 01/06/2017 12:22 PM, Joseph Salisbury wrote:

Hi Luke,

A kernel bug report was opened against Ubuntu [0].  This bug was fixed
by the following commit in v4.7-rc1:


commit 4c63c2454eff996c5e27991221106eb511f7db38

Author: Luke Dashjr 
Date:   Thu Oct 29 08:22:21 2015 +

btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in
btrfs_ioctl


However, this commit introduced a new regression.  With this commit
applied, "btrfs fi show" no longer works and the btrfs snapshot
functionality breaks.



I was hoping to get your feedback, since you are the patch author.  Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?


This is working for me, could you please include an strace of the problem?

Thanks!

-chris



Re: OOM: Better, but still there on

2016-12-21 Thread Chris Mason

On Wed, Dec 21, 2016 at 12:16:53PM +0100, Michal Hocko wrote:

On Wed 21-12-16 20:00:38, Tetsuo Handa wrote:

One thing to note here, when we are talking about 32b kernel, things
have changed in 4.8 when we moved from the zone based to node based
reclaim (see b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a
per-node basis") and associated patches). It is possible that the
reporter is hitting some pathological path which needs fixing but it
might be also related to something else. So I am rather not trying to
blame 32b yet...


It might be interesting to put tracing on releasepage and see if btrfs 
is pinning pages around.  I can't see how 32bit kernels would be 
different, but maybe we're hitting a weird corner.


-chris



Re: OOM: Better, but still there on 4.9

2016-12-16 Thread Chris Mason

On 12/16/2016 05:14 PM, Michal Hocko wrote:

On Fri 16-12-16 13:15:18, Chris Mason wrote:

On 12/16/2016 02:39 AM, Michal Hocko wrote:

[...]

I believe the right way to go around this is to pursue what I've started
in [1]. I will try to prepare something for testing today for you. Stay
tuned. But I would be really happy if somebody from the btrfs camp could
check the NOFS aspect of this allocation. We have already seen
allocation stalls from this path quite recently


Just double checking, are you asking why we're using GFP_NOFS to avoid going
into btrfs from the btrfs writepages call, or are you asking why we aren't
allowing highmem?


I am more interested in the NOFS part. Why cannot this be a full
GFP_KERNEL context? What kind of locks we would lock up when recursing
to the fs via slab shrinkers?



Since this is our writepages call, any jump into direct reclaim would go 
to writepage, which would end up calling the same set of code to read 
metadata blocks, which would do a GFP_KERNEL allocation and end up back 
in writepage again.


We'd also have issues with blowing through transaction reservations 
since the writepage recursion would have to nest into the running 
transaction.


-chris



[GIT PULL] Btrfs

2017-03-23 Thread Chris Mason

Hi Linus

We have a small set of fixes for the next RC:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Zygo tracked down a very old bug with inline compressed extents.
I didn't tag this one for stable because I want to do individual tested 
backports.  It's a little tricky and I'd rather do some extra testing

on it along the way.

Otherwise they are pretty obvious:

Liu Bo (1) commits (+2/-1):
   Btrfs: fix regression in lock_delalloc_pages

Dmitry V. Levin (1) commits (+0/-27):
   btrfs: remove btrfs_err_str function from uapi/linux/btrfs.h

Zygo Blaxell (1) commits (+14/-0):
   btrfs: add missing memset while reading compressed inline extents

Total: (3) commits (+16/-28)

fs/btrfs/extent_io.c   |  3 ++-
fs/btrfs/inode.c   | 14 ++
include/uapi/linux/btrfs.h | 27 ---
3 files changed, 16 insertions(+), 28 deletions(-)


[GIT PULL] Btrfs

2017-03-31 Thread Chris Mason
Hi Linus,

We have 3 small fixes queued up in my for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Goldwyn Rodrigues (1) commits (+7/-7):
btrfs: Change qgroup_meta_rsv to 64bit

Dan Carpenter (1) commits (+6/-1):
Btrfs: fix an integer overflow check

Liu Bo (1) commits (+31/-21):
Btrfs: bring back repair during read

Total: (3) commits (+44/-29)

 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/disk-io.c   |  2 +-
 fs/btrfs/extent_io.c | 46 --
 fs/btrfs/inode.c |  6 +++---
 fs/btrfs/qgroup.c| 10 +-
 fs/btrfs/send.c  |  7 ++-
 6 files changed, 44 insertions(+), 29 deletions(-)


Re: [PATCH] jump_label: Fix anonymous union initialization

2017-03-02 Thread Chris Mason

On 03/02/2017 04:42 PM, Steven Rostedt wrote:

On Thu, 2 Mar 2017 16:07:19 -0500
Jason Baron <jba...@akamai.com> wrote:


On 02/28/2017 11:32 AM, Boris Ostrovsky wrote:

Pre-4.6 gcc do not allow direct static initialization of members of
anonymous structs/unions. After commit 3821fd35b58d ("jump_label:
Reduce the size of struct static_key") STATIC_KEY_INIT_{TRUE|FALSE}
definitions cannot be compiled with those older compilers.

Placing initializers inside curved brackets works around this problem.

Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
 include/linux/jump_label.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 8e06d75..518020b 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -166,10 +166,10 @@ extern void arch_jump_label_transform_static(struct 
jump_entry *entry,
  */
 #define STATIC_KEY_INIT_TRUE   \
{ .enabled = { 1 }, \
- .entries = (void *)JUMP_TYPE_TRUE }
+ { .entries = (void *)JUMP_TYPE_TRUE } }
 #define STATIC_KEY_INIT_FALSE  \
{ .enabled = { 0 }, \
- .entries = (void *)JUMP_TYPE_FALSE }
+ { .entries = (void *)JUMP_TYPE_FALSE } }

 #else  /* !HAVE_JUMP_LABEL */




(Adding Steve to 'cc)

Thanks for the fix.

Reviewed-by: Jason Baron <jba...@akamai.com>


Funny, Chris pinged me on IRC telling me that jump labels broke with my
latest tree. And we discovered it was because of anonymous unions and
he was using an older compiler (4.4 or something). I didn't know how to
make it work, and we were just going to say "tough, jump labels are not
for 4.4". Although, didn't goto asm get added into 4.5? Did someone
backport it to the gcc 4.4 compilers? I believe 4.5 handles anonymous
unions.

Since the broken commit went through my tree, I'll take this patch.
I'm getting ready for another git pull request to Linus.



Compiled-by: Chris Mason <c...@fb.com>

-chris



[GIT PULL] Btrfs

2017-03-02 Thread Chris Mason
Hi Linus,

My for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Has Btrfs round two.  These are mostly a continuation of Dave Sterba's 
collection
of cleanups, but Filipe also has some bug fixes and performance improvements.

Nikolay Borisov (42) commits (+611/-579):
btrfs: Make lock_and_cleanup_extent_if_need take btrfs_inode (+14/-14)
btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode (+39/-38)
btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode (+10/-8)
btrfs: all btrfs_delalloc_release_metadata take btrfs_inode (+22/-19)
btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode (+3/-4)
btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode (+7/-6)
btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode (+3/-3)
btrfs: Make btrfs_orphan_release_metadata take btrfs_inode (+8/-8)
btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode (+7/-7)
btrfs: Make check_parent_dirs_for_sync take btrfs_inode (+14/-14)
btrfs: make btrfs_free_io_failure_record take btrfs_inode (+9/-7)
btrfs: Make btrfs_lookup_ordered_range take btrfs_inode (+19/-18)
btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode (+17/-16)
btrfs: make btrfs_print_data_csum_error take btrfs_inode (+8/-7)
btrfs: make btrfs_is_free_space_inode take btrfs_inode (+20/-19)
btrfs: make btrfs_set_inode_index_count take btrfs_inode (+8/-8)
btrfs: Make btrfs_requeue_inode_defrag take btrfs_inode (+5/-5)
btrfs: Make clone_update_extent_map take btrfs_inode (+13/-14)
btrfs: Make btrfs_mark_extent_written take btrfs_inode (+6/-6)
btrfs: Make btrfs_drop_extent_cache take btrfs_inode (+30/-26)
btrfs: Make calc_csum_metadata_size take btrfs_inode (+12/-15)
btrfs: Make drop_outstanding_extent take btrfs_inode (+11/-12)
btrfs: Make btrfs_del_delalloc_inode take btrfs_inode (+7/-7)
btrfs: make btrfs_log_inode_parent take btrfs_inode (+24/-26)
btrfs: Make btrfs_set_inode_index take btrfs_inode (+13/-13)
btrfs: Make btrfs_clear_bit_hook take btrfs_inode (+25/-21)
btrfs: Make check_extent_to_block take btrfs_inode (+6/-5)
btrfs: make check_compressed_csum take btrfs_inode (+4/-5)
btrfs: Make btrfs_insert_dir_item take btrfs_inode (+7/-7)
btrfs: Make btrfs_log_all_parents take btrfs_inode (+5/-5)
btrfs: Make btrfs_i_size_write take btrfs_inode (+18/-19)
btrfs: make repair_io_failure take btrfs_inode (+12/-11)
btrfs: Make btrfs_orphan_add take btrfs_inode (+24/-22)
btrfs: make btrfs_orphan_del take btrfs_inode (+20/-20)
btrfs: make clean_io_failure take btrfs_inode (+15/-14)
btrfs: Make btrfs_add_nondir take btrfs_inode (+13/-9)
btrfs: make free_io_failure take btrfs_inode (+13/-11)
btrfs: Make check_can_nocow take btrfs_inode (+12/-10)
btrfs: Make btrfs_add_link take btrfs_inode (+26/-23)
btrfs: Make get_extent_t take btrfs_inode (+59/-54)
btrfs: Make hole_mergeable take btrfs_inode (+5/-4)
btrfs: Make fill_holes take btrfs_inode (+18/-19)

David Sterba (16) commits (+139/-124):
btrfs: use predefined limits for calculating maximum number of pages for 
compression (+6/-5)
btrfs: derive maximum output size in the compression implementation (+9/-14)
btrfs: merge nr_pages input and output parameter in compress_pages (+11/-15)
btrfs: merge length input and output parameter in compress_pages (+18/-20)
btrfs: add dummy callback for readpage_io_failed and drop checks (+10/-3)
btrfs: do proper error handling in btrfs_insert_xattr_item (+2/-1)
btrfs: drop checks for mandatory extent_io_ops callbacks (+3/-4)
btrfs: constify device path passed to relevant helpers (+22/-18)
btrfs: document existence of extent_io ops callbacks (+26/-11)
btrfs: handle allocation error in update_dev_stat_item (+2/-1)
btrfs: export compression buffer limits in a header (+15/-10)
btrfs: constify name of subvolume in creation helpers (+3/-3)
btrfs: constify buffers used by compression helpers (+3/-3)
btrfs: remove BUG_ON from __tree_mod_log_insert (+0/-2)
btrfs: constify input buffer of btrfs_csum_data (+3/-3)
btrfs: let writepage_end_io_hook return void (+6/-11)

Filipe Manana (8) commits (+163/-27):
Btrfs: do not create explicit holes when replaying log tree if NO_HOLES 
enabled (+5/-0)
Btrfs: try harder to migrate items to left sibling before splitting a leaf 
(+7/-0)
Btrfs: fix assertion failure when freeing block groups at close_ctree() 
(+9/-6)
Btrfs: incremental send, fix unnecessary hole writes for sparse files 
(+86/-2)
Btrfs: fix use-after-free due to wrong order of destroying work queues 
(+7/-2)
Btrfs: incremental send, do not delay rename when parent inode is new 
(+16/-3)
Btrfs: fix data loss after truncate when using the no-holes feature (+6/-13)
Btrfs: bulk delete checksum items in the same leaf (+27/-1)

Robbie Ko (3) commits 

[GIT PULL] Btrfs

2017-04-14 Thread Chris Mason

Hi Linus

Dave Sterba collected a few more fixes for the last rc:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

These aren't marked for stable, but I'm putting them in with a batch 
were testing/sending by hand for this release.


Liu Bo (3) commits (+11/-13):
   Btrfs: fix invalid dereference in btrfs_retry_endio (+4/-10)
   Btrfs: fix potential use-after-free for cloned bio (+1/-1)
   Btrfs: fix segmentation fault when doing dio read (+6/-2)

Adam Borowski (1) commits (+3/-0):
   btrfs: drop the nossd flag when remounting with -o ssd

Total: (4) commits (+14/-13)

fs/btrfs/inode.c   | 22 ++
fs/btrfs/super.c   |  3 +++
fs/btrfs/volumes.c |  2 +-
3 files changed, 14 insertions(+), 13 deletions(-)


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-11 Thread Chris Mason



On 08/10/2017 03:25 PM, Hugo Mills wrote:

On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:

On 08/10/2017 04:30 AM, Eric Biggers wrote:


Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as
well.  The numbers were in line with what Nick is posting here.
zstd is a big win over both lzo and zlib from a btrfs point of view.

It's true Nick's patches only support a single compression level in
btrfs, but that's because btrfs doesn't have a way to pass in the
compression ratio.  It could easily be a mount option, it was just
outside the scope of Nick's initial work.


Could we please not add more mount options? I get that they're easy
to implement, but it's a very blunt instrument. What we tend to see
(with both nodatacow and compress) is people using the mount options,
then asking for exceptions, discovering that they can't do that, and
then falling back to doing it with attributes or btrfs properties.
Could we just start with btrfs properties this time round, and cut out
the mount option part of this cycle.

In the long run, it'd be great to see most of the btrfs-specific
mount options get deprecated and ultimately removed entirely, in
favour of attributes/properties, where feasible.



It's a good point, and as was commented later down I'd just do mount -o 
compress=zstd:3 or something.


But I do prefer properties in general for this.  My big point was just 
that next step is outside of Nick's scope.


-chris



Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Chris Mason

On 08/10/2017 04:30 AM, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:



The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |



Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as 
well.  The numbers were in line with what Nick is posting here.  zstd is 
a big win over both lzo and zlib from a btrfs point of view.


It's true Nick's patches only support a single compression level in 
btrfs, but that's because btrfs doesn't have a way to pass in the 
compression ratio.  It could easily be a mount option, it was just 
outside the scope of Nick's initial work.


-chris





Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Chris Mason

On 08/10/2017 03:00 PM, Eric Biggers wrote:

On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:

On 08/10/2017 04:30 AM, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:



The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |



Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as
well.  The numbers were in line with what Nick is posting here.
zstd is a big win over both lzo and zlib from a btrfs point of view.

It's true Nick's patches only support a single compression level in
btrfs, but that's because btrfs doesn't have a way to pass in the
compression ratio.  It could easily be a mount option, it was just
outside the scope of Nick's initial work.



I am not surprised --- Zstandard is closer to the state of the art, both
format-wise and implementation-wise, than the other choices in BTRFS.  My point
is that benchmarks need to account for how much data is compressed at a time.
This is a common mistake when comparing different compression algorithms; the
algorithm name and compression level do not tell the whole story.  The
dictionary size is extremely significant.  No one is going to compress or
decompress a 200 MB file as a single stream in kernel mode, so it does not make
sense to justify adding Zstandard *to the kernel* based on such a benchmark.  It
is going to be divided into chunks.  How big are the chunks in BTRFS?  I thought
that it compressed only one page (4 KiB) at a time, but I hope that has been, or
is being, improved; 32 KiB - 128 KiB should be a better amount.  (And if the
amount of data compressed at a time happens to be different between the
different algorithms, note that BTRFS benchmarks are likely to be measuring that
as much as the algorithms themselves.)


Btrfs hooks the compression code into the delayed allocation mechanism 
we use to gather large extents for COW.  So if you write 100MB to a 
file, we'll have 100MB to compress at a time (within the limits of the 
amount of pages we allow to collect before forcing it down).


But we want to balance how much memory you might need to uncompress 
during random reads.  So we have an artificial limit of 128KB that we 
send at a time to the compression code.  It's easy to change this, it's 
just a tradeoff made to limit the cost of reading small bits.


It's the same for zlib,lzo and the new zstd patch.

-chris



Re: Moving ndctl development into the kernel tree?

2017-07-25 Thread Chris Mason

On 07/22/2017 02:49 PM, Dan Williams wrote:

On Fri, Jul 21, 2017 at 7:52 PM, Dan Williams  wrote:

[ adding Chris ]

On Fri, Jul 21, 2017 at 4:44 PM, Dan Williams  wrote:

On Fri, Jul 21, 2017 at 3:58 PM, Ingo Molnar  wrote:


* Dan Williams  wrote:


[...]

* Like perf, ndctl borrows the sub-command architecture and option
parsing from git. So, this code could be refactored into something
shared / generic, i.e. the bits in tools/perf/util/.


Just as a side note, stacktool (tools/stacktool/) is using the Git sub-command 
and
options parsing code as well, and it's already sharing it with perf, via the
tools/lib/subcmd/ library.

ndctl could use that as well.


Ah, nice, that refactoring happened about a year after ndctl was born.
Which brings up the next question about what to do with the git
history, but I'd want to know if ndctl is even welcome upstream before
digging any deeper.


I suspect this would be similar to what Chris did to merge btrfs while
retaining the standalone history. Chris, any pointers on what worked
well and what if anything you would do differently? I.e. I'm looking
to use git filter-branch to rewrite ndctl history as if if had always
been in tools/ndctl in the kernel tree. I found this old thread
https://lkml.org/lkml/2008/10/30/523 and it seems to also recommend
using an older kernel as the branch base.


So it wasn't as painful as I thought it would be, I just used the
script Linus recommended in that thread. Here is what I came up with
merging the last ndctl release on top of v4.9, and then applying the
pending development patches re-filtered to tools/ndctl:

 
https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=for-4.14/ndctl

...the next thing would be to rework the versioning to use the kernel
version and switch to using tools/lib/subcmd/.



I'd like to say I figured it all out back then, but the truth is that 
Linus held my hand the whole way.  My memory of it is that his script 
worked really well, I just ran that and verified the results.


-chris


Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

2017-04-25 Thread Chris Mason



On 04/25/2017 04:49 PM, Tejun Heo wrote:

On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote:

Will try that too.  I can't see why HT would change it because I see
single CPU queues misevaluated.  Just in case, you need to tune the
test params so that it doesn't load the machine too much and that
there are some non-CPU intensive workloads going on to purturb things
a bit.  Anyways, I'm gonna try disabling HT.


It's finickier but after changing the duty cycle a bit, it reproduces
w/ HT off.  I think the trick is setting the number of threads to the
number of logical CPUs and tune -s/-c so that p99 starts climbing up.
The following is from the root cgroup.


Since it's only measuring wakeup latency, schbench is best at exposing 
problems when the machine is just barely below saturated.  At 
saturation, everyone has to wait for the CPUs, and if we're relatively 
idle there's always a CPU to be found


There's schbench -a to try and find this magic tipping point, but I 
haven't found a great way to automate for every kind of machine yet (sorry).


-chris


[GIT PULL] Btrfs

2017-04-27 Thread Chris Mason

Hi Linus,

We have one more for btrfs:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

This is dropping a new WARN_ON from rc1 that ended up making more noise 
than we really want.  The larger fix for the underflow got delayed a bit 
and it's better for now to put it under CONFIG_BTRFS_DEBUG.


David Sterba (1) commits (+7/-4):
   btrfs: qgroup: move noisy underflow warning to debugging build

Total: (1) commits (+7/-4)

fs/btrfs/qgroup.c | 11 +++
1 file changed, 7 insertions(+), 4 deletions(-)


Re: [PATCH] btrfs: always write superblocks synchronously

2017-05-03 Thread Chris Mason



On 05/03/2017 04:36 AM, Jan Kara wrote:

On Tue 02-05-17 09:28:13, Davidlohr Bueso wrote:

Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
Since REQ_FUA and REQ_FLUSH flags are stripped from submitted IO
when the disk doesn't have volatile write cache and thus effectively
make the write async. This was seen to cause performance hits up
to 90% regression in disk IO related benchmarks such as reaim and
dbench[1].

Fix the problem by making sure the first superblock write is also
treated as synchronous since they can block progress of the
journalling (commit, log syncs) machinery and thus the whole filesystem.





Fixes: b685d3d65ac (block: treat REQ_FUA and REQ_PREFLUSH as synchronous)
Cc: stable 
Cc: Jan Kara 
Signed-off-by: Davidlohr Bueso 


I wasn't patient enough and already sent the fix as part of my series
fixing other filesystems [1]. It also fixes one more place in btrfs that
needs REQ_SYNC to return to the original behavior.




Thanks guys.

-chris



[GIT PULL] Btrfs

2017-05-09 Thread Chris Mason
 bdev_get_queue (+3/-4)
btrfs: check if the device is flush capable (+4/-0)
btrfs: delete unused member nobarriers (+0/-4)

Edmund Nadolski (2) commits (+25/-20):
btrfs: provide enumeration for __merge_refs mode argument (+13/-10)
btrfs: replace hardcoded value with SEQ_LAST macro (+12/-10)

Goldwyn Rodrigues (2) commits (+24/-3):
btrfs: qgroups: Retry after commit on getting EDQUOT (+23/-1)
btrfs: No need to check !(flags & MS_RDONLY) twice (+1/-2)

Chris Mason (1) commits (+2/-2):
btrfs: fix the gfp_mask for the reada_zones radix tree

Adam Borowski (1) commits (+9/-3):
btrfs: fix a bogus warning when converting only data or metadata

Deepa Dinamani (1) commits (+2/-1):
btrfs: Use ktime_get_real_ts for root ctime

Dan Carpenter (1) commits (+15/-26):
Btrfs: handle only applicable errors returned by btrfs_get_extent

Dmitry V. Levin (1) commits (+2/-0):
MAINTAINERS: add btrfs file entries for include directories

Hans van Kranenburg (1) commits (+5/-5):
Btrfs: consistent usage of types in balance_args

Total: (71) commits

 MAINTAINERS  |   2 +
 fs/btrfs/backref.c   |  41 ++-
 fs/btrfs/btrfs_inode.h   |   7 +
 fs/btrfs/compression.c   |  18 +-
 fs/btrfs/ctree.c |  20 +-
 fs/btrfs/ctree.h |  34 +-
 fs/btrfs/delayed-inode.c |  46 +--
 fs/btrfs/delayed-inode.h |   6 +-
 fs/btrfs/delayed-ref.c   |   8 +-
 fs/btrfs/delayed-ref.h   |   8 +-
 fs/btrfs/dev-replace.c   |   9 +-
 fs/btrfs/disk-io.c   |  13 +-
 fs/btrfs/disk-io.h   |   4 +-
 fs/btrfs/extent-tree.c   |  35 +-
 fs/btrfs/extent_io.c |  59 +--
 fs/btrfs/extent_io.h |   8 +-
 fs/btrfs/extent_map.c|  10 +-
 fs/btrfs/extent_map.h|   3 +-
 fs/btrfs/file.c  |  82 -
 fs/btrfs/free-space-cache.c  |   2 +-
 fs/btrfs/inode.c | 289 +++
 fs/btrfs/ioctl.c |  33 +-
 fs/btrfs/ordered-data.c  |  20 +-
 fs/btrfs/ordered-data.h  |   2 +-
 fs/btrfs/qgroup.c| 102 ++
 fs/btrfs/qgroup.h|  51 ++-
 fs/btrfs/raid56.c|  38 +-
 fs/btrfs/reada.c |  37 +-
 fs/btrfs/root-tree.c |   3 +-
 fs/btrfs/scrub.c | 331 +++--
 fs/btrfs/send.c  |  23 +-
 fs/btrfs/super.c |   3 +-
 fs/btrfs/tests/btrfs-tests.c |   1 -
 fs/btrfs/transaction.c   |  48 ++-
 fs/btrfs/transaction.h   |   6 +-
 fs/btrfs/tree-log.c  |   2 +-
 fs/btrfs/volumes.c   | 854 +++
 fs/btrfs/volumes.h   |   8 +-
 include/trace/events/btrfs.h | 187 +-
 include/uapi/linux/btrfs.h   |  10 +-
 40 files changed, 1629 insertions(+), 834 deletions(-)


Re: [GIT PULL] Btrfs

2017-05-09 Thread Chris Mason
On 05/09/2017 01:56 PM, Chris Mason wrote:
> Hi Linus,
> 
> My for-linus-4.12 branch:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
> for-linus-4.12

I hit send too soon, sorry.  There's a trivial conflict with our WARN_ON
fix that went into 4.11.  I pushed the resolution to
for-linus-4.12-merged.

diff --cc fs/btrfs/qgroup.c
index afbea61,3f75b5c..deffbeb
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@@ -1078,7 -1031,8 +1034,8 @@@ static int __qgroup_excl_accounting(str
qgroup->excl += sign * num_bytes;
qgroup->excl_cmpr += sign * num_bytes;
if (sign > 0) {
+   trace_qgroup_update_reserve(fs_info, qgroup, -(s64)num_bytes);
 -  if (WARN_ON(qgroup->reserved < num_bytes))
 +  if (qgroup->reserved < num_bytes)
report_reserved_underflow(fs_info, qgroup, num_bytes);
else
qgroup->reserved -= num_bytes;
@@@ -1103,7 -1057,9 +1060,9 @@@
WARN_ON(sign < 0 && qgroup->excl < num_bytes);
qgroup->excl += sign * num_bytes;
if (sign > 0) {
+   trace_qgroup_update_reserve(fs_info, qgroup,
+   -(s64)num_bytes);
 -  if (WARN_ON(qgroup->reserved < num_bytes))
 +  if (qgroup->reserved < num_bytes)
report_reserved_underflow(fs_info, qgroup,
  num_bytes);
else
@@@ -2472,7 -2451,8 +2454,8 @@@ void btrfs_qgroup_free_refroot(struct b
  
qg = unode_aux_to_qgroup(unode);
  
+   trace_qgroup_update_reserve(fs_info, qg, -(s64)num_bytes);
 -  if (WARN_ON(qg->reserved < num_bytes))
 +  if (qg->reserved < num_bytes)
report_reserved_underflow(fs_info, qg, num_bytes);
else
qg->reserved -= num_bytes;


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-05-17 Thread Chris Mason

On 05/17/2017 06:53 AM, Peter Zijlstra wrote:

On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:

sched/fair, cpumask: Export for_each_cpu_wrap()



-static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int 
*wrapped)
-{



-   next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);



-}


OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
implementation. The above 'n+1' should be 'n', and the effect is that
it'll skip over CPUs, potentially resulting in an iteration that only
sees every other CPU (for a fully contiguous mask).

This in turn causes hackbench to further suffer from the regression
introduced by commit:

  4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")

So its well past time to fix this.

Where the old scheme was a cliff-edge throttle on idle scanning, this
introduces a more gradual approach. Instead of stopping to scan
entirely, we limit how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

I'm also hoping it will fix/preserve kitsunyan's interactivity issue.

Please test..


We'll get some tests going here too.

-chris


[GIT PULL] Btrfs

2017-06-10 Thread Chris Mason
Hi Linus,

My for-linus-4.12 branch has some fixes that Dave Sterba collected:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.12

We've been hitting an early enospc problem on production machines that
Omar tracked down to an old int->u64 mistake.  I waited a bit on
this pull to make sure it was really the problem from production,
but it's on ~2100 hosts now and I think we're good.

Omar also noticed a commit in the queue would make new early ENOSPC
problems.  I pulled that out for now, which is why the top three commits
are younger than the rest.

Otherwise these are all fixes, some explaining very old bugs that we've
been poking at for a while.

Jeff Mahoney (2) commits (+4/-3):
btrfs: fix race with relocation recovery and fs_root setup (+3/-3)
btrfs: fix memory leak in update_space_info failure path (+1/-0)

Liu Bo (1) commits (+1/-1):
Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io

Colin Ian King (1) commits (+1/-1):
btrfs: fix incorrect error return ret being passed to mapping_set_error

Omar Sandoval (1) commits (+2/-2):
Btrfs: fix delalloc accounting leak caused by u32 overflow

Qu Wenruo (1) commits (+122/-2):
btrfs: fiemap: Cache and merge fiemap extent before submit it to user

David Sterba (1) commits (+2/-2):
btrfs: use correct types for page indices in btrfs_page_exists_in_range

Jan Kara (1) commits (+6/-4):
btrfs: Make flush bios explicitely sync

Su Yue (1) commits (+1/-1):
btrfs: tree-log.c: Wrong printk information about namelen

Total: (9) commits (+139/-16)

 fs/btrfs/ctree.h   |   4 +-
 fs/btrfs/dir-item.c|   2 +-
 fs/btrfs/disk-io.c |  10 ++--
 fs/btrfs/extent-tree.c |   7 +--
 fs/btrfs/extent_io.c   | 126 +++--
 fs/btrfs/inode.c   |   6 +--
 6 files changed, 139 insertions(+), 16 deletions(-)


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-06-09 Thread Chris Mason

On 06/06/2017 05:21 AM, Peter Zijlstra wrote:

On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote:

On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:

On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:


Please test..


Results are still coming in but things do look better with your patch
applied.

It does look like there's a regression when running hackbench in
process mode and when the CPUs are not fully utilised, e.g. check this
out:


This turned out to be a false positive; your patch improves things as
far as I can see.


Hooray, I'll move it to a part of the queue intended for merging.


It's a little late, but Roman Gushchin helped get some runs of this with 
our production workload.  The patch is every so slightly better.


Thanks!

-chris



Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-09 Thread Chris Mason
Hello everyone,

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.  
The election to select five TAB members will be held at the 2017 Kernel
Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-09 Thread Chris Mason

Hello everyone,

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.
The election to select five TAB members will be held at the 2017 Kernel
Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Reminder: Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-16 Thread Chris Mason

Hello everyone,

Quick update on the TAB elections, we have 5 nominations so far:

Jon Corbet
Greg Kroah-Hartman
Shuah Khan
Steve Rostedt
Ted Tso

The elections are next week, please feel free to contact me if you have 
any questions about the TAB.


-

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.  The 
election to select five TAB members will be held at the 2017 Kernel

Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


[GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason
Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert and
Phillip, we decided to send the whole thing in as one pull request.

I have it in my zstd branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd

There's a trivial conflict with the main btrfs pull that Dave Sterba just
sent.  His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.  My
idea was that you'd take our main btrfs pull first and this one second,
but the conflicts are small enough it's not a big deal.

zstd is a big win in speed over zlib and in compression ratio over lzo, and
the compression team here at FB has gotten great results using it in production.
Nick will continue to update the kernel side with new improvements from the 
open source zstd userland code.

Nick has a number of benchmarks for the main zstd code in his lib/zstd
commit:


I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

sudo modprobe zstd_compress_test
sudo mknod zstd_compress_test c 245 0
sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |

I benchmarked zstd decompression using the same method on the same machine.
The benchmark file is located in the upstream zstd repo under
`contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is
the amount of memory required to decompress data compressed with the given
compression level. If you know the maximum size of your input, you can
reduce the memory usage of decompression irrespective of the compression
level.

| Method   | Time (s) | MB/s| Adjusted MB/s | Memory (MB) |
|--|--|-|---|-|
| none |0.025 | 8479.54 | - |   - |
| zstd -1  |0.358 |  592.15 |636.60 |0.84 |
| zstd -3  |0.396 |  535.32 |571.40 |1.46 |
| zstd -5  |0.396 |  535.32 |571.40 |1.46 |
| zstd -10 |0.374 |  566.81 |607.42 |2.51 |
| zstd -15 |0.379 |  559.34 |598.84 |4.61 |
| zstd -19 |0.412 |  514.54 |547.77 |8.80 |
| zlib -1  |0.940 |  225.52 |231.68 |0.04 |
| zlib -3  |0.883 |  240.08 |247.07 |0.04 |
| zlib -6  |0.844 |  251.17 |258.84 |0.04 |
| zlib -9  |0.837 |  253.27 |287.64 |0.04 |

===

I ran a long series of tests and benchmarks on the btrfs side and
the gains are very similar to the core benchmarks Nick ran.

Nick Terrell (4) commits (+14578/-12):  
crypto: Add zstd support (+356/-0)  
btrfs: Add zstd support (+468/-12)  
lib: Add zstd modules (+13014/-0)   
lib: Add xxhash module (+740/-0)

Sean Purcell (1) commits (+178/-0): 
squashfs: Add zstd support  

Total: (5) commits (+14756/-12)

Re: [GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason



On 09/08/2017 03:33 PM, Chris Mason wrote:

Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert and
Phillip, we decided to send the whole thing in as one pull request.

I have it in my zstd branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd

There's a trivial conflict with the main btrfs pull that Dave Sterba just
sent.  His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.  My
idea was that you'd take our main btrfs pull first and this one second,
but the conflicts are small enough it's not a big deal.

zstd is a big win in speed over zlib and in compression ratio over lzo, and
the compression team here at FB has gotten great results using it in production.
Nick will continue to update the kernel side with new improvements from the
open source zstd userland code.


Just to clarify, we've been testing the kernel side of this here at FB, 
but our zstd use in prod is limited to the application side.


-chris


Re: [GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason

On Sat, Sep 09, 2017 at 09:35:59AM +0800, Herbert Xu wrote:

On Fri, Sep 08, 2017 at 03:33:05PM -0400, Chris Mason wrote:


 crypto/Kconfig |9 +
 crypto/Makefile|1 +
 crypto/testmgr.c   |   10 +
 crypto/testmgr.h   |   71 +
 crypto/zstd.c  |  265 


Is there anyone going to use zstd through the crypto API? If not
then I don't see the point in adding it at this point.  Especially
as the compression API is still in a state of flux.


That part was requested by intel, but I'm happy to leave it out for 
another time.  The rest of the patch series doesn't depend on it at all.


-chris


[GIT PULL v2] zstd support (lib, btrfs, squashfs, nocrypto)

2017-09-11 Thread Chris Mason
Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert
and Phillip, we decided to send the whole thing in as one pull request.

Herbert had asked about the crypto patch when we discussed the pull, but
I didn't realize he really meant not-right-now.  I've rebased it out of
this branch, and none of the other patches depended on it.

I have things in my zstd-minimal branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd-minimal

There's a trivial conflict with the main btrfs pull from last week.
Dave's pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.

zstd is a big win in speed over zlib and in compression ratio over lzo,
and the compression team here at FB has gotten great results using it in
production.  Nick will continue to update the kernel side with new
improvements from the open source zstd userland code.

Nick has a number of benchmarks for the main zstd code in his lib/zstd
commit:


I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

sudo modprobe zstd_compress_test
sudo mknod zstd_compress_test c 245 0
sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |

I benchmarked zstd decompression using the same method on the same machine.
The benchmark file is located in the upstream zstd repo under
`contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is
the amount of memory required to decompress data compressed with the given
compression level. If you know the maximum size of your input, you can
reduce the memory usage of decompression irrespective of the compression
level.

| Method   | Time (s) | MB/s| Adjusted MB/s | Memory (MB) |
|--|--|-|---|-|
| none |0.025 | 8479.54 | - |   - |
| zstd -1  |0.358 |  592.15 |636.60 |0.84 |
| zstd -3  |0.396 |  535.32 |571.40 |1.46 |
| zstd -5  |0.396 |  535.32 |571.40 |1.46 |
| zstd -10 |0.374 |  566.81 |607.42 |2.51 |
| zstd -15 |0.379 |  559.34 |598.84 |4.61 |
| zstd -19 |0.412 |  514.54 |547.77 |8.80 |
| zlib -1  |0.940 |  225.52 |231.68 |0.04 |
| zlib -3  |0.883 |  240.08 |247.07 |0.04 |
| zlib -6  |0.844 |  251.17 |258.84 |0.04 |
| zlib -9  |0.837 |  253.27 |287.64 |0.04 |

===

I ran a long series of tests and benchmarks on the btrfs side and
the gains are very similar to the core benchmarks Nick ran.

Nick Terrell (3) commits (+14222/-12):
btrfs: Add zstd support (+468/-12)
lib: Add zstd modules (+13014/-0)
lib: Add xxhash module (+740/-0)

Sean Purcell (1) commits (+178/-0):
squashfs: Add zstd support

Total: (4) commits (+14400/-12)

 fs/btrfs/Kconfig   |2 +
 fs/btrfs/Makefile  |2 +-
 fs/btrfs/compression.c |1 +
 fs/btrfs/compression.h |6 +-
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |2 +
 fs/btrfs/ioctl.c   |6 +-
 fs/btrfs/props.c   |6 +
 fs/btrfs/super.c   |   12 +-
 fs/btrfs/sysfs.c   |2 +
 fs/btrfs/zstd.c|  432 ++
 fs/squashfs/Kconfig|   14 +
 

Reminder v2: Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-22 Thread Chris Mason

Hello everyone,

Quick update on the TAB elections, we have 6 nominations so far:

Jon Corbet
Greg Kroah-Hartman
Shuah Khan
Steve Rostedt
Ted Tso
Tim Bird

The elections are coming soon, please feel free to contact me if you 
have any questions about the TAB.


-

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.  The 
election to select five TAB members will be held at the 2017 Kernel

Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-11-30 Thread Chris Mason



On 11/30/2017 12:23 PM, David Sterba wrote:

On Wed, Nov 29, 2017 at 01:38:26PM -0500, Chris Mason wrote:

On 11/29/2017 12:05 PM, Tejun Heo wrote:

On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote:

Hello,

On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote:

What has happened with this patch set?


No idea.  cc'ing Chris directly.  Chris, if the patchset looks good,
can you please route them through the btrfs tree?


lol looking at the patchset again, I'm not sure that's obviously the
right tree.  It can either be cgroup, block or btrfs.  If no one
objects, I'll just route them through cgroup.


We'll have to coordinate a bit during the next merge window but I don't
have a problem with these going in through cgroup.  Dave does this sound
good to you?


There are only minor changes to btrfs code so cgroup tree would be
better.


I'd like to include my patch to do all crcs inline (instead of handing
off to helper threads) when io controls are in place.  By the merge
window we should have some good data on how much it's all helping.


Are there any problems in sight if the inline crc and cgroup chnanges go
separately? I assume there's a runtime dependency, not a code
dependency, so it could be sorted by the right merge order.



The feature is just more useful with the inline crcs.  Without them we 
end up with kworkers doing both high and low prio submissions and it all 
boils down to the speed of the lowest priority.


-chris



Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-11-29 Thread Chris Mason

On 11/29/2017 12:05 PM, Tejun Heo wrote:

On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote:

Hello,

On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote:

What has happened with this patch set?


No idea.  cc'ing Chris directly.  Chris, if the patchset looks good,
can you please route them through the btrfs tree?


lol looking at the patchset again, I'm not sure that's obviously the
right tree.  It can either be cgroup, block or btrfs.  If no one
objects, I'll just route them through cgroup.


We'll have to coordinate a bit during the next merge window but I don't 
have a problem with these going in through cgroup.  Dave does this sound 
good to you?


I'd like to include my patch to do all crcs inline (instead of handing 
off to helper threads) when io controls are in place.  By the merge 
window we should have some good data on how much it's all helping.


-chris



Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

2018-03-06 Thread Chris Mason

On 6 Mar 2018, at 11:12, Linus Torvalds wrote:

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov  
wrote:
As the first step in development of bpfilter project [1] the 
request_module()
code is extended to allow user mode helpers to be invoked. Idea is 
that
user mode helpers are built as part of the kernel build and installed 
as
traditional kernel modules with .ko file extension into distro 
specified
location, such that from a distribution point of view, they are no 
different
than regular kernel modules. Thus, allow request_module() logic to 
load such

user mode helper (umh) modules via:

[,,]

I like this, but I have one request: can we make sure that this action
is visible in the system messages?

When we load a regular module, at least it shows in lsmod afterwards,
although I have a few times wanted to really see module load as an
event in the logs too.

When we load a module that just executes a user program, and there is
no sign of it in the module list, I think we *really* need to make
that event show to the admin some way.

.. and yes, maybe we'll need to rate-limit the messages, and maybe it
turns out that I'm entirely wrong and people will hate the messages
after they get used to the concept of these pseudo-modules, but
particularly for the early implementation when this is a new thing, I
really want a message like

 executed user process xyz-abc as a pseudo-module

or something in dmesg.

I do *not* want this to be a magical way to hide things.


Especially early on, this makes a lot of sense.  But I wanted to plug 
bps and the hopefully growing set of bpf introspection tools:


https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt

Long term these are probably a good place to tell the admin what's going 
on.


-chris


Re: [PATCH 2/2] code-of-conduct: Strip the enforcement paragraph pending community discussion

2018-10-08 Thread Chris Mason

On 6 Oct 2018, at 17:37, James Bottomley wrote:

Significant concern has been expressed about the responsibilities 
outlined in
the enforcement clause of the new code of conduct.  Since there is 
concern

that this becomes binding on the release of the 4.19 kernel, strip the
enforcement clauses to give the community time to consider and debate 
how this

should be handled.


Even in the places where I don't agree with the discussion about what 
our code of conduct should be, I love that we're having it.  Removing 
the enforcement clause basically goes back to the way things were.  We'd 
be recognizing that we know issues happen, and explicitly stating that 
when serious events do happen, the community as a whole isn't committing 
to helping.


It's true there are a lot of questions about how the community resolves 
problems and holds each other accountable for maintaining any code of 
conduct.  I think the enforcement section leaves us the room we need to 
continue discussions and still make it clear that we're making an effort 
to shift away from the harsh discussions in the past.


-chris




Linux Foundation Technical Advisory Board Elections -- Call for nominations

2018-10-22 Thread Chris Mason



Hello everyone,

The Linux Foundation Technical Advisory Board (TAB) serves as the 
interface between the kernel development community and the Linux 
Foundation. The TAB advises the Foundation on kernel-related matters, 
helps member companies learn to work with the community, and works to 
resolve community-related problems before they get out of hand.  We're 
also working with kernel maintainers to help refine the new code of 
conduct, and serving as the initial point of contact for code of conduct 
issues.


The board has ten members, one of whom sits on the Linux Foundation 
board of directors.


The election to select five TAB members will be held at the 2018 Kernel 
Summit in Vancouver, Canada.  The elections will take place at the 
conference center on Tuesday November 13th, at 5:30pm.


The election will be open to all attendees of all of the Linux 
Foundation events taking place that week in Vancouver.  Anyone is 
eligible to stand for election, simply send your nomination to:


tech-board-discuss at lists.linux-foundation.org

The deadline for receiving nominations is up until the beginning of the 
event where the election is held.


In past years, everyone running for the TAB has given a short speech 
before the voting began.  We've received feedback that the speeches add 
logistical complexity for the election, and may not be the best 
indicator of how well qualified someone is for the TAB.


Instead of speeches, this year we're asking candidates to include 
statements about why they would like to participate in the TAB.  These 
will be combined into a slideshow running during the election, and 
available via a public google doc at this location:


https://goo.gl/rPEc2v

Even though the deadline for nominations is right before voting begins, 
any statements must be received by Monday November 12th at 5PM Pacific, 
so that we have time to setup the slideshow.


Current TAB members, and their election year:

Chris Mason 2016
H. Peter Anvin 2016
Olof Johansson 2016
Rik van Riel2016
Dan Williams 2016

Jon Corbet 2017
Greg Kroah-Hartman 2017
Steven Rostedt 2017
Ted Tso 2017
Tim Bird2017

The five slots from 2016 are all up for election.  As always, please let 
us know if you have questions, and please do consider running.


Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.
The other five are halfway through their term and will be up for
election next year.


Linux Foundation Technical Advisory Board Elections -- Call for nominations

2018-11-04 Thread Chris Mason
Hello everyone,

Friendly reminder that the TAB elections are coming soon.

The Linux Foundation Technical Advisory Board (TAB) serves as the 
interface between the kernel development community and the Linux 
Foundation. The TAB advises the Foundation on kernel-related matters, 
helps member companies learn to work with the community, and works to 
resolve community-related problems before they get out of hand.  We're 
also working with kernel maintainers to help refine the new code of 
conduct, and serving as the initial point of contact for code of conduct 
issues.

The board has ten members, one of whom sits on the Linux Foundation 
board of directors.

The election to select five TAB members will be held at the 2018 Kernel 
Summit in Vancouver, Canada.  The elections will take place at the 
conference center on Tuesday November 13th, at 5:30pm.

The election will be open to all attendees of all of the Linux 
Foundation events taking place that week in Vancouver.  Anyone is 
eligible to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

The deadline for receiving nominations is up until the beginning of the 
event where the election is held.

In past years, everyone running for the TAB has given a short speech 
before the voting began.  We've received feedback that the speeches add 
logistical complexity for the election, and may not be the best 
indicator of how well qualified someone is for the TAB.

Instead of speeches, this year we're asking candidates to include 
statements about why they would like to participate in the TAB.  These 
will be combined into a slideshow running during the election, and 
available via a public google doc at this location:

https://goo.gl/rPEc2v

Even though the deadline for nominations is right before voting begins, 
any statements must be received by Monday November 12th at 5PM Pacific, 
so that we have time to setup the slideshow.

Current TAB members, and their election year:

Chris Mason 2016
H. Peter Anvin 2016
Olof Johansson 2016
Rik van Riel2016
Dan Williams 2016

Jon Corbet 2017
Greg Kroah-Hartman 2017
Steven Rostedt 2017
Ted Tso 2017
Tim Bird2017

The five slots from 2016 are all up for election.  As always, please let 
us know if you have questions, and please do consider running.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.
The other five are halfway through their term and will be up for
election next year.


Re: [PATCH] writepage method changes

2001-05-10 Thread Chris Mason



On Wednesday, May 09, 2001 10:51:17 PM -0300 Marcelo Tosatti
<[EMAIL PROTECTED]> wrote:

> 
> 
> On Wed, 9 May 2001, Marcelo Tosatti wrote:
> 
>> Locked for the "not wrote out case" (I will fix my patch now, thanks)
> 
> I just found out that there are filesystems (eg reiserfs) which write out
> data even if an error ocurred, which means the unlocking must be done by
> the filesystems, always. 

I'm not horribly attached to the way reiserfs is doing it right now.  If
reiserfs writepage manages to map any blocks, it writes them to disk, even
if mapping other blocks in the page failed.  These are only data blocks, so
there are no special consistency rules.  If we need to change this, it is
not a big deal.

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [reiserfs-dev] Re: reiserfs, xfs, ext2, ext3

2001-05-13 Thread Chris Mason



On Friday, May 11, 2001 04:00:20 AM -0700 Hans Reiser <[EMAIL PROTECTED]>
wrote:

> Alan Cox wrote:
> 
>> > Are you referring to Neil Brown's nfs operations patch as being as
>> > ugly as hell, or something else?  Just want to understand what you are
>> > saying before arguing.
>> 
>> Andi has sent me some stuff to look at. He listed four implementations
>> and I've only seen two of them
> 
> did you see an implementation which adds operations to VFS and is written
> by Neil Brown (with reiserfs portions by Chris and Nikita)?

I coded up a mixture of Andi's 2.2.x apis and Neil's 2.4.x stuff and sent
it out for review a little while ago. It isn't as good as Neil's stuff, but
it doesn't require changing the other filesystems.  If it looked good to
the NFS guys and the other FS guys don't hate it, I'll push it around for
testing/inclusion.

This would be my preferred solution right now, since it could also work for
AFS.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Reiserfs, Mongo and CPU question

2001-05-15 Thread Chris Mason



On Tuesday, May 15, 2001 01:41:01 PM +0200 Ricardo Galli <[EMAIL PROTECTED]>
wrote:

> Hans and reiserfs developers,
>   the same student of my university
> (http://www.cs.helsinki.fi/linux/linux-kernel/2001-18/0654.html) was
> carrying up the mongo benchmarks against reiser, xfs, jfs and ext2 for
> different base sizes.
> 
> 
> For example, for the base size of 10.000 (the average of a clean
> distribution is about 16.000 bytes) ReiserFS is even slower than ext2.
> I've realised the bottleneck may be the CPU, a Cyrix MII 233MHz.
> 

Would not surprise me, there's lots of room for improvement in reiserfs CPU
usage.  The 10k size is one of the worst cases for tail performance, those
numbers should increase if you mount with -o notail.

Here's a simple patch that should help on balance instensive apps (like
creates/deletes).  Please let me know if you see any difference with it.

-chris

diff -ur diff/linux/fs/reiserfs/fix_node.c linux/fs/reiserfs/fix_node.c
--- diff/linux/fs/reiserfs/fix_node.c   Mon Jan 15 18:31:19 2001
+++ linux/fs/reiserfs/fix_node.cFri Feb  2 15:40:54 2001
@@ -936,6 +936,7 @@
 if (p_s_tb->FEB[p_s_tb->cur_blknum])
   BUG();
 
+mark_buffer_journal_new(p_s_new_bh) ;
 p_s_tb->FEB[p_s_tb->cur_blknum++] = p_s_new_bh;
   }
 



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Re[2]: ReiserFS 2.4.4/3.x.0k-pre2

2001-05-15 Thread Chris Mason



On Tuesday, May 15, 2001 02:24:36 PM +0400 Samium Gromoff
<[EMAIL PROTECTED]> wrote:

>   Hello,
>  I`m still experiencing file tail corruptions
>   on subj.
>  And more: after i had restored bblocked patrition
>   (by relying on drive`s ability to remap bblks on
>   write by wroting small modification of debugreiserfs
>   which zeroified all bblks), i had _runtime_ tail
>corruptions of the mc`s dir hotlist which i tried 
>to rewrite again and again.
>   i found, that "sync"ing after modifying helps to keep
>   file fine, so it does until now.

Hmmm, are you sure the disk is good now?

What kinds of things are you doing on the files where you see tail
corruptions?  Can you reliably reproduce the corruption?

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Getting FS access events

2001-05-15 Thread Chris Mason



On Tuesday, May 15, 2001 04:33:57 AM -0400 Alexander Viro
<[EMAIL PROTECTED]> wrote:

> 
> 
> On Tue, 15 May 2001, Linus Torvalds wrote:
> 
>> Looks like there are 19 filesystems that use the buffer cache right now:
>> 
>>  grep -l bread fs/*/*.c | cut -d/ -f2 | sort -u | wc
>> 
>> So quite a bit of work involved.
> 
> Reiserfs... Dunno. They've got a private (slightly mutated) copy of
> ~60% of fs/buffer.c. 

But, putting the log and the metadata in the page cache makes memory
pressure and such cleaner, so this is one of my goals for 2.5.  reiserfs
will still have alias issues due to the packed tails (one copy in the
btree, another in the page), but it will be no worse than it is now.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: ReiserFs: Cosmetic problem in linux/Documentation/Changes[2.4.x]

2001-05-18 Thread Chris Mason



On Friday, May 18, 2001 01:26:01 PM +0200 "Martin.Knoblauch"
<[EMAIL PROTECTED]> wrote:

> "Martin.Knoblauch" wrote:
>> 
>> Hi,
>> 
>>  I submitted this a short while ago, only to realize later that the
>> subject line was not very informative. Sorry.
>> 
>>  As a suggestion: maybe the reiser-tools should support the common
>> -V/--version flag
>> 

Newer verions (at least 3.x.0j) have a -V.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] improve reiserfs 2.4.x O_SYNC and fsync speed

2001-05-12 Thread Chris Mason


Hi guys,

This patch has been lightly tested, I'd appreciate it if some of you 
could try it out on data you don't care about.  The idea is to 
improve fsync and O_SYNC performance by only doing a commit on the last transaction 
the file was actually involved in.  The old code always forced a commit of the current 
transaction, which is just about the slowest possible choice (but easy to verify as 
correct ;-)

(2.2.x reiserfs already has similar optimizations)

The words I want to stress here are data_you_don't_care_about.  I'm looking
for benchmarks and impressions while I test here to make sure the logging rules are 
not being broken.

-chris
diff -Nru a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c
--- a/fs/reiserfs/dir.c Mon Apr 30 12:45:15 2001
+++ b/fs/reiserfs/dir.c Mon Apr 30 12:45:15 2001
@@ -47,22 +47,10 @@
 };
 
 int reiserfs_dir_fsync(struct file *filp, struct dentry *dentry, int datasync) {
-  int ret = 0 ;
-  int windex ;
-  struct reiserfs_transaction_handle th ;
-
   lock_kernel();
-
-  journal_begin(, dentry->d_inode->i_sb, 1) ;
-  windex = push_journal_writer("dir_fsync") ;
-  reiserfs_prepare_for_journal(th.t_super, SB_BUFFER_WITH_SB(th.t_super), 1) ;
-  journal_mark_dirty(, dentry->d_inode->i_sb, SB_BUFFER_WITH_SB 
(dentry->d_inode->i_sb)) ;
-  pop_journal_writer(windex) ;
-  journal_end_sync(, dentry->d_inode->i_sb, 1) ;
-
-  unlock_kernel();
-
-  return ret ;
+  reiserfs_commit_for_inode(dentry->d_inode) ;
+  unlock_kernel() ;
+  return 0 ;
 }
 
 
diff -Nru a/fs/reiserfs/file.c b/fs/reiserfs/file.c
--- a/fs/reiserfs/file.cMon Apr 30 12:45:15 2001
+++ b/fs/reiserfs/file.cMon Apr 30 12:45:15 2001
@@ -50,6 +50,7 @@
 lock_kernel() ;
 down (>i_sem); 
 journal_begin(, inode->i_sb, JOURNAL_PER_BALANCE_CNT * 3) ;
+reiserfs_update_inode_transaction(inode) ;
 
 #ifdef REISERFS_PREALLOCATE
 reiserfs_discard_prealloc (, inode);
@@ -83,10 +84,7 @@
  int datasync
  ) {
   struct inode * p_s_inode = p_s_dentry->d_inode;
-  struct reiserfs_transaction_handle th ;
   int n_err;
-  int windex ;
-  int jbegin_count = 1 ;
 
   lock_kernel() ;
 
@@ -95,14 +93,12 @@
 
   n_err = fsync_inode_buffers(p_s_inode) ;
   n_err |= fsync_inode_data_buffers(p_s_inode);
+
   /* commit the current transaction to flush any metadata
   ** changes.  sys_fsync takes care of flushing the dirty pages for us
   */
-  journal_begin(, p_s_inode->i_sb, jbegin_count) ;
-  windex = push_journal_writer("sync_file") ;
-  reiserfs_update_sd(, p_s_inode);
-  pop_journal_writer(windex) ;
-  journal_end_sync(, p_s_inode->i_sb,jbegin_count) ;
+  reiserfs_commit_for_inode(p_s_inode) ; 
+
   unlock_kernel() ;
   return ( n_err < 0 ) ? -EIO : 0;
 }
diff -Nru a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c   Mon Apr 30 12:45:15 2001
+++ b/fs/reiserfs/inode.c   Mon Apr 30 12:45:15 2001
@@ -40,6 +40,7 @@
down (>i_sem); 
 
journal_begin(, inode->i_sb, jbegin_count) ;
+   reiserfs_update_inode_transaction(inode) ;
windex = push_journal_writer("delete_inode") ;
 
reiserfs_delete_object (, inode);
@@ -281,6 +282,7 @@
   reiserfs_update_sd(th, inode) ;
   journal_end(th, s, len) ;
   journal_begin(th, s, len) ;
+  reiserfs_update_inode_transaction(inode) ;
 }
 
 // it is called by get_block when create == 0. Returns block number
@@ -604,6 +606,7 @@
  TYPE_ANY, 3/*key length*/);
 if ((new_offset + inode->i_sb->s_blocksize) >= inode->i_size) {
journal_begin(, inode->i_sb, jbegin_count) ;
+   reiserfs_update_inode_transaction(inode) ;
transaction_started = 1 ;
 }
  research:
@@ -628,6 +631,7 @@
if (!transaction_started) {
pathrelse() ;
journal_begin(, inode->i_sb, jbegin_count) ;
+   reiserfs_update_inode_transaction(inode) ;
transaction_started = 1 ;
goto research ;
}
@@ -704,6 +708,7 @@
*/
pathrelse() ;
journal_begin(, inode->i_sb, jbegin_count) ;
+   reiserfs_update_inode_transaction(inode) ;
transaction_started = 1 ;
goto research;
 }
@@ -1296,6 +1301,10 @@
 return ;
 }
 lock_kernel() ;
+
+/* this is really only used for atime updates, so they don't have
+** to be included in O_SYNC or fsync
+*/
 journal_begin(, inode->i_sb, 1) ;
 reiserfs_update_sd (, inode);
 journal_end(, inode->i_sb, 1) ;
@@ -1660,6 +1669,7 @@
 */
 prevent_flush_page_lock(page, p_s_inode) ;
 journal_begin(, p_s_inode->i_sb,  JOURNAL_PER_BALANCE_CNT * 2 ) ;
+reiserfs_update_inode_transaction(p_s_inode) ;
 windex = push_journal_writer("reiserfs_vfs_truncate_file") ;
 reiserfs_do_truncate (, p_s_inode, page, update_timestamps) ;
 pop_journal_writer(windex) ;
@@ -1708,6 +1718,7 @@
 lock_kernel() ;
 prevent_flush_page_lock(bh_result->b_page, inode) ;
 journal_begin(, inode->i_sb, 

Re: Dying disk and filesystem choice.

2001-05-25 Thread Chris Mason



On Friday, May 25, 2001 09:21:42 AM -0700 Hans Reiser <[EMAIL PROTECTED]>
wrote:
> No, our policy is strictly in sync with and reflective of that of the
> rest of the linux-kernel.  Since the ac series has a different policy, we
> can be different in regards to the ac series.  

Not really, our policy has been much more restrictive than the rest of the
kernel.  Look at the patches we didn't send in.

> 
> And I don't begin to comprehend your not sending in the lost disk space
> after crash bug fix (I assume it is what you mean when you refer to lost
> files after a crash, because I know of no lost files after a crash bug,
> please phrase things more carefully), and it really annoys me and the
> users, frankly.  Why you consider that a feature is beyond me.

The patch is a _huge_ change to the way files are deleted and truncated, to
what happens during mount, and to the way transactions work.  It is
effectively a format extension, and must be verified against both 2.2.x
kernels and 2.4.x kernels, in both disk formats.

Before I even consider introducing a change of this size, I want to be as
sure as I can the rest of the code is stable.  It is the only way we can
debug it and stay sane.  Even after I release the code, I won't want it in
an ac series for a while.  It does much more harm than good if it somehow
ruins compatibility with an older kernel, especially in 2.4.x.  

Yes, it is a bug fix.  But, it is a very different kind of bug fix than
something that corrupts files at random, or something that doesn't get
buffers to disk at the right time.  

I won't pretend the fix isn't important, but I won't allow larger changes
to ruin the progress we've made so far.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Dying disk and filesystem choice.

2001-05-25 Thread Chris Mason



On Thursday, May 24, 2001 11:16:58 PM +0100 Alan Cox
<[EMAIL PROTECTED]> wrote:

>> IMHO we are not that deep into code freeze anymore. Freevxfs got added
>> in linux-2.4.5-pre*, so I think that a patch that adds a useful feature
>> like badblock support would be OK.
> 
> FreeVxFS changes precisely nothing in the behaviour of any other fs - its
> like adding a new driver.
> 
> Updating Reiserfs requires a lot more care because it has the potential to
> harm existing stable setups

This has been mostly covered, but just in case.  There are two different
freezes, the kernel, and in reiserfs.  The reiserfs part isn't something
Alan or Linus have imposed on us, we just wanted to limit the reiserfs
changes as much as possible during the early kernel releases.

The end result is that some larger scale issues are unfixed (memory
pressure from VM, lost files after a crash), but we have been able to focus
on the critical hoses-my-files/crashes-my-box kinds of bugs.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.5 Oops at boot

2001-05-30 Thread Chris Mason



On Wednesday, May 30, 2001 03:03:32 PM -0600 "D. Stimits"
<[EMAIL PROTECTED]> wrote:

[ snip ]

> RAMDISK: Compressed image found at block 0
> Freeing initrd memory: 249k freed
> VFS: Mounted root (ext2 filesystem).
> Red Hat nash version 3.0.10 starting
> VFS: Mounted root (ext2 filesystem) readonly.
> change_root: old root has d_count=2
> Trying to unmount old root ... <1>Unable to handle kernel NULL pointer
> dereference at virtual address 0010
>  printing eip:

Can't say for sure without the oops decoded through ksymoops, but this
looks like the oops in rd_ioctl fixed by 2.4.5-ac3 and higher.  I think the
following patch (taken from ac3) will be sufficient:

-chris

--- linux.vanilla/fs/block_dev.cSat May 26 16:53:17 2001
+++ linux.ac/fs/block_dev.c Mon May 28 16:10:59 2001
@@ -603,6 +602,7 @@
if (!bdev->bd_op->ioctl)
return -EINVAL;
inode_fake.i_rdev=rdev;
+   inode_fake.i_bdev=bdev;
init_waitqueue_head(_fake.i_wait);
set_fs(KERNEL_DS);
res = bdev->bd_op->ioctl(_fake, NULL, cmd, arg);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: reiserfs_read_inode2

2001-05-31 Thread Chris Mason



On Thursday, May 31, 2001 02:27:26 PM +0200 Lukasz Trabinski
<[EMAIL PROTECTED]> wrote:

> Hello
> 
> What it's means?
> 
> portraits:~# dmesg
> vs-13042: reiserfs_read_inode2: [2299 593873 0x0 SD] not found
> vs-13048: reiserfs_iget: bad_inode. Stat data of (2299 593873) not found
> vs-13042: reiserfs_read_inode2: [2299 593807 0x0 SD] not found
> vs-13048: reiserfs_iget: bad_inode. Stat data of (2299 593807) not found
> 
> 2.4.5 with lock_kernel/unlock patch,reiserfsprogs 3.x.0h, RH 7.1

In this case, it probably means you are serving NFS from that disk, which
needs extra patches.  Are you?

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: NULL characters in file on ReiserFS again.

2001-05-31 Thread Chris Mason



On Thursday, May 31, 2001 03:33:06 PM +0400 Andrej Borsenkow
<[EMAIL PROTECTED]> wrote:

> This happened to me yesterday on kernel-2.4.4-6mdk (Mandrake cooker, based
> on 2.4.4-ac14), single reiser root filesystem, mounted with default
> options. Hardware - ASUS CUSL2 (i815e chipset), Fujitsu UDMA-4 drive.
> 
> I tried to change hostname and did not have the corresponding entry in
> /etc/hosts (or anywhere). As a tesult, startx hung starting X server; it
> was not possible to switch to alpha console or kill X server. I pressed
> reset and after reboot looked into /var/log/XFree86*log - and there were
> a bunch of ^@ there.
> 

There are two ways to get nulls in log files.  reiserfs bugs, and a crash
before data blocks are flushed to disk.  You've probably hit the second.
Reiserfs only logs metadata, so it is possible for newly allocated data
blocks to have null bytes after a crash.

Patches are in progress to flush new data blocks before transaction commit.
I'm about to send out the first building block for this...

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] yet another knfsd-reiserfs patch

2001-06-01 Thread Chris Mason


> On Monday, April 23, 2001 10:45:14 AM -0400 Chris Mason <[EMAIL PROTECTED]> wrote:
> 
>> 
>> Hi guys,
>> 
>> This patch is not meant to replace Neil Brown's knfsd ops stuff, the 
>> goal was to whip up something that had a chance of getting into 2.4.x,
>> and that might be usable by the AFS guys too.  Neil's patch tries to 
>> address a bunch of things that I didn't, and looks better for the
>> long run.
>> 
> 

Updated to 2.4.5, with the nfs list cc'd this time in hopes of comments
or flames...

-chris

diff -Nru a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
--- a/fs/nfsd/nfsfh.c   Fri Jun  1 16:08:41 2001
+++ b/fs/nfsd/nfsfh.c   Fri Jun  1 16:08:41 2001
@@ -116,40 +116,12 @@
return error;
 }
 
-/* this should be provided by each filesystem in an nfsd_operations interface as
- * iget isn't really the right interface
- */
-static struct dentry *nfsd_iget(struct super_block *sb, unsigned long ino, __u32 
generation)
+static struct dentry *dentry_from_inode(struct inode *inode) 
 {
-
-   /* iget isn't really right if the inode is currently unallocated!!
-* This should really all be done inside each filesystem
-*
-* ext2fs' read_inode has been strengthed to return a bad_inode if the inode
-*   had been deleted.
-*
-* Currently we don't know the generation for parent directory, so a generation
-* of 0 means "accept any"
-*/
-   struct inode *inode;
struct list_head *lp;
struct dentry *result;
-   inode = iget(sb, ino);
-   if (is_bad_inode(inode)
-   || (generation && inode->i_generation != generation)
-   ) {
-   /* we didn't find the right inode.. */
-   dprintk("fh_verify: Inode %lu, Bad count: %d %d or version  %u %u\n",
-   inode->i_ino,
-   inode->i_nlink, atomic_read(>i_count),
-   inode->i_generation,
-   generation);
-
-   iput(inode);
-   return ERR_PTR(-ESTALE);
-   }
-   /* now to find a dentry.
-* If possible, get a well-connected one
+   /*
+* If possible, get a well-connected dentry
 */
spin_lock(_lock);
for (lp = inode->i_dentry.next; lp != >i_dentry ; lp=lp->next) {
@@ -173,6 +145,92 @@
return result;
 }
 
+static struct inode *__inode_from_fh(struct super_block *sb, int ino,
+int generation) 
+{
+   struct inode *inode ;
+
+   inode = iget(sb, ino);
+   if (is_bad_inode(inode)
+   || (generation && inode->i_generation != generation)
+   ) {
+   /* we didn't find the right inode.. */
+   dprintk("fh_verify: Inode %lu, Bad count: %d %d or version  %u %u\n",
+   inode->i_ino,
+   inode->i_nlink, atomic_read(>i_count),
+   inode->i_generation,
+   generation);
+
+   iput(inode);
+   return ERR_PTR(-ESTALE);
+   }
+   return inode ;
+}
+
+static struct inode *inode_from_fh(struct super_block *sb, 
+   __u32 *datap,
+   int len)
+{
+   if (sb->s_op->inode_from_fh)
+   return sb->s_op->inode_from_fh(sb, datap, len) ;
+   return __inode_from_fh(sb, datap[0], datap[1]) ;
+}
+
+static struct inode *parent_from_fh(struct super_block *sb, 
+   __u32 *datap,
+   int len)
+{
+   if (sb->s_op->parent_from_fh)
+   return sb->s_op->parent_from_fh(sb, datap, len) ;
+
+   if (len >= 3)
+   return __inode_from_fh(sb, datap[2], 0) ;
+   return ERR_PTR(-ESTALE);
+}
+
+/* 
+ * two iget funcs, one for inode, and one for parent directory
+ *
+ * this should be provided by each filesystem in an nfsd_operations interface as
+ * iget isn't really the right interface
+ *
+ * If the filesystem doesn't provide funcs to get inodes from datap,
+ * it must be: inum, generation, dir inum.  Length of 2 means the 
+ * dir inum isn't there.
+ *
+ * iget isn't really right if the inode is currently unallocated!!
+ * This should really all be done inside each filesystem
+ *
+ * ext2fs' read_inode has been strengthed to return a bad_inode if the inode
+ *   had been deleted.
+ *
+ * Currently we don't know the generation for parent directory, so a generation
+ * of 0 means "accept any"
+ */
+static struct dentry *nfsd_iget(struct super_block *sb, __u32 *datap, int len)
+{
+
+   struct inode *inode;
+
+   inode = inode_from_fh(sb, datap, len) ;
+   if (IS_ERR(inode)) 

Re: [2.4.5 and all ac-Patches] massive file corruption with reiseror NFS

2001-06-02 Thread Chris Mason



On Saturday, June 02, 2001 02:41:04 PM +0200 Andreas Hartmann
<[EMAIL PROTECTED]> wrote:

> Am Samstag,  2. Juni 2001 12:52 schrieb Rasmus Bøg Hansen:
>> On Sat, 2 Jun 2001, Andreas Hartmann wrote:
>> > I got massive file corruptions with the kernels mentioned in the
>> > subject. I can reproduce it every time.
>> >> You cannot use NFS on reiserfs unless you apply the knfsd patch. Look at
>> www.namesys.com.
> > Thank you very much for your advice.
> > I tested your suggestion and run the machine without NFS-mounted devices
> - it  seems to be working fine. > > Anyway - I'm wondering why I didn't get any 
>problem until 2.4.4ac10 with
> this  configuration without the appropriate patch on the client or on the
> server?

The problem only happens when the clients do an operation on a file that
has gone out of cache on the server.  Under light load, this might happen
very rarely.

You only need the patch on the server.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [NFS] Re: [RFC] yet another knfsd-reiserfs patch

2001-06-02 Thread Chris Mason



On Saturday, June 02, 2001 12:19:59 AM +0200 Trond Myklebust
<[EMAIL PROTECTED]> wrote:

> 
> Hi Chris,
> 
> Do you really need the parent inode in the filehandle?
> 
> That screws rename up pretty badly, since the filehandle changes when
> you rename into a different directory. It means for instance that when
> I do
> 
> open(foo)
> mv foo bar/
> write (foo)
> close(foo)
> 
> then I have a pretty good chance of getting an ESTALE on the write()
> statement.
> 

Hmmm, didn't realize I had only answered this in private mail.

The patch doesn't change when the parent dir's ino is included in the
filehandle, it just adds wrappers for storing it and getting it out.

For ext2, the parent inum is only sent for files when the subtree checks
are turned on (_fh_update is unchanged if no fill_fh func is provided).  

The reiserfs one always puts the parent inum into the fh, but
find_fh_dentry only pulls it out for directories or subtree checks so I
didn't add the extra logic to the reiserfs fill_fh func.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [2.4.5 and all ac-Patches] massive file corruption with reiseror NFS

2001-06-02 Thread Chris Mason



On Saturday, June 02, 2001 08:13:44 PM +0200 Andreas Hartmann
<[EMAIL PROTECTED]> wrote:

> Am Samstag,  2. Juni 2001 18:42 schrieben Sie:
>> On Saturday, June 02, 2001 02:41:04 PM +0200 Andreas Hartmann
>> >> <[EMAIL PROTECTED]> wrote:
>> > Am Samstag,  2. Juni 2001 12:52 schrieb Rasmus Bøg Hansen:
>> >> On Sat, 2 Jun 2001, Andreas Hartmann wrote:
>> >> > I got massive file corruptions with the kernels mentioned in the
>> >> > subject. I can reproduce it every time.
>> >> > >> >> >> You cannot use NFS on reiserfs unless you apply the knfsd patch.
>> >> >> Look at
>> >> >> >> www.namesys.com.
>> >> >> > > Thank you very much for your advice.
>> > > I tested your suggestion and run the machine without NFS-mounted
>> > > devices
>> > >> > - it  seems to be working fine. > > Anyway - I'm wondering why I didn't
>> > get any problem until 2.4.4ac10 with this  configuration without the
>> > appropriate patch on the client or on the server?
>> >> The problem only happens when the clients do an operation on a file that
>> has gone out of cache on the server.  Under light load, this might happen
>> very rarely.
> > The load didn't change. YOu can forget the load, it's very small. It's my 
> private server and I'm doing always the same thing via NFS - compiling
> e.g.  This has been working fine until 2.4.4.ac10, afterwards it has been
> broken.

Ok, there are two different problems here.  The patch you posted to l-k is
a generic NFS fix for 2.4.5.  ext2 would need this too.

If you are serving NFS from your reiserfs disk, you need an additional
patch on the server only (this is the one I was talking about).  Checkout
the FAQ on www.namesys.com for all the details.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [OOPS] 245ac7 - ncr53c8xx && reiserfs

2001-06-05 Thread Chris Mason



On Tuesday, June 05, 2001 03:00:40 PM -0400 Carlos E Gorges
<[EMAIL PROTECTED]> wrote:

> Hi all,
> 
> I get some problems w/ 2.4.5-ac7, ncr53c8xx w/ 2.4.4-ac18 works fine.
> 
> I gave a small looked on problem  ..
> the problem apparently is w/ ncr53c8xx driver ( who accuses timeout ),
> and make reiserfs call BUG() :
> 

reiserfs does this when it fails to write metadata or log buffers,
continuing without a panic or readonly mount will result in FS corruption.  

A forced readonly mount is a much better solution, but I haven't had a
chance yet to make sure it safely prevents writeback of all metadata, and
cleans things up properly.

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [reiserfs-list] major security bug in reiserfs (may affect SuSELinux)

2001-01-10 Thread Chris Mason



On Wednesday, January 10, 2001 12:38:34 PM -0500 Alexander Viro
<[EMAIL PROTECTED]> wrote:

> On Wed, 10 Jan 2001, Chris Mason wrote:
> 
>> In filldir, I don't like the line where we ((char *)dirent += reclen ;
>> If reclen is much larger than the buffer sent from userspace, I don't
>> see how we stay in bounds.
> 
>So? copy_to_user() and put_user() will refuse to scramble the
> kernel memory. IOW, dirent can be out of the userspace. Hell, user could
> call getdents() and pass it a kernel pointer. Try it and you'll see what
> happens.
> 

Ah thanks, that makes more sense.  But, copy_to_user is only working on
namelen bytes, and reclen is bigger than that.  So, who is checking the
value for the buf->current_dir pointer?

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Possible deadlock with ->writepaged version offlush_dirty_buffers() and 2.4.0

2001-01-11 Thread Chris Mason



On Wednesday, January 10, 2001 05:56:09 PM -0200 Marcelo Tosatti
<[EMAIL PROTECTED]> wrote:

> 
> Hi Chris,
> 
> It seems there is a possible deadlock condition with your patch which
> changes flush_dirty_buffers() to use ->writepage (something which we
> _definately_ want for 2.5). Take a look:
> 
Yes, good catch.

> 
> mark_buffer_dirty->balance_dirty->wakeup_bdflush->flush_dirty_buffers->
> writepage->block_write_full_page->__block_write_full_page->get_block->
> ext2_get_block->ext2_alloc_branch->
> 
>ext2_alloc_block->ext2_new_block->lock_super
>or 
>getblk()->lock_super
> 
> 
> I dont see any reason why this deadlock could'nt happen in practice now.
> 
It won't happen until someone other than fs/buffer.c starts marking ext2
pages dirty.  The normal file write path will make sure that any dirty
buffers are mapped, so the ext2_get_block code is never run.

> If I'm right, it will pretty nasty to fix this. One possible solution is
> to _never_ call mark_buffer_dirty() with the superblock lock held (ext2
> has a lot of places likes this right now)
> 

This is probably the best solution, since it is a good idea regardless of
my patch.

-chris
 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



generic_file_write change in 2.4.0-ac8

2001-01-12 Thread Chris Mason


Hi guys,

This code for generic_file_write calls vmtruncate without i_sem held.  Is
that intentional?  It should cause problems for reiserfs at least...

-chris

diff -u --new-file --recursive --exclude-from /usr/src/exclude
linux-2.4.0/mm/filemap.c linux.ac/mm/filemap.c
--- linux-2.4.0/mm/filemap.cWed Jan  3 02:59:45 2001
+++ linux.ac/mm/filemap.c   Thu Jan 11 17:26:55 2001
@@ -2578,6 +2625,13 @@
ClearPageUptodate(page);
kunmap(page);
goto unlock;
+sync_failure:
+   UnlockPage(page);
+   deactivate_page(page);
+   page_cache_release(page);
+   if (pos + bytes > inode->i_size)
+   vmtruncate(inode, inode->i_size);
+   goto done;
 }

 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: generic_file_write change in 2.4.0-ac8

2001-01-12 Thread Chris Mason



On Friday, January 12, 2001 04:30:44 PM -0500 Alexander Viro
<[EMAIL PROTECTED]> wrote:

> 
> 
> On Fri, 12 Jan 2001, Chris Mason wrote:
> 
>> 
>> Hi guys,
>> 
>> This code for generic_file_write calls vmtruncate without i_sem held.  Is
>> that intentional?  It should cause problems for reiserfs at least...
> 
> Erm... generic_file_write() grabs i_sem upon entry and drops it on exit.
> This call of vmtruncate() is deep inside the protected area.
> 

Yup, I'm trying to track down a different problem, and saw what I wanted to
instead of what was really there.  Sigh.  

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: patch:reiserfs 3.6.25 + LVM(Fix oops reiserfs filesystem)

2001-01-15 Thread Chris Mason



On Saturday, January 13, 2001 11:41:51 PM -0800 hugang
<[EMAIL PROTECTED]> wrote:

[ patch ]

Odd, the create_vi op should never be null, so the real fix is somewhere
else.  We'll look into this.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: More information on reiserfs bug

2001-01-16 Thread Chris Mason



On Tuesday, January 16, 2001 07:38:58 PM +0100 Jakob Borg
<[EMAIL PROTECTED]> wrote:

> Hi again,
> 
> It seems the problem occurs every time i start fetchmail... Attached are
> ksymoops output and .config (if i remember this time). If there is
> anything else I can do to help debug this, just tell me

Linus fixed that hunk of debugging code in his merge, and it found a bug in
the reiserfs O_SYNC support.  reiserfs_commit_write needs to hold the BKL.

This should fix it:

--- linux/fs/reiserfs/inode.c.1 Tue Jan 16 13:46:35 2001
+++ linux/fs/reiserfs/inode.c   Tue Jan 16 13:49:21 2001
@@ -1853,6 +1853,11 @@
 struct reiserfs_transaction_handle th ;
 
 reiserfs_wait_on_write_block(inode->i_sb) ;
+
+/* prevent_flush_page_lock must be called before generic_commit_write,
+** and the BKL must be held during the call.
+*/
+lock_kernel() ;
 prevent_flush_page_lock(page, inode) ;
 ret = generic_commit_write(f, page, from, to) ;
 /* we test for O_SYNC here so we can commit the transaction
@@ -1866,6 +1871,8 @@
journal_end_sync(, inode->i_sb, 1) ;
 }
 allow_flush_page_lock(page, inode) ;
+unlock_kernel() ;
+
 return ret ;
 }
 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kernel BUG with 2.4.1-pre7 reiserfs

2001-01-16 Thread Chris Mason



On Tuesday, January 16, 2001 07:58:37 PM +0100 Jakob Borg
<[EMAIL PROTECTED]> wrote:

> On Tue, Jan 16, 2001 at 10:36:43AM -0800, Linus Torvalds wrote:
>> > I seem to remember more possibly useful information scrolling by my
>> > screen, but it seems to not have made it to the logs, and I will shut
>> > down and fsck the filesystem now...
>> 
>> It really needs the stack-trace to debug this sanely (along with
>> translations of what the hex numbers are - see the bugreporting
>> documentation in the kernel source tree). 
> 
> Got that in the other mail subjected "More information ... ". In the
> meantime it seems the filesystem is unhurt because of this, but reiserfsck
> says
> 
> uread_super_block: bad block is found at a new superblock location
> uread_super_block: bad block is found at an old superblock location
> 
> which seems bogus. This is reiserfsck from the same suite that mkreiserfs
> came from ("reiserfsprogs 3.x") so they should be talking about the same
> sort of filesystem.
> 

The BUG you hit should not corrupt anything, that debugging code is
actually there to prevent silent corruption due to lack of locking.

It is likely you are using an fsck version that can't read the 3.6.x
format.  They are still packaging the beta fsck tool for the new format,
I'm not sure the exact download URL yet.

When you mount the FS it tells you which version it is, please include that
info as well.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: set_page_dirty/page_launder deadlock

2001-01-17 Thread Chris Mason



On Sunday, January 14, 2001 10:56:10 AM -0800 Linus Torvalds
<[EMAIL PROTECTED]> wrote:

>> Marcelo Tosatti writes:
>>  > 
>>  > While taking a look at page_launder()...
>> 
>>  ...
>> 
>>  > set_page_dirty() may lock the pagecache_lock which means potential
>>  > deadlock since we have the pagemap_lru_lock locked.
>> 
> 
> Well, as the new shm code doesn't return 1 any more, the whole locked page
> handling should just be deleted. ramfs always just re-marked the page
> dirty in its own "writepage()" function, so it was only shmfs that ever
> returned this special case, and because of other issues it already got
> excised by Christoph..
> 

Then I'm confused by the code in 2.4.1pre8:

-chris

/*
 * Move the page from the page cache to the swap cache
 */
static int shmem_writepage(struct page * page)
{
int error;
struct shmem_inode_info *info;
swp_entry_t *entry, swap;

info = >mapping->host->u.shmem_i;
if (info->locked)
return 1;
swap = __get_swap_page(2);
if (!swap.val)
return 1;


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Kernel 2.4.x and 2.4.1-preX - Higher latency then 2.2.xkernels?

2001-01-21 Thread Chris Mason



On Saturday, January 20, 2001 02:59:24 PM -0500 Gregory Maxwell
<[EMAIL PROTECTED]> wrote:

> On Sat, Jan 20, 2001 at 02:50:16PM -0500, Shawn Starr wrote: 
>> It just seems that since using 2.4 ive noticed my poor Pentium 200Mhz
>> slow down whether being in X or otherwise. It just seems that the system
>> is sluggish.
>> 
>> I am using the new ReiserFS filesystem and I do know its still in heavy
>> development perhaps my latency is due to this (?)
> 
> Reiserfs uses much more complex data structures then ext2 (trees..). I
> don't think that latency has ever been a design criteria and all of the
> benchmarks they use are pretty much pure throughput tests.
> 
> So it wouldn't be really surprising if reiserfs had very bad latency. You
> should apply the timepegs patch and profile your kernel latency to see
> where it's coming from.

I'm actually very interested in fixing any latency problems.  If you do
these tests, please send the results along.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.1-pre10 slowdown at boot.

2001-01-25 Thread Chris Mason



On Thursday, January 25, 2001 05:23:26 PM +0100 Ondrej Sury
<[EMAIL PROTECTED]> wrote:

> 
> 2.4.1-pre10 slows down after printing those (maybe ACPI or reiserfs
> issue), and even SysRQ-(s,u,b) is not imediate and waits several (two+)
> seconds before (syncing,remounting,booting).
> 
> ACPI: System description tables found
> ACPI: System description tables loaded
> ACPI: Subsystem enabled
> ACPI: System firmware supports: C2
> ACPI: System firmware supports: S0 S1 S4 S5
> reiserfs: checking transaction log (device 03:04) ...
> Warning, log replay starting on readonly filesystem
> 

Here, reiserfs is telling you that it has started replaying transactions in
the log.  You should also have a reiserfs message telling you how many
transactions it replayed, and how long it took.  Do you have that message?

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.1-pre10 slowdown at boot.

2001-01-25 Thread Chris Mason



On Thursday, January 25, 2001 06:51:33 PM +0100 Ondrej Sury
<[EMAIL PROTECTED]> wrote:

> Chris Mason <[EMAIL PROTECTED]> writes:
>> > reiserfs: checking transaction log (device 03:04) ...
>> > Warning, log replay starting on readonly filesystem
>> > 
>> 
>> Here, reiserfs is telling you that it has started replaying transactions
>> in the log.  You should also have a reiserfs message telling you how many
>> transactions it replayed, and how long it took.  Do you have that
>> message?
> 
> Nope.  I rebooted with Alt-SysRQ+B after some while (aprox more than 30
> sec, normally reiserfs replay is taking ~5 sec (pre9)).  I wasn't so
> patient.  I could test it before I'll go from work to home.
> 

Ok, depending on the metadata load before the crash, replay can take 30
seconds or more.  You usually have to try to generate that many metadata
changes, something like creating 100,000 tiny files or directories.
Compiling with CONFIG_REISERFS_CHECK turned on will give you more details
about the log replay.

Or, perhaps DMA is now off on your IDE drive, making everything slower.

Regardless, rebooting in the middle of log replay is safe.  Those
transactions will just be replayed again on the next boot.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: ACPI error in 2.4.1-pre10 @ via82c686 (Was: 2.4.1-pre10slowdown at boot.)

2001-01-25 Thread Chris Mason



On Thursday, January 25, 2001 07:37:16 PM +0100 Ondrej Sury
<[EMAIL PROTECTED]> wrote:

> I have discovered that it wasn't reiserfs problem.  I have disabled ACPI
> in BIOS and everything is ok.  So I assume that something has changed in
> ACPI between pre9 and pre10 versions and that something is broken in _my_
> system.
> 

Ok.  This isn't related to the slowdown problem you are seeing, but after a
clean shutdown, there should not be any transactions that need replay.
Keep an eye on the console as you shutdown, and make sure / is getting
properly unmounted.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Kernel 2.4.x and 2.4.1-preX - Higher latency then 2.2.xkernels?

2001-01-28 Thread Chris Mason



On Sunday, January 28, 2001 02:29:09 PM +1100 Andrew Morton
<[EMAIL PROTECTED]> wrote:

> Shawn Starr wrote:
>> 
>> Andrew, the patch HAS made a difference. For example, while untaring
>> glibc-2.2.1.tar.gz the system was not sluggish (mouse movements in X)
>> etc.
>> 
>> Seems to be a go for latency improvements on this system.
> 
> hmm..  OK, thanks.
> 
> Chris, this seems to be a worthwhile improvement to mainstream
> reiserfs, independent of the low-latency thing.   You can
> probably achieve 10 milliseconds with just a few lines of
> code - a subset of the patch which Shawn tested. (Unless you
> were planning on magical algorithmic improvements...).
> 
> I'm all set up to generate those few lines of code, so
> I'll propose a patch later this week.

Perfect, I was thinking exactly the same thing.  We have to be careful here
though, since the extra schedules will increase the chance the searching
has to be redone from scratch, which can have big performance ramifications.

I think your change to search_by_key will be the safest for performance
considerations, along with the change to prepare_for_delete_or_cut.  If
those won't be enough, we can attack reiserfs_get_block (who is probably
the biggest single offender without your patch).

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Renaming lost+found

2001-01-28 Thread Chris Mason



On Friday, January 26, 2001 01:19:49 PM -0500 James Lewis Nance
<[EMAIL PROTECTED]> wrote:

> FWIW IBM's JFS file system does not have a lost+found directory.  I dont
> remember if reiserfs does or not.
> 

reiserfsck creates it.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Reiserfs problem was: Re: Version 2.4.1 cannot be built.

2001-01-30 Thread Chris Mason



On Tuesday, January 30, 2001 03:42:36 PM -0800 "Brett G. Person"
<[EMAIL PROTECTED]> wrote:

> Worked fine here but  i am getting segfaults on my Reiser filesystems. 
> I've been distracted by a project over the last few days. Is what I'm
> seeing a symptom of the fs corruption people were talking about last week?
> 

If reiserfs is the cause you should have some clues in /var/log/messages.
Does the kernel compile on ext2 on the same box?

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: reiserfs min size (was: [2.4.1] mkreiserfs on loopdevice freezes kernel)

2001-02-01 Thread Chris Mason



On Wednesday, January 31, 2001 11:27:57 PM +0100 Bernd Eckenfels <[EMAIL PROTECTED]> 
wrote:

> On Wed, Jan 31, 2001 at 09:24:39AM +, James Sutherland wrote:
>> 32 megaBLOCK?? How big is it in Mbytes?
> 
> Blocksize is 4k, mkreiserfs in my version is telling me it can not generate
> partitions smaller than 32M but it is not true, i have to do
> 
> dd if=/dev/zero of=/var/loop.img count=32768 size=4096
> 
>> You do know reiserfs defaults to
>> building a 32 Mbyte journal on the device, I take it?
> 
> Yes, I wonder if it is a Error in mkreiserfs to require 128MB.

It is.  The actual min is around 40MB (with 32MB used by the journal.  Next version of 
mkreiserfs will be fixed.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] reiserfs transaction overflow

2001-04-18 Thread Chris Mason


Hi guys,

Under certain loads, the reiserfs journal can overflow the
max transaction size, leading to a crash (but not corruption).

When the transaction is too full for another writer to join,
the writer triggers a commit, and waits for the next transaction.
But, it doesn't properly check to make sure the next transcation
has enough room, which can lead to overflow.  It is hard to
hit because there is a large margin of error in the way log space
is reserved (this bug was probably in v.1 of the journal
code).

A similar patch will be needed for 3.5.x reiserfs, that will
follow soon.

Anyway, this patch should fix 2.4.x, please apply:

-chris

--- linux/fs/reiserfs/journal.c.1   Tue Apr 17 09:36:36 2001
+++ linux/fs/reiserfs/journal.c Tue Apr 17 09:37:50 2001
@@ -2052,7 +2052,7 @@
sleep_on(&(SB_JOURNAL(p_s_sb)->j_join_wait)) ;
   }
 }
-lock_journal(p_s_sb) ; /* relock to continue */
+goto relock ;
   }
 
   if (SB_JOURNAL(p_s_sb)->j_trans_start_time == 0) { /* we are the first writer, set 
trans_id */


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] ac only, allow reiserfs files > 4GB

2001-04-18 Thread Chris Mason


This patch should set s_maxbytes correctly for reiserfs in the
ac kernels, and adds a reiserfs_setattr call to catch expanding
truncates past the MAX_NON_LFS limit for old format files.

reiserfs_get_block already catches file writes and such for
this case.

It also adds a generic_inode_setattr call, mostly because I
didn't want to copy/maintain that hunk of code in reiserfs.

Testing has been light, I'll beat on it more this evening.

patch against 2.4.3-ac7.

-chris

diff -Nru a/fs/attr.c b/fs/attr.c
--- a/fs/attr.c Wed Apr 18 18:33:44 2001
+++ b/fs/attr.c Wed Apr 18 18:33:44 2001
@@ -111,6 +111,21 @@
return dn_mask;
 }
 
+int generic_inode_setattr(struct inode *inode, struct iattr * attr) {
+   int error  ;
+   unsigned int ia_valid = attr->ia_valid;
+
+   error = inode_change_ok(inode, attr);
+   if (!error) {
+   if ((ia_valid & ATTR_UID && attr->ia_uid != inode->i_uid) ||
+   (ia_valid & ATTR_GID && attr->ia_gid != inode->i_gid))
+   error = DQUOT_TRANSFER(inode, attr) ? -EDQUOT : 0;
+   if (!error)
+   error = inode_setattr(inode, attr);
+   }
+   return error ;
+}
+
 int notify_change(struct dentry * dentry, struct iattr * attr)
 {
struct inode *inode = dentry->d_inode;
@@ -131,14 +146,7 @@
if (inode->i_op && inode->i_op->setattr) 
error = inode->i_op->setattr(dentry, attr);
else {
-   error = inode_change_ok(inode, attr);
-   if (!error) {
-   if ((ia_valid & ATTR_UID && attr->ia_uid != inode->i_uid) ||
-   (ia_valid & ATTR_GID && attr->ia_gid != inode->i_gid))
-   error = DQUOT_TRANSFER(inode, attr) ? -EDQUOT : 0;
-   if (!error)
-   error = inode_setattr(inode, attr);
-   }
+   error = generic_inode_setattr(inode, attr) ;
}
unlock_kernel();
if (!error) {
diff -Nru a/fs/reiserfs/file.c b/fs/reiserfs/file.c
--- a/fs/reiserfs/file.cWed Apr 18 18:33:44 2001
+++ b/fs/reiserfs/file.cWed Apr 18 18:33:44 2001
@@ -106,6 +106,18 @@
   return ( n_err < 0 ) ? -EIO : 0;
 }
 
+static int reiserfs_setattr(struct dentry *dentry, struct iattr *attr) {
+struct inode *inode = dentry->d_inode ;
+if (attr->ia_valid & ATTR_SIZE) {
+   /* version 2 items will be caught by the s_maxbytes check
+   ** done for us in vmtruncate
+   */
+if (inode_items_version(inode) == ITEM_VERSION_1 && 
+   attr->ia_size > MAX_NON_LFS)
+return -EFBIG ;
+}
+return generic_inode_setattr(inode, attr) ;
+}
 
 struct file_operations reiserfs_file_operations = {
 read:  generic_file_read,
@@ -119,6 +131,7 @@
 
 struct  inode_operations reiserfs_file_inode_operations = {
 truncate:  reiserfs_vfs_truncate_file,
+setattr:reiserfs_setattr,
 };
 
 
diff -Nru a/fs/reiserfs/super.c b/fs/reiserfs/super.c
--- a/fs/reiserfs/super.c   Wed Apr 18 18:33:44 2001
+++ b/fs/reiserfs/super.c   Wed Apr 18 18:33:44 2001
@@ -412,7 +412,7 @@
 SB_BUFFER_WITH_SB (s) = bh;
 SB_DISK_SUPER_BLOCK (s) = rs;
 s->s_op = _sops;
-s->s_maxbytes = MAX_NON_LFS;
+s->s_maxbytes = MAX_NON_LFS; /* old format is always limited at 2GB */
 return 0;
 }
 #endif
@@ -493,7 +493,11 @@
 SB_BUFFER_WITH_SB (s) = bh;
 SB_DISK_SUPER_BLOCK (s) = rs;
 s->s_op = _sops;
-s->s_maxbytes = 0x;/* 4Gig */
+
+/* new format is limited by the 32 bit wide i_blocks field, want to
+** be one full block below that.
+*/
+s->s_maxbytes = (512LL << 32) - s->s_blocksize ;
 return 0;
 }
 
diff -Nru a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.hWed Apr 18 18:33:44 2001
+++ b/include/linux/fs.hWed Apr 18 18:33:44 2001
@@ -1359,6 +1359,7 @@
 
 extern int inode_change_ok(struct inode *, struct iattr *);
 extern int inode_setattr(struct inode *, struct iattr *);
+extern int generic_inode_setattr(struct inode *, struct iattr *);
 
 /*
  * Common dentry functions for inclusion in the VFS
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.cWed Apr 18 18:33:44 2001
+++ b/kernel/ksyms.cWed Apr 18 18:33:44 2001
@@ -180,6 +180,7 @@
 EXPORT_SYMBOL(permission);
 EXPORT_SYMBOL(vfs_permission);
 EXPORT_SYMBOL(inode_setattr);
+EXPORT_SYMBOL(generic_inode_setattr);
 EXPORT_SYMBOL(inode_change_ok);
 EXPORT_SYMBOL(write_inode_now);
 EXPORT_SYMBOL(notify_change);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] reiserfs should daemonize

2001-04-19 Thread Chris Mason


Hi guys,

The reiserfs commit thread needs to daemonize.  This patch
was actually from Andi Kleen eons ago (but blame me if 
it breaks).  Please apply.

Against 2.4.3:

--- linux/fs/reiserfs/journal.c Thu Apr 19 14:02:56 2001
+++ linux/fs/reiserfs/journal.c Thu Apr 19 18:11:57 2001
@@ -1814,16 +1814,14 @@
 ** then run the per filesystem commit task queue when we wakeup.
 */
 static int reiserfs_journal_commit_thread(void *nullp) {
-  exit_files(current);
-  exit_mm(current);
+
+  daemonize() ;
 
   spin_lock_irq(>sigmask_lock);
   sigfillset(>blocked);
   recalc_sigpending(current);
   spin_unlock_irq(>sigmask_lock);
 
-  current->session = 1;
-  current->pgrp = 1;
   sprintf(current->comm, "kreiserfsd") ;
   lock_kernel() ;
   while(1) {


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[RFC] yet another knfsd-reiserfs patch

2001-04-23 Thread Chris Mason


Hi guys,

This patch is not meant to replace Neil Brown's knfsd ops stuff, the 
goal was to whip up something that had a chance of getting into 2.4.x,
and that might be usable by the AFS guys too.  Neil's patch tries to 
address a bunch of things that I didn't, and looks better for the
long run.

Anyway, the basic idea is the FS provides:

int fill_fh(struct dentry *, __u32 *fh, int size) ;

fills the array of ints in fh with enough info to find the file and
its parent later.

struct inode *inode_from_fh(struct super_block *, __u32 *fh, int size) ;
struct inode *parent_from_fh(struct super_block *, __u32 *fh, int size) ;

iget the inode or parent directory inode based on data in the array.

Default ops are provided, the other filesystems should work the
same as before.  Anyway, please take a look.

-chris

# This is a BitKeeper generated patch for the following project:
# Project Name: local kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#  ChangeSet1.6 -> 1.7
#fs/reiserfs/super.c1.1 -> 1.2
#fs/nfsd/nfsfh.c1.1 -> 1.2
# include/linux/fs.h1.2 -> 1.3
#fs/reiserfs/inode.c1.1 -> 1.2
#   include/linux/reiserfs_fs.h 1.1 -> 1.2
#
# The following is the BitKeeper ChangeSet Log
# 
# 01/04/23  [EMAIL PROTECTED]  1.7
# reiserfs-knfsd-fh-ops-2
# 
# Introduce file handle operations into the super ops.  Add generic support and
# reiserfs support.   Meant for use by NFS (and perhaps AFS) to get around
# reiserfs' inability to find a file with an inode number alone.
# 
# fs.h  reiserfs-knfsd-fh-ops-2
# reiserfs_fs.h reiserfs-knfsd-fh-ops-2
# nfsfh.c   reiserfs-knfsd-fh-ops-2
# super.c   reiserfs-knfsd-fh-ops-2
# inode.c   reiserfs-knfsd-fh-ops-2
# 
#
diff -Nru a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
--- a/fs/nfsd/nfsfh.c   Mon Apr 23 02:14:42 2001
+++ b/fs/nfsd/nfsfh.c   Mon Apr 23 02:14:42 2001
@@ -116,40 +116,12 @@
return error;
 }
 
-/* this should be provided by each filesystem in an nfsd_operations interface as
- * iget isn't really the right interface
- */
-static struct dentry *nfsd_iget(struct super_block *sb, unsigned long ino, __u32 
generation)
+static struct dentry *dentry_from_inode(struct inode *inode) 
 {
-
-   /* iget isn't really right if the inode is currently unallocated!!
-* This should really all be done inside each filesystem
-*
-* ext2fs' read_inode has been strengthed to return a bad_inode if the inode
-*   had been deleted.
-*
-* Currently we don't know the generation for parent directory, so a generation
-* of 0 means "accept any"
-*/
-   struct inode *inode;
struct list_head *lp;
struct dentry *result;
-   inode = iget(sb, ino);
-   if (is_bad_inode(inode)
-   || (generation && inode->i_generation != generation)
-   ) {
-   /* we didn't find the right inode.. */
-   dprintk("fh_verify: Inode %lu, Bad count: %d %d or version  %u %u\n",
-   inode->i_ino,
-   inode->i_nlink, atomic_read(>i_count),
-   inode->i_generation,
-   generation);
-
-   iput(inode);
-   return ERR_PTR(-ESTALE);
-   }
-   /* now to find a dentry.
-* If possible, get a well-connected one
+   /*
+* If possible, get a well-connected dentry
 */
spin_lock(_lock);
for (lp = inode->i_dentry.next; lp != >i_dentry ; lp=lp->next) {
@@ -172,6 +144,92 @@
return result;
 }
 
+static struct inode *__inode_from_fh(struct super_block *sb, int ino,
+int generation) 
+{
+   struct inode *inode ;
+
+   inode = iget(sb, ino);
+   if (is_bad_inode(inode)
+   || (generation && inode->i_generation != generation)
+   ) {
+   /* we didn't find the right inode.. */
+   dprintk("fh_verify: Inode %lu, Bad count: %d %d or version  %u %u\n",
+   inode->i_ino,
+   inode->i_nlink, atomic_read(>i_count),
+   inode->i_generation,
+   generation);
+
+   iput(inode);
+   return ERR_PTR(-ESTALE);
+   }
+   return inode ;
+}
+
+static struct inode *inode_from_fh(struct super_block *sb, 
+   __u32 *datap,
+   int len)
+{
+   if (sb->s_op->inode_from_fh)
+   return sb->s_op->inode_from_fh(sb, datap, len) ;
+   return __inode_from_fh(sb, datap[0], datap[1]) ;
+}
+
+static struct inode *parent_from_fh(struct super_block *sb, 
+  

Re: [patch] linux likes to kill bad inodes

2001-04-25 Thread Chris Mason



On Sunday, April 22, 2001 02:10:42 PM +0200 Pavel Machek <[EMAIL PROTECTED]>
wrote:

> Hi!
> 
> I had a temporary disk failure (played with acpi too much). What
> happened was that disk was not able to do anything for five minutes
> or so. When disk recovered, linux happily overwrote all inodes it
> could not read while disk was down with zeros -> massive disk
> corruption.
> 
> Solution is not to write bad inodes back to disk.
> 

Wouldn't we rather make it so bad inodes don't get marked dirty at all?

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] linux likes to kill bad inodes

2001-04-25 Thread Chris Mason



On Wednesday, April 25, 2001 10:01:20 PM +0200 Pavel Machek <[EMAIL PROTECTED]>
wrote:

> Hi!
> 
>> > Hi!
>> > 
>> > I had a temporary disk failure (played with acpi too much). What
>> > happened was that disk was not able to do anything for five minutes
>> > or so. When disk recovered, linux happily overwrote all inodes it
>> > could not read while disk was down with zeros -> massive disk
>> > corruption.
>> > 
>> > Solution is not to write bad inodes back to disk.
>> > 
>> 
>> Wouldn't we rather make it so bad inodes don't get marked dirty at all?
> 
> I guess this is cheaper: we can mark inode dirty at 1000 points, but
> you only write it at one point.

Whoops, I worded that poorly.  To me, it seems like a bug to dirty a bad
inode.  If this patch works, it is because somewhere, somebody did
something with a bad inode, and thought the operation worked (otherwise,
why dirty it?).  

So yes, even if we dirty them in a 1000 different places, we need to find
the one place that believes it can do something worthwhile to a bad inode.

-chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] reiserfs lfs fix for 2.4.4-pre5 and above

2001-04-25 Thread Chris Mason


Hello everyone,

2.4.4-pre5 started honoring the s_maxbytes field, so reiserfs needs a 
patch to allow files > 4GB on 3.6.x format filesystems.

If you work with large files on reiserfs and are willing to try
the prerelease kernels (non-production), please give this a try, 
it works for me but I'd like a few confirmations before I send to Linus.

This also prevents someone from using truncate to expand an old 
format file past the 2GB mark.

-chris

diff -Nru a/fs/reiserfs/file.c b/fs/reiserfs/file.c
--- a/fs/reiserfs/file.cTue Apr 24 13:37:21 2001
+++ b/fs/reiserfs/file.cTue Apr 24 13:37:21 2001
@@ -106,6 +106,24 @@
   return ( n_err < 0 ) ? -EIO : 0;
 }
 
+static int reiserfs_setattr(struct dentry *dentry, struct iattr *attr) {
+struct inode *inode = dentry->d_inode ;
+int error ;
+if (attr->ia_valid & ATTR_SIZE) {
+   /* version 2 items will be caught by the s_maxbytes check
+   ** done for us in vmtruncate
+   */
+if (inode_items_version(inode) == ITEM_VERSION_1 && 
+   attr->ia_size > MAX_NON_LFS)
+return -EFBIG ;
+}
+
+error = inode_change_ok(inode, attr) ;
+if (!error)
+inode_setattr(inode, attr) ;
+
+return error ;
+}
 
 struct file_operations reiserfs_file_operations = {
 read:  generic_file_read,
@@ -119,6 +137,7 @@
 
 struct  inode_operations reiserfs_file_inode_operations = {
 truncate:  reiserfs_vfs_truncate_file,
+setattr:reiserfs_setattr,
 };
 
 
diff -Nru a/fs/reiserfs/super.c b/fs/reiserfs/super.c
--- a/fs/reiserfs/super.c   Tue Apr 24 13:37:21 2001
+++ b/fs/reiserfs/super.c   Tue Apr 24 13:37:21 2001
@@ -492,7 +492,11 @@
 SB_BUFFER_WITH_SB (s) = bh;
 SB_DISK_SUPER_BLOCK (s) = rs;
 s->s_op = _sops;
-s->s_maxbytes = 0x;/* 4Gig */
+
+/* new format is limited by the 32 bit wide i_blocks field, want to
+** be one full block below that.
+*/
+s->s_maxbytes = (512LL << 32) - s->s_blocksize ;
 return 0;
 }
 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] reiserfs highmem bug on tail reads

2001-04-25 Thread Chris Mason


Ok, so all the reiserfs tail bugs weren't quite fixed yet, the last
tail fix can cause problems with highmem turned on.  Both bugs are
in fs/reiserfs/inode.c:_get_block_create_0

When reading the tail in, if the buffer was already up to date, 
we skip the disk i/o and return.  But the cleanup code assumes the 
page was kmap'd, which isn't right.

Also, there was a chance to double kmap the page if kmap scheduled a
nd the tree balanced while we slept.  This bug has been there for 
a long time.

Anyway, this was tested with Andrea's HIGHMEM_DEBUG_MERE_MORTALS 
patch to force highmem on my 128MB machine.  It works for me, but 
more testers are always good.

-chris

against 2.4.4-pre6, should work against 2.4.3 or higher.

diff -Nru a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c   Wed Apr 25 23:15:14 2001
+++ b/fs/reiserfs/inode.c   Wed Apr 25 23:15:14 2001
@@ -374,9 +374,11 @@
 ** sure we need to.  But, this means the item might move if
 ** kmap schedules
 */
-p = (char *)kmap(bh_result->b_page) ;
-if (fs_changed (fs_gen, inode->i_sb) && item_moved (_ih, )) {
-goto research;
+if (!p) {
+   p = (char *)kmap(bh_result->b_page) ;
+   if (fs_changed (fs_gen, inode->i_sb) && item_moved (_ih, )) {
+   goto research;
+   }
 }
 p += offset ;
 memset (p, 0, inode->i_sb->s_blocksize);
@@ -420,14 +422,15 @@
ih = get_ih ();
 } while (1);
 
+flush_dcache_page(bh_result->b_page) ;
+kunmap(bh_result->b_page) ;
+
 finished:
 pathrelse ();
 bh_result->b_blocknr = 0 ;
 bh_result->b_dev = inode->i_dev;
 mark_buffer_uptodate (bh_result, 1);
 bh_result->b_state |= (1UL << BH_Mapped);
-flush_dcache_page(bh_result->b_page) ;
-kunmap(bh_result->b_page) ;
 return 0;
 }
 





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-04-26 Thread Chris Mason



On Thursday, April 26, 2001 02:24:26 PM -0400 Alexander Viro
<[EMAIL PROTECTED]> wrote:

> 
> 
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
> 
>> correct. I bet other fs are affected as well btw.
> 
> If only... block_read() vs. block_write() has the same race. I'm going
> through the list of all wait_on_buffer() users right now.
> 

Looks like reiserfs has it too when allocating tree blocks, but it should
be harder to hit.  The fix should be small but it will take me a bit to
make sure it doesn't affect the rest of the balancing code.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: ReiserFS question

2001-04-26 Thread Chris Mason



On Thursday, April 26, 2001 11:05:25 PM +0400 Samium Gromoff
<[EMAIL PROTECTED]> wrote:

>   Hi People...
>got a following "dead of alive" question:
>how to find a root block on a ReiserFS partition
>with a corrupted superblock?
> 
>reiserfsprogs-3.x.0.9j simply writes -2^32
>there at start (reset_super_block) and then simply
>crashes when attempting to access to such mad place
>   ... got nearly lost my main partition ...
> 
> 

The reiserfsck ---rebuild-tree will find the root block for you.  Now that
you've rebuilt the super, run with --rebuild-tree and it should find
everything.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel panic with 2.4.x and reiserfs

2001-04-27 Thread Chris Mason



On Friday, April 27, 2001 02:40:50 AM -0700 jason
<[EMAIL PROTECTED]> wrote:

[ ouch ]

> 
> reiserfs_read_super: can't find reiserfs filesystem on dev 03:01
> Invalid session # or type of track
> Kernel panic: VFS: Unable to mount root fs on 03:01
> 
>   In case it's any help, I'm running Debian "sid" under kernel 2.4.3. hda
> is a Western Digital WD400 (UDMA 100) while hdc is a Maxtor 36.5 GB. I
> have a 900 Mhz Athlon on an Abit KT7A, the latter containing the South
> Bridge VIA VT82C686B and a North Bridge VIA VT8363A.
>   Any info on how I could possibly retrieve data from my disk (hda) would
> be greatly appreciated...
> 

Looks like you've hit the pot-luck of VIA problems, and elevator bugs
(2.4.1).  When the last crash hit, did you recycle with the power button or
the reset button?

Step one, if you can, get a backup of the raw device.  This will make
everything easier if there are problems in step 3.  

Step two, grab the latest reiserfsprogs from
ftp.reiserfs.org/pub/reiserfsprogs.

Step three, reiserfsck --rebuild-sb ; reiserfsck --rebuild-tree

Drop me a line if there are any questions.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] linux likes to kill bad inodes

2001-04-27 Thread Chris Mason



On Friday, April 27, 2001 12:28:54 AM +0200 Pavel Machek <[EMAIL PROTECTED]>
wrote:

> Okay, so what about following patch, followed by attempt to debug it?
> [I'd really like to get patch it; killing user's data without good
> reason seems evil to me, and this did quite a lot of damage to my
> $HOME.]

2.4.4-pre8 does have the patch to keep write_inode from syncing a
bad_inode.In the short term this is the best way to go.

For debugging further, it is probably best to put the warning in when
marking the inode dirty, and randomly returning bad_inodes from read_inode.
I'll give this a try next week.  

My guess is that UPDATE_ATIME is the offending caller, the follow_link path
in open_namei is at least one place that should trigger it.

-chris



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel panic with 2.4.x and reiserfs

2001-04-27 Thread Chris Mason



On Friday, April 27, 2001 04:33:15 PM +0100 Tony Hoyle
<[EMAIL PROTECTED]> wrote:
 
> Reiserfs doesn't cope well with crashes  Under 2.4 I wouldn't
> recommend using it on any kind of critical server - it seems to
> progressively corrupt itself (I'm looking at the second reformat and
> reinstall in a week, and I'm not a happy bunny).

Could you please forward along the details of these corruptions (including
hardware)?  

> 
> As the warning on reiserfsck says, the rebuild-tree option is a last
> resort.  It's as likely to make the problem worse then improve it (It
> rounds all the file lengths up to a block size, padding with zeros, which
> breaks lots of stuff).  Backup what you can first.

It shouldn't always do this, most of the time it has enough info to get the
size right.  Which reiserfsck did you use?

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: reiserfs autofix?

2001-04-29 Thread Chris Mason



On Sunday, April 29, 2001 02:48:27 PM -0700 putter <[EMAIL PROTECTED]>
wrote:

> Hi,
> I am kernel newbie, especially with logging filesystems.
> Now I am using Mandrake 7.1 with 2.4.3 kernel and imon patch
> and NVidia drivers compiled into the kernel.
 ^^^

The binary only nvidia drivers make it a bit hard for us to debug.

> Now, all my partitions are ReiserFS. I usually play quake once
> or twice a day. Sometimes graphics subsystem freezes up, so it takes
> keyboard input. Caps and Numlock are working fine, unless I try to kill
> X with ctrlalt-backspace. So I reset my machine with hardware switch.

Check your /var/log/messages.  You probably have messages from reiserfs.
Send along an lspci so we can see what your hardware is.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: reiserfs autofix?

2001-04-30 Thread Chris Mason



On Monday, April 30, 2001 12:07:04 AM -0700 putter <[EMAIL PROTECTED]>
wrote:

> I think I have tracked down the problem to the card itself. My machine is
> on @ graphics mode all the time, like 24hrs a day, and it seems that it
> is somewhat taxing on the cards performance. So now I switch down to text
> mode, everytime I leave the machine. How did I find out? I placed my
> finger of heatsink of my GeForce DDR. It was HOT! Fan works alright, so
> if I was to run computer a while, stress accumilates, and when I run
> GeForce understress of maximum resolutions, it craps out. So much for
> NVidia eh?

Do a search through the kernel arcvhies for nvidia.  The crashes could just
be the driver.  But heat is always a problem, add fans ;-)

> 
> BTW, I don't question graphical subsystem crashes. I question reiserfs
> that suppose to leave my partitions in consistent state, no matter how
> trigger happy with power switch I am, or is my judgement is clouded? >=)

After a crash, reiserfs only cleans up after itself.  If someone else went
in and hosed the metadata (nvidia, bad drive, controller, ide fun with
via), you've still got bad blocks.

This is one possible reason that we've seen more reports than ext2 has.
After a crash, ext2fsck fixes _whatever_ was broken.  log replay in
reiserfs only fixes the operations that were in progress when the system
crashed.

Anyway, those messages show that you've got metadata corruption.  grab the
latest reiserfsprogs from ftp.reiserfs.org and run reiserfsck -x (after
backing things up).

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: reiserfs+lndir problem [was: 2.4.4 SMP: spurious EOVERFLOW"Value too large for defined data type"]

2001-04-30 Thread Chris Mason



On Monday, April 30, 2001 10:55:57 PM +0200 Daniel Elstner
<[EMAIL PROTECTED]> wrote:

> Hi all,
> 
> unfortunately I have to correct me again.
> The problem seems unrelated to the kernel version or SMP/UP
> (though only 2.4.[34] tried yet).
> 
> Apparently it's a reiserfs/symlink problem.
> I tried doing the lndir on an ext2 partition, sources still
> on reiserfs. And it worked just fine!

Neat, thanks for the extra details.  Does that mean you can consistently
repeat on reiserfs now?  What happens when you do the lndir on reiserfs and
diff the directories?

Any useful messages in /var/log/messages?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] yet another knfsd-reiserfs patch

2001-04-30 Thread Chris Mason



On Monday, April 23, 2001 10:45:14 AM -0400 Chris Mason <[EMAIL PROTECTED]> wrote:

> 
> Hi guys,
> 
> This patch is not meant to replace Neil Brown's knfsd ops stuff, the 
> goal was to whip up something that had a chance of getting into 2.4.x,
> and that might be usable by the AFS guys too.  Neil's patch tries to 
> address a bunch of things that I didn't, and looks better for the
> long run.
>

Ok, here it is updated to 2.4.4.  The only change was to adapt to the usage
of comp_short_keys in reiserfs_iget under 2.4.4.

-chris

diff -Nru a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
--- a/fs/nfsd/nfsfh.c   Sun Apr 29 18:01:04 2001
+++ b/fs/nfsd/nfsfh.c   Sun Apr 29 18:01:04 2001
@@ -116,40 +116,12 @@
return error;
 }
 
-/* this should be provided by each filesystem in an nfsd_operations interface as
- * iget isn't really the right interface
- */
-static struct dentry *nfsd_iget(struct super_block *sb, unsigned long ino, __u32 
generation)
+static struct dentry *dentry_from_inode(struct inode *inode) 
 {
-
-   /* iget isn't really right if the inode is currently unallocated!!
-* This should really all be done inside each filesystem
-*
-* ext2fs' read_inode has been strengthed to return a bad_inode if the inode
-*   had been deleted.
-*
-* Currently we don't know the generation for parent directory, so a generation
-* of 0 means "accept any"
-*/
-   struct inode *inode;
struct list_head *lp;
struct dentry *result;
-   inode = iget(sb, ino);
-   if (is_bad_inode(inode)
-   || (generation && inode->i_generation != generation)
-   ) {
-   /* we didn't find the right inode.. */
-   dprintk("fh_verify: Inode %lu, Bad count: %d %d or version  %u %u\n",
-   inode->i_ino,
-   inode->i_nlink, atomic_read(>i_count),
-   inode->i_generation,
-   generation);
-
-   iput(inode);
-   return ERR_PTR(-ESTALE);
-   }
-   /* now to find a dentry.
-* If possible, get a well-connected one
+   /*
+* If possible, get a well-connected dentry
 */
spin_lock(_lock);
for (lp = inode->i_dentry.next; lp != >i_dentry ; lp=lp->next) {
@@ -172,6 +144,92 @@
return result;
 }
 
+static struct inode *__inode_from_fh(struct super_block *sb, int ino,
+int generation) 
+{
+   struct inode *inode ;
+
+   inode = iget(sb, ino);
+   if (is_bad_inode(inode)
+   || (generation && inode->i_generation != generation)
+   ) {
+   /* we didn't find the right inode.. */
+   dprintk("fh_verify: Inode %lu, Bad count: %d %d or version  %u %u\n",
+   inode->i_ino,
+   inode->i_nlink, atomic_read(>i_count),
+   inode->i_generation,
+   generation);
+
+   iput(inode);
+   return ERR_PTR(-ESTALE);
+   }
+   return inode ;
+}
+
+static struct inode *inode_from_fh(struct super_block *sb, 
+   __u32 *datap,
+   int len)
+{
+   if (sb->s_op->inode_from_fh)
+   return sb->s_op->inode_from_fh(sb, datap, len) ;
+   return __inode_from_fh(sb, datap[0], datap[1]) ;
+}
+
+static struct inode *parent_from_fh(struct super_block *sb, 
+   __u32 *datap,
+   int len)
+{
+   if (sb->s_op->parent_from_fh)
+   return sb->s_op->parent_from_fh(sb, datap, len) ;
+
+   if (len >= 3)
+   return __inode_from_fh(sb, datap[2], 0) ;
+   return ERR_PTR(-ESTALE);
+}
+
+/* 
+ * two iget funcs, one for inode, and one for parent directory
+ *
+ * this should be provided by each filesystem in an nfsd_operations interface as
+ * iget isn't really the right interface
+ *
+ * If the filesystem doesn't provide funcs to get inodes from datap,
+ * it must be: inum, generation, dir inum.  Length of 2 means the 
+ * dir inum isn't there.
+ *
+ * iget isn't really right if the inode is currently unallocated!!
+ * This should really all be done inside each filesystem
+ *
+ * ext2fs' read_inode has been strengthed to return a bad_inode if the inode
+ *   had been deleted.
+ *
+ * Currently we don't know the generation for parent directory, so a generation
+ * of 0 means "accept any"
+ */
+static struct dentry *nfsd_iget(struct super_block *sb, __u32 *datap, int len)
+{
+
+   struct inode *inode;
+
+   inode = inode_from_fh(sb, datap, len) ;
+   if (IS_ERR(inode)) {
+   return ERR_PTR(PTR_ERR

Re: reiserfs+lndir problem [was: 2.4.4 SMP: spurious EOVERFLOW"Value too large for defined data type"]

2001-05-01 Thread Chris Mason



On Wednesday, May 02, 2001 12:41:52 AM +0200 Daniel Elstner
<[EMAIL PROTECTED]> wrote:

> Hi,
> 
> On Mon, 30 Apr 2001 21:03:47 -0400 Chris Mason <[EMAIL PROTECTED]> wrote:
> 
>> > Apparently it's a reiserfs/symlink problem.
>> > I tried doing the lndir on an ext2 partition, sources still
>> > on reiserfs. And it worked just fine!
>> 
>> Neat, thanks for the extra details.  Does that mean you can consistently
>> repeat on reiserfs now?  What happens when you do the lndir on reiserfs
>> and diff the directories?
> 
> I just played around a bit with the following results:
> 
> sources on reiserfs, lndir on reiserfs -> make fails, diff ok
> sources on reiserfs, lndir on ext2 -> make ok
> sources on ext2, lndir on reiserfs -> make fails, diff ok
> 
> Doing the diff against a second copy of the tree shows no errors, too.
> Always the same behaviour: You have to run lndir at least twice to
> get the error. If the link tree was already set up after a boot, the
> error occurs only after rm + lndir + rm + lndir.
> 
> There's a strange way to get things working just like after a reboot.
> After diff'ing the link tree with the 2nd copy (both on reiserfs),
> make World won't fail - at least once.

Ok, can you reproduce with a set of sources other than X?  I would leave
glibc alone for now, unless you can reproduce on ext2.

-chris





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: * Re: Severe trashing in 2.4.4

2001-05-01 Thread Chris Mason



On Tuesday, May 01, 2001 03:11:58 PM -0700 David <[EMAIL PROTECTED]>
wrote:

> Can't say for a definite fact that it was reiserfs but I can say for a
> definite fact that something fishy happens sometimes.
> 
> If I have a text file open, something.html comes to mind, If I edit it
> and save it in one rxvt and open it in another rxvt, my changes may not
> be there.  If I save it *again* or exit the editing process, I will see
> the changes in the second term.  No, I'm not accidently forgetting to
> save it, I know for a fact that I saved it and the first terminal shows
> the non-modified state with the changes and the second term shows the
> previous data.
> 
> Somewhere something is stuck in cache and what's on disk isn't what's in
> cache and a second process for some reason gets what is on disk and not
> what is in cache.
> 
> It happens infrequently but it -does- happen.

Does it happen with -o notail?  Which editor?

-chris



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Maximum files per Directory

2001-05-04 Thread Chris Mason



On Tuesday, May 01, 2001 04:57:02 PM -0600 Andreas Dilger
<[EMAIL PROTECTED]> wrote:

> H. Peter Anvin writes:
>> Not correct, there can't be more than 2^15 *directories* in a single
>> directory.  I belive this is an ext2 limitation.
> 
> 
> I see that reiserfs plays some tricks with the directory i_nlink count.
> If you exceed 64536 links in a directory, it reverts to "1" and no longer
> tracks the link count.

Correct.  The link count isn't used at all when deciding if the directory
is empty (we use the size instead), so we can just lie to VFS if someone
tries to make tons of subdirs.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Maximum files per Directory

2001-05-04 Thread Chris Mason



On Friday, May 04, 2001 01:15:22 PM -0600 Andreas Dilger
<[EMAIL PROTECTED]> wrote:

> Chris writes:
>> On Tuesday, May 01, 2001 04:57:02 PM -0600 Andreas Dilger
>> <[EMAIL PROTECTED]> wrote:
>> > I see that reiserfs plays some tricks with the directory i_nlink count.
>> > If you exceed 64536 links in a directory, it reverts to "1" and no
>> > longer tracks the link count.
>> 
>> Correct.  The link count isn't used at all when deciding if the directory
>> is empty (we use the size instead), so we can just lie to VFS if someone
>> tries to make tons of subdirs.
> 
> For that matter, ext2 doesn't use the link count on directories to
> determine if they are empty either, so it shouldn't be too hard to do the
> same with the ext2 indexed-directory code.  Is there a reason that
> reiserfs chose to have "large number of directories" represented by "1"
> and not "LINK_MAX+1"?
> 

find and a few others consider a link count of 1 to mean there is no link
count tracking being done.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Maximum files per Directory

2001-05-05 Thread Chris Mason



On Saturday, May 05, 2001 03:49:20 PM +0200 Jamie Lokier
<[EMAIL PROTECTED]> wrote:

> Chris Mason wrote:
>> > Is there a reason that
>> > reiserfs chose to have "large number of directories" represented by "1"
>> > and not "LINK_MAX+1"?
>> 
>> find and a few others consider a link count of 1 to mean there is no link
>> count tracking being done.
> 
> Indeed, and thank you for getting this right!
> 
> Btw, is it possible to add dirent->d_type information to reiserfs, and
> would there be any performance gain in doing so?

reiserfs doesn't store that information in its directory items right now,
but there are plenty of free bits to do so.  It wouldn't be hard to add the
feature, and yes there should be a performance gain.

> 
> I have code to add d_type for every other filesystem that can support it
> without additional disk reads, but I couldn't figure out whether
> reiserfs can do it or whether stat() following readdir() is cheap anyway.

stat is actually a little more expensive than ext2, since we have to search
for the inode data in the tree.  It is a fast search, but...

-chris



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



<    3   4   5   6   7   8   9   10   11   12   >