Re: [trivial PATCH] treewide: Align function definition open/close braces

2017-12-18 Thread Dave Chinner
On Sun, Dec 17, 2017 at 04:28:44PM -0800, Joe Perches wrote:
> Some functions definitions have either the initial open brace and/or
> the closing brace outside of column 1.
> 
> Move those braces to column 1.
> 
> This allows various function analyzers like gnu complexity to work
> properly for these modified functions.
> 
> Miscellanea:
> 
> o Remove extra trailing ; and blank line from xfs_agf_verify
> 
> Signed-off-by: Joe Perches <j...@perches.com>
> ---
....

XFS bits look fine.

Acked-by: Dave Chinner <dchin...@redhat.com>

-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <da...@fromorbit.com> wrote:
> >
> > There may be deeper issues. I just started running scalability tests
> > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > directory corruption reported - something I hadn't seen in the dev
> > cycle at all.
> 
> By "in the dev cycle", do you mean your XFS changes, or have you been
> tracking the merge cycle at least for some testing?

I mean the three months leading up to the 4.10 merge, when all the
XFS changes were being tested against 4.9-rc kernels.

The iscsi problem showed up when I updated the base kernel from
4.9 to 4.10-current last week to test the pullreq I was going to
send you. I've been bust with other stuff until now, so I didn't
upgrade my working trees again until today in the hope the iscsi
problem had already been found and fixed.

> > I unmounted the fs, mkfs'd it again, ran the
> > workload again and about a minute in this fired:
> >
> > [628867.607417] [ cut here ]
> > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > shadow_lru_isolate+0x171/0x220
> 
> Well, part of the changes during the merge window were the shadow
> entry tracking changes that came in through Andrew's tree. Adding
> Johannes Weiner to the participants.
> 
> > Now, this workload does not touch the page cache at all - it's
> > entirely an XFS metadata workload, so it should not really be
> > affecting the working set code.
> 
> Well, I suspect that anything that creates memory pressure will end up
> triggering the working set code, so ..
> 
> That said, obviously memory corruption could be involved and result in
> random issues too, but I wouldn't really expect that in this code.
> 
> It would probably be really useful to get more data points - is the
> problem reliably in this area, or is it going to be random and all
> over the place.

The iscsi problem is 100% reproducable. create a pair of iscsi luns,
mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
the filesystems.

The test machine I'm having all these other problems on? stable and
steady as a rock using PMEM devices. Moment I go to use /dev/vdc
(i.e. run load/perf benchmarks) it starts falling over left, right
and center.

And I just smacked into this in the bulkstat phase of the benchmark
(mkfs, fsmark, xfs_repair, mount, bulkstat, find, grep, rm):

[ 2729.750563] BUG: Bad page state in process bstat  pfn:14945
[ 2729.751863] page:ea525140 count:-1 mapcount:0 mapping:  
(null) index:0x0
[ 2729.753763] flags: 0x4000()
[ 2729.754671] raw: 4000   

[ 2729.756469] raw: dead0100 dead0200  

[ 2729.758276] page dumped because: nonzero _refcount
[ 2729.759393] Modules linked in:
[ 2729.760137] CPU: 7 PID: 25902 Comm: bstat Tainted: GB   
4.9.0-dgc #18
[ 2729.761888] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[ 2729.763943] Call Trace:
[ 2729.764523]  
[ 2729.765004]  dump_stack+0x63/0x83
[ 2729.765784]  bad_page+0xc4/0x130
[ 2729.766552]  free_pages_check_bad+0x4f/0x70
[ 2729.767531]  free_pcppages_bulk+0x3c5/0x3d0
[ 2729.768513]  ? page_alloc_cpu_dead+0x30/0x30
[ 2729.769510]  drain_pages_zone+0x41/0x60
[ 2729.770417]  drain_pages+0x3e/0x60
[ 2729.771215]  drain_local_pages+0x24/0x30
[ 2729.772138]  flush_smp_call_function_queue+0x88/0x160
[ 2729.773317]  generic_smp_call_function_single_interrupt+0x13/0x30
[ 2729.774742]  smp_call_function_single_interrupt+0x27/0x40
[ 2729.776000]  smp_call_function_interrupt+0xe/0x10
[ 2729.777102]  call_function_interrupt+0x8e/0xa0
[ 2729.778147] RIP: 0010:delay_tsc+0x41/0x90
[ 2729.779085] RSP: 0018:c9000f0cf500 EFLAGS: 0202 ORIG_RAX: 
ff03
[ 2729.780852] RAX: 77541291 RBX: 88008b5efe40 RCX: 002e
[ 2729.782514] RDX: 0577 RSI: 05541291 RDI: 0001
[ 2729.784167] RBP: c9000f0cf500 R08: 0007 R09: c9000f0cf678
[ 2729.785818] R10: 0006 R11: 1000 R12: 0061
[ 2729.787480] R13: 0001 R14: 83214e30 R15: 0080
[ 2729.789124]  
[ 2729.789626]  __delay+0xf/0x20
[ 2729.790333]  do_raw_spin_lock+0x8c/0x160
[ 2729.791255]  _raw_spin_lock+0x15/0x20
[ 2729.792112]  list_lru_add+0x1a/0x70
[ 2729.792932]  xfs_buf_rele+0x3e7/0x410
[ 2729.793792]  xfs_buftarg_shrink_scan+0x6b/0x80
[ 2729.794841]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[ 2729.796099]  shrink_node+0x57/0x90
[ 2729.796905]  do_try_to_free_pages+0xdd/0x230
[ 2729.797914]  try_to_free_pages+0xce/0x1a0
[ 2729.798852]  __alloc_pages_slowpa

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Thu, Dec 22, 2016 at 07:18:27AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Looking around a bit, the only even halfway suspicious scatterlist
> > initialization thing I see is commit f9d03f96b988 ("block: improve
> > handling of the magic discard payload") which used to have a magic
> > hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
> > now it does __blk_bvec_map_sg() instead.
> 
> But that check was only for discard (and discard-like) bios which
> had the maic single page that sometimes was unused attached.
> 
> For "normal" bios the for_each_segment loop iterates over bi_vcnt,
> so it will be ignored anyway.  That being said both I and the lists
> got CCed halfway through the thread and I haven't seen the original
> report, so I'm not really sure what's going on here anyway.

http://www.gossamer-threads.com/lists/linux/kernel/2587485

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Thu, Dec 22, 2016 at 04:13:22PM +1100, Dave Chinner wrote:
> On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> > On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > > Hi,
> > > 
> > > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner <da...@fromorbit.com> wrote:
> > > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > > >> Thanks Dave,
> > > >>
> > > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > > >> modules loaded (virtio block) so there's something else going on in the
> > > >> current merge window.  I'll keep an eye on it and make sure there's
> > > >> nothing iSCSI needs fixing for.
> > > >
> > > > OK, so before this slips through the cracks.
> > > >
> > > > Linus - your tree as of a few minutes ago still panics immediately
> > > > when starting xfstests on iscsi devices. It appears to be a
> > > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > > seem to have bounced it and no-one is looking at it.
> > > 
> > > Hmm. There's not much to go by.
> > > 
> > > Can somebody in iscsi-land please try to just bisect it - I'm not
> > > seeing a lot of clues to where this comes from otherwise.
> > 
> > Yeah, my hopes of this being quickly resolved by someone else didn't
> > work out and whatever is going on in that test VM is looking like a
> > different kind of odd.  I'm saving that off for later, and seeing if I
> > can't be a bisect on the iSCSI issue.
> 
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all. I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
> 
> [628867.607417] [ cut here ]
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> shadow_lru_isolate+0x171/0x220
> [628867.610702] Modules linked in:
> [628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: GW  
>  4.9.0-dgc #18
> [628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Debian-1.8.2-1 04/01/2014
> [628867.616179] Workqueue: events rht_deferred_worker
> [628867.632422] Call Trace:
> [628867.634691]  dump_stack+0x63/0x83
> [628867.637937]  __warn+0xcb/0xf0
> [628867.641359]  warn_slowpath_null+0x1d/0x20
> [628867.643362]  shadow_lru_isolate+0x171/0x220
> [628867.644627]  __list_lru_walk_one.isra.11+0x79/0x110
> [628867.645780]  ? __list_lru_init+0x70/0x70
> [628867.646628]  list_lru_walk_one+0x17/0x20
> [628867.647488]  scan_shadow_nodes+0x34/0x50
> [628867.648358]  shrink_slab.part.65.constprop.86+0x1dc/0x410
> [628867.649506]  shrink_node+0x57/0x90
> [628867.650233]  do_try_to_free_pages+0xdd/0x230
> [628867.651157]  try_to_free_pages+0xce/0x1a0
> [628867.652342]  __alloc_pages_slowpath+0x2df/0x960
> [628867.653332]  ? __might_sleep+0x4a/0x80
> [628867.654148]  __alloc_pages_nodemask+0x24b/0x290
> [628867.655237]  kmalloc_order+0x21/0x50
> [628867.656016]  kmalloc_order_trace+0x24/0xc0
> [628867.656878]  __kmalloc+0x17d/0x1d0
> [628867.657644]  bucket_table_alloc+0x195/0x1d0
> [628867.658564]  ? __might_sleep+0x4a/0x80
> [628867.659449]  rht_deferred_worker+0x287/0x3c0
> [628867.660366]  ? _raw_spin_unlock_irq+0xe/0x30
> [628867.661294]  process_one_work+0x1de/0x4d0
> [628867.662208]  worker_thread+0x4b/0x4f0
> [628867.662990]  kthread+0x10c/0x140
> [628867.663687]  ? process_one_work+0x4d0/0x4d0
> [628867.664564]  ? kthread_create_on_node+0x40/0x40
> [628867.665523]  ret_from_fork+0x25/0x30
> [628867.666317] ---[ end trace 7c38634006a9955e ]---
> 
> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.

The system back up, and I haven't reproduced this problem yet.
However, benchmark results are way off where they should be, and at
times the performance is utterly abysmal. The XFS for-next tree
based on the 4.9 kernel shows none of these problems, so I don't
think there's an XFS problem here. Workload is the same 16-way
fsmark workload that I've been using for years as a performance
regression test.

The workload normally averages around 230k files/s - i'm seeing
and average of ~175k files/s on you current kernel. And there are
periods where performance just completely tanks:

#  ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32  -d  /mnt/scratch/0  -d 
 /mnt/scratch/1  -d  /mnt/scratch/2

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Hi,
> > 
> > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner <da...@fromorbit.com> wrote:
> > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > >> Thanks Dave,
> > >>
> > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > >> modules loaded (virtio block) so there's something else going on in the
> > >> current merge window.  I'll keep an eye on it and make sure there's
> > >> nothing iSCSI needs fixing for.
> > >
> > > OK, so before this slips through the cracks.
> > >
> > > Linus - your tree as of a few minutes ago still panics immediately
> > > when starting xfstests on iscsi devices. It appears to be a
> > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > seem to have bounced it and no-one is looking at it.
> > 
> > Hmm. There's not much to go by.
> > 
> > Can somebody in iscsi-land please try to just bisect it - I'm not
> > seeing a lot of clues to where this comes from otherwise.
> 
> Yeah, my hopes of this being quickly resolved by someone else didn't
> work out and whatever is going on in that test VM is looking like a
> different kind of odd.  I'm saving that off for later, and seeing if I
> can't be a bisect on the iSCSI issue.

There may be deeper issues. I just started running scalability tests
(e.g. 16-way fsmark create tests) and about a minute in I got a
directory corruption reported - something I hadn't seen in the dev
cycle at all. I unmounted the fs, mkfs'd it again, ran the
workload again and about a minute in this fired:

[628867.607417] [ cut here ]
[628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
shadow_lru_isolate+0x171/0x220
[628867.610702] Modules linked in:
[628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: GW   
4.9.0-dgc #18
[628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[628867.616179] Workqueue: events rht_deferred_worker
[628867.632422] Call Trace:
[628867.634691]  dump_stack+0x63/0x83
[628867.637937]  __warn+0xcb/0xf0
[628867.641359]  warn_slowpath_null+0x1d/0x20
[628867.643362]  shadow_lru_isolate+0x171/0x220
[628867.644627]  __list_lru_walk_one.isra.11+0x79/0x110
[628867.645780]  ? __list_lru_init+0x70/0x70
[628867.646628]  list_lru_walk_one+0x17/0x20
[628867.647488]  scan_shadow_nodes+0x34/0x50
[628867.648358]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[628867.649506]  shrink_node+0x57/0x90
[628867.650233]  do_try_to_free_pages+0xdd/0x230
[628867.651157]  try_to_free_pages+0xce/0x1a0
[628867.652342]  __alloc_pages_slowpath+0x2df/0x960
[628867.653332]  ? __might_sleep+0x4a/0x80
[628867.654148]  __alloc_pages_nodemask+0x24b/0x290
[628867.655237]  kmalloc_order+0x21/0x50
[628867.656016]  kmalloc_order_trace+0x24/0xc0
[628867.656878]  __kmalloc+0x17d/0x1d0
[628867.657644]  bucket_table_alloc+0x195/0x1d0
[628867.658564]  ? __might_sleep+0x4a/0x80
[628867.659449]  rht_deferred_worker+0x287/0x3c0
[628867.660366]  ? _raw_spin_unlock_irq+0xe/0x30
[628867.661294]  process_one_work+0x1de/0x4d0
[628867.662208]  worker_thread+0x4b/0x4f0
[628867.662990]  kthread+0x10c/0x140
[628867.663687]  ? process_one_work+0x4d0/0x4d0
[628867.664564]  ? kthread_create_on_node+0x40/0x40
[628867.665523]  ret_from_fork+0x25/0x30
[628867.666317] ---[ end trace 7c38634006a9955e ]---

Now, this workload does not touch the page cache at all - it's
entirely an XFS metadata workload, so it should not really be
affecting the working set code.

And worse, on that last error, the /host/ is now going into meltdown
(running 4.7.5) with 32 CPUs all burning down in ACPI code:

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
35074 root  -2   0   0  0  0 R  99.0  0.0  12:38.92 acpi_pad/12
35079 root  -2   0   0  0  0 R  99.0  0.0  12:39.40 acpi_pad/16
35080 root  -2   0   0  0  0 R  99.0  0.0  12:39.29 acpi_pad/17
35085 root  -2   0   0  0  0 R  99.0  0.0  12:39.35 acpi_pad/22
35087 root  -2   0   0  0  0 R  99.0  0.0  12:39.13 acpi_pad/24
35090 root  -2   0   0  0  0 R  99.0  0.0  12:38.89 acpi_pad/27
35093 root  -2   0   0  0  0 R  99.0  0.0  12:38.88 acpi_pad/30
35063 root  -2   0   0  0  0 R  98.1  0.0  12:40.64 acpi_pad/1
35065 root  -2   0   0  0  0 R  98.1  0.0  12:40.38 acpi_pad/3
35066 root  -2   0   0  0  0 R  98.1  0.0  12:40.30 acpi_pad/4
35067 root  -2   0   0  0  0 R  98.1  0.0  12:40.82 acpi_pad/5
35077 root  -2   0   0  0  0 R  98.1  0.0  12:39.65 acpi_pad

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> Thanks Dave,
> 
> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> modules loaded (virtio block) so there's something else going on in the
> current merge window.  I'll keep an eye on it and make sure there's
> nothing iSCSI needs fixing for.

OK, so before this slips through the cracks.

Linus - your tree as of a few minutes ago still panics immediately
when starting xfstests on iscsi devices. It appears to be a
scatterlist corruption and not an iscsi problem, so the iscsi guys
seem to have bounced it and no-one is looking at it.

I'm disappearing for several months at the end of tomorrow, so I
thought I better make sure you know about it.  I've also added
linux-scsi, linux-block to the cc list

Cheers,

Dave.

> On Thu, Dec 15, 2016 at 09:29:53AM +1100, Dave Chinner wrote:
> > On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> > > window kernel to test the XFS merge I am preparing for Linus.
> > > Unfortunately, all my test VMs using iscsi failed pretty much
> > > instantly on the first mount of an iscsi device:
> > > 
> > > [  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature 
> > > enabled. Use at your own risk!
> > > [  159.374612] XFS (sdb): Mounting V5 Filesystem
> > > [  159.425710] XFS (sdb): Ending clean mount
> > > [  160.274438] BUG: unable to handle kernel NULL pointer dereference at 
> > > 000c
> > > [  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
> > 
> > FYI, crash is here:
> > 
> > (gdb) l *(iscsi_tcp_segment_done+0x20d)
> > 0x81b950bd is in iscsi_tcp_segment_done 
> > (drivers/scsi/libiscsi_tcp.c:102).
> > 97  iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
> > 98struct scatterlist *sg, unsigned int offset)
> > 99  {
> > 100 segment->sg = sg;
> > 101 segment->sg_offset = offset;
> > 102 segment->size = min(sg->length - offset,
> > 103 segment->total_size - 
> > segment->total_copied);
> > 104 segment->data = NULL;
> > 105 }
> > 106 
> > 
> > So it looks to be sg = NULL, which means there's probably an issue
> > with the scatterlist...
> > 
> > -Dave.
> > 
> > > [  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
> > > PMD 0 [  160.277309]
> > > [  160.277523] Oops:  [#1] PREEMPT SMP
> > > [  160.278004] Modules linked in:
> > > [  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> > > [  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > BIOS Debian-1.8.2-1 04/01/2014
> > > [  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> > > [  160.280919] task: 88003e28 task.stack: c908
> > > [  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> > > [  160.282312] RSP: 0018:c9083c38 EFLAGS: 00010206
> > > [  160.282980] RAX:  RBX: 880039061730 RCX: 
> > > 
> > > [  160.283854] RDX: 1e00 RSI:  RDI: 
> > > 880039061730
> > > [  160.284738] RBP: c9083c90 R08: 0200 R09: 
> > > 05a8
> > > [  160.285627] R10: 9835607d R11:  R12: 
> > > 0200
> > > [  160.286495] R13:  R14: 8800390615a0 R15: 
> > > 880039061730
> > > [  160.287362] FS:  () GS:88003fc0() 
> > > knlGS:
> > > [  160.288340] CS:  0010 DS:  ES:  CR0: 80050033
> > > [  160.289113] CR2: 000c CR3: 31a8d000 CR4: 
> > > 06f0
> > > [  160.290084] Call Trace:
> > > [  160.290429]  ? inet_sendpage+0x4d/0x140
> > > [  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
> > > [  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
> > > [  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
> > > [  160.292771]  iscsi_xmit_task+0x53/0xc0
> > > [  160.293282]  iscsi_xmitworker+0x274/0x310
> > > [  160.293835]  process_one_work+0x1de/0x4d0
> > > [  160.294388]  worker_thread+0x4b/0x4f0
> > > [  160.294889]  kthread+0x10c/0x140
> > > [  160.295333]  ? process_one_work+0x4d0/0x4d0
> > > [  160.295898]  ? kthread_create_on_node+0x40/0x40
> > > [  16

Re: [BUG] Slab corruption during XFS writeback under memory pressure

2016-07-19 Thread Dave Chinner
On Tue, Jul 19, 2016 at 02:22:47PM -0700, Calvin Owens wrote:
> On 07/18/2016 07:05 PM, Calvin Owens wrote:
> >On 07/17/2016 11:02 PM, Dave Chinner wrote:
> >>On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote:
> >>>On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote:
> >>>>Hello all,
> >>>>
> >>>>I've found a nasty source of slab corruption. Based on seeing similar 
> >>>>symptoms
> >>>>on boxes at Facebook, I suspect it's been around since at least 3.10.
> >>>>
> >>>>It only reproduces under memory pressure so far as I can tell: the issue 
> >>>>seems
> >>>>to be that XFS reclaims pages from buffers that are still in use by
> >>>>scsi/block. I'm not sure which side the bug lies on, but I've only 
> >>>>observed it
> >>>>with XFS.
> >>[]
> >>>But this indicates that the page is under writeback at this point,
> >>>so that tends to indicate that the above freeing was incorrect.
> >>>
> >>>Hmmm - it's clear we've got direct reclaim involved here, and the
> >>>suspicion of a dirty page that has had it's bufferheads cleared.
> >>>Are there any other warnings in the log from XFS prior to kasan
> >>>throwing the error?
> >>
> >>Can you try the patch below?
> >
> >Thanks for getting this out so quickly :)
> >
> >So far so good: I booted Linus' tree as of this morning and reproduced the 
> >ASAN
> >splat. After applying your patch I haven't triggered it.
> >
> >I'm a bit wary since it was hard to trigger reliably in the first place... 
> >so I
> >lined up a few dozen boxes to run the test case overnight. I'll confirm in 
> >the
> >morning (-0700) they look good.
> 
> All right, my testcase ran 2099 times overnight without triggering anything.
> 
> For the overnight tests, I booted the boxes with "mem=" to artificially limit 
> RAM,
> which makes my repro *much* more reliable (I feel silly for not thinking of 
> that
> in the first place). With that setup, I hit the ASAN splat 21 times in 98 
> runs on
> vanilla 4.7-rc7. So I'm sold.
> 
> Tested-by: Calvin Owens <calvinow...@fb.com>

Thanks for testing, Calvin. I'll update the patch and get it
reviewed and committed.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Slab corruption during XFS writeback under memory pressure

2016-07-18 Thread Dave Chinner
On Sun, Jul 17, 2016 at 10:00:03AM +1000, Dave Chinner wrote:
> On Fri, Jul 15, 2016 at 05:18:02PM -0700, Calvin Owens wrote:
> > Hello all,
> > 
> > I've found a nasty source of slab corruption. Based on seeing similar 
> > symptoms
> > on boxes at Facebook, I suspect it's been around since at least 3.10.
> > 
> > It only reproduces under memory pressure so far as I can tell: the issue 
> > seems
> > to be that XFS reclaims pages from buffers that are still in use by
> > scsi/block. I'm not sure which side the bug lies on, but I've only observed 
> > it
> > with XFS.
[]
> But this indicates that the page is under writeback at this point,
> so that tends to indicate that the above freeing was incorrect.
> 
> Hmmm - it's clear we've got direct reclaim involved here, and the
> suspicion of a dirty page that has had it's bufferheads cleared.
> Are there any other warnings in the log from XFS prior to kasan
> throwing the error?

Can you try the patch below?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

xfs: bufferhead chains are invalid after end_page_writeback

From: Dave Chinner <dchin...@redhat.com>

In xfs_finish_page_writeback(), we have a loop that looks like this:

do {
if (off < bvec->bv_offset)
goto next_bh;
if (off > end)
break;
bh->b_end_io(bh, !error);
next_bh:
off += bh->b_size;
} while ((bh = bh->b_this_page) != head);

The b_end_io function is end_buffer_async_write(), which will call
end_page_writeback() once all the buffers have marked as no longer
under IO.  This issue here is that the only thing currently
protecting both the bufferhead chain and the page from being
reclaimed is the PageWriteback state held on the page.

While we attempt to limit the loop to just the buffers covered by
the IO, we still read from the buffer size and follow the next
pointer in the bufferhead chain. There is no guarantee that either
of these are valid after the PageWriteback flag has been cleared.
Hence, loops like this are completely unsafe, and result in
use-after-free issues. One such problem was caught by Calvin Owens
with KASAN:

.
 INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
  free_buffer_head+0x41/0x90
  __slab_free+0x1ed/0x340
  kmem_cache_free+0x270/0x300
  free_buffer_head+0x41/0x90
  try_to_free_buffers+0x171/0x240
  xfs_vm_releasepage+0xcb/0x3b0
  try_to_release_page+0x106/0x190
  shrink_page_list+0x118e/0x1a10
  shrink_inactive_list+0x42c/0xdf0
  shrink_zone_memcg+0xa09/0xfa0
  shrink_zone+0x2c3/0xbc0
.
 Call Trace:
[] dump_stack+0x68/0x94
  [] print_trailer+0x115/0x1a0
  [] object_err+0x34/0x40
  [] kasan_report_error+0x217/0x530
  [] __asan_report_load8_noabort+0x43/0x50
  [] xfs_destroy_ioend+0x3bf/0x4c0
  [] xfs_end_bio+0x154/0x220
  [] bio_endio+0x158/0x1b0
  [] blk_update_request+0x18b/0xb80
  [] scsi_end_request+0x97/0x5a0
  [] scsi_io_completion+0x438/0x1690
  [] scsi_finish_command+0x375/0x4e0
  [] scsi_softirq_done+0x280/0x340


Where the access is occuring during IO completion after the buffer
had been freed from direct memory reclaim.

Prevent use-after-free accidents in this end_io processing loop by
pre-calculating the loop conditionals before calling bh->b_end_io().
The loop is already limited to just the bufferheads covered by the
IO in progress, so the offset checks are sufficient to prevent
accessing buffers in the chain after end_page_writeback() has been
called by the the bh->b_end_io() callout.

Yet another example of why Bufferheads Must Die.

Signed-off-by: Dave Chinner <dchin...@redhat.com>
---
 fs/xfs/xfs_aops.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 80714eb..0cfb944 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -87,6 +87,12 @@ xfs_find_bdev_for_inode(
  * We're now finished for good with this page.  Update the page state via the
  * associated buffer_heads, paying attention to the start and end offsets that
  * we need to process on the page.
+ *
+ * Landmine Warning: bh->b_end_io() will call end_page_writeback() on the last
+ * buffer in the IO. Once it does this, it is unsafe to access the bufferhead 
or
+ * the page at all, as we may be racing with memory reclaim and it can free 
both
+ * the bufferhead chain and the page as it will see the page as clean and
+ * unused.
  */
 static void
 xfs_finish_page_writeback(
@@ -95,8 +101,9 @@ xfs_finish_page_writeback(
int error)
 {
unsigned intend = bvec->bv_offset + bvec->bv_len - 1;
-   struct buffer_head  *head, *bh;
+   struct buffer_head  *head, *bh, *next;
unsigned intoff = 0;
+   unsigned intbsi

Re: [BUG] Slab corruption during XFS writeback under memory pressure

2016-07-16 Thread Dave Chinner

But this indicates that the page is under writeback at this point,
so that tends to indicate that the above freeing was incorrect.

Hmmm - it's clear we've got direct reclaim involved here, and the
suspicion of a dirty page that has had it's bufferheads cleared.
Are there any other warnings in the log from XFS prior to kasan
throwing the error?

> Is there anything else I can send that might be helpful?

full console/dmesg output from a crashed machine, plus:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

> --
> /*
>  * Run as "./repro outfile 1000", where "outfile" sits on an XFS filesystem.
>  */
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #define CHUNK (32768)
> 
> static const char crap[CHUNK];
> 
> int main(int argc, char **argv)
> {
>   int r, fd, i;
>   size_t allocsize, count;
>   void *p;
> 
>   if (argc != 3) {
>   printf("Usage: %s filename count\n", argv[0]);
>   return 1;
>   }
> 
>   fd = open(argv[1], O_RDWR|O_CREAT, 0644);
>   if (fd == -1) {
>   perror("Can't open");
>   return 1;
>   }
> 
>   if (!fork()) {
>   count = atol(argv[2]);
> 
>   while (1) {
>   for (i = 0; i < count; i++)
>   if (write(fd, crap, CHUNK) != CHUNK)
>   perror("Eh?");
> 
>   fsync(fd);
>   ftruncate(fd, 0);
>   }

H. Truncate is used, but only after fsync. If the truncate
is removed, does the problem go away?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

2016-05-30 Thread Dave Chinner
On Mon, May 30, 2016 at 10:27:52AM +0200, Gernot Hillier wrote:
> Hi!
> 
> On 25.05.2016 01:13, Theodore Ts'o wrote:
> > On Tue, May 24, 2016 at 07:07:41PM +0200, Gernot Hillier wrote:
> >> We experience strange delays with kernel 4.1.18 during dpkg
> >> package installation on an ext4 filesystem after switching from
> >> Ubuntu 14.04 to 16.04. We can reproduce the issue with kernel 4.6.
> >> Installation of the same package takes 2s with ext3 and 31s with
> >> ext4 on the same partition.
> >>
> >> Hardware is an Intel-based server with Supermicro X8DTH board and
> >> Seagate ST973451SS disks connected to an LSI SAS2008 controller (PCI
> >> 0x1000:0x0072, mpt2sas driver).
> [...]
> >> To me, the problem looks comparable to
> >> https://bugzilla.kernel.org/show_bug.cgi?id=56821 (even if we don't see
> >> a full hang and there's no RAID involved for us), so a closer look on
> >> the SCSI layer or driver might be the next step?
> > 
> > What I would suggest is to create a small test case which compares the
> > time it takes to allocate 1 megabyte of memory, zero it, and then
> > write one megabytes of zeros using the write(2) system call.  Then try
> > writing one megabytes of zero using the BLKZEROOUT ioctl.
> 
> Ok, this is my test code:
> 
>   const int SIZE = 1*1024*1024;
>   char* buffer = malloc(SIZE);
>   uint64_t range[2] = { 0, SIZE };
>   int fd = open("/dev/sdb2", O_WRONLY);
> 
>   bzero(buffer, SIZE);
>   write(fd, buffer, SIZE);
>   sync_file_range(fd, 0, 0, 2);
> 
>   ioctl (fd, BLKZEROOUT, range);
> 
>   close(fd);
>   free(buffer);
> 
> # strace -tt ./test-tytso
> [...]
> 15:46:27.481636 open("/dev/sdb2", O_WRONLY) = 3
> 15:46:27.482004 write(3, "\0\0\0\0\0\0"..., 1048576) = 1048576
> 15:46:27.482438 sync_file_range(3, 0, 0, SYNC_FILE_RANGE_WRITE) = 0
> 15:46:27.482698 ioctl(3, BLKZEROOUT, [0, 10]) = 0
> 15:46:27.546971 close(3)= 0
> 
> So the write() and sync_file_range() in the first case takes ~400 us
> each while BLKZEROOUT takes... 60 ms. Wow.

Comparing apples to oranges.

Unlike the name implies, sync_file_range() does not provide any data
integrity semantics what-so-ever: SYNC_FILE_RANGE_WRITE only submits
IO to clean dirty pages - that only takes 400us of CPU time.  It
does not wait for completion, nor does it flush the drive cache and
so by the time the syscall returns to userspace the IO may not have
even been sent to the device (e.g. it could be queued by the IO
scheduler in the block layer). i.e. you're not timing IO, you're
timing CPU overhead of IO submission.

For an apples to apples comparison, you need to use fsync() to
physically force the written data to stable storage and wait for
completion. This is what BLKZEROOUT is effectively doing, so I think
you'll find fdatasync() also takes around 60ms...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/35] xfs: set bi_op to REQ_OP

2016-01-06 Thread Dave Chinner
On Tue, Jan 05, 2016 at 02:53:16PM -0600, mchri...@redhat.com wrote:
> From: Mike Christie <mchri...@redhat.com>
> 
> This patch has xfs set the bio bi_op to a REQ_OP, and
> rq_flag_bits to bi_rw.
> 
> Note:
> I have run xfs tests on these btrfs patches. There were some failures
> with and without the patches. I have not had time to track down why
> xfstest fails without the patches.
> 
> Signed-off-by: Mike Christie <mchri...@redhat.com>
> ---
>  fs/xfs/xfs_aops.c |  3 ++-
>  fs/xfs/xfs_buf.c  | 27 +++
>  2 files changed, 17 insertions(+), 13 deletions(-)

Not sure which patches your note is refering to here.

The XFS change here looks fine.

Acked-by: Dave Chinner <dchin...@redhat.com>

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/35 v2] separate operations from flags in the bio/request structs

2016-01-06 Thread Dave Chinner
On Wed, Jan 06, 2016 at 08:40:09PM -0500, Martin K. Petersen wrote:
> >>>>> "Mike" == mchristi  <mchri...@redhat.com> writes:
> 
> Mike> The following patches begin to cleanup the request->cmd_flags and
> bio-> bi_rw mess. We currently use cmd_flags to specify the operation,
> Mike> attributes and state of the request. For bi_rw we use it for
> Mike> similar info and also the priority but then also have another
> Mike> bi_flags field for state. At some point, we abused them so much we
> Mike> just made cmd_flags 64 bits, so we could add more.
> 
> Mike> The following patches seperate the operation (read, write discard,
> Mike> flush, etc) from cmd_flags/bi_rw.
> 
> Mike> This patchset was made against linux-next from today Jan 5 2016.
> Mike> (git tag next-20160105).
> 
> Very nice work. Thanks for doing this!
> 
> I think it's a much needed cleanup. I focused mainly on the core block,
> discard, write same and sd.c pieces and everything looks sensible to me.
> 
> I wonder what the best approach is to move a patch set with this many
> stakeholders forward? Set a "speak now or forever hold your peace"
> review deadline?

I say just ask Linus to pull it immediately after the next merge
window closes

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists

2015-04-08 Thread Dave Chinner
[ Sending again with a trimmed CC list to just the lists. Jeff - cc
lists that large get blocked by mailing lists... ]

On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
 The way the on-stack plugging currently works, each nesting level
 flushes its own list of I/Os.  This can be less than optimal (read
 awful) for certain workloads.  For example, consider an application
 that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
 I/Os together in a single io_submit call, only to have each of them
 dispatched individually down in the bowels of the dirct I/O code.
 The reason is that there are blk_plug-s instantiated both at the upper
 call site in do_io_submit and down in do_direct_IO.  The latter will
 submit as little as 1 I/O at a time (if you have a small enough I/O
 size) instead of performing the batching that the plugging
 infrastructure is supposed to provide.

I'm wondering what impact this will have on filesystem metadata IO
that needs to be issued immediately. e.g. we are doing writeback, so
there is a high level plug in place and we need to page in btree
blocks to do extent allocation. We do readahead at this point,
but it looks like this change will prevent the readahead from being
issued by the unplug in xfs_buf_iosubmit().

So while I can see how this can make your single microbenchmark
better (because it's only doing concurrent direct IO to the block
device and hence there are no dependencies between individual IOs),
I have significant reservations that it's actually a win for
filesystem-based workloads where we need direct control of flushing
to minimise IO latency due to IO dependencies...

Patches like this one:

https://lkml.org/lkml/2015/3/20/442

show similar real-world workload improvements to your patchset by
being smarter about using high level plugging to enable cross-file
merging of IO, but it still relies on the lower layers of plugging
to resolve latency bubbles caused by IO dependencies in the
filesystems.

 NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
 would always flush the plug list.  After this patch, this is only the
 case for the outer-most plug.  If you require the plug list to be
 flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
 maintainers should take a close look at this patch and ensure they get
 the right behavior in the end.

IOWs, you are saying we need to change all our current unplugs to
blk_flush_plug(current) to *try* to maintain the same behaviour as
we currently have? I say *try*, because no instead of just flushing
the readahead IO on the plug, we'll also flush all the queued data
writeback IO onthe high level plug. We don't actually want to do
that; we only want to submit the readahead and not the bulk IO that
will delay the latency sensitive dependent IOs

If that is the case, shouldn't you actually be trying to fix the
specific plugging problem you've identified (i.e. do_direct_IO() is
flushing far too frequently) rather than making a sweeping
generalisation that the IO stack plugging infrastructure
needs fundamental change?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives

2015-03-17 Thread Dave Chinner
On Mon, Mar 16, 2015 at 08:12:16PM -0500, Alireza Haghdoost wrote:
 On Mon, Mar 16, 2015 at 3:32 PM, Dave Chinner da...@fromorbit.com wrote:
  On Mon, Mar 16, 2015 at 11:28:53AM -0400, James Bottomley wrote:
  Probably need to cc dm-devel here.  However, I think we're all agreed
  this is RAID across multiple devices, rather than within a single
  device?  In which case we just need a way of ensuring identical zoning
  on the raided devices and what you get is either a standard zone (for
  mirror) or a larger zone (for hamming etc).
 
  Any sort of RAID is a bloody hard problem, hence the fact that I'm
  designing a solution for a filesystem on top of an entire bare
  drive. I'm not trying to solve every use case in the world, just the
  one where the drive manufactures think SMR will be mostly used: the
  back end of never delete distributed storage environments
  We can't wait for years for infrastructure layers to catch up in the
  brave new world of shipping SMR drives. We may not like them, but we
  have to make stuff work. I'm not trying to solve every problem - I'm
  just tryin gto address the biggest use case I see for SMR devices
  and it just so happens that XFS is already used pervasively in that
  same use case, mostly within the same no raid, fs per entire
  device constraints as I've documented for this proposal...
 
 I am confused what kind of application you are referring to for this
 back end, no raid, fs per entire device. Are you gonna rely on the
 application to do replication for disk failure protection ?

Exactly. Think distributed storage such as Ceph and gluster where
the data redundancy and failure recovery algorithms are in layers
*above* the local filesystem, not in the storage below the fs.  The
no raid, fs per device model is already a very common back end
storage configuration for such deployments.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives

2015-03-16 Thread Dave Chinner
  and which are not.  The stack will have to collectively work together
  to find a way to request and use zones in an orderly fashion.
 
 Here I think the sense of LSF/MM was that only allowing a fixed number
 of zones to be open would get a bit unmanageable (unless the drive
 silently manages it for us).  The idea of different sized zones is also
 a complicating factor.

Not for XFS - my proposal handles variable sized zones without any
additional complexity. Indeed, it will handle zone sizes from 16MB
to 1TB without any modification - mkfs handles it all when it
queries the zones and sets up the zone allocation inodes...

And we limit the number of open zones by the number of zone groups
we alow concurrent allocation to

 The other open question is that if we go for
 fully drive managed, what sort of alignment, size, trim + anything else
 should we do to make the drive's job easier.  I'm guessing we won't
 really have a practical answer to any of these until we see how the
 market responds.

I'm not aiming this proposal at drive managed, or even host-managed
drives: this proposal is for full host-aware (i.e. error on
out-of-order write) drive support. If you have drive managed SMR,
then there's pretty much nothing to change in existing filesystems.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives

2015-03-16 Thread Dave Chinner
On Mon, Mar 16, 2015 at 11:28:53AM -0400, James Bottomley wrote:
 [cc to linux-scsi added since this seems relevant]
 On Mon, 2015-03-16 at 17:00 +1100, Dave Chinner wrote:
  Hi Folks,
  
  As I told many people at Vault last week, I wrote a document
  outlining how we should modify the on-disk structures of XFS to
  support host aware SMR drives on the (long) plane flights to Boston.
  
  TL;DR: not a lot of change to the XFS kernel code is required, no
  specific SMR awareness is needed by the kernel code.  Only
  relatively minor tweaks to the on-disk format will be needed and
  most of the userspace changes are relatively straight forward, too.
  
  The source for that document can be found in this git tree here:
  
  git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
  
  in the file design/xfs-smr-structure.asciidoc. Alternatively,
  pull it straight from cgit:
  
  https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
  
  Or there is a pdf version built from the current TOT on the xfs.org
  wiki here:
  
  http://xfs.org/index.php/Host_Aware_SMR_architecture
  
  Happy reading!
 
 I don't think it would have caused too much heartache to post the entire
 doc to the list, but anyway
 
 The first is a meta question: What happened to the idea of separating
 the fs block allocator from filesystems?  It looks like a lot of the
 updates could be duplicated into other filesystems, so it might be a
 very opportune time to think about this.

Which requires a complete rework of the fs/block layer. That's the
long term goal, but we aren't going to be there for a few years yet.
Hust look at how long it's taken for copy offload (which is trivial
compared to allocation offload) to be implemented

  === RAID on SMR
  
  How does RAID work with SMR, and exactly what does that look like to
  the filesystem?
  
  How does libzbc work with RAID given it is implemented through the scsi 
  ioctl
  interface?
 
 Probably need to cc dm-devel here.  However, I think we're all agreed
 this is RAID across multiple devices, rather than within a single
 device?  In which case we just need a way of ensuring identical zoning
 on the raided devices and what you get is either a standard zone (for
 mirror) or a larger zone (for hamming etc).

Any sort of RAID is a bloody hard problem, hence the fact that I'm
designing a solution for a filesystem on top of an entire bare
drive. I'm not trying to solve every use case in the world, just the
one where the drive manufactures think SMR will be mostly used: the
back end of never delete distributed storage environments

We can't wait for years for infrastructure layers to catch up in the
brave new world of shipping SMR drives. We may not like them, but we
have to make stuff work. I'm not trying to solve every problem - I'm
just tryin gto address the biggest use case I see for SMR devices
and it just so happens that XFS is already used pervasively in that
same use case, mostly within the same no raid, fs per entire
device constraints as I've documented for this proposal...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: scheduling while atomic in blk_mq codepath?

2014-06-19 Thread Dave Chinner
On Thu, Jun 19, 2014 at 12:21:44PM -0400, Theodore Ts'o wrote:
 On Thu, Jun 19, 2014 at 12:08:01PM -0400, Theodore Ts'o wrote:
   The other issue, not sure, not a lot of detail. It may be fixed by the 
   pull
   request I sent out yesterday. You can try pulling in:
   
   git://git.kernel.dk/linux-block.git for-linus
  
  Thanks, I'll give that a try.
 
 I tried merging in your for-linus branch in v3.16-rc1, and I'm seeing
 the following.  On a 32-bit x86 3.15 kernel, run: mke2fs -t ext3
 /dev/vdc where /dev/vdc is a 5 gig virtio partition.

Short reads are more likely a bug in all the iovec iterator stuff
that got merged in from the vfs tree. ISTR a 32 bit-only bug in that
stuff go past in to do with not being able to partition a 32GB block
dev on a 32 bit system due to a 32 bit size_t overflow somewhere

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-29 Thread Dave Chinner
On Wed, Jan 29, 2014 at 09:52:46PM -0700, Matthew Wilcox wrote:
 On Fri, Jan 24, 2014 at 10:57:48AM +, Mel Gorman wrote:
  So far on the table is
  
  1. major filesystem overhawl
  2. major vm overhawl
  3. use compound pages as they are today and hope it does not go
 completely to hell, reboot when it does
 
 Is the below paragraph an exposition of option 2, or is it an option 4,
 change the VM unit of allocation?  Other than the names you're using,
 this is basically what I said to Kirill in an earlier thread; either
 scrap the difference between PAGE_SIZE and PAGE_CACHE_SIZE, or start
 making use of it.

Christoph Lamater's compound page patch set scrapped PAGE_CACHE_SIZE
and made it a variable that was set on the struct address_space when
it was instantiated by the filesystem. In effect, it allowed
filesystems to specify the unit of page cache allocation on a
per-inode basis.

 The fact that EVERYBODY in this thread has been using PAGE_SIZE when they
 should have been using PAGE_CACHE_SIZE makes me wonder if part of the
 problem is that the split in naming went the wrong way.  ie use PTE_SIZE
 for 'the amount of memory pointed to by a pte_t' and use PAGE_SIZE for
 'the amount of memory described by a struct page'.

PAGE_CACHE_SIZE was never distributed sufficiently to be used, and
if you #define it to something other than PAGE_SIZE stuff will
simply break.

 (we need to remove the current users of PTE_SIZE; sparc32 and powerpc32,
 but that's just a detail)
 
 And we need to fix all the places that are currently getting the
 distinction wrong.  SMOP ... ;-)  What would help is correct typing of
 variables, possibly with sparse support to help us out.  Big Job.

Yes, that's what the Christoph's patchset did.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner
On Wed, Jan 22, 2014 at 02:34:52PM +, Mel Gorman wrote:
 On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote:
  On 01/22/2014 04:34 AM, Mel Gorman wrote:
  On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:
  One topic that has been lurking forever at the edges is the current
  4k limitation for file system block sizes. Some devices in
  production today and others coming soon have larger sectors and it
  would be interesting to see if it is time to poke at this topic
  again.
  
  Large block support was proposed years ago by Christoph Lameter
  (http://lwn.net/Articles/232757/). I think I was just getting started
  in the community at the time so I do not recall any of the details. I do
  believe it motivated an alternative by Nick Piggin called fsblock though
  (http://lwn.net/Articles/321390/). At the very least it would be nice to
  know why neither were never merged for those of us that were not around
  at the time and who may not have the chance to dive through mailing list
  archives between now and March.
  
  FWIW, I would expect that a show-stopper for any proposal is requiring
  high-order allocations to succeed for the system to behave correctly.
  
  
  I have a somewhat hazy memory of Andrew warning us that touching
  this code takes us into dark and scary places.
  
 
 That is a light summary. As Andrew tends to reject patches with poor
 documentation in case we forget the details in 6 months, I'm going to guess
 that he does not remember the details of a discussion from 7ish years ago.
 This is where Andrew swoops in with a dazzling display of his eidetic
 memory just to prove me wrong.
 
 Ric, are there any storage vendor that is pushing for this right now?
 Is someone working on this right now or planning to? If they are, have they
 looked into the history of fsblock (Nick) and large block support (Christoph)
 to see if they are candidates for forward porting or reimplementation?
 I ask because without that person there is a risk that the discussion
 will go as follows
 
 Topic leader: Does anyone have an objection to supporting larger block
   sizes than the page size?
 Room: Send patches and we'll talk.

So, from someone who was done in the trenches of the large
filesystem block size code wars, the main objection to Christoph
lameter's patchset was that it used high order compound pages in the
page cache so that nothing at filesystem level needed to be changed
to support large block sizes.

The patch to enable XFS to use 64k block sizes with Christoph's
patches was simply removing 5 lines of code that limited the block
size to PAGE_SIZE. And everything just worked.

Given that compound pages are used all over the place now and we
also have page migration, compaction and other MM support that
greatly improves high order memory allocation, perhaps we should
revisit this approach.

As to Nick's fsblock rewrite, he basically rewrote all the
bufferhead head code to handle filesystem blocks larger than a page
whilst leaving the page cache untouched. i.e. the complete opposite
approach. The problem with this approach is that every filesystem
needs to be re-written to use fsblocks rather than bufferheads. For
some filesystems that isn't hard (e.g. ext2) but for filesystems
that use bufferheads in the core of their journalling subsystems
that's a completely different story.

And for filesystems like XFS, it doesn't solve any of the problem
with using bufferheads that we have now, so it simply introduces a
huge amount of IO path rework and validation without providing any
advantage from a feature or performance point of view. i.e. extent
based filesystems mostly negate the impact of filesystem block size
on IO performance...

Realistically, if I'm going to do something in XFS to add block size
 page size support, I'm going to do it wiht somethign XFS can track
through it's own journal so I can add data=journal functionality
with the same filesystem block/extent header structures used to
track the pages in blocks larger than PAGE_SIZE. And given that we
already have such infrastructure in XFS to support directory
blocks larger than filesystem block size

FWIW, as to the original large sector size support question, XFS
already supports sector sizes up to 32k in size. The limitation is
actually a limitation of the journal format, so going larger than
that would take some work...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner
On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote:
 On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:
  On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote:
   On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:
  
  [ I like big sectors and I cannot lie ]
 
 I think I might be sceptical, but I don't think that's showing in my
 concerns ...
 
I really think that if we want to make progress on this one, we need
code and someone that owns it.  Nick's work was impressive, but it was
mostly there for getting rid of buffer heads.  If we have a device that
needs it and someone working to enable that device, we'll go forward
much faster.
   
   Do we even need to do that (eliminate buffer heads)?  We cope with 4k
   sector only devices just fine today because the bh mechanisms now
   operate on top of the page cache and can do the RMW necessary to update
   a bh in the page cache itself which allows us to do only 4k chunked
   writes, so we could keep the bh system and just alter the granularity of
   the page cache.
   
  
  We're likely to have people mixing 4K drives and fill in some other
  size here on the same box.  We could just go with the biggest size and
  use the existing bh code for the sub-pagesized blocks, but I really
  hesitate to change VM fundamentals for this.
 
 If the page cache had a variable granularity per device, that would cope
 with this.  It's the variable granularity that's the VM problem.
 
  From a pure code point of view, it may be less work to change it once in
  the VM.  But from an overall system impact point of view, it's a big
  change in how the system behaves just for filesystem metadata.
 
 Agreed, but only if we don't do RMW in the buffer cache ... which may be
 a good reason to keep it.
 
   The other question is if the drive does RMW between 4k and whatever its
   physical sector size, do we need to do anything to take advantage of
   it ... as in what would altering the granularity of the page cache buy
   us?
  
  The real benefit is when and how the reads get scheduled.  We're able to
  do a much better job pipelining the reads, controlling our caches and
  reducing write latency by having the reads done up in the OS instead of
  the drive.
 
 I agree with all of that, but my question is still can we do this by
 propagating alignment and chunk size information (i.e. the physical
 sector size) like we do today.  If the FS knows the optimal I/O patterns
 and tries to follow them, the odd cockup won't impact performance
 dramatically.  The real question is can the FS make use of this layout
 information *without* changing the page cache granularity?  Only if you
 answer me no to this do I think we need to worry about changing page
 cache granularity.

We already do this today.

The problem is that we are limited by the page cache assumption that
the block device/filesystem never need to manage multiple pages as
an atomic unit of change. Hence we can't use the generic
infrastructure as it stands to handle block/sector sizes larger than
a page size...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner
On Wed, Jan 22, 2014 at 09:21:40AM -0800, James Bottomley wrote:
 On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote:
  On Wed, 2014-01-22 at 15:19 +, Mel Gorman wrote:
   On Wed, Jan 22, 2014 at 09:58:46AM -0500, Ric Wheeler wrote:
On 01/22/2014 09:34 AM, Mel Gorman wrote:
On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote:
On 01/22/2014 04:34 AM, Mel Gorman wrote:
On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote:
One topic that has been lurking forever at the edges is the current
4k limitation for file system block sizes. Some devices in
production today and others coming soon have larger sectors and it
would be interesting to see if it is time to poke at this topic
again.

Large block support was proposed years ago by Christoph Lameter
(http://lwn.net/Articles/232757/). I think I was just getting started
in the community at the time so I do not recall any of the details. 
I do
believe it motivated an alternative by Nick Piggin called fsblock 
though
(http://lwn.net/Articles/321390/). At the very least it would be 
nice to
know why neither were never merged for those of us that were not 
around
at the time and who may not have the chance to dive through mailing 
list
archives between now and March.

FWIW, I would expect that a show-stopper for any proposal is 
requiring
high-order allocations to succeed for the system to behave correctly.

I have a somewhat hazy memory of Andrew warning us that touching
this code takes us into dark and scary places.

That is a light summary. As Andrew tends to reject patches with poor
documentation in case we forget the details in 6 months, I'm going to 
guess
that he does not remember the details of a discussion from 7ish years 
ago.
This is where Andrew swoops in with a dazzling display of his eidetic
memory just to prove me wrong.

Ric, are there any storage vendor that is pushing for this right now?
Is someone working on this right now or planning to? If they are, have 
they
looked into the history of fsblock (Nick) and large block support 
(Christoph)
to see if they are candidates for forward porting or reimplementation?
I ask because without that person there is a risk that the discussion
will go as follows

Topic leader: Does anyone have an objection to supporting larger block
   sizes than the page size?
Room: Send patches and we'll talk.


I will have to see if I can get a storage vendor to make a public
statement, but there are vendors hoping to see this land in Linux in
the next few years.
   
   What about the second and third questions -- is someone working on this
   right now or planning to? Have they looked into the history of fsblock
   (Nick) and large block support (Christoph) to see if they are candidates
   for forward porting or reimplementation?
  
  I really think that if we want to make progress on this one, we need
  code and someone that owns it.  Nick's work was impressive, but it was
  mostly there for getting rid of buffer heads.  If we have a device that
  needs it and someone working to enable that device, we'll go forward
  much faster.
 
 Do we even need to do that (eliminate buffer heads)?

No, the reason bufferheads were replaced was that a bufferhead can
only reference a single page. i.e. the structure is that a page can
reference multipl bufferheads (block size = page size) but a
bufferhead can't refernce multiple pages which is what is needed for
block size  page size. fsblock was designed to handle both cases.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner
On Wed, Jan 22, 2014 at 11:50:02AM -0800, Andrew Morton wrote:
 On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley 
 james.bottom...@hansenpartnership.com wrote:
 
  But this, I think, is the fundamental point for debate.  If we can pull
  alignment and other tricks to solve 99% of the problem is there a need
  for radical VM surgery?  Is there anything coming down the pipe in the
  future that may move the devices ahead of the tricks?
 
 I expect it would be relatively simple to get large blocksizes working
 on powerpc with 64k PAGE_SIZE.  So before diving in and doing huge
 amounts of work, perhaps someone can do a proof-of-concept on powerpc
 (or ia64) with 64k blocksize.

Reality check: 64k block sizes on 64k page Linux machines has been
used in production on XFS for at least 10 years. It's exactly the
same case as 4k block size on 4k page size - one page, one buffer
head, one filesystem block.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner
On Thu, Jan 23, 2014 at 07:55:50AM -0500, Theodore Ts'o wrote:
 On Thu, Jan 23, 2014 at 07:35:58PM +1100, Dave Chinner wrote:
   
   I expect it would be relatively simple to get large blocksizes working
   on powerpc with 64k PAGE_SIZE.  So before diving in and doing huge
   amounts of work, perhaps someone can do a proof-of-concept on powerpc
   (or ia64) with 64k blocksize.
  
  Reality check: 64k block sizes on 64k page Linux machines has been
  used in production on XFS for at least 10 years. It's exactly the
  same case as 4k block size on 4k page size - one page, one buffer
  head, one filesystem block.
 
 This is true for ext4 as well.  Block size == page size support is
 pretty easy; the hard part is when block size  page size, due to
 assumptions in the VM layer that requires that FS system needs to do a
 lot of extra work to fudge around.  So the real problem comes with
 trying to support 64k block sizes on a 4k page architecture, and can
 we do it in a way where every single file system doesn't have to do
 their own specific hacks to work around assumptions made in the VM
 layer.
 
 Some of the problems include handling the case where you get someone
 dirties a single block in a sparse page, and the FS needs to manually
 fault in the other 56k pages around that single page.  Or the VM not
 understanding that page eviction needs to be done in chunks of 64k so
 we don't have part of the block evicted but not all of it, etc.

Right, this is part of the problem that fsblock tried to handle, and
some of the nastiness it had was that a page fault only resulted in
the individual page being read from the underlying block. This means
that it was entirely possible that the filesystem would need to do
RMW cycles in the writeback path itself to handle things like block
checksums, copy-on-write, unwritten extent conversion, etc. i.e. all
the stuff that the page cache currently handles by doing RMW cycles
at the page level.

The method of using compound pages in the page cache so that the
page cache could do 64k RMW cycles so that a filesystem never had to
deal with new issues like the above was one of the reasons that
approach is so appealing to us filesystem people. ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes

2014-01-23 Thread Dave Chinner
On Thu, Jan 23, 2014 at 04:44:38PM +, Mel Gorman wrote:
 On Thu, Jan 23, 2014 at 07:47:53AM -0800, James Bottomley wrote:
  On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote:
   On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote:
On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote:
  The other question is if the drive does RMW between 4k and whatever 
  its
  physical sector size, do we need to do anything to take advantage of
  it ... as in what would altering the granularity of the page cache 
  buy
  us?
 
 The real benefit is when and how the reads get scheduled.  We're able 
 to
 do a much better job pipelining the reads, controlling our caches and
 reducing write latency by having the reads done up in the OS instead 
 of
 the drive.

I agree with all of that, but my question is still can we do this by
propagating alignment and chunk size information (i.e. the physical
sector size) like we do today.  If the FS knows the optimal I/O patterns
and tries to follow them, the odd cockup won't impact performance
dramatically.  The real question is can the FS make use of this layout
information *without* changing the page cache granularity?  Only if you
answer me no to this do I think we need to worry about changing page
cache granularity.
   
   We already do this today.
   
   The problem is that we are limited by the page cache assumption that
   the block device/filesystem never need to manage multiple pages as
   an atomic unit of change. Hence we can't use the generic
   infrastructure as it stands to handle block/sector sizes larger than
   a page size...
  
  If the compound page infrastructure exists today and is usable for this,
  what else do we need to do? ... because if it's a couple of trivial
  changes and a few minor patches to filesystems to take advantage of it,
  we might as well do it anyway. 
 
 Do not do this as there is no guarantee that a compound allocation will
 succeed. If the allocation fails then it is potentially unrecoverable
 because we can no longer write to storage then you're hosed.  If you are
 now thinking mempool then the problem becomes that the system will be
 in a state of degraded performance for an unknowable length of time and
 may never recover fully.

We are talking about page cache allocation here, not something deep
down inside the IO path that requires mempools to guarantee IO
completion. IOWs, we have an *existing error path* to return ENOMEM
to userspace when page cache allocation fails.

 64K MMU page size systems get away with this
 because the blocksize is still = PAGE_SIZE and no core VM changes are
 necessary. Critically, pages like the page table pages are the same size as
 the basic unit of allocation used by the kernel so external fragmentation
 simply is not a severe problem.

Christoph's old patches didn't need 64k MMU page sizes to work.
IIRC, the compound page was mapped via into the page cache as
individual 4k pages. Any change of state on the child pages followed
the back pointer to the head of the compound page and changed the
state of that page. On page faults, the individual 4k pages were
mapped to userspace rather than the compound page, so there was no
userspace visible change, either.

The question I had at the time that was never answered was this: if
pages are faulted and mapped individually through their own ptes,
why did the compound pages need to be contiguous? copy-in/out
through read/write was still done a PAGE_SIZE granularity, mmap
mappings were still on PAGE_SIZE granularity, so why can't we build
a compound page for the page cache out of discontiguous pages?

FWIW, XFS has long used discontiguous pages for large block support
in metadata. Some of that is vmapped to make metadata processing
simple. The point of this is that we don't need *contiguous*
compound pages in the page cache if we can map them into userspace
as individual PAGE_SIZE pages. Only the page cache management needs
to handle the groups of pages that make up a filesystem block
as a compound page

  I was only objecting on the grounds that
  the last time we looked at it, it was major VM surgery.  Can someone
  give a summary of how far we are away from being able to do this with
  the VM system today and what extra work is needed (and how big is this
  piece of work)?
  
 
 Offhand no idea. For fsblock, probably a similar amount of work than
 had to be done in 2007 and I'd expect it would still require filesystem
 awareness problems that Dave Chinner pointer out earlier. For large block,
 it'd hit into the same wall that allocations must always succeed. If we
 want to break the connection between the basic unit of memory managed
 by the kernel and the MMU page size then I don't know but it would be a
 fairly large amount of surgery and need a lot of design work.

Here's the patch that Christoph wrote backin 2007 to add PAGE_SIZE
based mmap

Re: Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3 times slower then Solars 10 with the same HBA/Storage.

2014-01-20 Thread Dave Chinner
On Mon, Jan 20, 2014 at 05:58:55AM -0800, Christoph Hellwig wrote:
 On Thu, Jan 16, 2014 at 09:07:21AM +1100, Dave Chinner wrote:
  Yes, I think it can be done relatively simply. We'd have to change
  the code in xfs_file_aio_write_checks() to check whether EOF zeroing
  was required rather than always taking an exclusive lock (for block
  aligned IO at EOF sub-block zeroing isn't required),
 
 That's not even required for supporting aio appends, just a further
 optimization for it.

Oh, right, I got an off-by-one when reading the code - the EOF
zeroing only occurs when the offset is beyond EOF, not at or beyond
EOF...

  and then we'd
  have to modify the direct IO code to set the is_async flag
  appropriately. We'd probably need a new flag to say tell the DIO
  code that AIO beyond EOF is OK, but that isn't hard to do
 
 Yep, need a flag to allow appending writes and then defer them.
 
  Christoph, are you going to get any time to look at doing this in
  the next few days?
 
 I'll probably need at least another week before I can get to it.  If you
 wanna pick it up before than feel free.

I'm probably not going to get to it before then, either, so check
back in a week?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3 times slower then Solars 10 with the same HBA/Storage.

2014-01-15 Thread Dave Chinner
On Tue, Jan 14, 2014 at 03:30:11PM +0200, Sergey Meirovich wrote:
 Hi Cristoph,
 
 On 8 January 2014 16:03, Christoph Hellwig h...@infradead.org wrote:
  On Tue, Jan 07, 2014 at 08:37:23PM +0200, Sergey Meirovich wrote:
  Actually my initial report (14.67Mb/sec  3755.41 Requests/sec) was about 
  ext4
  However I have tried XFS as well. It was a bit slower than ext4 on all
  occasions.
 
  I wasn't trying to say XFS fixes your problem, but that we could
  implement appending AIO writes in XFS fairly easily.
 
  To verify Jan's theory, can you try to preallocate the file to the full
  size and then run the benchmark by doing a:
 
  # fallocate -l size filename
 
  and then run it?  If that's indeed the issue I'd be happy to implement
  the real aio append support for you as well.
 
 
 I've resorted to write simple wrapper around io_submit() and ran it
 against preallocated file (exactly to avoid append AIO scenario).
 Random data was used to avoid XtremIO online deduplication but results
 were still wonderfull for 4k sequential AIO write:
 
 744.77 MB/s   190660.17 Req/sec
 
 Clearly Linux lacks rial aio append to be available for any FS.
 Seems that you are thinking that it would be relatively easy to
 implement it for XFS on Linux? If so - I will really appreciate your
 afford.

Yes, I think it can be done relatively simply. We'd have to change
the code in xfs_file_aio_write_checks() to check whether EOF zeroing
was required rather than always taking an exclusive lock (for block
aligned IO at EOF sub-block zeroing isn't required), and then we'd
have to modify the direct IO code to set the is_async flag
appropriately. We'd probably need a new flag to say tell the DIO
code that AIO beyond EOF is OK, but that isn't hard to do

And for those that are wondering about the stale data exposure problem
documented in the aio code:

/*
 * For file extending writes updating i_size before data
 * writeouts complete can expose uninitialized blocks. So
 * even for AIO, we need to wait for i/o to complete before
 * returning in this case.
 */

This is fixed in XFS by removing a single if() check in
xfs_iomap_write_direct(). We already use unwritten extents for DIO
within EOF to avoid races that could expose uninitialised blocks, so
we just need to make that unconditional behaviour.  Hence racing IO
on concurrent appending i_size updates will only ever see a hole
(zeros), an unwritten region (zeros) or the written data.

Christoph, are you going to get any time to look at doing this in
the next few days?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html