[patch] sg: clean up gfp_mask in sg_build_indirect

2018-06-18 Thread Jeff Moyer
commit a45b599ad808c ("scsi: sg: allocate with __GFP_ZERO in
sg_build_indirect()") changed the call to alloc_pages to always use
__GFP_ZERO.  Just above that, though, there was this:

   if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
   gfp_mask |= __GFP_ZERO;

And there's only one user of the gfp_mask.  Just or in the __GFP_ZERO
flag at the top of the function and be done with it.

Signed-off-by: Jeff Moyer 

p.s. I'm not even sure why the original patch was committed -- it does
nothing for security.  I assume that the zeroing isn't done for
CAP_SYS_ADMIN / CAP_SYS_RAWIO because of performance.  (The history of
that zeroing being conditional goes way back to pre-git.)  Was there any
performance measurement done after the commit a45b599ad808c was applied?

diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index 53ae52dbff84..c926e07aabda 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -1850,7 +1850,7 @@ sg_build_indirect(Sg_scatter_hold * schp, Sg_fd * sfp, 
int buff_size)
int ret_sz = 0, i, k, rem_sz, num, mx_sc_elems;
int sg_tablesize = sfp->parentdp->sg_tablesize;
int blk_size = buff_size, order;
-   gfp_t gfp_mask = GFP_ATOMIC | __GFP_COMP | __GFP_NOWARN;
+   gfp_t gfp_mask = GFP_ATOMIC | __GFP_COMP | __GFP_NOWARN | __GFP_ZERO;
struct sg_device *sdp = sfp->parentdp;
 
if (blk_size < 0)
@@ -1880,9 +1880,6 @@ sg_build_indirect(Sg_scatter_hold * schp, Sg_fd * sfp, 
int buff_size)
if (sdp->device->host->unchecked_isa_dma)
gfp_mask |= GFP_DMA;
 
-   if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
-   gfp_mask |= __GFP_ZERO;
-
order = get_order(num);
 retry:
ret_sz = 1 << (PAGE_SHIFT + order);
@@ -1893,7 +1890,7 @@ sg_build_indirect(Sg_scatter_hold * schp, Sg_fd * sfp, 
int buff_size)
num = (rem_sz > scatter_elem_sz_prev) ?
scatter_elem_sz_prev : rem_sz;
 
-   schp->pages[k] = alloc_pages(gfp_mask | __GFP_ZERO, order);
+   schp->pages[k] = alloc_pages(gfp_mask, order);
if (!schp->pages[k])
goto out;
 


Re: [PATCH] preview - block layer help to detect sequential IO

2017-01-12 Thread Jeff Moyer
Hi, Kashyap,

I'm CC-ing Kent, seeing how this is his code.

Kashyap Desai  writes:

> Objective of this patch is - 
>
> To move code used in bcache module in block layer which is used to
> find IO stream.  Reference code @drivers/md/bcache/request.c
> check_should_bypass().  This is a high level patch for review and
> understand if it is worth to follow ?
>
> As of now bcache module use this logic, but good to have it in block
> layer and expose function for external use.
>
> In this patch, I move logic of sequential IO search in block layer and
> exposed function blk_queue_rq_seq_cutoff.  Low level driver just need
> to call if they want stream detection per request queue.  For my
> testing I just added call blk_queue_rq_seq_cutoff(sdev->request_queue,
> 4) megaraid_sas driver.
>  
> In general, code of bcache module was referred and they are doing
> almost same as what we want to do in megaraid_sas driver below patch -
>
> http://marc.info/?l=linux-scsi=148245616108288=2
>  
> bcache implementation use search algorithm (hashed based on bio start
> sector) and detects 128 streams.  wanted those implementation
> to skip sequential IO to be placed on SSD and move it direct to the
> HDD.
>
> Will it be good design to keep this algorithm open at block layer (as
> proposed in patch.) ?

It's almost always a good idea to avoid code duplication, but this patch
definitely needs some work.

I haven't looked terribly closely at the bcache implementaiton, so do
let me know if I've misinterpreted something.

We should track streams per io_context/queue pair.  We already have a
data structure for that, the io_cq.  Right now that structure is
tailored for use by the I/O schedulers, but I'm sure we could rework
that.  That would also get rid of the tremedous amount of bloat this
patch adds to the request_queue.  It will also allow us to remove the
bcache-specific fields that were added to task_struct.  Overall, it
should be a good simplification, unless I've completely missed the point
(which happens).

I don't like that you put sequential I/O detection into bio_check_eod.
Split it out into its own function.

You've added a member to struct bio that isn't referenced.  It would
have been nice of you to put enough work into this RFC so that we could
at least see how the common code was used by bcache and your driver.

EWMA (exponentially weighted moving average) is not an acronym I keep
handy in my head.  It would be nice to add documentation on the
algorithm and design choices.  More comments in the code would also be
appreciated.  CFQ does some similar things (detecting sequential
vs. seeky I/O) in a much lighter-weight fashion.  Any change to the
algorithm, of course, would have to be verified to still meet bcache's
needs.

A queue flag might be a better way for the driver to request this
functionality.

Coding style will definitely need fixing.

I hope that was helpful.

Cheers,
Jeff

>
> Signed-off-by: Kashyap desai 
> ---
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 14d7c07..2e93d14 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -693,6 +693,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
> gfp_mask, int node_id)
>  {
>   struct request_queue *q;
>   int err;
> + struct seq_io_tracker *io;
>  
>   q = kmem_cache_alloc_node(blk_requestq_cachep,
>   gfp_mask | __GFP_ZERO, node_id);
> @@ -761,6 +762,15 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
> gfp_mask, int node_id)
>  
>   if (blkcg_init_queue(q))
>   goto fail_ref;
> + 
> + q->sequential_cutoff = 0;
> + spin_lock_init(>io_lock);
> + INIT_LIST_HEAD(>io_lru);
> +
> + for (io = q->io; io < q->io + BLK_RECENT_IO; io++) {
> + list_add(>lru, >io_lru);
> + hlist_add_head(>hash, q->io_hash + BLK_RECENT_IO);
> + }
>  
>   return q;
>  
> @@ -1876,6 +1886,26 @@ static inline int bio_check_eod(struct bio *bio, 
> unsigned int nr_sectors)
>   return 0;
>  }
>  
> +static void add_sequential(struct task_struct *t)
> +{
> +#define blk_ewma_add(ewma, val, weight, factor) \
> +({  \
> +(ewma) *= (weight) - 1; \
> +(ewma) += (val) << factor;  \
> +(ewma) /= (weight); \
> +(ewma) >> factor;   \
> +})
> +
> + blk_ewma_add(t->sequential_io_avg,
> +  t->sequential_io, 8, 0);
> +
> + t->sequential_io = 0;
> +}
> +static struct hlist_head *blk_iohash(struct request_queue *q, uint64_t k)
> +{
> + return >io_hash[hash_64(k, BLK_RECENT_IO_BITS)];
> +}
> +
>  static noinline_for_stack bool
>  generic_make_request_checks(struct bio *bio)
>  {
> @@ -1884,6 +1914,7 @@ 

Re: RFC: 512e ZBC host-managed disks

2017-01-12 Thread Jeff Moyer
Christoph Hellwig  writes:

> On Thu, Jan 12, 2017 at 05:13:52PM +0900, Damien Le Moal wrote:
>> (3) Any other idea ?
>
> Do nothing and ignore the problem.  This whole idea so braindead that
> the person coming up with the T10 language should be shot.  Either a device
> has 511 logical sectors or 4k but not this crazy mix.
>
> And make sure no one ships such a piece of crap because we are hell
> not going to support it.

Agreed.  This is insane.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] blk-trace update vs API stability

2017-01-04 Thread Jeff Moyer
Hannes Reinecke  writes:

> At LSF I'd like to discuss
> - Do we consider blktrace (and any other tracepoint in eg SCSI) as a
> stable API?

I don't have a strong opinion on this.

> - How do we go about modifying blktrace?

Blktrace has a version number associated with trace events.  Bump the
version for non-backward-compatible changes.

> - Is this even desired?

Yes.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 6/7] sd: Implement support for ZBC devices

2016-10-19 Thread Jeff Moyer
Shaun Tancheff <shaun.tanch...@seagate.com> writes:

> On Tue, Oct 18, 2016 at 11:58 AM, Jeff Moyer <jmo...@redhat.com> wrote:
>> Damien Le Moal <damien.lem...@wdc.com> writes:
>>
>>> + if (!is_power_of_2(zone_blocks)) {
>>> + if (sdkp->first_scan)
>>> + sd_printk(KERN_NOTICE, sdkp,
>>> +   "Devices with non power of 2 zone "
>>> +   "size are not supported\n");
>>> + return -ENODEV;
>>> + }
>>
>> Are power of 2 zone sizes required by the standard?  I see why you've
>> done this, but I wonder if we're artificially limiting the
>> implementation, and whether there will be valid devices on the market
>> that simply won't work with Linux because of this.
>
> The standard does not require power of 2 zones.
> That said, I am not aware of any current (or planned) devices other
> than a power of 2.
> Common zone sizes I am aware of: 256MiB, 128MiB and 1GiB.
>
> Also note that we are excluding the runt zone from the power of 2 expectation.
>
> So conforming devices should (excluding a runt zone):
>   - Have zones of the same size.
>   - Choose a zone size that is a power of 2.

OK, thanks for following up.  As long as the drive vendors are on board,
I guess we're okay.  :)

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 6/7] sd: Implement support for ZBC devices

2016-10-19 Thread Jeff Moyer
"Martin K. Petersen" <martin.peter...@oracle.com> writes:

>>>>>> "Jeff" == Jeff Moyer <jmo...@redhat.com> writes:
>
> Jeff,
>
> Jeff> Are power of 2 zone sizes required by the standard?  I see why
> Jeff> you've done this, but I wonder if we're artificially limiting the
> Jeff> implementation, and whether there will be valid devices on the
> Jeff> market that simply won't work with Linux because of this.
>
> Standards are deliberately written to be permissive. But Linux doesn't
> support arbitrary sector sizes either even though the spec allows it. We
> always pick a reasonably sane subset of features to implement and this
> case is no different.
>
> After some discussion we decided to rip out all the complexity that was
> required to facilitate crazy drive layouts. As a result, the code is now
> in a state where we can actually merge it. The hope is that by picking a
> specific configuration subset and widely advertising it we can influence
> the market.
>
> Also, I am not aware of anybody actually asking the drive vendors to
> support crazy zone configurations.

That's fine with me.  I didn't see any mention of this design decision
in the patch logs, though, and given that I'm not intimately involved, I
asked the question.

Thanks,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 6/7] sd: Implement support for ZBC devices

2016-10-18 Thread Jeff Moyer
Damien Le Moal  writes:

> + if (!is_power_of_2(zone_blocks)) {
> + if (sdkp->first_scan)
> + sd_printk(KERN_NOTICE, sdkp,
> +   "Devices with non power of 2 zone "
> +   "size are not supported\n");
> + return -ENODEV;
> + }

Are power of 2 zone sizes required by the standard?  I see why you've
done this, but I wonder if we're artificially limiting the
implementation, and whether there will be valid devices on the market
that simply won't work with Linux because of this.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 2/7] blk-sysfs: Add 'chunk_sectors' to sysfs attributes

2016-10-18 Thread Jeff Moyer
Damien Le Moal  writes:

> diff --git a/Documentation/ABI/testing/sysfs-block 
> b/Documentation/ABI/testing/sysfs-block
> index 75a5055..ee2d5cd 100644
> --- a/Documentation/ABI/testing/sysfs-block
> +++ b/Documentation/ABI/testing/sysfs-block
> @@ -251,3 +251,16 @@ Description:
>   since drive-managed zoned block devices do not support
>   zone commands, they will be treated as regular block
>   devices and zoned will report "none".
> +
> +What:/sys/block//queue/chunk_sectors
> +Date:September 2016
> +Contact: Hannes Reinecke 
> +Description:
> + chunk_sectors has different meaning depending on the type
> + of the disk. For a RAID device (dm-raid), chunk_sectors
> + indicates the size in 512B sectors of the RAID volume
> + stripe segment. For a zoned block device, either
> + host-aware or host-managed, chunk_sectors indicates the
> + size of 512B sectors of the zones of the device, with
 ^^
 in
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HDD Unrecovered readerror issue

2016-07-22 Thread Jeff Moyer
Dmitry Monakhov  writes:

> But once I rewrite this block, problem goes away.
> #xfs_io -c "pwrite -S 0x0 $((80069000/2))k 4k" -d  /dev/sda
>
> Now I can read it w/o any errors and smartctl is happy
> #smartctl -t short /dev/sda
> #smartctl -l selftest /dev/sda
> Num  Test_DescriptionStatus  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Short offline   Completed without error   00%  4683 -
>
> So my disk is not dead right?

Correct.

> Why the hell HDD fail read from very beginning
> Is this because HDD firmware detect internal crcXX sum corruption?

Yes.

> How this can happen? Is this because of power failure?

Could be.  If power was cut in the middle of a write, this can happen.
There are other causes, though (bit rot, for example).

> AFAIK standard guarantees that sector will be updated atomically.

No, the SCSI and ATA standards most certainly do not guarantee that!
NVMe is the only standard I know of that requires Atomic Write Unit
Power Fail to be at lest one sector.

> But it happens! Please guide me how to fix such problems in general.

You fixed it.  Overwriting the sector will clear the error.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/42] v7: separate operations from flags in the bio/request structs

2016-05-04 Thread Jeff Moyer
Mike Christie <mchri...@redhat.com> writes:

> On 05/03/2016 03:44 PM, Jeff Moyer wrote:
>> Hi, Mike,
>> 
>> That git tree doesn't seem to exist.  I did manage to apply your patch
>> set on top of next-20160415, though.
>> 
>> So... what testing did you do? ;-) I ran into the following problems
>
> I normally run xfstests and run it on my daily workstation and laptop. I
> did not do this for every FS this time and hit a regression.
>
> What FS were you using?

I'm using xfs, scsi disk, no blk-mq, no dm.

>> - git clone fails
>> - yum segfaults
>
> In v7/v6, I missed a new submit_bio call, so I hit issues like the two
> above. I have this fixed in the next version.

OK, does this mean you're posting another version, or you already did
and I somehow missed it?

>> - many blktrace/blkparse issues, including incorrect cpu recorded in
>>   traces, null task names, and blkparse outputting nothing for a trace
>>   file several gigabytes in size.
>
> I will double check for these issues.

Thanks.  I'll stop digging for now, then.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/42] v7: separate operations from flags in the bio/request structs

2016-05-03 Thread Jeff Moyer
mchri...@redhat.com writes:

> The following patches begin to cleanup the request->cmd_flags and
> bio->bi_rw mess. We currently use cmd_flags to specify the operation,
> attributes and state of the request. For bi_rw we use it for similar
> info and also the priority but then also have another bi_flags field
> for state. At some point, we abused them so much we just made cmd_flags
> 64 bits, so we could add more.
>
> The following patches seperate the operation (read, write discard,
> flush, etc) from cmd_flags/bi_rw.
>
> This patchset was made against linux-next from today April 15
> (git tag next-20160415).
>
> I put a git tree here:
> https://github.com/mikechristie/linux-kernel.git
> The patches are in the op branch.

Hi, Mike,

That git tree doesn't seem to exist.  I did manage to apply your patch
set on top of next-20160415, though.

So... what testing did you do? ;-) I ran into the following problems:
- git clone fails
- yum segfaults
- many blktrace/blkparse issues, including incorrect cpu recorded in
  traces, null task names, and blkparse outputting nothing for a trace
  file several gigabytes in size.

After that, I decided to back out your patches and test the base
linux-next kernel.  That kernel has none of those issues.

So, either I'm missing some dependencies, or I think we've got some
issues to iron out before this thing goes in.  Before I dig any further,
am I missing something?

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/12] BUG: Losing bits on request.cmd_flags

2016-04-04 Thread Jeff Moyer
Shaun Tancheff  writes:

> In a few places a temporary value smaller than a cmd_flags
> is used to test for bits and or build up a new cmd_flags.
>
> Change to use explicit u64 values where appropriate.

This is not a bug fix, so please fix your subject.

I'm not against cleaning up the mixing of 32 vs 64 bit variables, but if
you're going to go down that path, then you might as well fix the mixing
of signed vs unsigned use as well.  And now that I look at it,
bio->bi_rw is unsigned long whereas req->cmd_flags is u64.  That could
make for fun bugs in the future.  We should at least add a comment in
the rq_flag_bits enum to the effect of bio flags need to stay in the
bottom 32 bits.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] block: Add badblock management for gendisks

2015-11-25 Thread Jeff Moyer
"Verma, Vishal L" <vishal.l.ve...@intel.com> writes:

> On Tue, 2015-11-24 at 14:14 -0500, Jeff Moyer wrote:
>> 
>> I'm not sure whether it makes sense to continue without badblock
>> management for the RAID code.  I was hoping Neil would comment on
>> that.
>> 
>> -Jeff
>
> Not sure I follow? I believe I've kept all the badblocks functionality
> RAID already had..

What I mean to say is that the RAID code had previously embedded the
badblocks structure in one of its other data structures.  As a result,
you would never get an allocation failure for it.

> On a related note, something I observed when testing with md:
>
> md's badblock list is per-device, and can be seen at:
>   /sys/block/md0/md/dev-pmem0/bad_blocks
>
> Now if we have badblocks in the gendisks too, there is also:
>   /sys/block/pmem0/bad_blocks
>
> The two are separate 'accounts' maintained by separate drivers (md for
> the former, and pmem for the latter). This can potentially be
> confusing..
>
> Should we consolidate the two, i.e. make md (re)use the gendisk
> badblocks for its purposes too?

I agree with what Dan said.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] block: Add badblock management for gendisks

2015-11-24 Thread Jeff Moyer
"Verma, Vishal L" <vishal.l.ve...@intel.com> writes:

> On Tue, 2015-11-24 at 10:34 -0500, Jeff Moyer wrote:
>> Vishal Verma <vishal.l.ve...@intel.com> writes:
>> 
>> > NVDIMM devices, which can behave more like DRAM rather than block
>> > devices, may develop bad cache lines, or 'poison'. A block device
>> > exposed by the pmem driver can then consume poison via a read (or
>> > write), and cause a machine check. On platforms without machine
>> > check recovery features, this would mean a crash.
>> > 
>> > The block device maintaining a runtime list of all known sectors
>> > that
>> > have poison can directly avoid this, and also provide a path forward
>> > to enable proper handling/recovery for DAX faults on such a device.
>> > 
>> > Use the new badblock management interfaces to add a badblocks list
>> > to
>> > gendisks.
>> 
>> Because disk_alloc_badblocks can fail, you need to check for a NULL
>> disk->bb in all of the utility functions you've defined.
>> 
>
> Thanks, Jeff - I'll fix this. I have a handful of other fixes queued up
> too, will send out a v2 soon.

I'm not sure whether it makes sense to continue without badblock
management for the RAID code.  I was hoping Neil would comment on that.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] block: Add badblock management for gendisks

2015-11-24 Thread Jeff Moyer
Vishal Verma  writes:

> NVDIMM devices, which can behave more like DRAM rather than block
> devices, may develop bad cache lines, or 'poison'. A block device
> exposed by the pmem driver can then consume poison via a read (or
> write), and cause a machine check. On platforms without machine
> check recovery features, this would mean a crash.
>
> The block device maintaining a runtime list of all known sectors that
> have poison can directly avoid this, and also provide a path forward
> to enable proper handling/recovery for DAX faults on such a device.
>
> Use the new badblock management interfaces to add a badblocks list to
> gendisks.

Because disk_alloc_badblocks can fail, you need to check for a NULL
disk->bb in all of the utility functions you've defined.

Cheers,
Jeff

>
> Signed-off-by: Vishal Verma 
> ---
>  block/genhd.c | 64 
> +++
>  include/linux/genhd.h |  6 +
>  2 files changed, 70 insertions(+)
>
> diff --git a/block/genhd.c b/block/genhd.c
> index 0c706f3..4209c32 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "blk.h"
>  
> @@ -505,6 +506,20 @@ static int exact_lock(dev_t devt, void *data)
>   return 0;
>  }
>  
> +static void disk_alloc_badblocks(struct gendisk *disk)
> +{
> + disk->bb = kzalloc(sizeof(disk->bb), GFP_KERNEL);
> + if (!disk->bb) {
> + pr_warn("%s: failed to allocate space for badblocks\n",
> + disk->disk_name);
> + return;
> + }
> +
> + if (badblocks_init(disk->bb, 1))
> + pr_warn("%s: failed to initialize badblocks\n",
> + disk->disk_name);
> +}
> +
>  static void register_disk(struct gendisk *disk)
>  {
>   struct device *ddev = disk_to_dev(disk);
> @@ -609,6 +624,7 @@ void add_disk(struct gendisk *disk)
>   disk->first_minor = MINOR(devt);
>  
>   disk_alloc_events(disk);
> + disk_alloc_badblocks(disk);
>  
>   /* Register BDI before referencing it from bdev */
>   bdi = >queue->backing_dev_info;
> @@ -657,6 +673,9 @@ void del_gendisk(struct gendisk *disk)
>   blk_unregister_queue(disk);
>   blk_unregister_region(disk_devt(disk), disk->minors);
>  
> + badblocks_free(disk->bb);
> + kfree(disk->bb);
> +
>   part_stat_set_all(>part0, 0);
>   disk->part0.stamp = 0;
>  
> @@ -670,6 +689,48 @@ void del_gendisk(struct gendisk *disk)
>  }
>  EXPORT_SYMBOL(del_gendisk);
>  
> +/*
> + * The gendisk usage of badblocks does not track acknowledgements for
> + * badblocks. We always assume they are acknowledged.
> + */
> +int disk_check_badblocks(struct gendisk *disk, sector_t s, int sectors,
> +sector_t *first_bad, int *bad_sectors)
> +{
> + return badblocks_check(disk->bb, s, sectors, first_bad, bad_sectors);
> +}
> +EXPORT_SYMBOL(disk_check_badblocks);
> +
> +int disk_set_badblocks(struct gendisk *disk, sector_t s, int sectors)
> +{
> + return badblocks_set(disk->bb, s, sectors, 1);
> +}
> +EXPORT_SYMBOL(disk_set_badblocks);
> +
> +int disk_clear_badblocks(struct gendisk *disk, sector_t s, int sectors)
> +{
> + return badblocks_clear(disk->bb, s, sectors);
> +}
> +EXPORT_SYMBOL(disk_clear_badblocks);
> +
> +/* sysfs access to bad-blocks list. */
> +static ssize_t disk_badblocks_show(struct device *dev,
> + struct device_attribute *attr,
> + char *page)
> +{
> + struct gendisk *disk = dev_to_disk(dev);
> +
> + return badblocks_show(disk->bb, page, 0);
> +}
> +
> +static ssize_t disk_badblocks_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *page, size_t len)
> +{
> + struct gendisk *disk = dev_to_disk(dev);
> +
> + return badblocks_store(disk->bb, page, len, 0);
> +}
> +
>  /**
>   * get_gendisk - get partitioning information for a given device
>   * @devt: device to get partitioning information for
> @@ -988,6 +1049,8 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, 
> disk_discard_alignment_show,
>  static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL);
>  static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
>  static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
> +static DEVICE_ATTR(badblocks, S_IRUGO | S_IWUSR, disk_badblocks_show,
> + disk_badblocks_store);
>  #ifdef CONFIG_FAIL_MAKE_REQUEST
>  static struct device_attribute dev_attr_fail =
>   __ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
> @@ -1009,6 +1072,7 @@ static struct attribute *disk_attrs[] = {
>   _attr_capability.attr,
>   _attr_stat.attr,
>   _attr_inflight.attr,
> + _attr_badblocks.attr,
>  #ifdef CONFIG_FAIL_MAKE_REQUEST
>   _attr_fail.attr,
>  #endif
> diff --git 

Re: [PATCH] scsi: fix regression that accidentally disabled block-based tcq

2014-09-15 Thread Jeff Moyer
Christoph Hellwig h...@infradead.org writes:

 Please try the fix below, looks like the commit broke TCQ for all drivers 
 using block-level tagging.

 ---
 From 865a19b760d2786fe37d3b5c151a4ecea4c0e95e Mon Sep 17 00:00:00 2001
 From: Christoph Hellwig h...@lst.de
 Date: Fri, 12 Sep 2014 16:00:19 -0700
 Subject: scsi: fix regression that accidentally disabled block-based tcq

 The scsi blk-mq support accidentally flipped a conditional, which lead to
 never enabling block based tcq when using the legacy request path.

 Fixes: d285203cf647d7c9 scsi: add support for a blk-mq based I/O path.
 Reported-by: Hans de Goede hdego...@redhat.com
 Signed-off-by: Christoph Hellwig h...@lst.de
 ---
  include/scsi/scsi_tcq.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
 index cdcc90b..e645835 100644
 --- a/include/scsi/scsi_tcq.h
 +++ b/include/scsi/scsi_tcq.h
 @@ -68,7 +68,7 @@ static inline void scsi_activate_tcq(struct scsi_device 
 *sdev, int depth)
   return;
  
   if (!shost_use_blk_mq(sdev-host) 
 - blk_queue_tagged(sdev-request_queue))
 + !blk_queue_tagged(sdev-request_queue))
   blk_queue_init_tags(sdev-request_queue, depth,
   sdev-host-bqt);

Introduced by d285203:

-   if (!blk_queue_tagged(sdev-request_queue))
+   if (!shost_use_blk_mq(sdev-host) 
+   blk_queue_tagged(sdev-request_queue))

Seems like an obvious fix.

Reviewed-by: Jeff Moyer jmo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: bsg.c: Cleaning up missing null-terminate in conjunction with strncpy

2014-07-28 Thread Jeff Moyer
Rickard Strandqvist rickard_strandqv...@spectrumdigital.se writes:

 Replacing strncpy with strlcpy to avoid strings that lacks null terminate.

 Signed-off-by: Rickard Strandqvist rickard_strandqv...@spectrumdigital.se
 ---
  block/bsg.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/block/bsg.c b/block/bsg.c
 index ff46add..b2688c5 100644
 --- a/block/bsg.c
 +++ b/block/bsg.c
 @@ -790,7 +790,7 @@ static struct bsg_device *bsg_add_device(struct inode 
 *inode,
   mutex_lock(bsg_mutex);
   hlist_add_head(bd-dev_list, bsg_dev_idx_hash(iminor(inode)));
  
 - strncpy(bd-name, dev_name(rq-bsg_dev.class_dev), sizeof(bd-name) - 
 1);
 + strlcpy(bd-name, dev_name(rq-bsg_dev.class_dev), sizeof(bd-name));
   dprintk(bound to %s, max queue %d\n,
   format_dev_t(buf, inode-i_rdev), bd-max_queue);

NACK

The bsg data structure is allocated using kzalloc, so that last byte
will be zero.
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jeff Moyer
Benjamin LaHaise b...@kvack.org writes:

 
 [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536

 Something is horribly wrong here.  There is no way that value for nr_events 
 should be passed in to ioctx_alloc().  This implies that userland is calling 
 io_setup() with an impossibly large value for nr_events.  Can you post the 
 actual diff for your fs/aio.c relative to linus' tree?


fio does exactly this!  it passes INT_MAX.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jeff Moyer
Elliott, Robert (Server Storage) elli...@hp.com writes:

 -Original Message-
 From: Christoph Hellwig [mailto:h...@infradead.org]
 Sent: Thursday, 10 July, 2014 11:15 AM
 To: Elliott, Robert (Server Storage)
 Cc: Jens Axboe; dgilb...@interlog.com; James Bottomley; Bart Van Assche;
 Benjamin LaHaise; linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: scsi-mq V2
 
 On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
  It's starting to look weird.  I'll prepare another two bisect branches
  around some MM changes, which seems the only other possible candidate.
 
 I've pushed out scsi-mq.3-bisect-3 

 Good.

 and scsi-mq.3-bisect-4 for you.

 Bad.

 Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
 diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
 index df95a2f..11b65d4 100644
 --- a/arch/x86/vdso/vdso2c.h
 +++ b/arch/x86/vdso/vdso2c.h
 @@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct 
 BITSFUNC(fake_sections) *out,
   uint64_t flags = GET_LE(in-sh_flags);

   bool copy = flags  SHF_ALLOC 
 + (GET_LE(in-sh_size) ||
 + (GET_LE(in-sh_type) != SHT_RELA 
 + GET_LE(in-sh_type) != SHT_REL)) 
   strcmp(name, .altinstructions) 
   strcmp(name, .altinstr_replacement);

 Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
 an fio hang, with one CPU (CPU 0) stuck in io_submit loops.

Hi, Rob,

Can you get sysrq-t output for me?  I don't know how/why we'd continue
to get io_submits for an exiting process.

Thanks,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jeff Moyer
Jeff Moyer jmo...@redhat.com writes:

 Hi, Rob,

 Can you get sysrq-t output for me?  I don't know how/why we'd continue
 to get io_submits for an exiting process.

Also, do you know what sys_io_submit is returning?
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scsi-mq V2

2014-07-10 Thread Jeff Moyer
Jens Axboe ax...@kernel.dk writes:

 On 2014-07-10 17:11, Jeff Moyer wrote:
 Benjamin LaHaise b...@kvack.org writes:


 [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
 [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536

 Something is horribly wrong here.  There is no way that value for nr_events
 should be passed in to ioctx_alloc().  This implies that userland is calling
 io_setup() with an impossibly large value for nr_events.  Can you post the
 actual diff for your fs/aio.c relative to linus' tree?


 fio does exactly this!  it passes INT_MAX.

 That's correct, I had actually forgotten about this. It was a change
 made a few years back, in correlation with the aio optimizations
 posted then, basically telling aio to ignore that silly (and broken)
 user ring.

I still don't see how you accomplish that.  Making it bigger doesn't get
rid of it.  ;-)

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] io: define an interface for IO extensions

2014-04-02 Thread Jeff Moyer
Darrick J. Wong darrick.w...@oracle.com writes:

 Define a generic interface to allow userspace to attach metadata to an
 IO operation.  This interface will be used initially to implement
 protection information (PI) pass through, though it ought to be usable
 by anyone else desiring to extend the IO interface.  It should not be
 difficult to modify the non-AIO calls to use this mechanism.

My main issue with this patch is determining what exactly gets returned
to userspace when there is an issue in the teardown_extensions path.
It looks like you'll get the first error propagated from
io_teardown_extensions, others are ignored.  Then, in aio_complete, if
there was no error with the I/O, then you'll get the teardown error
reported in event-res, otherwise you'll get it in event-res2.  So,
what are the valid errors returned by the teardown routine for
extensions?  How is the userspace app supposed to determine where the
error came from, the I/O or a failure in the extension teardown?

I think it may make sense to only use res2 for reporting io extension
teardown failures.  Any new code that will use extensions can certainly
be written to check both res and res2, and this method would prevent the
ambiguity I mentioned.

Finally, I know this is an RFC, but please add some man-page changes to
your patch set, and CC linux-man.  Michael Kerrisk typically has
valuable advice on new APIs.

Cheers,
Jeff


 Signed-off-by: Darrick J. Wong darrick.w...@oracle.com
 ---
  fs/aio.c |  180 
 +-
  include/linux/aio.h  |7 ++
  include/uapi/linux/aio_abi.h |   15 +++-
  3 files changed, 197 insertions(+), 5 deletions(-)


 diff --git a/fs/aio.c b/fs/aio.c
 index 062a5f6..0c40bdc 100644
 --- a/fs/aio.c
 +++ b/fs/aio.c
 @@ -158,6 +158,11 @@ static struct vfsmount *aio_mnt;
  static const struct file_operations aio_ring_fops;
  static const struct address_space_operations aio_ctx_aops;
  
 +static int io_teardown_extensions(struct kiocb *req);
 +static int io_setup_extensions(struct kiocb *req, int is_write,
 +struct io_extension __user *ioext);
 +static int iocb_setup_extensions(struct iocb *iocb, struct kiocb *req);
 +
  static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
  {
   struct qstr this = QSTR_INIT([aio], 5);
 @@ -916,6 +921,17 @@ void aio_complete(struct kiocb *iocb, long res, long 
 res2)
   struct io_event *ev_page, *event;
   unsigned long   flags;
   unsigned tail, pos;
 + int ret;
 +
 + ret = io_teardown_extensions(iocb);
 + if (ret) {
 + if (!res)
 + res = ret;
 + else if (!res2)
 + res2 = ret;
 + else
 + pr_err(error %d tearing down aio extensions\n, ret);
 + }
  
   /*
* Special case handling for sync iocbs:
 @@ -1350,15 +1366,167 @@ rw_common:
   return 0;
  }
  
 +/* IO extension code */
 +#define REQUIRED_STRUCTURE_SIZE(type, member)\
 + (offsetof(type, member) + sizeof(((type *)NULL)-member))
 +#define IO_EXT_SIZE(member) \
 + REQUIRED_STRUCTURE_SIZE(struct io_extension, member)
 +
 +struct io_extension_type {
 + unsigned int type;
 + unsigned int extension_struct_size;
 + int (*setup_fn)(struct kiocb *, int is_write);
 + int (*destroy_fn)(struct kiocb *);
 +};
 +
 +static struct io_extension_type extensions[] = {
 + {IO_EXT_INVALID, 0, NULL, NULL},
 +};
 +
 +static int is_write_iocb(struct iocb *iocb)
 +{
 + switch (iocb-aio_lio_opcode) {
 + case IOCB_CMD_PWRITE:
 + case IOCB_CMD_PWRITEV:
 + return 1;
 + default:
 + return 0;
 + }
 +}
 +
 +static int io_teardown_extensions(struct kiocb *req)
 +{
 + struct io_extension_type *iet;
 + int ret, ret2;
 +
 + if (req-ki_ioext == NULL)
 + return 0;
 +
 + /* Shut down all the extensions */
 + ret = 0;
 + for (iet = extensions; iet-type != IO_EXT_INVALID; iet++) {
 + if (!(req-ki_ioext-ke_kern.ie_has  iet-type))
 + continue;
 + ret2 = iet-destroy_fn(req);
 + if (ret2  !ret)
 + ret = ret2;
 + }
 +
 + /* Copy out return values */
 + if (unlikely(copy_to_user(req-ki_ioext-ke_user,
 +   req-ki_ioext-ke_kern,
 +   sizeof(struct io_extension {
 + if (!ret)
 + ret = -EFAULT;
 + }
 +
 + kfree(req-ki_ioext);
 + req-ki_ioext = NULL;
 + return ret;
 +}
 +
 +static int io_check_bufsize(__u64 has, __u64 size)
 +{
 + struct io_extension_type *iet;
 + __u64 all_flags = 0;
 +
 + for (iet = extensions; iet-type != IO_EXT_INVALID; iet++) {
 + all_flags |= iet-type;
 + if (!(has  iet-type))
 + continue;
 + if (iet-extension_struct_size  size)
 + 

Re: [RFC PATCH 0/5] userspace PI passthrough via AIO/DIO

2014-03-21 Thread Jeff Moyer
Darrick J. Wong darrick.w...@oracle.com writes:

 This RFC provides a rough implementation of a mechanism to allow
 userspace to attach protection information (e.g. T10 DIF) data to a
 disk write and to receive the information alongside a disk read.  The
 interface is an extension to the AIO interface: two new commands
 (IOCB_CMD_P{READ,WRITE}VM) are provided.  The last struct iovec in the

Sorry for the shallow question, but what does that M stand for?

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] atomic block device

2014-03-20 Thread Jeff Moyer
James Bottomley james.bottom...@hansenpartnership.com writes:

 OK, so what the Database people are currently fretting about is how the
 Linux cache fights with the WAL.  Pretty much all DBs sit on filesystems
 these days, so the first question is are block operations even relevant

Yes, they are relevant so long as there are users.  Not all databases
run on file systems.  More to the point, I think the spec includes them
for completeness.

 and do the (rather scant) file operations fit their need.  The basic
 problem with the file mode is the granularity.  What does a DB do with
 transactions which go over the limits?

s/transactions/atomic multiwrites?  There are a couple of options,
there.  You could either emulate the multiwrite transparently to the
application, or you could fail it.  I think we'd have to support both,
since some applications will not want the less performant fallback.

 It also looks like the NVM file actions need to work over DIO, so the
 question is how.

DIO just means avoid using the page cache (or, I guess you could make it
more generic by saying avoid double buffering).  If the hardware
supports atomic writes, then this is easy.  If it doesn't, then we can
still do the emulation.  DIO already has fallback to buffered, so it's
not like we always honor that flag.

 (And the other problem is that only a few DBs seem to use DIO).

This is the first time I have ever heard anyone state that not using DIO
was a problem.  :-)  Also, I'm not sure what problem you are thinking
of.  Perhaps you are referring to the interactions of the WAL and the
page cache (that you mentioned in the first paragraph).

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] block: fix race between request completion and timeout handling

2013-09-10 Thread Jeff Moyer
Ewan Milne emi...@redhat.com writes:

 If the request completes after blk_mark_rq_complete() is called but
 before blk_clear_req_complete() is called, the completion will not be
 processed, and we will have to wait for the request to timeout again.
 Maybe this is not so bad, as it should be extremely rare, but if the
 timeout were a large value for some reason, that could be a problem.

 It seems to me that the issue is really that there are 2 state variables
 (the REQ_ATOM_COMPLETE flag and the req-timeout_list) for the same
 state, and so manipulating both of these without a lock will always have
 a window.

Agreed.  Do you see a clean way of fixing that?

 Clearly it would be better to avoid a panic, so Jeff's fix would help.

 I'm not sure I follow how the issue Jeff is fixing caused this
 particular crash, though.  How did the request get back on the queue?
 The crash occurred when the SCSI EH was flushing the done_q requests.

Right, sorry.  I wrote that after having been away from the problem for
too long.  I left out an important part:

This would only actually explain the coredumps *IF* the request
structure was freed, reallocated *and* queued before the error handler
thread had a chance to process it.  That is possible, but it may make
sense to keep digging for another race.  I think that if this is what
was happening, we would see other instances of this problem showing up
as null pointer or garbage pointer dereferences, for example when the
request structure was not re-used.  It looks like we actually do run
into that situation in other reports.

I don't think this is a smoking gun, but I think the patch should go in
so we can further narrow down the search.

Thanks for looking at it, Ewan.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch] block: fix race between request completion and timeout handling

2013-08-29 Thread Jeff Moyer
Hi,

We have several reports (against a distro kernel) of panics in
blk_requeue_request that look like this:

kernel BUG at block/blk-core.c:1045!
invalid opcode:  [#1] SMP
last sysfs file: 
/sys/devices/pci:40/:40:03.0/:55:00.0/infiniband_mad/umad0/port
CPU 0
Modules linked in: ipmi_si ipmi_devintf ipmi_msghandler bonding rdma_ucm(U) 
rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) 
ib_umad(U) iw_nes(U) libcrc32c mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) 
ib_mad(U) ib_core(U) cdc_ether usbnet mii microcode i2c_i801 i2c_core iTCO_wdt 
iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache 
jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log 
dm_mod [last unloaded: scsi_wait_scan]

Pid: 491, comm: scsi_eh_0 Tainted: GW     
2.6.32-220.13.1.el6.x86_64 #1 IBM  -[8722PAX]-/00D1461
RIP: 0010:[8124e424]  [8124e424] 
blk_requeue_request+0x94/0xa0
RSP: 0018:881057eefd60  EFLAGS: 00010012
RAX: 881d99e3e8a8 RBX: 881d99e3e780 RCX: 881d99e3e8a8
RDX: 881d99e3e8a8 RSI: 881d99e3e780 RDI: 881d99e3e780
RBP: 881057eefd80 R08: 881057eefe90 R09: 
R10:  R11:  R12: 881057f92338
R13:  R14: 881057f92338 R15: 883058188000
FS:  () GS:88004020() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 006d3ec0 CR3: 00302cd7d000 CR4: 000406b0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process scsi_eh_0 (pid: 491, threadinfo 881057eee000, task 881057e29540)
Stack:
 1057 0286 8810275efdc0 881057f16000
0 881057eefdd0 81362323 881057eefe20 8135f393
0 881057e29af8 8810275efdc0 881057eefe78 881057eefe90
Call Trace:
 [81362323] __scsi_queue_insert+0xa3/0x150
 [8135f393] ? scsi_eh_ready_devs+0x5e3/0x850
 [81362a23] scsi_queue_insert+0x13/0x20
 [8135e4d4] scsi_eh_flush_done_q+0x104/0x160
 [8135fb6b] scsi_error_handler+0x35b/0x660
 [8135f810] ? scsi_error_handler+0x0/0x660
 [810908c6] kthread+0x96/0xa0
 [8100c14a] child_rip+0xa/0x20
 [81090830] ? kthread+0x0/0xa0
 [8100c140] ? child_rip+0x0/0x20
Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 
48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 0f 0b eb fe 0f 1f 84 
00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
RIP  [8124e424] blk_requeue_request+0x94/0xa0
 RSP 881057eefd60

The RIP is this line:
BUG_ON(blk_queued_rq(rq));

After digging through the code, I think there may be a race between the
request completion and the timer handler running.

A timer is started for each request put on the device's queue (see
blk_start_request-blk_add_timer).  If the request does not complete
before the timer expires, the timer handler (blk_rq_timed_out_timer)
will mark the request complete atomically:

static inline int blk_mark_rq_complete(struct request *rq)
{
return test_and_set_bit(REQ_ATOM_COMPLETE, rq-atomic_flags);
}

and then call blk_rq_timed_out.  The latter function will call
scsi_times_out, which will return one of BLK_EH_HANDLED,
BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED.  If BLK_EH_RESET_TIMER is
returned, blk_clear_rq_complete is called, and blk_add_timer is again
called to simply wait longer for the request to complete.

Now, if the request happens to complete while this is going on, what
happens?  Given that we know the completion handler will bail if it
finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
handler running after that bit is cleared.  So, from the above
paragraph, after the call to blk_clear_rq_complete.  If the completion
sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
there (I haven't seen this in the cores).  Next, if we get the
completion before the call to list_add_tail, then the timer will
eventually fire for an old req, which may either be freed or reallocated
(there is evidence that this might be the case).  Finally, if the
completion comes in *after* the addition to the timeout list, I think
it's harmless.  The request will be removed from the timeout list,
req_atom_complete will be set, and all will be well.

This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
req-atomic_flags)); from blk_add_timer to the only caller that could
trip over it (blk_start_request).  It then inverts the calls to
blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
the race.  I've boot tested this patch, but nothing more.

Signed-off-by: Jeff Moyer jmo...@redhat.com
Acked-by: Hannes Reinecke h...@suse.de

diff --git a/block/blk-core.c b/block/blk-core.c
index 93a18d1..236ae0a 100644
--- a/block/blk-core.c

Re: [patch|rfc] block: fix race between request completion and timeout handling

2013-08-28 Thread Jeff Moyer
Hannes Reinecke h...@suse.de writes:

 So, looked into things a bit more.
 It looks as if you're on the right track, although I doubt your
 patch will fix the issue for me :-(
 
 Thing is, you're right there is a race window between requeuing
 and softirq triggering, which could well be fixed by your patch.
 So for that reason alone I would like to take it.
 
 However, including your patch will end up opening another can of
 worms: the softirq might now be triggering _while the request is
 queued on the request queue_.
 blk_requeue_request will be putting the request back on the request
 queue, where it'll be stuck until being pulled off from
 scsi_request_fn().
 So if the softirq triggers during that condition we'll end up
 calling the BUG_ON((!list_empty(req-queuelist)) in
 __blk_put_request().
 
 Guess we'd need to fix that one, too ...
 
 Ah. Now I see it.

 We're requeuing from the softirq context, ie after the completion
 has triggered. So the above scenario can't actually happen and the
 patch is valid.

Do you still think it won't solve the issue you're seeing?  What issue
is that, btw?


 So:

 Acked-by: Hannes Reinecke h...@suse.de

Thanks, I guess I'll have to send a properly signed-off patch, now.  ;-)

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch|rfc] block: fix race between request completion and timeout handling

2013-08-27 Thread Jeff Moyer
Hi,

We have several reports (against a distro kernel) of panics in
blk_requeue_request that look like this:

kernel BUG at block/blk-core.c:1045!
invalid opcode:  [#1] SMP
last sysfs file: 
/sys/devices/pci:40/:40:03.0/:55:00.0/infiniband_mad/umad0/port
CPU 0
Modules linked in: ipmi_si ipmi_devintf ipmi_msghandler bonding rdma_ucm(U) 
rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) 
ib_umad(U) iw_nes(U) libcrc32c mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) 
ib_mad(U) ib_core(U) cdc_ether usbnet mii microcode i2c_i801 i2c_core iTCO_wdt 
iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache 
jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log 
dm_mod [last unloaded: scsi_wait_scan]

Pid: 491, comm: scsi_eh_0 Tainted: GW     
2.6.32-220.13.1.el6.x86_64 #1 IBM  -[8722PAX]-/00D1461
RIP: 0010:[8124e424]  [8124e424] 
blk_requeue_request+0x94/0xa0
RSP: 0018:881057eefd60  EFLAGS: 00010012
RAX: 881d99e3e8a8 RBX: 881d99e3e780 RCX: 881d99e3e8a8
RDX: 881d99e3e8a8 RSI: 881d99e3e780 RDI: 881d99e3e780
RBP: 881057eefd80 R08: 881057eefe90 R09: 
R10:  R11:  R12: 881057f92338
R13:  R14: 881057f92338 R15: 883058188000
FS:  () GS:88004020() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 006d3ec0 CR3: 00302cd7d000 CR4: 000406b0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process scsi_eh_0 (pid: 491, threadinfo 881057eee000, task 881057e29540)
Stack:
 1057 0286 8810275efdc0 881057f16000
0 881057eefdd0 81362323 881057eefe20 8135f393
0 881057e29af8 8810275efdc0 881057eefe78 881057eefe90
Call Trace:
 [81362323] __scsi_queue_insert+0xa3/0x150
 [8135f393] ? scsi_eh_ready_devs+0x5e3/0x850
 [81362a23] scsi_queue_insert+0x13/0x20
 [8135e4d4] scsi_eh_flush_done_q+0x104/0x160
 [8135fb6b] scsi_error_handler+0x35b/0x660
 [8135f810] ? scsi_error_handler+0x0/0x660
 [810908c6] kthread+0x96/0xa0
 [8100c14a] child_rip+0xa/0x20
 [81090830] ? kthread+0x0/0xa0
 [8100c140] ? child_rip+0x0/0x20
Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 
48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 0f 0b eb fe 0f 1f 84 
00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
RIP  [8124e424] blk_requeue_request+0x94/0xa0
 RSP 881057eefd60

The RIP is this line:
BUG_ON(blk_queued_rq(rq));

After digging through the code, I think there may be a race between the
request completion and the timer handler running.

A timer is started for each request put on the device's queue (see
blk_start_request-blk_add_timer).  If the request does not complete
before the timer expires, the timer handler (blk_rq_timed_out_timer)
will mark the request complete atomically:

static inline int blk_mark_rq_complete(struct request *rq)
{
return test_and_set_bit(REQ_ATOM_COMPLETE, rq-atomic_flags);
}

and then call blk_rq_timed_out.  The latter function will call
scsi_times_out, which will return one of BLK_EH_HANDLED,
BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED.  If BLK_EH_RESET_TIMER is
returned, blk_clear_rq_complete is called, and blk_add_timer is again
called to simply wait longer for the request to complete.

Now, if the request happens to complete while this is going on, what
happens?  Given that we know the completion handler will bail if it
finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
handler running after that bit is cleared.  So, from the above
paragraph, after the call to blk_clear_rq_complete.  If the completion
sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
there (I haven't seen this in the cores).  Next, if we get the
completion before the call to list_add_tail, then the timer will
eventually fire for an old req, which may either be freed or reallocated
(there is evidence that this might be the case).  Finally, if the
completion comes in *after* the addition to the timeout list, I think
it's harmless.  The request will be removed from the timeout list,
req_atom_complete will be set, and all will be well.

This RFC patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
req-atomic_flags)); from blk_add_timer to the only caller that could
trip over it (blk_start_request).  It then inverts the calls to
blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
the race.  I've boot tested this patch, but nothing more.

Jens, James, others, what do you think?

Cheers,
Jeff

diff --git a/block/blk-core.c b/block/blk-core.c
index 93a18d1..236ae0a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ 

Re: [RFC/PATCH 4/4] block: Add URGENT request notification support to CFQ scheduler

2013-07-11 Thread Jeff Moyer
Tanya Brokhman tlin...@codeaurora.org writes:

 When the scheduler reports to the block layer that there is an urgent
 request pending, the device driver may decide to stop the transmission
 of the current request in order to handle the urgent one. This is done
 in order to reduce the latency of an urgent request. For example:
 long WRITE may be stopped to handle an urgent READ.

In general, I don't like the approach taken.  I would much rather see
a low-level cancellation method, and layer your urgent request handling
on top of that.  That could actually be used by the aio subsystem as
well (with a lot of work, of course).  That aside, I've provided some
comments below.


 Signed-off-by: Tatyana Brokhman tlin...@codeaurora.org

 diff --git a/block/blk-core.c b/block/blk-core.c
 index 3ab3a62..705f4f9 100644
 --- a/block/blk-core.c
 +++ b/block/blk-core.c
 @@ -2090,7 +2090,10 @@ static void blk_account_io_completion(struct request 
 *req, unsigned int bytes)
  
   cpu = part_stat_lock();
   part = req-part;
 - part_stat_add(cpu, part, sectors[rw], bytes  9);
 + if (part == NULL)
 + pr_err(%s: YANIV - BUG START, __func__);
 + else
 + part_stat_add(cpu, part, sectors[rw], bytes  9);

This looks like leftover debugging.

   part_stat_unlock();
   }
  }
 @@ -2111,12 +2114,13 @@ static void blk_account_io_done(struct request *req)
   cpu = part_stat_lock();
   part = req-part;
  
 - part_stat_inc(cpu, part, ios[rw]);
 - part_stat_add(cpu, part, ticks[rw], duration);
 - part_round_stats(cpu, part);
 - part_dec_in_flight(part, rw);
 -
 - hd_struct_put(part);
 + if (req-part != NULL) {
 + part_stat_inc(cpu, part, ios[rw]);
 + part_stat_add(cpu, part, ticks[rw], duration);
 + part_round_stats(cpu, part);
 + part_dec_in_flight(part, rw);
 + hd_struct_put(part);
 + }

A comment about why we now expect req-part might be null would be nice.

   part_stat_unlock();
   }
  }
 diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
 index d5bbdcf..f936cb9 100644
 --- a/block/cfq-iosched.c
 +++ b/block/cfq-iosched.c
 @@ -320,6 +320,9 @@ struct cfq_data {
   unsigned long workload_expires;
   struct cfq_group *serving_group;
  
 + unsigned int nr_urgent_pending;
 + unsigned int nr_urgent_in_flight;
 +
   /*
* Each priority tree is sorted by next_request position.  These
* trees are used when determining if two or more queues are
 @@ -2783,6 +2786,14 @@ static void cfq_dispatch_insert(struct request_queue 
 *q, struct request *rq)
   (RQ_CFQG(rq))-dispatched++;
   elv_dispatch_sort(q, rq);
  
 + if (rq-cmd_flags  REQ_URGENT) {
 + if (!cfqd-nr_urgent_pending)
 + WARN_ON(1);
 + else
 + cfqd-nr_urgent_pending--;
 + cfqd-nr_urgent_in_flight++;
 + }
 +

This is a rather ugly construct, and gets repeated later.  I'd be
inclined to just BUG.

   cfqd-rq_in_flight[cfq_cfqq_sync(cfqq)]++;
   cfqq-nr_sectors += blk_rq_sectors(rq);
   cfqg_stats_update_dispatch(cfqq-cfqg, blk_rq_bytes(rq), rq-cmd_flags);
 @@ -3909,6 +3920,68 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct 
 cfq_queue *cfqq,
   }
  }
  
 +/*
 + * Called when a request (rq) is reinserted (to cfqq). Check if there's
 + * something we should do about it
 + */
 +static void
 +cfq_rq_requeued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 + struct request *rq)
 +{
 + struct cfq_io_cq *cic = RQ_CIC(rq);
 +
 + cfqd-rq_queued++;
 + if (rq-cmd_flags  REQ_PRIO)
 + cfqq-prio_pending++;
 +
 + cfqq-dispatched--;
 + (RQ_CFQG(rq))-dispatched--;
 +
 + cfqd-rq_in_flight[cfq_cfqq_sync(cfqq)]--;
 +
 + cfq_update_io_thinktime(cfqd, cfqq, cic);
 + cfq_update_io_seektime(cfqd, cfqq, rq);
 + cfq_update_idle_window(cfqd, cfqq, cic);
 +
 + cfqq-last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
 +
 + if (cfqq == cfqd-active_queue) {
 + if (cfq_cfqq_wait_request(cfqq)) {
 + if (blk_rq_bytes(rq)  PAGE_CACHE_SIZE ||
 + cfqd-busy_queues  1) {
 + cfq_del_timer(cfqd, cfqq);
 + cfq_clear_cfqq_wait_request(cfqq);
 + } else {
 + cfqg_stats_update_idle_time(cfqq-cfqg);
 + cfq_mark_cfqq_must_dispatch(cfqq);
 + }
 + }
 + } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
 + cfq_preempt_queue(cfqd, cfqq);
 + }
 +}

Huge cut-n-paste of cfq_rq_enqueued.  Please factor the code out.

 +static int cfq_reinsert_request(struct 

[patch] scsi_dh_hp_sw.c: return DEV_OFFLINED when blk_get_request fails

2013-03-28 Thread Jeff Moyer
Hi,

If blk_get_requet fails here, it means that the queue is dead.  It seems
better to return a DEV_OFFLINED error code than the misleading
TEMP_UNAVAIL.  Comments?

Signed-off-by: Jeff Moyer jmo...@redhat.com

diff --git a/drivers/scsi/device_handler/scsi_dh_hp_sw.c 
b/drivers/scsi/device_handler/scsi_dh_hp_sw.c
index 084062b..eec24d3 100644
--- a/drivers/scsi/device_handler/scsi_dh_hp_sw.c
+++ b/drivers/scsi/device_handler/scsi_dh_hp_sw.c
@@ -118,7 +118,7 @@ static int hp_sw_tur(struct scsi_device *sdev, struct 
hp_sw_dh_data *h)
 retry:
req = blk_get_request(sdev-request_queue, WRITE, GFP_NOIO);
if (!req)
-   return SCSI_DH_RES_TEMP_UNAVAIL;
+   return SCSI_DH_DEV_OFFLINED;
 
req-cmd_type = REQ_TYPE_BLOCK_PC;
req-cmd_flags |= REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT |
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC][ATTEND] protection information and userspace

2013-02-07 Thread Jeff Moyer
Boaz Harrosh bharr...@panasas.com writes:

 I also pondered simply adding a new io_prep_* function + IO_CMD_ code to 
 libaio
 and all the other plumbing necessary to make that happen...

 void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov,
int iovcnt, long long offset, const void *pi,
size_t pi_count);

 This is also what I've envisioned.
 Updating io_prep / async I/O is reasonably easy as its been using a 
 separate structure for passing in the I/O details.
 
 Normal read/write calls don't really map as you simply don't have 
 enough parameter to feed PI information into the kernel.
 So for that you'd need to invent a new interface / syscall.
 
 For aio we just need to add additional fields to an existing structure.
 
 So yeah, I'd be interested in that discussion as well.

Sure, it's easy to start there, but then you eventually end up having to
add a non-aio interface as well.  Let's not take the latter off the
table.

 Me too, in multiple fronts. It's part of my general concern about
things we would like for user-mode servers

 I think that the current aio and libaio Interface is broken for a long
 time, for multitude of reasons. For instance the nested structure definitions
 are COMPAT broken

News to me.  I run the libaio test harness built with -m32 on 64 bit
regularly.  What, exactly, is broken?

 , and lots of missing pieces. (For example search in archives
 for why bsg does not support sg-lists.)

 And there are all these additions that everyone wants on top, that call for
 a new interface anyway.

What was proposed above does not require a new interface.  It's just an
additional IO_CMD_*.  I'm not saying there aren't reasons for a new
interface, it's just I didn't see any in this thread.

 So I would like to see a deep fixup of this interface, with an aio version2
 that can take into considerations, all of future needs including these
 above. Kernel code will be very happy to be implemented with the new, 
 interface
 and a COMPAT layer could be put in place for the old interface.

 All interested parties should bring to the table what is the extension/changes
 they need. And we can try and union all of them together.

 (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy
  I know that qemu was wanting this for a while as well as the multitude of
  user-mode servers)

I'm not sure how that's directly related to aio, but ok.  If we're going
to rewrite the aio code, I think Zach's acall would be a good start, at
least on the API front:
  http://lwn.net/Articles/316806/

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch,v3,repost 00/10] make I/O path allocations more numa-friendly

2012-12-14 Thread Jeff Moyer
James Bottomley james.bottom...@hansenpartnership.com writes:

 On Mon, 2012-12-10 at 12:59 -0500, Jeff Moyer wrote:
 Jeff Moyer jmo...@redhat.com writes:
 
  Hi,
 
  This patch set makes memory allocations for data structures used in
  the I/O path more numa friendly by allocating them from the same numa
  node as the storage device.  I've only converted a handful of drivers
  at this point.  My testing is limited by the hardware I have on hand.
  Using these patches, I was able to max out the bandwidth of the storage
  controller when issuing I/O from any node on my 4 node system.  Without
  the patch, I/O from nodes remote to the storage device would suffer a
  penalty ranging from 6-12%.  Given my relatively low-end setup[1], I
  wouldn't be surprised if others could show a more significant performance
  advantage.
 
  This is a repost of the last posting.  The only changes are additional
  reviewed-by/acked-by tags.  I think this version is ready for inclusion.
  James, would you mind taking a look?
 
 James?  Do you have any objections to including this for 3.8?

 Probably for 3.9 since the 3.8 merge window is upon us.

OK.

 Do we actually have any performance numbers from the big system people?
 That's really a must for this type of work.

I'm working on getting some numbers.

 It's missing acks from the affected drivers; that's not a show stopper
 but it would be better to have them.

Well, we could trim the driver list to those that have ACKs, but the
driver changes are trivial.

Thanks,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch,v3,repost 00/10] make I/O path allocations more numa-friendly

2012-12-10 Thread Jeff Moyer
Jeff Moyer jmo...@redhat.com writes:

 Hi,

 This patch set makes memory allocations for data structures used in
 the I/O path more numa friendly by allocating them from the same numa
 node as the storage device.  I've only converted a handful of drivers
 at this point.  My testing is limited by the hardware I have on hand.
 Using these patches, I was able to max out the bandwidth of the storage
 controller when issuing I/O from any node on my 4 node system.  Without
 the patch, I/O from nodes remote to the storage device would suffer a
 penalty ranging from 6-12%.  Given my relatively low-end setup[1], I
 wouldn't be surprised if others could show a more significant performance
 advantage.

 This is a repost of the last posting.  The only changes are additional
 reviewed-by/acked-by tags.  I think this version is ready for inclusion.
 James, would you mind taking a look?

James?  Do you have any objections to including this for 3.8?

-Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3,repost 00/10] make I/O path allocations more numa-friendly

2012-11-27 Thread Jeff Moyer
Hi,

This patch set makes memory allocations for data structures used in
the I/O path more numa friendly by allocating them from the same numa
node as the storage device.  I've only converted a handful of drivers
at this point.  My testing is limited by the hardware I have on hand.
Using these patches, I was able to max out the bandwidth of the storage
controller when issuing I/O from any node on my 4 node system.  Without
the patch, I/O from nodes remote to the storage device would suffer a
penalty ranging from 6-12%.  Given my relatively low-end setup[1], I
wouldn't be surprised if others could show a more significant performance
advantage.

This is a repost of the last posting.  The only changes are additional
reviewed-by/acked-by tags.  I think this version is ready for inclusion.
James, would you mind taking a look?

Cheers,
Jeff

[1] LSI Megaraid SAS controller with 1GB battery-backed cache,
fronting a RAID 6 10+2.  The workload I used was tuned to not
have to hit disk.  Fio file attached.

--
changes from v2-v3:
- Made the numa_node Scsi_Host structure member dependent on CONFIG_NUMA
- Got rid of a GFP_ZERO I added accidentally
changes from v1-v2:
- got rid of the vfs patch, as Al pointed out some fundamental
  problems with it
- credited Bart van Assche properly


Jeff Moyer (10):
  scsi: add scsi_host_alloc_node
  scsi: make __scsi_alloc_queue numa-aware
  scsi: make scsi_alloc_sdev numa-aware
  scsi: allocate scsi_cmnd-s from the device's local numa node
  sd: use alloc_disk_node
  ata: use scsi_host_alloc_node
  megaraid_sas: use scsi_host_alloc_node
  mpt2sas: use scsi_host_alloc_node
  lpfc: use scsi_host_alloc_node
  cciss: use blk_init_queue_node

 drivers/ata/libata-scsi.c |3 ++-
 drivers/block/cciss.c |3 ++-
 drivers/scsi/hosts.c  |   13 +++--
 drivers/scsi/lpfc/lpfc_init.c |   10 ++
 drivers/scsi/megaraid/megaraid_sas_base.c |5 +++--
 drivers/scsi/mpt2sas/mpt2sas_scsih.c  |4 ++--
 drivers/scsi/scsi.c   |   16 ++--
 drivers/scsi/scsi_lib.c   |3 ++-
 drivers/scsi/scsi_scan.c  |4 ++--
 drivers/scsi/sd.c |2 +-
 include/scsi/scsi_host.h  |   28 
 11 files changed, 69 insertions(+), 22 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3,repost 06/10] ata: use scsi_host_alloc_node

2012-11-27 Thread Jeff Moyer
Acked-by: Jeff Garzik jgar...@redhat.com
Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/ata/libata-scsi.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e3bda07..9d5dd09 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3586,7 +3586,8 @@ int ata_scsi_add_hosts(struct ata_host *host, struct 
scsi_host_template *sht)
struct Scsi_Host *shost;
 
rc = -ENOMEM;
-   shost = scsi_host_alloc(sht, sizeof(struct ata_port *));
+   shost = scsi_host_alloc_node(sht, sizeof(struct ata_port *),
+dev_to_node(host-dev));
if (!shost)
goto err_alloc;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3,repost 01/10] scsi: add scsi_host_alloc_node

2012-11-27 Thread Jeff Moyer
Allow an LLD to specify on which numa node to allocate scsi data
structures.  Thanks to Bart Van Assche for the suggestion.

Reviewed-by: Bart Van Assche bvanass...@acm.org
Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/hosts.c |   13 +++--
 include/scsi/scsi_host.h |   28 
 2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 593085a..06ce602 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -336,16 +336,25 @@ static struct device_type scsi_host_type = {
  **/
 struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 {
+   return scsi_host_alloc_node(sht, privsize, NUMA_NO_NODE);
+}
+EXPORT_SYMBOL(scsi_host_alloc);
+
+struct Scsi_Host *scsi_host_alloc_node(struct scsi_host_template *sht,
+  int privsize, int node)
+{
struct Scsi_Host *shost;
gfp_t gfp_mask = GFP_KERNEL;
 
if (sht-unchecked_isa_dma  privsize)
gfp_mask |= __GFP_DMA;
 
-   shost = kzalloc(sizeof(struct Scsi_Host) + privsize, gfp_mask);
+   shost = kzalloc_node(sizeof(struct Scsi_Host) + privsize,
+gfp_mask, node);
if (!shost)
return NULL;
 
+   scsi_host_set_numa_node(shost, node);
shost-host_lock = shost-default_lock;
spin_lock_init(shost-host_lock);
shost-shost_state = SHOST_CREATED;
@@ -443,7 +452,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template 
*sht, int privsize)
kfree(shost);
return NULL;
 }
-EXPORT_SYMBOL(scsi_host_alloc);
+EXPORT_SYMBOL(scsi_host_alloc_node);
 
 struct Scsi_Host *scsi_register(struct scsi_host_template *sht, int privsize)
 {
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 4908480..438856d 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -732,6 +732,14 @@ struct Scsi_Host {
 */
struct device *dma_dev;
 
+#ifdef CONFIG_NUMA
+   /*
+* Numa node this device is closest to, used for allocating
+* data structures locally.
+*/
+   int numa_node;
+#endif
+
/*
 * We should ensure that this is aligned, both for better performance
 * and also because some compilers (m68k) don't automatically force
@@ -776,6 +784,8 @@ extern int scsi_queue_work(struct Scsi_Host *, struct 
work_struct *);
 extern void scsi_flush_work(struct Scsi_Host *);
 
 extern struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *, int);
+extern struct Scsi_Host *scsi_host_alloc_node(struct scsi_host_template *,
+ int, int);
 extern int __must_check scsi_add_host_with_dma(struct Scsi_Host *,
   struct device *,
   struct device *);
@@ -919,6 +929,24 @@ static inline unsigned char scsi_host_get_guard(struct 
Scsi_Host *shost)
return shost-prot_guard_type;
 }
 
+#ifdef CONFIG_NUMA
+static inline int scsi_host_get_numa_node(struct Scsi_Host *shost)
+{
+   return shost-numa_node;
+}
+
+static inline void scsi_host_set_numa_node(struct Scsi_Host *shost, int node)
+{
+   shost-numa_node = node;
+}
+#else /* CONFIG_NUMA */
+static inline int scsi_host_get_numa_node(struct Scsi_Host *shost)
+{
+   return NUMA_NO_NODE;
+}
+static inline void scsi_host_set_numa_node(struct Scsi_Host *shost, int node) 
{}
+#endif
+
 /* legacy interfaces */
 extern struct Scsi_Host *scsi_register(struct scsi_host_template *, int);
 extern void scsi_unregister(struct Scsi_Host *);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3,repost 03/10] scsi: make scsi_alloc_sdev numa-aware

2012-11-27 Thread Jeff Moyer
Use the numa node id set in the Scsi_Host to allocate the sdev structure
on the device-local numa node.

Reviewed-by: Bart Van Assche bvanass...@acm.org
Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/scsi_scan.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 3e58b22..d91749d 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -232,8 +232,8 @@ static struct scsi_device *scsi_alloc_sdev(struct 
scsi_target *starget,
extern void scsi_evt_thread(struct work_struct *work);
extern void scsi_requeue_run_queue(struct work_struct *work);
 
-   sdev = kzalloc(sizeof(*sdev) + shost-transportt-device_size,
-  GFP_ATOMIC);
+   sdev = kzalloc_node(sizeof(*sdev) + shost-transportt-device_size,
+   GFP_ATOMIC, scsi_host_get_numa_node(shost));
if (!sdev)
goto out;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3,repost 07/10] megaraid_sas: use scsi_host_alloc_node

2012-11-27 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/megaraid/megaraid_sas_base.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c 
b/drivers/scsi/megaraid/megaraid_sas_base.c
index d2c5366..707a6cd 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -4020,8 +4020,9 @@ megasas_probe_one(struct pci_dev *pdev, const struct 
pci_device_id *id)
if (megasas_set_dma_mask(pdev))
goto fail_set_dma_mask;
 
-   host = scsi_host_alloc(megasas_template,
-  sizeof(struct megasas_instance));
+   host = scsi_host_alloc_node(megasas_template,
+   sizeof(struct megasas_instance),
+   dev_to_node(pdev-dev));
 
if (!host) {
printk(KERN_DEBUG megasas: scsi_host_alloc failed\n);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3,repost 10/10] cciss: use blk_init_queue_node

2012-11-27 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/block/cciss.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index b0f553b..5fe5546 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1930,7 +1930,8 @@ static void cciss_get_serial_no(ctlr_info_t *h, int 
logvol,
 static int cciss_add_disk(ctlr_info_t *h, struct gendisk *disk,
int drv_index)
 {
-   disk-queue = blk_init_queue(do_cciss_request, h-lock);
+   disk-queue = blk_init_queue_node(do_cciss_request, h-lock,
+ dev_to_node(h-dev));
if (!disk-queue)
goto init_queue_failure;
sprintf(disk-disk_name, cciss/c%dd%d, h-ctlr, drv_index);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3,repost 09/10] lpfc: use scsi_host_alloc_node

2012-11-27 Thread Jeff Moyer
Acked-By: James Smart  james.sm...@emulex.com
Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/lpfc/lpfc_init.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 7dc4218..65956d3 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -3051,11 +3051,13 @@ lpfc_create_port(struct lpfc_hba *phba, int instance, 
struct device *dev)
int error = 0;
 
if (dev != phba-pcidev-dev)
-   shost = scsi_host_alloc(lpfc_vport_template,
-   sizeof(struct lpfc_vport));
+   shost = scsi_host_alloc_node(lpfc_vport_template,
+sizeof(struct lpfc_vport),
+dev_to_node(phba-pcidev-dev));
else
-   shost = scsi_host_alloc(lpfc_template,
-   sizeof(struct lpfc_vport));
+   shost = scsi_host_alloc_node(lpfc_template,
+sizeof(struct lpfc_vport),
+dev_to_node(phba-pcidev-dev));
if (!shost)
goto out;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3,repost 08/10] mpt2sas: use scsi_host_alloc_node

2012-11-27 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/mpt2sas/mpt2sas_scsih.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/mpt2sas/mpt2sas_scsih.c 
b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
index af4e6c4..a4d6b36 100644
--- a/drivers/scsi/mpt2sas/mpt2sas_scsih.c
+++ b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
@@ -8011,8 +8011,8 @@ _scsih_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
struct MPT2SAS_ADAPTER *ioc;
struct Scsi_Host *shost;
 
-   shost = scsi_host_alloc(scsih_driver_template,
-   sizeof(struct MPT2SAS_ADAPTER));
+   shost = scsi_host_alloc_node(scsih_driver_template,
+   sizeof(struct MPT2SAS_ADAPTER), dev_to_node(pdev-dev));
if (!shost)
return -ENODEV;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3,repost 05/10] sd: use alloc_disk_node

2012-11-27 Thread Jeff Moyer
Reviewed-by: Bart Van Assche bvanass...@acm.org
Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/sd.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 12f6fdf..a5dae6b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
if (!sdkp)
goto out;
 
-   gd = alloc_disk(SD_MINORS);
+   gd = alloc_disk_node(SD_MINORS, scsi_host_get_numa_node(sdp-host));
if (!gd)
goto out_free;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3,repost 02/10] scsi: make __scsi_alloc_queue numa-aware

2012-11-27 Thread Jeff Moyer
Pass the numa node id set in the Scsi_Host on to blk_init_queue_node
in order to keep all allocations local to the numa node the device is
closest to.

Reviewed-by: Bart Van Assche bvanass...@acm.org
Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/scsi_lib.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index da36a3a..ebad5e8 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1664,7 +1664,8 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host 
*shost,
struct request_queue *q;
struct device *dev = shost-dma_dev;
 
-   q = blk_init_queue(request_fn, NULL);
+   q = blk_init_queue_node(request_fn, NULL,
+   scsi_host_get_numa_node(shost));
if (!q)
return NULL;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3,repost 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node

2012-11-27 Thread Jeff Moyer
Reviewed-by: Bart Van Assche bvanass...@acm.org
Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/scsi.c |   16 ++--
 1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2936b44..1750702 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -173,16 +173,19 @@ static DEFINE_MUTEX(host_cmd_pool_mutex);
  * NULL on failure
  */
 static struct scsi_cmnd *
-scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask)
+scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask,
+   int node)
 {
struct scsi_cmnd *cmd;
 
-   cmd = kmem_cache_zalloc(pool-cmd_slab, gfp_mask | pool-gfp_mask);
+   cmd = kmem_cache_alloc_node(pool-cmd_slab,
+   gfp_mask | pool-gfp_mask | __GFP_ZERO,
+   node);
if (!cmd)
return NULL;
 
-   cmd-sense_buffer = kmem_cache_alloc(pool-sense_slab,
-gfp_mask | pool-gfp_mask);
+   cmd-sense_buffer = kmem_cache_alloc_node(pool-sense_slab,
+   gfp_mask | pool-gfp_mask, node);
if (!cmd-sense_buffer) {
kmem_cache_free(pool-cmd_slab, cmd);
return NULL;
@@ -223,7 +226,8 @@ scsi_host_alloc_command(struct Scsi_Host *shost, gfp_t 
gfp_mask)
 {
struct scsi_cmnd *cmd;
 
-   cmd = scsi_pool_alloc_command(shost-cmd_pool, gfp_mask);
+   cmd = scsi_pool_alloc_command(shost-cmd_pool, gfp_mask,
+ scsi_host_get_numa_node(shost));
if (!cmd)
return NULL;
 
@@ -435,7 +439,7 @@ struct scsi_cmnd *scsi_allocate_command(gfp_t gfp_mask)
if (!pool)
return NULL;
 
-   return scsi_pool_alloc_command(pool, gfp_mask);
+   return scsi_pool_alloc_command(pool, gfp_mask, NUMA_NO_NODE);
 }
 EXPORT_SYMBOL(scsi_allocate_command);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch,v2 00/10] make I/O path allocations more numa-friendly

2012-11-13 Thread Jeff Moyer
Elliott, Robert (Server Storage) elli...@hp.com writes:

 What do these commands report about the NUMA and non-uniform IO topology on 
 the test system?

This is a DELL PowerEdge R715.  See chapter 7 of this document for
details on how the I/O bridges are connected:
  
http://www.dell.com/downloads/global/products/pedge/en/Poweredge-r715-technicalguide.pdf

   numactl --hardware

# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 2 4 6
node 0 size: 8182 MB
node 0 free: 7856 MB
node 1 cpus: 8 10 12 14
node 1 size: 8192 MB
node 1 free: 8008 MB
node 2 cpus: 9 11 13 15
node 2 size: 8192 MB
node 2 free: 7994 MB
node 3 cpus: 1 3 5 7
node 3 size: 8192 MB
node 3 free: 7982 MB
node distances:
node   0   1   2   3 
  0:  10  16  16  16 
  1:  16  10  16  16 
  2:  16  16  10  16 
  3:  16  16  16  10 

   lspci -t

# lspci -vt
-+-[:20]-+-00.0  ATI Technologies Inc RD890 Northbridge only dual slot 
(2x8) PCI-e GFX Hydra part
 |   +-02.0-[21]--
 |   +-03.0-[22]00.0  LSI Logic / Symbios Logic MegaRAID SAS 2108 
[Liberator]
 |   \-0b.0-[23]--+-00.0  Intel Corporation 82599EB 10-Gigabit SFI/SFP+ 
Network Connection
 |\-00.1  Intel Corporation 82599EB 10-Gigabit SFI/SFP+ 
Network Connection
 \-[:00]-+-00.0  ATI Technologies Inc RD890 PCI to PCI bridge (external 
gfx0 port A)
 +-02.0-[01]--+-00.0  Broadcom Corporation NetXtreme II BCM5709 
Gigabit Ethernet
 |\-00.1  Broadcom Corporation NetXtreme II BCM5709 
Gigabit Ethernet
 +-03.0-[02]--+-00.0  Broadcom Corporation NetXtreme II BCM5709 
Gigabit Ethernet
 |\-00.1  Broadcom Corporation NetXtreme II BCM5709 
Gigabit Ethernet
 +-04.0-[03-08]00.0-[04-08]--+-00.0-[05]00.0  LSI Logic / 
Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
 +-09.0-[09]--
 +-12.0  ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
 +-12.1  ATI Technologies Inc SB7x0 USB OHCI1 Controller
 +-12.2  ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI Controller
 +-13.0  ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
 +-13.1  ATI Technologies Inc SB7x0 USB OHCI1 Controller
 +-13.2  ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI Controller
 +-14.0  ATI Technologies Inc SBx00 SMBus Controller
 +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
 +-14.4-[0a]03.0  Matrox Graphics, Inc. MGA G200eW WPCM450
 +-18.0  Advanced Micro Devices [AMD] Family 10h Processor 
HyperTransport Configuration
 +-18.1  Advanced Micro Devices [AMD] Family 10h Processor Address 
Map
 +-18.2  Advanced Micro Devices [AMD] Family 10h Processor DRAM 
Controller
 +-18.3  Advanced Micro Devices [AMD] Family 10h Processor 
Miscellaneous Control
 +-18.4  Advanced Micro Devices [AMD] Family 10h Processor Link 
Control
 +-19.0  Advanced Micro Devices [AMD] Family 10h Processor 
HyperTransport Configuration
 +-19.1  Advanced Micro Devices [AMD] Family 10h Processor Address 
Map
 +-19.2  Advanced Micro Devices [AMD] Family 10h Processor DRAM 
Controller
 +-19.3  Advanced Micro Devices [AMD] Family 10h Processor 
Miscellaneous Control
 +-19.4  Advanced Micro Devices [AMD] Family 10h Processor Link 
Control
 +-1a.0  Advanced Micro Devices [AMD] Family 10h Processor 
HyperTransport Configuration
 +-1a.1  Advanced Micro Devices [AMD] Family 10h Processor Address 
Map
 +-1a.2  Advanced Micro Devices [AMD] Family 10h Processor DRAM 
Controller
 +-1a.3  Advanced Micro Devices [AMD] Family 10h Processor 
Miscellaneous Control
 +-1a.4  Advanced Micro Devices [AMD] Family 10h Processor Link 
Control
 +-1b.0  Advanced Micro Devices [AMD] Family 10h Processor 
HyperTransport Configuration
 +-1b.1  Advanced Micro Devices [AMD] Family 10h Processor Address 
Map
 +-1b.2  Advanced Micro Devices [AMD] Family 10h Processor DRAM 
Controller
 +-1b.3  Advanced Micro Devices [AMD] Family 10h Processor 
Miscellaneous Control
 \-1b.4  Advanced Micro Devices [AMD] Family 10h Processor Link 
Control

# cat /sys/bus/pci/devices/\:20\:03.0/\:22\:00.0/numa_node 
3

-Jeff



 -Original Message-
 From: Jeff Moyer [mailto:jmo...@redhat.com]
 Sent: Monday, 12 November, 2012 3:27 PM
 To: Bart Van Assche
 Cc: Elliott, Robert (Server Storage); linux-scsi@vger.kernel.org
 Subject: Re: [patch,v2 00/10] make I/O path allocations more numa-friendly
 
 Bart Van Assche bvanass...@acm.org writes:
 
  On 11/09/12 21:46, Jeff Moyer wrote:
  On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote:
  It's certainly better to tie them all to one node then let them be
  randomly scattered across nodes

Re: [patch,v2 00/10] make I/O path allocations more numa-friendly

2012-11-12 Thread Jeff Moyer
Bart Van Assche bvanass...@acm.org writes:

 On 11/09/12 21:46, Jeff Moyer wrote:
 On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote:
 It's certainly better to tie them all to one node then let them be
 randomly scattered across nodes; your 6% observation may simply be
 from that.

 How do you think these compare, though (for structures that are per-IO)?
 - tying the structures to the node hosting the storage device
 - tying the structures to the node running the application

 This is a great question, thanks for asking it!  I went ahead and
 modified the megaraid_sas driver to take a module parameter that
 specifies on which node to allocate the scsi_host data structure (and
 all other structures on top that are tied to that).  I then booted the
 system 4 times, specifying a different node each time.  Here are the
 results as compared to a vanilla kernel:

[snip]
 Which NUMA node was processing the megaraid_sas interrupts in these
 tests ? Was irqbalance running during these tests or were interrupts
 manually pinned to a specific CPU core ?

irqbalanced was indeed running, so I can't say for sure what node the
irq was pinned to during my tests (I didn't record that information).

I re-ran the tests, this time turning off irqbalance (well, I set it to
one-shot), and the pinning the irq to the node running the benchmark.
In this configuration, I saw no regressions in performance.

As a reminder:

 The first number is the percent gain (or loss) w.r.t. the vanilla
 kernel.  The second number is the standard deviation as a percent of the
 bandwidth.  So, when data structures are tied to node 0, we see an
 increase in performance for nodes 0-3.  However, on node 3, which is the
 node the megaraid_sas controller is attached to, we see no gain in
 performance, and we see an increase in the run to run variation.  The
 standard deviation for the vanilla kernel was 1% across all nodes.

Here are the updated numbers:

data structures tied to node 0

application tied to:
node 0:  0 +/-4%
node 1:  9 +/-1%
node 2: 10 +/-2%
node 3:  0 +/-2%

data structures tied to node 1

application tied to:
node 0:  5 +/-2%
node 1:  6 +/-8%
node 2: 10 +/-1%
node 3:  0 +/-3%

data structures tied to node 2

application tied to:
node 0:  6 +/-2%
node 1:  9 +/-2%
node 2:  7 +/-6%
node 3:  0 +/-3%

data structures tied to node 3

application tied to:
node 0:  0 +/-4%
node 1: 10 +/-2%
node 2: 11 +/-1%
node 3:  0 +/-5%

Now, the above is apples to oranges, since the vanilla kernel was run
w/o any tuning of irqs.  So, I went ahead and booted with
numa_node_parm=-1, which is the same as vanilla, and re-ran the tests.

When we compare a vanilla kernel with and without irq binding, we get
this:

node 0:  0 +/-3%
node 1:  9 +/-1%
node 2:  8 +/-3%
node 3:  0 +/-1%

As you can see, binding irqs helps nodes 1 and 2 quite substantially.
What this boils down to, when you compare a patched kernel with the
vanilla kernel, where they are both tying irqs to the node hosting the
application, is a net gain of zero, but an increase in standard
deviation.

Let me try to make that more readable.  The patch set does not appear
to help at all with my benchmark configuration.  ;-)  One other
conclusion I can draw from this data is that irqbalance could do a
better job.

An interesting (to me) tidbit about this hardware is that, while it has
4 numa nodes, it only has 2 sockets.  Based on the numbers above, I'd
guess nodes 0 and 3 are in the same socket, likewise for 1 and 2.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3 00/10] make I/O path allocations more numa-friendly

2012-11-09 Thread Jeff Moyer
Hi,

This patch set makes memory allocations for data structures used in
the I/O path more numa friendly by allocating them from the same numa
node as the storage device.  I've only converted a handful of drivers
at this point.  My testing showed that, for workloads where the I/O
processes were not tied to the numa node housing the device, a speedup
of around 6% was observed.  When the I/O processes were tied to the
numa node of the device, there was no measurable difference in my test
setup.  Given my relatively low-end setup[1], I wouldn't be surprised
if others could show a more significant performance advantage.

Comments would be greatly appreciated.

Cheers,
Jeff

[1] LSI Megaraid SAS controller with 1GB battery-backed cache,
fronting a RAID 6 10+2.  The workload I used was tuned to not
have to hit disk.  Fio file attached.

--
changes from v2-v3:
- Made the numa_node Scsi_Host structure member dependent on CONFIG_NUMA
- Got rid of a GFP_ZERO I added accidentally
changes from v1-v2:
- got rid of the vfs patch, as Al pointed out some fundamental
  problems with it
- credited Bart van Assche properly


Jeff Moyer (10):
  scsi: add scsi_host_alloc_node
  scsi: make __scsi_alloc_queue numa-aware
  scsi: make scsi_alloc_sdev numa-aware
  scsi: allocate scsi_cmnd-s from the device's local numa node
  sd: use alloc_disk_node
  ata: use scsi_host_alloc_node
  megaraid_sas: use scsi_host_alloc_node
  mpt2sas: use scsi_host_alloc_node
  lpfc: use scsi_host_alloc_node
  cciss: use blk_init_queue_node

 drivers/ata/libata-scsi.c |3 ++-
 drivers/block/cciss.c |3 ++-
 drivers/scsi/hosts.c  |   13 +++--
 drivers/scsi/lpfc/lpfc_init.c |   10 ++
 drivers/scsi/megaraid/megaraid_sas_base.c |5 +++--
 drivers/scsi/mpt2sas/mpt2sas_scsih.c  |4 ++--
 drivers/scsi/scsi.c   |   16 ++--
 drivers/scsi/scsi_lib.c   |3 ++-
 drivers/scsi/scsi_scan.c  |4 ++--
 drivers/scsi/sd.c |2 +-
 include/scsi/scsi_host.h  |   28 
 11 files changed, 69 insertions(+), 22 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3 02/10] scsi: make __scsi_alloc_queue numa-aware

2012-11-09 Thread Jeff Moyer
Pass the numa node id set in the Scsi_Host on to blk_init_queue_node
in order to keep all allocations local to the numa node the device is
closest to.

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/scsi_lib.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index da36a3a..ebad5e8 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1664,7 +1664,8 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host 
*shost,
struct request_queue *q;
struct device *dev = shost-dma_dev;
 
-   q = blk_init_queue(request_fn, NULL);
+   q = blk_init_queue_node(request_fn, NULL,
+   scsi_host_get_numa_node(shost));
if (!q)
return NULL;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3 09/10] lpfc: use scsi_host_alloc_node

2012-11-09 Thread Jeff Moyer
Acked-By: James Smart  james.sm...@emulex.com
Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/lpfc/lpfc_init.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 7dc4218..65956d3 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -3051,11 +3051,13 @@ lpfc_create_port(struct lpfc_hba *phba, int instance, 
struct device *dev)
int error = 0;
 
if (dev != phba-pcidev-dev)
-   shost = scsi_host_alloc(lpfc_vport_template,
-   sizeof(struct lpfc_vport));
+   shost = scsi_host_alloc_node(lpfc_vport_template,
+sizeof(struct lpfc_vport),
+dev_to_node(phba-pcidev-dev));
else
-   shost = scsi_host_alloc(lpfc_template,
-   sizeof(struct lpfc_vport));
+   shost = scsi_host_alloc_node(lpfc_template,
+sizeof(struct lpfc_vport),
+dev_to_node(phba-pcidev-dev));
if (!shost)
goto out;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3 10/10] cciss: use blk_init_queue_node

2012-11-09 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/block/cciss.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index b0f553b..5fe5546 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1930,7 +1930,8 @@ static void cciss_get_serial_no(ctlr_info_t *h, int 
logvol,
 static int cciss_add_disk(ctlr_info_t *h, struct gendisk *disk,
int drv_index)
 {
-   disk-queue = blk_init_queue(do_cciss_request, h-lock);
+   disk-queue = blk_init_queue_node(do_cciss_request, h-lock,
+ dev_to_node(h-dev));
if (!disk-queue)
goto init_queue_failure;
sprintf(disk-disk_name, cciss/c%dd%d, h-ctlr, drv_index);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node

2012-11-09 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/scsi.c |   16 ++--
 1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2936b44..1750702 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -173,16 +173,19 @@ static DEFINE_MUTEX(host_cmd_pool_mutex);
  * NULL on failure
  */
 static struct scsi_cmnd *
-scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask)
+scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask,
+   int node)
 {
struct scsi_cmnd *cmd;
 
-   cmd = kmem_cache_zalloc(pool-cmd_slab, gfp_mask | pool-gfp_mask);
+   cmd = kmem_cache_alloc_node(pool-cmd_slab,
+   gfp_mask | pool-gfp_mask | __GFP_ZERO,
+   node);
if (!cmd)
return NULL;
 
-   cmd-sense_buffer = kmem_cache_alloc(pool-sense_slab,
-gfp_mask | pool-gfp_mask);
+   cmd-sense_buffer = kmem_cache_alloc_node(pool-sense_slab,
+   gfp_mask | pool-gfp_mask, node);
if (!cmd-sense_buffer) {
kmem_cache_free(pool-cmd_slab, cmd);
return NULL;
@@ -223,7 +226,8 @@ scsi_host_alloc_command(struct Scsi_Host *shost, gfp_t 
gfp_mask)
 {
struct scsi_cmnd *cmd;
 
-   cmd = scsi_pool_alloc_command(shost-cmd_pool, gfp_mask);
+   cmd = scsi_pool_alloc_command(shost-cmd_pool, gfp_mask,
+ scsi_host_get_numa_node(shost));
if (!cmd)
return NULL;
 
@@ -435,7 +439,7 @@ struct scsi_cmnd *scsi_allocate_command(gfp_t gfp_mask)
if (!pool)
return NULL;
 
-   return scsi_pool_alloc_command(pool, gfp_mask);
+   return scsi_pool_alloc_command(pool, gfp_mask, NUMA_NO_NODE);
 }
 EXPORT_SYMBOL(scsi_allocate_command);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3 05/10] sd: use alloc_disk_node

2012-11-09 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/sd.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 12f6fdf..a5dae6b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
if (!sdkp)
goto out;
 
-   gd = alloc_disk(SD_MINORS);
+   gd = alloc_disk_node(SD_MINORS, scsi_host_get_numa_node(sdp-host));
if (!gd)
goto out_free;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3 03/10] scsi: make scsi_alloc_sdev numa-aware

2012-11-09 Thread Jeff Moyer
Use the numa node id set in the Scsi_Host to allocate the sdev structure
on the device-local numa node.

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/scsi_scan.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 3e58b22..d91749d 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -232,8 +232,8 @@ static struct scsi_device *scsi_alloc_sdev(struct 
scsi_target *starget,
extern void scsi_evt_thread(struct work_struct *work);
extern void scsi_requeue_run_queue(struct work_struct *work);
 
-   sdev = kzalloc(sizeof(*sdev) + shost-transportt-device_size,
-  GFP_ATOMIC);
+   sdev = kzalloc_node(sizeof(*sdev) + shost-transportt-device_size,
+   GFP_ATOMIC, scsi_host_get_numa_node(shost));
if (!sdev)
goto out;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3 01/10] scsi: add scsi_host_alloc_node

2012-11-09 Thread Jeff Moyer
Allow an LLD to specify on which numa node to allocate scsi data
structures.  Thanks to Bart Van Assche for the suggestion.

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/hosts.c |   13 +++--
 include/scsi/scsi_host.h |   28 
 2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 593085a..06ce602 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -336,16 +336,25 @@ static struct device_type scsi_host_type = {
  **/
 struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 {
+   return scsi_host_alloc_node(sht, privsize, NUMA_NO_NODE);
+}
+EXPORT_SYMBOL(scsi_host_alloc);
+
+struct Scsi_Host *scsi_host_alloc_node(struct scsi_host_template *sht,
+  int privsize, int node)
+{
struct Scsi_Host *shost;
gfp_t gfp_mask = GFP_KERNEL;
 
if (sht-unchecked_isa_dma  privsize)
gfp_mask |= __GFP_DMA;
 
-   shost = kzalloc(sizeof(struct Scsi_Host) + privsize, gfp_mask);
+   shost = kzalloc_node(sizeof(struct Scsi_Host) + privsize,
+gfp_mask, node);
if (!shost)
return NULL;
 
+   scsi_host_set_numa_node(shost, node);
shost-host_lock = shost-default_lock;
spin_lock_init(shost-host_lock);
shost-shost_state = SHOST_CREATED;
@@ -443,7 +452,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template 
*sht, int privsize)
kfree(shost);
return NULL;
 }
-EXPORT_SYMBOL(scsi_host_alloc);
+EXPORT_SYMBOL(scsi_host_alloc_node);
 
 struct Scsi_Host *scsi_register(struct scsi_host_template *sht, int privsize)
 {
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 4908480..438856d 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -732,6 +732,14 @@ struct Scsi_Host {
 */
struct device *dma_dev;
 
+#ifdef CONFIG_NUMA
+   /*
+* Numa node this device is closest to, used for allocating
+* data structures locally.
+*/
+   int numa_node;
+#endif
+
/*
 * We should ensure that this is aligned, both for better performance
 * and also because some compilers (m68k) don't automatically force
@@ -776,6 +784,8 @@ extern int scsi_queue_work(struct Scsi_Host *, struct 
work_struct *);
 extern void scsi_flush_work(struct Scsi_Host *);
 
 extern struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *, int);
+extern struct Scsi_Host *scsi_host_alloc_node(struct scsi_host_template *,
+ int, int);
 extern int __must_check scsi_add_host_with_dma(struct Scsi_Host *,
   struct device *,
   struct device *);
@@ -919,6 +929,24 @@ static inline unsigned char scsi_host_get_guard(struct 
Scsi_Host *shost)
return shost-prot_guard_type;
 }
 
+#ifdef CONFIG_NUMA
+static inline int scsi_host_get_numa_node(struct Scsi_Host *shost)
+{
+   return shost-numa_node;
+}
+
+static inline void scsi_host_set_numa_node(struct Scsi_Host *shost, int node)
+{
+   shost-numa_node = node;
+}
+#else /* CONFIG_NUMA */
+static inline int scsi_host_get_numa_node(struct Scsi_Host *shost)
+{
+   return NUMA_NO_NODE;
+}
+static inline void scsi_host_set_numa_node(struct Scsi_Host *shost, int node) 
{}
+#endif
+
 /* legacy interfaces */
 extern struct Scsi_Host *scsi_register(struct scsi_host_template *, int);
 extern void scsi_unregister(struct Scsi_Host *);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3 08/10] mpt2sas: use scsi_host_alloc_node

2012-11-09 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/mpt2sas/mpt2sas_scsih.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/mpt2sas/mpt2sas_scsih.c 
b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
index af4e6c4..a4d6b36 100644
--- a/drivers/scsi/mpt2sas/mpt2sas_scsih.c
+++ b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
@@ -8011,8 +8011,8 @@ _scsih_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
struct MPT2SAS_ADAPTER *ioc;
struct Scsi_Host *shost;
 
-   shost = scsi_host_alloc(scsih_driver_template,
-   sizeof(struct MPT2SAS_ADAPTER));
+   shost = scsi_host_alloc_node(scsih_driver_template,
+   sizeof(struct MPT2SAS_ADAPTER), dev_to_node(pdev-dev));
if (!shost)
return -ENODEV;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3 06/10] ata: use scsi_host_alloc_node

2012-11-09 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/ata/libata-scsi.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e3bda07..9d5dd09 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3586,7 +3586,8 @@ int ata_scsi_add_hosts(struct ata_host *host, struct 
scsi_host_template *sht)
struct Scsi_Host *shost;
 
rc = -ENOMEM;
-   shost = scsi_host_alloc(sht, sizeof(struct ata_port *));
+   shost = scsi_host_alloc_node(sht, sizeof(struct ata_port *),
+dev_to_node(host-dev));
if (!shost)
goto err_alloc;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v3 07/10] megaraid_sas: use scsi_host_alloc_node

2012-11-09 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/megaraid/megaraid_sas_base.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c 
b/drivers/scsi/megaraid/megaraid_sas_base.c
index d2c5366..707a6cd 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -4020,8 +4020,9 @@ megasas_probe_one(struct pci_dev *pdev, const struct 
pci_device_id *id)
if (megasas_set_dma_mask(pdev))
goto fail_set_dma_mask;
 
-   host = scsi_host_alloc(megasas_template,
-  sizeof(struct megasas_instance));
+   host = scsi_host_alloc_node(megasas_template,
+   sizeof(struct megasas_instance),
+   dev_to_node(pdev-dev));
 
if (!host) {
printk(KERN_DEBUG megasas: scsi_host_alloc failed\n);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch,v2 00/10] make I/O path allocations more numa-friendly

2012-11-09 Thread Jeff Moyer
Bart Van Assche bvanass...@acm.org writes:

 On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote:
 It's certainly better to tie them all to one node then let them be
 randomly scattered across nodes; your 6% observation may simply be
 from that.

 How do you think these compare, though (for structures that are per-IO)?
 - tying the structures to the node hosting the storage device
 - tying the structures to the node running the application

This is a great question, thanks for asking it!  I went ahead and
modified the megaraid_sas driver to take a module parameter that
specifies on which node to allocate the scsi_host data structure (and
all other structures on top that are tied to that).  I then booted the
system 4 times, specifying a different node each time.  Here are the
results as compared to a vanilla kernel:

data structures tied to node 0

application tied to:
node 0:  +6% +/-1%
node 1:  +9% +/-2%
node 2:  +10% +/-3%
node 3:  +0% +/-4%

The first number is the percent gain (or loss) w.r.t. the vanilla
kernel.  The second number is the standard deviation as a percent of the
bandwidth.  So, when data structures are tied to node 0, we see an
increase in performance for nodes 0-3.  However, on node 3, which is the
node the megaraid_sas controller is attached to, we see no gain in
performance, and we see an increase in the run to run variation.  The
standard deviation for the vanilla kernel was 1% across all nodes.

Given that the results are mixed, depending on which node the workload
is running, I can't really draw any conclusions from this.  The node 3
number is really throwing me for a loop.  If it were positive, I'd do
some handwaving about all data structures getting allocated one node 0
at boot, and the addition of getting the scsi_cmnd structure on the same
node is what resulted in the net gain.

data structures tied to node 1

application tied to:
node 0:  +6% +/-1%
node 1:  +0% +/-2%
node 2:  +0% +/-6%
node 3:  -7% +/-13%

Now this is interesting!  Tying data structures to node 1 results in a
performance boost for node 0?  That would seem to validate your question
of whether it just helps out to have everything come from the same node,
as opposed to allocated close to the storage controller.  However, node
3 sees a decrease in performance, and a huge standard devation.  Node 2
also sees an increased standard deviation.  That leaves me wondering why
node 0 didn't also experience an increase

data structures tied to node 2

application tied to:
node 0:  +5% +/-3%
node 1:  +0% +/-5%
node 2:  +0% +/-4%
node 3:  +0% +/-5%

Here, we *mostly* just see an increase in standard deviation, with no
appreciable change in application performance.

data structures tied to node 3

application tied to:
node 0:  +0% +/-6%
node 1:  +6% +/-4%
node 2:  +7% +/-4%
node 3:  +0% +/-4%

Now, this is the case where I'd expect to see the best performance,
since the HBA is on node 3.  However, that's not what we get!  Instead,
we get maybe a couple percent improvement on nodes 1 and 2, and an
increased run-to-run variation for nodes 0 and 3.

Overall, I'd say that my testing is inconclusive, and I may just pull
the patch set until I can get some reasonable results.


And now to address your question from a completely theoretical point of
view (since empirical data has only succeeded in baffling me).  You have
to keep in mind that some of these data structures are long-lived.
Things like the Scsi_Host and request_queue will be around as long as
the device is present (and the module is not removed).  So, it doesn't
make sense to try to allocate these data structures on the node running
the application, unless you are pinning the application to a single node
that is not the node hosting the storage (which would be weird).  So, I
think it does make sense to pin these data structures to a single node,
that node being the one closest to the storage.  We do have to keep in
mind that there are architectures for which there could be multiple
nodes equidistant to the storage.

 The latter means that PCI Express traffic must spend more time winding
 its way through the CPU complex. For example, the Memory Writes to the
 OQ and to deliver the MSI-X interrupt take longer to reach the 
 destination
 CPU memory, snooping the other CPUs along the way. Once there, though,
 application reads should be faster.

I'm using direct I/O in my testing, which means that the DMA is going to
whatever node the memory allocation (for application buffers) was
satisfied.  For buffered I/O, you're going to end up dma-ing from the
page cache, and that will also likely come from the node on which the
application was running at the time of the read/write.  So, what I'm
getting at is you're very likely to have a split between the data being
transferred and the data structures used to manage the transfer.

 We're trying to design the SCSI Express standards (SOP and PQI) to be
 non-uniform memory and non-uniform I/O friendly.  Some concepts
 we've 
 

Re: [patch,v2 01/10] scsi: add scsi_host_alloc_node

2012-11-05 Thread Jeff Moyer
Bart Van Assche bvanass...@acm.org writes:

 On 11/02/12 22:45, Jeff Moyer wrote:
 diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
 index 593085a..7d7ad8b 100644
 --- a/drivers/scsi/hosts.c
 +++ b/drivers/scsi/hosts.c
 @@ -336,16 +336,25 @@ static struct device_type scsi_host_type = {
**/
   struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int 
 privsize)
   {
 +return scsi_host_alloc_node(sht, privsize, -1);

 Using NUMA_NO_NODE here might improve readability.

Agreed, I'll fix that.

 diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
 index 4908480..a1b5c8e 100644
 --- a/include/scsi/scsi_host.h
 +++ b/include/scsi/scsi_host.h
 @@ -733,6 +733,12 @@ struct Scsi_Host {
  struct device *dma_dev;

  /*
 + * Numa node this device is closest to, used for allocating
 + * data structures locally.
 + */
 +int numa_node;

 Have you considered using #ifdef CONFIG_NUMA / #endif here ? I've
 noticed that all other numa_node members in structures under include/
 have this.

That was an oversight, thanks for pointing it out.  I'll fix it up.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch,v2 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node

2012-11-05 Thread Jeff Moyer
Bart Van Assche bvanass...@acm.org writes:

 On 11/02/12 22:45, Jeff Moyer wrote:
 diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
 index 2936b44..4db6973 100644
 --- a/drivers/scsi/scsi.c
 +++ b/drivers/scsi/scsi.c
 @@ -173,16 +173,20 @@ static DEFINE_MUTEX(host_cmd_pool_mutex);
* NULL on failure
*/
   static struct scsi_cmnd *
 -scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask)
 +scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask,
 +int node)
   {
  struct scsi_cmnd *cmd;

 -cmd = kmem_cache_zalloc(pool-cmd_slab, gfp_mask | pool-gfp_mask);
 +cmd = kmem_cache_alloc_node(pool-cmd_slab,
 +gfp_mask | pool-gfp_mask | __GFP_ZERO,
 +node);
  if (!cmd)
  return NULL;

 -cmd-sense_buffer = kmem_cache_alloc(pool-sense_slab,
 - gfp_mask | pool-gfp_mask);
 +cmd-sense_buffer = kmem_cache_alloc_node(pool-sense_slab,
 +gfp_mask | pool-gfp_mask | __GFP_ZERO,
 +node);

 It's not clear to me why __GFP_ZERO is added to the allocation flags ?

Hmm, seems I thought this was another case of kmem_cache_zalloc.  I'll
fix it up.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch,v2 05/10] sd: use alloc_disk_node

2012-11-05 Thread Jeff Moyer
Bart Van Assche bvanass...@acm.org writes:

 On 11/02/12 22:45, Jeff Moyer wrote:
 Signed-off-by: Jeff Moyer jmo...@redhat.com
 ---
   drivers/scsi/sd.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
 index 12f6fdf..8deb915 100644
 --- a/drivers/scsi/sd.c
 +++ b/drivers/scsi/sd.c
 @@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
  if (!sdkp)
  goto out;

 -gd = alloc_disk(SD_MINORS);
 +gd = alloc_disk_node(SD_MINORS, dev_to_node(dev));
  if (!gd)
  goto out_free;

 shost-numa_node can be another NUMA node than dev_to_node(dev). Have
 you considered using shost-numa_node here ?

It can?  How?

Just so I'm clear, you're suggesting I use the scsi_device's host
pointer to get to the Scsi_Host, and that *will* be filled in that this
point, right?

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 01/10] scsi: make __scsi_alloc_queue numa-aware

2012-11-02 Thread Jeff Moyer
Bart Van Assche bvanass...@acm.org writes:

 On 10/30/12 21:14, Jeff Moyer wrote:
 Pass the numa node id set in the Scsi_Host on to blk_init_queue_node
 in order to keep all allocations local to the numa node the device is
 closest to.

 Signed-off-by: Jeff Moyer jmo...@redhat.com
 ---
   drivers/scsi/scsi_lib.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
 index da36a3a..8662a09 100644
 --- a/drivers/scsi/scsi_lib.c
 +++ b/drivers/scsi/scsi_lib.c
 @@ -1664,7 +1664,7 @@ struct request_queue *__scsi_alloc_queue(struct 
 Scsi_Host *shost,
  struct request_queue *q;
  struct device *dev = shost-dma_dev;

 -q = blk_init_queue(request_fn, NULL);
 +q = blk_init_queue_node(request_fn, NULL, shost-numa_node);
  if (!q)
  return NULL;

 Hello Jeff,

 I haven't seen the patch that introduces numa_node in struct Scsi_Host
 nor the cover letter of this patch series ? Have these been posted on
 the linux-scsi mailing list ?

Hi, Bart,

Wow, looks like I left out the first patch!  The cover letter I think
only went to lkml.  I have to do a repost, so I'll be sure to send the
cover to linux-scsi as well, and CC you (and credit you for the idea,
which I totally forgot to do).  I'll send a repost out today.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v2 09/10] lpfc: use scsi_host_alloc_node

2012-11-02 Thread Jeff Moyer
Acked-By: James Smart  james.sm...@emulex.com
Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/lpfc/lpfc_init.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 7dc4218..65956d3 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -3051,11 +3051,13 @@ lpfc_create_port(struct lpfc_hba *phba, int instance, 
struct device *dev)
int error = 0;
 
if (dev != phba-pcidev-dev)
-   shost = scsi_host_alloc(lpfc_vport_template,
-   sizeof(struct lpfc_vport));
+   shost = scsi_host_alloc_node(lpfc_vport_template,
+sizeof(struct lpfc_vport),
+dev_to_node(phba-pcidev-dev));
else
-   shost = scsi_host_alloc(lpfc_template,
-   sizeof(struct lpfc_vport));
+   shost = scsi_host_alloc_node(lpfc_template,
+sizeof(struct lpfc_vport),
+dev_to_node(phba-pcidev-dev));
if (!shost)
goto out;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v2 07/10] megaraid_sas: use scsi_host_alloc_node

2012-11-02 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/megaraid/megaraid_sas_base.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c 
b/drivers/scsi/megaraid/megaraid_sas_base.c
index d2c5366..707a6cd 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -4020,8 +4020,9 @@ megasas_probe_one(struct pci_dev *pdev, const struct 
pci_device_id *id)
if (megasas_set_dma_mask(pdev))
goto fail_set_dma_mask;
 
-   host = scsi_host_alloc(megasas_template,
-  sizeof(struct megasas_instance));
+   host = scsi_host_alloc_node(megasas_template,
+   sizeof(struct megasas_instance),
+   dev_to_node(pdev-dev));
 
if (!host) {
printk(KERN_DEBUG megasas: scsi_host_alloc failed\n);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v2 06/10] ata: use scsi_host_alloc_node

2012-11-02 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/ata/libata-scsi.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e3bda07..9d5dd09 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3586,7 +3586,8 @@ int ata_scsi_add_hosts(struct ata_host *host, struct 
scsi_host_template *sht)
struct Scsi_Host *shost;
 
rc = -ENOMEM;
-   shost = scsi_host_alloc(sht, sizeof(struct ata_port *));
+   shost = scsi_host_alloc_node(sht, sizeof(struct ata_port *),
+dev_to_node(host-dev));
if (!shost)
goto err_alloc;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v2 01/10] scsi: add scsi_host_alloc_node

2012-11-02 Thread Jeff Moyer
Allow an LLD to specify on which numa node to allocate scsi data
structures.  Thanks to Bart Van Assche for the suggestion.

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/hosts.c |   13 +++--
 include/scsi/scsi_host.h |8 
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 593085a..7d7ad8b 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -336,16 +336,25 @@ static struct device_type scsi_host_type = {
  **/
 struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 {
+   return scsi_host_alloc_node(sht, privsize, -1);
+}
+EXPORT_SYMBOL(scsi_host_alloc);
+
+struct Scsi_Host *scsi_host_alloc_node(struct scsi_host_template *sht,
+  int privsize, int node)
+{
struct Scsi_Host *shost;
gfp_t gfp_mask = GFP_KERNEL;
 
if (sht-unchecked_isa_dma  privsize)
gfp_mask |= __GFP_DMA;
 
-   shost = kzalloc(sizeof(struct Scsi_Host) + privsize, gfp_mask);
+   shost = kzalloc_node(sizeof(struct Scsi_Host) + privsize,
+gfp_mask, node);
if (!shost)
return NULL;
 
+   shost-numa_node = node;
shost-host_lock = shost-default_lock;
spin_lock_init(shost-host_lock);
shost-shost_state = SHOST_CREATED;
@@ -443,7 +452,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template 
*sht, int privsize)
kfree(shost);
return NULL;
 }
-EXPORT_SYMBOL(scsi_host_alloc);
+EXPORT_SYMBOL(scsi_host_alloc_node);
 
 struct Scsi_Host *scsi_register(struct scsi_host_template *sht, int privsize)
 {
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 4908480..a1b5c8e 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -733,6 +733,12 @@ struct Scsi_Host {
struct device *dma_dev;
 
/*
+* Numa node this device is closest to, used for allocating
+* data structures locally.
+*/
+   int numa_node;
+
+   /*
 * We should ensure that this is aligned, both for better performance
 * and also because some compilers (m68k) don't automatically force
 * alignment to a long boundary.
@@ -776,6 +782,8 @@ extern int scsi_queue_work(struct Scsi_Host *, struct 
work_struct *);
 extern void scsi_flush_work(struct Scsi_Host *);
 
 extern struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *, int);
+extern struct Scsi_Host *scsi_host_alloc_node(struct scsi_host_template *,
+ int, int);
 extern int __must_check scsi_add_host_with_dma(struct Scsi_Host *,
   struct device *,
   struct device *);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v2 02/10] scsi: make __scsi_alloc_queue numa-aware

2012-11-02 Thread Jeff Moyer
Pass the numa node id set in the Scsi_Host on to blk_init_queue_node
in order to keep all allocations local to the numa node the device is
closest to.

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/scsi_lib.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index da36a3a..8662a09 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1664,7 +1664,7 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host 
*shost,
struct request_queue *q;
struct device *dev = shost-dma_dev;
 
-   q = blk_init_queue(request_fn, NULL);
+   q = blk_init_queue_node(request_fn, NULL, shost-numa_node);
if (!q)
return NULL;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v2 05/10] sd: use alloc_disk_node

2012-11-02 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/sd.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 12f6fdf..8deb915 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
if (!sdkp)
goto out;
 
-   gd = alloc_disk(SD_MINORS);
+   gd = alloc_disk_node(SD_MINORS, dev_to_node(dev));
if (!gd)
goto out_free;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v2 08/10] mpt2sas: use scsi_host_alloc_node

2012-11-02 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/mpt2sas/mpt2sas_scsih.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/mpt2sas/mpt2sas_scsih.c 
b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
index af4e6c4..a4d6b36 100644
--- a/drivers/scsi/mpt2sas/mpt2sas_scsih.c
+++ b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
@@ -8011,8 +8011,8 @@ _scsih_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
struct MPT2SAS_ADAPTER *ioc;
struct Scsi_Host *shost;
 
-   shost = scsi_host_alloc(scsih_driver_template,
-   sizeof(struct MPT2SAS_ADAPTER));
+   shost = scsi_host_alloc_node(scsih_driver_template,
+   sizeof(struct MPT2SAS_ADAPTER), dev_to_node(pdev-dev));
if (!shost)
return -ENODEV;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v2 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node

2012-11-02 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/scsi.c |   17 +++--
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2936b44..4db6973 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -173,16 +173,20 @@ static DEFINE_MUTEX(host_cmd_pool_mutex);
  * NULL on failure
  */
 static struct scsi_cmnd *
-scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask)
+scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask,
+   int node)
 {
struct scsi_cmnd *cmd;
 
-   cmd = kmem_cache_zalloc(pool-cmd_slab, gfp_mask | pool-gfp_mask);
+   cmd = kmem_cache_alloc_node(pool-cmd_slab,
+   gfp_mask | pool-gfp_mask | __GFP_ZERO,
+   node);
if (!cmd)
return NULL;
 
-   cmd-sense_buffer = kmem_cache_alloc(pool-sense_slab,
-gfp_mask | pool-gfp_mask);
+   cmd-sense_buffer = kmem_cache_alloc_node(pool-sense_slab,
+   gfp_mask | pool-gfp_mask | __GFP_ZERO,
+   node);
if (!cmd-sense_buffer) {
kmem_cache_free(pool-cmd_slab, cmd);
return NULL;
@@ -223,7 +227,8 @@ scsi_host_alloc_command(struct Scsi_Host *shost, gfp_t 
gfp_mask)
 {
struct scsi_cmnd *cmd;
 
-   cmd = scsi_pool_alloc_command(shost-cmd_pool, gfp_mask);
+   cmd = scsi_pool_alloc_command(shost-cmd_pool, gfp_mask,
+ shost-numa_node);
if (!cmd)
return NULL;
 
@@ -435,7 +440,7 @@ struct scsi_cmnd *scsi_allocate_command(gfp_t gfp_mask)
if (!pool)
return NULL;
 
-   return scsi_pool_alloc_command(pool, gfp_mask);
+   return scsi_pool_alloc_command(pool, gfp_mask, NUMA_NO_NODE);
 }
 EXPORT_SYMBOL(scsi_allocate_command);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v2 03/10] scsi: make scsi_alloc_sdev numa-aware

2012-11-02 Thread Jeff Moyer
Use the numa node id set in the Scsi_Host to allocate the sdev structure
on the device-local numa node.

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/scsi_scan.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 3e58b22..f10a308 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -232,8 +232,8 @@ static struct scsi_device *scsi_alloc_sdev(struct 
scsi_target *starget,
extern void scsi_evt_thread(struct work_struct *work);
extern void scsi_requeue_run_queue(struct work_struct *work);
 
-   sdev = kzalloc(sizeof(*sdev) + shost-transportt-device_size,
-  GFP_ATOMIC);
+   sdev = kzalloc_node(sizeof(*sdev) + shost-transportt-device_size,
+   GFP_ATOMIC, shost-numa_node);
if (!sdev)
goto out;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch,v2 10/10] cciss: use blk_init_queue_node

2012-11-02 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/block/cciss.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index b0f553b..5fe5546 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1930,7 +1930,8 @@ static void cciss_get_serial_no(ctlr_info_t *h, int 
logvol,
 static int cciss_add_disk(ctlr_info_t *h, struct gendisk *disk,
int drv_index)
 {
-   disk-queue = blk_init_queue(do_cciss_request, h-lock);
+   disk-queue = blk_init_queue_node(do_cciss_request, h-lock,
+ dev_to_node(h-dev));
if (!disk-queue)
goto init_queue_failure;
sprintf(disk-disk_name, cciss/c%dd%d, h-ctlr, drv_index);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 09/10] lpfc: use scsi_host_alloc_node

2012-10-30 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/lpfc/lpfc_init.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 7dc4218..65956d3 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -3051,11 +3051,13 @@ lpfc_create_port(struct lpfc_hba *phba, int instance, 
struct device *dev)
int error = 0;
 
if (dev != phba-pcidev-dev)
-   shost = scsi_host_alloc(lpfc_vport_template,
-   sizeof(struct lpfc_vport));
+   shost = scsi_host_alloc_node(lpfc_vport_template,
+sizeof(struct lpfc_vport),
+dev_to_node(phba-pcidev-dev));
else
-   shost = scsi_host_alloc(lpfc_template,
-   sizeof(struct lpfc_vport));
+   shost = scsi_host_alloc_node(lpfc_template,
+sizeof(struct lpfc_vport),
+dev_to_node(phba-pcidev-dev));
if (!shost)
goto out;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 07/10] megaraid_sas: use scsi_host_alloc_node

2012-10-30 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/megaraid/megaraid_sas_base.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c 
b/drivers/scsi/megaraid/megaraid_sas_base.c
index d2c5366..707a6cd 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -4020,8 +4020,9 @@ megasas_probe_one(struct pci_dev *pdev, const struct 
pci_device_id *id)
if (megasas_set_dma_mask(pdev))
goto fail_set_dma_mask;
 
-   host = scsi_host_alloc(megasas_template,
-  sizeof(struct megasas_instance));
+   host = scsi_host_alloc_node(megasas_template,
+   sizeof(struct megasas_instance),
+   dev_to_node(pdev-dev));
 
if (!host) {
printk(KERN_DEBUG megasas: scsi_host_alloc failed\n);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 01/10] scsi: make __scsi_alloc_queue numa-aware

2012-10-30 Thread Jeff Moyer
Pass the numa node id set in the Scsi_Host on to blk_init_queue_node
in order to keep all allocations local to the numa node the device is
closest to.

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/scsi_lib.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index da36a3a..8662a09 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1664,7 +1664,7 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host 
*shost,
struct request_queue *q;
struct device *dev = shost-dma_dev;
 
-   q = blk_init_queue(request_fn, NULL);
+   q = blk_init_queue_node(request_fn, NULL, shost-numa_node);
if (!q)
return NULL;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 03/10] scsi: allocate scsi_cmnd-s from the device's local numa node

2012-10-30 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/scsi.c |   17 +++--
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2936b44..4db6973 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -173,16 +173,20 @@ static DEFINE_MUTEX(host_cmd_pool_mutex);
  * NULL on failure
  */
 static struct scsi_cmnd *
-scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask)
+scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask,
+   int node)
 {
struct scsi_cmnd *cmd;
 
-   cmd = kmem_cache_zalloc(pool-cmd_slab, gfp_mask | pool-gfp_mask);
+   cmd = kmem_cache_alloc_node(pool-cmd_slab,
+   gfp_mask | pool-gfp_mask | __GFP_ZERO,
+   node);
if (!cmd)
return NULL;
 
-   cmd-sense_buffer = kmem_cache_alloc(pool-sense_slab,
-gfp_mask | pool-gfp_mask);
+   cmd-sense_buffer = kmem_cache_alloc_node(pool-sense_slab,
+   gfp_mask | pool-gfp_mask | __GFP_ZERO,
+   node);
if (!cmd-sense_buffer) {
kmem_cache_free(pool-cmd_slab, cmd);
return NULL;
@@ -223,7 +227,8 @@ scsi_host_alloc_command(struct Scsi_Host *shost, gfp_t 
gfp_mask)
 {
struct scsi_cmnd *cmd;
 
-   cmd = scsi_pool_alloc_command(shost-cmd_pool, gfp_mask);
+   cmd = scsi_pool_alloc_command(shost-cmd_pool, gfp_mask,
+ shost-numa_node);
if (!cmd)
return NULL;
 
@@ -435,7 +440,7 @@ struct scsi_cmnd *scsi_allocate_command(gfp_t gfp_mask)
if (!pool)
return NULL;
 
-   return scsi_pool_alloc_command(pool, gfp_mask);
+   return scsi_pool_alloc_command(pool, gfp_mask, NUMA_NO_NODE);
 }
 EXPORT_SYMBOL(scsi_allocate_command);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 04/10] sd: use alloc_disk_node

2012-10-30 Thread Jeff Moyer

Signed-off-by: Jeff Moyer jmo...@redhat.com
---
 drivers/scsi/sd.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 12f6fdf..8deb915 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
if (!sdkp)
goto out;
 
-   gd = alloc_disk(SD_MINORS);
+   gd = alloc_disk_node(SD_MINORS, dev_to_node(dev));
if (!gd)
goto out_free;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch/rfc/rft] sd: allocate request_queue on device's local numa node

2012-10-23 Thread Jeff Moyer
Bart Van Assche bvanass...@acm.org writes:

 On 10/22/12 21:01, Jeff Moyer wrote:
 All of the infrastructure is available to allocate a request_queue on a
 particular numa node, but it isn't being utilized at all.  Wire up the
 sd driver to allocate the request_queue on the HBA's local numa node.

 This is a request for comments and testing (I've built and booted it,
 nothing more).  I believe that this should be a performance win, but I
 have no numbers to back it up as yet.  Suggestions for workloads to test
 are welcome.

 Cheers,
 Jeff

 Signed-off-by: Jeff Moyer jmo...@redhat.com

 diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
 index da36a3a..7986483 100644
 --- a/drivers/scsi/scsi_lib.c
 +++ b/drivers/scsi/scsi_lib.c
 @@ -1664,7 +1664,8 @@ struct request_queue *__scsi_alloc_queue(struct 
 Scsi_Host *shost,
  struct request_queue *q;
  struct device *dev = shost-dma_dev;

 -q = blk_init_queue(request_fn, NULL);
 +q = blk_init_queue_node(request_fn, NULL,
 +dev_to_node(shost-shost_dev));
  if (!q)
  return NULL;

 Are you sure this approach will always result in the queue being
 allocated on the same NUMA node as the HCA ? If e.g. a user triggers
 LUN scanning via sysfs the above code may be invoked on another NUMA
 node than the node to which the HCA is connected.

shost-shost_dev should inherit the numa node from the pci bus to which
it is attached.  So long as that works, there should be no concern over
which numa node the probe code is running on.

 Also, if you have a look at e.g. scsi_request_fn() or
 scsi_device_unbusy() you will see that in order to avoid inter-node
 traffic it's important to allocate the sdev and shost data structures
 on the same NUMA node.

Yes, good point.

 How about the following approach ?
 - Add a variant of scsi_host_alloc() that allows to specify on which
   NUMA node to allocate the shost structure and also that stores the
   identity of that node in the shost structure.
 - Modify __scsi_alloc_queue() such that it allocates the sdev structure
   on the same NUMA node as the shost structure.
 - Modify the SCSI LLD of your choice such that it uses the new
   scsi_host_alloc() call. According to what is appropriate the NUMA node
   on which to allocate the shost could be specified by the user or could
   be identical to the NUMA node of the HCA controlled by the SCSI LLD
   (see e.g. /sys/devices/pci*/*/numa_node). Please keep in mind that a
   single PCIe bus may have a minimal distance to more than one NUMA
   node. See e.g. the diagram at the top of page 8 in

 http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf
   for a system diagram of a NUMA system where each PCIe bus has a
   minimal distance to two different NUMA nodes.

That's an interesting configuration.  I wonder what the numa_node sysfs
file contains for such systems--do you know?  I'm not sure how we could
allow this to be user-controlled at probe time.  Did you have a specific
mechanism in mind?  Module parameters?  Something else?

Thanks for your input, Bart.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch/rfc/rft] sd: allocate request_queue on device's local numa node

2012-10-22 Thread Jeff Moyer
Hi,

All of the infrastructure is available to allocate a request_queue on a
particular numa node, but it isn't being utilized at all.  Wire up the
sd driver to allocate the request_queue on the HBA's local numa node.

This is a request for comments and testing (I've built and booted it,
nothing more).  I believe that this should be a performance win, but I
have no numbers to back it up as yet.  Suggestions for workloads to test
are welcome.

Cheers,
Jeff

Signed-off-by: Jeff Moyer jmo...@redhat.com

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index da36a3a..7986483 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1664,7 +1664,8 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host 
*shost,
struct request_queue *q;
struct device *dev = shost-dma_dev;
 
-   q = blk_init_queue(request_fn, NULL);
+   q = blk_init_queue_node(request_fn, NULL,
+   dev_to_node(shost-shost_dev));
if (!q)
return NULL;
 
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html