Re: [PATCH 2/8] powerpc: remove CONFIG_PCI_QSPAN

2018-10-17 Thread Benjamin Herrenschmidt
On Wed, 2018-10-17 at 10:01 +0200, Christoph Hellwig wrote:
> This option isn't actually used anywhere.

Oh my, that's ancient. Probably didn't make the cut from arch/ppc to
arch/powerpc

> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/Kconfig | 9 -
>  1 file changed, 9 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index a80669209155..e8c8970248bc 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -955,7 +955,6 @@ config PCI
>   bool "PCI support" if PPC_PCI_CHOICE
>   default y if !40x && !CPM2 && !PPC_8xx && !PPC_83xx \
>   && !PPC_85xx && !PPC_86xx && !GAMECUBE_COMMON
> - default PCI_QSPAN if PPC_8xx
>   select GENERIC_PCI_IOMAP
>   help
> Find out whether your system includes a PCI bus. PCI is the name of
> @@ -969,14 +968,6 @@ config PCI_DOMAINS
>  config PCI_SYSCALL
>   def_bool PCI
>  
> -config PCI_QSPAN
> - bool "QSpan PCI"
> - depends on PPC_8xx
> - select PPC_I8259
> - help
> -   Say Y here if you have a system based on a Motorola 8xx-series
> -   embedded processor with a QSPAN PCI interface, otherwise say N.
> -
>  config PCI_8260
>   bool
>   depends on PCI && 8260



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 16:24 -0600, Jason Gunthorpe wrote:
> Basically, all this list processing is a huge overhead compared to
> just putting a helper call in the existing sg iteration loop of the
> actual op.  Particularly if the actual op is a no-op like no-mmu x86
> would use.

Yes, I'm leaning toward that approach too.

The helper itself could hang off the devmap though.

> Since dma mapping is a performance path we must be careful not to
> create intrinsic inefficiencies with otherwise nice layering :)
> 
> Jason


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 17:21 -0600, Jason Gunthorpe wrote:
> Splitting the sgl is different from iommu batching.
> 
> As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in
> the middle.
> 
> The optimum behavior is to allocate a 1MB-4K iommu range and fill it
> with the CPU memory. Then return a SGL with three entires, two
> pointing into the range and one to the p2p.
> 
> It is creating each range which tends to be expensive, so creating
> two
> ranges (or worse, if every SGL created a range it would be 255) is
> very undesired.

I think it's easier to get us started to just use a helper and
stick it in the existing sglist processing loop of the architecture.

As we noticed, stacking dma_ops is actually non-trivial and opens quite
the can of worms.

As Jerome mentioned, you can end up with IOs ops containing an sglist
that is a collection of memory and GPU pages for example.

Cheers,
Ben.


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 15:22 -0600, Jason Gunthorpe wrote:
> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
> > > I think this opens an even bigger can of worms..
> > 
> > No, I don't think it does. You'd only shim when the target page is
> > backed by a device, not host memory, and you can figure this out by
> > a
> > is_zone_device_page()-style lookup.
> 
> The bigger can of worms is how do you meaningfully stack dma_ops.
> 
> What does the p2p provider do when it detects a p2p page?

Yeah I think we don't really want to stack dma_ops... thinking more
about it.

As I just wrote, it looks like we might need a more specialised hook
in the devmap to be used by the main dma_op, on a per-page basis.

Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 15:03 -0600, Jason Gunthorpe wrote:
> I don't follow, when does get_dma_ops() return a p2p aware provider?
> It has no way to know if the DMA is going to involve p2p, get_dma_ops
> is called with the device initiating the DMA.
> 
> So you'd always return the P2P shim on a system that has registered
> P2P memory?
> 
> Even so, how does this shim work? dma_ops are not really intended to
> be stacked. How would we make unmap work, for instance? What happens
> when the underlying iommu dma ops actually natively understands p2p
> and doesn't want the shim?

Good point. We only know on a per-page basis ... ugh.

So we really need to change the arch main dma_ops. I'm not opposed to
that. What we then need to do is have that main arch dma_map_sg,
when it encounters a "device" page, call into a helper attached to
the devmap to handle *that page*, providing sufficient context.

That helper wouldn't perform the actual iommu mapping. It would simply
return something along the lines of:

 - "use that alternate bus address and don't map in the iommu"
 - "use that alternate bus address and do map in the iommu"
 - "proceed as normal"
 - "fail"

What do you think ?

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 14:48 -0600, Logan Gunthorpe wrote:
> > ...and that dma_map goes through get_dma_ops(), so I don't see the conflict?
> 
> The main conflict is in dma_map_sg which only does get_dma_ops once but
> the sg may contain memory of different types.

We can handle that in our "overriden" dma ops.

It's a bit tricky but it *could* break it down into segments and
forward portions back to the original dma ops.

Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 12:00 -0600, Jason Gunthorpe wrote:
> - All platforms can succeed if the PCI devices are under the same
>   'segment', but where segments begin is somewhat platform specific
>   knowledge. (this is 'same switch' idea Logan has talked about)

We also need to be careful whether P2P is enabled in the switch
or not.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 10:27 -0700, Dan Williams wrote:
> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> > already have APIs that map BAR memory to user space, and would like to
> > keep using them. A 'enable P2P for bar' helper function sounds better
> > to me.
> 
> ...and I think it's not a helper function as much as asking the bus
> provider "can these two device dma to each other". The "helper" is the
> dma api redirecting through a software-iommu that handles bus address
> translation differently than it would handle host memory dma mapping.

Do we even need tat function ? The dma_ops have a dma_supported()
call...

If we have those override ops built into the "dma_target" object,
then these things can make that decision knowing both the source
and target device.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Benjamin Herrenschmidt
On Mon, 2017-04-17 at 23:43 -0600, Logan Gunthorpe wrote:
> 
> On 17/04/17 03:11 PM, Benjamin Herrenschmidt wrote:
> > Is it ? Again, you create a "concept" the user may have no idea about,
> > "p2pmem memory". So now any kind of memory buffer on a device can could
> > be use for p2p but also potentially a bunch of other things becomes
> > special and called "p2pmem" ...
> 
> The user is going to have to have an idea about it if they are designing
> systems to make use of it. I've said it before many times: this is an
> optimization with significant trade-offs so the user does have to make
> decisions regarding when to enable it.

Not necessarily. There are many cases where the "end user" won't have
any idea. In any case, I think we bring the story down to those two
points of conflating the allocator with the peer to peer DMA, and the
lack of generality in the approach to solve the peer to peer DMA
problem.

> > But what do you have in p2pmem that somebody benefits from. Again I
> > don't understand what that "p2pmem" device buys you in term of
> > functionality vs. having the device just instanciate the pages.
> 
> Well thanks for just taking a big shit on all of our work without even
> reading the patches. Bravo.

Now now now  calm down. We are being civil here. I'm not shitting
on anything, I'm asking what seems to be a reasonable question in term
of benefits of the approach you have chosen. Using that sort of
language will not get you anywhere. 

> > Now having some kind of way to override the dma_ops, yes I do get that,
> > and it could be that this "p2pmem" is typically the way to do it, but
> > at the moment you don't even have that. So I'm a bit at a loss here.
> 
> Yes, we've already said many times that this is something we will need
> to add.
> 
> > But it doesn't *have* to be. Again, take my GPU example. The fact that
> > a NIC might be able to DMA into it doesn't make it specifically "p2p
> > memory".
> 
> Just because you use it for other things doesn't mean it can't also
> provide the service of a "p2pmem" device.

But there is no such thing as a "p2pmem" device.. that's what I'm
trying to tell you...

As both Jerome and I tried to explain, there are many reason why one
may want to do peer DMA into some device memory, that doesn't make that
memory some kind of "p2pmem". It's trying to stick a generic label onto
something that isn't.

That's why I'm suggesting we disconnect the two aspects. On one hand
the problem of handling p2p DMA, whether the target is some memory,
some MMIO registers, etc...

On the other hand, some generic "utility" that can optionally be used
by drivers to manage a pool of DMA memory in the device, essentially a
simple allocator.

The two things are completely orthogonal.

> > So now your "p2pmem" device needs to also be laid out on top of those
> > MMIO registers ? It's becoming weird.
> 
> Yes, Max Gurtovoy has also expressed an interest in expanding this work
> to cover things other than memory. He's suggested simply calling it a
> p2p device, but until we figure out what exactly that all means we can't
> really finalize a name.

Possibly. In any case, I think it should be separate from the
allocation.

> > See, basically, doing peer 2 peer between devices has 3 main challenges
> > today: The DMA API needing struct pages, the MMIO translation issues
> > and the IOMMU translation issues.
> > 
> > You seem to create that added device as some kind of "owner" for the
> > struct pages, solving #1, but leave #2 and #3 alone.
> 
> Well there are other challenges too. Like figuring out when it's
> appropriate to use, tying together the device that provides the memory
> with the driver tring to use it in DMA transactions, etc, etc. Our patch
> set tackles these latter issues.

But it tries to conflate the allocation, which is basically the fact
that this is some kind of "memory pool" with the problem of doing peer
DMA.

I'm advocating for separating the concepts.

> > If we go down that path, though, rather than calling it p2pmem I would
> > call it something like dma_target which I find much clearer especially
> > since it doesn't have to be just memory.
> 
> I'm not set on the name. My arguments have been specifically for the
> existence of an independent struct device. But I'm not really interested
> in getting into bike shedding arguments over what to call it at this
> time when we don't even really know what it's going to end up doing in
> 

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Benjamin Herrenschmidt
On Mon, 2017-04-17 at 10:52 -0600, Logan Gunthorpe wrote:
> 
> On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
> > But is it ? For example take a GPU, does it, in your scheme, need an
> > additional "p2pmem" child ? Why can't the GPU driver just use some
> > helper to instantiate the necessary struct pages ? What does having an
> > actual "struct device" child buys you ?
> 
> Yes, in this scheme, it needs an additional p2pmem child. Why is that an
> issue? It certainly makes it a lot easier for the user to understand the
> p2pmem memory in the system (through the sysfs tree) and reason about
> the topology and when to use it. This is important.

Is it ? Again, you create a "concept" the user may have no idea about,
"p2pmem memory". So now any kind of memory buffer on a device can could
be use for p2p but also potentially a bunch of other things becomes
special and called "p2pmem" ...

> > > 2) In order to create the struct pages we use the ZONE_DEVICE
> > > infrastructure which requires a struct device. (See
> > > devm_memremap_pages.)
> > 
> > Yup, but you already have one in the actual pci_dev ... What is the
> > benefit of adding a second one ?
> 
> But that would tie all of this very tightly to be pci only and may get
> hard to differentiate if more users of ZONE_DEVICE crop up who happen to
> be using a pci device.

But what do you have in p2pmem that somebody benefits from. Again I
don't understand what that "p2pmem" device buys you in term of
functionality vs. having the device just instanciate the pages.

Now having some kind of way to override the dma_ops, yes I do get that,
and it could be that this "p2pmem" is typically the way to do it, but
at the moment you don't even have that. So I'm a bit at a loss here.
 
>  Having a specific class for this makes it very
> clear how this memory would be handled.

But it doesn't *have* to be. Again, take my GPU example. The fact that
a NIC might be able to DMA into it doesn't make it specifically "p2p
memory".

Essentially you are saying that any device that happens to have a piece
of mappable "memory" (or something that behaves like it) and can be
DMA'ed into should now have that "p2pmem" thing attached to it.

Now take an example where that becomes really awkward (it's also a real
example of something people want to do). I have a NIC and a GPU, the
NIC DMA's data to/from the GPU, but they also want to poke at each
other doorbell, the GPU to kick the NIC into action when data is ready
to send, the NIC to poke the GPU when data has been received.

Those doorbells are MMIO registers.

So now your "p2pmem" device needs to also be laid out on top of those
MMIO registers ? It's becoming weird.

See, basically, doing peer 2 peer between devices has 3 main challenges
today: The DMA API needing struct pages, the MMIO translation issues
and the IOMMU translation issues.

You seem to create that added device as some kind of "owner" for the
struct pages, solving #1, but leave #2 and #3 alone.

Now, as I said, it could very well be that having the devmap pointer
point to some specific device-type with a well known structure to
provide solutions for #2 and #3 such as dma_ops overrides, is indeed
the right way to solve these problems.

If we go down that path, though, rather than calling it p2pmem I would
call it something like dma_target which I find much clearer especially
since it doesn't have to be just memory.

For the sole case of creating struct page's however, I fail to see the
point.

>  For example, although I haven't
> looked into it, this could very well be a point of conflict with HMM. If
> they were to use the pci device to populate the dev_pagemap then we
> couldn't also use the pci device. I feel it's much better for users of
> dev_pagemap to have their struct devices they own to avoid such conflicts.

If we are going to create some sort of struct dma_target, HMM could
potentially just look for the parent if it needs the PCI device.

> > >  This amazingly gets us the get_dev_pagemap
> > > architecture which also uses a struct device. So by using a p2pmem
> > > device we can go from struct page to struct device to p2pmem device
> > > quickly and effortlessly.
> > 
> > Which isn't terribly useful in itself right ? What you care about is
> > the "enclosing" pci_dev no ? Or am I missing something ?
> 
> Sure it is. What if we want to someday support p2pmem that's on another bus?

But why not directly use that other bus' device in that case ?

> > > 3) You wouldn't want to use the pci's struct device because it doesn't
> > > r

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 23:13 -0600, Logan Gunthorpe wrote:
> 
> > 
> > I'm still not 100% why do you need a "p2mem device" mind you ...
> 
> Well, you don't "need" it but it is a design choice that I think makes a
> lot of sense for the following reasons:
> 
> 1) p2pmem is in fact a device on the pci bus. A pci driver will need to
> set it up and create the device and thus it will have a natural parent
> pci device. Instantiating a struct device for it means it will appear in
> the device hierarchy and one can use that to reason about its position
> in the topology.

But is it ? For example take a GPU, does it, in your scheme, need an
additional "p2pmem" child ? Why can't the GPU driver just use some
helper to instantiate the necessary struct pages ? What does having an
actual "struct device" child buys you ?

> 2) In order to create the struct pages we use the ZONE_DEVICE
> infrastructure which requires a struct device. (See
> devm_memremap_pages.)

Yup, but you already have one in the actual pci_dev ... What is the
benefit of adding a second one ?

>  This amazingly gets us the get_dev_pagemap
> architecture which also uses a struct device. So by using a p2pmem
> device we can go from struct page to struct device to p2pmem device
> quickly and effortlessly.

Which isn't terribly useful in itself right ? What you care about is
the "enclosing" pci_dev no ? Or am I missing something ?

> 3) You wouldn't want to use the pci's struct device because it doesn't
> really describe what's going on. For example, there may be multiple
> devices on the pci device in question: eg. an NVME card and some p2pmem.

What is "some p2pmem" ?

> Or it could be a NIC with some p2pmem.

Again what is "some p2pmem" ?

That a device might have some memory-like buffer space is all well and
good but does it need to be specifically distinguished at the device
level ? It could be inherent to what the device is... for example again
take the GPU example, why would you call the FB memory "p2pmem" ? 

>  Or it could just be p2pmem by itself. And the logic to figure out what
>  memory is available and where
> the address is will be non-standard so it's really straightforward to
> have any pci driver just instantiate a p2pmem device.

Again I'm not sure why it needs to "instanciate a p2pmem" device. Maybe
it's the term "p2pmem" that offputs me. If p2pmem allowed to have a
standard way to lookup the various offsets etc... I mentioned earlier,
then yes, it would make sense to have it as a staging point. As-is, I
don't know. 

> It is probably worth you reading the RFC patches at this point to get a
> better feel for this.

Yup, I'll have another look a bit more in depth.

Cheers,
Ben.


> Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 10:34 -0600, Logan Gunthorpe wrote:
> 
> On 16/04/17 09:53 AM, Dan Williams wrote:
> > ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
> > context about the physical address in question. I'm thinking you can
> > hang bus address translation data off of that structure. This seems
> > vaguely similar to what HMM is doing.
> 
> Thanks! I didn't realize you had the infrastructure to look up a device
> from a pfn/page. That would really come in handy for us.

It does indeed. I won't be able to play with that much for a few weeks
(see my other email) so if you're going to tackle this while I'm away,
can you work with Jerome to make sure you don't conflict with HMM ?

I really want a way for HMM to be able to layout struct pages over the
GPU BARs rather than in "allocated free space" for the case where the
BAR is big enough to cover all of the GPU memory.

In general, I'd like a simple & generic way for any driver to ask the
core to layout DMA'ble struct pages over BAR space. I an not convinced
this requires a "p2mem device" to be created on top of this though but
that's a different discussion.

Of course the actual ability to perform the DMA mapping will be subject
to various restrictions that will have to be implemented in the actual
"dma_ops override" backend. We can have generic code to handle the case
where devices reside on the same domain, which can deal with switch
configuration etc... we will need to have iommu specific code to handle
the case going through the fabric. 

Virtualization is a separate can of worms due to how qemu completely
fakes the MMIO space, we can look into that later.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 10:47 -0600, Logan Gunthorpe wrote:
> > I think you need to give other archs a chance to support this with a
> > design that considers the offset case as a first class citizen rather
> > than an afterthought.
> 
> I'll consider this. Given the fact I can use your existing
> get_dev_pagemap infrastructure to look up the p2pmem device this
> probably isn't as hard as I thought it would be anyway (we probably
> don't even need a page flag). We'd just have lookup the dev_pagemap,
> test if it's a p2pmem device, and if so, call a p2pmem_dma_map function
> which could apply the offset or do any other arch specific logic (if
> necessary).

I'm still not 100% why do you need a "p2mem device" mind you ...

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 08:53 -0700, Dan Williams wrote:
> > Just thinking out loud ... I don't have a firm idea or a design. But
> > peer to peer is definitely a problem we need to tackle generically, the
> > demand for it keeps coming up.
> 
> ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
> context about the physical address in question. I'm thinking you can
> hang bus address translation data off of that structure. This seems
> vaguely similar to what HMM is doing.

Ok, that's interesting. That would be a way to handle the "lookup" I
was mentioning in the email I sent a few minutes ago.

We would probably need to put some "structure" to that context.

I'm very short on time to look into the details of this for at least
a month (I'm taking about 3 weeks off for personal reasons next week),
but I'm happy to dive more into this when I'm back and sort out with
Jerome how to make it all co-habitate nicely with HMM.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 08:44 -0700, Dan Williams wrote:
> The difference is that there was nothing fundamental in the core
> design of pmem + DAX that prevented other archs from growing pmem
> support.

Indeed. In fact we have work in progress support for pmem on power
using experimental HW.

> THP and memory hotplug existed on other architectures and
> they just need to plug in their arch-specific enabling. p2p support
> needs the same starting point of something more than one architecture
> can plug into, and handling the bus address offset case needs to be
> incorporated into the design.
> 
> pmem + dax did not change the meaning of what a dma_addr_t is, p2p does.

The more I think about it, the more I tend toward something along the
lines of having the arch DMA ops being able to quickly differentiate
between "normal" memory (which includes non-PCI pmem in some cases,
it's an architecture choice I suppose) and "special device" (page flag
? pfn bit ? ... there are options).

>From there, we keep our existing fast path for the normal case.

For the special case, we need to provide a fast lookup mechanism
(assuming we can't stash enough stuff in struct page or the pfn)
to get back to a struct of some sort that provides the necessary
information to resolve the translation.

This *could* be something like a struct p2mem device that carries
a special set of DMA ops, though we probably shouldn't make the generic
structure PCI specific.

This is a slightly slower path, but that "stub" structure allows the
special DMA ops to provide the necessary bus-specific knowledge, which
for PCI for example, can check whether the devices are on the same
segment, whether the switches are configured to allow p2p, etc...

What form should that fast lookup take ? It's not completely clear to
me at that point. We could start with a simple linear lookup I suppose
and improve in a second stage.

Of course this pipes into the old discussion about disconnecting
the DMA ops from struct page. If we keep struct page, any device that
wants to be a potential DMA target will need to do something "special"
to create those struct pages etc.. though we could make that a simple
pci helper that pops the necessary bits and pieces for a given BAR &
range.

If we don't need struct page, then it might be possible to hide it all
in the PCI infrastructure.

> > Virtualization specifically would be a _lot_ more difficult than simply
> > supporting offsets. The actual topology of the bus will probably be lost
> > on the guest OS and it would therefor have a difficult time figuring out
> > when it's acceptable to use p2pmem. I also have a difficult time seeing
> > a use case for it and thus I have a hard time with the argument that we
> > can't support use cases that do want it because use cases that don't
> > want it (perhaps yet) won't work.
> > 
> > > This is an interesting experiement to look at I suppose, but if you
> > > ever want this upstream I would like at least for you to develop a
> > > strategy to support the wider case, if not an actual implementation.
> > 
> > I think there are plenty of avenues forward to support offsets, etc.
> > It's just work. Nothing we'd be proposing would be incompatible with it.
> > We just don't want to have to do it all upfront especially when no one
> > really knows how well various architecture's hardware supports this or
> > if anyone even wants to run it on systems such as those. (Keep in mind
> > this is a pretty specific optimization that mostly helps systems
> > designed in specific ways -- not a general "everybody gets faster" type
> > situation.) Get the cases working we know will work, can easily support
> > and people actually want.  Then expand it to support others as people
> > come around with hardware to test and use cases for it.
> 
> I think you need to give other archs a chance to support this with a
> design that considers the offset case as a first class citizen rather
> than an afterthought.

Thanks :-) There's a reason why I'm insisting on this. We have constant
requests for this today. We have hacks in the GPU drivers to do it for
GPUs behind a switch, but those are just that, ad-hoc hacks in the
drivers. We have similar grossness around the corner with some CAPI
NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines
to whack nVME devices.

I'm very interested in a more generic solution to deal with the problem
of P2P between devices. I'm happy to contribute with code to handle the
powerpc bits but we need to agree on the design first :)

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-15 Thread Benjamin Herrenschmidt
On Sat, 2017-04-15 at 15:09 -0700, Dan Williams wrote:
> I'm wondering, since this is limited to support behind a single
> switch, if you could have a software-iommu hanging off that switch
> device object that knows how to catch and translate the non-zero
> offset bus address case. We have something like this with VMD driver,
> and I toyed with a soft pci bridge when trying to support AHCI+NVME
> bar remapping. When the dma api looks up the iommu for its device it
> hits this soft-iommu and that driver checks if the page is host memory
> or device memory to do the dma translation. You wouldn't need a bit in
> struct page, just a lookup to the hosting struct dev_pagemap in the
> is_zone_device_page() case and that can point you to p2p details.

I was thinking about a hook in the arch DMA ops but that kind of
wrapper might work instead indeed. However I'm not sure what's the best
way to "instantiate" it.

The main issue is that the DMA ops are a function of the initiator,
not the target (since the target is supposed to be memory) so things
are a bit awkward.

One (user ?) would have to know that a given device "intends" to DMA
directly to another device.

This is awkward because in the ideal scenario, this isn't something the
device knows. For example, one could want to have an existing NIC DMA
directly to/from NVME pages or GPU pages.

The NIC itself doesn't know the characteristic of these pages, but
*something* needs to insert itself in the DMA ops of that bridge to
make it possible.

That's why I wonder if it's the struct page of the target that should
be "marked" in such a way that the arch dma'ops can immediately catch
that they belong to a device and might require "wrapped" operations.

Are ZONE_DEVICE pages identifiable based on the struct page alone ? (a
flag ?)

That would allow us to keep a fast path for normal memory targets, but
also have some kind of way to handle the special cases of such peer 2
peer (or also handle other type of peer to peer that don't necessarily
involve PCI address wrangling but could require additional iommu bits).

Just thinking out loud ... I don't have a firm idea or a design. But
peer to peer is definitely a problem we need to tackle generically, the
demand for it keeps coming up.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-15 Thread Benjamin Herrenschmidt
On Sat, 2017-04-15 at 11:41 -0600, Logan Gunthorpe wrote:
> Thanks, Benjamin, for the summary of some of the issues.
> 
> On 14/04/17 04:07 PM, Benjamin Herrenschmidt wrote
> > So I assume the p2p code provides a way to address that too via special
> > dma_ops ? Or wrappers ?
> 
> Not at this time. We will probably need a way to ensure the iommus do
> not attempt to remap these addresses.

You can't. If the iommu is on, everything is remapped. Or do you mean
to have dma_map_* not do a remapping ?

That's the problem again, same as before, for that to work, the
dma_map_* ops would have to do something special that depends on *both*
the source and target device.

The current DMA infrastructure doesn't have anything like that. It's a
rather fundamental issue to your design that you need to address.

The dma_ops today are architecture specific and have no way to
differenciate between normal and those special P2P DMA pages.

> Though if it does, I'd expect
> everything would still work you just wouldn't get the performance or
> traffic flow you are looking for. We've been testing with the software
> iommu which doesn't have this problem.

So first, no, it's more than "you wouldn't get the performance". On
some systems it may also just not work. Also what do you mean by "the
SW iommu doesn't have this problem" ? It catches the fact that
addresses don't point to RAM and maps differently ?

> > The problem is that the latter while seemingly easier, is also slower
> > and not supported by all platforms and architectures (for example,
> > POWER currently won't allow it, or rather only allows a store-only
> > subset of it under special circumstances).
> 
> Yes, I think situations where we have to cross host bridges will remain
> unsupported by this work for a long time. There are two many cases where
> it just doesn't work or it performs too poorly to be useful.

And the situation where you don't cross bridges is the one where you
need to also take into account the offsets.

*both* cases mean that you need somewhat to intervene at the dma_ops
level to handle this. Which means having a way to identify your special
struct pages or PFNs to allow the arch to add a special case to the
dma_ops.

> > I don't fully understand how p2pmem "solves" that by creating struct
> > pages. The offset problem is one issue. But there's the iommu issue as
> > well, the driver cannot just use the normal dma_map ops.
> 
> We are not using a proper iommu and we are dealing with systems that
> have zero offset. This case is also easily supported. I expect fixing
> the iommus to not map these addresses would also be reasonably achievable.

So you are designing something that is built from scratch to only work
on a specific limited category of systems and is also incompatible with
virtualization.

This is an interesting experiement to look at I suppose, but if you
ever want this upstream I would like at least for you to develop a
strategy to support the wider case, if not an actual implementation.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Fri, 2017-04-14 at 14:04 -0500, Bjorn Helgaas wrote:
> I'm a little hesitant about excluding offset support, so I'd like to
> hear more about this.
> 
> Is the issue related to PCI BARs that are not completely addressable
> by the CPU?  If so, that sounds like a first-class issue that should
> be resolved up front because I don't think the PCI core in general
> would deal well with that.
> 
> If all the PCI memory of interest is in fact addressable by the CPU,
> I would think it would be pretty straightforward to support offsets
> --
> everywhere you currently use a PCI bus address, you just use the
> corresponding CPU physical address instead.

It's not *that* easy sadly. The reason is that normal dma map APIs
assume the "target" of the DMAs are system memory, there is no way to
pass it "another device", in a way that allows it to understand the
offsets if needed.

That said, dma_map will already be problematic when doing p2p behind
the same bridge due to the fact that the iommu is not in the way so you
can't use the arch standard ops there.

So I assume the p2p code provides a way to address that too via special
dma_ops ? Or wrappers ?

Basically there are two very different ways you can do p2p, either
behind the same host bridge or accross two host bridges:

 - Behind the same host bridge, you don't go through the iommu, which
means that even if your target looks like a struct page, you can't just
use dma_map_* on it because what you'll get back from that is an iommu
token, not a sibling BAR offset. Additionally, if you use the target
struct resource address, you will need to offset it to get back to the
actual BAR value on the PCIe bus.

 - Behind different host bridges, then you go through the iommu and the
host remapping. IE. that's actually the easy case. You can probably
just use the normal iommu path and normal dma mapping ops, you just
need to have your struct page representing the target device BAR *in
CPU space* this time. So no offsetting required either.

The problem is that the latter while seemingly easier, is also slower
and not supported by all platforms and architectures (for example,
POWER currently won't allow it, or rather only allows a store-only
subset of it under special circumstances).

So what people practically want to do is to have 2 devices behind a
switch DMA'ing to/from each other.

But that brings the 2 problems above.

I don't fully understand how p2pmem "solves" that by creating struct
pages. The offset problem is one issue. But there's the iommu issue as
well, the driver cannot just use the normal dma_map ops.

I haven't had a chance to look at the details of the patches but it's
not clear from the description in patch 0 how that is solved.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 22:40 -0600, Logan Gunthorpe wrote:
> 
> On 13/04/17 10:16 PM, Jason Gunthorpe wrote:
> > I'd suggest just detecting if there is any translation in bus
> > addresses anywhere and just hard disabling P2P on such systems.
> 
> That's a fantastic suggestion. It simplifies things significantly.
> Unless there are any significant objections I think I will plan on
> doing
> that.

I object.

> > On modern hardware with 64 bit BARs there is very little reason to
> > have translation, so I think this is a legacy feature.
> 
> Yes, p2pmem users are likely to be designing systems around it (ie
> JBOFs) and not trying to shoehorn it onto legacy architectures.
> 
> At the very least, it makes sense to leave it out and if someone
> comes
> along who cares they can put in the effort to support the address
> translation.
> 
> Thanks,
> 
> Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Fri, 2017-04-14 at 21:37 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2017-04-13 at 22:40 -0600, Logan Gunthorpe wrote:
> > 
> > On 13/04/17 10:16 PM, Jason Gunthorpe wrote:
> > > I'd suggest just detecting if there is any translation in bus
> > > addresses anywhere and just hard disabling P2P on such systems.
> > 
> > That's a fantastic suggestion. It simplifies things significantly.
> > Unless there are any significant objections I think I will plan on
> > doing
> > that.
> 
> I object.

Note: It would also make your stuff fundamentally incompatible with
KVM guest pass-through since KVM plays with remapping BARs all over
the place.

Ben.

> > > On modern hardware with 64 bit BARs there is very little reason
> > > to
> > > have translation, so I think this is a legacy feature.
> > 
> > Yes, p2pmem users are likely to be designing systems around it (ie
> > JBOFs) and not trying to shoehorn it onto legacy architectures.
> > 
> > At the very least, it makes sense to leave it out and if someone
> > comes
> > along who cares they can put in the effort to support the address
> > translation.
> > 
> > Thanks,
> > 
> > Logan



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 22:16 -0600, Jason Gunthorpe wrote:
> > Any caller of pci_add_resource_offset() uses CPU addresses different from
> > the PCI bus addresses (unless the offset is zero, of course).  All ACPI
> > platforms also support this translation (see "translation_offset"), though
> > in most x86 systems the offset is zero.  I'm aware of one x86 system that
> > was tested with a non-zero offset but I don't think it was shipped that
> > way.
> 
> I'd suggest just detecting if there is any translation in bus
> addresses anywhere and just hard disabling P2P on such systems.

I object to designing a subsystem that by design cannot work on whole
categories of architectures out there.

> On modern hardware with 64 bit BARs there is very little reason to
> have translation, so I think this is a legacy feature.

No it's not.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-13 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 15:22 -0600, Logan Gunthorpe wrote:
> 
> On 12/04/17 03:55 PM, Benjamin Herrenschmidt wrote:
> > Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
> > will perform the conversion between the struct resource content (CPU
> > physical address) and the actual PCI bus side address.
> 
> Ah, thanks for the tip! On my system, this translation returns the same
> address so it was not necessary. And, yes, that means this would have to
> find its way into the dma mapping routine somehow. This means we'll
> eventually need a way to look-up the p2pmem device from the struct page.
> Which means we will likely need a new flag bit in the struct page or
> something. The big difficulty I see is testing. Do you know what
> architectures or in what circumstances are these translations used?

I think a bunch of non-x86 architectures but I don't know which ones
outside of powerpc.

> > When behind the same switch you need to use PCI addresses. If one tries
> > later to do P2P between host bridges (via the CPU fabric) things get
> > more complex and one will have to use either CPU addresses or something
> > else alltogether (probably would have to teach the arch DMA mapping
> > routines to work with those struct pages you create and return the
> > right thing).
> 
> Probably for starters we'd want to explicitly deny cases between host
> bridges and add that later if someone wants to do the testing.

Cheers,
Ben.

> Thanks,
> 
> Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-12 Thread Benjamin Herrenschmidt
On Wed, 2017-04-12 at 11:09 -0600, Logan Gunthorpe wrote:
> 
> > Do you handle funky address translation too ? IE. the fact that the PCI
> > addresses aren't the same as the CPU physical addresses for a BAR ?
> 
> No, we use the CPU physical address of the BAR. If it's not mapped that
> way we can't use it.

Ok, you need to fix that or a bunch of architectures won't work. 

Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
will perform the conversion between the struct resource content (CPU
physical address) and the actual PCI bus side address.

When behind the same switch you need to use PCI addresses. If one tries
later to do P2P between host bridges (via the CPU fabric) things get
more complex and one will have to use either CPU addresses or something
else alltogether (probably would have to teach the arch DMA mapping
routines to work with those struct pages you create and return the
right thing).

> > > This will mean many setups that could likely
> > > work well will not be supported so that we can be more confident it
> > > will work and not place any responsibility on the user to understand
> > > their topology. (We've chosen to go this route based on feedback we
> > > received at LSF).
> > > 
> > > In order to enable this functionality we introduce a new p2pmem device
> > > which can be instantiated by PCI drivers. The device will register some
> > > PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> > > users of these devices to get buffers.
> > 
> > I don't completely understand this. This is actual memory on the PCI
> > bus ? Where does it come from ? Or are you just trying to create struct
> > pages that cover your PCIe DMA target ?
> 
> Yes, the memory is on the PCI bus in a BAR. For now we have a special
> PCI card for this, but in the future it would likely be the CMB in an
> NVMe card. These patches create struct pages to map these BAR addresses
> using ZONE_DEVICE.

Ok.

So ideally we'd want things like dma_map_* to be able to be fed those
struct pages and do the right thing which is ... tricky, especially
with the address translation I mentioned since the address will be
different whether the initiator is on the same host bridge as the
target or not.

> > So correct me if I'm wrong, you are trying to create struct page's that
> > map a PCIe BAR right ? I'm trying to understand how that interacts with
> > what Jerome is doing for HMM.
> 
> Yes, well we are using ZONE_DEVICE in the exact same way as the dax code
> is. These patches use the existing API with no modifications. As I
> understand it, HMM was using ZONE_DEVICE in a way that was quite
> different to how it was originally designed.

Sort-of. I don't see why there would be a conflict with the struct
pages use though. Jerome can you chime in ? Jerome: It looks like they
are just laying out struct page over a BAR which is the same thing I
think you should do when the BAR is "large enough" for the GPU memory.

The case where HMM uses "holes" in the address space for its struct
page is somewhat orthogonal but I also see no conflict here.

> > The reason is that the HMM currently creates the struct pages with
> > "fake" PFNs pointing to a hole in the address space rather than
> > covering the actual PCIe memory of the GPU. He does that to deal with
> > the fact that some GPUs have a smaller aperture on PCIe than their
> > total memory.
> 
> I'm aware of what HMM is trying to do and although I'm not familiar with
> the intimate details, I saw it as fairly orthogonal to what we are
> attempting to do.

Right.

> > However, I have asked him to only apply that policy if the aperture is
> > indeed smaller, and if not, create struct pages that directly cover the
> > PCIe BAR of the GPU instead, which will work better on systems or
> > architecture that don't have a "pinhole" window limitation.
> > However he was under the impression that this was going to collide with
> > what you guys are doing, so I'm trying to understand how. 
> 
> I'm not sure I understand how either. However, I suspect if you collide
> with these patches then you'd also be breaking dax too.

Possibly but as I said, I don't see why so I'll let Jerome chime in
since he was under the impression that there was a conflict here :-)

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-11 Thread Benjamin Herrenschmidt
On Thu, 2017-03-30 at 16:12 -0600, Logan Gunthorpe wrote:
> Hello,
> 
> As discussed at LSF/MM we'd like to present our work to enable
> copy offload support in NVMe fabrics RDMA targets. We'd appreciate
> some review and feedback from the community on our direction.
> This series is not intended to go upstream at this point.
> 
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the trade-off
> is currently a reduction in overall throughput. (Largely due to hardware
> issues that would certainly improve in the future).

Another issue of course is that not all systems support P2P
between host bridges :-) (Though almost all switches can enable it).

> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch.

Ok. I suppose that's a reasonable starting point. Do I haven't looked
at the patches in detail yet but it would be nice if that policy was in
a well isolated component so it can potentially be affected by
arch/platform code.

Do you handle funky address translation too ? IE. the fact that the PCI
addresses aren't the same as the CPU physical addresses for a BAR ?

> This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We've chosen to go this route based on feedback we
> received at LSF).
> 
> In order to enable this functionality we introduce a new p2pmem device
> which can be instantiated by PCI drivers. The device will register some
> PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> users of these devices to get buffers.

I don't completely understand this. This is actual memory on the PCI
bus ? Where does it come from ? Or are you just trying to create struct
pages that cover your PCIe DMA target ?
 
> We give an example of enabling
> p2p memory with the cxgb4 driver, however currently these devices have
> some hardware issues that prevent their use so we will likely be
> dropping this patch in the future. Ideally, we'd want to enable this
> functionality with NVME CMB buffers, however we don't have any hardware
> with this feature at this time.

So correct me if I'm wrong, you are trying to create struct page's that
map a PCIe BAR right ? I'm trying to understand how that interacts with
what Jerome is doing for HMM.

The reason is that the HMM currently creates the struct pages with
"fake" PFNs pointing to a hole in the address space rather than
covering the actual PCIe memory of the GPU. He does that to deal with
the fact that some GPUs have a smaller aperture on PCIe than their
total memory.

However, I have asked him to only apply that policy if the aperture is
indeed smaller, and if not, create struct pages that directly cover the
PCIe BAR of the GPU instead, which will work better on systems or
architecture that don't have a "pinhole" window limitation.

However he was under the impression that this was going to collide with
what you guys are doing, so I'm trying to understand how. 

> In nvmet-rdma, we attempt to get an appropriate p2pmem device at
> queue creation time and if a suitable one is found we will use it for
> all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
> attribute is also created which is required to be set before any p2pmem
> is attempted.
> 
> This patchset also includes a more controversial patch which provides an
> interface for userspace to obtain p2pmem buffers through an mmap call on
> a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
> O_DIRECT interfaces. However, the user would be entirely responsible for
> knowing what their doing and inspecting sysfs to understand the pci
> topology and only using it in sane situations.
> 
> Thanks,
> 
> Logan
> 
> 
> Logan Gunthorpe (6):
>   Introduce Peer-to-Peer memory (p2pmem) device
>   nvmet: Use p2pmem in nvme target
>   scatterlist: Modify SG copy functions to support io memory.
>   nvmet: Be careful about using iomem accesses when dealing with p2pmem
>   p2pmem: Support device removal
>   p2pmem: Added char device user interface
> 
> Steve Wise (2):
>   cxgb4: setup pcie memory window 4 and create p2pmem region
>   p2pmem: Add debugfs "stats" file
> 
>  drivers/memory/Kconfig  |   5 +
>  drivers/memory/Makefile |   2 +
>  drivers/memory/p2pmem.c | 697 
> 
>  drivers/net/ethe

Re: [PATCH] scsi: lpfc: Add shutdown method for kexec

2017-03-06 Thread Benjamin Herrenschmidt
On Mon, 2017-03-06 at 22:46 -0500, Martin K. Petersen wrote:
> > > > > > "Mauricio" == Mauricio Faria de Oliveira  > > > > > et.ibm.com> writes:
> 
> Mauricio> On 02/12/2017 07:49 PM, Anton Blanchard wrote:
> > > We see lpfc devices regularly fail during kexec. Fix this by
> > > adding a
> > > shutdown method which mirrors the remove method.
> 
> Mauricio> @mkp, @jejb: may you please flag this patch for
> stable?  Thank
> Mauricio> you.
> 
> I don't recall a consensus being reached on this patch.

What would be the opposition ? Without it kexec breaks. With it, it
works ...

Now we all seem to agree that kexec should be overhauled to not use
shutdown but instead unbind drivers, but that's a more long term
project that nobody yet had a chance to tackle.

In the meantime, some systems need a functioning kexec to boot.

Cheers,
Ben.



Re: [PATCH] scsi: lpfc: Add shutdown method for kexec

2017-02-13 Thread Benjamin Herrenschmidt
On Tue, 2017-02-14 at 15:45 +1300, Eric W. Biederman wrote:
> The only difference ever that should exist between shutdown and remove
> is do you clean up kernel data structures.  The shutdown method is
> allowed to skip the cleanup up kernel data structures that the remove
> method needs to make.
> 
> Assuming the kernel does not have corrupted data structures I don't see
> any practical reason for distinguishing between the two.  There may be
> some real world gotchas we have to deal with but semantically I don't
> see any concerns.

We had historical additions to shutdown actually shutting things down,
like spinning platters down to park disk heads, switching devices
into D3 on some systems, etc... that remove never had (and we don't
want for kexec).

Cheers,
Ben.


Re: [PATCH] scsi: lpfc: Add shutdown method for kexec

2017-02-13 Thread Benjamin Herrenschmidt
On Mon, 2017-02-13 at 15:57 -0600, Brian King wrote:
> If we do transition to use remove rather than shutdown, I think we
> want
> some way for a device driver to know whether we are doing kexec or
> not.
> A RAID adapter with a write cache is going to want to flush its write
> cache on a PCI hotplug remove, but for a kexec, its going to want to
> skip
> that so the kexec is faster. Today, since kexec looks like a reboot,
> rather than a shutdown, we can skip the flush on a reboot, since its
> technically not needed there either.

What happens if a non-flushed adapter gets a PERST ?

I wouldn't trust that 'don't have to flush' magic ...

Cheers,
Ben.



Re: [PATCH] scsi: lpfc: Add shutdown method for kexec

2017-02-12 Thread Benjamin Herrenschmidt
On Mon, 2017-02-13 at 13:21 +1300, Eric W. Biederman wrote:
> > Good point, at the very least we should call remove if shutdown doesn't
> > exist. Eric: could we make the changes Ben suggests?
> 
> Definitely.  That was the original design of the kexec interface
> but people were worried about calling remove during reboot might
> introduce regressions on the reboot functionality.  So shutdown was
> added to be remove without the cleaning up the kernel data structures.

Right. Part of the problem was also that remove was dependent on CONFIG_HOTPLUG
though that's no longer the case anymore.

The problem is that a bunch of drivers either don't have a shutdown or
worse, have one that actually shuts the HW down rather than "idle" it
which puts it in a state that the new kernel can't deal with.

> I am all for changing the core to call remove.  That seems to be a more
> tested code path (because remove tends to be part of the development
> path for modules) and thus much more likely to work in practice.
> 
> The challenge with changes in this area is that when the infrastructure
> is changed for everyone someone needs to baby sit it until all of the
> unexpected issues are resolved.  I was hoping a couple of years ago that
> Ben could be that person.

Correct. And I lack time. But since we are a platform that uses kexec
daily as our main booting mechanism we should probably be the ones
driving that.

I had a patch at least to fallback to remove in absence of shutdown
which I can try to dig out.

We can add a config option to make it do the other way around that
we should start testing internally. I'm sure we will find 1 or 2
regressions as drivers do weird things.

Cheers,
Ben.



Re: [PATCH] scsi: lpfc: Add shutdown method for kexec

2017-02-12 Thread Benjamin Herrenschmidt
On Mon, 2017-02-13 at 08:49 +1100, Anton Blanchard wrote:
> From: Anton Blanchard 
> 
> We see lpfc devices regularly fail during kexec. Fix this by adding
> a shutdown method which mirrors the remove method.

Or instead finally do what I've been advocating for years (and even
sent patches for) which is to have kexec call remove instead of
shutdown.

Shutdown is and has *always* been the wrong thing to do.

> Signed-off-by: Anton Blanchard 
> ---
>  drivers/scsi/lpfc/lpfc_init.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_init.c
> b/drivers/scsi/lpfc/lpfc_init.c
> index 4776fd8..10f75ad 100644
> --- a/drivers/scsi/lpfc/lpfc_init.c
> +++ b/drivers/scsi/lpfc/lpfc_init.c
> @@ -11447,6 +11447,7 @@ static struct pci_driver lpfc_driver = {
>   .id_table   = lpfc_id_table,
>   .probe  = lpfc_pci_probe_one,
>   .remove = lpfc_pci_remove_one,
> + .shutdown   = lpfc_pci_remove_one,
>   .suspend= lpfc_pci_suspend_one,
>   .resume = lpfc_pci_resume_one,
>   .err_handler= &lpfc_err_handler,


Re: [PATCH] ibmvscsi: add write memory barrier to CRQ processing

2016-12-09 Thread Benjamin Herrenschmidt
On Wed, 2016-12-07 at 17:31 -0600, Tyrel Datwyler wrote:
> The first byte of each CRQ entry is used to indicate whether an entry is
> a valid response or free for the VIOS to use. After processing a
> response the driver sets the valid byte to zero to indicate the entry is
> now free to be reused. Add a memory barrier after this write to ensure
> no other stores are reordered when updating the valid byte.

Which "other stores" specifically ? This smells fishy without that
precision. It's important to always understand what exactly barriers
order with.

Cheers,
Ben.

> Signed-off-by: Tyrel Datwyler 
> ---
>  drivers/scsi/ibmvscsi/ibmvscsi.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c 
> b/drivers/scsi/ibmvscsi/ibmvscsi.c
> index d9534ee..2f5b07e 100644
> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
> @@ -232,6 +232,7 @@ static void ibmvscsi_task(void *data)
> >     while ((crq = crq_queue_next_crq(&hostdata->queue)) != NULL) {
> >     ibmvscsi_handle_crq(crq, hostdata);
> >     crq->valid = VIOSRP_CRQ_FREE;
> > +   wmb();
> >     }
>  
> >     vio_enable_interrupts(vdev);
> @@ -240,6 +241,7 @@ static void ibmvscsi_task(void *data)
> >     vio_disable_interrupts(vdev);
> >     ibmvscsi_handle_crq(crq, hostdata);
> >     crq->valid = VIOSRP_CRQ_FREE;
> > +   wmb();
> >     } else {
> >     done = 1;
> >     }

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] scsi/ipr: Fix runaway IRQs when falling back from MSI to LSI

2016-11-23 Thread Benjamin Herrenschmidt
LSIs must be ack'ed with an MMIO otherwise they remain asserted
forever. This is controlled by the "clear_isr" flag.

While we set that flag properly when deciding initially whether
to use LSIs or MSIs, we fail to set it if we first chose MSIs,
the test fails, then fallback to LSIs.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/scsi/ipr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c
index 5324741..5dd3194 100644
--- a/drivers/scsi/ipr.c
+++ b/drivers/scsi/ipr.c
@@ -10213,6 +10213,7 @@ static int ipr_probe_ioa(struct pci_dev *pdev,
}
 
ioa_cfg->intr_flag = IPR_USE_LSI;
+   ioa_cfg->clear_isr = 1;
ioa_cfg->nvectors = 1;
}
else if (rc)

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/3] cxlflash: Superpipe support

2015-08-10 Thread Benjamin Herrenschmidt
On Mon, 2015-08-10 at 12:09 -0500, Matthew R. Ochs wrote:
> Add superpipe supporting infrastructure to device driver for the IBM CXL
> Flash adapter. This patch allows userspace applications to take advantage
> of the accelerated I/O features that this adapter provides and bypass the
> traditional filesystem stack.

 So in a similar vein to the previous review, I am missing a lot of
context here but a few things did spring to me eyes:

> +/**
> + * lookup_local() - find a local LUN information structure by WWID
> + * @cfg: Internal structure associated with the host.
> + * @wwid:WWID associated with LUN.
> + *
> + * Return: Found local lun_info structure on success, NULL on failure
> + */
> +static struct llun_info *lookup_local(struct cxlflash_cfg *cfg, u8 *wwid)
> +{
> + struct llun_info *lli, *temp;
> + ulong lock_flags;
> +
> + spin_lock_irqsave(&cfg->slock, lock_flags);
> +
> + list_for_each_entry_safe(lli, temp, &cfg->lluns, list)
> + if (!memcmp(lli->wwid, wwid, DK_CXLFLASH_MANAGE_LUN_WWID_LEN)) {
> + lli->newly_created = false;

This is weird ... a lookup effectively changes the state of the object
looked up... what for ? There is something oddball here.

It might be legit but in that case, you should really add a comment
explaining the logic around that 'newly_created' field.

Also you drop the lock right below but you have no refcounting, are
these objects ever disposed of ?

In general, can you provide a clearer explanation of what are "global"
vs "local" LUNs ?

> + spin_unlock_irqrestore(&cfg->slock, lock_flags);
> + return lli;
> + }
> +
> + spin_unlock_irqrestore(&cfg->slock, lock_flags);
> + return NULL;
> +}
> +
> +/**
> + * lookup_global() - find a global LUN information structure by WWID
> + * @wwid:WWID associated with LUN.
> + *
> + * Return: Found global lun_info structure on success, NULL on failure
> + */
> +static struct glun_info *lookup_global(u8 *wwid)
> +{
> + struct glun_info *gli, *temp;
> + ulong lock_flags;
> +
> + spin_lock_irqsave(&global.slock, lock_flags);
> +
> + list_for_each_entry_safe(gli, temp, &global.gluns, list)
> + if (!memcmp(gli->wwid, wwid, DK_CXLFLASH_MANAGE_LUN_WWID_LEN)) {
> + spin_unlock_irqrestore(&global.slock, lock_flags);
> + return gli;
> + }
> +
> + spin_unlock_irqrestore(&global.slock, lock_flags);
> + return NULL;
> +}

Same ...

> +/**
> + * lookup_lun() - find or create a local LUN information structure
> + * @sdev:SCSI device associated with LUN.
> + * @wwid:WWID associated with LUN.
> + *
> + * When a local LUN is not found and a global LUN is also not found, both
> + * a global LUN and local LUN are created. The global LUN is added to the
> + * global list and the local LUN is returned.

Make the function name more explicit: find_or_create_lun() for example.
I very much dislike a function called "lookup" that has side effects.

> + * Return: Found/Allocated local lun_info structure on success, NULL on 
> failure
> + */
> +static struct llun_info *lookup_lun(struct scsi_device *sdev, u8 *wwid)
> +{
> + struct llun_info *lli = NULL;
> + struct glun_info *gli = NULL;
> + struct Scsi_Host *shost = sdev->host;
> + struct cxlflash_cfg *cfg = shost_priv(shost);
> + ulong lock_flags;
> +
> + if (unlikely(!wwid))
> + goto out;
> +
> + lli = lookup_local(cfg, wwid);
> + if (lli)
> + goto out;
> +
> + lli = create_local(sdev, wwid);
> + if (unlikely(!lli))
> + goto out;

Similar question to earlier, you have no refcounting, should I assume
these things never get removed ?

> + gli = lookup_global(wwid);
> + if (gli) {
> + lli->parent = gli;
> + spin_lock_irqsave(&cfg->slock, lock_flags);
> + list_add(&lli->list, &cfg->lluns);
> + spin_unlock_irqrestore(&cfg->slock, lock_flags);
> + goto out;
> + }
> +
> + gli = create_global(sdev, wwid);
> + if (unlikely(!gli)) {
> + kfree(lli);
> + lli = NULL;
> + goto out;
> + }
> +
> + lli->parent = gli;
> + spin_lock_irqsave(&cfg->slock, lock_flags);
> + list_add(&lli->list, &cfg->lluns);
> + spin_unlock_irqrestore(&cfg->slock, lock_flags);
> +
> + spin_lock_irqsave(&global.slock, lock_flags);
> + list_add(&gli->list, &global.gluns);
> + spin_unlock_irqrestore(&global.slock, lock_flags);

Your locks are extremely fine grained... too much ? Any reason why you
don't have a simple/single lock handling all these ? IE, do you expect
frequent accesses ?

Also, a function called "lookup_something" that has the side effect of
adding that something to two lists doesn't look great to me. You may
want to review the function naming a bit.

Finally, what happens if two processes call this trying 

Re: [PATCH v4 1/3] cxlflash: Base error recovery support

2015-08-10 Thread Benjamin Herrenschmidt
On Mon, 2015-08-10 at 12:09 -0500, Matthew R. Ochs wrote:
> Introduce support for enhanced I/O error handling.
> 
> Signed-off-by: Matthew R. Ochs 
> Signed-off-by: Manoj N. Kumar 
> ---

So I'm not necessarily very qualified to review SCSI bits as I haven't
done anything close to the Linux SCSI code for almost a decade but I
have a couple of questions and nits...

>   wait_queue_head_t tmf_waitq;
>   bool tmf_active;
> - u8 err_recovery_active:1;
> + wait_queue_head_t limbo_waitq;
> + enum cxlflash_state state;
>  };
>  
>  struct afu_cmd {
> diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
> index 76a7286..18359d4 100644
> --- a/drivers/scsi/cxlflash/main.c
> +++ b/drivers/scsi/cxlflash/main.c
> @@ -380,6 +380,18 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, 
> struct scsi_cmnd *scp)
>   }
>   spin_unlock_irqrestore(&cfg->tmf_waitq.lock, lock_flags);
>  
> + switch (cfg->state) {
> + case STATE_LIMBO:
> + pr_debug_ratelimited("%s: device in limbo!\n", __func__);
> + rc = SCSI_MLQUEUE_HOST_BUSY;
> + goto out;
> + case STATE_FAILTERM:
> + pr_debug_ratelimited("%s: device has failed!\n", __func__);
> + goto error;
> + default:
> + break;
> + }

I noticed that you mostly read and write that new state outside of your
tmf_waitq.lock. Is there any other lock or mutex protecting you ? In
this specific case ?

I know in the old day queuecommand was called with a host lock, is that
still the case ?

Or you just don't care about an occasional spurrious
SCSI_MLQUEUE_HOST_BUSY ?

>   cmd = cxlflash_cmd_checkout(afu);
>   if (unlikely(!cmd)) {
>   pr_err("%s: could not get a free command\n", __func__);
> @@ -428,6 +440,10 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, 
> struct scsi_cmnd *scp)
>  
>  out:
>   return rc;
> +error:
> + scp->result = (DID_NO_CONNECT << 16);
> + scp->scsi_done(scp);
> + return 0;
>  }
>  
>  /**
> @@ -455,9 +471,21 @@ static int cxlflash_eh_device_reset_handler(struct 
> scsi_cmnd *scp)
>get_unaligned_be32(&((u32 *)scp->cmnd)[2]),
>get_unaligned_be32(&((u32 *)scp->cmnd)[3]));
>  
> - rcr = send_tmf(afu, scp, TMF_LUN_RESET);
> - if (unlikely(rcr))
> + switch (cfg->state) {
> + case STATE_NORMAL:
> + rcr = send_tmf(afu, scp, TMF_LUN_RESET);
> + if (unlikely(rcr))
> + rc = FAILED;
> + break;
> + case STATE_LIMBO:
> + wait_event(cfg->limbo_waitq, cfg->state != STATE_LIMBO);
> + if (cfg->state == STATE_NORMAL)
> + break;
> + /* fall through */
> + default:
>   rc = FAILED;
> + break;
> + }

Same here, since you are doing wait_event, I assume no lock is held
(unless it's a mutex) and in subsequent places I am not listing.

As I said, it could well be that it all works fine but it would be worth
mentioning in this case because it's not obvious from simply reading the
code.

Cheers,
Ben.



--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Concerns about "mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support"

2015-04-01 Thread Benjamin Herrenschmidt
On Thu, 2015-02-19 at 21:45 -0800, James Bottomley wrote:

> Ben, this is legal by design.  It was specifically designed for the
> aic79xx SCSI card, but can be used for a variety of other reasons.  The
> aic79xx hardware problem was that the DMA engine could address the whole
> of memory (it had two address modes, a 39 bit one and a 64 bit one) but
> the script engine that runs the mailboxes only had a 32 bit activation
> register (the activating write points at the physical address of the
> script to begin executing).  This meant that the scripts that run in
> memory had to be in the first 4GB of physical memory, hence the split
> mask.  The DMA mask specifies that the card can transfer from anywhere
> in physical memory, but the consistent_dma_mask says that the consistent
> allocation used to get scripts memory must come from the lower 4GB.

So looking at that again...

This is interesting ... basically any driver using a different mask has
been broken on powerpc for basically ever. The whole concept was poorly
designed, for example,  the set_consistent_mask isn't a dma_map_ops
unlike everything else.

In some cases, what we want is convey a base+offset information to
drivers but we can't do that.

This stuff cannot work with setups like a lot of our iommus where we
have a remapped region at the bottom of the DMA address space and a
bypass (direct map) region high up.

Basically, we can fix it, at least for most platforms, but it will be
hard, invasive, *and* will need to go to stable. Grmbl.

We'll have to replace our "direct" DMA ops (which just apply an offset)
which we use for devices that set a 64-bit mask on platform that support
a bypass window, with some smart-ass hybrid variant that selectively
shoot stuff up to the bypass window or down via the iommu remapped based
on the applicable mask for a given operation.

It would be nice if we could come up with a way to inform the driver
that we support that sort of "bypass" region with an offset. That would
allow drivers that have that 64-bit base + 32-bit offset scheme to work
much more efficiently for us. The driver could configure the base to be
our "bypass window offset", and we could use ZONE_DMA32 for consistent
DMAs.

It would also help us with things like some GPUs that can only do 40-bit
DMA (which won't allow them to reach our bypass region normally) but do
have a way to configure the generated top bits of all DMA addresses in
some fixed register.

Any idea what such an interface might look like ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Concerns about "mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support"

2015-02-19 Thread Benjamin Herrenschmidt
On Thu, 2015-02-19 at 21:45 -0800, James Bottomley wrote:
> Ben, this is legal by design.  It was specifically designed for the
> aic79xx SCSI card, but can be used for a variety of other reasons.  The
> aic79xx hardware problem was that the DMA engine could address the whole
> of memory (it had two address modes, a 39 bit one and a 64 bit one) but
> the script engine that runs the mailboxes only had a 32 bit activation
> register (the activating write points at the physical address of the
> script to begin executing).  This meant that the scripts that run in
> memory had to be in the first 4GB of physical memory, hence the split
> mask.  The DMA mask specifies that the card can transfer from anywhere
> in physical memory, but the consistent_dma_mask says that the consistent
> allocation used to get scripts memory must come from the lower 4GB.

Right, ok, it looks like it's easy enough to support with ZONE_DMA32, I'm
testing patches to create it unconditionally on ppc64 (it used to depend
on us using swiotlb on embedded platforms) and I'll shoot that upstream
if it passes.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Concerns about "mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support"

2015-02-19 Thread Benjamin Herrenschmidt
On Fri, 2015-02-20 at 16:22 +1100, Benjamin Herrenschmidt wrote:

> Looking a bit more closely, you basically do
> 
>  - set_dma_mask(64-bit)
>  - set_consistent_dma_mask(32-bit)
> 
> Now, I don't know how x86 will react to the conflicting masks, but on
> ppc64, I'm pretty sure the second one will barf. IE, the first one will
> establish a set of direct mapping ops which give you a bypass of the
> iommu to all of memory. The second one will then do a
> dma_supported(mask) call which will hit the direct ops, and they will
> fail since a 32-bit mask cannot address the bypass completely.
> 
> Are architectures really required to support such mismatching dma_mask
> and consistent_dma_mask ? what a bloody trainwreck ... :-(

Oh well, looks like x86 supports it and it won't be too hard to support
it on ppc64 as well. We even had some code along those lines for FSL
platforms with an ifdef due to the amount of drivers that used to fail
setting the consistent mask properly but that seems to be mostly fixed
nowadays.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Concerns about "mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support"

2015-02-19 Thread Benjamin Herrenschmidt
On Fri, 2015-02-20 at 16:06 +1100, Benjamin Herrenschmidt wrote:

> Note that even on powerpc platforms where it would work because we
> maintain both 32-bit and 64-bit bypass windows in the device address
> space simultaneously, you will leak iommu entries unless you also switch
> back to 32-bit when freeing the 32-bit mappings... (and you would
> probably crash if you tried to free a 64-bit mapping while in 32-bit
> mode).
> 
> The iommu APIs weren't designed with that "switching mask" facility in
> mind...

Looking a bit more closely, you basically do

 - set_dma_mask(64-bit)
 - set_consistent_dma_mask(32-bit)

Now, I don't know how x86 will react to the conflicting masks, but on
ppc64, I'm pretty sure the second one will barf. IE, the first one will
establish a set of direct mapping ops which give you a bypass of the
iommu to all of memory. The second one will then do a
dma_supported(mask) call which will hit the direct ops, and they will
fail since a 32-bit mask cannot address the bypass completely.

Are architectures really required to support such mismatching dma_mask
and consistent_dma_mask ? what a bloody trainwreck ... :-(

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Concerns about "mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support"

2015-02-19 Thread Benjamin Herrenschmidt
On Fri, 2015-02-20 at 16:01 +1100, Benjamin Herrenschmidt wrote:
> Hi Sreekanth !
> 
> While looking at some (unrelated) issue where mtp2sas seems to be using
> 32-bit DMA instead of 64-bit DMA on some POWER platforms, I noticed this
> patch which was merged as 5fb1bf8aaa832e1e9ca3198de7bbecb8eff7db9c.
> 
> Can you confirm my understanding that you are:
> 
>  - Setting the DMA mask to 32-bit
> 
>  - Mapping pages for DMA
> 
>  - Changing the DMA mask to 64-bit
> 
> ?
> 
> If yes, then I don't think this is a legal thing to do and definitely
> not something supported by all architectures. It might work by accident,
> but there is no telling that any translation/DMA mapping provided before
> a call to set_dma_mask() is still valid after that call.
> 
> The architecture might have to completely reconfigure the iommu, for
> example on some PowerPC platforms, we switch from a remapped mapping to
> a direct linear map of all memory, all translations established before
> the switch might be lost (it depends on the specific implementation).
> 
> How does it work on x86 with DMAR ?

Note that even on powerpc platforms where it would work because we
maintain both 32-bit and 64-bit bypass windows in the device address
space simultaneously, you will leak iommu entries unless you also switch
back to 32-bit when freeing the 32-bit mappings... (and you would
probably crash if you tried to free a 64-bit mapping while in 32-bit
mode).

The iommu APIs weren't designed with that "switching mask" facility in
mind...

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Concerns about "mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support"

2015-02-19 Thread Benjamin Herrenschmidt
Hi Sreekanth !

While looking at some (unrelated) issue where mtp2sas seems to be using
32-bit DMA instead of 64-bit DMA on some POWER platforms, I noticed this
patch which was merged as 5fb1bf8aaa832e1e9ca3198de7bbecb8eff7db9c.

Can you confirm my understanding that you are:

 - Setting the DMA mask to 32-bit

 - Mapping pages for DMA

 - Changing the DMA mask to 64-bit

?

If yes, then I don't think this is a legal thing to do and definitely
not something supported by all architectures. It might work by accident,
but there is no telling that any translation/DMA mapping provided before
a call to set_dma_mask() is still valid after that call.

The architecture might have to completely reconfigure the iommu, for
example on some PowerPC platforms, we switch from a remapped mapping to
a direct linear map of all memory, all translations established before
the switch might be lost (it depends on the specific implementation).

How does it work on x86 with DMAR ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND][PATCH 1/2] lib/scatterlist: Make ARCH_HAS_SG_CHAIN an actual Kconfig

2014-03-23 Thread Benjamin Herrenschmidt
On Sun, 2014-03-23 at 00:03 -0700, Christoph Hellwig wrote:
> On Sun, Mar 23, 2014 at 02:04:46PM +1100, Benjamin Herrenschmidt wrote:
> > 
> > > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > > index 1594945..8122294 100644
> > > --- a/arch/arm/Kconfig
> > > +++ b/arch/arm/Kconfig
> > > @@ -82,6 +82,7 @@ config ARM
> > > <http://www.arm.linux.org.uk/>.
> > >  
> > >  config ARM_HAS_SG_CHAIN
> > > + select ARCH_HAS_SG_CHAIN
> > >   bool
> > >  
> > 
> > Heh, a self-selecting config option... I didn't know that trick !
> 
> ARM vs ARCH.  However the arm variant of the variable should probably be
> consolidated into the ARCH one as a follow up.

Oh right... sorry for the noise.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND][PATCH 1/2] lib/scatterlist: Make ARCH_HAS_SG_CHAIN an actual Kconfig

2014-03-22 Thread Benjamin Herrenschmidt
On Sat, 2014-03-22 at 11:13 -0700, Laura Abbott wrote:
> Rather than have architectures #define ARCH_HAS_SG_CHAIN in an architecture
> specific scatterlist.h, make it a proper Kconfig option and use that
> instead. At same time, remove the header files are are now mostly
> useless and just include asm-generic/scatterlist.h.
> 
> Cc: Russell King 
> Cc: Tony Luck 
> Cc: Fenghua Yu 

For powerpc

Acked-by: Benjamin Herrenschmidt 

> Cc: Paul Mackerras 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: "James E.J. Bottomley" 
> Cc: Fenghua Yu 
> Cc: Tony Luck 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Martin Schwidefsky 
> Cc: Heiko Carstens 
> Cc: Andrew Morton 
> Signed-off-by: Laura Abbott 


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND][PATCH 1/2] lib/scatterlist: Make ARCH_HAS_SG_CHAIN an actual Kconfig

2014-03-22 Thread Benjamin Herrenschmidt

> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 1594945..8122294 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -82,6 +82,7 @@ config ARM
> .
>  
>  config ARM_HAS_SG_CHAIN
> + select ARCH_HAS_SG_CHAIN
>   bool
>  

Heh, a self-selecting config option... I didn't know that trick !

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-08 Thread Benjamin Herrenschmidt
On Tue, 2013-10-08 at 20:55 -0700, H. Peter Anvin wrote:
> Why not add a minimum number to pci_enable_msix(), i.e.:
> 
> pci_enable_msix(pdev, msix_entries, nvec, minvec)
> 
> ... which means "nvec" is the number of interrupts *requested*, and
> "minvec" is the minimum acceptable number (otherwise fail).

Which is exactly what Ben (the other Ben :-) suggested and that I
supports...

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-07 Thread Benjamin Herrenschmidt
On Mon, 2013-10-07 at 14:01 -0400, Tejun Heo wrote:
> I don't think the same race condition would happen with the loop.  The
> problem case is where multiple msi(x) allocation fails completely
> because the global limit went down before inquiry and allocation.  In
> the loop based interface, it'd retry with the lower number.
> 
> As long as the number of drivers which need this sort of adaptive
> allocation isn't too high and the common cases can be made simple, I
> don't think the "complex" part of interface is all that important.
> Maybe we can have reserve / cancel type interface or just keep the
> loop with more explicit function names (ie. try_enable or something
> like that).

I'm thinking a better API overall might just have been to request
individual MSI-X one by one :-)

We want to be able to request an MSI-X at runtime anyway ... if I want
to dynamically add a queue to my network interface, I want it to be able
to pop a new arbitrary MSI-X.

And we don't want to lock drivers into contiguous MSI-X sets either.

And for the cleanup ... well that's what the "pcim" functions are for,
we can just make MSI-X variants.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-05 Thread Benjamin Herrenschmidt
On Sun, 2013-10-06 at 08:02 +0200, Alexander Gordeev wrote:
> On Sun, Oct 06, 2013 at 08:46:26AM +1100, Benjamin Herrenschmidt wrote:
> > On Sat, 2013-10-05 at 16:20 +0200, Alexander Gordeev wrote:
> > > So my point is - drivers should first obtain a number of MSIs they *can*
> > > get, then *derive* a number of MSIs the device is fine with and only then
> > > request that number. Not terribly different from memory or any other type
> > > of resource allocation ;)
> > 
> > What if the limit is for a group of devices ? Your interface is racy in
> > that case, another driver could have eaten into the limit in between the
> > calls.
> 
> Well, the another driver has had a better karma ;) But seriously, the
> current scheme with a loop is not race-safe wrt to any other type of
> resource which might exhaust. What makes the quota so special so we
> should care about it and should not care i.e. about lack of msi_desc's?

I'm not saying the current scheme is better but I prefer the option of
passing a min,max to the request function.

> Yeah, I know the quota might hit more likely. But why it is not addressed
> right now then? Not a single function in chains...
>   rtas_msi_check_device() -> msi_quota_for_device() -> traverse_pci_devices()
>   rtas_setup_msi_irqs() -> msi_quota_for_device() -> traverse_pci_devices()
> ...is race-safe. So if it has not been bothering anyone until now then 
> no reason to start worrying now :)
>
> In fact, in the current design to address the quota race decently the
> drivers would have to protect the *loop* to prevent the quota change
> between a pci_enable_msix() returned a positive number and the the next
> call to pci_enable_msix() with that number. Is it doable?

I am not advocating for the current design, simply saying that your
proposal doesn't address this issue while Ben's does.

Cheers,
Ben.

> > Ben.
> > 
> > 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-05 Thread Benjamin Herrenschmidt
On Sat, 2013-10-05 at 16:20 +0200, Alexander Gordeev wrote:
> So my point is - drivers should first obtain a number of MSIs they *can*
> get, then *derive* a number of MSIs the device is fine with and only then
> request that number. Not terribly different from memory or any other type
> of resource allocation ;)

What if the limit is for a group of devices ? Your interface is racy in
that case, another driver could have eaten into the limit in between the
calls.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2, part 1 3/9] PCI: Convert alloc_pci_dev(void) to pci_alloc_dev(bus) instead

2013-05-15 Thread Benjamin Herrenschmidt
On Wed, 2013-05-15 at 22:46 +0800, Liu Jiang wrote:
>I don't know any OF exports, could you please help to CC
> some OF experts?

I wrote that code I think. Sorry, I've missed the beginning of the
thread, what is the problem ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi/ibmvscsi: add module alias for ibmvscsic

2012-07-30 Thread Benjamin Herrenschmidt
On Mon, 2012-07-30 at 21:06 +0200, Olaf Hering wrote:
> > So while this would work, I do wonder however whether we could
> instead
> > fix it by simplifying the whole thing as follow since iSeries is now
> > gone and so we don't need split backends anymore:
> > 
> > scsi/ibmvscsi: Remove backend abstraction
> 
> I cant that these things myself anymore.

Brian, can somebody from your side own these ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi/ibmvscsi: /sys/class/scsi_host/hostX/config doesn't show any information

2012-07-29 Thread Benjamin Herrenschmidt
n Wed, 2012-07-18 at 18:49 +0200, o...@aepfle.de wrote:
> From: Linda Xie 
> 
> Expected result:
> It should show something like this:
> x1521p4:~ # cat /sys/class/scsi_host/host1/config
> PARTITIONNAME='x1521p4'
> NWSDNAME='X1521P4'
> HOSTNAME='X1521P4'
> DOMAINNAME='RCHLAND.IBM.COM'
> NAMESERVERS='9.10.244.100 9.10.244.200'
> 
> Actual result:
> x1521p4:~ # cat /sys/class/scsi_host/host0/config
> x1521p4:~ #
> 
> This patch changes the size of the buffer used for transfering config
> data to 4K. It was tested against 2.6.19-rc2 tree.
> 
> Reported by IBM during SLES11 beta testing:

So this patch just seems to blindly replace all occurrences of PAGE_SIZE
with HOST_PAGE_SIZE which is utterly wrong. Only one of those needs to
be changed, the one passed to ibmvscsi_do_host_config() which is what's
visible to the server, all the rest is just sysfs attributes and should
remain as-is.

Additionally (not even mentioning that there is no explanation as to
what the real problem is anywhere in the changeset) I don't like the
fix. The root of the problem is that the MAD header has a 16-bit length
field, so writing 0x1 (64K PAGE_SIZE) into it doesn't quite work.

So in addition to a better comment, I would suggest a fix more like
this:

scsi/ibmvscsi: Fix host config length field overflow

The length field in the host config packet is only 16-bit long, so
passing it 0x1 (64K which is our standard PAGE_SIZE) doesn't
work and result in an empty config from the server.

Signed-off-by: Benjamin Herrenschmidt 
CC: 
---

diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
index 3a6c474..337e8b3 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -1541,6 +1541,9 @@ static int ibmvscsi_do_host_config(struct 
ibmvscsi_host_data *hostdata,
 
host_config = &evt_struct->iu.mad.host_config;
 
+   /* The transport length field is only 16-bit */
+   length = min(0x, length);
+
/* Set up a lun reset SRP command */
memset(host_config, 0x00, sizeof(*host_config));
host_config->common.type = VIOSRP_HOST_CONFIG_TYPE;


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi/ibmvscsi: add module alias for ibmvscsic

2012-07-29 Thread Benjamin Herrenschmidt
On Wed, 2012-07-18 at 18:49 +0200, o...@aepfle.de wrote:
> From: Olaf Hering 
> 
> The driver is named ibmvscsic, at runtime it its name is advertised as
> ibmvscsi. For this reason mkinitrd wont pickup the driver properly.
> Reported by IBM during SLES11 beta testing:
> 
> https://bugzilla.novell.com/show_bug.cgi?id=459933
> LTC50724

So while this would work, I do wonder however whether we could instead
fix it by simplifying the whole thing as follow since iSeries is now
gone and so we don't need split backends anymore:

scsi/ibmvscsi: Remove backend abstraction

Now that the iSeries code is gone the backend abstraction
in this driver is no longer necessary, which allows us to
consolidate the driver in one file.

The side effect is that the module name is now ibmvscsi.ko
which matches the driver hotplug name and fixes auto-load
issues.

Signed-off-by: Benjamin Herrenschmidt 
---
---
 drivers/scsi/ibmvscsi/Makefile|6 +-
 drivers/scsi/ibmvscsi/ibmvscsi.c  |  348 +--
 drivers/scsi/ibmvscsi/ibmvscsi.h  |   22 ---
 drivers/scsi/ibmvscsi/rpa_vscsi.c |  368 -
 4 files changed, 330 insertions(+), 414 deletions(-)
 delete mode 100644 drivers/scsi/ibmvscsi/rpa_vscsi.c

diff --git a/drivers/scsi/ibmvscsi/Makefile b/drivers/scsi/ibmvscsi/Makefile
index ff5b5c5..cb150d1 100644
--- a/drivers/scsi/ibmvscsi/Makefile
+++ b/drivers/scsi/ibmvscsi/Makefile
@@ -1,7 +1,3 @@
-obj-$(CONFIG_SCSI_IBMVSCSI)+= ibmvscsic.o
-
-ibmvscsic-y+= ibmvscsi.o
-ibmvscsic-$(CONFIG_PPC_PSERIES)+= rpa_vscsi.o 
-
+obj-$(CONFIG_SCSI_IBMVSCSI)+= ibmvscsi.o
 obj-$(CONFIG_SCSI_IBMVSCSIS)   += ibmvstgt.o
 obj-$(CONFIG_SCSI_IBMVFC)  += ibmvfc.o
diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
index 3a6c474..d2bd2c0 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -93,13 +93,13 @@ static int max_requests = IBMVSCSI_MAX_REQUESTS_DEFAULT;
 static int max_events = IBMVSCSI_MAX_REQUESTS_DEFAULT + 2;
 static int fast_fail = 1;
 static int client_reserve = 1;
+static char partition_name[97] = "UNKNOWN";
+static unsigned int partition_number = -1;
 
 static struct scsi_transport_template *ibmvscsi_transport_template;
 
 #define IBMVSCSI_VERSION "1.5.9"
 
-static struct ibmvscsi_ops *ibmvscsi_ops;
-
 MODULE_DESCRIPTION("IBM Virtual SCSI");
 MODULE_AUTHOR("Dave Boutcher");
 MODULE_LICENSE("GPL");
@@ -118,6 +118,315 @@ MODULE_PARM_DESC(fast_fail, "Enable fast fail. 
[Default=1]");
 module_param_named(client_reserve, client_reserve, int, S_IRUGO );
 MODULE_PARM_DESC(client_reserve, "Attempt client managed reserve/release");
 
+static void ibmvscsi_handle_crq(struct viosrp_crq *crq,
+   struct ibmvscsi_host_data *hostdata);
+
+/* 
+ * Routines for managing the command/response queue
+ */
+/**
+ * ibmvscsi_handle_event: - Interrupt handler for crq events
+ * @irq:   number of irq to handle, not used
+ * @dev_instance: ibmvscsi_host_data of host that received interrupt
+ *
+ * Disables interrupts and schedules srp_task
+ * Always returns IRQ_HANDLED
+ */
+static irqreturn_t ibmvscsi_handle_event(int irq, void *dev_instance)
+{
+   struct ibmvscsi_host_data *hostdata =
+   (struct ibmvscsi_host_data *)dev_instance;
+   vio_disable_interrupts(to_vio_dev(hostdata->dev));
+   tasklet_schedule(&hostdata->srp_task);
+   return IRQ_HANDLED;
+}
+
+/**
+ * release_crq_queue: - Deallocates data and unregisters CRQ
+ * @queue: crq_queue to initialize and register
+ * @host_data: ibmvscsi_host_data of host
+ *
+ * Frees irq, deallocates a page for messages, unmaps dma, and unregisters
+ * the crq with the hypervisor.
+ */
+static void ibmvscsi_release_crq_queue(struct crq_queue *queue,
+  struct ibmvscsi_host_data *hostdata,
+  int max_requests)
+{
+   long rc = 0;
+   struct vio_dev *vdev = to_vio_dev(hostdata->dev);
+   free_irq(vdev->irq, (void *)hostdata);
+   tasklet_kill(&hostdata->srp_task);
+   do {
+   if (rc)
+   msleep(100);
+   rc = plpar_hcall_norets(H_FREE_CRQ, vdev->unit_address);
+   } while ((rc == H_BUSY) || (H_IS_LONG_BUSY(rc)));
+   dma_unmap_single(hostdata->dev,
+queue->msg_token,
+queue->size * sizeof(*queue->msgs), DMA_BIDIRECTIONAL);
+   free_page((unsigned long)queue->msgs);
+}
+
+/**
+ * crq_queue_next_crq: - Returns the next entry in message queue
+ * @queue: crq_queue to use
+ *
+ * Returns pointer to next entry in queue, or NULL if there are no new 
+ * entried in the CRQ.
+ */
+static struct viosrp_crq *crq_queu

Re: [PATCH] scsi/ibmvscsi: /sys/class/scsi_host/hostX/config doesn't show any information

2012-07-29 Thread Benjamin Herrenschmidt
On Fri, 2012-07-27 at 07:56 +0100, James Bottomley wrote:
> On Fri, 2012-07-27 at 15:19 +1000, Benjamin Herrenschmidt wrote:
> > On Wed, 2012-07-18 at 18:49 +0200, o...@aepfle.de wrote:
> > > From: Linda Xie 
> > 
> > James, can I assume you're picking up those two ?
> 
> If they get acked by the maintiners ...

I don't think we have a specific upstream maintainer for those drivers,
so let's assume it's my job... NAK on the sysfs one, the other one is
ok, I'll reply to the respective patches.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi/ibmvscsi: /sys/class/scsi_host/hostX/config doesn't show any information

2012-07-26 Thread Benjamin Herrenschmidt
On Wed, 2012-07-18 at 18:49 +0200, o...@aepfle.de wrote:
> From: Linda Xie 

James, can I assume you're picking up those two ?

Cheers,
Ben.

> Expected result:
> It should show something like this:
> x1521p4:~ # cat /sys/class/scsi_host/host1/config
> PARTITIONNAME='x1521p4'
> NWSDNAME='X1521P4'
> HOSTNAME='X1521P4'
> DOMAINNAME='RCHLAND.IBM.COM'
> NAMESERVERS='9.10.244.100 9.10.244.200'
> 
> Actual result:
> x1521p4:~ # cat /sys/class/scsi_host/host0/config
> x1521p4:~ #
> 
> This patch changes the size of the buffer used for transfering config
> data to 4K. It was tested against 2.6.19-rc2 tree.
> 
> Reported by IBM during SLES11 beta testing:
> 
> https://bugzilla.novell.com/show_bug.cgi?id=439970
> LTC49349
> 
> Signed-off-by: Olaf Hering 
> 
> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c 
> b/drivers/scsi/ibmvscsi/ibmvscsi.c
> index e580aa4..1513ca8 100644
> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
> @@ -93,6 +93,8 @@ static int max_requests = IBMVSCSI_MAX_REQUESTS_DEFAULT;
>  static int max_events = IBMVSCSI_MAX_REQUESTS_DEFAULT + 2;
>  static int fast_fail = 1;
>  static int client_reserve = 1;
> +/* host data buffer size */
> +#define HOST_BUFFER_SIZE 4096
>  
>  static struct scsi_transport_template *ibmvscsi_transport_template;
>  
> @@ -1666,7 +1668,7 @@ static ssize_t show_host_srp_version(struct device *dev,
>   struct ibmvscsi_host_data *hostdata = shost_priv(shost);
>   int len;
>  
> - len = snprintf(buf, PAGE_SIZE, "%s\n",
> +   len = snprintf(buf, HOST_BUFFER_SIZE, "%s\n",
>  hostdata->madapter_info.srp_version);
>   return len;
>  }
> @@ -1687,7 +1689,7 @@ static ssize_t show_host_partition_name(struct device 
> *dev,
>   struct ibmvscsi_host_data *hostdata = shost_priv(shost);
>   int len;
>  
> - len = snprintf(buf, PAGE_SIZE, "%s\n",
> +   len = snprintf(buf, HOST_BUFFER_SIZE, "%s\n",
>  hostdata->madapter_info.partition_name);
>   return len;
>  }
> @@ -1708,7 +1710,7 @@ static ssize_t show_host_partition_number(struct device 
> *dev,
>   struct ibmvscsi_host_data *hostdata = shost_priv(shost);
>   int len;
>  
> - len = snprintf(buf, PAGE_SIZE, "%d\n",
> +   len = snprintf(buf, HOST_BUFFER_SIZE, "%d\n",
>  hostdata->madapter_info.partition_number);
>   return len;
>  }
> @@ -1728,7 +1730,7 @@ static ssize_t show_host_mad_version(struct device *dev,
>   struct ibmvscsi_host_data *hostdata = shost_priv(shost);
>   int len;
>  
> - len = snprintf(buf, PAGE_SIZE, "%d\n",
> +   len = snprintf(buf, HOST_BUFFER_SIZE, "%d\n",
>  hostdata->madapter_info.mad_version);
>   return len;
>  }
> @@ -1748,7 +1750,7 @@ static ssize_t show_host_os_type(struct device *dev,
>   struct ibmvscsi_host_data *hostdata = shost_priv(shost);
>   int len;
>  
> - len = snprintf(buf, PAGE_SIZE, "%d\n", hostdata->madapter_info.os_type);
> +   len = snprintf(buf, HOST_BUFFER_SIZE, "%d\n", 
> hostdata->madapter_info.os_type);
>   return len;
>  }
>  
> @@ -1767,7 +1769,7 @@ static ssize_t show_host_config(struct device *dev,
>   struct ibmvscsi_host_data *hostdata = shost_priv(shost);
>  
>   /* returns null-terminated host config data */
> - if (ibmvscsi_do_host_config(hostdata, buf, PAGE_SIZE) == 0)
> +   if (ibmvscsi_do_host_config(hostdata, buf, HOST_BUFFER_SIZE) == 0)
>   return strlen(buf);
>   else
>   return 0;


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] scsi: Use new __dma_buffer to align sense buffer in scsi_cmnd

2007-12-23 Thread Benjamin Herrenschmidt

> This has the potential of leaving a big fat ugly hole in the middle of 
> scsi_cmnd. I would suggest of *just* moving the sense_buffer array to be
> the *first member* of struct scsi_cmnd. The command itself is already cache
> aligned, allocated by the proper flags to it's slab. And put a fat comment
> near it's definition.

The hole would only be present on non coherent architectures though. We
also still need the padding after the sense_buffer. But yeah, moving the
sense buffer to first position would diminish the size of the hole I
suppose.

> This is until a proper fix is sent. I have in my Q a proposition for a 
> more prominent solution, which I will send next month. Do to short comings
> in the sense handling and optimizations, but should definitely cover this
> problem.
> 
> The code should have time to be discussed and tested, so it is only 2.6.26
> material. For the duration of the 2.6.25 kernel we can live with a reorder
> of scsi_cmnd members, if it solves such a grave bug for some ARCHs.

Re-oreder isn't enough. We need both re-order and the __dma_buffer
annotation to ensure proper padding. Some archs have 64 or even 128
bytes cache line size.

I agree that in the long run, a better solution should be done involving
reworking the way the sense buffer is allocated. I started modifying the
EH code to kmalloc it, but it gets very messy with the current interface
between drivers and EH. I still have some patches around I can send,
though some drivers need fixing (error handling on kmalloc failure).
James seem to have a better idea of pre-allocating sense buffers per
host based on how many they can have in flight which I haven't looked
at.

Ben.

> Boaz
> 
> [RFD below]
> My proposed solution will be has follows:
> 
>  1. Since the scsi protocol mandates an immediate REQUEST_SENSE after an error
> in effect the Q is frozen until the REQUEST_SENSE command returns.
> 
>  2. The scsi-ml needs the sense buffer for its normal operation, independent 
> from the ULD's request of the sence_buffer or not at request->sense. But
> in effect, 90% of all scsi-requests come with ULD's allocated buffer for
> sense, that is copied to, on command completion.
> 
>  3. 99% of all commands complete successfully, so if an optimization is 
> proposed for the successful case, sacrificing a few cycles for the error
> case than thats a good thing.
> 
>  My suggestion is to have a per Q, driver-overridable, sense buffer that is 
>  DMAed/written to by drivers. At the end of the REQUEST_SENSE command one
>  of 2 things will be done. Either copy the sense to the ULD's supplied buffer,
>  or if not available, allocate one from a dedicated mem_cache pool.

I think that's pretty much was James was proposing indeed.

>  So we are completely saving 92 bytes from scsi_cmnd by replacing the buffer
>  with a pointer. We can always read the sense into a per Q buffer. And 10% of
>  %1 of the times we will need to allocate a sense buffer from a dedicated 
> mem_cache
>  I would say thats a nice optimization.
> 
>  The changes to scsi_error/scsi_cmnd and friends, is pretty strait forward. 
> But
>  it depends on a conversion of 4/5 drivers to the new scsi_eh API for 
>  REQUEST_SENSE. I have only converted these drivers that interfered with the 
> accessors
>  effort + 1 other places. But there are a few more places that are not 
> converted.
>  Once done. The implementation can easily change with no affect on drivers.
> 
>  The reason I've started with this work is because the SCSI standard permits 
> up
>  to 260 bytes of sense, which you guest right, is needed by the OSD command 
> set.
> 
> Boaz

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] scsi: Use new __dma_buffer to align sense buffer in scsi_cmnd

2007-12-21 Thread Benjamin Herrenschmidt

On Fri, 2007-12-21 at 10:33 +, Alan Cox wrote:
> On Fri, 21 Dec 2007 13:30:08 +1100
> Benjamin Herrenschmidt <[EMAIL PROTECTED]> wrote:
> 
> > The sense buffer ins scsi_cmnd can nowadays be DMA'ed into directly
> > by some low level drivers (that typically happens with USB mass
> > storage).
> 
> Should that not be fixed in USB storage by using pci_alloc_coherent on the
> PCI device of the hub not peeing directly into kernel space ?

All "dumb" SCSI drivers have this problem, USB storage is just one of
them. This would also allow to remove bounce buffering that some
non-dumb ones are doing in fact.

There's another solution jejb was talking about involving reworking the
allocation of the sense buffer to make it always under driver control
etc... but that's the kind of SCSI surgery that I'm not prepared to do
especially not for .25 and without much HW to test with.

> It's also incomplete as a fix because I don't see what guarantees the
> buffer size will always exceed cache line size

How is that a problem ? The annotation will make sure the buffer doesn't
share cache lines. It forces alignmenet before and pads after.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] scsi: Use new __dma_buffer to align sense buffer in scsi_cmnd

2007-12-21 Thread Benjamin Herrenschmidt

On Fri, 2007-12-21 at 06:16 -0700, Matthew Wilcox wrote:
> On Fri, Dec 21, 2007 at 10:33:26AM +, Alan Cox wrote:
> > On Fri, 21 Dec 2007 13:30:08 +1100
> > Benjamin Herrenschmidt <[EMAIL PROTECTED]> wrote:
> > 
> > > The sense buffer ins scsi_cmnd can nowadays be DMA'ed into directly
> > > by some low level drivers (that typically happens with USB mass
> > > storage).
> > 
> > Should that not be fixed in USB storage by using pci_alloc_coherent on the
> > PCI device of the hub not peeing directly into kernel space ?
> 
> That's what I said, but Ben seems fixated on this particular fix.

That means more understanding of the SCSI code than I have right now
and a _lot_ more time than I have right now. It's not only USB storage,
it's anything that does DMA and uses the generic error handling.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] DMA buffer alignment annotations

2007-12-21 Thread Benjamin Herrenschmidt

On Fri, 2007-12-21 at 09:39 +, Russell King wrote:

> > +#ifndef ARCH_MIN_DMA_ALIGNMENT
> > +#define __dma_aligned
> > +#define __dma_buffer
> > +#else
> > +#define __dma_aligned  
> > __attribute__((aligned(ARCH_MIN_DMA_ALIGNMENT)))
> > +#define __dma_buffer   __dma_buffer_line(__LINE__)
> > +#define __dma_buffer_line(line)__dma_aligned;\
> > +   char __dma_pad_##line[0] __dma_aligned
> 
> You introduce __dma_buffer_line() if ARCH_MIN_DMA_ALIGNMENT is set but
> not if it isn't...

Yup, it's not meant to be used outside of __dma_buffer...

 .../...

> > +then dev->buffer will be safe for DMA on all architectures.  On a
> > +cache-coherent architecture the members of dev will be aligned exactly
> > +as they would have been without __dma_buffer; on a non-cache-coherent
> > +architecture buffer and field2 will be aligned so that buffer does not
> > +share a cache line with any other data.
> > +
> 
> ... but it's not described.  What's the purpose of it, and why would it
> only be used on CPUs with ARCH_MIN_DMA_ALIGNMENT set?

Hrm, I'm not the best at writing exlanations, care to send a patch ? :-)

Cheers,
Ben


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] DMA buffer alignment annotations

2007-12-20 Thread Benjamin Herrenschmidt
This patch based on some earlier work by Roland Dreier introduces
a pair of annotations that can be used to enforce alignment of
objects that can be DMA'ed into, and to enforce that an DMA'able
object within a structure isn't sharing a cache line with some
other object.

Such sharing of a data structure between DMA and non-DMA objects
isn't a recommended practice, but it does happen and in some case
might even make sense, so we now have a way to make it work
propertly.

The current patch only enables such alignment for some PowerPC
platforms that do not have coherent caches. Other platforms such
as ARM, MIPS, etc... can define ARCH_MIN_DMA_ALIGNMENT if they
want to benefit from this, I don't know them well enough to do
it myself.

The initial issue I'm fixing (in a second patch) by using these
is the SCSI sense buffer which is currently part of the scsi
command structure and can be DMA'ed to. On non-coherent platforms,
this causes various corruptions as this cache line is shared with
various other fields of the scsi_cmnd data structure.

Signed-off-by: Benjamin Herrenschmidt <[EMAIL PROTECTED]>
---

 Documentation/DMA-mapping.txt |   32 
 include/asm-generic/page.h|   10 ++
 include/asm-powerpc/page.h|8 
 3 files changed, 50 insertions(+)

--- linux-merge.orig/include/asm-generic/page.h 2007-07-27 13:44:45.0 
+1000
+++ linux-merge/include/asm-generic/page.h  2007-12-21 13:07:28.0 
+1100
@@ -20,6 +20,16 @@ static __inline__ __attribute_const__ in
return order;
 }
 
+#ifndef ARCH_MIN_DMA_ALIGNMENT
+#define __dma_aligned
+#define __dma_buffer
+#else
+#define __dma_aligned  __attribute__((aligned(ARCH_MIN_DMA_ALIGNMENT)))
+#define __dma_buffer   __dma_buffer_line(__LINE__)
+#define __dma_buffer_line(line)__dma_aligned;\
+   char __dma_pad_##line[0] __dma_aligned
+#endif
+
 #endif /* __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 
Index: linux-merge/include/asm-powerpc/page.h
===
--- linux-merge.orig/include/asm-powerpc/page.h 2007-09-28 11:42:10.0 
+1000
+++ linux-merge/include/asm-powerpc/page.h  2007-12-21 13:15:02.0 
+1100
@@ -77,6 +77,14 @@
 #define VM_DATA_DEFAULT_FLAGS64(VM_READ | VM_WRITE | \
 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
 
+/*
+ * On non cache coherent platforms, we enforce cache aligned DMA
+ * buffers inside of structures
+ */
+#ifdef CONFIG_NOT_COHERENT_CACHE
+#define ARCH_MIN_DMA_ALIGNMENT L1_CACHE_BYTES
+#endif
+
 #ifdef __powerpc64__
 #include 
 #else
Index: linux-merge/Documentation/DMA-mapping.txt
===
--- linux-merge.orig/Documentation/DMA-mapping.txt  2007-12-21 
13:17:14.0 +1100
+++ linux-merge/Documentation/DMA-mapping.txt   2007-12-21 13:20:00.0 
+1100
@@ -75,6 +75,38 @@ What about block I/O and networking buff
 networking subsystems make sure that the buffers they use are valid
 for you to DMA from/to.
 
+Note that on non-cache-coherent architectures, having a DMA buffer
+that shares a cache line with other data can lead to memory
+corruption.
+
+The __dma_buffer macro exists to allow safe DMA buffers to be declared
+easily and portably as part of larger structures without causing bloat
+on cache-coherent architectures. To get this macro, architectures have
+to define ARCH_MIN_DMA_ALIGNMENT to the requested alignment value in
+their asm/page.h before including asm-generic/page.h
+
+Of course these structures must be contained in memory that can be
+used for DMA as described above.
+
+To use __dma_buffer, just declare a struct like:
+
+   struct mydevice {
+   int field1;
+   char buffer[BUFFER_SIZE] __dma_buffer;
+   int field2;
+   };
+
+If this is used in code like:
+
+   struct mydevice *dev;
+   dev = kmalloc(sizeof *dev, GFP_KERNEL);
+
+then dev->buffer will be safe for DMA on all architectures.  On a
+cache-coherent architecture the members of dev will be aligned exactly
+as they would have been without __dma_buffer; on a non-cache-coherent
+architecture buffer and field2 will be aligned so that buffer does not
+share a cache line with any other data.
+
DMA addressing limitations
 
 Does your device have any DMA addressing limitations?  For example, is
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] scsi: Use new __dma_buffer to align sense buffer in scsi_cmnd

2007-12-20 Thread Benjamin Herrenschmidt
The sense buffer ins scsi_cmnd can nowadays be DMA'ed into directly
by some low level drivers (that typically happens with USB mass
storage).

This is a problem on non cache coherent architectures such as
embedded PowerPCs where the sense buffer can share cache lines with
other structure members, which leads to various forms of corruption.

This uses the newly defined __dma_buffer annotation to enforce that
on such platforms, the sense_buffer is contained within its own
cache line. This has no effect on cache coherent architectures.

Signed-off-by: Benjamin Herrenschmidt <[EMAIL PROTECTED]>
---

 include/scsi/scsi_cmnd.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-merge.orig/include/scsi/scsi_cmnd.h   2007-12-21 13:07:14.0 
+1100
+++ linux-merge/include/scsi/scsi_cmnd.h2007-12-21 13:07:29.0 
+1100
@@ -88,7 +88,7 @@ struct scsi_cmnd {
   working on */
 
 #define SCSI_SENSE_BUFFERSIZE  96
-   unsigned char sense_buffer[SCSI_SENSE_BUFFERSIZE];
+   unsigned char sense_buffer[SCSI_SENSE_BUFFERSIZE] __dma_buffer;
/* obtained by REQUEST SENSE when
 * CHECK CONDITION is received on original
 * command (auto-sense) */
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-20 Thread Benjamin Herrenschmidt

On Tue, 2007-11-20 at 15:10 -0600, James Bottomley wrote:
> We're talking about trying to fix this for 2.4; which is already at
> -rc3 ... Is an entire arch change for dma alignment really a merge
> candidate at this stage?

Well, as I said before... it's a matter of what seems to be the less
likely to break something right ?

On one side,  I'm doing surgery on code I barely know, the scsi error
handling, and now it seems I also have to fixup a handful of drivers
that aren't the most obvious pieces of code around.

On the other side, Roland proposal is basically just adding a macro that
can be empty for everybody but a handful of archs, and stick it onto one
field in one structure...

The later has about 0 chances to actually break something or cause a
regression. I wouldn't say that about the former.

Now, I will see if I manage to fixup the NCR drivers to pass a
pre-allocated buffer (USB storage I think can pass NULL as it's not
calling prep in atomic context). But then, it complicates the matter
because that means "restore" will have to know whether prep allocated
the buffer or not, thus more fields to add to the save struct, it's
getting messy, unless we decide -all- callers are responsible for the
buffer allocation (hrm... maybe the best approach).

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt
FYI, Here's what I have for the SCSI change. I haven't updated drivers
to care for the new return code though, help appreciated with that as I
don't know much about these drivers.

Index: linux-work/drivers/scsi/scsi_error.c
===
--- linux-work.orig/drivers/scsi/scsi_error.c   2007-11-20 13:26:18.0 
+1100
+++ linux-work/drivers/scsi/scsi_error.c2007-11-20 13:43:05.0 
+1100
@@ -602,8 +602,9 @@ static void scsi_abort_eh_cmnd(struct sc
  * @cmnd is ignored and this functions sets up a REQUEST_SENSE command
  * and cmnd buffers to read @sense_bytes into @scmd->sense_buffer.
  **/
-void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses,
-   unsigned char *cmnd, int cmnd_size, unsigned 
sense_bytes)
+int scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses,
+ unsigned char *cmnd, int cmnd_size, unsigned sense_bytes,
+ gfp_t gfp_mask)
 {
struct scsi_device *sdev = scmd->device;
 
@@ -622,12 +623,20 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd 
ses->use_sg = scmd->use_sg;
ses->resid = scmd->resid;
ses->result = scmd->result;
+   sg_init_table(&ses->sense_sgl, 1);
 
if (sense_bytes) {
+   struct page *pg;
+
+   if (sdev->host->hostt->unchecked_isa_dma)
+   gfp_mask |= __GFP_DMA;
scmd->request_bufflen = min_t(unsigned,
   sizeof(scmd->sense_buffer), sense_bytes);
-   sg_init_one(&ses->sense_sgl, scmd->sense_buffer,
-  scmd->request_bufflen);
+   pg = alloc_page(gfp_mask);
+   if (!pg)
+   return FAILED;
+   memset(page_address(pg), 0, scmd->request_bufflen);
+   sg_set_page(&ses->sense_sgl, pg, scmd->request_bufflen, 0);
scmd->request_buffer = &ses->sense_sgl;
scmd->sc_data_direction = DMA_FROM_DEVICE;
scmd->use_sg = 1;
@@ -658,6 +667,8 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd 
 * untransferred sense data should be interpreted as being zero.
 */
memset(scmd->sense_buffer, 0, sizeof(scmd->sense_buffer));
+
+   return SUCCESS;
 }
 EXPORT_SYMBOL(scsi_eh_prep_cmnd);
 
@@ -670,9 +681,17 @@ EXPORT_SYMBOL(scsi_eh_prep_cmnd);
  **/
 void scsi_eh_restore_cmnd(struct scsi_cmnd* scmd, struct scsi_eh_save *ses)
 {
+   struct page *pg;
+
/*
 * Restore original data
 */
+   pg = sg_page(&ses->sense_sgl);
+   if (pg) {
+   memcpy(scmd->sense_buffer, page_address(pg),
+  scmd->request_bufflen);
+   __free_page(pg);
+   }
scmd->cmd_len = ses->cmd_len;
memcpy(scmd->cmnd, ses->cmnd, sizeof(scmd->cmnd));
scmd->sc_data_direction = ses->data_direction;
@@ -709,7 +728,10 @@ static int scsi_send_eh_cmnd(struct scsi
struct scsi_eh_save ses;
int rtn;
 
-   scsi_eh_prep_cmnd(scmd, &ses, cmnd, cmnd_size, sense_bytes);
+   if (scsi_eh_prep_cmnd(scmd, &ses, cmnd, cmnd_size, sense_bytes,
+ GFP_KERNEL) != SUCCESS)
+   return FAILED;
+
shost->eh_action = &done;
 
spin_lock_irqsave(shost->host_lock, flags);
Index: linux-work/include/scsi/scsi_eh.h
===
--- linux-work.orig/include/scsi/scsi_eh.h  2007-11-20 13:36:44.0 
+1100
+++ linux-work/include/scsi/scsi_eh.h   2007-11-20 13:42:49.0 +1100
@@ -81,9 +81,10 @@ struct scsi_eh_save {
struct scatterlist sense_sgl;
 };
 
-extern void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd,
+extern int scsi_eh_prep_cmnd(struct scsi_cmnd *scmd,
struct scsi_eh_save *ses, unsigned char *cmnd,
-   int cmnd_size, unsigned sense_bytes);
+   int cmnd_size, unsigned sense_bytes,
+   gfp_t gfp_mask);
 
 extern void scsi_eh_restore_cmnd(struct scsi_cmnd* scmd,
struct scsi_eh_save *ses);
Index: linux-work/drivers/scsi/NCR5380.c
===
--- linux-work.orig/drivers/scsi/NCR5380.c  2007-11-20 13:46:41.0 
+1100
+++ linux-work/drivers/scsi/NCR5380.c   2007-11-20 13:46:47.0 +1100
@@ -2283,7 +2283,7 @@ static void NCR5380_information_transfer
}
 
if ((cmd->cmnd[0] != REQUEST_SENSE) && 
(status_byte(cmd->SCp.Status) == CHECK_CONDITION)) {
-   scsi_eh_prep_cmnd(cmd, 
&hostdata->ses, NULL, 0, ~0);
+   scsi_eh_prep_cmnd(cmd, 
&hostdata->ses, NULL, 0, ~0, GFP_ATOMIC);
 
dprintk(NDEBUG_AUTOSENSE, 

Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 18:10 -0800, Roland Dreier wrote:
> 
> I wrapped this ugliness up inside the macro back in what I posted in
> 2002 (http://lkml.org/lkml/2002/6/12/234):
> 
> #define __dma_buffer __dma_buffer_line(__LINE__)
> #define __dma_buffer_line(line) __dma_buffer_expand_line(line)
> #define __dma_buffer_expand_line(line) \
> __attribute__ ((aligned(L1_CACHE_BYTES))); \
> char __dma_pad_ ## line [0] __attribute__
> ((aligned(L1_CACHE_BYTES)))
> 
> then you just need to tag the actual member like:
> 
> char foo[3] __dma_buffer;

That's actually not too bad ... 

I'm having a problem with reverting SCSI to use an allocation for the
sense buffer, because it can fail and the new scso_eh_prep_cmnd() isn't
supposed to return a failure. I've changed that but now I get into
trying to fix the various drivers that use it without checking the
result code and it's becoming much more complicated than initially
thought.

So I may do the above instead and revive your patch.

Any objection ? James ? David ?

Cheers,
Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 16:46 -0800, David Miller wrote:
> 
> 1) Require that entire buffers are commited by call sites,
>and thus "embedding" DMA'd within non-DMA stuff isn't allowed
> 
> 2) Add the __dma_cacheline_aligned tag.
> 
> But note that with #2 it could get quite ugly because the
> alignment and size both have a minimum that needs to be
> enforced, not just the alignment alone.  So either:

Yup.

> struct foo {
> unsigned int other_unrelated_stuff;
> 
> struct object dma_thing __dma_cacheline_aligned;
> 
> unsigned int more_nondma_stuff __dma_cacheline_aligned;
> };

In my tests, I had used a "fuckton_t" object defined to be an empty
thing with alignment constraint, seemed to work :-) But I'd rather
require #1.

BTW. What is the status nowadays with skb's ?

Cheers,
Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 14:31 -0800, David Miller wrote:
> From: Benjamin Herrenschmidt <[EMAIL PROTECTED]>
> Date: Tue, 20 Nov 2007 06:51:14 +1100
> 
> > On Mon, 2007-11-19 at 00:38 -0800, David Miller wrote:
> > > From: Benjamin Herrenschmidt <[EMAIL PROTECTED]>
> > > Date: Mon, 19 Nov 2007 16:35:23 +1100
> > > 
> > > You could make a dma_cacheline_aligned and use that.
> > > It seems pretty reasonable.
> > 
> > I was thinking about that. What archs would need it ? arm, mips, what
> > else ?
> 
> The sparc32 port would need it too.

James preference seem to go for a revert of the patch that removed the
kmalloc of the buffer instead. Sounds definitely like an easier plan
for .24 (and maybe even backport to stable).

I'll produce a patch for that later today or tomorrow.

Do you still think we should introduce this __dma_cacheline_aligned ? Do
you see other cases of drivers where it would be useful ? It tend to
agree with your earlier statement that drivers doing that are broken and
should be using a separate allocator for DMA'ble objects (in fact, on
non-cache coherent archs, kmalloc is just fine).

Cheers,
Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 13:43 -0800, Roland Dreier wrote:
> > I've been debugging various issues on the PowerPC 44x embedded
>  > architecture which happens to have non-coherent PCI DMA.
>  > 
>  > One of the problem I'm hitting is that one really need to enforce
>  > kmalloc alignement to cache lines or bad things will happen (among
>  > others with USB), for some reasons, powerpc failed to do so, I fixed it.
> 
> Heh... I hit the same problem literally 5 years ago:
> http://lwn.net/Articles/1783/
> 
> I implemented the __dma_buffer annotation:
> http://lwn.net/Articles/2269/
> 
> But DaveM said we should just use the PCI pool code instead:
> http://lwn.net/Articles/2270/

Heh, well... 

In this case, DaveM just proposed something akin to your
__dma_buffer :-)

On the other hand, after discussing with James, it looks like we'll just
be reverting the patch that removed the kmalloc of the sense buffer
since non cache-coherent archs are supposed to enforce kmalloc alignment
to cache lines.

__dma_buffer still seems like a good thing to have if too many other
drivers are hitting that but for this specific problem, it's not the
approach that we'll be taking.

Cheers,
Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

> I'd like to be rid of it inside the command for various reasons:  every
> command has one of these, and they're expensive in the allocation (at 96
> bytes).  There's no reason we have to allocate and free that amount of
> space with every command.  In theory, the number of these is bounded at
> the queue depth, in practice, there's usually only one, and this DMA
> alignment issue does requires most drivers to double copy.

Do you have a plan for short term here ? I'd like something fixed
for .25, so I may have to introduce a __dma_aligned macro of some sort
to deal with that in the meantime...

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 09:09 -0600, James Bottomley wrote:
> > What other drivers do is DMA to their own allocation and then memcpy to
> > the sense buffer.
> > 
> > There is a movement to allocate the sense data as its own sg list, but
> > I don't think that patch has even been posted yet.
> 
> I'd like to be rid of it inside the command for various reasons:  every
> command has one of these, and they're expensive in the allocation (at 96
> bytes).  There's no reason we have to allocate and free that amount of
> space with every command.  In theory, the number of these is bounded at
> the queue depth, in practice, there's usually only one, and this DMA
> alignment issue does requires most drivers to double copy.

And most drivers don't and break. Take USB storage, I -think- (code path
aren't trivial to follow) it just gets the sglist cooked up by the code
in scsi_error.c no ? That just points to the buffer in scsi_cmnd. It
then pass that for DMA to the USB stack.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 05:32 -0700, Matthew Wilcox wrote:
> On Mon, Nov 19, 2007 at 04:35:23PM +1100, Benjamin Herrenschmidt wrote:
> > The other one I'm hitting now is that the SCSI layer nowadays embeds the
> 
> 'nowadays'?  It has always been so.

Wasn't it kmalloc'ed at one point ?

> > sense_buffer inside the scsi_cmnd structure without any kind of
> > alignment whatsoever. I've been hitting irregulary is a crash on SCSI 
> > command
> > completion that seems to be related to corruption of the "request"
> > pointer in struct scsi_cmnd and I think it might be the cause.
> > I'm now trying to setup a proper repro-case.
> 
> What other drivers do is DMA to their own allocation and then memcpy to
> the sense buffer.

What "other drivers" ? Those architectures use the same drivers as
everything else.

> There is a movement to allocate the sense data as its own sg list, but
> I don't think that patch has even been posted yet.

I've seen code creating an sglist from the scsi_cmnd->sense_buffer and
passing that to drivers. That breaks.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI breakage on non-cache coherent architectures

2007-11-19 Thread Benjamin Herrenschmidt

On Mon, 2007-11-19 at 00:38 -0800, David Miller wrote:
> From: Benjamin Herrenschmidt <[EMAIL PROTECTED]>
> Date: Mon, 19 Nov 2007 16:35:23 +1100
> 
> > I'm not sure what is the best way to fix that. Internally, I've done
> > some test whacking some cacheline_aligned in the scsi_cmnd data
> > structure to verify I no longer get random SLAB corruption when using my
> > USB but that significantly bloats the size of the structure on archs
> > such as ppc64 that don't need it and have a large cache line size.
> > 
> > Unfortunately, I don't think there's any existing Kconfig symbol or arch
> > provided #define to tell us that we are on a non-coherent arch afaik
> > that could be used to make that conditional.
> > 
> > Another option would be to kmalloc the buffer (wasn't it the case before
> > btw ?) but I suppose some people will scream at the idea due to how the
> > command pools are done...
> 
> You could make a dma_cacheline_aligned and use that.
> It seems pretty reasonable.

I was thinking about that. What archs would need it ? arm, mips, what
else ?

Cheers,
Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SCSI breakage on non-cache coherent architectures

2007-11-18 Thread Benjamin Herrenschmidt
Hi James !

(Please CC me on replies as I'm not subscribed to linux-scsi)

I've been debugging various issues on the PowerPC 44x embedded
architecture which happens to have non-coherent PCI DMA.

One of the problem I'm hitting is that one really need to enforce
kmalloc alignement to cache lines or bad things will happen (among
others with USB), for some reasons, powerpc failed to do so, I fixed it.

The other one I'm hitting now is that the SCSI layer nowadays embeds the
sense_buffer inside the scsi_cmnd structure without any kind of
alignment whatsoever. I've been hitting irregulary is a crash on SCSI command
completion that seems to be related to corruption of the "request"
pointer in struct scsi_cmnd and I think it might be the cause.
I'm now trying to setup a proper repro-case.

The sense buffer is something that is written to by the device, thus it
gets cache invalidated when the DMA happens, potentially losing whatever
was sharing the cache line, which includes, among other things, that
"request" pointer field.

There are other issues as well if any of the fields sharing the cache
line happens to be read while the DMA is in progress, it would
populate the cache with memory content prior to the DMA being completed.

It's fairly important on such architectures not to share cache lines
between objects being DMA'd to/from and the rest of the system. If the
DMA is strictly outgoing, it's generally ok, but not if the DMA is
bidirectional or the device is writing to memory.

I'm not sure what is the best way to fix that. Internally, I've done
some test whacking some cacheline_aligned in the scsi_cmnd data
structure to verify I no longer get random SLAB corruption when using my
USB but that significantly bloats the size of the structure on archs
such as ppc64 that don't need it and have a large cache line size.

Unfortunately, I don't think there's any existing Kconfig symbol or arch
provided #define to tell us that we are on a non-coherent arch afaik
that could be used to make that conditional.

Another option would be to kmalloc the buffer (wasn't it the case before
btw ?) but I suppose some people will scream at the idea due to how the
command pools are done...

What do you suggest ?

Cheers,
Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver

2007-07-16 Thread Benjamin Herrenschmidt
On Mon, 2007-07-16 at 17:03 -0500, James Bottomley wrote:
> On Tue, 2007-07-17 at 07:49 +1000, Benjamin Herrenschmidt wrote:
> > > No ... that was the point of flush_kernel_dcache_page().  The page in
> > > question is page cache backed and contains user mappings.  However, the
> > > block layer has already done a flush_dcache_page() in get_user_pages()
> > > and the user shouldn't be touching memory under I/O (unless they want
> > > self induced aliasing problems) so we're free to assume all the user
> > > cachelines are purged, hence all we have to do is flush the kernel alias
> > > to bring the page up to date and make the users see it correctly.
> > 
> > The block layer will have done that even in the swap-out path ? (Just
> > asking... I'm not very familiar with the block layer)
> 
> Er ... not really, this is the I/O path for user initiated I/O.  The
> page out path, by definition, can't have any extant user mappings.  For
> page out, the relevant page is flushed before its mapping is detached,
> and then it can be paged to the backing store (or for anonymous pages to
> the swap device) when no mappings remain.

Ok, thanks.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver

2007-07-16 Thread Benjamin Herrenschmidt

> No ... that was the point of flush_kernel_dcache_page().  The page in
> question is page cache backed and contains user mappings.  However, the
> block layer has already done a flush_dcache_page() in get_user_pages()
> and the user shouldn't be touching memory under I/O (unless they want
> self induced aliasing problems) so we're free to assume all the user
> cachelines are purged, hence all we have to do is flush the kernel alias
> to bring the page up to date and make the users see it correctly.

The block layer will have done that even in the swap-out path ? (Just
asking... I'm not very familiar with the block layer)

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver

2007-07-16 Thread Benjamin Herrenschmidt
On Mon, 2007-07-16 at 08:47 -0500, James Bottomley wrote:
> 
> No ... that was the point of flush_kernel_dcache_page().  The page in
> question is page cache backed and contains user mappings.  However,
> the
> block layer has already done a flush_dcache_page() in get_user_pages()
> and the user shouldn't be touching memory under I/O (unless they want
> self induced aliasing problems) so we're free to assume all the user
> cachelines are purged, hence all we have to do is flush the kernel
> alias
> to bring the page up to date and make the users see it correctly. 

Ok. In our case the problem is not aliases though, it's the coherency
between instruction and data caches for pages that may be executed from
(such as swapped out text pages).

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver

2007-07-16 Thread Benjamin Herrenschmidt

> Upon closer look, while flush_kernel_dcache_page() is a no-op on ppc64,
> flush_dcache_page() isn't. So I'd prefer to not call it if not really needed.
> 
> And according to James, flush_kernel_dcache_page() should be sufficient...
> 
> So I'm getting puzzled again...

flush_dcache_page() handles icache vs. dcache coherency by clearing the
PG_arch_1 bit in the struct page that indicates that the page is cache
clean.

You -must- call it if you're going to use any form of CPU access to
write to the page (basically dirtying the data cache) and that page can
be ever mapped into user space and executed from.

I don't know what flush_kernel_dcache_page() does and if it needs a
similar treatement, it's a new interface, so maybe Jens and or James can
tell me more about it..

Cheers,
Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver

2007-07-13 Thread Benjamin Herrenschmidt
On Fri, 2007-07-13 at 16:19 +0200, Arnd Bergmann wrote:
> I'm pretty sure that no ppc64 machine needs alias resolution in the kernel,
> although some are VIPT. Last time we discussed this, Segher explained it
> to me, but I don't remember which way Cell does it. IIRC, it automatically
> flushes cache lines that are accessed through aliases. 

Ah yes, I remember reading about this automatic flushing thing. I don't
know how the caches actually work on most of our PPC's, but the fact is,
we don't have aliasing issues, so I can safely ignore it for a bit
longer :-)

There are some aliasing issues with the instruction cache specifically
on some 4xx models but that's irrelevant to this discussion (and I think
we handle them elsewhere anyway).

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/6] ps3: BD/DVD/CD-ROM Storage Driver

2007-07-13 Thread Benjamin Herrenschmidt
On Fri, 2007-07-13 at 09:02 -0400, James Bottomley wrote:
> On Wed, 2007-07-04 at 15:22 +0200, Geert Uytterhoeven wrote:
> > +   kaddr = kmap_atomic(sgpnt->page, KM_USER0);
> > +   if (!kaddr)
> > +   return -1;
> > +   len = sgpnt->length;
> > +   if ((req_len + len) > buflen) {
> > +   active = 0;
> > +   len = buflen - req_len;
> > +   }
> > +   memcpy(kaddr + sgpnt->offset, buf + req_len,
> > len);
> > +   kunmap_atomic(kaddr, KM_USER0);
> 
> This isn't a SCSI objection, but this sequence appears several times in
> this driver.  It's wrong for a non-PIPT architecture (and I believe the
> PS3 is VIPT) because you copy into the kernel alias for the page, which
> dirties the line in the cache of that alias (the user alias cache line
> was already invalidated).  However, unless you flush the kernel alias to
> main memory, the user could read stale data.  The way this is supposed
> to be done is to do a 

Nah, we have no cache aliasing on ppc, at least not that matter here and
not on cell.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/6] ps3: Preallocate bootmem memory for the PS3 FLASH ROM storage driver

2007-06-15 Thread Benjamin Herrenschmidt
On Fri, 2007-06-15 at 13:39 +0200, Geert Uytterhoeven wrote:
> plain text document attachment (ps3-stable)
> Preallocate 256 KiB of bootmem memory for the PS3 FLASH ROM storage driver.

I still very much dislike the #ifdef xxx_MODULE in main kernel code.

At the end of the day, is it realistic to ever use a PS3 without the
storage driver ? I would suggest just allocating those unconditionally.

Ben.

> Signed-off-by: Geert Uytterhoeven <[EMAIL PROTECTED]>
> Signed-off-by: Geoff Levand <[EMAIL PROTECTED]>
> ---
>  arch/powerpc/platforms/ps3/setup.c |   19 ++-
>  include/asm-powerpc/ps3.h  |1 +
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> --- a/arch/powerpc/platforms/ps3/setup.c
> +++ b/arch/powerpc/platforms/ps3/setup.c
> @@ -107,7 +107,8 @@ static void ps3_panic(char *str)
>   while(1);
>  }
>  
> -#if defined(CONFIG_FB_PS3) || defined(CONFIG_FB_PS3_MODULE)
> +#if defined(CONFIG_FB_PS3) || defined(CONFIG_FB_PS3_MODULE) || \
> +defined(CONFIG_PS3_FLASH) || defined(CONFIG_PS3_FLASH_MODULE)
>  static void prealloc(struct ps3_prealloc *p)
>  {
>   if (!p->size)
> @@ -123,7 +124,9 @@ static void prealloc(struct ps3_prealloc
>   printk(KERN_INFO "%s: %lu bytes at %p\n", p->name, p->size,
>  p->address);
>  }
> +#endif
>  
> +#if defined(CONFIG_FB_PS3) || defined(CONFIG_FB_PS3_MODULE)
>  struct ps3_prealloc ps3fb_videomemory = {
>   .name = "ps3fb videomemory",
>   .size = CONFIG_FB_PS3_DEFAULT_SIZE_M*1024*1024,
> @@ -146,6 +149,18 @@ early_param("ps3fb", early_parse_ps3fb);
>  #define prealloc_ps3fb_videomemory() do { } while (0)
>  #endif
>  
> +#if defined(CONFIG_PS3_FLASH) || defined(CONFIG_PS3_FLASH_MODULE)
> +struct ps3_prealloc ps3flash_bounce_buffer = {
> + .name = "ps3flash bounce buffer",
> + .size = 256*1024,
> + .align = 256*1024
> +};
> +EXPORT_SYMBOL_GPL(ps3flash_bounce_buffer);
> +#define prealloc_ps3flash_bounce_buffer()
> prealloc(&ps3flash_bounce_buffer)
> +#else
> +#define prealloc_ps3flash_bounce_buffer()do { } while (0)
> +#endif
> +
>  static int ps3_set_dabr(u64 dabr)
>  {
>   enum {DABR_USER = 1, DABR_KERNEL = 2,};
> @@ -175,6 +190,8 @@ static void __init ps3_setup_arch(void)
>  #endif
>  
>   prealloc_ps3fb_videomemory();
> + prealloc_ps3flash_bounce_buffer();
> +
>   ppc_md.power_save = ps3_power_save;
>  
>   DBG(" <- %s:%d\n", __func__, __LINE__);
> --- a/include/asm-powerpc/ps3.h
> +++ b/include/asm-powerpc/ps3.h
> @@ -427,6 +427,7 @@ struct ps3_prealloc {
>  };
>  
>  extern struct ps3_prealloc ps3fb_videomemory;
> +extern struct ps3_prealloc ps3flash_bounce_buffer;
>  
> 
>  #endif
> 
> ___
> Linuxppc-dev mailing list
> [EMAIL PROTECTED]
> https://ozlabs.org/mailman/listinfo/linuxppc-dev

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 6/7] ps3: ROM Storage Driver

2007-05-30 Thread Benjamin Herrenschmidt
On Wed, 2007-05-30 at 12:13 +0200, Christoph Hellwig wrote:
> 
> For any sane hypervisor or hardware the copy should be worth
> than that.  Then again a sane hardware or hypervisor would support
> SG requests..

Agreed... Sony should fix that, it's a bit ridiculous.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 6/7] ps3: ROM Storage Driver

2007-05-29 Thread Benjamin Herrenschmidt
On Tue, 2007-05-29 at 13:11 +0200, Geert Uytterhoeven wrote:
> > This looks very inefficient.  Just set sg_tablesize of your driver
> > to 1 to avoid getting mutiple segments.
> 
> The disadvantage of setting sg_tablesize = 1 is that the driver will
> get small
> requests (PAGE_SIZE) most of the time, which is very bad for
> performance.

And the joke is that not only the HW can do scatter & gather but you
also have an iommu ...

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qla_wxyz pci_set_mwi question

2007-04-12 Thread Benjamin Herrenschmidt

> Willy was referring to this from include/asm-powerpc/pci.h:
> 
> #ifdef CONFIG_PPC64
> 
> /*
>  * We want to avoid touching the cacheline size or MWI bit.
>  * pSeries firmware sets the cacheline size (which is not the cpu cacheline
>  * size in all cases) and hardware treats MWI the same as memory write.
>  */
> #define PCI_DISABLE_MWI
> 
> 
> which makes pci_set_mwi() do nothing other than return 0;

Interesting... I think I missed that we had that bit for some time :-)

Well, I suppose that on pSeries and probably pmac too, the firmware will
set the MWI bit for us anyway, but that's a bit dodgy to apply that to
all ppc64... they aren't all pSeries.

I'll have to look into that again one of these days.

Cheers,
Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qla_wxyz pci_set_mwi question

2007-04-12 Thread Benjamin Herrenschmidt
On Thu, 2007-04-12 at 14:04 -0600, Matthew Wilcox wrote:
> On Thu, Apr 12, 2007 at 12:37:13PM -0700, Andrew Vasquez wrote:
> > On Thu, 12 Apr 2007, Matthew Wilcox wrote:
> > > Why should it fail?  If there's a platform which can't support a
> > > cacheline size that the qla2xyz card can handle, it should be able to
> > > happily fall back to doing plain writes instead of MWIs.  IMO, it should
> > > just call pci_set_mwi() and ignore the result.
> > 
> > I believe there were some erratas on some ISP2xxx chips where MWI
> > needed to be set for proper operation.  I'll go back, verify and
> > update the patch accordingly.
> 
> Hmm.  The thing is that pci_set_mwi() returns success on machines where
> MWI is disabled (currently only PPC64).  Perhaps it needs to fail
> instead.

MWI isn't diabled on ppc64... or did I miss something ?

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 35/59] sysctl: C99 convert ctl_tables in arch/powerpc/kernel/idle.c

2007-01-16 Thread Benjamin Herrenschmidt
On Tue, 2007-01-16 at 09:39 -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <[EMAIL PROTECTED]> - unquoted
> 
> This was partially done already and there was no ABI breakage what
> a relief.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>

Acked-by: Benjamin Herrenschmidt <[EMAIL PROTECTED]>

> ---
>  arch/powerpc/kernel/idle.c |   11 ---
>  1 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
> index 8994af3..8b27bb1 100644
> --- a/arch/powerpc/kernel/idle.c
> +++ b/arch/powerpc/kernel/idle.c
> @@ -110,11 +110,16 @@ static ctl_table powersave_nap_ctl_table[]={
>   .mode   = 0644,
>   .proc_handler   = &proc_dointvec,
>   },
> - { 0, },
> + {}
>  };
>  static ctl_table powersave_nap_sysctl_root[] = {
> - { 1, "kernel", NULL, 0, 0755, powersave_nap_ctl_table, },
> - { 0,},
> + {
> + .ctl_name   = CTL_KERN,
> + .procname   = "kernel",
> + .mode   = 0755,
> + .child  = powersave_nap_ctl_table,
> + },
> + {}
>  };
>  
>  static int __init

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 36/59] sysctl: C99 convert ctl_tables entries in arch/ppc/kernel/ppc_htab.c

2007-01-16 Thread Benjamin Herrenschmidt
On Tue, 2007-01-16 at 09:39 -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <[EMAIL PROTECTED]> - unquoted
> 
> And make the mode of the kernel directory 0555 no one is allowed
> to write to sysctl directories.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>

Acked-by: Benjamin Herrenschmidt <[EMAIL PROTECTED]>

> ---
>  arch/ppc/kernel/ppc_htab.c |   11 ---
>  1 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/ppc/kernel/ppc_htab.c b/arch/ppc/kernel/ppc_htab.c
> index bd129d3..77b20ff 100644
> --- a/arch/ppc/kernel/ppc_htab.c
> +++ b/arch/ppc/kernel/ppc_htab.c
> @@ -442,11 +442,16 @@ static ctl_table htab_ctl_table[]={
>   .mode   = 0644,
>   .proc_handler   = &proc_dol2crvec,
>   },
> - { 0, },
> + {}
>  };
>  static ctl_table htab_sysctl_root[] = {
> - { 1, "kernel", NULL, 0, 0755, htab_ctl_table, },
> - { 0,},
> + {
> + .ctl_name   = CTL_KERN,
> + .procname   = "kernel",
> + .mode   = 0555,
> + .child  = htab_ctl_table,
> + },
> + {}
>  };
>  
>  static int __init

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/59] sysctl: ipmi remove unnecessary insert_at_head flag

2007-01-16 Thread Benjamin Herrenschmidt
On Tue, 2007-01-16 at 09:39 -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <[EMAIL PROTECTED]> - unquoted
> 
> With unique sysctl binary numbers setting insert_at_head is pointless.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>

Acked-by: Benjamin Herrenschmidt <[EMAIL PROTECTED]>

> ---
>  drivers/char/ipmi/ipmi_poweroff.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/char/ipmi/ipmi_poweroff.c 
> b/drivers/char/ipmi/ipmi_poweroff.c
> index 9d23136..b3ae65e 100644
> --- a/drivers/char/ipmi/ipmi_poweroff.c
> +++ b/drivers/char/ipmi/ipmi_poweroff.c
> @@ -686,7 +686,7 @@ static int ipmi_poweroff_init (void)
>   printk(KERN_INFO PFX "Power cycle is enabled.\n");
>  
>  #ifdef CONFIG_PROC_FS
> - ipmi_table_header = register_sysctl_table(ipmi_root_table, 1);
> + ipmi_table_header = register_sysctl_table(ipmi_root_table, 0);
>   if (!ipmi_table_header) {
>   printk(KERN_ERR PFX "Unable to register powercycle sysctl\n");
>   rv = -ENOMEM;

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.15.4 rel.2 1/1] libata: add hotswap to sata_svw

2006-11-28 Thread Benjamin Herrenschmidt
On Tue, 2006-11-28 at 23:22 +, David Woodhouse wrote:
> On Thu, 2006-02-16 at 16:09 +0100, Martin Devera wrote:
> > From: Martin Devera <[EMAIL PROTECTED]>
> > 
> > Add hotswap capability to Serverworks/BroadCom SATA controlers. The
> > controler has SIM register and it selects which bits in SATA_ERROR
> > register fires interrupt.
> > The solution hooks on COMWAKE (plug), PHYRDY change and 10B8B decode 
> > error (unplug) and calls into Lukasz's hotswap framework.
> > The code got one day testing on dual core Athlon64 H8SSL Supermicro 
> > MoBo with HT-1000 SATA, SMP kernel and two CaviarRE SATA HDDs in
> > hotswap bays.
> > 
> > Signed-off-by: Martin Devera <[EMAIL PROTECTED]>
> 
> What became of this?

I might be to blame for not testing it... The Xserve I had on my desk
was too noisy for most of my co-workers so I kept delaying and forgot
about it 

Also the Xserve I have only has one disk, which makes hotplug testing a
bit harder :-)

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iomapping a big endian area

2005-04-04 Thread Benjamin Herrenschmidt
On Mon, 2005-04-04 at 08:59 -0500, James Bottomley wrote:
> On Mon, 2005-04-04 at 17:50 +1000, Benjamin Herrenschmidt wrote:
> > I disagree. The driver will never "know" ...
> 
> ? the driver has to know.  Look at the 53c700 to see exactly how awful
> it is.  This beast has byte and word registers.  When used BE, all the
> byte registers alter their position (to both inb and readb).

What I mean is that the driver doesn't have "know" whatever CPU/bus
combo endianness will require it to use "native" or "swapped" access.
What the driver knows is wether the device it tries to access need BE or
LE accessors.

> > I don't think it's sane. You know that your device is BE or LE and use
> > the appropriate interface. "native" doesn't make sense to me in this
> > context.
> 
> Well ... it's like this. Native means "pass through without swapping"
> and has an easy implementation on both BE and LE platforms.  Logically
> io{read,write}{16,32}be would have to do byte swaps on LE platforms.
> Being lazy, I'm opposed to doing the work if there's no actual use for
> it, so can you provide an example of a BE bus (or device) used on a LE
> platform that would actually benefit from this abstraction?

I don't think drivers will benefit from "native" vs. "swapping". Taht
means that pretty much all drivers that care will end up with #ifdef
crap to pick up the right one based on the CPU endian, and some wil get
it wrong of course. No, I _do_ thing that this should be hidden in the
implementation, and we should provide "le" and "be" accessors (le beeing
the current ones, be new ones). Which one swap or not is in the
implementation of those accessors for the architecture, but the driver
shouldn't have to care.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iomapping a big endian area

2005-04-04 Thread Benjamin Herrenschmidt

> > Well ... it's like this. Native means "pass through without swapping"
> > and has an easy implementation on both BE and LE platforms.  Logically
> > io{read,write}{16,32}be would have to do byte swaps on LE platforms.
> > Being lazy, I'm opposed to doing the work if there's no actual use for
> > it, so can you provide an example of a BE bus (or device) used on a LE
> > platform that would actually benefit from this abstraction?
> 
> I would probably spell "native" as "noswap".
> "native" just doesn't convey enough specific meaning...

But that implies that the driver has to know that the bus and the device
and the CPU are on the same byte endian etc that is rather specific,
and if they all know, then they can also just use the correct "be" or
"le" ... I really see no point in "native" abstraction.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iomapping a big endian area

2005-04-04 Thread Benjamin Herrenschmidt
On Sat, 2005-04-02 at 22:27 -0600, James Bottomley wrote:
> On Sat, 2005-04-02 at 20:08 -0800, David S. Miller wrote:
> > > Did anyone have a preference for the API?  I was thinking
> > > ioread32_native, but ioread32be is fine too.
> > 
> > I think doing foo{be,le}{8,16,32}() would be consistent with
> > our byteorder.h interface names.
> 
> Thinking about this some more, I know of no case of a BE bus connected
> to a LE system, nor do I think anyone would ever create such a beast, so
> our only missing interface is for a BE bus on a BE system.

It's more a matter of the device than the bus imho... 

> Thus, I think io{read,write}{16,32}_native are better interfaces ...

I disagree. The driver will never "know" ...

> they basically mean pass memory operations without byte swaps, so
> they're well defined on both BE and LE systems and correspond exactly to
> our existing _raw_{read,write}{w,l} calls (principle of least surprise).

I don't think it's sane. You know that your device is BE or LE and use
the appropriate interface. "native" doesn't make sense to me in this
context.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iomapping a big endian area

2005-04-04 Thread Benjamin Herrenschmidt
On Sat, 2005-04-02 at 21:40 -0600, James Bottomley wrote:

> Actually, ioread8be is unnecessary, but I was planning to add
> ioread16/ioread32 and iowritexx be on be variants (equivalent to
> _raw_readw et al.)
> 
> After all, the driver must know the card is BE, so the routines that
> make use of the feature are easily coded into the card, so there's no
> real need to add it to the iomem cookie.
> 
> Did anyone have a preference for the API?  I was thinking
> ioread32_native, but ioread32be is fine too.

I think we want "be" since obviously, the "ioread32" one is "le", that
makes sense to provide both and let the implementation bother with what
has to swap or not. With "native", x86 would do what ? swap or not
swap ? unclear ... with "be", at least it's clear.

The problem ? Hehe, well ... there is at least one little problem: The
way iomap "fixes" the issue of mem vs. io space in a driver could have
been used to also fix le vs. be issues. For example, an USB OHCI
controller is normally little endian. But some too-stupid-for-words HW
folks tried to be too smart on a number of embedded chips, and put a big
endian version of it. Thus the driver ends up having to support both.
Most embedded vendors just butcher the driver with #ifdef's which is
fine by me ... until you end up having _also_ a PCI bus with an
EHCI/OHCI pair on it on the same board... then you are toast.

But, I wouldn't bother too much about this case. The driver has other
issues than just IO to deal with (the DMA data structures in memory are
also endian- swapped) so I suppose the entire driver need to be somewhat
#included from a wrapper an compiled twice for different endians to get
that right...

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html