Re: [openib-general] [RFC/BUG] DMA vs. CQ race

2007-02-27 Thread Roland Dreier
On our cell blade + PCI-e Mellanox.
  
  I don't see anything in arch/powerpc that looks like
  dma_alloc_coherent() will do anything other than allocate some memory
  and map it with DMA_BIDIRECTIONAL.  So how does this altix fix help in
  your situation?  Am I misreading the Cell IOMMU code?

Shirley, can you clarify why doing dma_alloc_coherent() in the kernel
helps on your Cell blade?  It really seems that dma_alloc_coherent()
just allocates some memory and then does dma_map(DMA_BIDIRECTIONAL),
which would be exactly the same as allocating the CQ buffer in
userspace and using ib_umem_get() to map it into the kernel.

I'm looking at a possibly cleaner solution to the Altix issue, so I
would like to make sure it fixes whatever the bug on Cell is as well.
So any details you can provide about the problem you see on Cell would
help a lot.

Thanks...

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [RFC/BUG] DMA vs. CQ race

2007-02-27 Thread Shirley Ma





Roland Dreier [EMAIL PROTECTED] wrote on 02/27/2007 01:40:36 PM:

 Shirley, can you clarify why doing dma_alloc_coherent() in the kernel
 helps on your Cell blade?  It really seems that dma_alloc_coherent()
 just allocates some memory and then does dma_map(DMA_BIDIRECTIONAL),
 which would be exactly the same as allocating the CQ buffer in
 userspace and using ib_umem_get() to map it into the kernel.

 I'm looking at a possibly cleaner solution to the Altix issue, so I
 would like to make sure it fixes whatever the bug on Cell is as well.
 So any details you can provide about the problem you see on Cell would
 help a lot.

 Thanks...
Thanks, Roland. The failure on Cell is different with Altix issue after I
reviewed the whole thread. So this fix might not help Cell. The problem I
have might be related to multiple DMAs mapping to the same CQ. It might be
somewhere else lost the sync.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [RFC/BUG] DMA vs. CQ race

2007-02-26 Thread Roland Dreier
  That would be great. We hit a similar problem in our cluster test -- data
  corruption because of this race.

On what platform?

 - R.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [RFC/BUG] DMA vs. CQ race

2007-02-26 Thread Shirley Ma




 Hmm, OK.  Then I will do my best to make sure we get a fix for this
 into 2.6.22.

That would be great. We hit a similar problem in our cluster test -- data
corruption because of this race.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [RFC/BUG] DMA vs. CQ race

2007-02-26 Thread Shirley Ma




Roland Dreier [EMAIL PROTECTED] wrote on 02/26/2007 02:09:48 PM:
   That would be great. We hit a similar problem in our cluster test --
data
   corruption because of this race.

 On what platform?

  - R.

On our cell blade + PCI-e Mellanox.

Thanks
Shirley Ma___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] [RFC/BUG] DMA vs. CQ race

2007-02-26 Thread Roland Dreier
  On our cell blade + PCI-e Mellanox.

I don't see anything in arch/powerpc that looks like
dma_alloc_coherent() will do anything other than allocate some memory
and map it with DMA_BIDIRECTIONAL.  So how does this altix fix help in
your situation?  Am I misreading the Cell IOMMU code?

 - R.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [RFC/BUG] DMA vs. CQ race

2007-02-22 Thread Roland Dreier
  A first-cut at a patch was sent out, some very reasonable
  objections were raised, and the thread fizzled out.

Sorry, I meant to respond again, but I never got around to it.

  The biggest concern with the earlier patch seemed to be
  backward compatibility. There was a stab at addressing
  that in http://tinyurl.com/2x3s52, but no commentary.
  (Too ugly for words?)

I think you went off into the weeds there, but I'll respond to that
earlier email in detail.

  Any suggestions as to how to proceed? Should I just code
  something up in order to have a concrete target to discuss?
  Or are there any new thoughts based on the previous emails?

I actually have a vague plan for a somewhat cleaner way to get this
fix.  For a variety of reasons, I am planning on changing the way the
kernel handles memory registration so that low-level drivers have more
control over what happens.  This would allow us to folow Gleb's
suggestion to use register MR to create and map the kernel's buffer
and avoid some of the error path ugliness.  So I would prefer to map
the coherent memory that way.

However this will take a while to come to fruition, since it is kind
of a background task for me.  How severe is this issue?  In other
words, when you produced the problem, was it a synthetic test, or a
workload that someone might actually want to run?

 - R.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [RFC/BUG] DMA vs. CQ race

2007-02-22 Thread akepner
On Thu, Feb 22, 2007 at 10:34:16AM -0800, Roland Dreier wrote:
 
 I actually have a vague plan for a somewhat cleaner way to get this
 fix.  For a variety of reasons, I am planning on changing the way the
 kernel handles memory registration so that low-level drivers have more
 control over what happens.  This would allow us to folow Gleb's
 suggestion to use register MR to create and map the kernel's buffer
 and avoid some of the error path ugliness.  So I would prefer to map
 the coherent memory that way.

OK, I look forward to seeing what you have in mind.

 
 However this will take a while to come to fruition, since it is kind
 of a background task for me.  How severe is this issue?  In other
 words, when you produced the problem, was it a synthetic test, or a
 workload that someone might actually want to run?
 

We found this accidentally, running a normal MPI job, on a 
normally sized machine (i.e., tens, not hundreds of 
processors.) It appears to be more easily produced that 
we'd expected, and we consider it to be a severe problem.

-- 
Arthur


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] [RFC/BUG] DMA vs. CQ race

2007-02-22 Thread Roland Dreier
  We found this accidentally, running a normal MPI job, on a 
  normally sized machine (i.e., tens, not hundreds of 
  processors.) It appears to be more easily produced that 
  we'd expected, and we consider it to be a severe problem.

Hmm, OK.  Then I will do my best to make sure we get a fix for this
into 2.6.22.

 - R.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [RFC/BUG] DMA vs. CQ race

2007-02-21 Thread akepner

In:

http://openib.org/pipermail/openib-general/2006-December/030251.html

I described a potential race between DMA and CQ updates on
Altix systems. At that time the bug hadn't been observed,
but was expected to be possible on large NUMA systems.

A first-cut at a patch was sent out, some very reasonable
objections were raised, and the thread fizzled out.

Since that time we've been able to produce the bug, and show
that the patch I sent fixes the problem. (OK, the patch I sent
with the addition of a small but important patchlet.)

The biggest concern with the earlier patch seemed to be
backward compatibility. There was a stab at addressing
that in http://tinyurl.com/2x3s52, but no commentary.
(Too ugly for words?)

Any suggestions as to how to proceed? Should I just code
something up in order to have a concrete target to discuss?
Or are there any new thoughts based on the previous emails?

-- 
Arthur


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general