Re: [openib-general] [RFC/BUG] DMA vs. CQ race
On our cell blade + PCI-e Mellanox. I don't see anything in arch/powerpc that looks like dma_alloc_coherent() will do anything other than allocate some memory and map it with DMA_BIDIRECTIONAL. So how does this altix fix help in your situation? Am I misreading the Cell IOMMU code? Shirley, can you clarify why doing dma_alloc_coherent() in the kernel helps on your Cell blade? It really seems that dma_alloc_coherent() just allocates some memory and then does dma_map(DMA_BIDIRECTIONAL), which would be exactly the same as allocating the CQ buffer in userspace and using ib_umem_get() to map it into the kernel. I'm looking at a possibly cleaner solution to the Altix issue, so I would like to make sure it fixes whatever the bug on Cell is as well. So any details you can provide about the problem you see on Cell would help a lot. Thanks... ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC/BUG] DMA vs. CQ race
Roland Dreier [EMAIL PROTECTED] wrote on 02/27/2007 01:40:36 PM: Shirley, can you clarify why doing dma_alloc_coherent() in the kernel helps on your Cell blade? It really seems that dma_alloc_coherent() just allocates some memory and then does dma_map(DMA_BIDIRECTIONAL), which would be exactly the same as allocating the CQ buffer in userspace and using ib_umem_get() to map it into the kernel. I'm looking at a possibly cleaner solution to the Altix issue, so I would like to make sure it fixes whatever the bug on Cell is as well. So any details you can provide about the problem you see on Cell would help a lot. Thanks... Thanks, Roland. The failure on Cell is different with Altix issue after I reviewed the whole thread. So this fix might not help Cell. The problem I have might be related to multiple DMAs mapping to the same CQ. It might be somewhere else lost the sync. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC/BUG] DMA vs. CQ race
That would be great. We hit a similar problem in our cluster test -- data corruption because of this race. On what platform? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC/BUG] DMA vs. CQ race
Hmm, OK. Then I will do my best to make sure we get a fix for this into 2.6.22. That would be great. We hit a similar problem in our cluster test -- data corruption because of this race. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC/BUG] DMA vs. CQ race
Roland Dreier [EMAIL PROTECTED] wrote on 02/26/2007 02:09:48 PM: That would be great. We hit a similar problem in our cluster test -- data corruption because of this race. On what platform? - R. On our cell blade + PCI-e Mellanox. Thanks Shirley Ma___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC/BUG] DMA vs. CQ race
On our cell blade + PCI-e Mellanox. I don't see anything in arch/powerpc that looks like dma_alloc_coherent() will do anything other than allocate some memory and map it with DMA_BIDIRECTIONAL. So how does this altix fix help in your situation? Am I misreading the Cell IOMMU code? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC/BUG] DMA vs. CQ race
A first-cut at a patch was sent out, some very reasonable objections were raised, and the thread fizzled out. Sorry, I meant to respond again, but I never got around to it. The biggest concern with the earlier patch seemed to be backward compatibility. There was a stab at addressing that in http://tinyurl.com/2x3s52, but no commentary. (Too ugly for words?) I think you went off into the weeds there, but I'll respond to that earlier email in detail. Any suggestions as to how to proceed? Should I just code something up in order to have a concrete target to discuss? Or are there any new thoughts based on the previous emails? I actually have a vague plan for a somewhat cleaner way to get this fix. For a variety of reasons, I am planning on changing the way the kernel handles memory registration so that low-level drivers have more control over what happens. This would allow us to folow Gleb's suggestion to use register MR to create and map the kernel's buffer and avoid some of the error path ugliness. So I would prefer to map the coherent memory that way. However this will take a while to come to fruition, since it is kind of a background task for me. How severe is this issue? In other words, when you produced the problem, was it a synthetic test, or a workload that someone might actually want to run? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC/BUG] DMA vs. CQ race
On Thu, Feb 22, 2007 at 10:34:16AM -0800, Roland Dreier wrote: I actually have a vague plan for a somewhat cleaner way to get this fix. For a variety of reasons, I am planning on changing the way the kernel handles memory registration so that low-level drivers have more control over what happens. This would allow us to folow Gleb's suggestion to use register MR to create and map the kernel's buffer and avoid some of the error path ugliness. So I would prefer to map the coherent memory that way. OK, I look forward to seeing what you have in mind. However this will take a while to come to fruition, since it is kind of a background task for me. How severe is this issue? In other words, when you produced the problem, was it a synthetic test, or a workload that someone might actually want to run? We found this accidentally, running a normal MPI job, on a normally sized machine (i.e., tens, not hundreds of processors.) It appears to be more easily produced that we'd expected, and we consider it to be a severe problem. -- Arthur ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [RFC/BUG] DMA vs. CQ race
We found this accidentally, running a normal MPI job, on a normally sized machine (i.e., tens, not hundreds of processors.) It appears to be more easily produced that we'd expected, and we consider it to be a severe problem. Hmm, OK. Then I will do my best to make sure we get a fix for this into 2.6.22. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [RFC/BUG] DMA vs. CQ race
In: http://openib.org/pipermail/openib-general/2006-December/030251.html I described a potential race between DMA and CQ updates on Altix systems. At that time the bug hadn't been observed, but was expected to be possible on large NUMA systems. A first-cut at a patch was sent out, some very reasonable objections were raised, and the thread fizzled out. Since that time we've been able to produce the bug, and show that the patch I sent fixes the problem. (OK, the patch I sent with the addition of a small but important patchlet.) The biggest concern with the earlier patch seemed to be backward compatibility. There was a stab at addressing that in http://tinyurl.com/2x3s52, but no commentary. (Too ugly for words?) Any suggestions as to how to proceed? Should I just code something up in order to have a concrete target to discuss? Or are there any new thoughts based on the previous emails? -- Arthur ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general