RE: [PATCH v2 00/14] RDS: connection scalability and performance improvements
From: Santosh Shilimkar > Sent: 30 September 2015 18:24 ... > This is being addressed by simply using per bucket rw lock which makes the > locking simple and very efficient. The hash table size is still an issue and > I plan to address it by using re-sizable hash tables as suggested on the list. If the hash chains are short do you need the expense of a rw lock for each chain? A simple spinlock may be faster. If you use the hash chain lock for the reference count on the hashed objects you should be able to release the lock before locking the object itself. David -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/10] use safer test on the result of find_first_zero_bit
From: Julia Lawall > On Wed, 4 Jun 2014, Geert Uytterhoeven wrote: > > > Hi Julia, > > > > On Wed, Jun 4, 2014 at 1:00 PM, Julia Lawall wrote: > > > OK, thanks. I was only looking at the C code. > > > > > > But the C code contains a loop that is followed by: > > > > > > if (!size) > > > return result; > > > tmp = *p; > > > > > > found_first: > > > tmp |= ~0UL << size; > > > if (tmp == ~0UL)/* Are any bits zero? */ > > > return result + size; /* Nope. */ > > > > > > In the first return, it would seem that result == size. Could the second > > > one be changed to just return size? It should not hurt performance. > > > > "size" may have been changed between function entry and this line. > > So you have to store it in a temporary. > > Sorry, after reflection it seems that indeed size + result is always the > original size, so it is actually all of the code that uses >= that is > doing something unnecessary. == for the failure test is fine. There is nothing wrong with defensive coding. The 'tmp |= ~0UL << size' ensures that the return value is 'correct' when there are no bits set. The function could have been defined so that this wasn't needed. If you assume that the 'no zero bits' is unlikely, then checking the return value from ffz() could well be slightly faster. Not that anything is likely to notice. David -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/10] use safer test on the result of find_first_zero_bit
From: Julia Lawall > On Wed, 4 Jun 2014, Geert Uytterhoeven wrote: > > > Hi Julia, > > > > On Wed, Jun 4, 2014 at 11:07 AM, Julia Lawall wrote: > > > Find_first_zero_bit considers BITS_PER_LONG bits at a time, and thus may > > > return a larger number than the maximum position argument if that position > > > is not a multiple of BITS_PER_LONG. > > > > Shouldn't this be fixed in find_first_zero_bit() instead? > > OK, I could do that as well. Most of the callers currently test with >=. > Should they be left as is, or changed to use ==? Do we want to add an extra test to find_first_zero_bit() and effectively slow down all the calls - especially those where the length is a multiple of 8 (probably the most common). Maybe the documented return code should be changed to allow for the existing behaviour. David -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch] RDMA/cxgb4: info leak in c4iw_alloc_ucontext()
From: Yann Droneaud > Hi, > > Le vendredi 28 mars 2014 11:24 +0300, Dan Carpenter a crit : > > The c4iw_alloc_ucontext_resp struct has a 4 byte hole after the last > > member and we should clear it before passing it to the user. > > > > Fixes: 05eb23893c2c ('cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes') > > Signed-off-by: Dan Carpenter > > > > It's not the proper fix for this issue: an explicit padding has to be > added (and initialized), see "Re: [PATCH net-next 2/2] cxgb4/iw_cxgb4: > Doorbell Drop Avoidance Bug Fixes" > http://marc.info/?i=1395848977.3297.15.camel@localhost.localdomain > > In its current form, the c4iw_alloc_ucontext_resp structure does not > require padding on i386, so a 32bits userspace program using this > structure against a x86_64 kernel will make the kernel do a buffer > overflow in userspace, likely on stack, as answer of a GET_CONTEXT > request: ... > struct c4iw_alloc_ucontext_resp { > struct ibv_get_context_resp ibv_resp; > __u64 status_page_key; > __u32 status_page_size; > }; Or add __attribute__((aligned(4))) to the 64bit fields. And maybe a compile time assert on the length of the structure. Since it is part of an ABI it must not change David
RE: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
> > It seems to me that a more useful interface would take a minimum and > > maximum number of vectors from the driver. This wouldn't allow the > > driver to specify that it could only accept, say, any even number within > > a certain range, but you could still leave the current functions > > available for any driver that needs that. > > Mmmm.. I am not sure I am getting it. Could you please rephrase? One possibility is for drivers than can use a lot of interrupts to request a minimum number initially and then request the additional ones much later on. That would make it less likely that none will be available for devices/drivers that need them but are initialised later. David -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/2] fs: supply inode uid/gid setting interface
> Subject: Re: [PATCH 0/2] fs: supply inode uid/gid setting interface > > On 2013/8/23 12:10, Greg KH wrote: > > On Fri, Aug 23, 2013 at 10:48:36AM +0800, Rui Xiang wrote: > >> This patchset implements an accessor functions to set uid/gid > >> in inode struct. Just finish code clean up. > > > > Why? > > > It can introduce a new function to reduce some codes. > Just clean up. In what sense is it a 'cleanup' ? To my mind it just means that anyone reading the code has to go and look at another file in order to see what the function does (it might be expected to be more complex). It also adds run time cost, while probably not directly measurable I suspect it more than doubles the execution time of that code fragment - do that everywhere and the system will run like a sick pig. The only real use for accessor functions is when you don't want the structure offset to be public. In this case that is obviously not the case since all the drivers are directly accessing other members of the structure. David -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 net-next 05/22] cxgb4: Add T5 write combining support
> This patch implements a low latency Write Combining (aka Write Coalescing) > work > request path. PCIE maps User Space Doorbell BAR2 region writes to the new > interface to SGE. SGE pulls a new message from PCIE new interface and if its a > coalesced write work request then pushes it for processing. This patch copies > coalesced work request to memory mapped BAR2 space. ... > + } else { > + memset(data, 0, 2 * sizeof(u64)); > + *data += 2; > + } Using memset is overkill (or rather a big overhead if it isn't detected by the compiler). Nothing wrong with: (*data)[0] = 0; (*data)[1] = 0; *data += 2; Actually, typing that, I realise that you should probably have read *data into a local variable, and then updated it when finished. Otherwise some of the accesses might be ones which force the compiler to reload the value. > static inline void ring_tx_db(struct adapter *adap, struct sge_txq *q, int n) > { > + unsigned int *wr, index; > + > wmb();/* write descriptors before telling HW */ > spin_lock(&q->db_lock); > if (!q->db_disabled) { > - t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL), > - QID(q->cntxt_id) | PIDX(n)); > + if (is_t4(adap->chip)) { > + t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL), > + QID(q->cntxt_id) | PIDX(n)); > + } else { > + if (n == 1) { > + index = q->pidx ? (q->pidx - 1) : (q->size - 1); > + wr = (unsigned int *)&q->desc[index]; > + cxgb_pio_copy((u64 __iomem *) > + (adap->bar2 + q->udb + 64), > + (u64 *)wr); Why all the casts on 'wr' ? > + } else > + writel(n, adap->bar2 + q->udb + 8); > + wmb(); Since you actually need memory barriers here on x86 you definitely need a comment saying so, and it would (IMHO) better to use a different define in the source (even if it is currently converted to wmb() in a local header file). Thinking further, for portability you might want to find some way of abstracting the multi-word writes somehow. For example, some of the ppc have a dma engine associated with the PCIe master interface that can be used to generate large TLP. The code would still want to spin waiting for the dma to complete (since the transfer would be far faster than any interrupt path). David -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH net-next 05/22] cxgb4: Add T5 write combining support
> >>> + writel(n, adap->bar2 + q->udb + 8); > >>> +#if defined(CONFIG_X86_32) || defined(CONFIG_X86_64) > >>> + asm volatile("sfence" : : : "memory"); > >>> +#endif > >> There is absolutely no way I'm letting anyone put crap like this > >> into a driver. > >> > >> Use a portable inteface, and if one does not exist create one. > > > > I guess you'll have to add a wc_wmb() function for all the hw platforms > > supported by the kernel. I see libibverbs defines this for the user > > side in include/infiniband/arch.h, and that could be used as the meat of > > the hw platform-specific implementations. > > > I see that normal wmb() code for x86_64 architecture is same as what > above #ifdef condition is doing. To be more clear I looked at the > assembly code for wmb and find that it is converted into 'sfence' > instruction. So, I think above code should be replaced with wmb call > which will also take care of portability on different architecture. I > will submit the series again soon. >From my recollection of the x86 architecture, the memory barriers are hardly ever needed, certainly not in the places where, for example a ppc needs them. I'd actually suspect that the normal wmb() for x86 should be a nop. About the only place where any on the fence instructions are needed are in relation to write combining accesses. In particular I don't believe they are ever needed to synchronise uncached accesses with each other, or with cached accesses (which are snooped). David -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/9] ocrdma: Driver for Emulex OneConnect RDMA adapter
> > struct { > > u64 fielda > > u32 field; > > }; > > In this case: On 64 bit: the u64 is aligned to 8 and the u32 > is aligned to 4. So > the structure is aligned to 8. A pad is inserted at the end of the > struct to bring it out. On 32 bit, the u64 is aligned to 4, so the > struct is aligned to 4, so no pad is added. That is true for 32bit i386, some 32bit ABI may require 64bit alignment for 64bit items. ... > The aligned attribute overrides the automatic determination of the > alignment based on the contents and just forces it. > > So, as an example, if you have this hardware layout: > > struct { > u32 fielda; > u64 fieldb; > } attribute ((aligned(4)); > > David is saying you will get a 12 byte struct and fieldb will be > unaligned. Since 12 is aligned to 4 no padding is added. I was actually suggesting defining (modulo syntax errors): typedef u64 u64_aligned_8 attribute ((aligned(8)); Then you can define: struct foo { u32 fielda; u64_aligned_8 fieldb; }; and know that the aligment should be the same on all systems. (But I'd still add the explicit pad in the above.) This is actually more useful when you need to process structures that have misaligned 64bit items - as happens when doing 'compat 32' code in a 64bit kernel. Or trying to map the x86 disk mbr. > For hardware facing structures I'd combine this with a static assert > to verify structure size at compile time. Also useful if you need to ensure a structure doesn't accidentally change size when you need to preserve a userspace interface. > So.. > > 1) Avoid using attributes unless the structure has unaligned members. > 2) Avoid creating structures with unaligned members (eg for userspace >communication) > 3) Frown at hardware/firmware developers who make communication >structures with unaligned members :) > 4) Be explicit about padding in your layout for 64/32 >compatibility. Agreed. David -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/9] ocrdma: Driver for Emulex OneConnect RDMA adapter
> > +#define is_cqe_wr_imm(cqe) \ > > + ((le32_to_cpu(cqe->flags_status_srcqpn) & OCRDMA_CQE_WRITE_IMM) ? 1 > > : 0) > > ...similar comment about using readable typesafe inline functions > instead of macros... and if you are using #defines, you need to enclose every reference to the parameters in (). David -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/9] ocrdma: Driver for Emulex OneConnect RDMA adapter
> > - Header file for userspace library and kernel driver interface. > > > +struct ocrdma_alloc_ucontext_resp { > > + u32 dev_id; > > + u32 wqe_size; > > + u32 max_inline_data; > > + u32 dpp_wqe_size; > > + u64 ah_tbl_page; > > + u32 ah_tbl_len; > > + u32 rsvd; > > + u8 fw_ver[32]; > > + u32 rqe_size; > > + u64 rsvd1; > > +} __packed; > > If I'm reading this correctly, you have the 8-byte rsvd1 member at an > offset only aligned to 4 bytes, because of the __packed directive. It > would be much better to have these structures laid out so they are > naturally the same on both 32-bit and 64-bit ABIs, and get rid of the > __packed directive, which wrecks gcc code generation in some cases. > gcc also supports defining types that have non-standard alignment constraints that can be used to force the same alignment for 64bit fields between i386 and amd64. Probably __attribute__((aligned,n)) or similar. This can be used to force 32bit alignment in amd64 code in order to match definitions in 32bit userspace. For new things it would make sense to force 64bit alignment of 64bit fields for 32bit code. Adding __packed (rather than 32bit alignment) forces the compiler to generate byte by byte accesses for all the fields on systems that can't do misaligned accesses in hardware (eg sparc). David David -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html