RE: [PATCH v3 06/10] net/macb: clean up ring buffer logic
Instead of masking head and tail every time we increment them, just let them wrap through UINT_MAX and mask them when subscripting. Add simple accessor functions to do the subscripting properly to minimize the chances of messing this up. ... +static unsigned int macb_tx_ring_avail(struct macb *bp) +{ + return TX_RING_SIZE - (bp-tx_head - bp-tx_tail); +} That one doesn't look quite right to me. Surely it should be masking with 'TX_RING_SIZE - 1' David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 06/10] net/macb: clean up ring buffer logic
On Tue, Oct 30, 2012 at 4:12 AM, David Laight david.lai...@aculab.com wrote: Instead of masking head and tail every time we increment them, just let them wrap through UINT_MAX and mask them when subscripting. Add simple accessor functions to do the subscripting properly to minimize the chances of messing this up. ... +static unsigned int macb_tx_ring_avail(struct macb *bp) +{ + return TX_RING_SIZE - (bp-tx_head - bp-tx_tail); +} That one doesn't look quite right to me. Surely it should be masking with 'TX_RING_SIZE - 1' Why is that? head and tail can never be more than TX_RING_SIZE apart, so it shouldn't make any difference. It's a ring buffer (I presume) the pointers can be in either order. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 06/10] net/macb: clean up ring buffer logic
return (TX_RING_SIZE - (bp-tx_head - bp-tx_tail) (TX_RING_SIZE - 1)); Is equivalent to: return (bp-tx_tail - bp-tx_head) (TX_RING_SIZE - 1)); David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v8 01/16] hashtable: introduce a small and naive hashtable
On Tue, Oct 30, 2012 at 02:45:57PM -0400, Sasha Levin wrote: +/* Use hash_32 when possible to allow for fast 32bit hashing in 64bit kernels. */ +#define hash_min(val, bits) \ +({ \ + sizeof(val) = 4 ? \ + hash_32(val, bits) : \ + hash_long(val, bits); \ +}) Doesn't the above fit in 80 column. Why is it broken into multiple lines? Also, you probably want () around at least @val. In general, it's a good idea to add () around any macro argument to avoid nasty surprises. It was broken to multiple lines because it looks nicer that way (IMO). If we wrap it with () it's going to go over 80, so it's going to stay broken down either way :) ({ \ sizeof(val) = 4 ? hash_32(val, bits) : hash_long(val, bits); \ }) Is the better way to go. We are C programmers, we like to see the ?: on a single line if possible. The way you have it, looks like three statements run consecutively. To add some more colour (not color): In any case, this is a normal C #define, it doesn't need the {}. So it can just be: # define hash_min(val, bits) \ (sizeof(val) = 4 ? hash_32(val, bits) : hash_long(val, bits)) I don't think that s/val/(val)/g and s/bits/(bits)/g are needed because the tokens are already ',' separated. I do actually wonder how many of these hash lists should be replaced with some kind of tree structure in order to get O(log(n)) searches. After all hashing is still O(n). (apologies if I mean o(n) not O(n) - it's a long time since I did my maths degree!) David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3/9] net: xfrm: use this_cpu_ptr per-cpu helper
this_cpu_read |-_this_cpu_generic_read #define _this_cpu_generic_read(pcp) \ ({ typeof(pcp) ret__; \ preempt_disable(); \ ret__ = *this_cpu_ptr((pcp)); \ preempt_enable(); \ ret__; \ }) this_cpu_read operations locate per-cpu variable with preemption safe, not disable interrupts. why is it atomic vs interrupts? Hmmm... what effect do those preemt_dis/enable() actually have? Since a pre-empt can happen either side of them, the value the caller sees can be for the wrong cpu anyway. The only time I could see them being necessary is if *this_cpu_ptr() itself needs mutex protection in order to function correctly - and that is likely to be port specific. On i386/amd64 where (I guess) it is an access offset by fs/gs this isn't necessary and just wastes cpu cycles. If the caller cares which cpu the value comes from (eg to increment a counter) then the caller would need to disable pre-emption across the whole operation. David
RE: [PATCH v4] lxt PHY: Support for the buggy LXT973 rev A2
This patch adds proper handling of the buggy revision A2 of LXT973 phy, adding precautions linked to ERRATA Item 4: Revision A2 of LXT973 chip randomly returns the contents of the previous even register when you read a odd register regularly Does reading the PHY registers involve bit-banging an MII interface? If so this code is likely to stall the system for significant periods (ready phy registers at all can be a problem). I know some ethernet mac have hardware blocks for reading MII and even polling one MII register for changes. Maybe some of this code ought to be using async software bit-bang - especially when just polling for link status change. I'm sure it ought to be possible to do one bit-bang action per clock tick instead of spinning for the required delays. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v6] hashtable: introduce a small and naive hashtable
Amazing how something simple gets lots of comments and versions :-) ... + * This has to be a macro since HASH_BITS() will not work on pointers since + * it calculates the size during preprocessing. + */ +#define hash_empty(hashtable) \ +({ \ + int __i; \ + bool __ret = true; \ + \ + for (__i = 0; __i HASH_SIZE(hashtable); __i++) \ + if (!hlist_empty(hashtable[__i])) \ + __ret = false; \ + \ + __ret; \ +}) Actually you could have a #define that calls a function passing in the address and size. Also, should the loop have a 'break' in it? David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v6] hashtable: introduce a small and naive hashtable
And even then, if we would do: for (i = 0; i HASH_SIZE(hashtable); i++) if (!hlist_empty(hashtable[i])) break; return i = HASH_SIZE(hashtable); What happens if the last entry of the table is non-empty ? It still works, as 'i' is not incremented due to the break. And i will still be less than HASH_SIZE(hashtable). Did you have *your* cup of coffee today? ;-) Ahh, right! Actually I had it already ;-) I tend to dislike the repeated test, gcc might be able to optimise it away, but the code is cleaner written as: for (i = 0; i HASH_SIZE(hashtable); i++) if (!hlist_empty(hashtable[i])) return false; return true; Agreed that the flags should be removed. Moving to define + static inline is still important though. Not sure I'd bother making the function inline. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v6] hashtable: introduce a small and naive hashtable
Moreover, if your thinking is that we do not need a static inline function replicated at every caller, maybe we should introduce a lib/hashtable.c that implements those 2 functions. That was my thought... Given their nature, I'd guess they aren't critical path. Probably not worth adding an entire source file though. With regard to 'inline', when I say it, I really mean it! Unfortunately some people seem to just type it anyway (rather like 'register' used to get used) - so the compilers start ignoring the request. At least some versions of gcc will usually inline static functions that are only called once - but even then it can change its mind for almost no reason. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2 00/11] introduce random32_get_bytes() and random32_get_bytes_state()
On Sun, 04 Nov 2012 00:43:31 +0900, Akinobu Mita said: This patchset introduces new functions into random32 library for getting the requested number of pseudo-random bytes. Before introducing these new functions into random32 library, prandom32() and prandom32_seed() with prandom32 prefix are renamed to random32_state() and srandom32_state() respectively. The purpose of this renaming is to prevent some kernel developers from assuming that prandom32() and random32() might imply that only prandom32() was the one using a pseudo-random number generator by prandom32's p, and the result may be a very embarassing security exposure. Out of curiosity, why the '32'? I'm just waiting for some kernel developer to do something stupid with this on a 64-bit arch because they think it's a 32-bit API. ;) Should we bite the bullet and lose the 32, as long as we're churning the code *anyhow*? Also why remove the 'pseudo' part of the name? It is an important part of the name. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC PATCH v2 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()
Looks the problem is worse than above, not only bitfields are affected, the adjacent fields might be involved too, see: http://lwn.net/Articles/478657/ Not mentioned in there is that even with x86/amd64 given a struct with the following adjacent fields: char a; char b; char c; then foo-b |= 0x80; might do a 32bit RMW cycle. This will (well might - but probably does) happen if compiled to a 'BTS' instruction. The x86 instruction set docs are actually unclear as to whether the 32bit cycle might even be misaligned! amd64 might do a 64bit cycle (not checked the docs). David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH net-next?] pktgen: Use simpler test for non-zero ipv6 address
- if (pkt_dev-min_in6_daddr.s6_addr32[0] == 0 - pkt_dev-min_in6_daddr.s6_addr32[1] == 0 - pkt_dev-min_in6_daddr.s6_addr32[2] == 0 - pkt_dev-min_in6_daddr.s6_addr32[3] == 0) ; - else { + if (pkt_dev-min_in6_daddr.s6_addr32[0] | + pkt_dev-min_in6_daddr.s6_addr32[1] | + pkt_dev-min_in6_daddr.s6_addr32[2] | + pkt_dev-min_in6_daddr.s6_addr32[3]) { Given the likely values, it might be worth using: if (pkt_dev-min_in6_daddr.s6_addr32[0] == 0 pkt_dev-min_in6_daddr.s6_addr32[1] | pkt_dev-min_in6_daddr.s6_addr32[2] | pkt_dev-min_in6_daddr.s6_addr32[3]) { David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC net-next] treewide: s/ipv4_is_foo()/ipv4_addr_foo/
ipv4 and ipv6 use different styles for these tests. ipv4_is_foo(__be32) ipv6_addr_foo(struct in6_addr *) I presume there is a 'const' in there ... Perhaps it'd be good to convert the ipv4 tests to the ipv6 style. You don't want to force an IPv4 address (which might be in a register) be written out to stack. Taking the address also has implications for the optimiser. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] SUNRPC: Prevent kernel stack corruption on long values of flush
... long is always the same or bigger then a pointer (A pointer must always fit in a long) ... Linux may make that assumption, but it doesn't have to be true. 64bit windows still has 32bit long. C99 inttypes.h defines [u]intptr_t to be an integral type that is large enough to hold a pointer to any data item. (That in itself is problematic for implementations that encode multiple characters into a machine word and need to use 'fat' pointers in order to encode the offset.) David
RE: [PATCH BUGFIX 2/6] pkt_sched: fix the update of eligible-group sets
Between two invocations of make_eligible, the system virtual time may happen to grow enough that, in its binary representation, a bit with higher order than 31 flips. This happens especially with TSO/GSO. Before this fix, the mask used in make_eligible was computed as (1ULindex_of_last_flipped_bit)-1, whose value is well defined on a 64-bit architecture, because index_of_flipped_bit = 63, but is in general undefined on a 32-bit architecture if index_of_flipped_bit 31. The fix just replaces 1UL with 1ULL. ... - unsigned long mask = (1UL fls(vslot ^ old_vslot)) - 1; + unsigned long mask = (1ULL fls(vslot ^ old_vslot)) - 1; I'm not sure you really want to be doing 64bit shifts on 32 bit systems just to generate ~0ul. probably worth a conditional. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2] net/macb: Use non-coherent memory for rx buffers
Allocate regular pages to use as backing for the RX ring and use the DMA API to sync the caches. This should give a bit better performance since it allows the CPU to do burst transfers from memory. It is also a necessary step on the way to reduce the amount of copying done by the driver. I've not tried to understand the patches, but you have to be very careful using non-snooped memory for descriptor rings. No amount of DMA API calls can sort out some of the issues. Basically you must not dirty a cache line that contains data that the MAC unit might still write to. For the receive ring this means that you must not setup new rx buffers for ring entries until the MAC unit has filled all the ring entries in the same cache line. This probably means only adding rx buffers in blocks of 8 or 16 (or even more if there are large cache lines). I can't see any code in the patch that does this. Doing the same for the tx ring is more difficult, especially if you can't stop the MAC unit polling the TX ring on a timer basis. Basically you can only give the MAX tx packets if either it is idle, or if the tx ring containing the new entries starts on a cache line. If the MAC unit is polling the ring, then to give it multiple items you may need to update the 'owner' bit in the first ring entry last - just in case the cache line gets written out before you've finished. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2] net/macb: Use non-coherent memory for rx buffers
On 12/03/2012 01:43 PM, David Laight : Allocate regular pages to use as backing for the RX ring and use the DMA API to sync the caches. This should give a bit better performance since it allows the CPU to do burst transfers from memory. It is also a necessary step on the way to reduce the amount of copying done by the driver. I've not tried to understand the patches, but you have to be very careful using non-snooped memory for descriptor rings. No amount of DMA API calls can sort out some of the issues. David, Maybe I have not described the patch properly but the non-coherent memory is not used for descriptor rings. It is used for DMA buffers pointed out by descriptors (that are allocated as coherent memory). As buffers are filled up by the interface DMA and then, afterwards, used by the driver to pass data to the net layer, it seems to me that the use of non-coherent memory is sensible. Ah, ok - difficult to actually determine from a fast read of the code. So you invalidate (I think that is the right term) all the cache lines that are part of each rx buffer before giving it back to the MAC unit. (Maybe that first time, and just those cache lines that might have been written to after reception - I'd worry about whether the CRC is written into the rx buffer!) I was wondering if the code needs to do per page allocations? Perhaps that is necessary to avoid needing a large block of contiguous physical memory (and virtual addresses)? I know from some experiments done many years ago that a data copy in the MAC tx and rx path isn't necessarily as bad as people may think - especially if it removes complicated 'buffer loaning' schemes and/or iommu setup (or bounce buffers due to limited hardware memory addressing). The rx copy can usually be made to be a 'whole word' copy (ie you copy the two bytes of garbage that (mis)align the destination MAC address, and some bytes after the CRC. With some hardware I believe it is possible for the cache controller to do cache-line aligned copies very quickly! (Some very new x86 cpus might be doing this for 'rep movsd'.) The copy in the rx path is also better for short packets the can end up queued for userspace (although a copy in the socket code would solve that one. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2] net/macb: Use non-coherent memory for rx buffers
If I understand well, you mean that the call to: dma_sync_single_range_for_device(bp-pdev-dev, phys, pg_offset, frag_len, DMA_FROM_DEVICE); in the rx path after having copied the data to skb is not needed? That is also the conclusion that I found after having thinking about this again... I will check this. You need to make sure that the memory isn't in the data cache when you give the rx buffer back to the MAC. (and ensure the cpu doesn't read it until the rx is complete.) I've NFI what that dma_sync call does - you need to invalidate the cache lines. For the CRC, my driver is not using the CRC offloading feature for the moment. So no CRC is written by the device. I was thinking it would matter if the MAC wrote the CRC into the buffer (even though it was excluded from the length). It doesn't - you only need to worry about data you've read. I was wondering if the code needs to do per page allocations? Perhaps that is necessary to avoid needing a large block of contiguous physical memory (and virtual addresses)? The page management seems interesting for future management of RX buffers as skb fragments: that will allow to avoid copying received data. Dunno - the complexities of such buffer loaning schemes often exceed the gain of avoiding the data copy. Using buffers allocated to the skb is a bit different - since you completely forget about the memory once you pass the skb upstream. Some quick sums indicate you might want to allocate 8k memory blocks and split into 5 buffers. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2] net/macb: Use non-coherent memory for rx buffers
Well, for the 10/100 MACB interface, I am stuck with 128 Bytes buffers! So this use of pages seems sensible. If you have dma coherent memory you can make the rx buffer space be an array of short buffers referenced by adjacent ring entries (possibly with the last one slightly short to allow for the 2 byte offset). Then, when a big frame ends up in multiple buffers, you need a maximum of two copies to extract the data. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH, resubmit] ax88179_178a: ASIX AX88179_178A USB 3.0/2.0 to gigabit ethernet adapter driver
+struct ax88179_rx_pkt_header { + u8 l4_csum_err:1, + l3_csum_err:1, + l4_type:3, + l3_type:2, + ce:1; + + u8 vlan_ind:3, + rx_ok:1, + pri:3, + bmc:1; + + u16 len:13, + crc:1, + mii:1, + drop:1; +} __packed; This won't work on both big-endian systems (assuming this works on x86). You apparently try to convert the structure in-place in ax88179_rx_fixup() by calling le32_to_cpus(); that may work if you define all the bitfields to be part of a u32 but it won't work with the current definition. Trying to use bit fields to map hardware registers (etc) isn't a good idea at all. The C standard says absolutely nothing about the order in which the bits are allocated to the field names. (There are at least 4 possible orederings.) It is much better to define constants for the bit values and explicitly mask them as required. David
Re: [PATCH v3 0/7] Add O_DENY* support for VFS and CIFS/NFS
On Thu, Feb 28, 2013 at 01:53:25PM -0800, Andy Lutomirski wrote: O_DENYMAND - to switch on/off three flags above. O_DENYMAND doesn't deny anything. Would a name like O_RESPECT_DENY be better? Possibly rename to O_CHECK_DENY ? David -- David Laight: da...@l8s.co.uk -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] vhost-net: fall back to vmalloc if high-order allocation fails
+ n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN); + if (!n) + n = vmalloc(sizeof *n); I'm slightly confused by this construct. I thought kmalloc(... GFP_KERNEL) would sleep waiting for memory (rather than return NULL). I realise that (for multi-page sizes) that kmalloc() and vmalloc() both need to find a contiguous block of kernel virtual addresses - in different address ranges, and that vmalloc() then has to find physical memory pages (which will not be contiguous). I think this means that kmalloc() is likely to be faster (if it doesn't have to sleep), but that vmalloc() is allocating from a much larger resource. This make me that that large allocations that are not temporary should probably be allocated with vmalloc(). Is there a 'NO_SLEEP' flag to kmalloc()? is that all GFP_ATOMIC requests? If so you might try a non-sleeping kmalloc() with a vmalloc() if it fails. This all looks as though there should be a GFP_NONCONTIG flag (or similar) so that kmalloc() can make a decision itself. Of at least a wrapper - like the one for free(). David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] vhost-net: fall back to vmalloc if high-order allocation fails
I think this means that kmalloc() is likely to be faster (if it doesn't have to sleep), but that vmalloc() is allocating from a much larger resource. This make me that that large allocations that are not temporary should probably be allocated with vmalloc(). vmalloc has some issues for example afaik it's not backed by a huge page so I think its use puts more stress on the TLB cache. Thinks further ... 64bit systems are likely to have enough kernel VA to be able to map all of physical memory into contiguous VA. I don't know if Linux does that, but I know code to map it was added to NetBSD amd64 in order to speed up kernel accesses to 'random' pages - it might have been partially backed out due to bugs! If physical memory is mapped like that then kmalloc() requests can any of physical memory and be unlikely to fail - since user pages can be moved in order to generate contiguous free blocks. Doesn't help with 32bit systems - they had to stop mapping all of physical memory into kernel space a long time ago. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch v2] b43: N-PHY: fix gain in b43_nphy_get_gain_ctl_workaround_ent()
+ int gain_data[] = {0x0062, 0x0064, 0x006a, 0x106a, 0x106c, +0x1074, 0x107c, 0x207c}; static const int ... David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v6 08/46] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
I wouldn't go that far... ;-) Unfairness is not a show-stopper right? IMHO, the warning/documentation should suffice for anybody wanting to try out this locking scheme for other use-cases. I presume that by 'fairness' you mean 'write preference'? I'd not sure how difficult it would be, but maybe have two functions for acquiring the lock for read, one blocks if there is a writer waiting, the other doesn't. That way you can change the individual call sites separately. The other place I can imagine a per-cpu rwlock being used is to allow a driver to disable 'sleep' or software controlled hardware removal while it performs a sequence of operations. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Disable IPv4-mapped - enforce IPV6_V6ONLY
A proper solution would be to either return false if net.ipv6.bindv6only is true and optval is false (which would break downward compatibility because it wouldn't just be a default and setsockopt might return an error) or to introduce a new sysctl variable like net.ipv6.bindv6only_enforced_silently. (silently because setsockopt() wouldn't return an error if net.ipv6.bindv6only is true and optval (v6only in the example above) is false.) I would volunteer to write a patch which introduces something like net.ipv6.bindv6only_enforced_silently if some maintainer would give me his ok. If so, the question remains if systemctl net.ipv6.bindv6only_enforced_silently = 1 should set systemctl.net.ipv6.bindv6only too or if an error should be returned if net.ipv6.bindv6only is false. I am not convinced why you need this, and I am not in favor of enfocing IPV6_V6ONLY, but... some points: - We should allow system-admin to enforce IPV6_V6ONLY to 0 as well. - CAP_NET_ADMIN users should always be able to use both modes (They can do sysctl anyway.) - setsockopt should fail w/ EPERM if user tries to override. I can imagine that some programs will always try to clear IPV6_V6ONLY (maybe for portability with other OS which default to setting it for security reasons) and will error-exit if it fails. So non-silent enforcing could be a PITA. OTOH there might be programs/systems where silent failure is wrong. You really don't want to (globally) stop an application setting IPV6_V6ONLY, such a program may well be creating separate IPv4 and IPv6 sockets. Some of this needs to be part of some application wide 'security' framework - that probably doesn't exist! Should there also be similar controls for the use of IPv4 mapped addresses in actual on-the-wire IPv6 packets - eg those destined for a remote gateway on an IPv6 only system? David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
My solution to making 'break' work in the iterator is: for (bkt = 0, node = NULL; bkt HASH_SIZE(name) node == NULL; bkt++) hlist_for_each_entry(obj, node, name[bkt], member) I'd take a look at the generated code. Might come out a bit better if the condition is changed to: node == NULL bkt HASH_SIZE(name) you might find the compiler always optimises out the node == NULL comparison. (It might anyway, but switching the order gives it a better chance.) David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2/2] netprio_cgroup: Optimize the priomap copy loop slightly
- for (i = 0; - old_priomap (i old_priomap-priomap_len); - i++) - new_priomap-priomap[i] = old_priomap-priomap[i]; + if (old_priomap) { + old_len = old_priomap-priomap_len; + + for (i = 0; i old_len; i++) + new_priomap-priomap[i] = old_priomap-priomap[i]; + } Or: memcpy(new_priomap-priomap, old_priomap-priomap, old_priomap-priomap_len * sizeof old_priomap-priomap[0]); David N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [PATCH 9/9] drivers/isdn/gigaset/common.c: Remove useless kfree
Remove useless kfree() and clean up code related to the removal. ... diff --git a/drivers/isdn/gigaset/common.c b/drivers/isdn/gigaset/common.c index aa41485..30a6b17 100644 --- a/drivers/isdn/gigaset/common.c +++ b/drivers/isdn/gigaset/common.c @@ -1123,7 +1123,6 @@ struct gigaset_driver *gigaset_initdriver(unsigned minor, unsigned minors, return drv; error: - kfree(drv-cs); kfree(drv); return NULL; } Seems to me that (assuming kfree(NULL) is ok) the kfree() is best left in - just in case some other error path is added after drv-cs is assigned. Better safe than a memory leak. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] sata_fsl: add workaround for data length mismatch on freescale V2 controller
+ /* Read command completed register */ + done_mask = ioread32(hcr_base + CC); + + if (host_priv-quirks SATA_FSL_QUIRK_V2_ERRATA) { + if (unlikely(hstatus INT_ON_DATA_LENGTH_MISMATCH)) { + for (tag = 0; tag ATA_MAX_QUEUE; tag++) { + qc = ata_qc_from_tag(ap, tag); + if (qc ata_is_atapi(qc-tf.protocol)) { + atapi_flag = 1; + break; + } + } + } + } + + /* Workaround for data length mismatch errata */ + if (atapi_flag) { Seems to me like the conditionals for this code are all in the wrong order - adding code to the normal path. The whole lot should probably be inside: if (unlikely(hstatus INT_ON_DATA_LENGTH_MISMATCH)) { and the 'atapi_flag' boolean removed. I also wonder it this is worthy of an actual quirk? Might be worth doing anyway. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 02/16] user_ns: use new hashtable implementation
Yes hash_32 seems reasonable for the uid hash. With those long hash chains I wouldn't like to be on a machine with 10,000 processes with each with a different uid, and a processes calling setuid in the fast path. The uid hash that we are playing with is one that I sort of wish that the hash table could grow in size, so that we could scale up better. Since uids are likely to be allocated in dense blocks, maybe an unhashed multi-level lookup scheme might be appropriate. Index an array with the low 8 (say) bits of the uid. Each item can be either: 1) NULL = free entry. 2) a pointer to a uid structure (check uid value). 3) a pointer to an array to index with the next 8 bits. (2) and (3) can be differentiated by the low address bit. I think that is updateable with cmpxchg. Clearly this is a bad algorithm if uids are all multiples of 2^24 but that is true or any hash function. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/1] tcp: Wrong timeout for SYN segments
I would suggest to increase TCP_SYN_RETRIES from 5 to 6. 180 secs is eternity, but 31 secs is too small. Wasn't the intention of the long delay to allow a system acting as a router to reboot? I suspect that is why it (and some other TCP timers) are in minutes. David
RE: [PATCH v2 3/3] pppoatm: protect against freeing of vcc
On Tue, 27 Nov 2012 18:02:29 + David Woodhouse dw...@infradead.org wrote: In solos-pci at least, the ops-close() function doesn't flush all pending skbs for this vcc before returning. So can be a tasklet somewhere which has loaded the address of the vcc-pop function from one of them, and is going to call it in some unspecified amount of time. Should we make the device's -close function wait for all TX and RX skbs for this vcc to complete? the driver's close routine should wait for any of the pending tx and rx to complete. take a look at the he.c in driver/atm I'm not sure that sleeping for long periods in close() is always a good idea. If the process is event driven it will be unable to handle events on other fd until the close completes. This may be known not to be true in this case, but is more generally a problem. In this case the close should probably (IMHO at least) only sleep while pending tx and rx are aborted/discarded. Even when it might make sense to sleep in close until tx drains there needs to be a finite timeout before it become abortive. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Add O_DENY* flags to fcntl and cifs
On Sat, Dec 08, 2012 at 12:43:14AM +0400, Pavel Shilovsky wrote: The problem is the possibility of denial-of-service attacks here. We can try to prevent them by: FWIW I already see a DoS 'attack'. I have some filestore shared using NFS (to Linux and Solaris) and using samba (to Windows). I use it for release builds of a product to ensure the versions built for the different operating systems match, and because some files have to be built on an 'alien' system (eg gcc targetted at embedded card). I can't run the windows build at the same time as the others because the windows C compiler manages to obtain exclusive access to the source files - stopping the other systems from reading them. David -- David Laight: da...@l8s.co.uk -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 0/2] fs: supply inode uid/gid setting interface
Subject: Re: [PATCH 0/2] fs: supply inode uid/gid setting interface On 2013/8/23 12:10, Greg KH wrote: On Fri, Aug 23, 2013 at 10:48:36AM +0800, Rui Xiang wrote: This patchset implements an accessor functions to set uid/gid in inode struct. Just finish code clean up. Why? It can introduce a new function to reduce some codes. Just clean up. In what sense is it a 'cleanup' ? To my mind it just means that anyone reading the code has to go and look at another file in order to see what the function does (it might be expected to be more complex). It also adds run time cost, while probably not directly measurable I suspect it more than doubles the execution time of that code fragment - do that everywhere and the system will run like a sick pig. The only real use for accessor functions is when you don't want the structure offset to be public. In this case that is obviously not the case since all the drivers are directly accessing other members of the structure. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [net-next rfc 1/3] net: avoid high order memory allocation for queues by using flex array
+static void netif_free_tx_queues(struct net_device *dev) +{ + if (is_vmalloc_addr(dev-_tx)) + vfree(dev-_tx); + else + kfree(dev-_tx); +} + static int netif_alloc_netdev_queues(struct net_device *dev) { unsigned int count = dev-num_tx_queues; @@ -5261,9 +5270,11 @@ static int netif_alloc_netdev_queues(struct net_device *dev) BUG_ON(count 1); tx = kcalloc(count, sizeof(struct netdev_queue), GFP_KERNEL); - if (!tx) - return -ENOMEM; - + if (!tx) { + tx = vzalloc(count * sizeof(struct netdev_queue)); + if (!tx) + return -ENOMEM; + } dev-_tx = tx; Given the number of places I've seen this code added, why not put it in a general helper? I also thought that malloc() with GFP_KERNEL would sleep. Under what conditions does it fail instead? David
RE: [PATCH 2/2] perf tools: Make Power7 events available for perf
I think we should be able to do something better using the C preprocessor, this is exactly the sort of thing it's good at. What I mean is something like we do with arch/powerpc/include/asm/systbl.h, where we define the list of syscalls once, and then include it in multiple places, using different macro definitions to get different outputs. There is a 'neat' trick - you can pass a #define macro the name of another #define - which is then expanded after the initial expansion. A description I happen to have is pasted below. David Consider what happens when #define foo(x) x(args) is expanded: foo(bar) clearly becomes bar(args). If we also have #define bar(args) then bar() is expanded AFTER foo() allowing us to generate any text including args. So we have passed the name of one #define as a parameter to a different #define. If we replace the definition of foo with #define foo(x) x(args1) x(args2) then foo(bar) is equivalent to bar(args1) bar(args2). This is useful because foo(baz) expands to baz(args1) baz(args2) allowing us to feed the same set of arguments to more than one #define. A simple example: #define lights(x) x(red) x(orange) x(green) #define xx(colour) LIGHT_##colour, enum { lights(xx) NUM_LIGHTS }; #undef xx #define xx(colour) #colour, static const char light_names[] = { lights(xx) }; #undef xx This expands to: enum { LIGHT_red, LIGHT_orange, LIGHT_green, NUM_LIGHTS }; static const char light_names[] = { ”red”, ”orange”, ”green”, }; (We needed to add NUM_LIGHTS because a trailing comma isn’t valid in a C++ enum.) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v7] ethernet/arc/arc_emac - Add new driver
So it should be declared dma_addr_t then, + addr = dma_map_single(ndev-dev, (void *)rx_buff-skb-data, + buflen, DMA_FROM_DEVICE); + if (dma_mapping_error(ndev-dev, addr)) { + if (net_ratelimit()) + netdev_err(ndev, cannot dma map\n); + dev_kfree_skb(rx_buff-skb); + stats-rx_errors++; + continue; + } + dma_unmap_addr_set(rx_buff, mapping, addr); + dma_unmap_len_set(rx_buff, len, buflen); + + rxbd-data = (unsigned char *)cpu_to_le32(rx_buff-skb-data); the 'addr' returned by dma_map_single is what the device really needs, although it is the same as rx_buff-skb-data with the trivial definition of dma_map_single. The last line here needs to be rxbd-data = cpu_to_le32(addr); which fixes the bug, and has no dependency on a 32 bit CPU. It still has a dependency on dma_addr_t size being 32 bit No - just a dependency on memory being mapped to a 32bit physical address accessible from the device. Many 64 bit systems have devices that can only access 32bit address space, either the memory has to be allocated from the correct arena, of DMA bounce buffers are used. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC PATCH] vsnprintf: Remove use of %n and convert existing uses
On Wed, Sep 11, 2013 at 05:04:17PM -0700, Joe Perches wrote: On Thu, 2013-09-12 at 08:40 +0900, Tetsuo Handa wrote: Joe Perches wrote: - seq_printf(m, %s%d%n, con-name, con-index, len); + len = seq_printf(m, %s%d, con-name, con-index); Isn't len always 0 or -1 ? Right. Well you're no fun... These uses would seem broken anyway because the seq_printf isn't itself tested for correctness. Hmm. Also, there's a large amount of code that appears to do calculations with pos or len like: pos += seq_printf(handle, fmt. ...) ... and most of that code proceeds to ignore pos completely. Note that -show() is *NOT* supposed to return the number of characters it has/would like to have produced. Just return 0 and be done with that; overflows are dealt with just fine. The large amount, BTW, is below 100 lines, AFAICS, in rather few files. There are very few that seem to use it correctly like netfilter. Suggestions? Change the return type of seq_printf() to void and require that code use access functions/macros to find the length and error status. Possibly save the length of the last call separately. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 1/3] phy: Add new Exynos USB PHY driver
I just did a diff of registers in exynos 4210 and 4212 PHY drivers [1] and couldn't find that big a difference in register layout. Of course there are a few changes in HSIC bit fields and PHYFSEL but that's only minimal and could well be handled in a single driver. [1] - http://diffchecker.com/py3tp68m This is quite a lot of differences, especially including shifted bitfields... In addition there is another set of available PHYs and inter-dependencies between them. Might be worth feeding both files through sed -e 's/_421[02]_/_421x_/' prior to doing the diff. And maybe changing the order of some definitions so they match. Then the actual differences will be more obvious. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
It seems to me that a more useful interface would take a minimum and maximum number of vectors from the driver. This wouldn't allow the driver to specify that it could only accept, say, any even number within a certain range, but you could still leave the current functions available for any driver that needs that. Mmmm.. I am not sure I am getting it. Could you please rephrase? One possibility is for drivers than can use a lot of interrupts to request a minimum number initially and then request the additional ones much later on. That would make it less likely that none will be available for devices/drivers that need them but are initialised later. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 02/10][v6] powerpc/Power7: detect load/store instructions
Implement instr_is_load_store_2_06() to detect whether a given instruction is one of the fixed-point or floating-point load/store instructions in the POWER Instruction Set Architecture v2.06. ... +int instr_is_load_store_2_06(const unsigned int *instr) +{ + unsigned int op, upper, lower; + + op = instr_opcode(*instr); + + if ((op = 32 op = 58) || (op == 61 || op == 62)) + return true; + + if (op != 31) + return false; + + upper = op 5; + lower = op 0x1f; + + /* Short circuit as many misses as we can */ + if (lower 3 || lower 23) + return false; + + if (lower == 3) { + if (upper = 16) + return true; + + return false; + } + + if (lower == 7 || lower == 12) + return true; + + if (lower = 20) /* lower = 23 (implicit) */ + return true; + + return false; +} I can't help feeling the code could do with some comments about which actual instructions are selected where. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] X.25: Fix address field length calculation
On Tue, 2013-10-15 at 14:29 +, Kelleter, Günther wrote: Addresses are BCD encoded, not ASCII. x25_addr_ntoa got it right. [] Wrong length calculation leads to rejection of CALL ACCEPT packets. [] diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c [] @@ -98,7 +98,7 @@ int x25_parse_address_block(struct sk_buff *skb, } len = *skb-data; - needed = 1 + (len 4) + (len 0x0f); + needed = 1 + ((len 4) + (len 0x0f) + 1) / 2; This calculation looks odd. Looks correct to me... In X.25 the lengths (in digits) of the called and calling addresses are encoded in the high and low nibbles of one byte and then followed by both addresses with a digit in each nibble. If the length of the first address is odd, the second one isn't byte aligned. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH RFC v2 29/29] vmxnet3: Make use of pcim_enable_msix_range() interface
Subject: [PATCH RFC v2 29/29] vmxnet3: Make use of pcim_enable_msix_range() interface ... diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c index d33802c..e552d2b 100644 --- a/drivers/net/vmxnet3/vmxnet3_drv.c +++ b/drivers/net/vmxnet3/vmxnet3_drv.c @@ -2735,39 +2735,19 @@ vmxnet3_read_mac_addr(struct vmxnet3_adapter *adapter, u8 *mac) */ static int -vmxnet3_acquire_msix_vectors(struct vmxnet3_adapter *adapter, - int vectors) +vmxnet3_acquire_msix_vectors(struct vmxnet3_adapter *adapter, int vectors) { - int err = -EINVAL, vector_threshold; - vector_threshold = VMXNET3_LINUX_MIN_MSIX_VECT; - - while (vectors = vector_threshold) { - err = pci_enable_msix(adapter-pdev, adapter-intr.msix_entries, - vectors); - if (!err) { - adapter-intr.num_intrs = vectors; - return 0; - } else if (err 0) { - dev_err(adapter-netdev-dev, -Failed to enable MSI-X, error: %d\n, err); - return err; - } else if (err vector_threshold) { - dev_info(adapter-pdev-dev, - Number of MSI-Xs which can be allocated - is lower than min threshold required.\n); - return -ENOSPC; - } else { - /* If fails to enable required number of MSI-x vectors - * try enabling minimum number of vectors required. - */ - dev_err(adapter-netdev-dev, - Failed to enable %d MSI-X, trying %d instead\n, - vectors, vector_threshold); - vectors = vector_threshold; - } + vectors = pcim_enable_msix_range(adapter-pdev, + adapter-intr.msix_entries, vectors, + VMXNET3_LINUX_MIN_MSIX_VECT); + if (vectors 0) { + dev_err(adapter-netdev-dev, + Failed to enable MSI-X, error: %d\n, vectors); + return vectors; } - return err; + adapter-intr.num_intrs = vectors; + return 0; } AFAICT the old code either used the requested number or the minimum number. The new code seems to claim an intermediate number of interrupts - but probably only uses the minimum number. This wastes the last few MSI-X interrupts. The code (especially the calling code) would be easier to read if the 'vectors' value wasn't explicitly passed. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/1] mfd: omap-usb-host: Fix USB device detection problems on OMAP4 Panda
From: Of Roger Quadros With u-boot 2013.10, USB devices are sometimes not detected on OMAP4 Panda. To make us independent of what bootloader does with the USB Host module, we must RESET it to get it to a known good state. This patch Soft RESETs the USB Host module. ... +++ b/drivers/mfd/omap-usb-host.c @@ -43,14 +43,18 @@ /* UHH Register Set */ #define OMAP_UHH_REVISION (0x00) #define OMAP_UHH_SYSCONFIG (0x10) -#define OMAP_UHH_SYSCONFIG_MIDLEMODE(1 12) +#define OMAP_UHH_SYSCONFIG_MIDLEMASK(3 12) +#define OMAP_UHH_SYSCONFIG_MIDLESHIFT(12) (tab/space issue) Wouldn't it be clearer to define these in the opposite order with: +#defineOMAP_UHH_SYSCONFIG_MIDLEMASK(3 OMAP_UHH_SYSCONFIG_MIDLESHIFT) ... +static void omap_usbhs_rev1_reset(struct device *dev) +{ + struct usbhs_hcd_omap *omap = dev_get_drvdata(dev); + u32 reg; + unsigned long timeout; + + reg = usbhs_read(omap-uhh_base, OMAP_UHH_SYSCONFIG); + + /* Soft Reset */ + usbhs_write(omap-uhh_base, OMAP_UHH_SYSCONFIG, + reg | OMAP_UHH_SYSCONFIG_SOFTRESET); + + timeout = jiffies + msecs_to_jiffies(100); + while (!(usbhs_read(omap-uhh_base, OMAP_UHH_SYSSTATUS) + OMAP_UHH_SYSSTATUS_RESETDONE)) { + cpu_relax(); + + if (time_after(jiffies, timeout)) { + dev_err(dev, Soft RESET operation timed out\n); + break; + } + } + + /* Set No-Standby */ + reg = ~OMAP_UHH_SYSCONFIG_MIDLEMASK; + reg |= OMAP_UHH_SYSCONFIG_MIDLE_NOSTANDBY + OMAP_UHH_SYSCONFIG_MIDLESHIFT; + + /* Set No-Idle */ + reg = ~OMAP_UHH_SYSCONFIG_SIDLEMASK; + reg |= OMAP_UHH_SYSCONFIG_SIDLE_NOIDLE + OMAP_UHH_SYSCONFIG_SIDLESHIFT; Why not pass in the mask and value and avoid replicating the entire function. I can't see any other significant differences, the udelay(2) won't be significant. I'm not sure of the context this code runs in, but if the reset is likely to take a significant portion of the 100ms timeout period, why not just sleep for a few ms between status polls. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/4] pch_gbe: Fix transmit queue management
From: David Miller According to Documentation/networking/driver.txt the ndo_start_xmit method should not return NETDEV_TX_BUSY under normal circumstances. Stop the transmit queue when tx_ring is full. ... You should be instead preventing the transmit method from being invoked when it might be possible that a request cannot be satisfied. This means that at the end of a transmit request, you must stop the queue if a packet with the maximum number of possible descriptors cannot be satisfied. Then it is impossible for the transmit function to be invoked in a situation where it would need to fail for lack of available transmit descriptors. I know you like ethernet drivers to work this way, but requiring enough descriptors for a maximally fragmented packet requires a difficult calculation and will cause the tx queue to be stopped unnecessarily. IIRC a normal skb can have a maximum of 17 fragments, if these end up being transmitted over USB using xhci they might need 34 descriptors (because descriptors can't cross 64k boundaries). However a typical 64k TCP packet just needs 4. xhci has a further constraint that the ring end can only be at specific alignments, in effect thus means that the descriptors for a single packet cannot cross the end of a ring segment. So if the available space in the xhci ring was used to stop the ethernet tx queue (which it isn't at the moment) it would have to be stopped very early. Isn't there also a new skb structure that allows lists of fragments? That might need even more descriptors for a worst case transmit. I would have thought it would make sense for the ethernet driver to stop the queue when there is a reasonable expectation that there won't be enough descriptors for the next packet, but have a reasonable code path when a packet with a large number of fragments won't fit in the tx ring. This either requires the driver itself to hold the skb, or to return NETDEV_TX_BUSY. It might make sense to monitor recent traffic to find the 'expected maximum number of fragments'. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/1] mfd: omap-usb-host: Fix USB device detection problems on OMAP4 Panda
From: Roger Quadros [mailto:rog...@ti.com] On 11/29/2013 03:17 PM, David Laight wrote: ... + timeout = jiffies + msecs_to_jiffies(100); + while (!(usbhs_read(omap-uhh_base, OMAP_UHH_SYSSTATUS) + OMAP_UHH_SYSSTATUS_RESETDONE)) { + cpu_relax(); You mean use msleep(1) here instead of cpu_relax()? Shouldn't be a problem IMO, but can you please tell me why that is better as the reset seems to complete usually in the first iteration. If it doesn't finish in the first iteration you don't want to spin the cpu for 100ms. If it hasn't finished in the first millisecond, you probably expect it to actually time out - so you might as well look (say) every 10ms. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] ipvs: Remove unused variable ret from sync_thread_master()
@@ -1637,7 +1637,7 @@ static int sync_thread_master(void *data) continue; } while (ip_vs_send_sync_msg(tinfo-sock, sb-mesg) 0) { - int ret = __wait_event_interruptible(*sk_sleep(sk), So ideally there's be a comment here why we're using interruptible but then ignore interruptions. Julian said ( http://lkml.kernel.org/r/alpine.lfd.2.00.1310012245020.1...@ja.ssi.bg ): Yes, your patch looks ok to me. In the past we used ssleep() but IPVS users were confused why IPVS threads increase the load average. So, we switched to _interruptible calls and later the socket polling was added. I've done this in the past so that the code sleeps interruptibly unless there is a signal pending - which would cause it to return early. /* Tell scheduler we are going to sleep... */ if (signal_pending(current)) /* We don't want waking immediately (again) */ sleep_state = TASK_UNINTERRUPTIBLE; else sleep_state = TASK_INTERRUPTIBLE; set_current_state(sleep_state); Shame there isn't a process flag to indicate that the process will sleep uninterruptibly and that it doesn't matter. So don't count to the load average and don't emit a warning if it has been sleeping for a long time. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] ipvs: Remove unused variable ret from sync_thread_master()
I've done this in the past so that the code sleeps interruptibly unless there is a signal pending - which would cause it to return early. /* Tell scheduler we are going to sleep... */ if (signal_pending(current)) /* We don't want waking immediately (again) */ sleep_state = TASK_UNINTERRUPTIBLE; else sleep_state = TASK_INTERRUPTIBLE; set_current_state(sleep_state); If this is for kernel threads, I think you can wipe the pending state; not entirely sure how signals interact with kthreads; Oleg will know. I did take me a while to manage to use kthreads, and we probably still have customers who insist on using pre 2.6.18 kernels! (In which case the processes are children of a daemon that only return to userspace in order to exit.) We also need to send signals (to get the process out of blocking socket calls), so I have to use force_sig() to get the signal noticed. The other issue was tidyup - I had to find module_put_and_exit(). David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
Sure, I modified the code so that we only prefetched 2 cache lines ahead, but only if the overall length of the input buffer is more than 2 cache lines. Below are the results (all counts are the average of 100 iterations of the csum operation, as previous tests were, I just omitted that column). Hmmm averaging over 10 iterations means that all the code is in the i-cache and the branch predictor will be correctly primed. For short checksum requests I'd guess that the relevant data has just been written and is already in the cpu cache (unless there has been a process and cpu switch). So prefetch is likely to be unnecessary. If you assume that the checksum code isn't in the i-cache then small requests are likely to be dominated by the code size. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
I'm not sure, whats the typical capacity for the branch predictors ability to remember code paths? ... For such simple single-target branches it goes near or over a thousand for recent Intel and AMD microarchitectures. Thousands for really recent CPUs. IIRC the x86 can also correctly predict simple sequences - like a branch in a loop that is taken every other iteration, or only after a previous branch is taken. Much simpler cpus may use a much simpler strategy. I think one I've used (a fpga soft-core cpu) just uses the low bits of the instruction address to index a single bit table. This means that branches alias each other. In order to get the consistent cycle counts in order to minimise the worst case code path we had to disable the dynamic prediction. For the checksum code the loop branch isn't a problem. Tests on entry to the function might get mispredicted. So if you have conditional prefetch when the buffer is long then time a short buffer after a 100 long ones you'll almost certainly see the mispredition penalty. FWIW I remember speeding up a copy (I think) loop on a strongarm by adding an extra instruction to fetch a word from later in the buffer into a register I never otherwise used. (That was an unpaged system so I knew it couldn't fault.) David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: net/usb/ax88179_178a driver broken in linux-3.12
From: Eric Dumazet [mailto:eric.duma...@gmail.com] On Tue, 2013-11-19 at 09:02 -0500, Mark Lord wrote: On 13-11-19 05:04 AM, David Laight wrote: From: Mark Lord .. except the ax88179_178a driver still does not work in linux-3.12, whereas it works fine in all earlier kernels. I've seen lost packets in IIRC 3.2 That's a regression. And a simple revert (earlier in this thread) fixes it. So.. let's revert it for now, until a proper xhci compatible patch is produced. ... There is a patch to xhci-ring.c that should fix the SG problem. http://www.spinics.net/lists/linux-usb/msg97176.html I think it should apply to the 3.12 sources. I am running with that patch here now (thanks), and it too appears to prevent the lockups. But is this patch upstream already? If yes, then it needs to get pushed out to -stable for 3.12 at least. Having someone else confirm that there is a bug, and that the patch fixes it should help it get pushed to -stable. If not upstream, then the revert is probably safest for -stable, rather than new code that has never been upstream before. Both patches are attached to this email. One or the other is required for the USB 3.0 network adapters to function in 3.12. I do not see any error in commit f27070158d6754765f2 (ax88179_178a: avoid copy of tx tcp packets) Quite the contrary in fact... I suspect a TSO bug, and would rather disable TSO for this nic. Have you tried to revert 3804fad45411b482 (USBNET: ax88179_178a: enable tso if usb host supports sg dma) It isn't directly a TSO problem. There has always been a bug in the xhci driver for fragmented buffers. TSO just means it is given a lot of fragmented buffers. As well as user-supplied fragmented buffers, the bug affects internal fragmentation that happens whenever a buffer crosses a 64k byte boundary (please hw engineers - stop doing this!) I'm not sure whether usbnet would ever pass buffers that cross 64k boundaries. I've not seen one - even with TSO. But the rx buffers are 20k (doesn't seem ideal!) and could also be problematical. USB mass storage has used SG for ages, the buffers must all be adequately aligned for the hardware - they won't meet the constraint for USB3 itself, but the documented restriction may be more severe than the actual one. David
RE: net/usb/ax88179_178a driver broken in linux-3.12
From: Eric Dumazet [mailto:eric.duma...@gmail.com] On Tue, 2013-11-19 at 14:43 +, David Laight wrote: It isn't directly a TSO problem. There has always been a bug in the xhci driver for fragmented buffers. TSO just means it is given a lot of fragmented buffers. As well as user-supplied fragmented buffers, the bug affects internal fragmentation that happens whenever a buffer crosses a 64k byte boundary (please hw engineers - stop doing this!) TCP stack uses order-3 allocations. This means 32KB for x86 (PAGE_SIZE=4096) What is PAGE_SIZE for your arches ? If there is a 6KB limit, we might adapt TCP to make sure we do not cross a 64KB boundary. Other strategy would be to detect this case in the driver and split a problematic segment into two parts. The xhci code does all the checks for buffer fragments crossing 64k boundaries. I've posted a patch that uses a conservative upper limit for the number of fragments: 2 * number_of_sg_buffs + xfer_len/65536. This saves scanning the sg list twice. For those not reading linux-usb: The xhci 'bug' is that an SG list may only cross the end of a ring segment at an aligned length. For USB2 devices this will be 512 bytes. For USB3 the documented alignment could be 16k (depends on a burst size), but might only be 1k. (This is another place where the hw engineers haven't made life easy.) The only simple solution is to ensure that a SG list doesn't cross the end of a ring segment. David
RE: Supporting 4 way connections in LKSCTP
In normal operation, IP-A sends INIT to IP-X, IP-X returns INIT_ACK to IP-A. IP-A then sends HB to IP-X, IP-X then returns HB_ACK to IP-A. In the meantime, IP-B sends HB to IP-Y and IPY returns HB_ACK. In case of the path between IP-A and IP-X is broken, IP-B sends INIT to IP-X, NODE-B uses IP-Y to return INIT_ACK to IP-B. Then IP-B sends HB to IP-X, and IP-Y returns HB_ACK to IP-B. In the meantime, the HB communication between IP-B and IP-Y follows the normal flow. Can I confirm, is it really valid? As long as NODE-B knows about both IP-A and IP-B, and NODE-A knows about both IP-X and IP-Y (meaning all the addresses were exchanged inside INIT and INIT-ACK), then this situation is perfectly valid. In fact, this has been tested an multiple interops. There are some network configurations that do cause problems. Consider 4 systems with 3 LAN segments: A) 10.10.10.1 on LAN X and 192.168.1.1 on LAN Y. B) 10.10.10.2 on LAN X and 192.168.1.2 on LAN Y. C) 10.10.10.3 on LAN X. D) 10.10.10.4 on LAN X and 192.168.1.2 on LAN Z. There are no routers between the networks (and none of the systems are running IP forwarding). If A connects to B everything is fine - traffic can use either LAN. Connections from A to C are problematic if C tries to send anything (except a HB) to 192.168.1.1 before receiving a HB response. One of the SCTP stacks we've used did send messages to an inappropriate address, but I've forgotten which one. Connections between A and D fail unless the HB errors A receives for 192.168.1.2 are ignored. Of course the application could explicitly bind to only the 10.x address but that requires the application know the exact network topology and may be difficult for incoming calls. David
RE: Supporting 4 way connections in LKSCTP
There are some network configurations that do cause problems. Consider 4 systems with 3 LAN segments: A) 10.10.10.1 on LAN X and 192.168.1.1 on LAN Y. B) 10.10.10.2 on LAN X and 192.168.1.2 on LAN Y. C) 10.10.10.3 on LAN X. D) 10.10.10.4 on LAN X and 192.168.1.2 on LAN Z. There are no routers between the networks (and none of the systems are running IP forwarding). If A connects to B everything is fine - traffic can use either LAN. Connections from A to C are problematic if C tries to send anything (except a HB) to 192.168.1.1 before receiving a HB response. One of the SCTP stacks we've used did send messages to an inappropriate address, but I've forgotten which one. I guess that would be problematic if A can not receive traffic for 192.168.1.1 on the interface connected to LAN X. I shouldn't technically be a problem for C as it should mark the path to 192.168.1.1 as down. For A, as long as it doesn't decide to ABORT the association, it shouldn't be a problem either. It would be interesting to know more about what problems you've observed. It was a long time ago, we don't actually do much SCTP testing even though our product (SS7 M3UA) uses it. Mostly because we don't want to find bugs that are hard to fix! What we saw was C using the 192.x address (as supplied in the INIT) and these not being routable to A (and not getting a response from a 3rd system). So the application layer immediately timed everything out. What I can't remember is whether this was Linux, Solaris or the SCTP stack we took from freebsd and rammed into the windows kernel. David
RE: Supporting 4 way connections in LKSCTP
The point is that address scoping should be used. When sending an INIT from 10.10.10.1 to 10.10.10.4 you should not list 192.168.1.1, since you are transmitting an address to a node which might or might not be in the same scope. You might have two machines that are connected via the public internet and also via a private network. The two sets of cabling being completely separate giving you resilience to network failure. In which case you definitely don't want address scoping. While you may not want the SCTP traffic on the public network itself, it could easily be routed separately. We have systems that 'sort of' designate one interface for SIP/RTP and the other for 'management'. They might run M3UA/SCTP but no one has really thought enough about which interface(s) the M3UA traffic should use. (Think of an ISUP/SIP gateway using M3UA for ISUP signalling.) David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Supporting 4 way connections in LKSCTP
the configured addresses could be: System A) 10.0.0.1 on Lan X, 10.10.0.1 on Lan Y System B) 10.0.0.2 on Lan X, 10.10.0.2 on Lan Y System C) 10.0.0.3 on Lan X, 10.10.0.2 on Lan Z Same problem will occur. ... With that, Sys A talking to Sys C will get an abort from Sys B when trying to talk to 10.10.0.2. With /8, it'll be even worse since SysB and SysC will have duplicate addresses within the subnet. :) The point is that you don't always know that the same private subnet is in reality 2 different subnets with duplicate addresses. I've had to debug an actual production issue similar to this where customer had a very similar configuration to above, and their associations kept getting aborted. When I tried accessing the system that kept sending aborts, I found it was some windows server and not a Diameter station they were expecting. Does seem that the addresses listed in INIT and INIT_ACK chunks should be ignored until a valid HB response has been received. If an abort is received before a valid HB response then the address should be ignored rather than the connection aborted. Then it wouldn't matter anywhere near as much if addresses are advertised that are not reachable from the remote system. All this should have been thought about when the original RFC was written. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH net-next v2 1/2] r8152: add RTL8152_EARLY_AGG_TIMEOUT_SUPER
From: Hayes Wang For slow CPU, the frequent bulk transfer would cause poor throughput. One solution is to increase the timeout of the aggregation. It let the hw could complete the bulk transfer later and fill more packets into the buffer. Besides, it could reduce the frequency of the bulk transfer efficiently and improve the performance. However, the optimization value of the timeout depends on the capability of the hardware, especially the CPU. For example, according to the experiment, the timeout 164 us is better than the default value for the chromebook with the ARM CPU. The best value probably depends on the workload as well as the cpu speed. I wonder if a sane algorithm for dynamically setting this value exists. It really needs a CBU (crystal ball unit), but working ones are difficult to come by. Passing the buck to the 'user' is rather a cop-out (but we all do it). Now add RTL8152_EARLY_AGG_TIMEOUT_SUPER to let someone could choose desired timeout value if he wants to get the best performance. ... /* USB_RX_EARLY_AGG */ -#define EARLY_AGG_SUPPER 0x0e832981 +#define EARLY_AGG_SUPER rx_buf_sz - 1522) / 4) 16) | \ + (u32)(CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER = 0 ? 0x2981 : \ + ((CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER * 125) 0x ? \ + CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER * 125 : 0x))) The 0x2981 would be better written as (85 * 125). But maybe replace with something like: min((CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER = 0 ? 85 : CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER) * 125u, 0xu) David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH net-next v3 1/2] r8152: add RTL8152_EARLY_AGG_TIMEOUT_SUPER
From: Hayes Wang ... I should have spotted this before. /* USB_RX_EARLY_AGG */ -#define EARLY_AGG_SUPPER 0x0e832981 +#define EARLY_AGG_SUPER rx_buf_sz - 1522) / 4) 16) | \ + (u32)(CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER = 0 ? 85 * 125 : \ + min(CONFIG_RTL8152_EARLY_AGG_TIMEOUT_SUPER * 125, 0x))) #define EARLY_AGG_HIGH 0x0e837a12 #define EARLY_AGG_SLOW 0x0e83 @@ -1978,7 +1980,7 @@ static void r8153_set_rx_agg(struct r8152 *tp) ocp_write_dword(tp, MCU_TYPE_USB, USB_RX_BUF_TH, RX_THR_SUPPER); ocp_write_dword(tp, MCU_TYPE_USB, USB_RX_EARLY_AGG, - EARLY_AGG_SUPPER); + EARLY_AGG_SUPER); } else { ocp_write_dword(tp, MCU_TYPE_USB, USB_RX_BUF_TH, RX_THR_HIGH); It looks as though rx_buf_sz should be a parameter to EARLY_AGG_SUPER. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 1/4] net: add name_assign_type netdev attribute
From: David Herrmann The name_assign_type attribute gives hints where the interface name of a given net-device comes from. Three different values are currently defined: NET_NAME_ENUM: This is the default. The ifname is provided by the kernel with an enumerated suffix. Names may be reused and unstable. NET_NAME_USER: The ifname was provided by user-space during net-device setup. NET_NAME_RENAMED: The net-device has been renamed via RTNL. Once this type is set, it cannot change again. ... diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b8d8c80..6698e87 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1248,6 +1248,7 @@ struct net_device { * of the interface. */ charname[IFNAMSIZ]; + unsigned char name_assign_type; /* name assignment type */ /* device name hash chain, please keep it close to name[] */ struct hlist_node name_hlist; Do you really need to add 7 byte of padding here? There seems to be some padding lurking elsewhere that really ought to be mergable. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2] net: netfilter: LLVMLinux: vlais-netfilter
From: beh...@converseincode.com From: Mark Charlebois charl...@gmail.com Replaced non-standard C use of Variable Length Arrays In Structs (VLAIS) in xt_repldata.h with a C99 compliant flexible array member and then calculated offsets to the other struct members. These other members aren't referenced by name in this code, however this patch maintains the same memory layout and padding as was previously accomplished using VLAIS. Had the original structure been ordered differently, with the entries VLA at the end, then it could have been a flexible member, and this patch would have been a lot simpler. However since the data stored in this structure is ultimately exported to userspace, the order of this structure can't be changed. Why not just remove the last element and allocate space for it after the structure? That would reduce the complexity of the patch and the unreadability of the new code. I realise that the alignment of type##_error is 'tricky' to determine. David N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [PATCH v2] net: netfilter: LLVMLinux: vlais-netfilter
From: Behan Webster On 03/18/14 02:41, David Laight wrote: From: beh...@converseincode.com From: Mark Charlebois charl...@gmail.com Replaced non-standard C use of Variable Length Arrays In Structs (VLAIS) in xt_repldata.h with a C99 compliant flexible array member and then calculated offsets to the other struct members. These other members aren't referenced by name in this code, however this patch maintains the same memory layout and padding as was previously accomplished using VLAIS. Had the original structure been ordered differently, with the entries VLA at the end, then it could have been a flexible member, and this patch would have been a lot simpler. However since the data stored in this structure is ultimately exported to userspace, the order of this structure can't be changed. Why not just remove the last element and allocate space for it after the structure? Because that would still be employing VLAIS to solve the problem. The last element may be a zero-length array (a flexible member), not a VLA. Sadly both the last 2 elements in the struct need to be manually calculated, which is what we've done. So make the last element a 'flexible member' and then work out where the final field goes. Something like: struct p { struct a a; struct b b[]; } p = malloc(sizeof *p + n * sizeof (struct b) + alignof (struct c) + sizeof (struct c); struct c *c = (void *)p-b[n] + (-offsetof(struct p, b[n]) (alignof(struct c) - 1); David
RE: sisusb: Use static const, fix typo
From: Joe Perches Reduce text a bit by using static const. If you want to save a few bytes remove the pointers. (and the fixed RAM text to get below 7 chars). eg: - const char *ramtypetext2[] = { SDR SDRAM, SDR SGRAM, - DDR SDRAM, DDR SGRAM }; static const char ramtypetext2[8][] = { SDR SD, SDR SG, DDR SD, DDR SG ... - dev_info(sisusb-sisusb_dev-dev, %dMB %s %s, bus width %d\n, (sisusb-vramsize 20), ramtypetext1, - ramtypetext2[ramtype], bw); dev_info(sisusb-sisusb_dev-dev, %dMB %s %sRAM, bus width %d\n, sisusb-vramsize 20, ramtypetext1, ramtypetext2[ramtype], bw); David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: sisusb: Use static const, fix typo
From: Joe Perches [ On Mon, 2014-02-24 at 10:26 +, David Laight wrote: From: Joe Perches Reduce text a bit by using static const. If you want to save a few bytes remove the pointers. (and the fixed RAM text to get below 7 chars). Hi David. eg: - const char *ramtypetext2[] = { SDR SDRAM, SDR SGRAM, - DDR SDRAM, DDR SGRAM }; The idea was use static to avoid the array reload on each function entry. static const char ramtypetext2[8][] = { SDR SD, SDR SG, DDR SD, DDR SG 8 is an odd number here. Brain fade - not enough coffee. I meant 'char ramtypetext2[][8]' so that you avoid the pointers as well. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] net/usb: Add Lenovo ThinkPad OneLink GigaLAN USB ID to ax88179 driver
From: Keith Packard The Lenovo OneLink dock includes a USB ethernet adapter using the AX88179 chip, but with a different USB ID. Add this new USB id to the driver so that it will autodetect the adapter correctly. Signed-off-by: Keith Packard kei...@keithp.com Tested-by: Carl Worth cwo...@cworth.org --- drivers/net/usb/ax88179_178a.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c index 8e8d0fc..dcf974f 100644 --- a/drivers/net/usb/ax88179_178a.c +++ b/drivers/net/usb/ax88179_178a.c @@ -1418,6 +1418,19 @@ static const struct driver_info samsung_info = { .tx_fixup = ax88179_tx_fixup, }; +static const struct driver_info lenovo_info = { + .description = ThinkPad OneLink Dock USB GigaLAN, + .bind = ax88179_bind, + .unbind = ax88179_unbind, + .status = ax88179_status, + .link_reset = ax88179_link_reset, + .reset = ax88179_reset, + .stop = ax88179_stop, + .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .rx_fixup = ax88179_rx_fixup, + .tx_fixup = ax88179_tx_fixup, +}; + static const struct usb_device_id products[] = { { /* ASIX AX88179 10/100/1000 */ @@ -1435,6 +1448,10 @@ static const struct usb_device_id products[] = { /* Samsung USB Ethernet Adapter */ USB_DEVICE(0x04e8, 0xa100), .driver_info = (unsigned long)samsung_info, +}, { + /* Lenovo ThinkPad OneLink GigaLAN */ + USB_DEVICE(0x17ef, 0x304b), + .driver_info = (unsigned long)samsung_info, I think you meant lenovo_info. Actually it looks like the initialiser should be factored somehow to that the list of functions and flags doesn't need repeating for every clone. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH net-next 08/12] r8152: support TSO
From: Hayes Wang Support scatter gather and TSO. Adjust the tx checksum function and set the max gso size to fix the size of the tx aggregation buffer. There is little point supporting TSO unless the usb host controller supports arbitrary aligned scatter-gather. All you do is require that a large skb be allocated (that may well fail), and add it another data copy. The xhci controller is almost capable of arbitrary scatter-gather, but it is currently disabled because the data must be aligned at the end of the transfer ring (the hardware designers have made it almost impossible to write efficient software). Note that the various xhci controllers behave subtly differently. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH net-next 08/12] r8152: support TSO
From: hayeswang David Laight [mailto:david.lai...@aculab.com] Sent: Tuesday, March 04, 2014 8:12 PM To: 'Hayes Wang'; net...@vger.kernel.org Cc: nic_s...@realtek.com; linux-kernel@vger.kernel.org; linux-...@vger.kernel.org Subject: RE: [PATCH net-next 08/12] r8152: support TSO From: Hayes Wang Support scatter gather and TSO. Adjust the tx checksum function and set the max gso size to fix the size of the tx aggregation buffer. There is little point supporting TSO unless the usb host controller supports arbitrary aligned scatter-gather. All you do is require that a large skb be allocated (that may well fail), and add it another data copy. I think I have done it. For also working for EHCI usb host controller, I allocate 16 KB continuous buffer and copy the fragmented packets to it and bulk out the buffer. Does that mean you are splitting the 64k 'ethernet packet' from TCP is software? I've looked at the ax88179 where the hardware can do it. Is there really a gain doing segmentation here if you have to do the extra data copy? It might be worth packing multiple short packets into a single USB bulk data packet, but that might require a CBU (crystal ball unit). I did measure a maximum transmit ethernet frame rate of (IIRC) 25 frames/sec for the ax88179 - probably limited by the USB3 frame rate. Exceeding that would probably require putting multiple tx packets into a single URB. OTOH that limit probably doesn't matter What might be more useful is to set the rx side up to receive into page-aligned 2k (or 4k) buffers and then separate out the ethernet frames into the required skb - probably as page list fragments (I'm not entirely sure how such skb can be created). I don't know what the r8152 does, but the usbnet code encourages it to allocate long skb, pass them to the usb stack to fill, and then clone them if they contain multiple frames. This isn't really good use of memory or cpu cycles. The ax88179 driver also lies badly about the skb's truesize. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH net-next 08/12] r8152: support TSO
From: Eric Dumazet On Tue, 2014-03-04 at 14:35 +, David Laight wrote: Does that mean you are splitting the 64k 'ethernet packet' from TCP is software? I've looked at the ax88179 where the hardware can do it. Is there really a gain doing segmentation here if you have to do the extra data copy? There is no 'extra' copy. There is _already_ a copy, so what's the deal of doing this copy in a SG enabled way ? Ok. But there is one more copy than for a normal ethernet chipset which can use multiple ring entries to send the MAC+IP+TCP headers from a different buffer to the TCP userdata. David N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [PATCH] phy: fix compiler array bounds warning on settings[]
From: Bjorn Helgaas With -Werror=array-bounds, gcc v4.7.x warns that in phy_find_valid(), the settings[] array subscript is above array bounds, I think because idx is a signed integer and if the caller supplied idx 0, we pass the guard but still reference out of bounds. Not rejecting the patch but... Just indexing an array with 'int' shouldn't cause this warning, so somewhere a caller must actually be passing an idx 0. While changing the type to unsigned will make the comparison against the array bound reject the -1, I suspect that the specific call path didn't really intend passing a hard-coded -1. It might be worth trying to locate the call site that passes -1. David
RE: [PATCH] phy: fix compiler array bounds warning on settings[]
From: Bjorn Helgaas On Wed, Mar 5, 2014 at 2:10 AM, David Laight david.lai...@aculab.com wrote: From: Bjorn Helgaas With -Werror=array-bounds, gcc v4.7.x warns that in phy_find_valid(), the settings[] array subscript is above array bounds, I think because idx is a signed integer and if the caller supplied idx 0, we pass the guard but still reference out of bounds. Not rejecting the patch but... Just indexing an array with 'int' shouldn't cause this warning, so somewhere a caller must actually be passing an idx 0. While changing the type to unsigned will make the comparison against the array bound reject the -1, I suspect that the specific call path didn't really intend passing a hard-coded -1. It might be worth trying to locate the call site that passes -1. I agree 100%. If that's the case, we definitely should find that caller rather than apply this patch. I'll look more today. You might want to apply the patch as well :-) David
RE: [PATCH] phy: fix compiler array bounds warning on settings[]
From: Bjorn Helgaas I'm stumped. phy_find_valid() is static and only called from one place. The 'idx' argument is always the result of phy_find_setting(), which should always return something between 0 and ARRAY_SIZE(settings), so I don't see any way idx can be 0. I stripped this down as far as I could; the resulting test code is at http://pastebin.com/pp1zMEWu if anybody else wants to look at it. I'm using gcc 4.8.x 20131105 (prerelease), with -Warray-bounds -O2 flags. I hesitate to suspect a compiler bug, but it is very strange. For example, in my test code, replacing MAX_NUM_SETTINGS with 2 gets rid of the warnings. MAX_NUM_SETTINGS is known to be 2 at compile-time, so I don't know why this should make a difference. I can't get an array bounds error from the gcc 4.8.1-10ubuntu9 at all. Not even when I index the array with a constant 3. I wonder if they compiled it out! David N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [PATCH 3/3] perf: Use 64-bit value when comparing sample_regs
From: Sukadev Bhattiprolu When checking whether a bit representing a register is set in sample_regs, a 64-bit mask, use 64-bit value (1LL). Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- tools/perf/util/unwind.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/unwind.c b/tools/perf/util/unwind.c index 742f23b..2b888c6 100644 --- a/tools/perf/util/unwind.c +++ b/tools/perf/util/unwind.c @@ -396,11 +396,11 @@ static int reg_value(unw_word_t *valp, struct regs_dump *regs, int id, { int i, idx = 0; - if (!(sample_regs (1 id))) + if (!(sample_regs (1LL id))) return -EINVAL; for (i = 0; i id; i++) { - if (sample_regs (1 i)) + if (sample_regs (1LL i)) idx++; } There are much faster ways to count the number of set bits, especially if you might need to check a significant number of bits. There might even be a function defined somewhere to do it. Basically you just add up the bits, for 16 bit it would be: val = (val 0x) + (val 1) 0x; val = (val 0x) + (val 2) 0x; val = (val 0x0f0f) + (val 4) 0x0f0f; val = (val 0x00ff) + (val 8) 0x00ff; As the size of the work increases the improvement is more significant. (Some of the later masking can probably be proven unnecessary.) David
RE: [PATCH 2/2] Revert USBNET: ax88179_178a: enable tso if usb host supports sg dma
From: Mathias Nyman This reverts commit 3804fad45411b48233b48003e33a78f290d227c8. This commit, together with commit 247bf557273dd775505fb9240d2d152f4f20d304 xhci 1.0: Limit arbitrarily-aligned scatter gather. were origially added to get xHCI 1.0 hosts and usb ethernet ax88179_178a devices working together with scatter gather. xHCI 1.0 hosts pose some requirement on how transfer buffers are aligned, setting this requirement for 1.0 hosts caused USB 3.0 mass storage devices to fail more frequently. This patch doesn't need to be reverted. Provided the xhci driver doesn't set the flag to say that arbitrary scatter gather is supported (ie usb_device_no_sg_constraint(dev-udev)) is false) the ax88179_178a driver won't request transmits that need fragmenting. There is a separate issue in that it does request receives that need fragmenting. However if the fragmented receives end up being split by a LINK TRB the device driver recovers. The ax88179 hardware doesn't recover when a tx is split. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] Revert xhci 1.0: Limit arbitrarily-aligned scatter gather.
From: Mathias Nyman This reverts commit 247bf557273dd775505fb9240d2d152f4f20d304. You need to revert further. Just don’t set hcd-self.no_sg_constraint ever - since it just doesn't work. That will stop the ax88179_178a driver sending fragmented packets. With the check for aligned non-terminal fragments removed the code will be as bad as the earlier releases. This commit, together with commit 3804fad45411b48233b48003e33a78f290d227c8 USBNET: ax88179_178a: enable tso if usb host supports sg dma were origially added to get xHCI 1.0 hosts and usb ethernet ax88179_178a devices working together with scatter gather. xHCI 1.0 hosts pose some requirement on how transfer buffers are aligned, setting this requirement for 1.0 hosts caused USB 3.0 mass storage devices to fail more frequently. USB 3.0 mass storage devices used to work before 3.14-rc1. Theoretically, the TD fragment rules could have caused an occasional disk glitch. Now the devices *will* fail, instead of theoretically failing. From a user perspective, this looks like a regression; the USB device obviously fails on 3.14-rc1, and may sometimes silently fail on prior kernels. The proper soluition is to implement the TD fragment rules required, but for now this patch needs to be reverted to get USB 3.0 mass storage devices working at the level they used to. Signed-off-by: Mathias Nyman mathias.ny...@linux.intel.com Cc: stable sta...@vger.kernel.org --- drivers/usb/host/xhci.c | 14 +++--- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index 6fe577d..924a6cc 100644 --- a/drivers/usb/host/xhci.c +++ b/drivers/usb/host/xhci.c @@ -4733,6 +4733,9 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks) /* Accept arbitrarily long scatter-gather lists */ hcd-self.sg_tablesize = ~0; + /* support to build packet from discontinuous buffers */ + hcd-self.no_sg_constraint = 1; + Don't add the above - it only looked at by the usbnet code and in particular the ax88179_178a driver /* XHCI controllers don't stop the ep queue on short packets :| */ hcd-self.no_stop_on_short = 1; @@ -4757,14 +4760,6 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks) /* xHCI private pointer was set in xhci_pci_probe for the second * registered roothub. */ - xhci = hcd_to_xhci(hcd); - /* - * Support arbitrarily aligned sg-list entries on hosts without - * TD fragment rules (which are currently unsupported). - */ - if (xhci-hci_version 0x100) - hcd-self.no_sg_constraint = 1; - return 0; } @@ -4793,9 +4788,6 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks) if (xhci-hci_version 0x96) xhci-quirks |= XHCI_SPURIOUS_SUCCESS; - if (xhci-hci_version 0x100) - hcd-self.no_sg_constraint = 1; - /* Make sure the HC is halted. */ retval = xhci_halt(xhci); if (retval) -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2/2] Revert USBNET: ax88179_178a: enable tso if usb host supports sg dma
From: Alan Stern On Fri, 7 Mar 2014, David Laight wrote: From: Mathias Nyman This reverts commit 3804fad45411b48233b48003e33a78f290d227c8. This commit, together with commit 247bf557273dd775505fb9240d2d152f4f20d304 xhci 1.0: Limit arbitrarily-aligned scatter gather. were origially added to get xHCI 1.0 hosts and usb ethernet ax88179_178a devices working together with scatter gather. xHCI 1.0 hosts pose some requirement on how transfer buffers are aligned, setting this requirement for 1.0 hosts caused USB 3.0 mass storage devices to fail more frequently. This patch doesn't need to be reverted. Yes, it does. Provided the xhci driver doesn't set the flag to say that arbitrary scatter gather is supported (ie usb_device_no_sg_constraint(dev-udev)) is false) the ax88179_178a driver won't request transmits that need fragmenting. True. But xhci-hcd _will_ set the flag, because of patch 1 in this series. In other words, patch 1 makes patch 2 necessary. I was reading patch 2 first. It would seem to be better to modify patch 1 - so it doesn't set the flag. Then the changes are limited to the usb tree, and the change to net wouldn't need to be reapplied in order to test scatter-gather when it is properly fixed. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] Revert xhci 1.0: Limit arbitrarily-aligned scatter gather.
From: Alan Stern On Fri, 7 Mar 2014, David Laight wrote: From: Mathias Nyman This reverts commit 247bf557273dd775505fb9240d2d152f4f20d304. You need to revert further. Just don?t set hcd-self.no_sg_constraint ever - since it just doesn't work. No; it does work most of the time. If hcd-self.no_sg_constraint wasn't sent, then certain commands would fail always, instead of failing occasionally. The point is that the no_sg_constraint was added in order to allow the ax88179_178a driver to send arbitrarily aligned transfers. Nothing else looks at it (well didn't when I scanned the tree a while back). In effect all the other transfer requests were assumed to be suitably aligned. The check that caused things to fail was the one added to xhci relatively recently that verified the alignment of the fragments. David
RE: [PATCH 2/2] Revert USBNET: ax88179_178a: enable tso if usb host supports sg dma
From: On Fri, 7 Mar 2014, David Laight wrote: From: Alan Stern On Fri, 7 Mar 2014, David Laight wrote: From: Mathias Nyman This reverts commit 3804fad45411b48233b48003e33a78f290d227c8. This commit, together with commit 247bf557273dd775505fb9240d2d152f4f20d304 xhci 1.0: Limit arbitrarily-aligned scatter gather. were origially added to get xHCI 1.0 hosts and usb ethernet ax88179_178a devices working together with scatter gather. xHCI 1.0 hosts pose some requirement on how transfer buffers are aligned, setting this requirement for 1.0 hosts caused USB 3.0 mass storage devices to fail more frequently. This patch doesn't need to be reverted. Yes, it does. Provided the xhci driver doesn't set the flag to say that arbitrary scatter gather is supported (ie usb_device_no_sg_constraint(dev-udev)) is false) the ax88179_178a driver won't request transmits that need fragmenting. True. But xhci-hcd _will_ set the flag, because of patch 1 in this series. In other words, patch 1 makes patch 2 necessary. I was reading patch 2 first. It would seem to be better to modify patch 1 - so it doesn't set the flag. Then the changes are limited to the usb tree, and the change to net wouldn't need to be reapplied in order to test scatter-gather when it is properly fixed. The point is that the no_sg_constraint was added in order to allow the ax88179_178a driver to send arbitrarily aligned transfers. Nothing else looks at it (well didn't when I scanned the tree a while back). In effect all the other transfer requests were assumed to be suitably aligned. The check that caused things to fail was the one added to xhci relatively recently that verified the alignment of the fragments. (Actually the check was added to usbcore, not to xhci-hcd.) The _real_ problem here seems to be that no_sg_constraint is ambiguous. Originally it was meant to refer to the constraint that all SG elements except the last must be a multiple of the maxpacket size. For that purpose, the check added to usbcore was entirely appropriate, as was setting the flag in xhci-hcd. Probably true - but the only code that looked at it was in usbnet. The check in usbcore is very recent. But now it turns out that the ax88179 driver is violating a _different_ constraint: that Link TRBs must occur only at 1-KB boundaries. The no_sg_constraint flag was never meant to describe this. In other words, the flag issue is separate from the problem facing ax88179. Really that means that the xhci controller couldn't actually support the alignments it said it could - rather than the ax88179 driver sending packets that didn't meet that the rules. The appropriate way to address this new problem for the future is to remove the second constraint by adding correct support for TD fragments into xhci-hcd. Indeed. The appropriate way to address the problem right now in the -stable kernels is to prevent ax88179 from using SG -- and not by abusing the no_sg_constraint flag, which is not directly related to the TD fragment problem. I'd say that if the no_sg_constraint flag is set the ax88179 driver could reasonably expect to send its fragment lists. So it is best not to set the no_sg_constraint flag, and then remove the checks against it that are needed to allow the transfers from the disk system not be rejected - even though we know that some of them might not actually work. Otherwise you'll need to add yet another flag when the sg support is fixed. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: rfc: checkpatch logical line continuations (was IBM Akebono: Add support for a new PHY interface to the IBM emac driver)
From: j...@joshtriplett.org On Fri, Mar 07, 2014 at 01:02:44PM -0800, Joe Perches wrote: On Fri, 2014-03-07 at 15:41 -0500, David Miller wrote: From: Alistair Popple alist...@popple.id.au Date: Thu, 6 Mar 2014 14:52:25 +1100 + out_be32(dev-reg, in_be32(dev-reg) | WKUP_ETH_RGMIIEN +| WKUP_ETH_TX_OE | WKUP_ETH_RX_IE); When an expression spans multiple lines, the lines should end with operators rather than begin with them. That's not in CodingStyle currently. It's also not even remotely consistent across existing kernel code, and it isn't obvious that there's a general developer consensus on the right way to write it. My personal preference (which counts for nothing here) is to put the operators at the start of the continuation like in order to make it more obvious that it is a continuation. The netdev rules are particularly problematical for code like: if (tst(foo, foo2, foo3, ...) ... tst2(..) tst3()) { baz(); where a scan read of the LHS gives the wrong logic. At least we don't have a coding style that allows very long lnes an puts } and { on their own lines - leading to: ... } while (foo(...) bar(...) . /* very long line falls off screen */ { int x; Is that the top or bottom of a loop? David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v8 net-next 1/3] filter: add Extended BPF interpreter and converter
From: David Miller From: Linus Torvalds torva...@linux-foundation.org Date: Mon, 10 Mar 2014 19:02:18 -0700 I would generally suggest that people only use bool for function return types, and absolutely nothing else. Seriously. I think it makes sense for function arguments too. 'bool' doesn't necessarily even make sense for function arguments and results. It might require the same additional masking of high bits that are required for 'short' and 'char' parameters. Also, IIRC, one of the arm ABIs used 32bit 'bool'. Even the extra code that gets added to ensure that the 'bool' values are either 0 or 1 can be a problem. I don't know whether the recent compilers (or standards) say whether the reader or writer is expected to enforce the domain. But if the compiler uses a bit-wise AND for bool_a bool_b there is definitely scope for unexpected behaviour, and if the compiler does anything else you get unwanted instructions and possibly branches. Some old versions of gcc made a 'pigs breakfast' of bool_a |= bool_b. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH RFC] netfilter: cacheline aligning in struct netns_ct
From: Zhouyi Zhou not frequently changing components should share same cachelines Signed-off-by: Zhouyi Zhou yizhouz...@ict.ac.cn --- include/net/netns/conntrack.h | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h index fbcc7fa..69d2d58 100644 --- a/include/net/netns/conntrack.h +++ b/include/net/netns/conntrack.h @@ -65,6 +65,12 @@ struct nf_ip_net { struct netns_ct { atomic_tcount; unsigned intexpect_count; + struct hlist_nulls_head unconfirmed; + struct hlist_nulls_head dying; + struct hlist_nulls_head tmpl; + + /* not frequently changing components should share same cachelines */ + struct nf_ip_netnf_ct_proto cacheline_aligned_in_smp; I'm not sure this has the effect you are trying to achieve. Adding an __attribute__((aligned(n))) to a structure member not only adds padding before the member, but also pads the member to a multiple of the specified size. With large cache lines you probably want neither. IIRC one of the ppc systems has something like 256 byte cache lines (and if it doesn't now, it might have soon). So arbitrarily cache-line aligning structure members may not be a good idea. In any case reducing the number of cache lines accessed, and the number dirtied is probably more important than putting 'readonly' data in its own cache lines. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH -next] qlcnic: fix compiler warning
From: Shahed Shaikh Adding netdev. -Original Message- From: Martin Kaiser,,, [mailto:mar...@reykholt.kaiser.cx] On Behalf Of Martin Kaiser Sent: Thursday, January 09, 2014 9:29 PM To: Himanshu Madhani; Rajesh Borundia Cc: linux-kernel; triv...@kernel.org Subject: [PATCH -next] qlcnic: fix compiler warning Add an explicit cast to fix the following warning (seen on Debian Wheezy, gcc 4.7.2) CC [M] drivers/net/wireless/rtlwifi/rtl8192ce/trx.o drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c: In function qlcnic_send_filter: drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c:349:3: warning: passing argument 2 of ether_addr_equal from incompatible pointer type [enabled by default] In file included from include/linux/if_vlan.h:16:0, from drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c:9: include/linux/etherdevice.h:244:20: note: expected const u8 * but argument is of type u64 * If I am not wrong, this patch should go to David's net-next tree. Signed-off-by: Martin Kaiser mar...@kaiser.cx Acked-by: Shahed Shaikh shahed.sha...@qlogic.com Thanks, Shahed --- drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c index 6373f60..3557154 100644 --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c @@ -346,7 +346,7 @@ static void qlcnic_send_filter(struct qlcnic_adapter *adapter, head = (adapter-fhash.fhead[hindex]); hlist_for_each_entry_safe(tmp_fil, n, head, fnode) { - if (ether_addr_equal(tmp_fil-faddr, src_addr) + if (ether_addr_equal(tmp_fil-faddr, (const u8 *)src_addr) tmp_fil-vlan_id == vlan_id) { if (jiffies (QLCNIC_READD_AGE * HZ + tmp_fil- ftime)) qlcnic_change_filter(adapter, src_addr, -- 1.7.10.4 A quick look at the code seems to imply that this code is somewhat buggy. 'src_addr' is defined a u64 even though it is a 6 byte mac address. On big-endian systems the wrong bytes get accessed all over the place. Also when ether_addr_equal() in inlined the code ends up accessing src_addr as 32bit and 16bit data - which don't have to be synchronised against and writes to the 64bit item. If might be that ether_addr_equal() (etc) should have a gcc asm statement with a memory constraint against the mac address buffers. David
RE: [PATCH v2 3/4] net: make tcp_cleanup_rbuf private
From: David Miller ... On Thu, Jan 9, 2014 at 12:26 PM, Neal Cardwell ncardw...@google.com wrote: On Thu, Jan 9, 2014 at 3:16 PM, Dan Williams dan.j.willi...@intel.com wrote: net_dma was the only external user so this can become local to tcp.c again. ... -void tcp_cleanup_rbuf(struct sock *sk, int copied) +static void cleanup_rbuf(struct sock *sk, int copied) I would vote to keep the tcp_ prefix. In the TCP code base that is the more common idiom, even for internal/static TCP functions, and personally I find it easier to read and work with in stack traces, etc. My 2 cents. Ok. It was cleanup_rbuf() in a former life, but one vote for leaving the name as is is enough for me. You can make that two votes :) I suspect DM adds 2000 votes :-) Keeping the prefix makes it easier to grep for, and makes it more obvious that it isn't a generic function (when being called). David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
From: walt In the meantime, try this patch, which is something of a long shot. No difference. But I notice the code enables the TRB quirk only if the xhci_version is specifically 0x95. My debug messages claim that xHCI doesn't need link TRB QUIRK so I'm wondering if adding my asmedia device to the quirks list really doesn't change anything unless it's xhci 0.95. I think the intention was that the quirk would be applied for your hardware even though it claims to be version 0.96. That was gust a hopeful guess that the same change would help. Does lspci provide that information? I'm not seeing anything obvious. I've only seen it in the diagnostic prints (that aren't normally output). David
RE: [RFC 00/10] xhci: re-work command queue management
From: Mathias Nyman This is an attempt to re-work and solve the issues in xhci command queue management that Sarah has descibed earlier: Right now, the command management in the xHCI driver is rather ad-hock. Different parts of the driver all submit commands, including interrupt handling routines, functions called from the USB core (with or without the bus bandwidth mutex held). Some times they need to wait for the command to complete, and sometimes they just issue the command and don't care about the result of the command. ... The Implementation: --- First step is to create a list of the commands submitted to the command queue. To accomplish this each command is required to be submitted with a properly filled command structure containing completion, status variable and a pointer to the command TRB that will be used. The first 7 patches are all about creating these command structures and submitting them when we queue commands. The command structures are allocated on the fly, the commands that are submitted in interrupt context are allocated with GFP_ATOMIC. Next, the global command queue is introduced. Commands are added to the queue when trb's are queued, and remove when the commad completes. Also switch to use the status variable and completion in the command struct. ... IMHO the xhci driver is already far too complicated, and this probably just makes it even worse. The fact that you are having to allocate memory ion an ISR ought also to be ringing alarm bells. Have you considered adding a 'software command ring' (indexed with the same value as the hardware one) and using it to hold additional parameters? It might even be worth only putting a single command into the hardware ring! That might simplify the timer code. This still has a fixed constraint on the number of queued commands, but I suspect that is bounded anyway (a few per device?). If not you can almost certainly arrange to grow the soft-ring before the isr code can run out of entries. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
From: walt On 01/09/2014 03:50 PM, Sarah Sharp wrote: On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote: I've wondered if my xhci problems might be caused by hardware quirks, and wondering why I seem to be the only one who has this problem. Maybe I could take one for the team by buying new hardware toys that I don't really need but I could use to test the xhci driver? (I do enjoy buying new toys, necessary, or, um, maybe not :) It would be appreciated if you could see if your device causes other host controllers to fail. Who am I to keep a geek from new toys? ;) Sarah, I just fixed my xhci bug for US$19.99 :) #lspci | tail -1 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03) Do you know which version of the xhci spec it conforms to? (Will probably be 0.96 or 1.00). Of course, ASMedia will probably change the silicon they are using without changing anything on the packaging. David
RE: [RFC 00/10] xhci: re-work command queue management
From: Mathias Nyman ... The fact that you are having to allocate memory ion an ISR ought also to be ringing alarm bells. It did. Other options are as you said to use a 'software command ring'. Either just pre-allocate a full command ring (64 trbs * sizeof(struct xhci_command)), roughly 2300 bytes when driver loads, or then a smaller pool of commands just used for ISR submitted commands. (We then need to either either expand pool, or start allocating in isr if pool is empty) If you pre-allocate mapping 1-1 with the command ring entries you don't need many of the fields of xhci_command at all. So the allocated size will be much smaller. Also not that all the rings have been increased to 256 entries, not that there is any requirement for them to match. These solutions require some ring management and possibly expansion code, and increases the complexity again. I selected this simpler approach as I understood that the frequency of commands allocated in ISR is quite low. The frequency isn't the problem - the fact that it is allowed is a problem. This also feels like trying to optimize before we get the main changes working. I would call it 'not realising the full scope of the problem' and thus leaving the 'difficult bits' for last. Sorting out how to solve the difficult bits up front can lead to a nicer solution. Have you considered adding a 'software command ring' (indexed with the same value as the hardware one) and using it to hold additional parameters? It might even be worth only putting a single command into the hardware ring! That might simplify the timer code. This is something I haven't thought about. Could be one possibility, but feels like artificially restricting something the HW is designed to care of. It might resolve any issues about the fixed finite size of the hardware command ring. OTOH you probably don't need that many entries in the hardware command ring for 'normal' processing - provided you have a software ring/queue that can buffer all the requests. Looking at the number of fields in xci_command, why do you need the caller to pass a structure at all? Just make the fields parameters to the 'send a command' function along with a parameter that says whether it can sleep (in kmalloc() or if the ring is full depending on the implementation). ... This still has a fixed constraint on the number of queued commands, but I suspect that is bounded anyway (a few per device?). 64 commands fit on the command ring segment, I'm not aware of any per-device limits? I was wondering about how many commands the software might try to queue, not the number that the hardware allowed to be queued. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] usbnet: remove generic hard_header_len check
From: Emil Goode int rndis_rx_fixup(struct usbnet *dev, struct sk_buff *skb) { + /* This check is no longer done by usbnet */ + if (skb-len dev-net-hard_header_len) + return 0; + Wouldn't it be better to test against ETH_HLEN, since that is a constant and obviously correct in this case? Some minidrivers change the default hard_header_len value so using it guarantees that the patch will not make any change to how the code is currently working. Using ETH_HLEN could be more informative about what the minidriver should check before passing skbs to usbnet_skb_return(). Then I think the comment should be changed as well. My intention was to not make any changes that affect how the code works for devices I cannot test, but I think either way is fine and if you insist on changing it let me know. I think that test is to ensure that the data passed to the mini-driver contains the ethernet frame encapsulation header (this typically contains the actual frame length and some flags) so that the minidriver won't read off the end of the usb data. Any check for stupidly short ethernet frames would be later on. IIRC the absolute minimum 802.3 ethernet frame is 17 bytes (after frame padding has been stripped). David
RE: [PATCH v2] usbnet: remove generic hard_header_len check
From: Of Emil Goode This patch removes a generic hard_header_len check from the usbnet module that is causing dropped packages under certain circumstances for devices that send rx packets that cross urb boundaries. One example is the AX88772B which occasionally send rx packets that cross urb boundaries where the remaining partial packet is sent with no hardware header. When the buffer with a partial packet is of less number of octets than the value of hard_header_len the buffer is discarded by the usbnet module. ... diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c index d6f64da..955df81 100644 --- a/drivers/net/usb/ax88179_178a.c +++ b/drivers/net/usb/ax88179_178a.c @@ -1118,6 +1118,10 @@ static int ax88179_rx_fixup(struct usbnet *dev, struct sk_buff *skb) u16 hdr_off; u32 *pkt_hdr; + /* This check is no longer done by usbnet */ + if (skb-len dev-net-hard_header_len) + return 0; + The ax88179 driver can also receive ethernet frames that cross the end of rx URB. It should have the code to save the last fragment (of a urb) until the next data arrives, and then correctly merge the fragments. It is likely that any sub-driver that sets the receive urb length to a multiple of 1k (rather than leaving it at hard_hdr+mtu) can defrag rx data that crosses urb boundaries. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH RFC] sctp: Update HEARTBEAT timer immediately after user changed HB.interval
+ + /* Update the heartbeat timer immediately. */ + if (!mod_timer(trans-hb_timer, + sctp_transport_timeout(trans))) + sctp_transport_hold(trans); This is causes a rather inconsistent behavior in that it doesn't account of the amount of time left on the hb timer. Thus it doesn't always do what commit log implies. If the remaining time is shorter then the new timeout value, it will take longer till the next HB the application expects. Being able to stop heartbeats being sent by repeatedly configuring the interval is certainly not a good idea! If the application has very strict timing requirement on the HBs, it should trigger the HB manually. We could rework the code to allow the combination of SPP_HB_DEMAND | SPP_HB_ENABLE to work as follows: 1) update the timer to the new value 2) Send an immediate HB a) Update the hb timeout. 2a should probably be done regardless since it can cause 2 HB to be outstanding at the same time. Sending a heartbeat 'on demand' might be problematic if one has also just been sent (from the timer). I'd have thought that you wouldn't want to send a heartbeat while still expecting a response from the previous one (this might require splitting the time interval in two). David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2 v2] usbnet: fix bad header length bug
From: Oliver Neukum censored. Oh well. But how about merging it with FLAG_MULTI_PACKET? I really don't want to add more flags. There is a point where enough flags make absurd having a common code. We are closing in on that point. Any sub-driver that supports multi-packet either has to use stupidly long URB and/or set the rx_urb_size to a multiple of the usb transfer size. It will also have to detect illegal short headers. Actually it might be worth double-checking the encapsulations used. IIRC the ax88179_178a uses different headers for tx and rx. So there might be some that support multi-packet but still have a short(ish) limit on the bulk receive size (before the short fragment). I'm sat at the wrong desk to look at the code... David
RE: [PATCH 05/14] net: axienet: Service completion interrupts ASAP
From: Michal Simek From: Peter Crosthwaite peter.crosthwa...@xilinx.com The packet completion interrupts for TX and RX should be serviced before the packets are consumed. This ensures against the degenerate case when a new completion interrupt is raised after the handler has exited but before the interrupts are cleared. In this case its possible for the ISR to clear an unhandled interrupt (leading to potential deadlock). I would clear the IRQ after processing the last packet, and then do a final check for another packet. That reduces the number of interrupts you take and then find there is no work (because it was done on the previous interrupt). There is a slight 'gotcha' in that the write to clear the IRQ can easily get delayed enough the that cpu exits the ISR before the IRQ line actually drops - leading the unwanted and unclaimed interrupts. A posted write over PCIe could easily take long enough. Maybe a hybrid scheme where the IRQ is cleared when the next entry is still owned by the device would work. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3] net: netfilter: LLVMLinux: vlais-netfilter
From: beh...@converseincode.com From: Mark Charlebois charl...@gmail.com Replaced non-standard C use of Variable Length Arrays In Structs (VLAIS) in xt_repldata.h with a C99 compliant flexible array member and then calculated offsets to the other struct members. These other members aren't referenced by name in this code, however this patch maintains the same memory layout and padding as was previously accomplished using VLAIS. Had the original structure been ordered differently, with the entries VLA at the end, then it could have been a flexible member, and this patch would have been a lot simpler. However since the data stored in this structure is ultimately exported to userspace, the order of this structure can't be changed. This patch makes no attempt to change the existing behavior, merely the way in which the current layout is accomplished using standard C99 constructs. As such the code can now be compiled with either gcc or clang. Author: Mark Charlebois charl...@gmail.com Signed-off-by: Mark Charlebois charl...@gmail.com Signed-off-by: Behan Webster beh...@converseincode.com Signed-off-by: Vinícius Tinti viniciusti...@gmail.com --- net/netfilter/xt_repldata.h | 27 ++- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/net/netfilter/xt_repldata.h b/net/netfilter/xt_repldata.h index 6efe4e5..343599e 100644 --- a/net/netfilter/xt_repldata.h +++ b/net/netfilter/xt_repldata.h @@ -5,23 +5,40 @@ * they serve as the hanging-off data accessed through repl.data[]. */ +/* tbl has the following structure equivalent, but is C99 compliant: + * struct { + * struct type##_replace repl; + * struct type##_standard entries[nhooks]; + * struct type##_error term; + * } *tbl; + */ + #define xt_alloc_initial_table(type, typ2) ({ \ unsigned int hook_mask = info-valid_hooks; \ unsigned int nhooks = hweight32(hook_mask); \ unsigned int bytes = 0, hooknum = 0, i = 0; \ struct { \ struct type##_replace repl; \ - struct type##_standard entries[nhooks]; \ - struct type##_error term; \ - } *tbl = kzalloc(sizeof(*tbl), GFP_KERNEL); \ + struct type##_standard entries[]; \ + } *tbl; \ + struct type##_error *term; \ + size_t entries_end = offsetof(typeof(*tbl), \ + entries[nhooks-1]) + sizeof(tbl-entries[0]); \ Is the compiler complaining about: offsetof(typeof(*tbl), entries[nhooks]) If it does it is a PITA. + size_t term_offset = (entries_end + __alignof__(*term) - 1) \ + ~(__alignof__(*term) - 1); \ You've not tested this - the () are in the wrong places. + size_t term_end = term_offset + sizeof(*term); \ + size_t tbl_sz = (term_end + __alignof__(tbl-repl) - 1) \ + ~(__alignof__(tbl-repl) - 1); \ + tbl = kzalloc(tbl_sz, GFP_KERNEL); \ The number of temporary variables make the above hard to read. I'm not at all sure you need to worry about the trailing alignment. It rather depends on how the final data is used. If the combined buffer is copied to userspace you may not be copying all of the required data. It might be easier to call copytouser() twice. if (tbl == NULL) \ return NULL; \ + term = (struct type##_error *)(((char *)tbl)[term_offset]); \ strncpy(tbl-repl.name, info-name, sizeof(tbl-repl.name)); \ - tbl-term = (struct type##_error)typ2##_ERROR_INIT; \ + *term = (struct type##_error)typ2##_ERROR_INIT; \ David
RE: [PATCH v3] mac80211: LLVMLinux: Remove VLAIS usage from mac80211
From: beh...@converseincode.com Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This is the original VLAIS struct. struct { struct aead_request req; u8 priv[crypto_aead_reqsize(tfm)]; } aead_req; This patch instead allocates the appropriate amount of memory using an char array. The new code can be compiled with both gcc and clang. Signed-off-by: Jan-Simon Mller dl...@gmx.de Signed-off-by: Behan Webster beh...@converseincode.com Signed-off-by: Vincius Tinti viniciusti...@gmail.com Signed-off-by: Mark Charlebois charl...@gmail.com --- net/mac80211/aes_ccm.c | 40 ++-- 1 file changed, 22 insertions(+), 18 deletions(-) diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c index 7c7df47..20521f9 100644 --- a/net/mac80211/aes_ccm.c +++ b/net/mac80211/aes_ccm.c @@ -23,12 +23,14 @@ void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad, u8 *data, size_t data_len, u8 *mic) { struct scatterlist assoc, pt, ct[2]; - struct { - struct aead_request req; - u8 priv[crypto_aead_reqsize(tfm)]; - } aead_req; - memset(aead_req, 0, sizeof(aead_req)); + char aead_req_data[sizeof(struct aead_request) + + crypto_aead_reqsize(tfm)] + __aligned(__alignof__(struct aead_request)); + + struct aead_request *aead_req = (void *) aead_req_data; + + memset(aead_req, 0, sizeof(aead_req_data)); It seems to me that the underlying function(s) ought to be changed so that this isn't needed. Passing an extra parameter would probably cost very little. David
RE: [PATCH 9/9] powerpc/pm: support deep sleep feature on T1040
From: Kevin Hao Sent: 20 March 2014 11:48 To: Scott Wood Cc: linuxppc-...@lists.ozlabs.org; Chenhui Zhao; jason@freescale.com; linux-kernel@vger.kernel.org Subject: Re: [PATCH 9/9] powerpc/pm: support deep sleep feature on T1040 On Tue, Mar 18, 2014 at 06:18:54PM -0500, Scott Wood wrote: The sequence write, readback, sync will guarantee this according to the manual. If you're referring to the section you quoted above, the manual does not say that. It only talks about when accesses to memory regions affected by the configuration register write can be safely made. To guarantee that the results of any sequence of writes to configuration registers are in effect, the final configuration register write should be immediately followed by a read of the same register, and that should be followed by a SYNC instruction. Then accesses can safely be made to memory regions affected by the configuration register write. That sort of sequence is need to force the operations through any external bus - after the cpu itself has issued the bus cycles. Mostly required because writes are often 'posted' (ie address and data latched, and then performed synchronously). According to the above description in t4240 RM manual (2.6.1 Accessing CCSR Memory from the Local Processor), that the writing to configuration register takes effect is a prerequisite for the memory access to the affected regions. ... OK, so the intention of 'twi, isync' following the load is not to order the following storage access, but order the following delay loop instructions, right I tried to work out what the 'twi, isync' instructions were for (in in_le32()). The best I could come up with was to ensure a synchronous bus-fault. But bus faults are probably only expected during device probing - not normal operation, and the instructions will have a significant cost. Additionally in_le32() and out_le32() both start with a 'sync' instruction. In many cases that isn't needed either - an explicit iosync() can be used after groups of instructions. David
RE: [PATCH 9/9] powerpc/pm: support deep sleep feature on T1040
From: Scott Wood [mailto:scottw...@freescale.com] On Thu, 2014-03-20 at 11:59 +, David Laight wrote: I tried to work out what the 'twi, isync' instructions were for (in in_le32()). The best I could come up with was to ensure a synchronous bus-fault. But bus faults are probably only expected during device probing - not normal operation, and the instructions will have a significant cost. Additionally in_le32() and out_le32() both start with a 'sync' instruction. In many cases that isn't needed either - an explicit iosync() can be used after groups of instructions. The idea is that it's better to be maximally safe by default, and let performance critical sections be optimized using raw accessors and explicit synchronization if needed, than to have hard-to-debug bugs due to missing/wrong sync. A lot of I/O is slow enough that the performance impact doesn't really matter, but the brain-time cost of getting the sync right is still there. Hmmm That might be an excuse for the 'sync', but not the twi and isync. I was setting up a dma request (for the ppc 83xx PCIe bridge) and was doing back to back little-endian writes into memory. I had difficulty finding and including header files containing the definitions for byteswapped accesses I needed. arch/powerpc/include/asm/swab.h contains some - but I couldn't work out how to get it included (apart from giving the full path). In any case you need to understand when synchronisation is required - otherwise you will get it wrong. Especially since non-byteswapped accesses are done by direct access. David
RE: [RFC] csum experts, csum_replace2() is too expensive
From: Eric Dumazet On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote: Eric Dumazet eric.duma...@gmail.com writes: I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that is insane... Couldn't it just be the cache miss? Or the fact that we mix 16 bit stores and 32bit loads ? BTW, any idea why ip_fast_csum() on x86 contains a memory constraint ? The correct constraint would be one that told gcc that it accesses the 20 bytes from the source pointer. Without it gcc won't necessarily write out the values before the asm instructions execute. David
RE: [PATCH 07/21] netfilter: nf_nat: remove inline marking of EXPORT_SYMBOL functions
EXPORT_SYMBOL and inline directives are contradictory to each other. The patch fixes this inconsistency. ... -inline const struct nf_nat_l4proto * +const struct nf_nat_l4proto * __nf_nat_l4proto_find(u8 family, u8 protonum) { return rcu_dereference(nf_nat_l4protos[family][protonum]); If it makes sense to inline the local calls (ie the cost of the call is significant) then possibly add an inlined (or inlinable) static function that is called locally and by the exported one? I'm not sure that gcc is allowed to make the assumption that the local exported function will be called - and thus inline it. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Scaling problem with a lot of AF_PACKET sockets on different interfaces
Looks like the ptype_base[] should be per 'dev'? Or just put entries where ptype-dev != null_or_dev on a per-interface list and do two searches? Yes, but then we would have two searches instead of one in fast path. Usually it would be empty - so the search would be very quick! David
RE: [PATCH v9 net-next 5/7] net: simple poll/select low latency socket poll
I am a bit uneasy with this one, because an applicatio polling() on one thousand file descriptors using select()/poll(), will call sk_poll_ll() one thousand times. Anything calling poll() on 1000 fds probably has performance issues already! Which is why kevent schemes have been added. At least the Linux code doesn't use a linked list for the fd - 'struct file' map which made poll() O(n^2), and getting to that number of open fds O(n^3) on some versions of SVR4. David
RE: Scaling problem with a lot of AF_PACKET sockets on different interfaces
I have a Linux router with a lot of interfaces (hundreds or thousands of VLANs) and an application that creates AF_PACKET socket per interface and bind()s sockets to interfaces. ... I noticed that box has strange performance problems with most of the CPU time spent in __netif_receive_skb: 86.15% [k] __netif_receive_skb 1.41% [k] _raw_spin_lock 1.09% [k] fib_table_lookup 0.99% [k] local_bh_enable_ip ... This corresponds to: net/core/dev.c: type = skb-protocol; list_for_each_entry_rcu(ptype, ptype_base[ntohs(type) PTYPE_HASH_MASK], list) { if (ptype-type == type (ptype-dev == null_or_dev || ptype-dev == skb-dev || ptype-dev == orig_dev)) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } } Which works perfectly OK until there are a lot of AF_PACKET sockets, since the socket adds a protocol to ptype list: Presumably the 'ethertype' is the same for all the sockets? (And probably the ' PTYPE_HASH_MASH' doesn't separate it from 0800 or 0806 (IIRC IP and ICMP)) How often is that deliver_skb() inside the loop called? If the code could be arranged so that the scan loop didn't contain a function call then the loop code would be a lot faster since the compiler can cache values in registers. While that woukd speed the code up somewhat, there would still be a significant cost to iterate 1000+ times. Looks like the ptype_base[] should be per 'dev'? Or just put entries where ptype-dev != null_or_dev on a per-interface list and do two searches? David