RE: Re: [PATCH v2] mm/page_owner: ignore everything below the IRQ entry point

2018-03-26 Thread Vaneet Narang
Hi Dmitry,

>Every user of stack_depot should filter out irq frames, without that
>stack_depot will run out of memory sooner or later. so this is a
>change in the right direction.
> 
>Do we need to define empty version of in_irqentry_text? Shouldn't only
>filter_irq_stacks be used by kernel code?

We thought about this but since we were adding both the APIs filter_irq_stacks 
& in_irqentry_text 
in header file so we thought of defining empty definition for both as both the 
APIs are accessible
to the module who is going to include header file.

If you think empty definition of in_irqentry_text() is not requited then we 
will modify & resend the
patch.

Thanks & Regards,
Vaneet Narang


Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.

2018-03-23 Thread Vaneet Narang

Hi Nick,

Thanks for your comments, Please check my reply to few of your comments.
I will be sharing benchmarking figures separately. 

> 
>> +if (curr_offset > 127) {
>> +curr_offset = (curr_offset << 1) | DYN_BIT;
>> +LZ4_writeLE16(op, (U16)curr_offset);
>> +op += 2;
>> +} else {
>> +curr_offset = curr_offset << 1;
>> +*op = (BYTE)curr_offset;
>> +op++;
>> +}
> 
>The standard way to do variable sized integers is to use the high-bit as
>the control bit, not the low-bit. Do you have benchmarks to show that using
>the low-bit is faster than using the high-bit? If so, lets comment in the
>code, if not lets use the high-bit.
> 
We are not sure about performance difference of using low or high bit but Since 
as per 
LZ4 specification, offset needs to be stored in little endian format so keeping 
Low bit 
as control bit makes it easier to retreive offset when offset is spread across 
two bytes.

offset = LZ4_readLE16(ip);
if (offset & 0x1) {
offset = offset >> 1; // Just one Right shift
ip += 2;
} else {
offset = (offset & 0xff) >> 1; // Only Two Operation for one byte offset
ip += 1;
}

So not sure if keeping High bit will make much difference as we will be needing 
same or more operations to get offset in case of High bit.

>>  /* copy literals */
>>  cpy = op + length;
>>  if (((endOnInput) && ((cpy > (partialDecoding ? oexit : 
>> oend - MFLIMIT))
>> -|| (ip + length > iend - (2 + 1 + LASTLITERALS
>> -|| ((!endOnInput) && (cpy > oend - 
>> WILDCOPYLENGTH))) {
>> +|| (ip + length > iend - (2 + LASTLITERALS
>> +|| ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH - 
>> 1))) {
>>  if (partialDecoding) {
>>  if (cpy > oend) {
>>  /*
>> @@ -188,13 +190,31 @@ static FORCE_INLINE int LZ4_decompress_generic(
>>  break;
>>  }
>>
>> -LZ4_wildCopy(op, ip, cpy);
>> +if (dynOffset && length < 4)
>> +LZ4_copy4(op, ip);
>> +else
>> +LZ4_wildCopy(op, ip, cpy);
>> +
> 
>The LZ4 format enforces that the last 5 bytes are literals so that
>LZ4_wildCopy() can be used here. I suspect that having this extra branch
>here for `dynOffset` mode hurts decompression speed.
> 
This check is made to handle one byte read overflow when decompressing last 
frame. wildCopy does 8 byte copy blindly.

Issue Case:
0 length literal followed by 1 byte offset and then it ends with one byte token 
and 5 Byte Last Literals.

With this combination only 7 bytes (1 Byte Offset + 1 Byte Token + 5 Byte 
Literal) remains at the end of 
input buffer so reading 8 bytes from input buffer results in 1 byte overflow.

Since 1 byte offset is not there in original LZ4, so this issue is not 
possible. To avoid overhead of this check, 
we planned to use 6 Byte as Last literal Length.

-#define LASTLITERALS5
+#define LASTLITERALS6

>>
>>  int LZ4_decompress_safe(const char *source, char *dest,
>> -int compressedSize, int maxDecompressedSize)
>> +int compressedSize, int maxDecompressedSize, bool dynOffset)
>>  {
>>  return LZ4_decompress_generic(source, dest, compressedSize,
>>  maxDecompressedSize, endOnInputSize, full, 0,
>> -noDict, (BYTE *)dest, NULL, 0);
>> +noDict, (BYTE *)dest, NULL, 0, dynOffset);
> 
>You'll need to use the same trick that LZ4_compress_fast() uses, by hard
>coding `dynOffset`. We want the compiler to generate too version of
>LZ4_decompress_generic(), one with `dynOffset` and one with `!dynOffset`.
>That way the tight loop won't the branches that check `dynOffset`.
> 
>if (dynOffset)
>return LZ4_decompress_generic(..., true);
>else
>return LZ4_decompress_generic(..., false);
> 
>Without this trick, I expect that this patch causes a regression to both
>LZ4 and LZ4_DYN decompression speed.

Since there is no backward compatibility of our solution with original LZ4 so we
are bit confused if we should make it as separate API to avoid overhead of 
dynOffset
checks every where in the code and also to avoid changing prototype of exported 
functions 
LZ4_decompress/LZ4_compress OR we should keep these checks to avoid redundant 
code. 
Kindly suggest

Thanks & Regards,
Vaneet Narang


Re: [PATCH 0/1] cover-letter/lz4: Implement lz4 with dynamic offset length.

2018-03-23 Thread Vaneet Narang
Hi Nick / Sergey,


We have compared LZ4 Dyn with Original LZ4 using some samples of realtime 
application data(4Kb)
compressed/decompressed by ZRAM. For comparison we have used lzbench 
(https://github.com/inikep/lzbench)
we have implemented dedicated LZ4 Dyn API & kept last literal length as 6 to 
avoid overhead 
of checks. It seems in average case there is a saving of 3~4% in compression 
ratio with almost same compression
speed and minor loss in decompression speed (~50MB/s) when compared with LZ4.

Comparison of Lz4 Dyn with LZO1x is also done as LZO1x is default compressor of 
ZRAM.

Original LZ4:
sh-3.2# ./lzbench  -r  -elz4  data/
lzbench 1.7.3 (32-bit Linux)   Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size  Ratio Filename
memcpy   2205 MB/s  2217 MB/s4096 100.00 data//data_1
lz4 1.8.0 216 MB/s   761 MB/s2433  59.40 data//data_1
lz4 1.8.0 269 MB/s   877 MB/s1873  45.73 data//data_2
lz4 1.8.0 238 MB/s   575 MB/s2060  50.29 data//data_3
lz4 1.8.0 321 MB/s  1015 MB/s1464  35.74 data//data_4
lz4 1.8.0 464 MB/s  1090 MB/s 713  17.41 data//data_5
lz4 1.8.0 296 MB/s   956 MB/s1597  38.99 data//data_6
lz4 1.8.0 338 MB/s   994 MB/s2238  54.64 data//data_7
lz4 1.8.0 705 MB/s  1172 MB/s 193   4.71 data//data_8
lz4 1.8.0 404 MB/s  1150 MB/s1097  26.78 data//data_9
lz4 1.8.0 216 MB/s   921 MB/s3183  77.71 data//data_10
lz4 1.8.0 456 MB/s  1101 MB/s1011  24.68 data//data_11
lz4 1.8.0 867 MB/s  1202 MB/s  37   0.90 data//data_12


LZ4 Dynamic Offet:  
sh-3.2# ./lzbench  -r  -elz4_dyn  data/
lzbench 1.7.3 (32-bit Linux)   Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size  Ratio Filename
memcpy   2203 MB/s  2218 MB/s4096 100.00 data//data_1
lz4 1.8.0 218 MB/s   693 MB/s2228  54.39 data//data_1
lz4 1.8.0 273 MB/s   851 MB/s1739  42.46 data//data_2
lz4 1.8.0 230 MB/s   526 MB/s1800  43.95 data//data_3
lz4 1.8.0 321 MB/s   952 MB/s1357  33.13 data//data_4
lz4 1.8.0 470 MB/s  1075 MB/s 664  16.21 data//data_5
lz4 1.8.0 303 MB/s   964 MB/s1455  35.52 data//data_6
lz4 1.8.0 345 MB/s   951 MB/s2126  51.90 data//data_7
lz4 1.8.0 744 MB/s  1163 MB/s 177   4.32 data//data_8
lz4 1.8.0 409 MB/s  1257 MB/s1033  25.22 data//data_9
lz4 1.8.0 220 MB/s   857 MB/s3049  74.44 data//data_10
lz4 1.8.0 464 MB/s  1105 MB/s 934  22.80 data//data_11
lz4 1.8.0 874 MB/s  1194 MB/s  36   0.88 data//data_12


LZ4 Dynamic Offset with 32K data:
sh-3.2# ./lzbench -elz4_dyn data/data32k
lzbench 1.7.3 (32-bit Linux)   Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size  Ratio Filename
memcpy   5285 MB/s  5283 MB/s   32768 100.00 data/data32k
lz4 1.8.0 274 MB/s   995 MB/s   13435  41.00 data/data32k
done... (cIters=1 dIters=1 cTime=1.0 dTime=2.0 chunkSize=1706MB cSpeed=0MB)

Original LZ4 with 32K data:
sh-3.2# ./lzbench_orig -elz4 data/data32k
lzbench 1.7.3 (32-bit Linux)   Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size  Ratio Filename
memcpy   4918 MB/s  5108 MB/s   32768 100.00 data/data32k
lz4 1.8.0 276 MB/s  1045 MB/s   14492  44.23 data/data32k

LZO1x with 32K data (Default Compressor for ZRAM): 
sh-3.2# ./lzbench -elzo1x,1 data/data32k
lzbench 1.7.3 (32-bit Linux)   Assembled by P.Skibinski
Compressor name Compress. Decompress. Compr. size  Ratio Filename
memcpy   5273 MB/s  5320 MB/s   32768 100.00 data/data32k
lzo1x 2.09 -1 283 MB/s   465 MB/s   14292  43.62 data/data32k

Regards,
Vaneet Narang


Re: [PATCH 2/2] printk: make sure to print log on console.

2018-06-01 Thread Vaneet Narang
Hi Sergey,

>> but the problem is real.
>
>Yep, could be. But not exactly the way it is described in the commit
>messages and the patch does not fully address the problem.
>
>The patch assumes that all those events happen sequentially. While
>in reality they can happen in parallel on different CPUs.
>
>Example:
>
> CPU0 CPU1
>
> set console verbose
>
> dump_backtrace()
> {
>  // for (;;)  print frames
>  printk("%pS\n", frame0);
>  printk("%pS\n", frame1);
>  printk("%pS\n", frame2);
>  printk("%pS\n", frame3);
>  ...console_loglevel = CONSOLE_LOGLEVEL_SILENT;
>  printk("%pS\n", frame12);
>  printk("%pS\n", frame13);
> }
>

This is not printk issue, its printk usage issue. User need to handle this part 
using some protection.

What we highlighted is issue related to printk, Where usage is correct 
but still printk can miss some logs due to printk design of asynchronous 
printing.


>Part of backtrace or the entire backtrace will be missed, because
>we read the global console_loglevel. The problem is still there.
>
>> The console handling is asynchronous even without the kthread.
>> The current console_lock owner is responsible for handling messages
>> also from other CPUs.
>
>A nitpick: that's another thing to improve in the commit message, because
>   I don't see that console_silent() race in the upstream kernel. We even
>   don't have any users of console_silent() function. The only thing that
>   ever sets console_loglevel to CONSOLE_LOGLEVEL_SILENT is
>   kernel/debug/kdb/kdb_io.c
>

Issue is not related to console_silent() user, Issue is with dynamic console 
log level handling.
Which can be done using dmesg as well.
For example:
dmesg -n 7 
"Run some scenario to capture all the kernel logs"
dmesg -n 1

I may end up not getting all the logs if my console flush is slow & 
dmesg -n 1 got excuted before flush .


Thanks & Regards,
Vaneet Narang


RE: [PATCH v2] arm/stacktrace: stop unwinding after an invalid address.

2018-04-03 Thread Vaneet Narang
 
Hi Russell,


>__dabt_usr+0x44/0x60
>0xb6748ea4
>

>--- a/arch/arm/kernel/stacktrace.c
>+++ b/arch/arm/kernel/stacktrace.c
>@@ -92,6 +92,9 @@ static int save_trace(struct stackframe *frame, void *d)
> 
>   regs = (struct pt_regs *)frame->sp;
>
>+  if (!__kernel_text_address(regs->ARM_pc))
>+  return 1;
>+
>   trace->entries[trace->nr_entries++] = regs->ARM_pc;
 
 
Any Inputs or Comments on this patch to avoid storing user space entries 
during unwind.
 
Regards,
Vaneet Narang

rcptInfo.txt
Description: Binary data


RE: Re: [PATCH 1/1] lz4: Implement lz4 with dynamic offset length.

2018-04-03 Thread Vaneet Narang
Hi Sergey,

>You shrink a 2 bytes offset down to a 1 byte offset, thus you enforce that
2 Byte offset is not shrinked to 1 byte, Its only 1 bit is reserved out of
16 bits of offset. So only 15 Bits can be used to store offset value.

>'page should be less than 32KB', which I'm sure will be confusing. 
lz4_dyn will work on bigger data length(> 32k) but in that case compression
ratio may not be better than LZ4. This is same as LZ4 compressing data more 
than 64K (16Bits). LZ4 can't store offset more than 64K similarly
LZ4 dyn can't store offset more than 32K.

There is a handling in LZ4 code for this and similar handling added for LZ4 Dyn.

Handling in LZ4 Dyn: max_distance is 32K for lz4_dyn and will be 64K for LZ4
int max_distance = dynOffset ? MAX_DISTANCE_DYN : MAX_DISTANCE;

>And you
>rely on lz4_dyn users to do the right thing - namely, to use that 'nice'
>`#if (PAGE_SIZE < (32 * KB))'. 
They don't need to add this code, they just need to choose right compression 
algorithm 
that fits their requirement. If source length is less than 32K then lz4_dyn 
would give better compression ratio then LZ4. 

Considering ZRAM as a user for LZ4 dyn, we have added this check for PAGE_SIZE 
which
is source length. This code adds lz4 dyn to preferred list of compression 
algorithm 
when PAGE size is less than 32K. 

>Apart from that, lz4_dyn supports only data
>in up to page_size chunks. Suppose my system has page_size of less than 32K,
>so I legitimately can enable lz4_dyn, but suppose that I will use it
>somewhere where I don't work with page_size-d chunks. Will I able to just
>do tfm->compress(src, sz) on random buffers? The whole thing looks to be
>quite fragile.
No thats not true, lz4_dyn can work for random buffers and it need not be 
of page size chunks. There is no difference in Lz4 and Lz4 dyn working. 

Only difference is LZ4 dyn doesn't use fixed offset size, this concept already 
getting 
used in LZO which uses dynamic size of Metadata based on Match Length and Match 
offset.
It uses different markers for this which defines length of meta data.

lzodefs.h:

#define M1_MAX_OFFSET   0x0400
#define M2_MAX_OFFSET   0x0800
#define M3_MAX_OFFSET   0x4000
#define M4_MAX_OFFSET   0xbfff

#define M1_MIN_LEN  2
#define M1_MAX_LEN  2
#define M2_MIN_LEN  3
#define M2_MAX_LEN  8
#define M3_MIN_LEN  3
#define M3_MAX_LEN  33
#define M4_MIN_LEN  3
#define M4_MAX_LEN  9

#define M1_MARKER   0
#define M2_MARKER   64
#define M3_MARKER   32
#define M4_MARKER   16

Similarly for LZ4 Dyn, we have used 1 bit as a marker to determine offset 
length.

Thanks & Regards,
Vaneet Narang

rcptInfo.txt
Description: Binary data


Re: [PATCH] af_packet: Raw socket destruction warning fix

2016-02-10 Thread Vaneet Narang
Hi,

>What driver are you using (is that in-tree)? Can you reproduce the same issue
>with a latest -net kernel, for example (or, a 'reasonably' recent one like 4.3 
>or
>4.4)? There has been quite a bit of changes in err queue handling (which also
>accounts rmem) as well. How reliably can you trigger the issue? Does it trigger
>with a completely different in-tree network driver as well with your tests? 
>Would
>be useful to track/debug sk_rmem_alloc increases/decreases to see from which 
>path
>new rmem is being charged in the time between packet_release() and 
>packet_sock_destruct()
>for that socket ...
>
It seems race condition to us between packet_rcv and packet_close, we have 
tried to reproduce
this issue by adding delay in skb_set_owner_r and issue gets reproduced quite 
frequently. 
we have added some traces and on analyzing we have realised following possible 
race condition.

packet_rcv

packet_close
   
skb_set_owner_r(skb, sk);

  
skb_queue_purge(&sk->sk_receive_queue);

spin_lock(&sk->sk_receive_queue.lock);
__skb_queue_tail(&sk->sk_receive_queue, skb);
spin_unlock(&sk->sk_receive_queue.lock);

Since packet was not added to receive queue so receive queue purge will no have 
any impact.
It will not free sk_buff stored in receive queue. So to fix this issue, we have 
make sure
skb_set_owner_r(skb, sk) & __skb_queue_tail(&sk->sk_receive_queue, skb) is 
called under 
receive queue lock and we have moved receive queue purge from packet_release to 
packet_sock_destruct.


we have added some traces in skb_set_owner_r & packet_sock_destruct. So this is
what we got
CPU 0
 sk = db6d17c0 sk->sk_flags = 0x320 Size = 1984
Backtrace:
 (dump_backtrace+0x0/0x128) from (show_stack+0x20/0x28) 
 (show_stack+0x0/0x28) from (dump_stack+0x24/0x28)
 (dump_stack+0x0/0x28) from (packet_rcv+0x480/0x488)
 (packet_rcv+0x0/0x488) from (__netif_receive_skb_core+0x53c/0x674)
 (__netif_receive_skb_core+0x0/0x674) from (__netif_receive_skb+0x20/0x74)
 (__netif_receive_skb+0x0/0x74) from (netif_receive_skb+0x2c/0xbc)
 (netif_receive_skb+0x0/0xbc) from (napi_gro_receive+0x90/0xc0)
..
 (net_rx_action+0x0/0x300) from(__do_softirq+0x160/0x340)
 (__do_softirq+0x0/0x340) from (do_softirq+0xc4/0xe0)
 (do_softirq+0x0/0xe0) from (irq_exit+0xc4/0xf8)
 (irq_exit+0x0/0xf8) from (handle_IRQ+0x88/0x10c)
 (handle_IRQ+0x0/0x10c) from  (gic_handle_irq+0x64/0xac)

 CPU 1
Backtrace:
sk = db6d17c0 sk->sk_rmem_alloc=1984  sk->sk_flags = 0x141
Receive Queue Empty = "Yes"  Error queue empty = "Yes"
 
 (packet_sock_destruct+0x0/0x1f4) from (__sk_free+0x28/0x18c)
 (__sk_free+0x0/0x18c) from (sk_free+0x40/0x48)
 (sk_free+0x0/0x48) from (packet_release+0x29c/0x310)
 (packet_release+0x0/0x310) from  (sock_release+0x30/0xb8)
 (sock_release+0x0/0xb8) from (sock_close+0x1c/0x28)
 (sock_close+0x0/0x28) from (__fput+0x9c/0x2b4)
 (__fput+0x0/0x2b4) from (fput+0x18/0x20)
 (fput+0x0/0x20) from (task_work_run+0xc0/0xfc)
 (task_work_run+0x0/0xfc) from (do_work_pending+0x108/0x114)
 (do_work_pending+0x0/0x114) from (work_pending+0xc/0x20)

From this it appears packet_rcv was called when packet_release
was not done  as sk_flags = 0x320  (SOCK_DEAD is not set) & 
packet_sock_destruct was called 
when sk_rmem_alloc was increased but packet was not added to receive queue.
sk_rmem_alloc pending is same as size of last packet received on socket.
 
Kindly comment on the fix shared at following link.
http://www.spinics.net/lists/kernel/msg2184815.html

Thanks & Regards,
Vaneet Narang

Re: [EDT][PATCH 1/1] hw_breakpoint.c :cpu hotplug handling

2015-05-17 Thread Vaneet Narang
EP-2DAD0AFA905A4ACB804C4F82A001242F

On Wed, May 13, 2015 at 06:24:06AM +0100, Maninder Singh wrote:
>> EP-2DAD0AFA905A4ACB804C4F82A001242F
>> 
>> Subject: [PATCH 1/1] hw_breakpoint.c :cpu hotplug handling
>> 
>> This patch adds support for CPU hotplug, It re-installl all installed 
>> watchpoints and breakpoints
>> back on H/W in case of cpu-hot plug.

>Not sure why this is needed -- the scheduler should reinstall the
>breakpoints when the debugged task gets scheduled in via
>arch_install_hw_breakpoint.
>
>Will

I agree with you this reinstalling has to be either take care by scheduler or 
Debug tool. 
In current implementation we clear H/W registers but we don't clear slots 
(wp_on_reg / bp_on_reg) for both watchpoint or breakpoint.
So it makes mandatory for debug tool to uninstall breakpoints when CPU goes 
offline, because if we don't 
uninstall which I think is not required and when CPU comes online, we will not 
be able to reinstall them back because there will be no free slots.
Despite of the fact that H/W registers are free but still we wouldn't be able 
to install breakpoints.
Logically we should clear these slots or we should reinstall but if you think 
reinstall has to be taken care by scheduler or debugger
then at least we should clear these slots.

Regards,
Vaneet Narang


Re: [EDT] [PATCH 1/1] Fix: hw watchpoint continually triggers callback

2015-05-18 Thread Vaneet Narang

 EP-2DAD0AFA905A4ACB804C4F82A001242F
>
>Ok, I have to ask: what on Earth is this number and what does [EDT] mean?

This is auto generated from our editor. Kindly ignore its not relevant.
Please find reply inline.

>> >On Tue, May 12, 2015 at 02:12:54PM +0100, Vaneet Narang wrote:
>> >> On Tue, May 12, 2015 at 12:48:13PM +0100, Maninder Singh wrote:
>> >> This fix is given for kernel developers who wants to use perf interface by
>> >> registering callback using register_wide_hw_breakpoint API.  On every
>> >> callback trigger they have to unregister watchpoints otherwise callback
>> >> gets called in a loop and now issue is "when to register watch point back
>> >> ?". 
>> 
>> >If you want to solve this, I think we need a better way to expose software
>> >single-step/emulation to the overflow handler. If we try to do this in
>> >the hw_breakpoint code itself, we run into problems:
>> 
>> >  - What if another thread hits the same instruction whilst we are trying
>> >   to step it?
>> 
>> >  - What if there are two breakpoints or a breakpoint + watchpoint
>> >   triggered by the same instruction?
>> 
>> Thanks for you input. I am not sure, issues which you have mentioned with
>> this implementation will actually come.
>> To address the issues you have raised, I need to brief about kprobe.
>> Kprobe follows 3 steps for breakpoint (BP) handling.
>> 1. Decode BP instruction.
>> 2. Replace BP instruction with new instruction that will generate SWI.
>> 3. Execute instruction & move PC to next instruction.
>> Kprobe follows step 1 & step 2 on addition of BP and 3rd step is followed
>> when SWI gets triggered.
>> 
>> For this fix we have used step 1 & step 3, we have skipped step 2. As we
>> don't know the caller of watch point & also HW generates interrupt so step
>> 2 is not required. The only difference is since we don't know the caller
>> we can't decode instruction in advance. We have to follow step 1 and step
>> 3 when HWI gets triggered.  Since we are not replacing instruction from
>> memory and I assume kprobe implementation for execution of instruction in
>> interrupt context is tested and stable, so it shouldn't  produce any of
>> the above race condition issues.

>
>Ok, so my first point shouldn't be a problem if we're just emulating the
>instruction. However, I still think there are corner cases. For example,
>imagine hitting a breakpoint on a ldr pc, [&foo] instruction where we've
>also got a watchpoint on foo. Even with emulation, it's going to be
>difficult to ensure that watchpoint is safely delivered.
>
>As I say, I'd really rather have a kprobes-agnostic way of stepping an
>instruction and let the debugger decide whether it wants to use that or
>not.
>

2 breakpoints will not be any issue but watchpoint + breakpoint is interesting 
scenario with ldr pc, [&foo] instruction in place. 
How would ARM will behave in this case without kprobe ? I think It will keep on 
generating Watch point interrupt only.

With kprobe watchpoint interrupt gets triggered first and as soon as we execute 
ldr pc, [&foo] using kprobe
it will trigger Breakpoint interrupt.  
This can be taken care with special handling for such instruction where PC gets 
changed. Can you please suggest what should be correct behavior in this case ?
Is this scenario possible with any other instruction. ? I am not able to think 
other instructions. Is it possisble with push or pop ?

>> >  - What if the debugger didn't want to execute the instruction at all?
>> 
>> if debugger doesn't want to execute instruction then debugger should use
>> single step implementation without overflow handler. 
>
>But the debugger might want to use the overflow handler to regain control
>on the exception (like ptrace does for userspace).
>
I have tried same kernel module on x86, Linux 3.5. Behavior on x86 is to 
execute instruction.
I am not sure how ptrace works on x86, if instruction gets executed without any 
control from overflow handler.
Behavior on ARM should be same as x86. Since perf API interface is same on ARM 
as well as x86.

>Will

Regards,
Vaneet Narang

Fwd: Re: Re: [EDT] [PATCH 1/1] Fix: hw watchpoint continually triggers callback

2015-05-12 Thread Vaneet Narang
On Tue, May 12, 2015 at 12:48:13PM +0100, Maninder Singh wrote:
>> EP-2DAD0AFA905A4ACB804C4F82A001242F
>> 
>> On ARM, when a watchpoint is registered using register_wide_hw_breakpoint, 
>> the callback handler endlessly runs until the watchpoint is unregistered.
>> The reason for this issue is debug interrupts gets raised before executing 
>> the instruction,
>> and after interrupt handling ARM tries to execute the same instruction again 
>> , which results
>> in interrupt getting raised again.
>> 
>> This patch fixes this issue by using KPROBES (getting the instruction 
>> executed and incrementing PC
>> to next instruction).
>> 
>> Signed-off-by: Vaneet Narang 
>> Signed-off-by: Maninder Singh 
>> Reviewed-by: Amit Arora 
>> Reviewed-by: Ajeet Yadav 
>> ---
>>  arch/arm/kernel/hw_breakpoint.c |   18 ++
>>  1 files changed, 18 insertions(+), 0 deletions(-)
>> 
>> diff --git a/arch/arm/kernel/hw_breakpoint.c 
>> b/arch/arm/kernel/hw_breakpoint.c
>> index dc7d0a9..ec72f86 100644
>> --- a/arch/arm/kernel/hw_breakpoint.c
>> +++ b/arch/arm/kernel/hw_breakpoint.c
>> @@ -37,6 +37,9 @@
>>  #include 
>>  #include 
>>  #include 
>> +#ifdef CONFIG_KPROBES
>> +#include 
>> +#endif
>>  
>>  /* Breakpoint currently in use for each BRP. */
>>  static DEFINE_PER_CPU(struct perf_event *, bp_on_reg[ARM_MAX_BRP]);
>> @@ -757,6 +760,21 @@ static void watchpoint_handler(unsigned long addr, 
>> unsigned int fsr,
>>   */
>>  if (!wp->overflow_handler)
>>  enable_single_step(wp, instruction_pointer(regs));
>> +#ifdef CONFIG_KPROBES
>> +else {
>> +struct kprobe kp;
>> +unsigned long flags;
>> +
>> +arch_uninstall_hw_breakpoint(wp);
>> +kp.addr = (kprobe_opcode_t *)instruction_pointer(regs);
>> +if (!arch_prepare_kprobe(&kp)) {
>> +local_irq_save(flags);
>> +kp.ainsn.insn_singlestep(&kp, regs);
>> +local_irq_restore(flags);
>> +}
>> +arch_install_hw_breakpoint(wp);
>> +}
>> +#endif

>I don't think this is the right thing to do at all; the kernel already
>handles step exceptions using mismatched breakpoints when there is no
>overflow handler specified (e.g. using perf mem events). If you register a
>handler (e.g. gdb via ptrace) then you have to handle the step yourself.

>Will

EP-2DAD0AFA905A4ACB804C4F82A001242F

Hi Will, 

This fix is given for kernel developers who wants to use perf interface by 
registering callback using register_wide_hw_breakpoint API.
On every callback trigger they have to unregister watchpoints otherwise 
callback gets called in a loop and now issue is "when to register watch point 
back ?". 
With this issue in place, it makes perf interface unusable. We didn't faced 
this issue with x86. 
For verification, we have used test code available at
samples/hw_breakpoint/data_breakpoint.c
 
Thanks & Regards,
Vaneet Narang N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±‘êçzX§¶›¡Ü¨}©ž²Æ 
zÚ&j:+v‰¨¾«‘êçzZ+€Ê+zf£¢·hšˆ§~†­†Ûiÿûàz¹®w¥¢¸?™¨è­Ú&¢)ߢf”ù^jÇ«y§m…á@A«a¶Úÿ
0¶ìh®å’i

Re: [EDT] [PATCH 1/1] Fix: hw watchpoint continually triggers callback

2015-05-12 Thread Vaneet Narang
EP-2DAD0AFA905A4ACB804C4F82A001242F

>On Tue, May 12, 2015 at 02:12:54PM +0100, Vaneet Narang wrote:
>> On Tue, May 12, 2015 at 12:48:13PM +0100, Maninder Singh wrote:
>> >> On ARM, when a watchpoint is registered using 
>> >> register_wide_hw_breakpoint, 
>> >> the callback handler endlessly runs until the watchpoint is unregistered.
>> >> The reason for this issue is debug interrupts gets raised before
>> >> executing the instruction, and after interrupt handling ARM tries to
>> >> execute the same instruction again , which results in interrupt getting
>> >> raised again.
>> >> 
>> >> This patch fixes this issue by using KPROBES (getting the instruction
>> >> executed and incrementing PC to next instruction).
>> >> 
>> >> Signed-off-by: Vaneet Narang 
>> >> Signed-off-by: Maninder Singh 
>> >> Reviewed-by: Amit Arora 
>> >> Reviewed-by: Ajeet Yadav 
>> >> ---
>> >>  arch/arm/kernel/hw_breakpoint.c |   18 ++
>> >>  1 files changed, 18 insertions(+), 0 deletions(-)
>> >> 
>> >> diff --git a/arch/arm/kernel/hw_breakpoint.c 
>> >> b/arch/arm/kernel/hw_breakpoint.c
>> >> index dc7d0a9..ec72f86 100644
>> >> --- a/arch/arm/kernel/hw_breakpoint.c
>> >> +++ b/arch/arm/kernel/hw_breakpoint.c
>> >> @@ -37,6 +37,9 @@
>> >>  #include 
>> >>  #include 
>> >>  #include 
>> >> +#ifdef CONFIG_KPROBES
>> >> +#include 
>> >> +#endif
>> >>  
>> >>  /* Breakpoint currently in use for each BRP. */
>> >>  static DEFINE_PER_CPU(struct perf_event *, bp_on_reg[ARM_MAX_BRP]);
>> >> @@ -757,6 +760,21 @@ static void watchpoint_handler(unsigned long addr, 
>> >> unsigned int fsr,
>> >>*/
>> >>   if (!wp->overflow_handler)
>> >>   enable_single_step(wp, instruction_pointer(regs));
>> >> +#ifdef CONFIG_KPROBES
>> >> + else {
>> >> + struct kprobe kp;
>> >> + unsigned long flags;
>> >> +
>> >> + arch_uninstall_hw_breakpoint(wp);
>> >> + kp.addr = (kprobe_opcode_t *)instruction_pointer(regs);
>> >> + if (!arch_prepare_kprobe(&kp)) {
>> >> + local_irq_save(flags);
>> >> + kp.ainsn.insn_singlestep(&kp, regs);
>> >> + local_irq_restore(flags);
>> >> + }
>> >> + arch_install_hw_breakpoint(wp);
>> >> + }
>> >> +#endif
>> 
>> >I don't think this is the right thing to do at all; the kernel already
>> >handles step exceptions using mismatched breakpoints when there is no
>> >overflow handler specified (e.g. using perf mem events). If you register a
>> >handler (e.g. gdb via ptrace) then you have to handle the step yourself.
>> 
>> This fix is given for kernel developers who wants to use perf interface by
>> registering callback using register_wide_hw_breakpoint API.  On every
>> callback trigger they have to unregister watchpoints otherwise callback
>> gets called in a loop and now issue is "when to register watch point back
>> ?". 

>If you want to solve this, I think we need a better way to expose software
>single-step/emulation to the overflow handler. If we try to do this in
>the hw_breakpoint code itself, we run into problems:

>  - What if another thread hits the same instruction whilst we are trying
>   to step it?

>  - What if there are two breakpoints or a breakpoint + watchpoint
>   triggered by the same instruction?

Thanks for you input. I am not sure, issues which you have mentioned with this 
implementation will actually come.
To address the issues you have raised, I need to brief about kprobe.
Kprobe follows 3 steps for breakpoint (BP) handling.
1. Decode BP instruction.
2. Replace BP instruction with new instruction that will generate SWI.
3. Execute instruction & move PC to next instruction.
Kprobe follows step 1 & step 2 on addition of BP and 3rd step is followed when 
SWI gets triggered.

For this fix we have used step 1 & step 3, we have skipped step 2. As we don't 
know the caller of watch point &
 also HW generates interrupt so step 2 is not required. The only difference is 
since we don't know the caller we can't 
decode instruction in advance. We have to follow step 1 and s

RE:(2) [PATCH 1/1] mm/vmscan.c: change prototype for shrink_page_list

2020-04-29 Thread Vaneet Narang
Hi Michal, 

>> >
>> >Acked-by: Michal Hocko 
>> >
>> >Is there any reason to move declarations here?
>> >
>> 
>> "unsigned int ret" was changed mistakenely, sending V2.
>> and "unsigned int nr_reclaimed" is changed to remove hole.
 
>Could you be more specific? Have you seen a better stack allocation
>footprint?

We didn't check stack allocation footprint, we did changes just by looking into 
the code.
we thought changing reclaimed type from long to int on 64 bit platform will add 
hole of 4 bytes between two stack variables nr_reclaimed & nr_taken.

So we tried to remove that hole by packing it with bool.

unsigned long nr_scanned; --> Size and alignment 8 byte 
for long 
-   unsigned long nr_reclaimed = 0;   --> Changing to int will make 
its size as 4  
unsigned long nr_taken;   --> nr_taken needs alignment 
of 8 so will add hole.
struct reclaim_stat stat;
int file = is_file_lru(lru);
enum vm_event_item item;
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+   unsigned int nr_reclaimed = 0;--> So moving to this place 
to pack it along with bool 
bool stalled = false;


Overall stack footprint might not change as compiler makes function stack 
pointer as 16 byte aligned but we did this 
as we normally follow this coding convention when defining structures or stack 
variables. 

Thanks & Regards,
Vaneet Narang
 
 

 


Re: [PATCH 1/1] arm/module: maximum utilization of module area.

2016-12-12 Thread Vaneet Narang
Hi Ard,

>> +#ifndef CONFIG_THUMB2_KERNEL
>> +#define MODULES_START  ALIGN((unsigned long)_etext - SZ_32M, PAGE_SIZE)
>
>On a multi_v7_defconfig kernel, the text size is >8 MB, which means
>you are only adding ~7 MB to the module area, while consuming 16 MB of
>additional address space.

I am not sure if 16MB virtual address space will make any difference on 
embedded 
systems where physical memory is already less than virtual address space.
if required address space wastage can be reduced by keeping TASK_SIZE as 
PAGE_OFFSET - 24MB like below

#define MODULES_VADDR  (PAGE_OFFSET - SZ_24M)
#define TASK_SIZE  (UL(CONFIG_PAGE_OFFSET) - UL(SZ_24M))


> Given that 20 MB modules are very uncommon,

Size of all modules can be 20MB, this seems to be normal scenario.

>I think it is better to enable CONFIG_ARM_MODULE_PLTS instead. That

CONFIG_ARM_MODULE_PLTS has function call overhead as it refers PLT table
while calling kernel functions. Also size of modules will also gets increased a 
bit.
So using short calls from modules to kernel will be faster. 
These changes trying to utilize best available space for kernel modules for
making short calls. 
So CONFIG_ARM_MODULE_PLTS is not required when modules
can be accomdated within 20MB.

>way, there is no need to update these defaults for everyone.
>

>> +#else
>> +#define MODULES_START MODULES_VADDR

Thanks & Regards,
Vaneet Narang

RE: Re: [PATCH 1/1] kasan: Support for r/w instrumentation control

2016-12-12 Thread Vaneet Narang
 Hi,

> Do you actually hit an issue with image size? In what context?
> Do you use inline/outline instrumentation? Does switching to the other
> option help?

Memory access with KASAN enabled Image has overhead in terms of cpu execution.
Sometimes we are not able to reproduce race condition issues with these 
overhead in
place. So user should have control atleast over read instrumentation.

> Does it make sense to ever disable writes? I assume that you are

Write instrumentation control is majorly kept to be inline with ASAN for user 
space 
applications. 
Also write is sometimes useful when uImage is already sanitized and some 
corruption
is done by kernel modules by doing some direct memory access then both read / 
write sanity of uImage 
can be avoided.

> disabling reads, right?
> Disabling both certainly does not make sense.

Regards,
Vaneet Narang


Re: Re: [PATCH 1/1] kasan: Support for r/w instrumentation control

2016-12-12 Thread Vaneet Narang
Hi,

>>> Do you actually hit an issue with image size? In what context?
>>> Do you use inline/outline instrumentation? Does switching to the other
>>> option help?
>>
>> Memory access with KASAN enabled Image has overhead in terms of cpu 
>> execution.
>> Sometimes we are not able to reproduce race condition issues with these 
>> overhead in
>> place. So user should have control atleast over read instrumentation.
>
>Don't you want to disable KASAN entirely in such case?

hmmm, but we need KASAN to detect corruption issues so overhead can be
reduced by switching OFF read instrumentation. Generally Reads are much more 
frequent
than writes as latest arm64 kernel has 224000 reads and 62300 writes 
which is almost 3.5 times. So i think this control is required.

 
>>> Does it make sense to ever disable writes? I assume that you are
>>
>> Write instrumentation control is majorly kept to be inline with ASAN for 
>> user space
>> applications.
>> Also write is sometimes useful when uImage is already sanitized and some 
>> corruption
>> is done by kernel modules by doing some direct memory access then both read 
>> / write sanity of uImage
>> can be avoided.
>
>But then you don't need KASAN at all.

KASAN support is required in this case to detect module issues.
KASAN provides asan_load / asan_store definition as these functions
are added by compiler in modules before every memory access.
These are the functions which will do address sanity and detect errors. 

Regards,
Vaneet Narang

RE: Re: [PATCH 1/1] arm/module: maximum utilization of module area.

2016-12-12 Thread Vaneet Narang
Hi,

>A PC24 relocation has a range of +/-32MB.  This means that where-ever
>the module is placed, it must be capable of reaching any function
>within the kernel text, which may itself be quite large (eg, 8MB, or
>possibly larger).  The module area exists to allow modules to be
>located in an area where PC24 relocations are able to reach all of the
>kernel text on sensibly configured kernels, thereby allowing for
>optimal performance.
>
>If you wish to load large modules, then enable ARM_MODULE_PLTS, which
>will use the less efficient PLT method (which is basically an indirect
>function call) for relocations that PC24 can't handle, and will allow
>the module to be loaded into the vmalloc area.
>
>Growing the module area so that smaller modules also get penalised by
>the PLT indirection is not sane.

This is exactly what i am saying. These changes are useful to accomdate
22MB modules without enabling ARM_MODULE_PLTS. 
Without these changes ARM_MODULE_PLTS needs to be enabled to insert more than 
14MB
modules but with these changes 22MB modules (considering 8MB uImage) can be 
inserted 
without enabling ARM_MODULE_PLTS.

So till 22MB modules are not penalised but without these changes once size of 
modules
gets over 14MB PLT becomes must. 

Regards,
Vaneet Narang

Re: [PATCH] af_packet: Raw socket destruction warning fix

2016-02-04 Thread Vaneet Narang
Hi,

>>
>> Issue is coming for 3.10.58.
>

Sorry for late reply, we were trying to reproduce the issue but its not that 
frequent.

>What driver are you using (is that in-tree)?

we are facing issue with wireless driver. Is it possible that wireless driver
can cause any issue because rmem accounting is done by kernel.

>Can you reproduce the same issue
>with a latest -net kernel, for example (or, a 'reasonably' recent one like 4.3 
>or
>4.4)? There has been quite a bit of changes in err queue handling (which also
>accounts rmem) as well.

It difficult to port on 4.3 as of now but can you suggest some patches which can
be helpful. One more thing, can https://lkml.org/lkml/2015/9/15/634. change 
in linux kernel cause this issue. Just an observation (could be incorrect), 
after 
including this change we are facing this issue.

>How reliably can you trigger the issue? Does it trigger
>with a completely different in-tree network driver as well with your tests?

This issue is not easily reproducible. This issue gets reproduced in long term
testing. Yes wireless network driver is not in tree.

>Would be useful to track/debug sk_rmem_alloc increases/decreases to see from 
>which path
>new rmem is being charged in the time between packet_release() and 
>packet_sock_destruct()
>for that socket ...

As i mentioned, its not easily reproducible but we have added some debug patch 
in 
packet_sock_destruct to check the value of rmem_alloc. 
So as per logs, At the entry point rmem_alloc was 0 but after error queue purge 
it becomes some 576(seems a new packet added). Not sure which queue. 
Is it possible that kernel can still add packets in receive queue when socket 
is already closed, 
can you point the kernel code where this is avoided or is there any way this 
gets added to error
queue. 
As per my understanding rmem_alloc gets increased only if packets gets added to 
receive queue
or error queue. Is it any other queue which also changes rmem_alloc?

Thanks,
Vaneet Narang
.

RE: [PATCH 1/1] arm64: add support for PAGE_SIZE aligned kernel stack

2020-08-03 Thread Vaneet Narang
Hi Mark,

>> currently THREAD_SIZE is always in power of 2, which will waste
>> memory in cases there is need to increase of stack size.
>
>If you are seeing issues with the current stack size, can you please
>explain that in more detail? Where are you seeing problems? Which
>configuration options do you have selected?
>
>I'm not keen on making kernel stack sizes configurable as it's not
>currently possible for the person building the kernel to figure out a
>safe size (and if this were possible, it's be better to handle this
>automatically).
>
>If the stack size is too small in some configurations I think we need to
>ensure that it is appropriately sized regardless of whether the person
>building the kernel believes they can identify a reasonable size.

Motivation behind these changes is saving memory on our system. 
Our system runs around 2500 threads concurrently so saving 4Kb  
might help us in saving 10MB memory.

To ensure 12KB is sufficient for our system we have used stack tracing and 
realised maximum stack used is not more than 9KB. 
 /sys/kernel/tracing/stack_max_size

Tracing interface defined by kernel to track maximum stack size can be used
by others to decide appropriate stack size.

>> Thus adding support for PAGE_SIZE(not power of 2) stacks for arm64.
>> User can decide any value 12KB, 16KB, 20 KB etc. based on value
>> of THREAD_SHIFT. User can set any value which is PAGE_SIZE aligned for
>> PAGE_ALIGNED_STACK_SIZE config.
>> 
>> Value of THREAD_SIZE is defined as 12KB for now, since with irq stacks
>> it is enough and it will save 4KB per thread.
>
>How are you certain of this?

Prev ARM64 uses 16KB stack size to store IRQ stack and thread on the same 
kernel stack.
Now since these are stored seperately so maximum kernel stack requirement is 
also reduced. 
ARM still uses 8KB stack to store both SVC and IRQ mode stack. So ARM64 
shouldn't
have requirement more than double when compared to ARM. 
So considering these points we realised 12KB stack might be sufficient
for our system. Analyzing stack using stack tracer confirmed max stack 
requirement.

This is still configurable, if some system has higher stack requirement then 
user can 
go with 16KB but there should be option to change it.

 
Regards,
Vaneet Narang


RE: [PATCH 2/3] arm: introduce IRQ stacks

2020-10-21 Thread Vaneet Narang
Hi Russel & Arnd,

> > > - define 'current' as 'this_cpu_read_stable(current_task);'
> > > - convert to CONFIG_THREAD_INFO_IN_TASK
> 
unlike ARM64 ARM doesn't suport dedicated register like "sp_el0", so 
arm still stores thread info on stack and fetches using current task using
stack pointer (SP) so CONFIG_THREAD_INFO_IN_TASK is not supported on ARM. 
To implement solution inline with current ARM design, we have used indirection.

> That means we need to also code that up in assembly - remember, we
> need to access thread_info from assembly code.

> Note also that there is a circular dependency involved. If you make
> thread_info accessible via per-cpu, then:

> "We don't do it because we don't have a separate register to be able
> to store the thread_info pointer, and copying that lump between the
> SVC and IRQ stack will add massively to IRQ latency, especially for
> older machines."

We tried to minimize latency in switching between SVC stack and IRQ stack
by just copying extra thread info pointer to IRQ stack. Apart from this, 
Most of the code of SVC to IRQ switching is similar to ARM64 implementation.

Also We tried to minimize latency of get_current() by introducting self pointer 
in
SVC stack to avoid if check to determine, is current is called from SVC context
or IRQ context. This way will always do one extra indirection in get_current, 
irrespective get_current is called from SVC or IRQ context.

Yes we agree still there will be minor additional latency on accessing current
task and SVC to IRQ switching  which might be significant for older machines,
So this is the reason why we kept this sol. under kconfig. This solution
provides extra stack to kernel developers for further development,instead of 
increasing
stack size to 16KB or spending time on debugging stack overflow issues.

PS: We have debugged and resolved stack overflow issue but we still implemented 
this sol. to avoid
debugging such issues in future also it gives us flexibility for kernel 
enhancement on ARM.

Thanks & Regards,
Vaneet Narang


RE: [PATCH 1/2] zstd: pass pointer rathen than structure to functions

2019-05-30 Thread Vaneet Narang
[Reminder] Any updates ?

> currently params structure is passed in all functions, which increases
> stack usage in all the function and lead to stack overflow on target like
> ARM with kernel stack size of 8 KB so better to pass pointer.
 
> Checked for ARM:
 

> (ZSTD_compressContinue_internal)-> 136  -> 88
> (ZSTD_compressCCtx) -> 192  -> 64
> (zstd_compress) -> 144      -> 96

Regards,
Vaneet Narang


RE:(2) [PATCH 3/4] scripts/checkstack.pl: add arm push handling for stack usage

2020-05-07 Thread Vaneet Narang
Hi Masahiro, 

>> To count stack usage of push {*, fp, ip, lr, pc} instruction in ARM,
>> if FRAME POINTER is enabled.
>> e.g. c01f0d48: e92ddff0 push {r4, r5, r6, r7, r8, r9, sl, fp, ip, lr, pc}
>>
>> c01f0d50 :
>> c01f0d44:   e1a0c00dmov ip, sp
>> c01f0d48:   e92ddff0push{r4, r5, r6, r7, r8, r9, sl, fp, ip, 
>>lr, pc}
>> c01f0d4c:   e24cb004sub fp, ip, #4
>> c01f0d50:   e24dd094sub sp, sp, #448; 0x1C0
>>
>> $ cat dump | scripts/checkstack.pl arm
>> 0xc01f0d50 Y []:448
>>
>> added subroutine frame work for this.
>> After change:
>> 0xc01f0d500 Y []:   492
  
 
> Do you know CONFIG_FRAME_WARN?
 Yes we know this and we use it to get compilation error if some function is 
using more stack.
This config will report issue at compilation.
 
>I know checkstack.pl dumps the stack size
>of functions, which is different from what
>-Wframe-larger-than does, but the goal is
>quite similar, I think.
> 
>I just wondered if we need both.
 
We feel purpose of this patch is different from CONFIG_FRAME_WARN.
This patch is specific to ARM and fixes bug in stack usage calculation.

We were comparing stack usage of ARM with ARM64 and found big gap.
We realised ARM is not calculating stack usage properly.
It only considers stack used by local variables but it doesn't consider 
stack used to store register context at the start of functions. 
This is not the case with ARM64. It seems ARM64 considers both.

We found even stack variables are of same size on both target but 
arm64 stack usage is high.

Considering below assembly, Actual stack usage is 492 but current script 
reports 448.
push instruction uses 44 bytes of stack to take backup of registers as per ARM 
calling
convention.

c01f0d44:   e1a0c00dmov ip, sp
c01f0d48:   e92ddff0push{r4, r5, r6, r7, r8, r9, sl, fp, ip, 
lr, pc}
c01f0d4c:       e24cb004sub fp, ip, #4
c01f0d50:   e24dd094sub sp, sp, #448; 0x1C0

Thanks & Regards,
Vaneet Narang
 
  


Re: [PATCH v2] module: check if memory leak by module.

2017-03-29 Thread Vaneet Narang
Hi,

>> Hmm, how can you track _all_ vmalloc allocations done on behalf of the
>> module? It is quite some time since I've checked kernel/module.c but
>> from my vague understading your check is basically only about statically
>> vmalloced areas by module loader. Is that correct? If yes then is this
>> actually useful? Were there any bugs in the loader code recently? What
>> led you to prepare this patch? All this should be part of the changelog!

First of all there is no issue in kernel/module.c. This patch add functionality
to detect scenario where some kernel module does some memory allocation but gets
unloaded without doing vfree. For example
static int kernel_init(void)
{
char * ptr = vmalloc(400 * 1024);
return 0;
}

static void kernel_exit(void)
{
}

Now in this case if we do rmmod then memory allocated by kernel_init
will not be freed but this patch will detect such kind of bugs in kernel module 
code.

Also We have seen bugs in some kernel modules where they allocate some memory 
and
gets removed without freeing them and if new module gets loaded in place
of removed module then /proc/vmallocinfo shows wrong information. vmalloc info 
will
show pages getting allocated by new module. So these logs will help in 
detecting 
such issues.

> >  static void free_module(struct module *mod)
> >  {
> > +   check_memory_leak(mod);
> > +

>Of course, vfree() has not been called yet. It is the beginning of 
>free_module(). vfree() is one of the last things you need to do. See 
>module_memfree(). If I am not missing something, you get pr_err() 
>everytime a module is unloaded.

This patch is not to detect memory allocated by kernel. module_memfree
will allocated by kernel for kernel modules but our intent is to detect
memory allocated directly by kernel modules and not getting freed.

Regards,
Vaneet Narang

RE: [PATCH v2] stackdepot: ignore junk last entry in case of switch from user mode.

2017-10-12 Thread Vaneet Narang
Hi Dmitry,

>diff --git a/lib/stackdepot.c b/lib/stackdepot.c
>index f87d138..a045748 100644
>--- a/lib/stackdepot.c
>+++ b/lib/stackdepot.c
>@@ -214,6 +214,13 @@ depot_stack_handle_t depot_save_stack(struct stack_trace 
>*trace,
>   if (unlikely(trace->nr_entries == 0))
>   goto fast_exit;
> 
>+  /* 
>+   * Ignore last entry if it belongs to user space
>+   * in case of switch from user mode.
>+   */
>+  if (!kernel_text_address(trace->entries[trace->nr_entries - 1]))

can we also add this check here:-
if (!kernel_text_address(trace->entries[trace->nr_entries - 1]) || 
in_interrupt())

Because on further checking , last frame in case of svc to interrupt mode is 
some random core kernel / Kernel Module address
like below:-

irq_exit+0xe4/0x140
__handle_domain_irq+0x9c/0x130
gic_handle_irq+0x40/0x80
__irq_svc+0x44/0x7c
hdmi_ocm_rx_get_scdc_sts+0x160/0x170 [sdp_hdmi]   >> we need this frme??

This generates mismatch with existing entires and new entries get created and 
depot_index
value keep on increasing slowly and reaches max value (STACK_ALLOC_SLABS_CAP) 
but after this change 
depot_index gets a stable value after some time. (Tested on ARM)

>+  trace->nr_entries--;
>+
>   hash = hash_stack(trace->entries, trace->nr_entries);
>   bucket = &stack_table[hash & STACK_HASH_MASK];
> 
>-- 
>1.9.1

Regards,
Vaneet Narang


RE: Re: [PATCH 1/1] mpi: check for shift exponent greater than 31.

2017-11-09 Thread Vaneet Narang
 
Hi,

>> This patch check for shift exponent greater than 31,

Yes, This should be "check for shift exponent greater than BITS_PER_LONG"

>Firstly, isn't it 63 on 64-bit machines?

Description of patch is specific to 32bit machine but patch is made considering 
64bit in mind also. 
and this is the precisly the reason we have been comparing with 
BITS_PER_MPI_LIMB 
as BITS_PER_MPI_LIMB is BITS_PER_LONG.

>Secondly, this is the wrong way to do things.  The banner comment on
>mpihelp_lshift(), for example, says that the function has the following
>argument constraints:

>   0 < CNT < BITS_PER_MP_LIMB

>so sh1 and sh2 must both be in the range 1-31 or 1-63, assuming cnt is
>within its constraints.

You are right, there is already a comment for the range but in our case caller 
was mpi_powm() itself
so to fix UBSAN warning we prefferd handling in mpihelp_lshift/mpihelp_rshift.

Call Sequence:
[0-1.1677] [] (ubsan_epilogue) from [] 
(__ubsan_handle_shift_out_of_bounds+0xf4/0x13c)
[0-1.1677] [] (__ubsan_handle_shift_out_of_bounds) from [] 
(mpihelp_lshift+0xf0/0x160)
[0-1.1677] [] (mpihelp_lshift) from [] 
(mpi_powm+0x308/0xc7c)

Scenario: MPI mod passed to mpi_powm has 0 last valid limb. 

count_leading_zeros() returns 32/64 when last valid limb of mod is 0.

* If @x is 0 then the result is COUNT_LEADING_ZEROS_0.
*/
static inline int count_leading_zeros(unsigned long x)
#define COUNT_LEADING_ZEROS_0 BITS_PER_LONG

int mpi_powm(MPI res, MPI base, MPI exp, MPI mod)
{

mod_shift_cnt = count_leading_zeros(mod->d[msize - 1]);  --> 
count_leading_zeros can return 32/64
if (mod_shift_cnt)
mpihelp_lshift(mp, mod->d, msize, mod_shift_cnt); --> 32/64 can 
be passed to mpihelp_lshift

/* Remove any leading zero words from the result.  */
if (mod_shift_cnt)
mpihelp_rshift(rp, rp, rsize, mod_shift_cnt); --> 
Similarly 32/64 can be passed to mpihelp_rshift

}   


>Therefore if it needs a checking, you only need to check cnt on entry to the
>function, rather than checking sh1 and sh2 inside the loop.  Further, you
>should use pr_err() so that we know that this has gone wrong and return an
>error to the caller (there are security implications).

Since we were not sure about checking cnt and returning some error value. we 
thought it may break mpi_powm().
We preferred doing a clean handling of invalid input without changing any 
behaviour of mpihelp_lshift/rshift . 
Please suggest how overflow can handled without breaking mpi_powm().

>Further, have you checked the caller to see if they do ever violate the
>constraints?

From caller side, only issue which i can think is passing 0 in as last valid 
limb of mod. Is this any
constraint ?

Regards,
Vaneet Narang


RE: Re: [PATCH 1/1] stackdepot: interface to check entries and size of stackdepot.

2017-11-24 Thread Vaneet Narang
Hi Michal,


>> 5) To check number of entries in stackdepot to decide stackdepot hash size 
>> for different systems.
>>For fewer entries hash table size can be reduced from 4MB.
>
> What are you going to do with that information. It is not like you can
> reduce the memory footprint or somehow optimize anything during the
> runtime.

On low memory system where page owner entries are in range of 3k ~ 4k, its
a waste to keep hash table size of 4MB. It can be modified to some 128KB to
save memory footprint of stackdepot. So stackdepot entry count is important.

> OK, so debugging a debugging facility... I do not think we want to
> introduce a lot of code for something like that.

We enabled stackdepot on our system and realised, in long run stack depot 
consumes
more runtime memory then it actually needs. we used shared patch to debug this 
issue. 
stack stores following two unique entries. Page allocation done in interrupt 
context will generate a unique stack trace. Consider following two entries.

Entry 1:
 __alloc_pages_nodemask+0xec/0x200
 page_frag_alloc+0x84/0x140
 __napi_alloc_skb+0x83/0xe0
 rtl8169_poll+0x1e5/0x670
 net_rx_action+0x122/0x380  
 __do_softirq+0xce/0x298
 irq_exit+0xa3/0xb0
 ---
 do_IRQ+0x72/0xc0
 ret_from_intr+0x0/0x14
 rw_copy_check_uvector+0x8a/0x100
 import_iovec+0x27/0xc0
 copy_msghdr_from_user+0xc0/0x120
 ___sys_recvmsg+0x76/0x210
 __sys_recvmsg+0x39/0x70
 entry_SYSCALL_64_fastpath+0x13/

 Entry 2:
  __alloc_pages_nodemask+0xec/0x200
 page_frag_alloc+0x84/0x140
 __napi_alloc_skb+0x83/0xe0
 rtl8169_poll+0x1e5/0x670
 net_rx_action+0x122/0x380
 __do_softirq+0xce/0x298
 irq_exit+0xa3/0xb0
 ---
 smp_apic_timer_interrupt+0x5b/0x110
 apic_timer_interrupt+0x89/0x90
 cpuidle_enter_state+0x95/0x2c0
 do_idle+0x163/0x1a0
 cpu_startup_entry+0x14/0x20
 secondary_startup_64+0xa5/0xb0

 Actual Allocation Path is
 __alloc_pages_nodemask+0xec/0x200
 page_frag_alloc+0x84/0x140
 __napi_alloc_skb+0x83/0xe0
 rtl8169_poll+0x1e5/0x670
 net_rx_action+0x122/0x380  
 __do_softirq+0xce/0x298
 irq_exit+0xa3/0xb0

We have been getting similar kind of such entries and eventually
stackdepot reaches Max Cap. So we found this interface useful in debugging
stackdepot issue so shared in community.

Regards,
Vaneet Narang


RE: Re: Re: [PATCH 1/1] stackdepot: interface to check entries and size of stackdepot.

2017-11-24 Thread Vaneet Narang
Hi Michal,

>> We have been getting similar kind of such entries and eventually
>> stackdepot reaches Max Cap. So we found this interface useful in debugging
>> stackdepot issue so shared in community.
 
>Then use it for internal debugging and provide a code which would scale
>better on smaller systems. We do not need this in the kernel IMHO. We do
>not merge all the debugging patches we use for internal development.
` 
Not just debugging but this information can also be used to profile and tune 
stack depot. 
Getting count of stack entries would help in deciding hash table size and 
page order used by stackdepot. 

For less entries, bigger hash table and higher page order slabs might not be 
required as 
maintained by stackdepot. As i already mentioned smaller size hashtable can be 
choosen and 
similarly lower order  pages can be used for slabs.

If you think its useful, we can share scalable patch to configure below two 
values based on 
number of stack entries dynamically.

#define STACK_ALLOC_ORDER 2 
#define STACK_HASH_SIZE (1L << STACK_HASH_ORDER)


Regards,
Vaneet Narang