Re: [casper] Dropped packets during HASHPIPE data acquisition

Mark Ruzindana Tue, 15 Dec 2020 21:00:52 -0800

Hi all,

While running hashpipe with the intention of debugging using gdb as
suggested, I failed to replicate my segfault issue. On one hand, it should
have been working given what I understand about the packet socket
implementation and the way that I wrote the code, but on the other, I don't
know why it works now, and not before because I didn't make any changes
between runs. It's a stretch, but there were a few reboots and improvements
in cable organization within the rack, but that's about it.

I'm taking note of the following change for documentation purposes. It's
not the reason for my issue. Feel free to ignore or comment on it. This
change was made before and remained after I observed the segfault issue. To
flush the packets in the port before the thread is run, I am using "p_frame=
hashpipe_pktsock_recv_udp_frame_nonblock(p_ps, bindport)" instead of "
p_frame=hashpipe_pktsock_recv_frame_nonblock(p_ps, bindport)" in the while
loop, otherwise, there's an infinite loop because there are packets with
other protocols constantly being captured by the port.

I'm hoping I figure out what change was made as I am debugging the rest of
this, but for now the specific segfault that I was having is no longer an
issue. It's unsatisfying and I'll come back to it if I don't figure it out
as I go, but for now, I'm moving on.

Okay, so now, I'm still experiencing dropped packets. Given a kernel page
size of 4096 bytes and a frame size of 16384 bytes, I have tried buffer
parameters ranging from, 480 to 128000 total number of frames and 60 to
1000 blocks respectively. With improvements in throughput in one instance,
but not the other three that I have running. The one instance with
improvements, on the upper end of that range, exceeds the number of packets
expected in a hashpipe shared memory buffer block (the ring buffers in
between threads), but only for about four or so of them at the very
beginning of a scan. No dropped packets for the rest of the scan. While the
other instances, with no recognizable improvements, drop packets through
out the scan with one of them dropping significantly more than the other
two.

I'm currently trying a few things to debug this, but I figured that I would
ask sooner rather than later. Is there a configuration or step that I may
have missed in the implementation of packet sockets? My understanding is
that it should handle my current data rates with no problem. So with
multiple instances running (four in my case), I should be able to capture
data with 0 dropped packets (100% data throughput).

Just a note, with a packet size of 8168 bytes, and a frame size of 8192
bytes, hashpipe was crashing, but in a completely unrelated way to how it
did before. It was *not* a segfault after capturing the exact number of
packets that correspond to the number of frames in the packet socket ring
buffer as I described in previous emails. The crashes were more
inconsistent and I think it's because the frame size needs to be
considerably larger than the packet size. An order of 2 seemed to be
enough. I currently have the frame size set to 16384 (also a multiple of
the kernel page size), and do not have an issue with hashpipe crashing.

Let me know if you have any thoughts and suggestions. I really appreciate
the help.

Thanks,

Mark Ruzindana

On Thu, Dec 3, 2020 at 11:16 AM Mark Ruzindana <ruziem...@gmail.com> wrote:

> Thanks for the suggestion David!
>
> I was starting hashpipe in the debugger. I'll use gdb and the core file,
> and let you know what I find. If I still can't figure out the problem, I
> will send you a minimum non-working example. I definitely think it's some
> sort of pointer arithmetic error as well, I just can't see it yet. I really
> appreciate the help.
>
> Thanks again,
>
> Mark
>
> On Thu, Dec 3, 2020 at 1:30 AM David MacMahon <dav...@berkeley.edu> wrote:
>
>> Hi, Mark,
>>
>> Sorry to hear you're still getting a segfault.  It sounds like you made
>> some progress with gdb, but the fact that you ended up with a different
>> sort of error suggests that you were starting hashpipe in the debugger.  To
>> debug your initial segfault problem, you can run hashpipe without the
>> debugger, let it segfault and generate a core file, then use gdb and the
>> core file (and hashpipe) to examine the state of the program when the
>> segfault occurred.  The tricky part is getting the core file to be
>> generated on a segfault.  You typically have to increase the core file size
>> limit using "ulimit -c unlimited" and (because hashpipe is typically
>> installed with the suid bit set) you have to let the kernel know it's OK to
>> dump core files for suid programs using "sudo sysctl -w fs.suid_dumpable=1"
>> (or maybe 2 if 1 doesn't quite do it).  You can read more about these steps
>> with "help ulimit" (ulimit is a bash builtin) and "man 5 proc".
>>
>> Once you have the core file (typically named "core" but it may have a
>> numeric extension from the PID of the crashing process) you can debug
>> things with "gbd /path/to/hashpipe /path/to/core/file".  Note that the core
>> file may be created with permissions that only let root read it, so you
>> might have to "sudo chown a+r core" or similar to get read access to it.
>> This starts the debugger in a a sort of forensic mode using the core file
>> as a snapshot of the process and its memory space at the time of the
>> segfault.  You can use "info threads" to see which threads existed, "thread
>> N" to switch between threads (N is a thread number as shown by "info
>> threads"), "bt" to see the function call backtrace fo the current thread,
>> and "frame N" to switch to a specific frame in the function call
>> backtrace.  Once you zero in on which part of your code was executing when
>> the segfault occurred you can examine variables to see what exactly caused
>> the segfault to occur.  You might find that the "interesting" or "relevant"
>> variables have been optimized away, so you may want/need to recompile with
>> a lower optimization level (e.g. -O1 or maybe even -O0?) to prevent that
>> from happening.
>>
>> Because this happens when you reach the end of your data buffer, I have
>> to think it's a pointer arithmetic error of some sort.  If you can't figure
>> out the problem from the core file, please create a "minimum working
>> example" (well, in this case I guess a minimum non-working example),
>> including a dummy packet generator script that creates suitable packets,
>> and I'll see if I can recreate the problem.
>>
>> HTH,
>> Dave
>>
>> On Nov 30, 2020, at 14:45, Mark Ruzindana <ruziem...@gmail.com> wrote:
>>
>> 'm currently using gdb to debug and it either tells me that I have a
>> segmentation fault at the memcpy() in process_packet() or something very
>> strange happens where the starting mcnt of a block greatly exceeds the mcnt
>> corresponding to the packet being processed and there's no segmentation
>> fault because the mcnt distance becomes negative so the memcpy() is
>> skipped. Hopefully that wasn't too hard to track. Very strange problem that
>> only occurs with gdb and not when I run hashpipe without it. Without gdb, I
>> get the same segmentation fault at the end of the circular buffer as
>> mentioned above.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "casper@lists.berkeley.edu" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to casper+unsubscr...@lists.berkeley.edu.
>> To view this discussion on the web visit
>> https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/AC9534AD-390F-44D8-ABFE-8AE76F059957%40berkeley.edu
>> <https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/AC9534AD-390F-44D8-ABFE-8AE76F059957%40berkeley.edu?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"casper@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to casper+unsubscr...@lists.berkeley.edu.
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CA%2B41hpyCYYhi3NaEw%3D4vP15sFiUxRAB41BfG_PBR_mE4fEpyZA%40mail.gmail.com.

Re: [casper] Dropped packets during HASHPIPE data acquisition

Reply via email to