Re: [casper] Dropped packets during HASHPIPE data acquisition

Mark Ruzindana Tue, 15 Dec 2020 21:02:43 -0800

Also, I tried to condense/summarize the issue so if you would like
additional details, please feel free to ask and I'll provide them.


Thanks again,

Mark Ruzindana

On Tue, Dec 15, 2020 at 10:00 PM Mark Ruzindana <ruziem...@gmail.com> wrote:

> Hi all,
>
> While running hashpipe with the intention of debugging using gdb as
> suggested, I failed to replicate my segfault issue. On one hand, it should
> have been working given what I understand about the packet socket
> implementation and the way that I wrote the code, but on the other, I don't
> know why it works now, and not before because I didn't make any changes
> between runs. It's a stretch, but there were a few reboots and improvements
> in cable organization within the rack, but that's about it.
>
> I'm taking note of the following change for documentation purposes. It's
> not the reason for my issue. Feel free to ignore or comment on it. This
> change was made before and remained after I observed the segfault issue. To
> flush the packets in the port before the thread is run, I am using "
> p_frame=hashpipe_pktsock_recv_udp_frame_nonblock(p_ps, bindport)" instead
> of "p_frame=hashpipe_pktsock_recv_frame_nonblock(p_ps, bindport)" in the
> while loop, otherwise, there's an infinite loop because there are packets
> with other protocols constantly being captured by the port.
>
> I'm hoping I figure out what change was made as I am debugging the rest of
> this, but for now the specific segfault that I was having is no longer an
> issue. It's unsatisfying and I'll come back to it if I don't figure it out
> as I go, but for now, I'm moving on.
>
> Okay, so now, I'm still experiencing dropped packets. Given a kernel page
> size of 4096 bytes and a frame size of 16384 bytes, I have tried buffer
> parameters ranging from, 480 to 128000 total number of frames and 60 to
> 1000 blocks respectively. With improvements in throughput in one instance,
> but not the other three that I have running. The one instance with
> improvements, on the upper end of that range, exceeds the number of packets
> expected in a hashpipe shared memory buffer block (the ring buffers in
> between threads), but only for about four or so of them at the very
> beginning of a scan. No dropped packets for the rest of the scan. While the
> other instances, with no recognizable improvements, drop packets through
> out the scan with one of them dropping significantly more than the other
> two.
>
> I'm currently trying a few things to debug this, but I figured that I
> would ask sooner rather than later. Is there a configuration or step that I
> may have missed in the implementation of packet sockets? My understanding
> is that it should handle my current data rates with no problem. So with
> multiple instances running (four in my case), I should be able to capture
> data with 0 dropped packets (100% data throughput).
>
> Just a note, with a packet size of 8168 bytes, and a frame size of 8192
> bytes, hashpipe was crashing, but in a completely unrelated way to how it
> did before. It was *not* a segfault after capturing the exact number of
> packets that correspond to the number of frames in the packet socket ring
> buffer as I described in previous emails. The crashes were more
> inconsistent and I think it's because the frame size needs to be
> considerably larger than the packet size. An order of 2 seemed to be
> enough. I currently have the frame size set to 16384 (also a multiple of
> the kernel page size), and do not have an issue with hashpipe crashing.
>
> Let me know if you have any thoughts and suggestions. I really appreciate
> the help.
>
> Thanks,
>
> Mark Ruzindana
>
> On Thu, Dec 3, 2020 at 11:16 AM Mark Ruzindana <ruziem...@gmail.com>
> wrote:
>
>> Thanks for the suggestion David!
>>
>> I was starting hashpipe in the debugger. I'll use gdb and the core file,
>> and let you know what I find. If I still can't figure out the problem, I
>> will send you a minimum non-working example. I definitely think it's some
>> sort of pointer arithmetic error as well, I just can't see it yet. I really
>> appreciate the help.
>>
>> Thanks again,
>>
>> Mark
>>
>> On Thu, Dec 3, 2020 at 1:30 AM David MacMahon <dav...@berkeley.edu>
>> wrote:
>>
>>> Hi, Mark,
>>>
>>> Sorry to hear you're still getting a segfault.  It sounds like you made
>>> some progress with gdb, but the fact that you ended up with a different
>>> sort of error suggests that you were starting hashpipe in the debugger.  To
>>> debug your initial segfault problem, you can run hashpipe without the
>>> debugger, let it segfault and generate a core file, then use gdb and the
>>> core file (and hashpipe) to examine the state of the program when the
>>> segfault occurred.  The tricky part is getting the core file to be
>>> generated on a segfault.  You typically have to increase the core file size
>>> limit using "ulimit -c unlimited" and (because hashpipe is typically
>>> installed with the suid bit set) you have to let the kernel know it's OK to
>>> dump core files for suid programs using "sudo sysctl -w fs.suid_dumpable=1"
>>> (or maybe 2 if 1 doesn't quite do it).  You can read more about these steps
>>> with "help ulimit" (ulimit is a bash builtin) and "man 5 proc".
>>>
>>> Once you have the core file (typically named "core" but it may have a
>>> numeric extension from the PID of the crashing process) you can debug
>>> things with "gbd /path/to/hashpipe /path/to/core/file".  Note that the core
>>> file may be created with permissions that only let root read it, so you
>>> might have to "sudo chown a+r core" or similar to get read access to it.
>>> This starts the debugger in a a sort of forensic mode using the core file
>>> as a snapshot of the process and its memory space at the time of the
>>> segfault.  You can use "info threads" to see which threads existed, "thread
>>> N" to switch between threads (N is a thread number as shown by "info
>>> threads"), "bt" to see the function call backtrace fo the current thread,
>>> and "frame N" to switch to a specific frame in the function call
>>> backtrace.  Once you zero in on which part of your code was executing when
>>> the segfault occurred you can examine variables to see what exactly caused
>>> the segfault to occur.  You might find that the "interesting" or "relevant"
>>> variables have been optimized away, so you may want/need to recompile with
>>> a lower optimization level (e.g. -O1 or maybe even -O0?) to prevent that
>>> from happening.
>>>
>>> Because this happens when you reach the end of your data buffer, I have
>>> to think it's a pointer arithmetic error of some sort.  If you can't figure
>>> out the problem from the core file, please create a "minimum working
>>> example" (well, in this case I guess a minimum non-working example),
>>> including a dummy packet generator script that creates suitable packets,
>>> and I'll see if I can recreate the problem.
>>>
>>> HTH,
>>> Dave
>>>
>>> On Nov 30, 2020, at 14:45, Mark Ruzindana <ruziem...@gmail.com> wrote:
>>>
>>> 'm currently using gdb to debug and it either tells me that I have a
>>> segmentation fault at the memcpy() in process_packet() or something very
>>> strange happens where the starting mcnt of a block greatly exceeds the mcnt
>>> corresponding to the packet being processed and there's no segmentation
>>> fault because the mcnt distance becomes negative so the memcpy() is
>>> skipped. Hopefully that wasn't too hard to track. Very strange problem that
>>> only occurs with gdb and not when I run hashpipe without it. Without gdb, I
>>> get the same segmentation fault at the end of the circular buffer as
>>> mentioned above.
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "casper@lists.berkeley.edu" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to casper+unsubscr...@lists.berkeley.edu.
>>> To view this discussion on the web visit
>>> https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/AC9534AD-390F-44D8-ABFE-8AE76F059957%40berkeley.edu
>>> <https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/AC9534AD-390F-44D8-ABFE-8AE76F059957%40berkeley.edu?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"casper@lists.berkeley.edu" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to casper+unsubscr...@lists.berkeley.edu.
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CA%2B41hpz2NENt8cW12mRUtFzEAeYeO56JXtTqWDn8umEh4NqVEg%40mail.gmail.com.

Re: [casper] Dropped packets during HASHPIPE data acquisition

Reply via email to