Also, I tried to condense/summarize the issue so if you would like additional details, please feel free to ask and I'll provide them.
Thanks again, Mark Ruzindana On Tue, Dec 15, 2020 at 10:00 PM Mark Ruzindana <ruziem...@gmail.com> wrote: > Hi all, > > While running hashpipe with the intention of debugging using gdb as > suggested, I failed to replicate my segfault issue. On one hand, it should > have been working given what I understand about the packet socket > implementation and the way that I wrote the code, but on the other, I don't > know why it works now, and not before because I didn't make any changes > between runs. It's a stretch, but there were a few reboots and improvements > in cable organization within the rack, but that's about it. > > I'm taking note of the following change for documentation purposes. It's > not the reason for my issue. Feel free to ignore or comment on it. This > change was made before and remained after I observed the segfault issue. To > flush the packets in the port before the thread is run, I am using " > p_frame=hashpipe_pktsock_recv_udp_frame_nonblock(p_ps, bindport)" instead > of "p_frame=hashpipe_pktsock_recv_frame_nonblock(p_ps, bindport)" in the > while loop, otherwise, there's an infinite loop because there are packets > with other protocols constantly being captured by the port. > > I'm hoping I figure out what change was made as I am debugging the rest of > this, but for now the specific segfault that I was having is no longer an > issue. It's unsatisfying and I'll come back to it if I don't figure it out > as I go, but for now, I'm moving on. > > Okay, so now, I'm still experiencing dropped packets. Given a kernel page > size of 4096 bytes and a frame size of 16384 bytes, I have tried buffer > parameters ranging from, 480 to 128000 total number of frames and 60 to > 1000 blocks respectively. With improvements in throughput in one instance, > but not the other three that I have running. The one instance with > improvements, on the upper end of that range, exceeds the number of packets > expected in a hashpipe shared memory buffer block (the ring buffers in > between threads), but only for about four or so of them at the very > beginning of a scan. No dropped packets for the rest of the scan. While the > other instances, with no recognizable improvements, drop packets through > out the scan with one of them dropping significantly more than the other > two. > > I'm currently trying a few things to debug this, but I figured that I > would ask sooner rather than later. Is there a configuration or step that I > may have missed in the implementation of packet sockets? My understanding > is that it should handle my current data rates with no problem. So with > multiple instances running (four in my case), I should be able to capture > data with 0 dropped packets (100% data throughput). > > Just a note, with a packet size of 8168 bytes, and a frame size of 8192 > bytes, hashpipe was crashing, but in a completely unrelated way to how it > did before. It was *not* a segfault after capturing the exact number of > packets that correspond to the number of frames in the packet socket ring > buffer as I described in previous emails. The crashes were more > inconsistent and I think it's because the frame size needs to be > considerably larger than the packet size. An order of 2 seemed to be > enough. I currently have the frame size set to 16384 (also a multiple of > the kernel page size), and do not have an issue with hashpipe crashing. > > Let me know if you have any thoughts and suggestions. I really appreciate > the help. > > Thanks, > > Mark Ruzindana > > On Thu, Dec 3, 2020 at 11:16 AM Mark Ruzindana <ruziem...@gmail.com> > wrote: > >> Thanks for the suggestion David! >> >> I was starting hashpipe in the debugger. I'll use gdb and the core file, >> and let you know what I find. If I still can't figure out the problem, I >> will send you a minimum non-working example. I definitely think it's some >> sort of pointer arithmetic error as well, I just can't see it yet. I really >> appreciate the help. >> >> Thanks again, >> >> Mark >> >> On Thu, Dec 3, 2020 at 1:30 AM David MacMahon <dav...@berkeley.edu> >> wrote: >> >>> Hi, Mark, >>> >>> Sorry to hear you're still getting a segfault. It sounds like you made >>> some progress with gdb, but the fact that you ended up with a different >>> sort of error suggests that you were starting hashpipe in the debugger. To >>> debug your initial segfault problem, you can run hashpipe without the >>> debugger, let it segfault and generate a core file, then use gdb and the >>> core file (and hashpipe) to examine the state of the program when the >>> segfault occurred. The tricky part is getting the core file to be >>> generated on a segfault. You typically have to increase the core file size >>> limit using "ulimit -c unlimited" and (because hashpipe is typically >>> installed with the suid bit set) you have to let the kernel know it's OK to >>> dump core files for suid programs using "sudo sysctl -w fs.suid_dumpable=1" >>> (or maybe 2 if 1 doesn't quite do it). You can read more about these steps >>> with "help ulimit" (ulimit is a bash builtin) and "man 5 proc". >>> >>> Once you have the core file (typically named "core" but it may have a >>> numeric extension from the PID of the crashing process) you can debug >>> things with "gbd /path/to/hashpipe /path/to/core/file". Note that the core >>> file may be created with permissions that only let root read it, so you >>> might have to "sudo chown a+r core" or similar to get read access to it. >>> This starts the debugger in a a sort of forensic mode using the core file >>> as a snapshot of the process and its memory space at the time of the >>> segfault. You can use "info threads" to see which threads existed, "thread >>> N" to switch between threads (N is a thread number as shown by "info >>> threads"), "bt" to see the function call backtrace fo the current thread, >>> and "frame N" to switch to a specific frame in the function call >>> backtrace. Once you zero in on which part of your code was executing when >>> the segfault occurred you can examine variables to see what exactly caused >>> the segfault to occur. You might find that the "interesting" or "relevant" >>> variables have been optimized away, so you may want/need to recompile with >>> a lower optimization level (e.g. -O1 or maybe even -O0?) to prevent that >>> from happening. >>> >>> Because this happens when you reach the end of your data buffer, I have >>> to think it's a pointer arithmetic error of some sort. If you can't figure >>> out the problem from the core file, please create a "minimum working >>> example" (well, in this case I guess a minimum non-working example), >>> including a dummy packet generator script that creates suitable packets, >>> and I'll see if I can recreate the problem. >>> >>> HTH, >>> Dave >>> >>> On Nov 30, 2020, at 14:45, Mark Ruzindana <ruziem...@gmail.com> wrote: >>> >>> 'm currently using gdb to debug and it either tells me that I have a >>> segmentation fault at the memcpy() in process_packet() or something very >>> strange happens where the starting mcnt of a block greatly exceeds the mcnt >>> corresponding to the packet being processed and there's no segmentation >>> fault because the mcnt distance becomes negative so the memcpy() is >>> skipped. Hopefully that wasn't too hard to track. Very strange problem that >>> only occurs with gdb and not when I run hashpipe without it. Without gdb, I >>> get the same segmentation fault at the end of the circular buffer as >>> mentioned above. >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "casper@lists.berkeley.edu" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to casper+unsubscr...@lists.berkeley.edu. >>> To view this discussion on the web visit >>> https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/AC9534AD-390F-44D8-ABFE-8AE76F059957%40berkeley.edu >>> <https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/AC9534AD-390F-44D8-ABFE-8AE76F059957%40berkeley.edu?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "casper@lists.berkeley.edu" group. To unsubscribe from this group and stop receiving emails from it, send an email to casper+unsubscr...@lists.berkeley.edu. To view this discussion on the web visit https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CA%2B41hpz2NENt8cW12mRUtFzEAeYeO56JXtTqWDn8umEh4NqVEg%40mail.gmail.com.