Ok so, I'm currently beating my head against the inpenetrable wall of anti-clue that is JTAC (yes I know what you're asking, when am I not? :P), and I've apparently reached a point of impasse where I need to solicit some external assistance to help get the point across.
The other day we discovered a neat little issue on the EX8200 (all available code), there is a hard coded resource limit being set by RPD (not even in the usual places like login.conf class settings that you can hack around) that limits the size of the data segment to 512MB. When you try to exceed that limit, rpd coredumps like so: Process (55002,rpd) attempted to exceed RLIMIT_DATA: attempted 524412 KB Max 524288 KB pid 55002 (rpd), uid 0: exited on signal 6 (core dumped) Now, while sane and rational people might see this as a pretty big problem, Juniper actually believes that this is working as designed and a perfectly good thing. Here is the response I got back from Advanced JTAC: > As per my communication with the engineering, the current limitation > of the memory allocation for "RPD" process is sufficient enough handle > 500k+ routes in EX switch, so theoretically we should not see any > memory usage issue here. But, there could be other issues such memory > leak etc. which can cause process to hog more memory. It is important > to analyze core dump of "rpd" so that we can look into root cause of > the issue. I of course tried to explain the concept of multiple paths learned from multiple neighbors in the RIB vs the routes exported to the FIB, and that my 512MB of rpd utilization was perfectly normal considering the number of BGP paths we had (which for us is actually pretty darn small, most of our MX960 routers are doing closer to 1GB in rpd): Groups: 15 Peers: 14 Down peers: 1 Table Tot Paths Act Paths Suppressed History Damp State Pending inet.0 933817 332257 0 0 0 0 inetflow.0 50 25 0 0 0 0 inet6.0 4774 2545 0 0 0 0 But they've flatly refused to believe me that this is normal and that a 512MB cap is very very broken, and continue to try and "find the source of the memory leak". That I'm still having this argument with them, and that EX engineering doesn't understand 512MB doesn't support that many paths, frankly boggles the mind. I sortof understand why they think they need to cap the memory usage of rpd. One of the problems with the EX platform is that they don't ship any real storage on the RE, for example this 8208 RE has only 2GB TOTAL, with very little free space: Filesystem Size Used Avail Capacity Mounted on /dev/da0s1a 366M 123M 214M 37% / /dev/da0s1f 244M 20M 205M 9% /var /dev/da0s3d 630M 612K 579M 0% /var/tmp /dev/da0s3e 111M 1.8M 100M 2% /config How they plan to handle writing 2GB dumps to disk when the kernel panics is beyond me, this available space (after I removed EVERYTHING possible) wasn't even enough for me to untar the rpd coredump and gdb it locally. But the other consequence to no real storage is no swap, so when the router does run out of memory things are going to go south in a hurry. That said, at the point rpd is crashing there is almost 1GB of ram left in the free state, so clearly 512MB is far too low of a limit for practical use. The problem itself is bad enough, but the bigger problem here is that these guys really don't seem to understand why this is a bad thing. So, can somebody at Juniper please go break the glass on the emergency cluebat, go find the EX guys, and beat them upside the head with it until they get detached retinas? Pretty please? :) -- Richard A Steenbergen <r...@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC) _______________________________________________ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp