Hi Frank

I know that remote troubleshooting is annoying; that's why I was trying to 
provide as many details as possible. Anyways, with your last hint we're getting 
there! The while loop you mentioned is not exited, we can see that in the stack 
traces. Actually, the DefaultUDPTransportMapping thread remains forever in this 
state:

"DefaultUDPTransportMapping_0.0.0.0/0" #35 daemon prio=5 os_prio=0 
tid=0x00007f6829984800 nid=0x2a6d in Object.wait() [0x00007f67936fd000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.snmp4j.util.ThreadPool.execute(ThreadPool.java:103)
        - locked <0x0000000088818548> (a org.snmp4j.util.ThreadPool)
        at 
org.snmp4j.util.MultiThreadedMessageDispatcher.processMessage(MultiThreadedMessageDispatcher.java:162)
        at 
org.snmp4j.transport.AbstractTransportMapping.fireProcessMessage(AbstractTransportMapping.java:76)
        at 
org.snmp4j.transport.DefaultUdpTransportMapping$ListenThread.run(DefaultUdpTransportMapping.java:430)
        at java.lang.Thread.run(Thread.java:745)


It is waiting for a notify that never comes:

      synchronized (this) {
        try {
          wait(); // line 103
        } catch (InterruptedException ex) {
          handleInterruptedExceptionOnExecute(ex, task);
        }


I guess, here's a timing problem. Suppose, in this piece of code,

  public void execute(WorkerTask task) {
    while (true) {
      for (int i=0; i<taskManagers.size(); i++) {
        TaskManager tm = taskManagers.get(i);
        if ((respawnThreads) && (!tm.isAlive())) {
          tm = new TaskManager(getTaskManagerName(name, i));
          taskManagers.set(i, tm);
        }
        if (tm.isIdle()) {
          try {
            tm.execute(task);
            return;
          }
          catch (IllegalStateException isex) {
            // ignore
          }
        }
      }
      synchronized (this) {
        try {
          wait();
        } catch (InterruptedException ex) {
          handleInterruptedExceptionOnExecute(ex, task);
        }
      }
    }
  }


tm.isIdle() returns false for all task managers, because they are all working. 
In that case, the thread would start waiting for a notify by any of the task 
managers. Suppose moreover, the notify actually came from all running task 
managers right before this thread entered waiting status. execute() wouldn't 
return, in fact it would stay in line 103 (SNMP4J-2.5.7), and that's exactly 
what we can see in the stacktrace. In version 2.5.6, notify could not happen 
between tm.isIdle() and wait() because execute() was synchronized.


Best regards,
Steffen Brüntjen



-----Original Message-----
From: Frank Fock [mailto:f...@agentpp.com] 
Sent: Montag, 9. April 2018 20:59
To: Steffen Brüntjen <steffen.bruent...@macmon.eu>
Cc: snmp4j@agentpp.org
Subject: Re: [SNMP4J] Timeouts in version 2.5.7+

Hi Steffen,

I understood the problem very well, but I am trying to find a possible cause.
In past, users have chosen a very low timeout because it worked - the system(s) 
answered very fast and SNMP4J timeout mechanism was not fast enough to timeout 
the request although the timeout value was lower than the effective response 
time. 

This behaviour then changed by some other timing side effect and all answers 
were timed out then. 
I just wanted to sort this out, but you did not provide your timeout value 
though. 

As supporter it is very difficult to operate with too few facts. 

As you can see from the following code of the DefaultUdpTransportMapping class, 
the “Received…” log message must be printed out if something was received from 
the UDP port:

public void run() {
  DatagramSocket socketCopy = socket;
  if (socketCopy != null) {
    try {
      socketCopy.setSoTimeout(getSocketTimeout());
      if (receiveBufferSize > 0) {
        socketCopy.setReceiveBufferSize(Math.max(receiveBufferSize,
                maxInboundMessageSize));
      }
      if (logger.isDebugEnabled()) {
        logger.debug("UDP receive buffer size for socket " +
                getAddress() + " is set to: " +

                          socketCopy.getReceiveBufferSize());
      }
    } catch (SocketException ex) {
      logger.error(ex);
      setSocketTimeout(0);
    }
  }
  while (!stop) {
    DatagramPacket packet = new DatagramPacket(buf, buf.length,
                                               udpAddress.getInetAddress(),
                                               udpAddress.getPort());
    try {
      socketCopy = socket;
      ByteBuffer bis;
      TransportStateReference stateReference =
              new TransportStateReference(DefaultUdpTransportMapping.this, 
udpAddress, null,
                      SecurityLevel.undefined, SecurityLevel.undefined,
                      false, socketCopy);
      try {
        if (socketCopy == null) {
          stop = true;
          continue;
        }
        try {
          socketCopy.receive(packet);
        }
        catch (SocketTimeoutException ste) {
          continue;
        }
        bis = prepareInPacket(packet, buf, stateReference);
      }
      catch (InterruptedIOException iiox) {
        if (iiox.bytesTransferred <= 0) {
          continue;
        }
        bis = prepareInPacket(packet, buf, stateReference);
      }
      if (logger.isDebugEnabled()) {
        logger.debug("Received message from "+packet.getAddress()+"/"+
                     packet.getPort()+
                     " with length "+packet.getLength()+": "+
                     new OctetString(packet.getData(), 0,
                                     packet.getLength()).toHexString());
      }
      if (bis != null) {
        fireProcessMessage(new UdpAddress(packet.getAddress(),
                packet.getPort()), bis, stateReference);
      }
    }
    catch (SocketTimeoutException stex) {
      // ignore
    }
    catch (PortUnreachableException purex) {
      synchronized (DefaultUdpTransportMapping.this) {
        listener = null;
      }
      logger.error(purex);
      if (logger.isDebugEnabled()) {
        purex.printStackTrace();
      }
      if (SNMP4JSettings.isForwardRuntimeExceptions()) {
        throw new RuntimeException(purex);
      }
      break;
    }
    catch (SocketException soex) {
      if (!stop) {
        logger.warn("Socket for transport mapping " + toString() + " error: " + 
soex.getMessage());
      }
      if (SNMP4JSettings.isForwardRuntimeExceptions()) {
        stop = true;
        throw new RuntimeException(soex);
      }
      else if (!stop) {
        try {
          DatagramSocket newSocket = renewSocketAfterException(soex, 
socketCopy);
          if (newSocket == null) {
            throw soex;
          }
          socket = newSocket;
        } catch (SocketException e) {
          stop = true;
          socket = null;
          logger.error("Socket renewal for transport mapping " + toString() +
              " failed with: " + e.getMessage(), e);

        }
      }
    }
    catch (IOException iox) {
      logger.warn(iox);
      if (logger.isDebugEnabled()) {
        iox.printStackTrace();
      }
      if (SNMP4JSettings.isForwardRuntimeExceptions()) {
        throw new RuntimeException(iox);
      }
    }
  }
If that is not the case, 

(a) either Java is not able to receive UDP packets from that socket anymore 
(then none of the possible reasons is caused by the SNMP4J stack or a specific 
version of it)
(b) The while loop is exited. 

Possible reasons for (b) could be an API call or an uncaught exception within 
the while loop.

Can you reproduce the issue with a simple MIB walk or an application where you 
ca share the log - at least the snippet between last successful answer 
processing and the first failure?

Best regards,
Frank



On 9. Apr 2018, at 15:19, Steffen Brüntjen <steffen.bruent...@macmon.eu> wrote:

Hi!


Is your timeout value big enough?

It seems I couldn't make the problem clear, so once again from the beginning. 
I've been sending millions of millions SNMP queries over a long time, like 
months, with no problems. Then I replaced SNMP4J version 2.5.6 with version 
2.5.7 (it can be any other newer release) and restarted the program. The 
program starts running fine again, but only for some comparatively short amount 
of time (sometimes 10 minutes, sometimes 2 days). From the moment on, when the 
problem arises, NO more SNMP packets are received at all. There's not a single 
SNMP request that results in something else but a timeout. I made no 
configuration changes like timeouts/retries, no changes in the network, none of 
the 200 devices change and no one has access to the machine. Consequently, when 
I restart the program, it runs fine again. So I bet a too low timeout can't be 
the problem. As an example here's my last test. I extracted all data from 
SNMP4J log messages:

2018-04-05 10:43 Program starts
2018-04-05 10:43 First outgoing SNMP request
2018-04-05 22:23 Last incoming message
2018-04-09 ..... Program is still running (until today)

4,994,621 sent messages from 2018-04-05 10:43 to 22:23
4,992,120 received msgs from 2018-04-05 10:43 to 22:23
  53,323 sent messages from 2018-04-05 22:23 to now
       0 received msgs from 2018-04-05 22:23 to now

(The different number of outgoing requests per amount of time comes from the 
fact I wrote in my first mail: The program queries the sysObjectId and then - 
if successful - more OIDs. If the sysObjectId can't be queried, the rest of 
communication doesn't take place.)



I do not see any differences in the stacktraces, do I miss something?

In your last mail you wrote, that "From the stack trace it seems that there is 
no idle TaskManager in your ThreadPool for the MultiThreadMessageDispatcher 
left.". So with the two identical stack traces I was trying to point out that 
there's no indication of a missing TaskManager thread.


Best regards and thanks,
Steffen Brüntjen




-----Original Message-----
From: Frank Fock [mailto:f...@agentpp.com] 
Sent: Freitag, 6. April 2018 15:40
To: Steffen Brüntjen <steffen.bruent...@macmon.eu>
Cc: snmp4j@agentpp.org
Subject: Re: [SNMP4J] Timeouts in version 2.5.7+

Is your timeout value big enough?
I do not see any differences in the stacktraces, do I miss something?



Am 06.04.2018 um 14:15 schrieb Steffen Brüntjen <steffen.bruent...@macmon.eu>:

Hi Frank

Mmh, I don't think it's because of uncaught exceptions (at least not outside of 
SNMP4J). SNMP packets are continued to be sent out normally, the responses are 
received by kernel but not "recognized" by SNMP4J (see tcpdump output). I get 
normal timeouts as if the devices wouldn't respond any more (but they do). I 
reproduced the problem once again with TRACE logging enabled. The SNMP4J logs 
appear to be very normal, except that there are no more "Received message from 
... with length ...: ..." messages any more 
(DefaultUdpTransportMapping.java:409). Here are the different types of log 
messages after problem occurs (in random order):

Sending message to .../161 with length 43: ... | 
(DefaultUdpTransportMapping.java:112)
Removed pending request with handle: PduHandle[...] | (Snmp.java:995)
Request timed out: ... | (Snmp.java:1900)
Running pending sync request with handle PduHandle[...] and retry count left 1 
| (Snmp.java:1808)


The stacktraces look the same when everything is running fine:

"snmp.3" #39 prio=5 os_prio=0 tid=0x00007f682a81a000 nid=0x2a71 in 
Object.wait() [0x00007f67932f9000]
 java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x0000000088861810> (a org.snmp4j.util.ThreadPool$TaskManager)

"snmp.2" #38 prio=5 os_prio=0 tid=0x00007f682a818000 nid=0x2a70 in 
Object.wait() [0x00007f67933fa000]
 java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x0000000088861a20> (a org.snmp4j.util.ThreadPool$TaskManager)

"snmp.1" #37 prio=5 os_prio=0 tid=0x00007f6829986800 nid=0x2a6f in 
Object.wait() [0x00007f67934fb000]
 java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x0000000088861c30> (a org.snmp4j.util.ThreadPool$TaskManager)

"snmp.0" #36 prio=5 os_prio=0 tid=0x00007f6829988800 nid=0x2a6e in 
Object.wait() [0x00007f67935fc000]
 java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x0000000088861e40> (a org.snmp4j.util.ThreadPool$TaskManager)


Here are the stacktraces after problem starts:

"snmp.3" #39 prio=5 os_prio=0 tid=0x00007f682a81a000 nid=0x2a71 in 
Object.wait() [0x00007f67932f9000]
 java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x00000000888185a0> (a org.snmp4j.util.ThreadPool$TaskManager)

"snmp.2" #38 prio=5 os_prio=0 tid=0x00007f682a818000 nid=0x2a70 in 
Object.wait() [0x00007f67933fa000]
 java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x00000000888187b0> (a org.snmp4j.util.ThreadPool$TaskManager)

"snmp.1" #37 prio=5 os_prio=0 tid=0x00007f6829986800 nid=0x2a6f in 
Object.wait() [0x00007f67934fb000]
 java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x00000000888189c0> (a org.snmp4j.util.ThreadPool$TaskManager)

"snmp.0" #36 prio=5 os_prio=0 tid=0x00007f6829988800 nid=0x2a6e in 
Object.wait() [0x00007f67935fc000]
 java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x0000000088818bd0> (a org.snmp4j.util.ThreadPool$TaskManager)



Best regards
Steffen Brüntjen


-----Original Message-----
From: Frank Fock [mailto:f...@agentpp.com] 
Sent: Mittwoch, 4. April 2018 00:54
To: Steffen Brüntjen <steffen.bruent...@macmon.eu>
Cc: snmp4j@agentpp.org
Subject: Re: [SNMP4J] Timeouts in version 2.5.7+

Hi Steffen,

From the stack trace it seems that there is no idle TaskManager in your 
ThreadPool for the MultiThreadMessageDispatcher left.
Maybe all TaskManagers died because of exceptions?

You can simply use the DefaultMessageDispatcher in your case. I do not expect 
advantages from the multi-threaded variant. Maybe you can find the root cause 
easier with the simple message dispatcher.

Best regards,
Frank 



On 3. Apr 2018, at 13:26, Steffen Brüntjen <steffen.bruent...@macmon.eu> wrote:

Hi Frank


Thanks for your reply. I don't reuse PDUs. In case of SNMPv2c, the effective 
statements would be:

PDU pdu = new PDU();
pdu.setType(PDU.GET);
pdu.add(new VariableBinding(...));
ResponseEvent responseEvent = snmp.send(pdu, target); // timeout

I believe the PDU request IDs will be coming from this code in 
MessageDispatcherImpl:

protected PduHandle createPduHandle() {
 return new PduHandle(getNextRequestID());
}


There's only one static snmp instance for all targets, I read somewhere that 
this is the preferred implementation.


I'am also able to remote debug the program, but I wouldn't know where to look 
at. Also, it's a bit of extra work and it will take a while, but it's possible.


Best regards,
Steffen


-----Original Message-----
From: Frank Fock [mailto:f...@agentpp.com] 
Sent: Sonntag, 1. April 2018 08:23
To: Steffen Brüntjen <steffen.bruent...@macmon.eu>
Cc: snmp4j@agentpp.org
Subject: Re: [SNMP4J] Timeouts in version 2.5.7+

Hi Steffen,

There was no change on the DefaultUdpTransportMapping between 2.5.6 and 2.5.7, 
thus I assume that behaviour change is a side effect of some other changes. 
SNMP4J is no a bit faster.
Have you checked your code for race conditions?
When you use Snmp.send, how do you create the request IDs of the PDUs? 

Also reusing the PDU objects before the request is finished might cause this 
issue. 

Best regards,
Frank  


On 16. Mar 2018, at 12:14, Steffen Brüntjen <steffen.bruent...@macmon.eu> wrote:

Hi!

I'm using SNMP4J for monitoring around 200 devices. It has been running stably 
for months. Now I found a problem in the latest releases: When my process runs 
for a couple of hours (sometimes days), after some time SNMP4J returns nothing 
but timeouts. My program initially uses Snmp.send() and TableUtils (I think, 
this doesn't matter). After the problem shows up once, all my program 
effectively does is querying the SysObjectIds (because this already results in 
timeouts):

org.snmp4j.Snmp.send(get(SnmpV2Mib.sysObjectID), target)


Some of the targets are community targets, some are user targets, but I get 
timeouts for all devices. I repeated testing with only community targets 
resulting in the same behavior. I tracked down the problem to release 2.5.7. 
The previous version 2.5.6 works without problems, all releases starting with 
2.5.7 show the problem. I also analyzed the SNMP packages using tcpdump. The 
devices send their answers quickly, the packets don't look any special. It 
seems that SNMP4J doesn't receive, recognize or process the UDP packets.

Can you reproduce the problem? Does anyone see problems in source code changes 
from 2.5.6 to 2.5.7?

Best regards
Steffen Brüntjen



Here are some stacktraces at the time the problem exists:

"DefaultUDPTransportMapping_0.0.0.0/162" #72 daemon prio=5 os_prio=0 
tid=0x00007fa0ac00a000 nid=0x4433 runnable [0x00007fa08b4f3000]
java.lang.Thread.State: RUNNABLE
  at java.net.PlainDatagramSocketImpl.receive0(Native Method)
  - locked <0x0000000089977c28> (a java.net.PlainDatagramSocketImpl)
  at 
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
  - locked <0x0000000089977c28> (a java.net.PlainDatagramSocketImpl)
  at java.net.DatagramSocket.receive(DatagramSocket.java:812)
  - locked <0x0000000089b2b3a8> (a java.net.DatagramPacket)
  - locked <0x0000000089977c78> (a java.net.DatagramSocket)
  at 
org.snmp4j.transport.DefaultUdpTransportMapping$ListenThread.run(DefaultUdpTransportMapping.java:397)
  at java.lang.Thread.run(Thread.java:745)

"DefaultUDPTransportMapping_0.0.0.0/0" #33 daemon prio=5 os_prio=0 
tid=0x00007fa126c9d800 nid=0x440c in Object.wait() [0x00007fa0b60f1000]
java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool.execute(ThreadPool.java:103)
  - locked <0x00000000883cde50> (a org.snmp4j.util.ThreadPool)
  at 
org.snmp4j.util.MultiThreadedMessageDispatcher.processMessage(MultiThreadedMessageDispatcher.java:162)
  at 
org.snmp4j.transport.AbstractTransportMapping.fireProcessMessage(AbstractTransportMapping.java:76)
  at 
org.snmp4j.transport.DefaultUdpTransportMapping$ListenThread.run(DefaultUdpTransportMapping.java:430)
  at java.lang.Thread.run(Thread.java:745)

"snmp.3" #37 prio=5 os_prio=0 tid=0x00007fa125a38000 nid=0x4410 in 
Object.wait() [0x00007fa0b5ced000]
java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x00000000883ddeb8> (a org.snmp4j.util.ThreadPool$TaskManager)

"snmp.2" #36 prio=5 os_prio=0 tid=0x00007fa126ca1800 nid=0x440f in 
Object.wait() [0x00007fa0b5dee000]
java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x00000000883de0c8> (a org.snmp4j.util.ThreadPool$TaskManager)

"snmp.1" #35 prio=5 os_prio=0 tid=0x00007fa126ca0800 nid=0x440e in 
Object.wait() [0x00007fa0b5eef000]
java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x00000000883de2d8> (a org.snmp4j.util.ThreadPool$TaskManager)

"snmp.0" #34 prio=5 os_prio=0 tid=0x00007fa126ca0000 nid=0x440d in 
Object.wait() [0x00007fa0b5ff0000]
java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x00000000883de4e8> (a org.snmp4j.util.ThreadPool$TaskManager)

"trap.0" #71 prio=5 os_prio=0 tid=0x00007fa0ac004000 nid=0x4432 in 
Object.wait() [0x00007fa08b5f4000]
java.lang.Thread.State: WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at java.lang.Object.wait(Object.java:502)
  at org.snmp4j.util.ThreadPool$TaskManager.run(ThreadPool.java:275)
  - locked <0x0000000089979080> (a org.snmp4j.util.ThreadPool$TaskManager)


And then there are some monitoring threads that send SNMP queries:

"mon-001" #43 prio=5 os_prio=0 tid=0x00007fa0c000c000 nid=0x4416 in 
Object.wait() [0x00007fa0b52e7000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at org.snmp4j.Snmp.send(Snmp.java:991)
  - locked <0x00000000fdc0c6e0> (a org.snmp4j.Snmp$SyncResponseListener)
  at org.snmp4j.Snmp.send(Snmp.java:963)
  at org.snmp4j.Snmp.send(Snmp.java:928)
 at [...]



And here's some non-binary tcpdump output:

12:05:49.820810 IP 192.168.101.211.57522 > 10.100.7.204.snmp:  GetRequest(28)  
system.sysObjectID.0
E..G..@.@.@...e.
d.......38.0).....public..../.........0.0...+.........
12:05:49.821321 IP 10.100.7.204.snmp > 192.168.101.211.57522:  GetResponse(37)  
system.sysObjectID.0=E:cisco.1.614
E..P..@.@.SW
d....e......<..02.....public.%../.........0.0...+........       +....   ..f
^C
_______________________________________________
SNMP4J mailing list
SNMP4J@agentpp.org
https://oosnmp.net/mailman/listinfo/snmp4j




_______________________________________________
SNMP4J mailing list
SNMP4J@agentpp.org
https://oosnmp.net/mailman/listinfo/snmp4j

Reply via email to