Re: [Samba] Odd Samba 4 (4.2.0pre1-GIT-b505111; actually only using client) behaviour #2 - accept: Software caused connection abort.

2013-08-29 Thread Tris Mabbs
Hiya Andrew,

Many thanks for the typically helpful and comprehensive reply :-)

 I think that's probably the right track :-)

 The code here is triggered when poll() indicates that the socket is
readable.
 This socket should only be readable when a new connection is being made,
and accept() should succeed.
 ...
 So, my only conclusion is that your box momentarily does not have the
resources to accept the connection,
 and because there isn't the sleep() in the source3 code, it prints this in
a loop until the resources become available.

Absolutely, and on any normal Unix implementation I'd agree entirely.  That
sort of poll()/accept()/... code is perfectly normal and exactly what
you'd expect - I've written plenty of very similar code myself over the
years ...
However this is Solaris :-(

Caught in the act:

...
16327: pollsys(0x0809B4D0, 8, 0xFEFFDF18, 0x)  = 1
16327:  fd=39 ev=POLLIN|POLLHUP rev=0
16327:  fd=38 ev=POLLIN|POLLHUP rev=0
16327:  fd=34 ev=POLLIN|POLLHUP rev=0
16327:  fd=36 ev=POLLIN|POLLHUP rev=0
16327:  fd=37 ev=POLLIN|POLLHUP rev=POLLIN
16327:  fd=35 ev=POLLIN|POLLHUP rev=0
16327:  fd=33 ev=POLLIN|POLLHUP rev=0
16327:  fd=6  ev=POLLIN|POLLHUP rev=0
16327:  timeout: 59.99900 sec
16327: accept(37, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) = 41
16327:  AF_INET  name = X.X.X.X  port = 28986
16327: forkx(0)= 26942
16327: lwp_sigmask(SIG_SETMASK, 0x00011080, 0x, 0x,
0x) = 0xFFBFFEFF [0x]
16327: close(41)   = 0
16327: pollsys(0x0809B4D0, 8, 0xFEFFDF18, 0x)  = 1
16327:  fd=39 ev=POLLIN|POLLHUP rev=0
16327:  fd=38 ev=POLLIN|POLLHUP rev=0
16327:  fd=34 ev=POLLIN|POLLHUP rev=0
16327:  fd=36 ev=POLLIN|POLLHUP rev=0
16327:  fd=35 ev=POLLIN|POLLHUP rev=POLLIN
16327:  fd=33 ev=POLLIN|POLLHUP rev=0
16327:  fd=6  ev=POLLIN|POLLHUP rev=0
16327:  fd=37 ev=POLLIN|POLLHUP rev=0
16327:  timeout: 44.69600 sec
16327: accept(35, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) Err#130
ECONNABORTED
...

So there's nothing odd about the poll().  Typically Solaris will flag
POLLERR in revents if it's out of resources, and POLLHUP if the remote end
closed the connection before it was fully established (remote NAKed, or
ignored, the connection SYN; terminally low on resources at t'other end of
the socket; ...).  Neither is happening here which would suggest things are
proceeding as normal for the connection establishment.

The server darn' well shouldn't be out of any resources either.  In terms of
physical resources, at the point that occurred the CPUs were at 99.9% idle,
there was 15Gb of free RAM (so not out of kernel memory then ...) and only a
total of about 400 sockets (TCP, Unix, ...) in use across the entire system,
as reported by netstat -na | wc -l - well below peak levels seen on this
system.

So it's going to be that hypothetical Solaris specific
SO_DONT_RANDOMLY_ABORT_CONNECTIONS socket() option, isn't it :-)

So could I request please, that in the source3 code, either:
a. The same sleep() is added as in the source4 code; -and/or-
b. If errno == ECONNABORTED then only log the error if the debug
level is (substantially?) higher than zero.

I think it's probably safe to assume that ECONNABORTED is generally
ignoreable; for whatever reason, Solaris seems to return this at the drop of
a metaphorical hat (and ignoring it on other OS' isn't going to be a problem
either).  Maybe the same with EAGAIN (and possibly EWOULDBLOCK), as other
Ignore this unless the user REALLY wants a lot of debug output type
errors?

This would also seem to be common practice - a quick Google for accept()
ignore ECONNABORTED comes back with a lot of results, mainly showing other
open source code having been modified specifically to ignore ECONNABORTED.

Cheers!

Tris.

-Original Message-
From: Andrew Bartlett [mailto:abart...@samba.org] 
Sent: 29 August 2013 00:41
To: Tris Mabbs
Cc: samba@lists.samba.org; samba-techni...@samba.org
Subject: Re: [Samba] Odd Samba 4 (4.2.0pre1-GIT-b505111; actually only
using client) behaviour #2 - accept: Software caused connection abort.

On Sun, 2013-08-25 at 18:50 +0100, Tris Mabbs wrote:
 Probably should have posted this to samba-technical 
 in the first place, so re-posting in case anyone has any useful ideas .
 
  
 
 From: Tris Mabbs
 
 Sent: 12 August 2013 23:08
 To: 'samba@lists.samba.org'
 Subject: Odd Samba 4 (4.2.0pre1-GIT-b505111; actually only using 
 client) behaviour #2 - accept: Software caused connection abort.
 
  
 
 Good day oh technical ones .
 
  
 
 I was running Samba 4 (client only, not using it as a 
 DC so effectively running Samba 3 code from the Samba 4 tree) and, 
 other than a little Gotcha! regarding decoding Kerberos PACs, it was 
 all

Re: [Samba] Odd Samba 4 (4.2.0pre1-GIT-b505111; actually only using client) behaviour #2 - accept: Software caused connection abort.

2013-08-29 Thread Jeremy Allison
On Thu, Aug 29, 2013 at 10:10:38AM +0100, Tris Mabbs wrote:
 Hiya Andrew,
 
 Many thanks for the typically helpful and comprehensive reply :-)
 
  I think that's probably the right track :-)
 
  The code here is triggered when poll() indicates that the socket is
 readable.
  This socket should only be readable when a new connection is being made,
 and accept() should succeed.
  ...
  So, my only conclusion is that your box momentarily does not have the
 resources to accept the connection,
  and because there isn't the sleep() in the source3 code, it prints this in
 a loop until the resources become available.
 
 Absolutely, and on any normal Unix implementation I'd agree entirely.  That
 sort of poll()/accept()/... code is perfectly normal and exactly what
 you'd expect - I've written plenty of very similar code myself over the
 years ...
 However this is Solaris :-(
 
 Caught in the act:
 
 ...
 16327: pollsys(0x0809B4D0, 8, 0xFEFFDF18, 0x)  = 1
 16327:  fd=39 ev=POLLIN|POLLHUP rev=0
 16327:  fd=38 ev=POLLIN|POLLHUP rev=0
 16327:  fd=34 ev=POLLIN|POLLHUP rev=0
 16327:  fd=36 ev=POLLIN|POLLHUP rev=0
 16327:  fd=37 ev=POLLIN|POLLHUP rev=POLLIN
 16327:  fd=35 ev=POLLIN|POLLHUP rev=0
 16327:  fd=33 ev=POLLIN|POLLHUP rev=0
 16327:  fd=6  ev=POLLIN|POLLHUP rev=0
 16327:  timeout: 59.99900 sec
 16327: accept(37, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) = 41
 16327:  AF_INET  name = X.X.X.X  port = 28986
 16327: forkx(0)= 26942
 16327: lwp_sigmask(SIG_SETMASK, 0x00011080, 0x, 0x,
 0x) = 0xFFBFFEFF [0x]
 16327: close(41)   = 0
 16327: pollsys(0x0809B4D0, 8, 0xFEFFDF18, 0x)  = 1
 16327:  fd=39 ev=POLLIN|POLLHUP rev=0
 16327:  fd=38 ev=POLLIN|POLLHUP rev=0
 16327:  fd=34 ev=POLLIN|POLLHUP rev=0
 16327:  fd=36 ev=POLLIN|POLLHUP rev=0
 16327:  fd=35 ev=POLLIN|POLLHUP rev=POLLIN
 16327:  fd=33 ev=POLLIN|POLLHUP rev=0
 16327:  fd=6  ev=POLLIN|POLLHUP rev=0
 16327:  fd=37 ev=POLLIN|POLLHUP rev=0
 16327:  timeout: 44.69600 sec
 16327: accept(35, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) Err#130
 ECONNABORTED
 ...
 
 So there's nothing odd about the poll().  Typically Solaris will flag
 POLLERR in revents if it's out of resources, and POLLHUP if the remote end
 closed the connection before it was fully established (remote NAKed, or
 ignored, the connection SYN; terminally low on resources at t'other end of
 the socket; ...).  Neither is happening here which would suggest things are
 proceeding as normal for the connection establishment.
 
 The server darn' well shouldn't be out of any resources either.  In terms of
 physical resources, at the point that occurred the CPUs were at 99.9% idle,
 there was 15Gb of free RAM (so not out of kernel memory then ...) and only a
 total of about 400 sockets (TCP, Unix, ...) in use across the entire system,
 as reported by netstat -na | wc -l - well below peak levels seen on this
 system.
 
 So it's going to be that hypothetical Solaris specific
 SO_DONT_RANDOMLY_ABORT_CONNECTIONS socket() option, isn't it :-)
 
 So could I request please, that in the source3 code, either:
   a. The same sleep() is added as in the source4 code; -and/or-
   b. If errno == ECONNABORTED then only log the error if the debug
 level is (substantially?) higher than zero.

So your problem is the debug statement being triggered repeatedly ?

Adding a sleep is (IMHO) the wrong thing to do. Once the accept()
has failed the 'POLLIN' event should not be triggered repeatedly
on the polled socket. Your truss trace doesn't show enough. Does
a subsequent pollsys() keep returning fd=35 ev=POLLIN|POLLHUP rev=POLLIN
after the:

 accept(35, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) Err#130  ECONNABORTED

?

Jeremy.
-- 
To unsubscribe from this list go to the following URL and read the
instructions:  https://lists.samba.org/mailman/options/samba


Re: [Samba] Odd Samba 4 (4.2.0pre1-GIT-b505111; actually only using client) behaviour #2 - accept: Software caused connection abort.

2013-08-29 Thread Tris Mabbs
Hiya Jeremy,

 So your problem is the debug statement being triggered repeatedly ?

Yup.

 Adding a sleep is (IMHO) the wrong thing to do. 

It has the advantage of pretty much guaranteeing the problem will go away;
it has the disadvantage of blocking the thread/process.
However it is what the Samba4 client code does (so a similar change to the
Samba3 would be consistent; of course, so would a different change to both
codebases ...).

 Once the accept() has failed the 'POLLIN' event should not be triggered
repeatedly on the polled socket.
 Your truss trace doesn't show enough. Does a subsequent pollsys() keep
returning fd=35 ev=POLLIN|POLLHUP rev=POLLIN after the:

 accept(35, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) Err#130 ECONNABORTED

Now that's a very interesting question ...

OK, a quick dig around later and we get (abridged):

pollsys(0x080849F0, 8, 0xFEFFDF58, 0x)  = 1
fd=36 ev=POLLIN|POLLHUP rev=0
fd=35 ev=POLLIN|POLLHUP rev=0
fd=34 ev=POLLIN|POLLHUP rev=0
fd=31 ev=POLLIN|POLLHUP rev=0
fd=33 ev=POLLIN|POLLHUP rev=0
fd=32 ev=POLLIN|POLLHUP rev=POLLIN
fd=6  ev=POLLIN|POLLHUP rev=0
fd=30 ev=POLLIN|POLLHUP rev=0
timeout: 32.54700 sec
accept(32, 0xFEFFDE0C, 0xFEFFDDF8, SOV_DEFAULT) Err#130 ECONNABORTED
...
write(8,  a c c e p t :   S o.., 43)  = 43
pollsys(0x080849F0, 8, 0xFEFFDF58, 0x)  = 1
fd=36 ev=POLLIN|POLLHUP rev=0
fd=35 ev=POLLIN|POLLHUP rev=0
fd=34 ev=POLLIN|POLLHUP rev=0
fd=31 ev=POLLIN|POLLHUP rev=POLLIN
fd=33 ev=POLLIN|POLLHUP rev=0
fd=6  ev=POLLIN|POLLHUP rev=0
fd=30 ev=POLLIN|POLLHUP rev=0
fd=32 ev=POLLIN|POLLHUP rev=0
timeout: 32.54600 sec
accept(31, 0xFEFFDE0C, 0xFEFFDDF8, SOV_DEFAULT) = 38
AF_INET  name = X.X.X.X  port = 55935
forkx(0)= 10502
...
pollsys(0x080849F0, 8, 0xFEFFDF58, 0x)  = 1
fd=36 ev=POLLIN|POLLHUP rev=0
fd=35 ev=POLLIN|POLLHUP rev=0
fd=34 ev=POLLIN|POLLHUP rev=0
fd=33 ev=POLLIN|POLLHUP rev=0
fd=32 ev=POLLIN|POLLHUP rev=POLLIN
fd=31 ev=POLLIN|POLLHUP rev=0
fd=6  ev=POLLIN|POLLHUP rev=0
fd=30 ev=POLLIN|POLLHUP rev=0
timeout: 31.03400 sec
accept(32, 0xFEFFDE0C, 0xFEFFDDF8, SOV_DEFAULT) Err#130 ECONNABORTED
...
write(8,  a c c e p t :   S o.., 43)  = 43
Received signal #18, SIGCLD, in pollsys() [caught]
  siginfo: SIGCLD CLD_EXITED pid=10504 status=0x
pollsys(0x080849F0, 8, 0xFEFFDF58, 0x)  Err#4 EINTR
fd=36 ev=POLLIN|POLLHUP rev=0
fd=35 ev=POLLIN|POLLHUP rev=0
fd=34 ev=POLLIN|POLLHUP rev=0
fd=33 ev=POLLIN|POLLHUP rev=0
fd=31 ev=POLLIN|POLLHUP rev=0
fd=6  ev=POLLIN|POLLHUP rev=0
fd=30 ev=POLLIN|POLLHUP rev=0
fd=32 ev=POLLIN|POLLHUP rev=0
timeout: 31.03200 sec

So that would be a no - next poll() and there's no revent flagged on that
same socket.
Which would confirm your thought that sleep() is perhaps not the way to
go.  However I don't know the Samba code (at all!) nearly well enough to
comment - that sleep() may be serving some other vital purpose under
different circumstances?

Either way, it would appear that my second suggestion would still be valid -
only log this (and possibly a couple of other error conditions) when more
debugging is enabled?

Another passing thought ...
That truss only captured 2 ECONNABORTED incidents - typical that nothing
much happens when you're specifically looking at it.
However, is it likely to be a coincidence that both were on the same socket?
FD#32 happens to be bound to port 445 on one specific interface of the
machine; tomorrow I might try a more extended test and poke lots of traffic
at that interface (and/or might stick the socket descriptor number into the
debug message) - if anything interesting presents itself (E.g., it's always
the same port, or interface, ... where the problem occurs) I'll post an
update saying so.
Probably doesn't affect the solution, but possibly technically interesting
anyway ...

Many thanks, and regards,

Tris.

-Original Message-
From: Jeremy Allison [mailto:j...@samba.org] 
Sent: 29 August 2013 20:52
To: Tris Mabbs
Cc: 'Andrew Bartlett'; samba@lists.samba.org; samba-techni...@samba.org
Subject: Re: [Samba] Odd Samba 4 (4.2.0pre1-GIT-b505111; actually only
using client) behaviour #2 - accept: Software caused connection abort.

On Thu, Aug 29, 2013 at 10:10:38AM +0100, Tris Mabbs wrote:
 Hiya Andrew,
 
 Many thanks for the typically helpful and comprehensive reply :-)
 
  I think that's probably the right track :-)
 
  The code here is triggered when poll() indicates that the socket is
 readable.
  This socket should only be readable when a new connection is being 
  made,
 and accept() should succeed.
  ...
  So, my only conclusion is that your box momentarily does not have 
  the
 resources

Re: [Samba] Odd Samba 4 (4.2.0pre1-GIT-b505111; actually only using client) behaviour #2 - accept: Software caused connection abort.

2013-08-28 Thread Andrew Bartlett
On Sun, 2013-08-25 at 18:50 +0100, Tris Mabbs wrote:
 Probably should have posted this to samba-technical in the
 first place, so re-posting in case anyone has any useful ideas .
 
  
 
 From: Tris Mabbs
 
 Sent: 12 August 2013 23:08
 To: 'samba@lists.samba.org'
 Subject: Odd Samba 4 (4.2.0pre1-GIT-b505111; actually only using client)
 behaviour #2 - accept: Software caused connection abort.
 
  
 
 Good day oh technical ones .
 
  
 
 I was running Samba 4 (client only, not using it as a DC so
 effectively running Samba 3 code from the Samba 4 tree) and, other than a
 little Gotcha! regarding decoding Kerberos PACs, it was all working
 perfectly.
 
 Then recently I had to upgrade, to 4.2.0pre1-GIT-b505111
 (I had to upgrade the OS on the server running Samba - 'twas OpenSolaris
 and is now Solaris 11.1) so I recompiled it all up and installed afresh
 (so no .tdbs from the previous installation or anything).
 
  
 
 But here's a funny thing (#2).  The log file gets absolutely
 ridiculous numbers of messages thus:
 
  
 
 Aug 12 22:45:01 Gateway smbd[16327]: [ID 702911 daemon.error] [2013/08/12
 22:45:01.731562,  0] ../source3/smbd/server.c:556(smbd_accept_connection)
 
 Aug 12 22:45:01 Gateway smbd[16327]: [ID 702911 daemon.error]   accept:
 Software caused connection abort
 
 Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error] [2013/08/12
 22:45:03.556423,  0] ../source3/smbd/server.c:556(smbd_accept_connection)
 
 Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error]   accept:
 Software caused connection abort
 
 Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error] [2013/08/12
 22:45:03.556688,  0] ../source3/smbd/server.c:556(smbd_accept_connection)
 
 Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error]   accept:
 Software caused connection abort
 
  
 
 And so on.  These will come in spurts; there won't be any
 such messages for several minutes then a whole load will come along all at
 once.  Rather like busses .

 
 I will catch smbd in the act at some point though, and
 when I do I'll follow-up with a system call trace to show exactly what is
 happening when this message gets triggered.  It will, of course, be
 something bizarrely Solaris specific (you didn't set the
 SO_DONT_RANDOMLY_ABORT_CONNECTIONS socket() option, did you?  Tsk tsk tsk
 .).

I think that's probably the right track :-)

The code here is triggered when poll() indicates that the socket is
reaable.  This socket should only be readable when a new connection is
being made, and accept() should succeed.

In the source4/smbd/process_single.c code equivalent to this, there is
this helpful hint:
/* accept an incoming connection. */
status = socket_accept(listen_socket, connected_socket);
if (!NT_STATUS_IS_OK(status)) {
DEBUG(0,(single_accept_connection: accept: %s\n,
nt_errstr(status)));
/* this looks strange, but is correct. 

   We can only be here if woken up from select, due to
   an incoming connection.

   We need to throttle things until the system clears
   enough resources to handle this new socket. 

   If we don't then we will spin filling the log and
   causing more problems. We don't panic as this is
   probably a temporary resource constraint */
sleep(1);
return;
}

So, my only conclusion is that your box momentarily does not have the
resources to accept the connection, and because there isn't the sleep()
in the source3 code, it prints this in a loop until the resources become
available. 

Andrew Bartlett
-- 
Andrew Bartlett
http://samba.org/~abartlet/
Authentication Developer, Samba Team   http://samba.org
Samba Developer, Catalyst IT   http://catalyst.net.nz


-- 
To unsubscribe from this list go to the following URL and read the
instructions:  https://lists.samba.org/mailman/options/samba


Re: [Samba] Odd Samba 4 (4.2.0pre1-GIT-b505111; actually only using client) behaviour #2 - accept: Software caused connection abort.

2013-08-25 Thread Tris Mabbs
Probably should have posted this to samba-technical in the
first place, so re-posting in case anyone has any useful ideas .

 

From: Tris Mabbs

Sent: 12 August 2013 23:08
To: 'samba@lists.samba.org'
Subject: Odd Samba 4 (4.2.0pre1-GIT-b505111; actually only using client)
behaviour #2 - accept: Software caused connection abort.

 

Good day oh technical ones .

 

I was running Samba 4 (client only, not using it as a DC so
effectively running Samba 3 code from the Samba 4 tree) and, other than a
little Gotcha! regarding decoding Kerberos PACs, it was all working
perfectly.

Then recently I had to upgrade, to 4.2.0pre1-GIT-b505111
(I had to upgrade the OS on the server running Samba - 'twas OpenSolaris
and is now Solaris 11.1) so I recompiled it all up and installed afresh
(so no .tdbs from the previous installation or anything).

 

But here's a funny thing (#2).  The log file gets absolutely
ridiculous numbers of messages thus:

 

Aug 12 22:45:01 Gateway smbd[16327]: [ID 702911 daemon.error] [2013/08/12
22:45:01.731562,  0] ../source3/smbd/server.c:556(smbd_accept_connection)

Aug 12 22:45:01 Gateway smbd[16327]: [ID 702911 daemon.error]   accept:
Software caused connection abort

Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error] [2013/08/12
22:45:03.556423,  0] ../source3/smbd/server.c:556(smbd_accept_connection)

Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error]   accept:
Software caused connection abort

Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error] [2013/08/12
22:45:03.556688,  0] ../source3/smbd/server.c:556(smbd_accept_connection)

Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error]   accept:
Software caused connection abort

 

And so on.  These will come in spurts; there won't be any
such messages for several minutes then a whole load will come along all at
once.  Rather like busses .

 

It doesn't seem to be affecting the operation so far as any
client is concerned.  Or rather it evidently will be having some effect, but
it's not a noticeable one.

 

However it is really irritating having the system log
getting filled up with all these messages!

 

Murphy's law, of course, states that trying to catch one of
these messages being created, so I can include a suitable system call trace
in this message, will be impossible - there will be no such messages logged
until the instant I click Send (at which point probably about half-a-dozen
will be logged all at once).  That does indeed seem to be the case - I've
now been trying to persuade one of these, normally very regularly occurring,
messages to be logged for about 20 minutes and still, stubbornly, nothing
continues to happen.

I will catch smbd in the act at some point though, and
when I do I'll follow-up with a system call trace to show exactly what is
happening when this message gets triggered.  It will, of course, be
something bizarrely Solaris specific (you didn't set the
SO_DONT_RANDOMLY_ABORT_CONNECTIONS socket() option, did you?  Tsk tsk tsk
.).

 

Cheers,

 

Tris.

-- 
To unsubscribe from this list go to the following URL and read the
instructions:  https://lists.samba.org/mailman/options/samba


Re: [Samba] Odd Samba 4 (4.2.0pre1-GIT-b505111; actually only using client) behaviour #2 - accept: Software caused connection abort.

2013-08-13 Thread Tris Mabbs
Caught it in the act!  As predicted, almost immediately
after sending my original message (!) .

 

However unfortunately, it's not particularly useful .

 

...

16327: write(7, \0, 1)= 1

16327: setcontext(0xFEFFD690)

16327: waitid(P_ALL, 0, 0xFEFFDD70, WEXITED|WTRAPPED|WNOHANG) = 0

16327:   siginfo: SIGCLD CLD_EXITED pid=26937 status=0x

16327: waitid(P_ALL, 0, 0xFEFFDD70, WEXITED|WTRAPPED|WNOHANG) = 0

16327:   siginfo: SIG#0

16327: pollsys(0x0809B4D0, 8, 0xFEFFDF18, 0x)  = 1

16327:  fd=39 ev=POLLIN|POLLHUP rev=0

16327:  fd=38 ev=POLLIN|POLLHUP rev=0

16327:  fd=34 ev=POLLIN|POLLHUP rev=0

16327:  fd=36 ev=POLLIN|POLLHUP rev=0

16327:  fd=37 ev=POLLIN|POLLHUP rev=0

16327:  fd=35 ev=POLLIN|POLLHUP rev=0

16327:  fd=33 ev=POLLIN|POLLHUP rev=0

16327:  fd=6  ev=POLLIN|POLLHUP rev=POLLIN

16327:  timeout: 59.99900 sec

16327: read(6, \0, 16)= 1

16327: pollsys(0x0809B4D0, 8, 0xFEFFDF18, 0x) (sleeping...)

16327:  fd=39 ev=POLLIN|POLLHUP rev=0

16327:  fd=38 ev=POLLIN|POLLHUP rev=0

16327:  fd=34 ev=POLLIN|POLLHUP rev=0

16327:  fd=36 ev=POLLIN|POLLHUP rev=0

16327:  fd=37 ev=POLLIN|POLLHUP rev=0

16327:  fd=35 ev=POLLIN|POLLHUP rev=0

16327:  fd=33 ev=POLLIN|POLLHUP rev=0

16327:  fd=6  ev=POLLIN|POLLHUP rev=0

16327:  timeout: 59.99900 sec

16327: pollsys(0x0809B4D0, 8, 0xFEFFDF18, 0x)  = 1

16327:  fd=39 ev=POLLIN|POLLHUP rev=0

16327:  fd=38 ev=POLLIN|POLLHUP rev=0

16327:  fd=34 ev=POLLIN|POLLHUP rev=0

16327:  fd=36 ev=POLLIN|POLLHUP rev=0

16327:  fd=37 ev=POLLIN|POLLHUP rev=POLLIN

16327:  fd=35 ev=POLLIN|POLLHUP rev=0

16327:  fd=33 ev=POLLIN|POLLHUP rev=0

16327:  fd=6  ev=POLLIN|POLLHUP rev=0

16327:  timeout: 59.99900 sec

16327: accept(37, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) = 41

16327:  AF_INET  name = X.X.X.X  port = 28986

16327: forkx(0)= 26942

16327: lwp_sigmask(SIG_SETMASK, 0x00011080, 0x, 0x,
0x) = 0xFFBFFEFF [0x]

16327: close(41)   = 0

16327: pollsys(0x0809B4D0, 8, 0xFEFFDF18, 0x)  = 1

16327:  fd=39 ev=POLLIN|POLLHUP rev=0

16327:  fd=38 ev=POLLIN|POLLHUP rev=0

16327:  fd=34 ev=POLLIN|POLLHUP rev=0

16327:  fd=36 ev=POLLIN|POLLHUP rev=0

16327:  fd=35 ev=POLLIN|POLLHUP rev=POLLIN

16327:  fd=33 ev=POLLIN|POLLHUP rev=0

16327:  fd=6  ev=POLLIN|POLLHUP rev=0

16327:  fd=37 ev=POLLIN|POLLHUP rev=0

16327:  timeout: 44.69600 sec

16327: accept(35, 0xFEFFDDCC, 0xFEFFDDB8, SOV_DEFAULT) Err#130
ECONNABORTED

16327: fstat(40, 0xFEFFD1A0)= 0

16327: d=0x0888 i=22079 m=002 l=0  u=0 g=0
rdev=0x02480248

16327:  at = Aug 12 17:30:31 BST 2013  [ 1376325031.0 ]

16327:  mt = Aug 12 17:30:31 BST 2013  [ 1376325031.0 ]

16327:  ct = Aug 12 17:30:31 BST 2013  [ 1376325031.0 ]

16327: bsz=8192  blks=0 fs=BADVFS

16327: time()   = 1376345330

16327: getpid()= 16327 [1]

16327: putmsg(40, 0xFEFFDB80, 0xFEFFDB90, 0)  = 0

16327:  ctl:  maxlen=24   len=24   buf=0xFEFFD230: 98 0
tFE\0\010\0..

16327:   98 0 tFE\0\010\0 XD2FFFE \CF iFE HFA\b\b1B\0\0\0

16327:  dat:  maxlen=1280 len=151  buf=0xFEFFD250:  A u g   1 2
2..

16327:A u g   1 2   2 3 : 0 8 : 5 0   s m b d [ 1 6 3 2 7 ] :   [ I
D

16327:  7 0 2 9 1 1   F A C I L I T Y _ A N D _ P R I O R I T Y ]
[

16327:2 0 1 3 / 0 8 / 1 2   2 3 : 0 8 : 5 0 . 0 4 3 5 8 2 , 0 ]


16327:. . / s o u r c e 3 / s m b d / s e r v e r . c : 5 5 6 ( s m
b

16327:d _ a c c e p t _ c o n n e c t i o n )\n\n\0

...

 

However hopefully that might be of use to someone -
unfortunately it does just look like a Solaris specific nasty.  There should
be no reason for the abort; everything is directly connected on uncongested
gigabit Ethernet, so it's unlikely to be TCP timeout related and I can't see
a client aborting the connection before even it's established (that might
happen if the clients, or possibly the server were desperately short of IP
resources but that isn't going to be the case either).  It doesn't appear to
be related to the number of connections either, so I can't immediately think
of any sensible reason for the connection to abort.

 

What might also be interesting to investigate (though
completely unrelated) is why there's an fstat(40, 0xFEFFD1A0) when writing
the message - descriptor 40 is the socket for the syslog() connection, so

[Samba] Odd Samba 4 (4.2.0pre1-GIT-b505111; actually only using client) behaviour #2 - accept: Software caused connection abort.

2013-08-12 Thread Tris Mabbs
Good day oh technical ones .

 

I was running Samba 4 (client only, not using it as a DC so
effectively running Samba 3 code from the Samba 4 tree) and, other than a
little Gotcha! regarding decoding Kerberos PACs, it was all working
perfectly.

Then recently I had to upgrade, to 4.2.0pre1-GIT-b505111
(I had to upgrade the OS on the server running Samba - 'twas OpenSolaris
and is now Solaris 11.1) so I recompiled it all up and installed afresh
(so no .tdbs from the previous installation or anything).

 

But here's a funny thing (#2).  The log file gets absolutely
ridiculous numbers of messages thus:

 

Aug 12 22:45:01 Gateway smbd[16327]: [ID 702911 daemon.error] [2013/08/12
22:45:01.731562,  0] ../source3/smbd/server.c:556(smbd_accept_connection)

Aug 12 22:45:01 Gateway smbd[16327]: [ID 702911 daemon.error]   accept:
Software caused connection abort

Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error] [2013/08/12
22:45:03.556423,  0] ../source3/smbd/server.c:556(smbd_accept_connection)

Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error]   accept:
Software caused connection abort

Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error] [2013/08/12
22:45:03.556688,  0] ../source3/smbd/server.c:556(smbd_accept_connection)

Aug 12 22:45:03 Gateway smbd[16327]: [ID 702911 daemon.error]   accept:
Software caused connection abort

 

And so on.  These will come in spurts; there won't be any
such messages for several minutes then a whole load will come along all at
once.  Rather like busses .

 

It doesn't seem to be affecting the operation so far as any
client is concerned.  Or rather it evidently will be having some effect, but
it's not a noticeable one.

 

However it is really irritating having the system log
getting filled up with all these messages!

 

Murphy's law, of course, states that trying to catch one of
these messages being created, so I can include a suitable system call trace
in this message, will be impossible - there will be no such messages logged
until the instant I click Send (at which point probably about half-a-dozen
will be logged all at once).  That does indeed seem to be the case - I've
now been trying to persuade one of these, normally very regularly occurring,
messages to be logged for about 20 minutes and still, stubbornly, nothing
continues to happen.

I will catch smbd in the act at some point though, and
when I do I'll follow-up with a system call trace to show exactly what is
happening when this message gets triggered.  It will, of course, be
something bizarrely Solaris specific (you didn't set the
SO_DONT_RANDOMLY_ABORT_CONNECTIONS socket() option, did you?  Tsk tsk tsk
.).

 

Cheers,

 

Tris.

-- 
To unsubscribe from this list go to the following URL and read the
instructions:  https://lists.samba.org/mailman/options/samba