eh... didn't mean to send this with no subject :)

> Our software generates a whole lot of concurrent LDAP traffic right now
> and we started running into an issue where our child processes would hang
> forever waiting around for LDAP operations that had apparently hung on the
> server end.  This would happen with start_tls(), search(), and bind()
> calls alike.  I noticed that Net::LDAP passes the Timeout parameter passed
> to new() into the socket object, so I presume that the intention is for
> operations other than only connecting to time out (e.g. start_SSL()
> inherits the timeout, so $ldap->start_tls() would presumably time out in
> the same time that the initial connection would have).  However this was
> not happening.  We dug around debugged for days (repro was difficult) and
> found a couple of changes that would make things work a bit better.
>
> First, and maybe least controversially, the root issue we found was that
> process() winds up doing a can_read() call that potentially blocks
> forever.  Changing this is pretty simple:
>
> @@ -824,10 +824,11 @@ sub process {
>   my $ldap = shift;
>   my $what = shift;
>   my $sock = $ldap->socket or return LDAP_SERVER_DOWN;
> +  my $timeout = $sock->timeout;
>   my $sel = IO::Select->new($sock);
>   my $ready;
>
> -  for( $ready = 1 ; $ready ; $ready = $sel->can_read(0)) {
> +  for( $ready = 1 ; $ready ; $ready = $sel->can_read( $timeout )) {
>     my $pdu;
>     asn_read($sock, $pdu)
>       or return _drop_conn($ldap, LDAP_OPERATIONS_ERROR, "Communications
> Error");
>
>
> We tested that patch and it helped to mitigate this issue -- although not
> perfectly, given that multiple process() calls can happen for a given LDAP
> operation, so the effective timeout can be nudged upward if a lot of
> process() calls hang for a long time but don't quite reach the timeout.
> Still, it's better than nothing.  If one were to be a bit pedantic, one
> could set some hash key for the life of an external call to indicate how
> long we've been waiting, and then in process(), do something like:
>
> ( $ldap->{waited} < $timeout ? $sel->can_read( $timeout - $ldap->{waited}
> ) : () )
>
> That would probably require more robust testing though.
>
> Something that actually worked a bit better for us, but which it seems may
> be less portable, was to set SO_SNDTIMEO and SO_RCVTIMEO on the underlying
> socket right after establishing the connection:
>
> @@ -117,6 +117,15 @@ sub new {
>
>    return undef unless $obj->{net_ldap_socket};
>
> +  if ( $arg->{timeout} and IO::Socket->can('SO_SNDTIMEO') ) {
> +    my $timeval = pack 'L!L!', $ldap_opts->{timeout}, 0;
> +    eval {
> +      $obj->{net_ldap_socket}->sockopt( SO_SNDTIMEO, $timeval ) or die
> $!;
> +      $obj->{net_ldap_socket}->sockopt( SO_RCVTIMEO, $timeval ) or die
> $!;
> +    };
> +    warn "Couldn't set socket timeout:$@\n";
> +  }
> +
>    $obj->{net_ldap_resp}    = {};
>    $obj->{net_ldap_version} = $arg->{version} || $LDAP_VERSION;
>    $obj->{net_ldap_async}   = $arg->{async} ? 1 : 0;
>
> I don't have any platforms readily available to test on that don't support
> SO_[SND|RCV]TIMEO, so the can() call there is only a guess at how to make
> sure we only bother with this on platforms that support it.  I can say
> from my testing that it looks like even if the sockopt() call fails,
> Net::LDAP soldiers on anyway, so the can() thing may not be necessary at
> all.  At any rate, our temporary solution for this issue, since we didn't
> want to run a locally modified LDAP.pm everywhere, was to do that
> sockopt() magic on $ldap->socket right after calling Net::LDAP->new.  That
> is well tested, but I have not tested the patch above, though it is very
> similar.  This also has the advantage of guarding against timeouts in
> syswrite() calls -- I actually wrote it when I still thought syswrite()
> was the culprit and not can_read().
>
> For more information on the sockopt() business, see, eh, google, and
> especially: http://perldesignpatterns.com/?SocketProgramming
>
> Finally, another change that partially mitigated the issue before we found
> our solution, and we found to be a good idea in general, was to add the
> 'timelimit' param equal to our timeout to the parameters we passed to
> search().  Another possibly controversial improvement to Net::LDAP
> timeouts would be to always add this parameter automatically from within
> search(), if a timeout is set and the parameter was not already passed.
> After all, why let the remote server potentially continue churning their
> answer if we've already given up on them?
>
> We've solved this problem for our own software but would love to see any
> or all of our fixes incorporated upstream, so I would welcome any feedback
> or suggestions on submitting these more suitably as useful patches.
>
> -Jared
>
>


Reply via email to