Hi Guozhang,

Thanks for the clarify. For the clarify 2, I think the key thing is not users 
control how much time in maximum to wait for inside code, but is the network 
client can be aware of the connecting can't be finished and try a good node. In 
the producer.sender even the selector.poll can timeout, but the next time is 
also not close the previous connecting and try another good node.


In out test env, QA shutdown one of the leader node, the producer send the 
request will timeout and close the node's connection then request the metadata. 
 But sometimes the request node is also the shutdown node.  When connecting the 
shutting down node to get the metadata, it is in the connecting phase, network 
client mark the connecting node's state to CONNECTING, but if the node is 
shutdown,  the socket can't be aware of the connecting is broken. Though the 
selector.poll has timeout parameter, but it will not close the connection, so 
the next 
time in the "networkclient.maybeUpdate" it will check if isAnyNodeConnecting, 
then will not connect to any good node the get the metadata.  It need about 
several minutes to 
aware the connecting is timeout and try other node.  


So I want to add a connect.timeout parameter,  the selector can find the 
connecting is timeout and close the connection.  It seems the currently the 
timeout value passed in `selector.poll()`
seems can not do this.


Thanks,
David






------------------ ???????? ------------------
??????: "Guozhang Wang";<wangg...@gmail.com>;
????????: 2017??5??16??(??????) ????1:51
??????: "dev@kafka.apache.org"<dev@kafka.apache.org>; 

????: Re: [DISCUSS] KIP-148: Add a connect timeout for client



Hi David,

I may be a bit confused before, just clarifying a few things:

1. As you mentioned, a client will always try to first establish the
connection with a broker node before it tries to send any request to it.
And after connection is established, it will either continuously send many
requests (e.g. produce) for just a single request (e.g. metadata) to the
broker, so these two phases are indeed different.

2. In the connected phase, connections.max.idle.ms is used to
auto-disconnect the socket if no requests has been sent / received during
that period of time; in the connecting phase, we always try to create the
socket via "socketChannel.connect" in a non-blocking call, and then checks
if the connection has been established, but all the callers of this
function (in either producer or consumer) has a timeout parameter as in
`selector.poll()`, and the timeout parameter is set either by calculations
based on metadata.expiration.time and backoff for producer#sender, or by
directly passed values from consumer#poll(timeout), so although there is no
directly config controlling that, users can still control how much time in
maximum to wait for inside code.

I originally thought your scenarios is more on the connected phase, but now
I feel you are talking about the connecting phase. For that case, I still
feel currently the timeout value passed in `selector.poll()` which is
controllable from user code should be sufficient?


Guozhang




On Sun, May 14, 2017 at 2:37 AM, ???????? <254479...@qq.com> wrote:

> Hi Guozhang,
>
>
> Sorry for the delay, thanks for the question.  It seems two different
> parameters to me:
> connect.timeout.ms: only work for the connecting phrase, after connected
> phrase this parameter is not used.
> connections.max.idle.ms: currently not work in the connecting phrase
> (only select return readyKeys >0) will add to the expired manager, after
> connected will check if the connection is still alive in some time.
>
>
> Even if we change the connections.max.idle.ms to work including the
> connecting phrase, we can not set this parameter to a small value, such as
> 5 seconds. Because the client is maybe busy sending message to other node,
> it will be disconnected in 5 seconds, so the default value of
> connections.max.idle.ms is setting to a larger time. We should have two
> parameters to control the connecting phrase behavior and the connected
> phrase behavior, do you think so?
>
>
> Thanks,
>
>
> David
>
>
>
>
> ------------------ ???????? ------------------
> ??????: "Guozhang Wang";<wangg...@gmail.com>;
> ????????: 2017??5??6??(??????) ????7:52
> ??????: "dev@kafka.apache.org"<dev@kafka.apache.org>;
>
> ????: Re: [DISCUSS] KIP-148: Add a connect timeout for client
>
>
>
> Hello David,
>
> Thanks for the KIP. For the described issue, I'm wondering if it can be
> resolved by tuning the CONNECTIONS_MAX_IDLE_MS_CONFIG (
> connections.max.idle.ms) on the client side? Default is 9 minutes.
>
>
> Guozhang
>
> On Tue, May 2, 2017 at 8:22 AM, ???????? <254479...@qq.com> wrote:
>
> > Hi all,
> >
> > Currently in our test environment, we found that after one of the broker
> > node crash (reboot or os crash), the client may still be connecting to
> the
> > crash node to send metadata request or other request, and it needs
> several
> > minutes to be aware that the connection is timeout then try another node
> to
> > connect to send the request. Then the client may still not be aware of
> the
> > metadata change after several minutes.
> >
> >
> > So I want to add a connect timeout on the  client,  please take a look
> at??
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 148%3A+Add+a+connect+timeout+for+client
> >
> > Regards,
> >
> > David
>
>
>
>
> --
> -- Guozhang
>



-- 
-- Guozhang

Reply via email to