Thanks Steve for your tips. 

Yes, we found many sacks in packet sequence of problematic connections and 
observed there was intermittent network jitter in between. That explained the 
behavior seen in our setup.

Regards,
Jeff                                        

On 12/7/17, 7:45 AM, "Steve Miller" <st...@idrathernotsay.com> wrote:

    This kind of sounds to me like there’s packet loss somewhere and TCP is 
closing the window to try to limit congestion.  But from the snippets you 
posted, I didn’t see any sacks in the tcpdump output.  If there *are* sacks, 
that’d be a strong indicator of loss somewhere, whether it’s in the network or 
it’s in some host that’s being overwhelmed.
    
    I didn’t have a chance to do the header math to see if TCP’s advertising a 
small window in the lossy case you posted.  But I figured I’d mention this just 
in case it’s useful.
    
        -Steve
    
    > On Dec 6, 2017, at 5:27 PM, tao xiao <xiaotao...@gmail.com> wrote:
    > 
    > Mirror mare is placed to close to target and send/receive buffer size set
    > to 10MB which is the result of bandwidth-delay product. OS level tcp 
buffer
    > has also been increased to 16MB max
    > 
    >> On Wed, 6 Dec 2017 at 15:19 Jan Filipiak <jan.filip...@trivago.com> 
wrote:
    >> 
    >> Hi,
    >> 
    >> two questions. Is your MirrorMaker collocated with the source or the
    >> target?
    >> what are the send and receive buffer sizes on the connections that do 
span
    >> across WAN?
    >> 
    >> Hope we can get you some help.
    >> 
    >> Best jan
    >> 
    >> 
    >> 
    >>> On 06.12.2017 14:36, Xu, Zhaohui wrote:
    >>> Any update on this issue?
    >>> 
    >>> We also run into similar situation recently. The mirrormaker is
    >> leveraged to replicate messages between clusters in different dc. But
    >> sometimes a portion of partitions are with high consumer lag and tcpdump
    >> also shows similar packet delivery pattern. The behavior is sort of weird
    >> and is not self-explaining. Wondering whether it has anything to do with
    >> the fact that number of consumers is too large?  In our example, we have
    >> around 100 consumer connections per broker.
    >>> 
    >>> Regards,
    >>> Jeff
    >>> 
    >>> On 12/5/17, 10:14 AM, "tao xiao" <xiaotao...@gmail.com> wrote:
    >>> 
    >>>     Hi,
    >>> 
    >>>     any pointer will be highly appreciated
    >>> 
    >>>>     On Thu, 30 Nov 2017 at 14:56 tao xiao <xiaotao...@gmail.com> wrote:
    >>>> 
    >>>> Hi There,
    >>>> 
    >>>> 
    >>>> 
    >>>> We are running into a weird situation when using Mirrormaker to
    >> replicate
    >>>> messages between Kafka clusters across datacenter and reach you
    >> for help in
    >>>> case you also encountered this kind of problem before or have
    >> some insights
    >>>> in this kind of issue.
    >>>> 
    >>>> 
    >>>> 
    >>>> Here is the scenario. We have setup a deployment where we run 30
    >>>> Mirrormaker instances on 30 different nodes. Each Mirrormaker
    >> instance is
    >>>> configure with num.streams=1 thus only one consumer runs. The
    >> topics to
    >>>> replicate is configure with 100 partitions and data is almost
    >> evenly
    >>>> distributed across all partitions. After running a period of
    >> time, weird
    >>>> things happened that some of the Mirrormaker instances seems to
    >> slow down
    >>>> and consume at a relative slow speed from source Kafka cluster.
    >> The output
    >>>> of tcptrack shows the consume rate of problematic instances
    >> dropped to
    >>>> ~1MB/s, while the other healthy instances consume at a rate of
    >> ~3MB/s. As
    >>>> a result, the consumer lag for corresponding partitions are going
    >> high.
    >>>> 
    >>>> 
    >>>> 
    >>>> 
    >>>> After triggering a tcpdump, we noticed the traffic pattern in tcp
    >>>> connection of problematic Mirrmaker instances is very different
    >> from
    >>>> others. Packets flowing in problematic tcp connections are
    >> relatively small
    >>>> and seq and ack packets are basically coming in one after
    >> another. On the
    >>>> other hand, the packets in healthy tcp connections are coming in a
    >>>> different pattern, basically several seq packets comes with an
    >> ack packets.
    >>>> Below screenshot shows the situation, and these two captures are
    >> got on the
    >>>> same mirrormaker node.
    >>>> 
    >>>> 
    >>>> 
    >>>> problematic connection.  ps. 10.kfk.kfk.kfk is kafka broker,
    >> 10.mm.mm.mm
    >>>> is Mirrormaker node
    >>>> 
    >>>> 
    >> 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fimgur.com%2FZ3odjjT&data=02%7C01%7Czhaohxu%40ebay.com%7Ca8efe84f9feb47ecb5fd08d53b85d7ac%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636480368398154028&sdata=2DdGcjPWD7QI7lZ7v7QDN6I53P9tsSTMzEGdw6IywmU%3D&reserved=0
    >>>> 
    >>>> 
    >>>> healthy connection
    >>>> 
    >>>> 
    >> 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fimgur.com%2Fw0A6qHT&data=02%7C01%7Czhaohxu%40ebay.com%7Ca8efe84f9feb47ecb5fd08d53b85d7ac%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636480368398154028&sdata=v52DmmY9LHN2%2F59Hb5Xo77JuLreOA3lfDyq8eHKmISQ%3D&reserved=0
    >>>> 
    >>>> 
    >>>> If we stop the problematic Mirrormaker instance and when other
    >> instances
    >>>> take over the lagged partitions, they can consume messages
    >> quickly and
    >>>> catch up the lag soon. So the broker in source Kafaka cluster is
    >> supposed
    >>>> to be good. But if Mirrormaker itself causes the issue, how can
    >> one tcp
    >>>> connection is good but others are problematic since the
    >> connections are all
    >>>> established in the same manner by Kafka library.
    >>>> 
    >>>> 
    >>>> 
    >>>> Consumer configuration for Mirrormaker instance as below.
    >>>> 
    >>>> auto.offset.reset=earliest
    >>>> 
    >>>> 
    >>>> 
    >> 
partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor
    >>>> 
    >>>> heartbeat.interval.ms=10000
    >>>> 
    >>>> session.timeout.ms=120000
    >>>> 
    >>>> request.timeout.ms=150000
    >>>> 
    >>>> receive.buffer.bytes=1048576
    >>>> 
    >>>> max.partition.fetch.bytes=2097152
    >>>> 
    >>>> fetch.min.bytes=1048576
    >>>> 
    >>>> 
    >>>> 
    >>>> Kafka version is 0.10.0.0 and we have Kafka and Mirrormaker run
    >> on Ubuntu
    >>>> 14.04
    >>>> 
    >>>> 
    >>>> 
    >>>> Any response is appreciated.
    >>>> 
    >>>> Regards,
    >>>> 
    >>>> Tao
    >>>> 
    >>> 
    >>> 
    >> 
    >> 
    
    

Reply via email to