Re: Slow Replication between Shard & Replica

Erick Erickson Tue, 01 Sep 2015 08:49:21 -0700

To pile on here, having nodes in a single Solr collection
separated in widely separated data centers is currently
an anti-pattern. As Shawn says, there are some things
that can help, but there is a lot of cross-chatter
amongst Solr nodes. The basic pattern is:


> a packet comes in to a Solr node
> The node forwards the docs to the appropriate leader
    (packet of 10 as Shawn mentioned)
> The leader forwards the docs to each replica in that shard
> The replicas ack back to the leader
> The leader acks back to the originating node.
> The originating node acks back to the client.

You can see that there are a ton of messages flying back
and forth, some of them across the slow pipe.

There is some work being done for "rack awareness", that will
help ameliorate these issues, but those won't help with the
latency across the ocean.

Currently, one approach is to have two completely distinct
SolrClouds. The SolrClouds don't know about each other at
all. Then either
1> the system-of-record insures replication across the DCs
then separate ingestion process for Solr operates on that

or

2> the client sends the original indexing request to both DCs

Best,
Erick

On Mon, Aug 31, 2015 at 4:26 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 8/31/2015 7:23 AM, Maulin Rathod wrote:
>> We are using solrcloud 5.2 with 1 shard (in UK Data Center) and 1 replica
>> (in Australia Data Center). We observed that data inserted/updated in shard
>> (UK Data center) is replicated very slowly to Replica in AUSTRALIA Data
>> Center (Due to high latency between UK and AUSTRALIA). We are looking to
>> improve the speed of data replication from shard to replica. Can we use
>> some sort of compression before sending data to replica? Please let me know
>> if any other alternative is available to improve data replication speed
>> from shard to replica?
>
> SolrCloud replicates data differently than many people expect,
> especially if they are familiar with how replication worked prior to
> SolrCloud's introduction in Solr 4.0.  The original document is sent to
> all replicas and each one indexes it independently.  This is HTTP
> traffic, containing the document data after the initial update
> processors are finished with it.
>
> TCP connections across international lines, and oceans in particular,
> are slow, because of the high latency.  The physical distance covered by
> the speed of light is one problem, but international links usually
> involve a number of additional routers, which also slows it down.  My
> employer has been dealing with this problem for years when copying files
> from one location to another.  One of the things available to help with
> this problem is modern TCP stacks that scale the TCP window effectively,
> so fewer acknowledgements are required.
>
> If you are running Solr on Linux machines that are running any recent
> kernel version (2.6 definitely qualifies, but I think 2.4 does as well),
> and you haven't turned on SYN cookies or explicitly disabled the
> scaling, you should be automatically scaling your TCP window.  If you
> are on Windows Server 2008 or Windows 7 (or versions later than these)
> and haven't poked around in the TCP tuning options, then you would also
> be OK.  If either end of the communication is Windows XP, Server 2003,
> or an older version of Windows, you're out of luck and will need to
> upgrade the operating system.
>
> The requests involved in SolrCloud indexing may be too short-lived to
> benefit much from scaling, though.  Window scaling typically only helps
> when the TCP connection lives for more than a few seconds, like an FTP
> data transfer.  Each individual inter-server indexing request is likely
> only transmitting 10 documents.
>
> Even when TCP window scaling is present, if there is *ANY* packet loss
> anywhere in a high-latency path, transfer speed will drop dramatically.
> In the lab, I built a simulated setup emulating our connection to our UK
> office.  Even with 130 milliseconds of round-trip latency added by the
> Linux router impersonating the Internet, transfer speeds of photo-sized
> files on a modern TCP stack were good ... until I also introduced packet
> loss.  Transfer speeds were BADLY affected by even one tenth of one
> percent packet loss, which is the lowest amount I tested.
>
> SolrCloud is highly optimized for the way it is usually installed -- on
> multiple machines connected together with one or more LAN switches.
> This is why it uses lots of little connections.  The new cross-data
> center replication (CDCR) feature is an attempt to better utilize
> high-latency WAN links.
>
> In Solr 5.x, the web server is more firmly under the control of the Solr
> development team, so compression and other improvements may be possible,
> but latency is the major problem here, not a lack of features.  I'm not
> sure whether the number of documents per update (currently defaulting to
> 10) is configurable, but with a modern TCP stack, increasing that number
> could make the transfer more efficient, assuming the communication link
> is clean.
>
> Thanks,
> Shawn
>

Re: Slow Replication between Shard & Replica

Reply via email to