Ok, thanks for the answer. The leader election and update notification sound like they should work using ZooKeeper (leader election recipe and a normal watch) but I guess there are some details that make things more complicated.

On 09.12.2017 20:19, Erick Erickson wrote:
This has been bandied about on a number of occasions, it boils down to
nobody has stepped up to make it happen. It turns out there are a
number of tricky issues:

how does leadership change if the leader goes down?
the raw complexity of getting it right. Getting it wrong corrupts indexes
how do you resolve leadership in the first place so only the leader writes to 
the index?
how would that affect performance if N replicas were autowarming at the same 
time, thus reading from HDFS?
how do the read-only replicas know to open a new searcher?
I'm sure there are a bunch more.
So this is one of those things that everyone agrees is interesting,
but nobody is willing to code and it's not actually clear that it
makes sense in the Solr context. It'd be a pity to put in all the work
then discover that the performance issues prohibited using it.

If you _guarantee_ that the index doesn't change, there's the
NoLockFactory you could specify. That would allow you to share a
common index, woe be unto you if you start updating the index though.

Best,
Erick

On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp <hendrik.hadd...@gmx.net> wrote:
Hi,

for the HDFS case wouldn't it be nice if there was a mode in which the
replicas just read the same index files as the leader? I mean after all the
data is already on a shared readable file system so why would one even need
to replicate the transaction log files?

regards,
Hendrik


On 08.12.2017 21:07, Erick Erickson wrote:
bq: Will TLOG replicas use less network bandwidth?

No, probably more bandwidth. TLOG replicas work like this:
1> the raw docs are forwarded
2> the old-style master/slave replication is used

So what you do save is CPU processing on the TLOG replica in exchange
for increased bandwidth.

Since the only thing forwarded in NRT replicas (outside of recovery)
is the raw documents, I expect that TLOG replicas would _increase_
network usage. The deal is that TLOG replicas can take over leadership
if the leader goes down so they must have an
up-to-date-after-last-index-sync set of tlogs.

At least that's my current understanding...

Best,
Erick

On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
<joseph.obernber...@gmail.com> wrote:
Anyone have any thoughts on this?  Will TLOG replicas use less network
bandwidth?

-Joe


On 12/4/2017 12:54 PM, Joe Obernberger wrote:
Hi All - this same problem happened again, and I think I partially
understand what is going on.  The part I don't know is what caused any
of
the replicas to go into full recovery in the first place, but once they
do,
they cause network interfaces on servers to go fully utilized in both
in/out
directions.  It appears that when a solr replica needs to recover, it
calls
on the leader for all the data.  In HDFS, the data from the leader's
point
of view goes:

HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS

Do I have this correct?  That poor network in the middle becomes a
bottleneck and causes other replicas to go into recovery, which causes
more
network traffic.  Perhaps going to TLOG replicas with 7.1 would be
better
with HDFS?  Would it be possible for the leader to send a message to the
replica to instead get the data straight from HDFS instead of going from
one
solr process to another?  HDFS would better be able to use the cluster
since
each block has 3x replicas.  Perhaps there is a better way to handle
replicas with a shared file system.

Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.
Good idea?  Thank you!

-Joe


On 11/22/2017 8:17 PM, Erick Erickson wrote:
Hmm. This is quite possible. Any time things take "too long" it can be
    a problem. For instance, if the leader sends docs to a replica and
the request times out, the leader throws the follower into "Leader
Initiated Recovery". The smoking gun here is that there are no errors
on the follower, just the notification that the leader put it into
recovery.

There are other variations on the theme, it all boils down to when
communications fall apart replicas go into recovery.....

Best,
Erick

On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
<joseph.obernber...@gmail.com> wrote:
Hi Shawn - thank you for your reply. The index is 29.9TBytes as
reported
by:
hadoop fs -du -s -h /solr6.6.0
29.9 T  89.9 T  /solr6.6.0

The 89.9TBytes is due to HDFS having 3x replication.  There are about
1.1
billion documents indexed and we index about 2.5 million documents per
day.
Assuming an even distribution, each node is handling about 680GBytes
of
index.  So our cache size is 1.4%. Perhaps 'relatively small block
cache'
was an understatement! This is why we split the largest collection
into
two,
where one is data going back 30 days, and the other is all the data.
Most
of our searches are not longer than 30 days back.  The 30 day index is
2.6TBytes total.  I don't know how the HDFS block cache splits between
collections, but the 30 day index performs acceptable for our specific
application.

If we wanted to cache 50% of the index, each of our 45 nodes would
need
a
block cache of about 350GBytes.  I'm accepting offers of DIMMs!

What I believe caused our 'recovery, fail, retry loop' was one of our
servers died.  This caused HDFS to start to replicate blocks across
the
cluster and produced a lot of network activity.  When this happened, I
believe there was high network contention for specific nodes in the
cluster
and their network interfaces became pegged and requests for HDFS
blocks
timed out.  When that happened, SolrCloud went into recovery which
caused
more network traffic.  Fun stuff.

-Joe


On 11/22/2017 11:44 AM, Shawn Heisey wrote:
On 11/22/2017 6:44 AM, Joe Obernberger wrote:
Right now, we have a relatively small block cache due to the
requirements that the servers run other software.  We tried to find
the best balance between block cache size, and RAM for programs,
while
still giving enough for local FS cache.  This came out to be 84 128M
blocks - or about 10G for the cache per node (45 nodes total).
How much data is being handled on a server with 10GB allocated for
caching HDFS data?

The first message in this thread says the index size is 31TB, which
is
*enormous*.  You have also said that the index takes 93TB of disk
space.  If the data is distributed somewhat evenly, then the answer
to
my question would be that each of those 45 Solr servers would be
handling over 2TB of data.  A 10GB cache is *nothing* compared to
2TB.

When index data that Solr needs to access for an operation is not in
the
cache and Solr must actually wait for disk and/or network I/O, the
resulting performance usually isn't very good.  In most cases you
don't
need to have enough memory to fully cache the index data ... but less
than half a percent is not going to be enough.

Thanks,
Shawn


---
This email has been checked for viruses by AVG.
http://www.avg.com


Reply via email to