Hello,
We're seeing the following error in riak/yokazuna:
2016-04-11 19:36:18.803 [error]
<0.23120.8>@yz_pb_search:maybe_process:84 "Failed to determine Solr port
for all nodes in search plan"
[{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,448}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,421}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,418}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,421}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,418}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,421}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,418}]},{lager_trunc_io,print,3,[{file,"src/lager_trunc_io.erl"},{line,168}]}]
This is a 7-node cluster running the RPM of 2.1.3 on CentOS 7, in Google
cloud, with 16-CPU/60GB RAM VMs. They are configured with levelDB, with
a 500G SSD disk for the first four tiers and a 2TB magnetic disk for the
remainder. IOPSs/throughput are not an issue with our application.
There is a UWSGI-based REST service that sits in front of riak that
contains all of the application logic. The testing suite (locust) loads
binary data files that the uwsgi service processes and inserts into
riak. As part of that processing yokazuna indexes get searched.
We find that ~40 minutes to an hour into load testing we start seeing
the above error logged (leading to 500s from locust's perspective). It
corresponds with Search Query Fail Count, which we graph with zabbix.
Over time the number gets larger and larger, and after about an hour of
load testng it starts to curve upwards sharply.
In riak.conf we have:
search = on
search.solr.start_timeout = 120s
search.solr.port = 8093
search.solr.jmx_port = 8985
search.solr.jvm_options = -d64 -Xms2g -Xmx16g -XX:+UseStringCache
-XX:+UseCompressedOops
and we are using java-1.7.0-openjdk-1.7.0.99-2.6.5.0.el7_2.x86_64 from
the CentOS repos. I've been graphing JMX stats with zabbix and nothing
looks untoward, the heap gradually climbs up in size but never
skyrockets and certainly doesn't come close to the 16GB cap (barely gets
above 3GB before things really go south). With jconsole I see the same
numbers, with a gradually increasing time for garbage collection (last
recorded was "23.751 seconds on PS Scavenge (640 collections)"),
although it's hard to tell if there's any large pauses from gc.
We graph a bunch of additional stats in zabbix, and the boxes in the
cluster never get close to capping out CPU or running out of RAM.
I googled around and couldn't find any reference to the logged error.
Does it have to do with solr having a problem contacting other nodes in
the cluster? Or is it some kind of node lookup issue?
--
Jim Raney
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com