Hello,

We're seeing the following error in riak/yokazuna:

2016-04-11 19:36:18.803 [error] <0.23120.8>@yz_pb_search:maybe_process:84 "Failed to determine Solr port for all nodes in search plan" [{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,448}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,421}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,418}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,421}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,418}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,421}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,418}]},{lager_trunc_io,print,3,[{file,"src/lager_trunc_io.erl"},{line,168}]}]

This is a 7-node cluster running the RPM of 2.1.3 on CentOS 7, in Google cloud, with 16-CPU/60GB RAM VMs. They are configured with levelDB, with a 500G SSD disk for the first four tiers and a 2TB magnetic disk for the remainder. IOPSs/throughput are not an issue with our application.

There is a UWSGI-based REST service that sits in front of riak that contains all of the application logic. The testing suite (locust) loads binary data files that the uwsgi service processes and inserts into riak. As part of that processing yokazuna indexes get searched.

We find that ~40 minutes to an hour into load testing we start seeing the above error logged (leading to 500s from locust's perspective). It corresponds with Search Query Fail Count, which we graph with zabbix. Over time the number gets larger and larger, and after about an hour of load testng it starts to curve upwards sharply.

In riak.conf we have:

search = on
search.solr.start_timeout = 120s
search.solr.port = 8093
search.solr.jmx_port = 8985
search.solr.jvm_options = -d64 -Xms2g -Xmx16g -XX:+UseStringCache -XX:+UseCompressedOops

and we are using java-1.7.0-openjdk-1.7.0.99-2.6.5.0.el7_2.x86_64 from the CentOS repos. I've been graphing JMX stats with zabbix and nothing looks untoward, the heap gradually climbs up in size but never skyrockets and certainly doesn't come close to the 16GB cap (barely gets above 3GB before things really go south). With jconsole I see the same numbers, with a gradually increasing time for garbage collection (last recorded was "23.751 seconds on PS Scavenge (640 collections)"), although it's hard to tell if there's any large pauses from gc.

We graph a bunch of additional stats in zabbix, and the boxes in the cluster never get close to capping out CPU or running out of RAM.

I googled around and couldn't find any reference to the logged error. Does it have to do with solr having a problem contacting other nodes in the cluster? Or is it some kind of node lookup issue?

--
Jim Raney


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to