Re: Solr error message

2016-04-11 Thread Jim Raney


On Apr 11, 2016, at 3:35 PM, Fred Dushin 
> wrote:

Hi Jim,

Interesting problem.

That error is occurring here:

https://github.com/basho/yokozuna/blob/2.1.2/src/yz_cover.erl#L275

because length(Mapping) and length(UniqNodes) are unequal:

https://github.com/basho/yokozuna/blob/2.1.2/src/yz_cover.erl#L262

This might be because you are getting timeouts trying to query the port on 
remote nodes:

https://github.com/basho/yokozuna/blob/2.1.2/src/yz_solr.erl#L324

As you can see, there is a hard-wired 1-second timeout on that RPC call, which 
could account for why you are seeing this failure into a load run.

You might try to rebuild a version of this module with an increased timeout, to 
see if that gets you over the hump, or consider making a configurable timeout.

Riak 2.1.3 ships with yokozuna 2.1.2, who's GIT SHA 
3520d11ec21ee08b7c18478fbbe1b61d7e3d8e0f, so you'd want to branch off that 
point of the tree, if you care to experiment.

If you rebuild the module, you can place the generated beam file in the 
lib/basho-patches directory of each of your riak installs, and restart Riak (or 
manually re-load the module on each node via the Riak console, if you need to 
keep your riak nodes up and running)

Let us know what you find or if you need more assistance.

-Fred

On Apr 11, 2016, at 4:11 PM, Jim Raney 
> wrote:

Failed to determine Solr port for all nodes in search plan

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Fred,

Thanks for the quick response.  After you basically verified that it was a a 
solr timeout issue I rebuilt the cluster with 14 nodes to see what would 
happen.  The amount of time it took for the query fails (and associated log 
entries) basically doubled as well.

I -could- try increasing the hard coded timeout but I don't think that's the 
route we want to go as it is likely this system will have that much data or 
more being pushed into and long query times won't work.  I imagine there is 
probably some solr tuning we can do - any ideas on what we could look at that 
we could pass through the riak config?

I'm going to try an Oracle 1.8 JDK with it later and see if any GC tuning helps 
in case there are long GC pauses.

--
Jim Raney
jim.ra...@physiq.com

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Solr error message

2016-04-11 Thread Fred Dushin
Hi Jim,

Interesting problem.

That error is occurring here:

https://github.com/basho/yokozuna/blob/2.1.2/src/yz_cover.erl#L275

because length(Mapping) and length(UniqNodes) are unequal:

https://github.com/basho/yokozuna/blob/2.1.2/src/yz_cover.erl#L262

This might be because you are getting timeouts trying to query the port on 
remote nodes:

https://github.com/basho/yokozuna/blob/2.1.2/src/yz_solr.erl#L324

As you can see, there is a hard-wired 1-second timeout on that RPC call, which 
could account for why you are seeing this failure into a load run.

You might try to rebuild a version of this module with an increased timeout, to 
see if that gets you over the hump, or consider making a configurable timeout.

Riak 2.1.3 ships with yokozuna 2.1.2, who's GIT SHA 
3520d11ec21ee08b7c18478fbbe1b61d7e3d8e0f, so you'd want to branch off that 
point of the tree, if you care to experiment.

If you rebuild the module, you can place the generated beam file in the 
lib/basho-patches directory of each of your riak installs, and restart Riak (or 
manually re-load the module on each node via the Riak console, if you need to 
keep your riak nodes up and running)

Let us know what you find or if you need more assistance.

-Fred

> On Apr 11, 2016, at 4:11 PM, Jim Raney  wrote:
> 
> Failed to determine Solr port for all nodes in search plan

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Solr error message

2016-04-11 Thread Jim Raney

Hello,

We're seeing the following error in riak/yokazuna:

2016-04-11 19:36:18.803 [error] 
<0.23120.8>@yz_pb_search:maybe_process:84 "Failed to determine Solr port 
for all nodes in search plan" 
[{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,448}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,421}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,418}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,421}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,418}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,421}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,418}]},{lager_trunc_io,print,3,[{file,"src/lager_trunc_io.erl"},{line,168}]}]


This is a 7-node cluster running the RPM of 2.1.3 on CentOS 7, in Google 
cloud, with 16-CPU/60GB RAM VMs.  They are configured with levelDB, with 
a 500G SSD disk for the first four tiers and a 2TB magnetic disk for the 
remainder.  IOPSs/throughput are not an issue with our application.


There is a UWSGI-based REST service that sits in front of riak that 
contains all of the application logic.  The testing suite (locust) loads 
binary data files that the uwsgi service processes and inserts into 
riak.  As part of that processing yokazuna indexes get searched.


We find that ~40 minutes to an hour into load testing we start seeing 
the above error logged (leading to 500s from locust's perspective).  It 
corresponds with Search Query Fail Count, which we graph with zabbix.  
Over time the number gets larger and larger, and after about an hour of 
load testng it starts to curve upwards sharply.


In riak.conf we have:

search = on
search.solr.start_timeout = 120s
search.solr.port = 8093
search.solr.jmx_port = 8985
search.solr.jvm_options = -d64 -Xms2g -Xmx16g -XX:+UseStringCache 
-XX:+UseCompressedOops


and we are using java-1.7.0-openjdk-1.7.0.99-2.6.5.0.el7_2.x86_64 from 
the CentOS repos.  I've been graphing JMX stats with zabbix and nothing 
looks untoward, the heap gradually climbs up in size but never 
skyrockets and certainly doesn't come close to the 16GB cap (barely gets 
above 3GB before things really go south).  With jconsole I see the same 
numbers, with a gradually increasing time for garbage collection (last 
recorded was "23.751 seconds on PS Scavenge (640 collections)"), 
although it's hard to tell if there's any large pauses from gc.


We graph a bunch of additional stats in zabbix, and the boxes in the 
cluster never get close to capping out CPU or running out of RAM.


I googled around and couldn't find any reference to the logged error.  
Does it have to do with solr having a problem contacting other nodes in 
the cluster?  Or is it some kind of node lookup issue?


--
Jim Raney


___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com