The Solr application I'm working on has many concurrently active cores - of
the order of 1000s at a time.

The management application depends on being able to query Solr for the
current set of live cores, a requirement I've been satisfying using the
STATUS core admin handler method.

However, once the number of active cores reaches a particular threshold
(which I haven't determined exactly), the response to the STATUS method is
truncated, resulting in malformed XML.

My debugging so far has revealed:

   - when doing STATUS queries from the local machine, they succeed,
   untruncated, >90% of the time
   - when local STATUS queries do fail, they are always truncated to the
   same length: 73685 bytes in my case
   - when doing STATUS queries from a remote machine, they fail due to
   truncation every time
   - remote STATUS queries are always truncated to the same length: 24704
   bytes in my case
   - the failing STATUS queries take visibly longer to complete on the
   client - a few seconds for a truncated result versus <1 second for an
   untruncated result
   - all STATUS queries return a successful 200 HTTP code
   - all STATUS queries are logged as returning in ~700ms in Solr's info log
   - during failing (truncated) responses, Solr's CPU usage spikes to
   saturation
   - behaviour seems the same whatever client I use: wget, curl, Python, ...

Using Solr 1.3.0 694707, Jetty 6.1.3.

At the moment, the main puzzles for me are that the local and remote
behaviour is so different. It leads me to think that it is something to do
with the network transmission speed. But the response really isn't that big
(untruncated it's ~1MB), and the CPU spike seems to suggest that something
in the process of serialising the core information is taking too long and
causing a timeout?

Any suggestions on settings to tweak, ways to get extra debug information,
or ascertain the active core list in some other way would be much
appreciated!

James

Reply via email to