On 12/13/2019 11:01 AM, Kojo wrote:
We had already changed SO configuration before the last crash, so I think
that the problem is not there.
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 257683
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 65535
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Are you running this ulimit command as the same user that is running
your Solr process? It must be the same user to learn anything useful.
This output indicates that the user that's running the ulimit command is
allowed to start 64K processes, which I would think should be enough.
Best guess here is that the actual user that's running Solr does *NOT*
have its limits increased. It may be a different user than you're using
to run the ulimit command.
When Solr tries to delete a znode? I´am sorry, because I understand nothing
about this process, and it is the only point that seems suspicios for me.
Do you think that it can cause inconsistency leading to the OOM problem?
OOME isn't caused by inconsistencies at the application level. It's a
low-level problem, an indication that Java tried to do something
required to run the program that it couldn't do.
I assume that it's Solr trying to delete the znode, because the node
path has solr in it. It will be the ZK client running inside Solr
that's actually trying to do the work, but Solr code probably initiated it.
Just after this INFO message above, ZK log starts to log thousands of this
block of lines below. Where it seems that ZK creates and closes thousands
of sessions.
I responded to this thread because I have some knowledge about Solr. I
really have no idea what these additional ZK server logs might mean.
The one that you quoted before was pretty straightforward, so I was able
to understand it.
Anything that gets logged after an OOME is suspect and may be useless.
The execution of a Java program after OOME is unpredictable, because
whatever was being run when the OOME was thrown did NOT successfully
execute.
Thanks,
Shawn