Re: OOM Killed

Laura Morales Tue, 11 Jul 2023 02:39:35 -0700

I have tried to do some testing but I cannot get a definitive answer about what 
works and what not, because there are so many variables. Also it doesn't fail 
right away, but after 1h to 1.5h, so I've done fewer than a dozen tests 
actually. I'm sorry that I can't be more precise.

The two PCs that I've got are an i3 4th-gen 2C4T with 8GB RAM, and an i7 
4th-gen 2C4T with 16GB RAM. The database is stored in a 1TB USB3.0 SSD (which I 
move between the two PCs). Either way, the only component that seems to make a 
difference is the amount of RAM. They are physical machines (bare metal) and 
not containers.
I'm running "fuseki-server" binary, downloaded from the binaries distribution, 
"./fuseki-server --loc=database --port=7000 --localhost /query".
I query Fuseki (4.8) over HTTP from a script that runs on the same PC. The 
script uses <100MB of RAM. All queries are "read" (no writes) and actually very 
basic. Either "DESCRIBE <node>" or "SELECT" that selects a node and follows a 
couple of links 3 or 4 levels deep at most. Fuseki answers them very quickly, 
<5ms, but occasionally it takes 50-100ms or rarely a couple of seconds 
(probably because of garbage collection?). The queries are always the same, run 
over and over across the dataset. The dataset contains approximately 200K 
nodes, 2.5M triples, 4GB disk size. Most nodes have, among their properties, 3 
that contain long strings, approximately 20KB-50KB combined, per node. I don't 
query those properties directly, but when I "DESCRIBE" a node they are 
retrieved. Fuseki is very fast, but these strings may contribute to the high 
memory load (I'm only guessing). I cannot identify any particular query as the 
offender, since it's the same bunch of queries run over and over at max speed 
(ie. one after the other with no wait time). It's pretty much the same amount 
of work for every query, and it works perfectly fine until it consumes all the 
RAM and SWAP.

The only configuration that worked for me (ie. it completed the job) is -Xmx4G 
and no parallelism at all (one request after the other in series) on 16GB of 
RAM (it used up all the RAM available). It seems strange to me that it needs so 
much RAM even when all the requests are serialized. Querying in parallel with 
16 or 32 threads doesn't seem to make much of a difference to Fuseki, other 
than the even higher memory consumption (Fuseki answers all queries very 
quickly in milliseconds, until it runs out of memory).
The memory growth is not instantaneous, and is not linear. I can see RAM usage 
fluctuate by a 3-4GB range (for example between 3GB and 7GB). But the trend is 
to use more and more memory. For example before crashing it would fluctuate 
between 10GB and 15GB.
If I increase -Xmx over 4GB, Fuseki is eventually OOM killed by the kernel. 
Below 4GB, Fuseki crashes with a heap error like this (both cases fail well 
after 1h of work):

10:03:21 WARN  QueuedThreadPool :: Job failed
java.lang.OutOfMemoryError: Java heap space
10:03:21 WARN  Fuseki          :: [152378] RC = 500 : Java heap space: failed 
reallocation of scalar replaced objects
java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar 
replaced objects
10:03:21 INFO  Fuseki          :: [152378] 500 Server Error (48.115 s)
10:03:23 WARN  AbstractConnector :: Accept Failure
java.lang.OutOfMemoryError: Java heap space
10:04:08 WARN  QueuedThreadPool :: Job failed
java.lang.OutOfMemoryError: Java heap space
10:04:08 WARN  QueuedThreadPool :: Job failed
java.lang.OutOfMemoryError: Java heap space
Exception in thread "HttpClient-2-SelectorManager" java.lang.OutOfMemoryError: 
Java heap space

On 8GB RAM it always fails for me, -Xmx4G or more is OOM-killed, whereas less 
ends up with a heap error.
The output of "java -XX:+PrintFlagsFinal -version | grep -i "M..HeapSize"" is

   size_t MaxHeapSize                              = 4175429632                 
               {product} {ergonomic}
   size_t ShenandoahSoftMaxHeapSize                = 0                          
            {manageable} {default}
openjdk version "11.0.18" 2023-01-17
OpenJDK Runtime Environment (build 11.0.18+10-post-Debian-1deb11u1)
OpenJDK 64-Bit Server VM (build 11.0.18+10-post-Debian-1deb11u1, mixed mode, 
sharing)

I've also tried with OpenJDK 17, same results.
I tried Fuseki 3.17 too but I was getting other JSON-LD errors (probably 
related to an old JSON-LD library) so I didn't test further.

I know that I don't have the latest and greatest hardware, but I think my 
database is very small and I feel like Fuseki should not be using 16GB RAM when 
running a lot of simple queries in series (not in parallel).
One thing that I want to try, but so far haven't, is to restart Fuseki halfway 
through the job.

> Sent: Monday, July 10, 2023 at 1:18 PM
> From: "Andy Seaborne" <[email protected]>
> To: [email protected]
> Subject: Re: OOM Killed
>
> Laura, Dave,
> 
> This doesn't sound like the same issue but let's see.
> 
> Dave - your situation isn't under high load is it?
> 
> - Is it in a container? If so:
>    Is it the container being killed OOM or
>      Java throwing an OOM exception?
>    Much RAM does the container get? How many threads?
> 
> - If not a container, how many CPU Threads are there? How many cores?
> 
> - Which form of Fuseki are you using?
> 
> what does
>    java -XX:+PrintFlagsFinal -version \
>     | grep -i 'M..HeapSize`
> 
> say?
> 
> How are you sending the queries to the server?
> 
> On 09/07/2023 20:33, Laura Morales wrote:
> > I'm running a job that is submitting a lot of queries to a Fuseki server, 
> > in parallel. My problem is that Fuseki is OOM-killed and I don't know how 
> > to fix this. Some details:
> > 
> > - Fuseki is queried as fast as possible. Queries take around 50-100ms to 
> > complete so I think it's serving 10s of queries each second
> 
> Are all the queries about the same amount of work are are some going to 
> cause significantly more memory use?
> 
> It is quite possible to send queries faster than the server can process 
> them - there is little point sending in parallel more than there are 
> real CPU threads to service them.
> 
> They will interfere and the machine can end up going slower (query of 
> queries per second).
> 
> I don't know exactly the impact on the GC but I think the JVM delays 
> minor GC's when very busy but that pushes it to do major ones earlier.
> 
> A thing to try is use less parallelism.
> 
> > - Fuseki 4.8. OS is Debian 12 (minimal installation with only OS, Fuseki, 
> > no desktop environments, uses only ~100MB of RAM)
> > - all the queries are read queries. No updates, inserts, or other write 
> > queries
> > - all the queries are over HTTP to the Fuseki endpoint
> > - database is TDB2 (created with tdb2.tdbloader)
> > - database contains around 2.5M triples
> > - the machine has 8GB RAM. I've tried on another PC with 16GB and it 
> > completes the job. On 8GB though, it won't
> > - with -Xmx6G it's killed earlier. With -Xmx2G it's killed later. Either 
> > way it's always killed.
> 
> Is it getting OOM at random or do certain queries tend to push it over 
> he edge?
> 
> Is that the machine (container) has 8G RAM and there is no -Xmx setting? 
> in that case, default setting applies which is 25% of RAM.
> 
> A heap dump to know where the memory is going would be useful.
> 
> > Is there anything that I can tweak to avoid Fuseki getting killed? 
> > Something that isn't "just buy more RAM".
> > Thank you
>

Re: OOM Killed

Reply via email to