Hi
I ran in to the issue again with Tika/Java taking more CPU, up to 200+ CPU%.
The scenario is that i have 3-4 long running processes calling Tika server
(Version 1.24) and occassionaly 3-4 additional shorter processes (2-3 hours)
starts up and calls the Tika server.
The scenario is being run for a couple of days, extracting text from various
types of documents.
The Tika server is running locally.
Top shows this:
----------------------------------------------------------------------------
----------------------
top - 16:21:17 up 5 days, 8:12, 6 users, load average: 2,64, 2,63, 2,61
Tasks: 145 total, 1 running, 144 sleeping, 0 stopped, 0 zombie
%Cpu(s): 50,8 us, 0,3 sy, 0,0 ni, 48,8 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0
st
KiB Mem : 4032128 total, 129052 free, 2702236 used, 1200840 buff/cache
KiB Swap: 4192252 total, 2968864 free, 1223388 used. 1040340 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
911 root 20 0 4578604 1,229g 8024 S 204,3 32,0 859:11.22 java
743 root 20 0 196596 5772 920 S 0,7 0,1 35:28.02
wizit_rest
34637 elastic+ 20 0 21,346g 883808 30616 S 0,3 21,9 1250:04 java
1 root 20 0 204620 3440 2376 S 0,0 0,1 0:14.99 systemd
2 root 20 0 0 0 0 S 0,0 0,0 0:00.15
kthreadd
3 root 20 0 0 0 0 S 0,0 0,0 1:46.20
ksoftirqd+
5 root 0 -20 0 0 0 S 0,0 0,0 0:00.00
kworker/0+
7 root 20 0 0 0 0 S 0,0 0,0 4:59.14
rcu_sched
8 root 20 0 0 0 0 S 0,0 0,0 0:00.00 rcu_bh
9 root rt 0 0 0 0 S 0,0 0,0 0:03.83
migration+
----------------------------------------------------------------------------
----------------------
At first i ran the jstackseries.sh:
----------------------------------------------------------------------------
----------------------
more jstack.911.202904.163848252
Attaching to process ID 911, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.242-b08
Deadlock Detection:
Can't print deadlocks:Unable to deduce type of thread from address
0x00007f30bc0
2d800 (expected type JavaThread, CompilerThread, ServiceThread,
JvmtiAgentThread
, or SurrogateLockerThread)
----------------------------------------------------------------------------
----------------------
It also freeze the system, "systemd[1]: Freezing execution."
But i finally got a threaddump via jstack, i attach that file. I also attach
the tika-config file in case that also could be useful.
Hope this helps to analyze the issue.
Kind regards
Hans
-----Ursprungligt meddelande-----
Från: Nick Burch <[email protected]>
Skickat: den 16 april 2020 15:40
Till: [email protected]
Kopia: [email protected]
Ämne: Re: Issue with > 200% CPU after bulk usage
On Wed, 15 Apr 2020, [email protected] wrote:
> I have encountered an issue with Tika running locally on a box that
> the Java runtime goes up to over 200% CPU, after running a bulk load
> of documents over a couple of days, it is more than 3 million documents.
Can you do a thread dump to show what the JVM is doing?
https://access.redhat.com/solutions/18178
Nick
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!-- NOTE: tika-batch is still an experimental feature.
The configuration file will likely change and be backward incompatible
with new versions of Tika. Please stay tuned.
-->
<tika-batch-config
maxAliveTimeSeconds="-1"
pauseOnEarlyTerminationMillis="10000"
timeoutThresholdMillis="300000"
timeoutCheckPulseMillis="1000"
maxQueueSize="10000"
numConsumers="default"> <!-- numConsumers = number of file consumers, "default" = number of processors -1 -->
<!-- options to allow on the commandline -->
<commandline>
<option opt="c" longOpt="tika-config" hasArg="true"
description="TikaConfig file"/>
<option opt="bc" longOpt="batch-config" hasArg="true"
description="xml batch config file"/>
<!-- We needed sorted for testing. We added random for performance.
Where crawling a directory is slow, it might be beneficial to
go randomly so that the parsers are triggered earlier. The
default is operating system's choice ("os") which means whatever order
the os returns files in .listFiles(). -->
<option opt="crawlOrder" hasArg="true"
description="how does the crawler sort the directories and files:
(random|sorted|os)"/>
<option opt="numConsumers" hasArg="true"
description="number of fileConsumers threads"/>
<option opt="maxFileSizeBytes" hasArg="true"
description="maximum file size to process; do not process files larger than this"/>
<option opt="maxQueueSize" hasArg="true"
description="maximum queue size for FileResources"/>
<option opt="fileList" hasArg="true"
description="file that contains a list of files (relative to inputDir) to process"/>
<option opt="fileListEncoding" hasArg="true"
description="encoding for fileList"/>
<option opt="inputDir" hasArg="true"
description="root directory for the files to be processed"/>
<option opt="startDir" hasArg="true"
description="directory (under inputDir) at which to start crawling"/>
<option opt="outputDir" hasArg="true"
description="output directory for output"/> <!-- do we want to make this mandatory -->
<option opt="recursiveParserWrapper"
description="use the RecursiveParserWrapper or not (default = false)"/>
<option opt="streamOut" description="stream the output of the RecursiveParserWrapper (default = false)"/>
<option opt="handleExisting" hasArg="true"
description="if an output file already exists, do you want to: overwrite, rename or skip"/>
<option opt="basicHandlerType" hasArg="true"
description="what type of content handler: xml, text, html, body"/>
<option opt="outputSuffix" hasArg="true"
description="suffix to add to the end of the output file name"/>
<option opt="timeoutThresholdMillis" hasArg="true"
description="how long to wait before determining that a consumer is stale"/>
<option opt="includeFilePat" hasArg="true"
description="regex that specifies which files to process"/>
<option opt="excludeFilePat" hasArg="true"
description="regex that specifies which files to avoid processing"/>
<option opt="reporterSleepMillis" hasArg="true"
description="millisecond between reports by the reporter"/>
<option opt="digest" hasArg="true"
description="which digest(s) to use, e.g. 'md5,sha512'\"/>
<option opt="digestMarkLimit" hasArg="true"
description="max bytes to read for digest\"/>
</commandline>
<!-- can specify inputDir="input", but the default config should not include this -->
<!-- can also specify startDir="input/someDir" to specify which child directory
to start processing -->
<crawler builderClass="org.apache.tika.batch.fs.builders.FSCrawlerBuilder"
crawlOrder="random"
maxFilesToAdd="-1"
maxFilesToConsider="-1"
includeFilePat=""
excludeFilePat=""
maxFileSizeBytes="-1"
/>
<!--
This is an example of a crawler that reads a list of files to be processed from a
file. This assumes that the files in the list are relative to inputDir.
<crawler class="org.apache.tika.batch.fs.builders.FSCrawlerBuilder"
fileList="files.txt"
fileListEncoding="UTF-8"
maxFilesToAdd="-1"
maxFilesToConsider="-1"
includeFilePat="(?i).pdf$"
excludeFilePat="(?i).msg$"
maxFileSizeBytes="-1"
inputDir="input"
/>
-->
<!--
To wrap parser in RecursiveParserWrapper (tika-app's -J or tika-server's /rmeta),
add attribute recursiveParserWrapper="true" to consumers element.
To wrap parser with DigestingParser add attributes e.g.:
digest="md5,sha256" digestMarkLimit="10000000"
-->
<consumers builderClass="org.apache.tika.batch.fs.builders.BasicTikaFSConsumersBuilder"
recursiveParserWrapper="false" consumersManagerMaxMillis="60000">
<parser builderClass="org.apache.tika.batch.builders.AppParserFactoryBuilder"
class="org.apache.tika.batch.DigestingAutoDetectParserFactory"
parseRecursively="true"
digest="md5" digestMarkLimit="1000000"/>
<contenthandler builderClass="org.apache.tika.batch.builders.DefaultContentHandlerFactoryBuilder"
basicHandlerType="xml" writeLimit="-1"/>
<!-- can specify custom output file suffix with:
suffix=".mysuffix"
if no suffix is specified, BasicTikaFSConsumersBuilder does its best to guess -->
<!-- can specify compression with
compression="bzip2|gzip|zip" -->
<outputstream class="FSOutputStreamFactory" encoding="UTF-8"/>
</consumers>
<!-- reporter and interrupter are optional -->
<reporter builderClass="org.apache.tika.batch.builders.SimpleLogReporterBuilder" reporterSleepMillis="1000"
reporterStaleThresholdMillis="60000"/>
<interrupter builderClass="org.apache.tika.batch.builders.InterrupterBuilder"/>
</tika-batch-config>