[
https://issues.apache.org/jira/browse/TIKA-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930733#comment-17930733
]
Tim Allison commented on TIKA-4388:
-----------------------------------
Are you using a tika-config.xml file?
> Performance degradation observed in Tika 3.1.0
> ----------------------------------------------
>
> Key: TIKA-4388
> URL: https://issues.apache.org/jira/browse/TIKA-4388
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 3.1.0
> Reporter: Sandeep Kulkarni
> Assignee: Tim Allison
> Priority: Major
>
> We are using Tika as a library and after upgrading to 3.1.0 started observing
> degradation for time take for text extraction. We are observing degradation
> for many file types, but one specific case where there is for html files.
> I used
> [https://www.cs.cornell.edu/people/pabo/movie-review-data/polarity_html.zip]
> dataset from https://www.cs.cornell.edu/people/pabo/movie-review-data/.
> On a test machine with 12 cores, I am getting too many warnings shown below:
> {noformat}
> [XMLReaderUtils] Contention waiting for a SAXParser. Consider increasing the
> XMLReaderUtils.POOL_SIZE{noformat}
> Then I set the pool size to equivalent to number of cores available using a
> call to XMLReaderUtils.setPoolSize(). But that had even worse effect on
> performance, it increased to 2x the time taken earlier. Also started getting
> other warning as well and that too more frequently.
> {noformat}
> [XMLReaderUtils] SAXParser not taken back into pool. If you haven't resized
> the pool this could be a sign that there are more calls to 'acquire' than to
> 'release'{noformat}
> Looks like changes done in commit
> [https://github.com/apache/tika/commit/6305da41756e59dcf19e92acf70657624581cfe3]
> are somehow causing this behaviour.
> With Tika 3.0.0 which we are currently using, I don't see any warning and
> performance is also good.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)