[
https://issues.apache.org/jira/browse/TIKA-4388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930680#comment-17930680
]
Tim Allison commented on TIKA-4388:
-----------------------------------
With tika-app in batch mode ({{java -jar tika-app-3.1.0.jar -J -t -i input -o
output}}), I'm not getting the warning with 15 workers (16 cores) on that data
set. And, the performance is roughly 4 seconds for both 3.1.0 and 3.0.0. When I
drop the pool size down to 2, I do get the warning, but then a trivial hit on
performance (4.4 seconds for 3.1.0 and 4.2 seconds for 3.0.0).
I'm not doubting your findings! I need more info to be able to replicate. Thank
you, again.
> Performance degradation observed in Tika 3.1.0
> ----------------------------------------------
>
> Key: TIKA-4388
> URL: https://issues.apache.org/jira/browse/TIKA-4388
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 3.1.0
> Reporter: Sandeep Kulkarni
> Assignee: Tim Allison
> Priority: Major
>
> We are using Tika as a library and after upgrading to 3.1.0 started observing
> degradation for time take for text extraction. We are observing degradation
> for many file types, but one specific case where there is for html files.
> I used
> [https://www.cs.cornell.edu/people/pabo/movie-review-data/polarity_html.zip]
> dataset from https://www.cs.cornell.edu/people/pabo/movie-review-data/.
> On a test machine with 12 cores, I am getting too many warnings shown below:
> {noformat}
> [XMLReaderUtils] Contention waiting for a SAXParser. Consider increasing the
> XMLReaderUtils.POOL_SIZE{noformat}
> Then I set the pool size to equivalent to number of cores available using a
> call to XMLReaderUtils.setPoolSize(). But that had even worse effect on
> performance, it increased to 2x the time taken earlier. Also started getting
> other warning as well and that too more frequently.
> {noformat}
> [XMLReaderUtils] SAXParser not taken back into pool. If you haven't resized
> the pool this could be a sign that there are more calls to 'acquire' than to
> 'release'{noformat}
> Looks like changes done in commit
> [https://github.com/apache/tika/commit/6305da41756e59dcf19e92acf70657624581cfe3]
> are somehow causing this behaviour.
> With Tika 3.0.0 which we are currently using, I don't see any warning and
> performance is also good.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)