[
https://issues.apache.org/jira/browse/TIKA-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778483#action_12778483
]
Erik Hetzner commented on TIKA-320:
-----------------------------------
Wonderful, thanks!
> Allow disabling language detection in AutoDetectParser
> ------------------------------------------------------
>
> Key: TIKA-320
> URL: https://issues.apache.org/jira/browse/TIKA-320
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 0.5
> Reporter: Erik Hetzner
> Assignee: Jukka Zitting
> Fix For: 0.5
>
>
> It should be possible to disable language detection in the AutoDetectParser.
> Between 0.4 and the current trunk, the time Tika spent parsing my test data
> (100MB of compressed web crawl data, mixed HTML, images, etc.) increased
> considerably. After profiling, I determined that most of the time was spent
> in language detection.
> time results of indexing my test data with Lucene using AutoDetectParser:
> real 15m21.020s
> user 6m31.344s
> sys 0m4.556s
> time results on the same test data using the same code as AutoDetectParser,
> but with language detection disabled:
> real 4m48.856s
> user 2m9.416s
> sys 0m3.484s
> Obviously these numbers are worthless in their particulars but I think they
> demonstrate that one ought to be able to turn off language detection, as it
> can massively slow down parsing.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.