[jira] [Commented] (TIKA-2520) OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP request

ASF GitHub Bot (JIRA) Fri, 25 May 2018 18:30:24 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491449#comment-16491449
 ]


ASF GitHub Bot commented on TIKA-2520:
--------------------------------------

chrismattmann commented on issue #237: TIKA-2520 optimize OptimaizeLangDetector 
default loadModel()
URL: https://github.com/apache/tika/pull/237#issuecomment-392225569
 
 
   yep @tballison that fixed it:
   
   for 2.x/master:
   
   ```
   INFO] --- forbiddenapis:2.5:testCheck (default) @ tika ---
   [INFO] Skipping execution for packaging "pom"
   [INFO] 
   [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ tika ---
   [INFO] Installing /Users/mattmann/tmp/tika2.0.0/pom.xml to 
/Users/mattmann/.m2/repository/org/apache/tika/tika/2.0.0-SNAPSHOT/tika-2.0.0-SNAPSHOT.pom
   [INFO] 
------------------------------------------------------------------------
   [INFO] Reactor Summary:
   [INFO] 
   [INFO] Apache Tika parent ................................. SUCCESS [  2.659 
s]
   [INFO] Apache Tika core ................................... SUCCESS [ 32.386 
s]
   [INFO] Apache Tika parsers ................................ SUCCESS [05:29 
min]
   [INFO] Apache Tika XMP .................................... SUCCESS [  2.089 
s]
   [INFO] Apache Tika serialization .......................... SUCCESS [  1.455 
s]
   [INFO] Apache Tika batch .................................. SUCCESS [01:54 
min]
   [INFO] Apache Tika language detection ..................... SUCCESS [  2.796 
s]
   [INFO] Apache Tika application ............................ SUCCESS [ 56.919 
s]
   [INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 30.672 
s]
   [INFO] Apache Tika translate .............................. SUCCESS [  2.907 
s]
   [INFO] Apache Tika server ................................. SUCCESS [ 23.061 
s]
   [INFO] Apache Tika examples ............................... SUCCESS [ 10.833 
s]
   [INFO] Apache Tika Java-7 Components ...................... SUCCESS [  2.372 
s]
   [INFO] Apache Tika eval ................................... SUCCESS [ 29.789 
s]
   [INFO] Apache Tika Deep Learning (powered by DL4J) ........ SUCCESS [01:01 
min]
   [INFO] Apache Tika Natural Language Processing ............ SUCCESS [ 23.639 
s]
   [INFO] Apache Tika ........................................ SUCCESS [  0.018 
s]
   [INFO] 
------------------------------------------------------------------------
   [INFO] BUILD SUCCESS
   [INFO] 
------------------------------------------------------------------------
   [INFO] Total time: 12:10 min
   [INFO] Finished at: 2018-05-25T16:48:59-07:00
   [INFO] Final Memory: 174M/1661M
   [INFO] 
------------------------------------------------------------------------
   nonas:tika2.0.0 mattmann$ tesseract
   Usage:
     tesseract --help | --help-psm | --help-oem | --version
     tesseract --list-langs [--tessdata-dir PATH]
     tesseract --print-parameters [options...] [configfile...]
     tesseract imagename|stdin outputbase|stdout [options...] [configfile...]
   
   OCR options:
     --tessdata-dir PATH   Specify the location of tessdata path.
     --user-words PATH     Specify the location of user words file.
     --user-patterns PATH  Specify the location of user patterns file.
     -l LANG[+LANG]        Specify language(s) used for OCR.
     -c VAR=VALUE          Set value for config variables.
                           Multiple -c arguments are allowed.
     --psm NUM             Specify page segmentation mode.
     --oem NUM             Specify OCR Engine mode.
   NOTE: These options must occur before any configfile.
   
   Page segmentation modes:
     0    Orientation and script detection (OSD) only.
     1    Automatic page segmentation with OSD.
     2    Automatic page segmentation, but no OSD, or OCR.
     3    Fully automatic page segmentation, but no OSD. (Default)
     4    Assume a single column of text of variable sizes.
     5    Assume a single uniform block of vertically aligned text.
     6    Assume a single uniform block of text.
     7    Treat the image as a single text line.
     8    Treat the image as a single word.
     9    Treat the image as a single word in a circle.
    10    Treat the image as a single character.
    11    Sparse text. Find as much text as possible in no particular order.
    12    Sparse text with OSD.
    13    Raw line. Treat the image as a single text line,
                        bypassing hacks that are Tesseract-specific.
   OCR Engine modes:
     0    Original Tesseract only.
     1    Cube only.
     2    Tesseract + cube.
     3    Default, based on what is available.
   
   Single options:
     -h, --help            Show this help message.
     --help-psm            Show page segmentation modes.
     --help-oem            Show OCR Engine modes.
     -v, --version         Show version information.
     --list-langs          List available languages for tesseract engine.
     --print-parameters    Print tesseract parameters to stdout.
   nonas:tika2.0.0 mattmann$ 
   
   ```
   
   And also for branch_1x:
   
   ```
   [INFO] --- forbiddenapis:2.5:testCheck (default) @ tika ---
   [INFO] Skipping execution for packaging "pom"
   [INFO] 
   [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ tika ---
   [INFO] Installing /Users/mattmann/tmp/tika2.0.0/pom.xml to 
/Users/mattmann/.m2/repository/org/apache/tika/tika/1.19-SNAPSHOT/tika-1.19-SNAPSHOT.pom
   [INFO] 
------------------------------------------------------------------------
   [INFO] Reactor Summary:
   [INFO] 
   [INFO] Apache Tika parent ................................. SUCCESS [  1.662 
s]
   [INFO] Apache Tika core ................................... SUCCESS [ 29.080 
s]
   [INFO] Apache Tika parsers ................................ SUCCESS [05:30 
min]
   [INFO] Apache Tika XMP .................................... SUCCESS [  2.053 
s]
   [INFO] Apache Tika serialization .......................... SUCCESS [  1.455 
s]
   [INFO] Apache Tika batch .................................. SUCCESS [02:02 
min]
   [INFO] Apache Tika language detection ..................... SUCCESS [  3.046 
s]
   [INFO] Apache Tika application ............................ SUCCESS [01:30 
min]
   [INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 52.497 
s]
   [INFO] Apache Tika translate .............................. SUCCESS [  4.999 
s]
   [INFO] Apache Tika server ................................. SUCCESS [ 36.182 
s]
   [INFO] Apache Tika examples ............................... SUCCESS [ 16.681 
s]
   [INFO] Apache Tika Java-7 Components ...................... SUCCESS [  4.317 
s]
   [INFO] Apache Tika eval ................................... SUCCESS [ 42.570 
s]
   [INFO] Apache Tika Deep Learning (powered by DL4J) ........ SUCCESS [02:05 
min]
   [INFO] Apache Tika Natural Language Processing ............ SUCCESS [ 35.008 
s]
   [INFO] Apache Tika ........................................ SUCCESS [  0.036 
s]
   [INFO] 
------------------------------------------------------------------------
   [INFO] BUILD SUCCESS
   [INFO] 
------------------------------------------------------------------------
   [INFO] Total time: 15:00 min
   [INFO] Finished at: 2018-05-25T17:38:01-07:00
   [INFO] Final Memory: 173M/1589M
   [INFO] 
------------------------------------------------------------------------
   nonas:tika2.0.0 mattmann$ git branch
     TIKA-1988
     TIKA-1988-new
     TIKA-2016
     TIKA-2298
   * branch_1x
     gsoc17
     master
     merge-mattmann-TIKA-1988
   nonas:tika2.0.0 mattmann$ 
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OptimaizeLangDetector#loadModels() should not be called for every single 
> langdetect HTTP request
> ------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2520
>                 URL: https://issues.apache.org/jira/browse/TIKA-2520
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.16
>            Reporter: Vincent van Donselaar
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: performance
>             Fix For: 1.19
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Tika REST server's `/language` resource invokes the relatively heavy 
> `loadModels` operation for every language detect call:
> {code:title=LanguageResource.java}
> public String detect(final String string) throws IOException {
>       LanguageResult language = new 
> OptimaizeLangDetector().loadModels().detect(string);
>       String detectedLang = language.getLanguage();
>       LOG.info("Detecting language for incoming resource: [{}]", 
> detectedLang);
>       return detectedLang;
> }
> {code}
> This could be optimized by (lazy?) loading the models only once and keep them 
> in memory. I assume the `LanguageDetector` is not thread safe, so I expect 
> this requires an ExecutorService with language detectors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2520) OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP request

Reply via email to