[ 
https://issues.apache.org/jira/browse/TIKA-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491175#comment-16491175
 ] 

ASF GitHub Bot commented on TIKA-2520:
--------------------------------------

tballison commented on issue #237: TIKA-2520 optimize OptimaizeLangDetector 
default loadModel()
URL: https://github.com/apache/tika/pull/237#issuecomment-392153435
 
 
   Sorry. How about now?
   
   On Fri, May 25, 2018 at 11:50 AM Chris Mattmann <notificati...@github.com>
   wrote:
   
   > didn't fix it for branch_1x either :( @tballison
   > <https://github.com/tballison>
   >
   >
   > Results :
   >
   > Failed tests:
   >   
PDFParserTest.testEmbeddedDocsWithOCROnly:1250->TikaTest.assertContains:103 
pdf_haystack not found in:
   > <html xmlns="http://www.w3.org/1999/xhtml";>
   > <head>
   > <meta name="date" content="2013-05-23T18:30:00Z" />
   > <meta name="cp:revision" content="1" />
   > <meta name="extended-properties:AppVersion" content="14.0000" />
   > <meta name="meta:paragraph-count" content="1" />
   > <meta name="meta:word-count" content="16" />
   > <meta name="extended-properties:Company" content="" />
   > <meta name="Word-Count" content="16" />
   > <meta name="dcterms:created" content="2013-05-23T18:30:00Z" />
   > <meta name="meta:line-count" content="1" />
   > <meta name="Last-Modified" content="2013-05-23T18:30:00Z" />
   > <meta name="dcterms:modified" content="2013-05-23T18:30:00Z" />
   > <meta name="Last-Save-Date" content="2013-05-23T18:30:00Z" />
   > <meta name="meta:character-count" content="96" />
   > <meta name="Template" content="Normal.dotm" />
   > <meta name="Line-Count" content="1" />
   > <meta name="Paragraph-Count" content="1" />
   > <meta name="meta:save-date" content="2013-05-23T18:30:00Z" />
   > <meta name="meta:character-count-with-spaces" content="111" />
   > <meta name="Application-Name" content="Microsoft Office Word" />
   > <meta name="modified" content="2013-05-23T18:30:00Z" />
   > <meta name="Content-Type" 
content="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
 />
   > <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
   > <meta name="X-Parsed-By" 
content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
   > <meta name="meta:creation-date" content="2013-05-23T18:30:00Z" />
   > <meta name="extended-properties:Application" content="Microsoft Office 
Word" />
   > <meta name="Creation-Date" content="2013-05-23T18:30:00Z" />
   > <meta name="xmpTPg:NPages" content="1" />
   > <meta name="Character-Count-With-Spaces" content="111" />
   > <meta name="Character Count" content="96" />
   > <meta name="Page-Count" content="1" />
   > <meta name="Revision-Number" content="1" />
   > <meta name="Application-Version" content="14.0000" />
   > <meta name="extended-properties:Template" content="Normal.dotm" />
   > <meta name="publisher" content="" />
   > <meta name="meta:page-count" content="1" />
   > <meta name="dc:publisher" content="" />
   > <title></title>
   > </head>
   > <body><p class="header" />
   > <p class="header" />
   > <p class="header" />
   > <p>Outer_haystack</p>
   > <p>Outer_haystack</p>
   > <p><div class="embedded" id="rId8" />
   > </p>
   > <p>Outer_haystack</p>
   > <p />
   > <p>Outer_haystack</p>
   > <p />
   > <p>Outer_haystack</p>
   > <p><a name="_GoBack" /></p>
   > <p class="footer" />
   > <p class="footer" />
   > <p class="footer" />
   > <p>attached.pdf</p>
   > <div class="page"><div class="ocr">dehayslack dehaystack dehayslack 
dehaystack dehaystack dehaystack pd'
   >
   > </div>
   > </div>
   > <p class="header" />
   >
   > <p class="header" />
   >
   > <p class="header" />
   >
   > <p>Haystack</p>
   >
   > <p>Needle</p>
   >
   > <p>Haystack</p>
   >
   > <p><a name="_GoBack" /></p>
   >
   > <p class="footer" />
   >
   > <p class="footer" />
   >
   > <p class="footer" />
   >
   > <div source="attachment" class="embedded" id="Test.docx" />
   > </body></html>
   >
   > Tests run: 1009, Failures: 1, Errors: 0, Skipped: 30
   >
   > [INFO] 
------------------------------------------------------------------------
   > [INFO] Reactor Summary:
   > [INFO]
   > [INFO] Apache Tika parent ................................. SUCCESS [  
2.496 s]
   > [INFO] Apache Tika core ................................... SUCCESS [ 
35.187 s]
   > [INFO] Apache Tika parsers ................................ FAILURE [07:03 
min]
   > [INFO] Apache Tika XMP .................................... SKIPPED
   > [INFO] Apache Tika serialization .......................... SKIPPED
   > [INFO] Apache Tika batch .................................. SKIPPED
   > [INFO] Apache Tika language detection ..................... SKIPPED
   > [INFO] Apache Tika application ............................ SKIPPED
   > [INFO] Apache Tika OSGi bundle ............................ SKIPPED
   > [INFO] Apache Tika translate .............................. SKIPPED
   > [INFO] Apache Tika server ................................. SKIPPED
   > [INFO] Apache Tika examples ............................... SKIPPED
   > [INFO] Apache Tika Java-7 Components ...................... SKIPPED
   > [INFO] Apache Tika eval ................................... SKIPPED
   > [INFO] Apache Tika Deep Learning (powered by DL4J) ........ SKIPPED
   > [INFO] Apache Tika Natural Language Processing ............ SKIPPED
   > [INFO] Apache Tika ........................................ SKIPPED
   > [INFO] 
------------------------------------------------------------------------
   > [INFO] BUILD FAILURE
   > [INFO] 
------------------------------------------------------------------------
   > [INFO] Total time: 07:42 min
   > [INFO] Finished at: 2018-05-25T08:45:25-07:00
   > [INFO] Final Memory: 66M/751M
   > [INFO] 
------------------------------------------------------------------------
   > [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on 
project tika-parsers: There are test failures.
   > [ERROR]
   > [ERROR] Please refer to 
/Users/mattmann/tmp/tika2.0.0/tika-parsers/target/surefire-reports for the 
individual test results.
   > [ERROR] -> [Help 1]
   > [ERROR]
   > [ERROR] To see the full stack trace of the errors, re-run Maven with the 
-e switch.
   > [ERROR] Re-run Maven using the -X switch to enable full debug logging.
   > [ERROR]
   > [ERROR] For more information about the errors and possible solutions, 
please read the following articles:
   > [ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
   > [ERROR]
   > [ERROR] After correcting the problems, you can resume the build with the 
command
   > [ERROR]   mvn <goals> -rf :tika-parsers
   > nonas:tika2.0.0 mattmann$ git branch
   >   TIKA-1988
   >   TIKA-1988-new
   >   TIKA-2016
   >   TIKA-2298
   > * branch_1x
   >   gsoc17
   >   master
   >   merge-mattmann-TIKA-1988
   > nonas:tika2.0.0 mattmann$
   >
   >
   > —
   > You are receiving this because you were mentioned.
   >
   >
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/tika/pull/237#issuecomment-392101027>, or mute
   > the thread
   > 
<https://github.com/notifications/unsubscribe-auth/AGbWvkalrs6sEDpY1fSTMy8ZSLxhyzbNks5t2Cg1gaJpZM4UMa8U>
   > .
   >
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> OptimaizeLangDetector#loadModels() should not be called for every single 
> langdetect HTTP request
> ------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2520
>                 URL: https://issues.apache.org/jira/browse/TIKA-2520
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.16
>            Reporter: Vincent van Donselaar
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: performance
>             Fix For: 1.19
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Tika REST server's `/language` resource invokes the relatively heavy 
> `loadModels` operation for every language detect call:
> {code:title=LanguageResource.java}
> public String detect(final String string) throws IOException {
>       LanguageResult language = new 
> OptimaizeLangDetector().loadModels().detect(string);
>       String detectedLang = language.getLanguage();
>       LOG.info("Detecting language for incoming resource: [{}]", 
> detectedLang);
>       return detectedLang;
> }
> {code}
> This could be optimized by (lazy?) loading the models only once and keep them 
> in memory. I assume the `LanguageDetector` is not thread safe, so I expect 
> this requires an ExecutorService with language detectors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to