[ https://issues.apache.org/jira/browse/ANY23-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416159#comment-16416159 ]
Lewis John McGibbney commented on ANY23-336: -------------------------------------------- Results from the CLI {code} lmcgibbn@LMC-056430 /usr/local/any23(master) $ ./cli/target/appassembler/bin/any23 rover -l run.log -f turtle -o result.ttl -s "https://www.guthriegreen.com/" ------------------------------------------------------------------------ Apache Any23 :: rover ------------------------------------------------------------------------ Mar 27, 2018 12:58:34 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. TIFFImageWriter not loaded. tiff files will not be processed See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. Mar 27, 2018 12:58:34 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. 0 [main] INFO org.apache.any23.rdf.PopularPrefixes - Loading prefixes from /org/apache/any23/prefixes/prefixes.properties 1220 [main] INFO org.apache.any23.extractor.SingleDocumentExtraction - Processing https://www.guthriegreen.com/ >Summary: -total calls: 6 -total triples: 115 -total runtime: 38 ms! -tripls/ms: 3 -ms/calls: 6 >Extractor: html-head-meta -total calls: 1 -total triples: 12 -total runtime: 0 ms! -ms/calls: 0 >Extractor: consolidation-extractor -total calls: 1 -total triples: 0 -total runtime: 0 ms! -ms/calls: 0 >Extractor: html-scraper -total calls: 1 -total triples: 4 -total runtime: 3 ms! -tripls/ms: 1 -ms/calls: 3 >Extractor: html-embedded-jsonld -total calls: 1 -total triples: 63 -total runtime: 20 ms! -tripls/ms: 3 -ms/calls: 20 >Extractor: html-head-title -total calls: 1 -total triples: 1 -total runtime: 0 ms! -ms/calls: 0 >Extractor: html-rdfa11 -total calls: 1 -total triples: 35 -total runtime: 15 ms! -tripls/ms: 2 -ms/calls: 15 29751 [main] INFO org.apache.any23.cli.Rover - Extractors used: [html-head-meta, html-scraper, html-embedded-jsonld, html-head-title, html-rdfa11] 29751 [main] INFO org.apache.any23.cli.Rover - 115 triples, 29725ms ------------------------------------------------------------------------ Apache Any23 SUCCESS Total time: 30s Finished at: Tue Mar 27 12:59:04 PDT 2018 Final Memory: 26M/212M ------------------------------------------------------------------------ {code} > Parsing json-ld content takes prohibitively long time > ----------------------------------------------------- > > Key: ANY23-336 > URL: https://issues.apache.org/jira/browse/ANY23-336 > Project: Apache Any23 > Issue Type: Bug > Components: core, extractors > Affects Versions: 2.2 > Reporter: Hans Brende > Priority: Critical > Fix For: 2.3 > > Attachments: Screen Shot 2018-03-27 at 2.52.15 PM.png, Screen Shot > 2018-03-27 at 2.54.43 PM.png > > > Using the page [https://www.guthriegreen.com|https://www.guthriegreen.com/] > as a benchmark, a page fetch took about 100 ms, while simply *parsing* the > json-ld content on that page took a *staggering 27400 ms*. For reference, I'm > using Java 8, build 162, on a Macbook Pro (early 2015). > The bad news is that this is not our fault. > I've profiled this behavior down to the > {{com.github.jsonldjava.utils.JsonUtils.fromURL(URL, CloseableHttpClient)}} > function. 94% of the parsing time is spent there. This function is called > when trying to load remote json-ld contexts. > In order to avoid loading remote contexts repeatedly, this function tries to > *cache* them by using a {{CachingHttpClient}} from the httpclient-osgi > library. > Unfortunately, that strategy is *not* working, as I have recorded exactly > *zero* cache hits, meaning that *every* retrieval is a cache miss and a > remote context is re-fetched via http every single time it's accessed. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)