[ https://issues.apache.org/jira/browse/ANY23-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416545#comment-16416545 ]
Hans Brende commented on ANY23-336: ----------------------------------- [~p_ansell] Thank you, with your {{setSharedCache(false)}} suggestion and one tweak to {{JarCacheStorage}}, I've managed to increase the parsing speed *by an order of magnitude*. I'll submit a pull request shortly. One question before I proceed: in {{JarCacheStorage.getEntry(String key)}}, is there a reason that you check the delegate cache _after_ you scan the classpath for jar files, rather than before? Shouldn't it be the other way around: check the delegate cache, and if the delegate cache returns a null value, _then_ check the classpath? Since the delegate cache seems to always be implemented as {{BasicHttpCacheStorage}} (i.e., a {{LinkedHashMap}}), it will be much quicker to check that one first, being a constant-time operation. (Also, in the case of Any23, we're not even using the classpath option at all, so our *only* cached values will be found in the delegate cache--so it doesn't seem to make much sense to turn what could be a constant-time operation into an operation proportional to the size of the classpath). Thoughts? > Parsing json-ld content takes prohibitively long time > ----------------------------------------------------- > > Key: ANY23-336 > URL: https://issues.apache.org/jira/browse/ANY23-336 > Project: Apache Any23 > Issue Type: Bug > Components: core, extractors > Affects Versions: 2.2 > Reporter: Hans Brende > Assignee: Peter Ansell > Priority: Critical > Fix For: 2.3 > > Attachments: Screen Shot 2018-03-27 at 2.52.15 PM.png, Screen Shot > 2018-03-27 at 2.54.43 PM.png > > > Using the page [https://www.guthriegreen.com|https://www.guthriegreen.com/] > as a benchmark, a page fetch took about 100 ms, while simply *parsing* the > json-ld content on that page took a *staggering 27400 ms*. For reference, I'm > using Java 8, build 162, on a Macbook Pro (early 2015). > The bad news is that this is not our fault. > I've profiled this behavior down to the > {{com.github.jsonldjava.utils.JsonUtils.fromURL(URL, CloseableHttpClient)}} > function. 94% of the parsing time is spent there. This function is called > when trying to load remote json-ld contexts. > In order to avoid loading remote contexts repeatedly, this function tries to > *cache* them by using a {{CachingHttpClient}} from the httpclient-osgi > library. > Unfortunately, that strategy is *not* working, as I have recorded exactly > *zero* cache hits, meaning that *every* retrieval is a cache miss and a > remote context is re-fetched via http every single time it's accessed. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)