[
https://issues.apache.org/jira/browse/ANY23-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416826#comment-16416826
]
Hans Brende commented on ANY23-336:
-----------------------------------
[~p_ansell] I created two different branches for you--both do the trick of
speeding up the requests by several orders of magnitude, and both have exactly
the same test case added. The only differences between them are the changes I
had to make to the {{JarCacheStorage}} class. Let me know which one you like
best:
First: https://github.com/HansBrende/jsonld-java/tree/ANY23-336
Second: https://github.com/HansBrende/jsonld-java/tree/ANY23-336-alt
Regarding the entry keys to the cache, the problem is that you are [returning a
*null
value*|https://github.com/jsonld-java/jsonld-java/blob/fd4b95f6451586f10705682a88e68b571ecee610/core/src/main/java/com/github/jsonldjava/utils/JarCacheStorage.java#L98]
for any key that is not a valid URI (even if the key IS cached in your
delegating cache), meaning that whenever the {{CachingHttpClient}} gives you a
key-value pair to store in which the key _is not_ a URI, that value is
effectively immediately removed from the cache, which in turn will cause the
{{CachingHttpClient}} to have to perform superfluous HTTP requests.
This is easily seen by reverting [this
line|https://github.com/HansBrende/jsonld-java/blob/d24fdf6afcf763cd9203fd8814419da4807c8bc3/core/src/main/java/com/github/jsonldjava/utils/JarCacheStorage.java#L98]
in my _second_ branch back to the previous behavior of returning null, and
then running my test case. It will fail.
> Parsing json-ld content takes prohibitively long time
> -----------------------------------------------------
>
> Key: ANY23-336
> URL: https://issues.apache.org/jira/browse/ANY23-336
> Project: Apache Any23
> Issue Type: Bug
> Components: core, extractors
> Affects Versions: 2.2
> Reporter: Hans Brende
> Assignee: Peter Ansell
> Priority: Critical
> Fix For: 2.3
>
> Attachments: Screen Shot 2018-03-27 at 2.52.15 PM.png, Screen Shot
> 2018-03-27 at 2.54.43 PM.png
>
>
> Using the page [https://www.guthriegreen.com|https://www.guthriegreen.com/]
> as a benchmark, a page fetch took about 100 ms, while simply *parsing* the
> json-ld content on that page took a *staggering 27400 ms*. For reference, I'm
> using Java 8, build 162, on a Macbook Pro (early 2015).
> The bad news is that this is not our fault.
> I've profiled this behavior down to the
> {{com.github.jsonldjava.utils.JsonUtils.fromURL(URL, CloseableHttpClient)}}
> function. 94% of the parsing time is spent there. This function is called
> when trying to load remote json-ld contexts.
> In order to avoid loading remote contexts repeatedly, this function tries to
> *cache* them by using a {{CachingHttpClient}} from the httpclient-osgi
> library.
> Unfortunately, that strategy is *not* working, as I have recorded exactly
> *zero* cache hits, meaning that *every* retrieval is a cache miss and a
> remote context is re-fetched via http every single time it's accessed.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)