[ 
https://issues.apache.org/jira/browse/ANY23-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416561#comment-16416561
 ] 

Hans Brende commented on ANY23-336:
-----------------------------------

The other reason I ask is because the {{CachingHttpClient}} does not store 
*only urls* as keys in the delegate cache (via {{putEntry(String key, 
HttpCacheEntry value)}}.

It also stores keys which look like this: 
{code}{Accept=application%2Fld%2Bjson%2C+application%2Fjson%3Bq%3D0.9%2C+application%2Fjavascript%3Bq%3D0.5%2C+text%2Fjavascript%3Bq%3D0.5%2C+text%2Fplain%3Bq%3D0.2%2C+*%2F*%3Bq%3D0.1&Accept-Encoding=gzip%2Cdeflate}http://schema.org:80/{code}

Keys such as these will *never* have a corresponding match in the classpath, so 
it only makes sense for keys such as these to check the delegate cache first. 
Currently, your solution for keys like these is simply to return null when they 
are requested again (since they aren't valid URIs)--which means that the 
{{CachingHttpClient}} will have to make extra HTTP requests to retrieve their 
corresponding values.

> Parsing json-ld content takes prohibitively long time
> -----------------------------------------------------
>
>                 Key: ANY23-336
>                 URL: https://issues.apache.org/jira/browse/ANY23-336
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: core, extractors
>    Affects Versions: 2.2
>            Reporter: Hans Brende
>            Assignee: Peter Ansell
>            Priority: Critical
>             Fix For: 2.3
>
>         Attachments: Screen Shot 2018-03-27 at 2.52.15 PM.png, Screen Shot 
> 2018-03-27 at 2.54.43 PM.png
>
>
> Using the page [https://www.guthriegreen.com|https://www.guthriegreen.com/] 
> as a benchmark, a page fetch took about 100 ms, while simply *parsing* the 
> json-ld content on that page took a *staggering 27400 ms*. For reference, I'm 
> using Java 8, build 162, on a Macbook Pro (early 2015).
> The bad news is that this is not our fault.
> I've profiled this behavior down to the 
> {{com.github.jsonldjava.utils.JsonUtils.fromURL(URL, CloseableHttpClient)}} 
> function. 94% of the parsing time is spent there. This function is called 
> when trying to load remote json-ld contexts. 
> In order to avoid loading remote contexts repeatedly, this function tries to 
> *cache* them by using a {{CachingHttpClient}} from the httpclient-osgi 
> library.
> Unfortunately, that strategy is *not* working, as I have recorded exactly 
> *zero* cache hits, meaning that *every* retrieval is a cache miss and a 
> remote context is re-fetched via http every single time it's accessed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to