[ 
https://issues.apache.org/jira/browse/ANY23-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416368#comment-16416368
 ] 

Hans Brende commented on ANY23-336:
-----------------------------------

[~p_ansell] I agree! That would be, not just useful, but *crisis-averting*!!!   

With the way it's set up now, someone could, for example, *utterly sabotage* a 
crawler that uses Any23 simply by creating a large number of json-ld elements 
whose @context's all point to an arbitrary domain without the right caching 
headers. We would have no way of knowing which contexts to add to the classpath 
beforehand, and we would be, essentially, DoS'ed.

> Parsing json-ld content takes prohibitively long time
> -----------------------------------------------------
>
>                 Key: ANY23-336
>                 URL: https://issues.apache.org/jira/browse/ANY23-336
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: core, extractors
>    Affects Versions: 2.2
>            Reporter: Hans Brende
>            Assignee: Peter Ansell
>            Priority: Critical
>             Fix For: 2.3
>
>         Attachments: Screen Shot 2018-03-27 at 2.52.15 PM.png, Screen Shot 
> 2018-03-27 at 2.54.43 PM.png
>
>
> Using the page [https://www.guthriegreen.com|https://www.guthriegreen.com/] 
> as a benchmark, a page fetch took about 100 ms, while simply *parsing* the 
> json-ld content on that page took a *staggering 27400 ms*. For reference, I'm 
> using Java 8, build 162, on a Macbook Pro (early 2015).
> The bad news is that this is not our fault.
> I've profiled this behavior down to the 
> {{com.github.jsonldjava.utils.JsonUtils.fromURL(URL, CloseableHttpClient)}} 
> function. 94% of the parsing time is spent there. This function is called 
> when trying to load remote json-ld contexts. 
> In order to avoid loading remote contexts repeatedly, this function tries to 
> *cache* them by using a {{CachingHttpClient}} from the httpclient-osgi 
> library.
> Unfortunately, that strategy is *not* working, as I have recorded exactly 
> *zero* cache hits, meaning that *every* retrieval is a cache miss and a 
> remote context is re-fetched via http every single time it's accessed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to