[ 
https://issues.apache.org/jira/browse/JENA-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140987#comment-15140987
 ] 

Rob Vesse commented on JENA-1138:
---------------------------------

A quick review of the offending code shows that the likely problem is that when 
parsing data into a {{DatasetGraph}} each triple/quad in the data being read 
results in an {{add(Node, Node, Node, Node)}} call on the {{DatasetGraph}} 
instance.  With the new in-memory dataset this results in a transaction for 
every triple/quad discovered which is likely why it takes so long and the 
implementations of these transactions creates some temporary objects which are 
likely what causes the GC thrashing.

My first thought is to perhaps change the parser to dataset bridging 
({{StreamRDFLib.ParseOutputDataset}} to batch up insertions in transactions 
when the underlying dataset supports transactions.

However this would likely only be a partial fix and I'm not familiar enough 
with the new implementation to understand why it performs so much worse for 
this, this likely needs [~ajs6f] and/or [~andy.seaborne] to take a more 
in-depth look and having the sample data would be extremely helpful for that.

> java.lang.OutOfMemoryError: GC overhead limit exceeded
> ------------------------------------------------------
>
>                 Key: JENA-1138
>                 URL: https://issues.apache.org/jira/browse/JENA-1138
>             Project: Apache Jena
>          Issue Type: Bug
>          Components: Cmd line tools
>    Affects Versions: Jena 3.0.1
>         Environment: Oracle JDK 1.8.0, Windows 7 64bit
>            Reporter: Giovanni Mels
>              Labels: performance
>         Attachments: sample-data.zip
>
>
> Since 3.0.1 we get {{java.lang.OutOfMemoryError: GC overhead limit exceeded}} 
> exceptions when using the {{sparql}} command line tool, even on relative 
> small datasets (~1.6 million triples).
> The issue occurs when the dataset is loaded in memory, so before the actual 
> query execution. 
> {code}
> sparql --query empty.rq --data sample-data.ttl
> {code}
> Where {{empty.rq}} contains:
> {noformat}
> SELECT * WHERE {}
> {noformat}
> This query takes ~20 seconds using Jena 2.13.0 and Jena 3.0.0, it fails with 
> 3.0.1 after ~4 minutes with {{java.lang.OutOfMemoryError: GC overhead limit 
> exceeded}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to