[jira] [Comment Edited] (JENA-2302) RowSetReaderJSON is not streaming

Claus Stadler (Jira) Wed, 09 Mar 2022 11:01:06 -0800


    [ 
https://issues.apache.org/jira/browse/JENA-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503788#comment-17503788
 ]


Claus Stadler edited comment on JENA-2302 at 3/9/22, 7:00 PM:
--------------------------------------------------------------

Here the performance results (millisecond granularity) for larger data (200MB) 
created with this [benchmark 
runner|https://github.com/Aklakan/jena/blob/c698e61b59b8e8e7ebb0ae5341c4fed57b4b4676/jena-arq/src/main/java/org/apache/jena/riot/rowset/rw/RowSetJSONStreamingBenchmark.java#L61]
 (I will remove it from the repo when done)

Under the assumption that I didn't mess something up then the results suggest 
that the streaming approach ("actual") can within a time frame process roughly 
3x the amount of data compared to the non-streaming one ("expected"):

{code:bash}
Time taken for iteration0:expected:setup: 7.793s
Time taken for iteration0:expected:consumption: 0.061000004s
Time taken for iteration0:actual:setup: 0.15300001s
Time taken for iteration0:actual:consumption: 2.094s
Result sets are equal - items seen: 219441

Time taken for iteration1:expected:setup: 6.5320005s
Time taken for iteration1:expected:consumption: 0.022000002s
Time taken for iteration1:actual:setup: 0.0s
Time taken for iteration1:actual:consumption: 1.6650001s
Result sets are equal - items seen: 219441
...
Time taken for iteration20:expected:setup: 6.2060003s
Time taken for iteration20:expected:consumption: 0.012s
Time taken for iteration20:actual:setup: 0.0s
Time taken for iteration20:actual:consumption: 2.137s
Result sets are equal - items seen: 219441
...
Time taken for iteration29:expected:setup: 6.2460003s
Time taken for iteration29:expected:consumption: 0.012s
Time taken for iteration29:actual:setup: 0.0s
Time taken for iteration29:actual:consumption: 2.2870002s
Result sets are equal - items seen: 219441
{code}


was (Author: aklakan):
Here the performance results (millisecond granularity) for larger data (200MB) 
created with this [benchmark 
runner|https://github.com/Aklakan/jena/blob/c698e61b59b8e8e7ebb0ae5341c4fed57b4b4676/jena-arq/src/main/java/org/apache/jena/riot/rowset/rw/RowSetJSONStreamingBenchmark.java#L61]
 (I will remove it from the repo when done)

Under the assumption that I didn't mess something up then the results suggest 
that the streaming approach ("actual") can within a time frame process 3x the 
amount of data (or even better) compared to the non-streaming one ("expected"):

{code:bash}
Time taken for iteration0:expected:setup: 7.793s
Time taken for iteration0:expected:consumption: 0.061000004s
Time taken for iteration0:actual:setup: 0.15300001s
Time taken for iteration0:actual:consumption: 2.094s
Result sets are equal - items seen: 219441

Time taken for iteration1:expected:setup: 6.5320005s
Time taken for iteration1:expected:consumption: 0.022000002s
Time taken for iteration1:actual:setup: 0.0s
Time taken for iteration1:actual:consumption: 1.6650001s
Result sets are equal - items seen: 219441
...
Time taken for iteration20:expected:setup: 6.2060003s
Time taken for iteration20:expected:consumption: 0.012s
Time taken for iteration20:actual:setup: 0.0s
Time taken for iteration20:actual:consumption: 2.137s
Result sets are equal - items seen: 219441
...
Time taken for iteration29:expected:setup: 6.2460003s
Time taken for iteration29:expected:consumption: 0.012s
Time taken for iteration29:actual:setup: 0.0s
Time taken for iteration29:actual:consumption: 2.2870002s
Result sets are equal - items seen: 219441
{code}

> RowSetReaderJSON is not streaming
> ---------------------------------
>
>                 Key: JENA-2302
>                 URL: https://issues.apache.org/jira/browse/JENA-2302
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 4.5.0
>            Reporter: Claus Stadler
>            Priority: Major
>
> Retrieving all data from our TDB2 endpoint with jena 4.5.0-SNAPSHOT is no 
> longer streaming for the JSON format. I tracked the issue to RowSetReaderJson 
> which reads everything into in memory (and then checks whether it is a SPARQL 
> ASK result)
> {code:java}
> public class RowSetReaderJson {
>         private void parse(InputStream in) {
>             JsonObject obj = JSON.parse(in); // !!! Loads everything !!!
>             // Boolean?
>             if ( obj.hasKey(kBoolean) ) { ... }
>     }
> }
> {code}
> Streaming works when switching the to RS_XML in the example below:
> {code:java}
> public class Main {
>     public static void main(String[] args) {
>         System.out.println("Test Started");
>         try (QueryExecution qe = QueryExecutionHTTP.create()
>                 
> .acceptHeader(ResultSetLang.RS_JSON.getContentType().getContentTypeStr())
>                 .endpoint("http://moin.aksw.org/sparql";).queryString("SELECT 
> * { ?s ?p ?o }").build()) {
>             qe.execSelect().forEachRemaining(System.out::println);
>         }
>         System.out.println("Done");
>     }
> }
> {code}
> For completeness, I can rule out any problem with TDB2 because streaming of 
> JSON works just fine with: 
> {code:bash}
> curl --data-urlencode "query=select * { ?s ?p ?o }"  
> "http://moin.aksw.org/sparql";
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (JENA-2302) RowSetReaderJSON is not streaming

Reply via email to