[jira] [Commented] (JENA-2302) RowSetReaderJSON is not streaming

Claus Stadler (Jira) Tue, 08 Mar 2022 07:13:07 -0800


    [ 
https://issues.apache.org/jira/browse/JENA-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503012#comment-17503012
 ]


Claus Stadler commented on JENA-2302:
-------------------------------------

* Is the proposed code only using the parser with no data mapping?

The parser only relies on GSON inits default configuration (new GSON()). Though 
I am working with JsonObject.class, List.class an String.class as in the line 
below which I suppose uses its internal mapping machinery:

{code:java}
        case "vars":
            List<String> varNames = gson.fromJson(reader, new 
TypeToken<List<String>>() {}.getType());
{code}
I am not using any custom TypeAdapters (whiched seemed a bit like overkill).

* Performance measures
I did a quick comparison yesterday with curl on the remote data and the 
limiting factor was clearly the bandwidth - times jumped between 14 and 20 
seconds for both tools. I can add a "test case" that measures streaming time 
with the existing implementation. 

* Is results-then-head tested? 
I need to add a test case for that - but the code is written with that in mind.
Right now the streaming parser might be too lenient - if there are multiple 
"binding" keys it will just keep streaming those and unexpected json elements 
are just skipped. I'd add support for a setting an ErrorHandler for whether 
this should only log a warning or raise an exception.

* I already migrated the code to my jena fork and added the ask result handling 
there - will create the PR once the issues are resolved.

* Formally DataBag is not order preserving (IIRC) and when does it spill
Hm the documentation of DefaultDataBag states that it is backed by an ArrayList.
The spill happens as soon as getResultVars is called and if those have not been 
seen yet.
I could extend the check so that if  getResultVars is null, then first call 
hasNext() which might just read the header - and thus avoid a needless spill if 
the header is the first thing in the stream.
Because the DefaultDataBag uses an array until the number of items reaches some 
configurable threshold it seems to be the right tool for the buffering.

* Missing kTypedLiteral
Good point

 * gson file sizes:

gson 432,0 KiB (total size of the .m2 folder) of which
244,0 KiB [##########] gson-2.9.0.jar
156,0 KiB [######] gson-2.9.0-sources.jar

gson-parent 20,0 KiB (whole folder)

No further dependencies.

> RowSetReaderJSON is not streaming
> ---------------------------------
>
>                 Key: JENA-2302
>                 URL: https://issues.apache.org/jira/browse/JENA-2302
>             Project: Apache Jena
>          Issue Type: Bug
>          Components: ARQ
>    Affects Versions: Jena 4.5.0
>            Reporter: Claus Stadler
>            Priority: Major
>
> Retrieving all data from our TDB2 endpoint with jena 4.5.0-SNAPSHOT is no 
> longer streaming for the JSON format. I tracked the issue to RowSetReaderJson 
> which reads everything into in memory (and then checks whether it is a SPARQL 
> ASK result)
> {code:java}
> public class RowSetReaderJson {
>         private void parse(InputStream in) {
>             JsonObject obj = JSON.parse(in); // !!! Loads everything !!!
>             // Boolean?
>             if ( obj.hasKey(kBoolean) ) { ... }
>     }
> }
> {code}
> Streaming works when switching the to RS_XML in the example below:
> {code:java}
> public class Main {
>     public static void main(String[] args) {
>         System.out.println("Test Started");
>         try (QueryExecution qe = QueryExecutionHTTP.create()
>                 
> .acceptHeader(ResultSetLang.RS_JSON.getContentType().getContentTypeStr())
>                 .endpoint("http://moin.aksw.org/sparql";).queryString("SELECT 
> * { ?s ?p ?o }").build()) {
>             qe.execSelect().forEachRemaining(System.out::println);
>         }
>         System.out.println("Done");
>     }
> }
> {code}
> For completeness, I can rule out any problem with TDB2 because streaming of 
> JSON works just fine with: 
> {code:bash}
> curl --data-urlencode "query=select * { ?s ?p ?o }"  
> "http://moin.aksw.org/sparql";
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (JENA-2302) RowSetReaderJSON is not streaming

Reply via email to