Parser using Platform encoding instead of UTF-8

Daniel Spicar (Commented) (JIRA) Wed, 26 Oct 2011 05:38:00 -0700

    [ 
https://issues.apache.org/jira/browse/CLEREZZA-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135900#comment-13135900
 ]


Daniel Spicar commented on CLEREZZA-643:
----------------------------------------

Thank you for your contribution, Rupert.

I inspected the patch you submitted and the current version of the code. An 
improved RDF-JSON serializer is something we would really like. In real-world 
applications we use Clerezza for, one major problem is bad performance and/or 
excessive memory consumption. We are dealing with huge graphs there. Therefore 
your contribution sounds really promising. But as I am not the author of the 
original code, I reviewed the original code as well with respect to the above 
scenario. My focus was determining whether the implementations scale with large 
Graphs.

Comments on the Serializer:
- As I understand the rdf-json specification, sorting of output is not 
required, only grouping by subject and predicate. Therefore I don't think the 
more expensive subject-predicate sort (that you commented out but still 
included) is necessary. Or am I missing something? Can this part be safely 
removed?

- The original code (unpatched) does NOT properly stream the serialization. 
This is a concern when the source graph contains too many unique 
subjects/predicates/objects, because all the generated JSONObjects/Arrays are 
stored in memory before being written to the output stream. This is especially 
concerning when many BLOBs are stored in the graph.

- The patch does correctly stream the serialization, but it loads the entire 
source graph into memory for sorting (toArray call at line 99). Again this may 
easily exceed available memory. The original code does not load the entire 
source graph into memory as it uses filter (when the underlying graph is backed 
by a TripleStore). The iterators returned by filter only access the data in the 
graph for one triple at a time upon the call to next().

Conclusion:
I think none of the solutions can support graphs that exceed memory size. I 
assume the unpatched version can deal with slightly larger graphs than your 
solution but that is irrelevant. We need a solution that will work reliably 
with graphs larger than memory size. As you mentioned an optimal solution would 
exploit a sorted (or at least grouped) iterator provided by the underlying 
TripleCollection. I think that is the approach we need to take to solve this 
issue in a scalable manner.

Now the there is the question whether to accept your patch for Clerezza until 
we implement a better solution. I am not sure. Your solution is a significant 
improvement in terms of speed of serialization, but the original code is easier 
to quick-fix such that the results are streamed properly to the outputstream (I 
think exploiting the JSON simple streaming interface may do the trick). So the 
question seems to be, what is more important, a solution that, while possibly 
very slow, will not exceed available memory, or a solution that significantly 
improves serialization performance. 

My opinion is that since we seemed to live so far with a solution that can not 
deal with very large graphs anyway, the speed improvement may be more valuable. 
However we need to get working on a better solution as described above.

I think we should raise this issue on the mailing list for discussion.
                
> Weak Performance of "application/json+rdf" serializer on big 
> TripleCollections and Serialzer/Parser using Platform encoding instead of 
> UTF-8
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLEREZZA-643
>                 URL: https://issues.apache.org/jira/browse/CLEREZZA-643
>             Project: Clerezza
>          Issue Type: Improvement
>            Reporter: Rupert Westenthaler
>            Assignee: Daniel Spicar
>         Attachments: rdf.rdfjson-arrays.sort_based_serializer_and_UTF-8.patch
>
>
> Both the "application/json+rdf" serializer and parser use platform specific 
> encodings instead of UTF-8.
> In addition the serializer suffers from very poor performance on big graphs 
> (at least when using SimpleMGrpah)
> After some digging in the Code I came to the conclusion that this is because 
> of the use of multiple TripleCollection.filter(..) calls fist to filter all 
> predicates for an subject and than all objects for each subject/predicate 
> combination. A trying to serialize a graph with 50k triples ended in several 
> minutes 100% CPU.
> With the next comment I will provide a patch with an implementation based on 
> a sorted array of the triples. With this method one can serialize graphs with 
> 100k in about 1sec. This patch also changes encoding to UTF-8.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CLEREZZA-643) Weak Performance of "application/json+rdf" serializer on big TripleCollections and Serialzer/Parser using Platform encoding instead of UTF-8

Reply via email to