[
https://issues.apache.org/jira/browse/CLEREZZA-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135900#comment-13135900
]
Daniel Spicar commented on CLEREZZA-643:
----------------------------------------
Thank you for your contribution, Rupert.
I inspected the patch you submitted and the current version of the code. An
improved RDF-JSON serializer is something we would really like. In real-world
applications we use Clerezza for, one major problem is bad performance and/or
excessive memory consumption. We are dealing with huge graphs there. Therefore
your contribution sounds really promising. But as I am not the author of the
original code, I reviewed the original code as well with respect to the above
scenario. My focus was determining whether the implementations scale with large
Graphs.
Comments on the Serializer:
- As I understand the rdf-json specification, sorting of output is not
required, only grouping by subject and predicate. Therefore I don't think the
more expensive subject-predicate sort (that you commented out but still
included) is necessary. Or am I missing something? Can this part be safely
removed?
- The original code (unpatched) does NOT properly stream the serialization.
This is a concern when the source graph contains too many unique
subjects/predicates/objects, because all the generated JSONObjects/Arrays are
stored in memory before being written to the output stream. This is especially
concerning when many BLOBs are stored in the graph.
- The patch does correctly stream the serialization, but it loads the entire
source graph into memory for sorting (toArray call at line 99). Again this may
easily exceed available memory. The original code does not load the entire
source graph into memory as it uses filter (when the underlying graph is backed
by a TripleStore). The iterators returned by filter only access the data in the
graph for one triple at a time upon the call to next().
Conclusion:
I think none of the solutions can support graphs that exceed memory size. I
assume the unpatched version can deal with slightly larger graphs than your
solution but that is irrelevant. We need a solution that will work reliably
with graphs larger than memory size. As you mentioned an optimal solution would
exploit a sorted (or at least grouped) iterator provided by the underlying
TripleCollection. I think that is the approach we need to take to solve this
issue in a scalable manner.
Now the there is the question whether to accept your patch for Clerezza until
we implement a better solution. I am not sure. Your solution is a significant
improvement in terms of speed of serialization, but the original code is easier
to quick-fix such that the results are streamed properly to the outputstream (I
think exploiting the JSON simple streaming interface may do the trick). So the
question seems to be, what is more important, a solution that, while possibly
very slow, will not exceed available memory, or a solution that significantly
improves serialization performance.
My opinion is that since we seemed to live so far with a solution that can not
deal with very large graphs anyway, the speed improvement may be more valuable.
However we need to get working on a better solution as described above.
I think we should raise this issue on the mailing list for discussion.
> Weak Performance of "application/json+rdf" serializer on big
> TripleCollections and Serialzer/Parser using Platform encoding instead of
> UTF-8
> --------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: CLEREZZA-643
> URL: https://issues.apache.org/jira/browse/CLEREZZA-643
> Project: Clerezza
> Issue Type: Improvement
> Reporter: Rupert Westenthaler
> Assignee: Daniel Spicar
> Attachments: rdf.rdfjson-arrays.sort_based_serializer_and_UTF-8.patch
>
>
> Both the "application/json+rdf" serializer and parser use platform specific
> encodings instead of UTF-8.
> In addition the serializer suffers from very poor performance on big graphs
> (at least when using SimpleMGrpah)
> After some digging in the Code I came to the conclusion that this is because
> of the use of multiple TripleCollection.filter(..) calls fist to filter all
> predicates for an subject and than all objects for each subject/predicate
> combination. A trying to serialize a graph with 50k triples ended in several
> minutes 100% CPU.
> With the next comment I will provide a patch with an implementation based on
> a sorted array of the triples. With this method one can serialize graphs with
> 100k in about 1sec. This patch also changes encoding to UTF-8.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira