Rupert provided a patch to improve serialization performance (thanks for the effort!). I reviewed his Patch and have written my comments on the JIRA page. But I think we need to discuss the issues I raise there. In summary:
- neither the patch nor the current implementations work reliably with very large graphs (larger than memeory) - the patch is significantly faster than the current implementation - the current implementation is easier to quick-fix for very large graphs (but also very slow) There is a sketch of a better solution that should allow us to be faster and not limited by memory size. It is based on sorted iterators. However these iterators need to be supplied by the underlying TripleCollections and that will require more changes to the core of Clerezza. Because both, the current implementation and the patch doe not really work on "big" TripleCollection (when big means really really big) the question we should discuss its: a) keep everything as it is and solve the problem properly (possibly as described in the issue) b) quick fix the current implementation (slow performance) + schedule a proper solution c) apply the patch (fast but graphs limited to available memory size) + schedule a proper solution My favorite is c. What do you think?
