[
https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paolo Castagna updated JENA-117:
--------------------------------
Priority: Minor (was: Major)
> A pure Java version of tdbloader2
> ---------------------------------
>
> Key: JENA-117
> URL: https://issues.apache.org/jira/browse/JENA-117
> Project: Jena
> Issue Type: Improvement
> Components: TDB
> Reporter: Paolo Castagna
> Assignee: Paolo Castagna
> Priority: Minor
> Labels: performance, tdbloader2
> Attachments: TDB_JENA-117_r1171714.patch
>
>
> There is probably a significant performance improvement for tdbloader2 in
> replacing the UNIX sort over text files with an external sorting pure Java
> implementation.
> Since JENA-99 we now have a SortedDataBag which does exactly that.
> ThresholdPolicyCount<Tuple<Long>> policy = new
> ThresholdPolicyCount<Tuple<Long>>(1000000);
> SerializationFactory<Tuple<Long>> serializerFactory = new
> TupleSerializationFactory();
> Comparator<Tuple<Long>> comparator = new TupleComparator();
> SortedDataBag<Tuple<Long>> sortedDataBag = new
> SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator);
> TupleSerializationFactory greates TupleInputStream|TupleOutputStream which
> are wrappers around DataInputStream|DataOutputStream. TupleComparator is
> trivial.
> Preliminary results seems promising and show that the Java implementation can
> be faster than UNIX sort since it uses smaller binary files (instead of text
> files) and it does comparisons of long values rather than strings.
> An example of ExternalSort which compare SortedDataBag vs. UNIX sort is
> available here:
> https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java
> A further advantage in doing the sorting with Java rather than UNIX sort is
> that we could stream results directly into the BPlusTreeRewriter rather than
> on disk and then reading them from disk into the BPlusTreeRewriter.
> I've not done an experiment yet to see if this is actually a significant
> improvement.
> Using compression for intermediate files might help, but more experiments are
> necessary to establish if it is worthwhile or not.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira