[
https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114666#comment-13114666
]
Paolo Castagna edited comment on JENA-117 at 2/21/12 7:55 PM:
--------------------------------------------------------------
An experimental, pure java, version of tdbloader2, named tdbloader3, is here:
https://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader3/trunk/
It's a pure Java program and it uses a two pass algorithm even to build the
node table.
It should (more testing is needed!) have better scalability properties on
machines with not much RAM available.
Here is how you can check it out and package it:
cd /tmp
svn co
https://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader3/trunk/
tdbloader3
cd /tmp/tdbloader3
mvn package
To run it:
java -cp
target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar \
-server -d64 -Xmx6144M cmd.tdbloader2 --no-stats --compression \
--spill-size 1500000 --loc /tmp/tdb /path/to/your/rdfdata.nt.gz
Use -h to see the options available:
cmd.tdbloader3 --loc=DIR FILE ...
General
-v --verbose Verbose
-q --quiet Run with minimal output
--debug Output information for debugging
--help
--version Version information
--loc Location
--compression Use compression for intermediate files
--buffer-size The size of buffers for IO in bytes
--gzip-outside GZIP...(Buffered...())
--spill-size The size of spillable segments in tuples|records
--no-stats Do not generate the stats file
--no-buffer Do not use Buffered{Input|Output}Stream
--max-merge-files Specify the maximum number of files to merge at
the same time (default: 100)
was (Author: castagna):
An experimental version of tdbloader2 is here:
https://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/
It's a pure Java program and it uses a two pass algorithm even to build the
node table.
It should (more testing is needed!) have better scalability properties on
machines with not much RAM available.
Here is how you can check it out and package it:
cd /tmp
svn co
https://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/
tdbloader2
cd /tmp/tdbloader2
mvn package
To run it:
java -cp target/tdbloader2-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar \
-server -d64 -Xmx6144M cmd.tdbloader2 --no-stats --compression \
--spill-size 1500000 --loc /tmp/tdb /path/to/your/rdfdata.nt.gz
Use -h to see the options available:
cmd.tdbloader2 --loc=DIR FILE ...
General
-v --verbose Verbose
-q --quiet Run with minimal output
--debug Output information for debugging
--help
--version Version information
--loc Location
--compression Use compression for intermediate files
--buffer-size The size of buffers for IO in bytes
--gzip-outside GZIP...(Buffered...())
--spill-size The size of spillable segments in tuples|records
--no-stats Do not generate the stats file
--no-buffer Do not use Buffered{Input|Output}Stream
--max-merge-files Specify the maximum number of files to merge at
the same time (default: 100)
> A pure Java version of tdbloader2
> ---------------------------------
>
> Key: JENA-117
> URL: https://issues.apache.org/jira/browse/JENA-117
> Project: Apache Jena
> Issue Type: Improvement
> Components: TDB
> Reporter: Paolo Castagna
> Assignee: Paolo Castagna
> Priority: Minor
> Labels: performance, tdbloader2
> Attachments: TDB_JENA-117_r1171714.patch
>
>
> There is probably a significant performance improvement for tdbloader2 in
> replacing the UNIX sort over text files with an external sorting pure Java
> implementation.
> Since JENA-99 we now have a SortedDataBag which does exactly that.
> ThresholdPolicyCount<Tuple<Long>> policy = new
> ThresholdPolicyCount<Tuple<Long>>(1000000);
> SerializationFactory<Tuple<Long>> serializerFactory = new
> TupleSerializationFactory();
> Comparator<Tuple<Long>> comparator = new TupleComparator();
> SortedDataBag<Tuple<Long>> sortedDataBag = new
> SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator);
> TupleSerializationFactory greates TupleInputStream|TupleOutputStream which
> are wrappers around DataInputStream|DataOutputStream. TupleComparator is
> trivial.
> Preliminary results seems promising and show that the Java implementation can
> be faster than UNIX sort since it uses smaller binary files (instead of text
> files) and it does comparisons of long values rather than strings.
> An example of ExternalSort which compare SortedDataBag vs. UNIX sort is
> available here:
> https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java
> A further advantage in doing the sorting with Java rather than UNIX sort is
> that we could stream results directly into the BPlusTreeRewriter rather than
> on disk and then reading them from disk into the BPlusTreeRewriter.
> I've not done an experiment yet to see if this is actually a significant
> improvement.
> Using compression for intermediate files might help, but more experiments are
> necessary to establish if it is worthwhile or not.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira