[
https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221074#comment-13221074
]
Paolo Castagna commented on JENA-117:
-------------------------------------
Hi Sarven, here are a few answers to your questions.
> --compression Use compression for intermediate files
> --gzip-outside GZIP...(Buffered...())
> --buffer-size The size of buffers for IO in bytes
> --no-buffer Do not use Buffered{Input|Output}Stream
Those are all options to control DataOutputStream/DataInputStream which are
used during the processing. In DataStreamFactory.java you can find this:
if ( ! buffered ) {
return new DataOutputStream( compression ? new GZIPOutputStream(out) :
out ) ;
} else {
if ( gzip_outside ) {
return new DataOutputStream( compression ? new GZIPOutputStream(new
BufferedOutputStream(out, buffer_size)) : new BufferedOutputStream(out,
buffer_size) ) ;
} else {
return new DataOutputStream( compression ? new
BufferedOutputStream(new GZIPOutputStream(out, buffer_size)) : new
BufferedOutputStream(out, buffer_size) ) ;
}
}
This is me trying (with experiment) to find the best combination. I still do
not have an answer, I welcome suggestions and results from experiments. That is
the reason why I put those configuration parameters on the command line.
Ideally, when we find what is working best, we should use that as default and
either eliminate the parameters or leave them in for advanced users only. The
buffer-size is 8192 bytes by default.
> --spill-size The size of spillable segments in tuples|records
> --spill-size-auto Automatically set the size of spillable segments
> --max-merge-files Specify the maximum number of files to merge at the same
> time (default: 100)
These are two other advanced users only parameters to allow experiments and
find out what works best. tdbloader3 is using 'data bags' which spill data on
disk because we cannot assume data at any stage to fit into RAM and we want to
avoid disk seeks. So, for example, if we want to sort some data which do not
fit in RAM we do in RAM in chunks, then dump to disk, process another chunk,
etc. at the end we sort-merge all the chunks. --spill-size parameter control
how many tuples you can keep in RAM before spilling to disk. This is not easy
to know, it also depends on how many bytes per tuple and tuples are different
sizes at different stages of computation. Ideally, users should not even think
about this. This is why I tried to have an adaptive strategy (i.e.
--spill-size-auto). With --spill-size-auto tdbloader3 will constantly monitor
RAM available in the JVM and trigger the spilling on disk when the available
RAM approaches a certain threshold. Things are more complicated if you have
multiple threads and I am still unsure if this is a good stragegy or not. The
aim is to have autotuning on by default and don't have users to think about
spill sizes (see also: JENA-126 and JENA-157). --max-merge-files is used to
specify the maximum number of files/chunks to sort-merge after each chunk has
been sorted and spilled on disk. So, for example, if you end-up with 10000
temporary files, the sort-merge will happy in two rounds: in the first round it
generates 100 new files (sort-merging 100 files at the time) and then a last
round to sort-merge the last 100 new generated files. This is because reading
from too many files at the same time does not work well. Why 100? It says in
the Hadoop source code that they found 100 works best for them... when doing a
very similar thing. Here is another area where more experiments will help in
finding a reasonable default value.
> --no-stats Do not generate the stats file
This is easy: by default tdbloader3 generates the stats.opt file (see "Choosing
the optimizer strategy" section here:
http://incubator.apache.org/jena/documentation/tdb/optimizer.html). You can
ignore that option, stats.opt file can be generated later via TDB's tdbstats
command line.
Now, your errors:
> $ java -cp
> target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar
> -server -d64 -Xmx2000M cmd.tdbloader3 --no-stats --compression --spill-size
> 1500000 --> loc /usr/lib/fuseki/DB/WorldBank /tmp/indicators.tar.gz
> INFO Load: /tmp/indicators.tar.gz -- 2012/03/02 10:49:39 EST
> ERROR [line: 1, col: 13] Unknown char: (0)
I think this is because your are tying to load a .gz which contains a tar with
multiple files. tdbloader3 does not support that.
My advice is to convert and validate all your files from whatever format you
have into N-Triples or N-Quads.
Concatenate all the N-Triples or N-Quads files into a single .nt or .nq file
and gzip it so that you end up with a single filename.nt.gz (which contains a
single file).
Try loading that using tdbloader2 on a 64 bit machine with as much as RAM you
have and use -Xmx2048m for the JVM.
If you try tdbloader3 as well on the same machine, give the JVM as much RAM as
you can via -Xmx... since tdbloader3 does not use memory mapped files.
> $ java tdb.tdbquery --desc=/usr/lib/fuseki/tdb2.worldbank.ttl 'SELECT * WHERE
> { ?s ?p ?o . } LIMIT 100'
> 10:56:30 WARN ModTDBDataset :: Unexpected: Not a TDB dataset for type
> DatasetTDB
Please, double check your tdb2.worldbank.ttl is pointing at the right directory.
> One final thing I'd like to know how to do is assigning graph names. --graph
> is not available as it was in tdbloader.
Right. One way to go around this would be to use files in N-Quads
(http://sw.deri.org/2008/07/n-quads/) instead of N-Triples format.
I have worked on tdbloader3 only "out-of-band", but things might change (if
there are people interested).
You are not the only one needed some patience when dealing with > 500 million
datasets. :-)
One dataset I want to experiment with is Freebase (i.e. ~ 600 million triples)
and I have only 8 GB of RAM on my desktop. This certainly is a good experiment
for tdbloader3.
> A pure Java version of tdbloader2, a.k.a. tdbloader3
> ----------------------------------------------------
>
> Key: JENA-117
> URL: https://issues.apache.org/jira/browse/JENA-117
> Project: Apache Jena
> Issue Type: Improvement
> Components: TDB
> Reporter: Paolo Castagna
> Assignee: Paolo Castagna
> Priority: Minor
> Labels: performance, tdbloader2
> Attachments: TDB_JENA-117_r1171714.patch
>
>
> There is probably a significant performance improvement for tdbloader2 in
> replacing the UNIX sort over text files with an external sorting pure Java
> implementation.
> Since JENA-99 we now have a SortedDataBag which does exactly that.
> ThresholdPolicyCount<Tuple<Long>> policy = new
> ThresholdPolicyCount<Tuple<Long>>(1000000);
> SerializationFactory<Tuple<Long>> serializerFactory = new
> TupleSerializationFactory();
> Comparator<Tuple<Long>> comparator = new TupleComparator();
> SortedDataBag<Tuple<Long>> sortedDataBag = new
> SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator);
> TupleSerializationFactory greates TupleInputStream|TupleOutputStream which
> are wrappers around DataInputStream|DataOutputStream. TupleComparator is
> trivial.
> Preliminary results seems promising and show that the Java implementation can
> be faster than UNIX sort since it uses smaller binary files (instead of text
> files) and it does comparisons of long values rather than strings.
> An example of ExternalSort which compare SortedDataBag vs. UNIX sort is
> available here:
> https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java
> A further advantage in doing the sorting with Java rather than UNIX sort is
> that we could stream results directly into the BPlusTreeRewriter rather than
> on disk and then reading them from disk into the BPlusTreeRewriter.
> I've not done an experiment yet to see if this is actually a significant
> improvement.
> Using compression for intermediate files might help, but more experiments are
> necessary to establish if it is worthwhile or not.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira