[jira] [Commented] (JENA-117) A pure Java version of tdbloader2, a.k.a. tdbloader3

Sarven Capadisli (Commented) (JIRA) Fri, 02 Mar 2012 11:56:24 -0800

    [ 
https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221213#comment-13221213
 ]


Sarven Capadisli commented on JENA-117:
---------------------------------------

Thank you Paolo for that reply. I'll focus on my own issues for the time being 
and go with the defaults until I have a better control over the situation.

I have the following in my tdb2.worldbank.ttl:
     tdb:location "/usr/lib/fuseki/DB/WorldBank" ;

When I look at that directory after running:

java -cp 
target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar 
-server -d64 -Xmx2048M cmd.tdbloader3 --no-stats --compression --spill-size 
1500000 --loc /usr/lib/fuseki/DB/WorldBank /tmp/indicators.gz

Almost all of the files have 8388608 bytes.

Adding the trailing slash at the end didn't make a difference either.

Importing either /tmp/indicators.nt or /tmp/indicators.gz gives me the same 
results. When I do a query, I get no results.

I don't understand the point of providing a single compressed gzip file. I 
understand that the file is smaller but that's under the assumption that the 
dump is usually compressed. In my case, I already have the plain N-Triples 
file, so this requires me to gzip it, and then have it gunzipped by tdbloader3. 
Why not allow the N-Triples file?

Bulk loading here appears to be about a single file. If this tool can deal with 
multiple files in a compressed file, and have me provide a graph name for it, 
that'd be ideal. I have several graphs that I need to create, and as you 
rightfully suggest, N-Quads is the way to go; provided that I extend N-Triples 
to N-Quads, and merge all of my files into one large file. While this appears 
to be the way to go, it introduces too many steps for me to prepare the dumps 
that's to be imported. If I have to fix something in my N-Triples dump, I'd 
have to regenerate everything, and then either import once more or do a SPARQL 
Update.

Given my preferences, do you reckon that tdbloader2 is more suitable?

Thanks again!
                
> A pure Java version of tdbloader2, a.k.a. tdbloader3
> ----------------------------------------------------
>
>                 Key: JENA-117
>                 URL: https://issues.apache.org/jira/browse/JENA-117
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: performance, tdbloader2
>         Attachments: TDB_JENA-117_r1171714.patch
>
>
> There is probably a significant performance improvement for tdbloader2 in 
> replacing the UNIX sort over text files with an external sorting pure Java 
> implementation.
> Since JENA-99 we now have a SortedDataBag which does exactly that.
>     ThresholdPolicyCount<Tuple<Long>> policy = new 
> ThresholdPolicyCount<Tuple<Long>>(1000000);
>     SerializationFactory<Tuple<Long>> serializerFactory = new 
> TupleSerializationFactory();
>     Comparator<Tuple<Long>> comparator = new TupleComparator();
>     SortedDataBag<Tuple<Long>> sortedDataBag = new 
> SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator);
> TupleSerializationFactory greates TupleInputStream|TupleOutputStream which 
> are wrappers around DataInputStream|DataOutputStream. TupleComparator is 
> trivial.
> Preliminary results seems promising and show that the Java implementation can 
> be faster than UNIX sort since it uses smaller binary files (instead of text 
> files) and it does comparisons of long values rather than strings.
> An example of ExternalSort which compare SortedDataBag vs. UNIX sort is 
> available here:
> https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java
> A further advantage in doing the sorting with Java rather than UNIX sort is 
> that we could stream results directly into the BPlusTreeRewriter rather than 
> on disk and then reading them from disk into the BPlusTreeRewriter.
> I've not done an experiment yet to see if this is actually a significant 
> improvement.
> Using compression for intermediate files might help, but more experiments are 
> necessary to establish if it is worthwhile or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-117) A pure Java version of tdbloader2, a.k.a. tdbloader3

Reply via email to