[jira] [Commented] (JENA-117) A pure Java version of tdbloader2, a.k.a. tdbloader3

Paolo Castagna (Commented) (JIRA) Tue, 06 Mar 2012 01:03:27 -0800

    [ 
https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223110#comment-13223110
 ]


Paolo Castagna commented on JENA-117:
-------------------------------------

>      tdb:unionDefaultGraph true ; 

This explains why you had no results at query time.

You were loading data into the default graph, but using tdb:unionDefaultGraph 
true. This way the default graph for your query is the union of the named 
graphs (which does not include the default graph where you loaded your data). 
You can find the documentation here: 
http://incubator.apache.org/jena/documentation/tdb/assembler.html

> BTW, loading all the files in the directory works as well

Yep. The problem is that if you have a lot of files the * expansion will stop 
working.

> I suppose now I have to decide whether to take the N-Quads route or use 
> tdbloader2. 

I saw you managed to load your data using tdbloader, good.

Considering N-Quads format, next time you do a big data conversion, is IMHO a 
good advice.
N-Triples/N-Quads are the formats which will load faster and they give you 
different options (and they are MapReduce friendly == easy to split).
                
> A pure Java version of tdbloader2, a.k.a. tdbloader3
> ----------------------------------------------------
>
>                 Key: JENA-117
>                 URL: https://issues.apache.org/jira/browse/JENA-117
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: performance, tdbloader2
>         Attachments: TDB_JENA-117_r1171714.patch
>
>
> There is probably a significant performance improvement for tdbloader2 in 
> replacing the UNIX sort over text files with an external sorting pure Java 
> implementation.
> Since JENA-99 we now have a SortedDataBag which does exactly that.
>     ThresholdPolicyCount<Tuple<Long>> policy = new 
> ThresholdPolicyCount<Tuple<Long>>(1000000);
>     SerializationFactory<Tuple<Long>> serializerFactory = new 
> TupleSerializationFactory();
>     Comparator<Tuple<Long>> comparator = new TupleComparator();
>     SortedDataBag<Tuple<Long>> sortedDataBag = new 
> SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator);
> TupleSerializationFactory greates TupleInputStream|TupleOutputStream which 
> are wrappers around DataInputStream|DataOutputStream. TupleComparator is 
> trivial.
> Preliminary results seems promising and show that the Java implementation can 
> be faster than UNIX sort since it uses smaller binary files (instead of text 
> files) and it does comparisons of long values rather than strings.
> An example of ExternalSort which compare SortedDataBag vs. UNIX sort is 
> available here:
> https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java
> A further advantage in doing the sorting with Java rather than UNIX sort is 
> that we could stream results directly into the BPlusTreeRewriter rather than 
> on disk and then reading them from disk into the BPlusTreeRewriter.
> I've not done an experiment yet to see if this is actually a significant 
> improvement.
> Using compression for intermediate files might help, but more experiments are 
> necessary to establish if it is worthwhile or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-117) A pure Java version of tdbloader2, a.k.a. tdbloader3

Reply via email to