Hi Andy Andy Seaborne wrote: > Paolo, > >> Both tdbloader3 [1] and tdbloader4 [2] are (should be?) correct, >> I've been testing them with datasets in the 500-700 million triples >> range but I consider them (still) *experimental*. > > Is now the right time to incorporate tdbloader3 into the main TDB > codebase as "tdbloader3"?
Yes. I'll do it, soon after TDB is released and the [VOTE] closes. It has no additional dependencies, other than TDB. :-) Tests are logging out at INFO level, I need to double check that and make it silent. There are just 6 and they run in ~10 seconds. I also want to check I am using all the new stuff to create TDB stuff... but this, again, isn't necessarily something which needs to be done before we incorporate it. > It does not disturb anything else (does it?) and makes it more > accessible to users to try out. Correct, it does not disturb anything else and it will be easier for others to try out (and, eventually, use). The big advantage is that, it should scale better on machines with lower RAM constraints. The external sort is pure Java and it's faster than UNIX sort because we can use binary files instead of text files to sort our 64 bits node ids. The draw back is that the first phase to build the node table and the relative index (i.e. nodes.data, node2id.idx and node2id.dat) is done in multiple passes. > Or ... what does it take for it not to be "experimental"? I'd like to run a couple of more tests with ~1 billion size datasets, but this can happen after tdbloader3 has been incorporated into TDB. ... and, last but not least, similar tests for tdbloader4 (i.e. the MapReduce implementation). :-) Next? Anyone into jCUDA? We all have hundreds of cores in our GPUs sitting most of the time idle. Maybe sorting stuff there is faster, even if I don't believe is going to do much of the difference for the first phase. I also want to continue looking to the hash values as node ids... Cheers, Paolo > > Andy >
