Paolo Castagna wrote:
> Paolo Castagna wrote:
>> TODO:
>>
>>  - Add MiniMRCluster so that it is easy for developers to run tests
>> with multiple reducers on a laptop.
> 
> Done.
> 
>>  - Split the first MapReduce job into two: one to produce offset
>> values for each partition, the other to generate data files with
>> correct ids for subsequent jobs.
> 
> Done.
> 
>>  - Build the node table concatenating output files from the MapReduce
>> jobs above.
> 
> Done.
> 
> All the changes are in a branch, here:
> https://github.com/castagna/tdbloader3/tree/hadoop-0.20.203.0
> 
> There is only one final step which is currently not done using MapReduce:
> the node2id.dat|idn files (i.e. the B+Tree index to map RDF node hashes
> of 128 bits to RDF node ids (68 bits)) are built from the nodes.dat file
> at the end
> of all MapReduce jobs.
> 
> Iterator<Pair<Long,ByteBuffer>> iter = objects.all();
> while ( iter.hasNext() ) {
>     Pair<Long, ByteBuffer> pair = iter.next();
>     long id = pair.getLeft() ;
>     Node node = NodeLib.fetchDecode(id, objects) ;
>     Hash hash = new Hash(recordFactory.keyLength()) ;
>     setHash(hash, node) ;
>     byte k[] = hash.getBytes() ;
>     Record record = recordFactory.create(k) ;
>     Bytes.setLong(id, record.getValue(), 0) ;
>     nodeToId.add(record);
> }
> 
> I need to run a few experiments, but this saves a find() to search if a
> record
> is already in the index. We know the objects file contains only unique
> RDF node
> values.
> 
> Indeed, while I was doing this I looked back at tdbloader2 and I think
> we could
> use the BPlusTreeRewriter 'trick' for the node table as well. I cannot
> reuse
> BPlusTreeRewriter as it is since it has been written for SPO or GSPO, etc.
> indexes where we have records with 3 or 4 slots of constant size (in
> this case
> 64 bits).
> 
> In the case of the node table we have records of two slots only
> respectively of
> size 128 bits for the hash and 68 bits for the node id.
> 
> I am keen to try to improve the first phase of the tdbloader2 since I
> expect it
> could further improve performances and scalability (in particular when
> the node
> table indexes do not fit in RAM anymore).

In case of tdbloader2 we would need to sort RDF node values in order to
guarantee uniqueness (which is normally done via the B+Tree index). This
is not an improvement.

In case of tdbloader3, a previous MapReduce job produces the nodes.dat
file which contains unique RDF node values. Another MapReduce job or
two jobs are necessary to produce a total sort over hash|id values.
So, I am not sure it's worth it. However, I'd like to try.

I've also been trying to compare performances of UNIX sort which uses
text file with an external sort (using SortedDataBag) with binary
files: the pure Java external sort seem faster (probably because files
are smaller and it's sorting longs instead of strings).

Paolo

> 
> @Andy, does this idea make sense?
> 
>>  - Test on a cluster with a large (> 1B dataset).
> 
> Soon...
> 
> Paolo

Reply via email to