Hi Miles,

I'm not following you.
If I'm saving an associated hash or bit vector, how can I then quickly
access the elements afterwards (the file with the data might be >100GB big
and is on the DFS)?

I could also directly save the offset of the data in the datafile as
reference, and then on each reducer read in that big file only once. As all
the keys are sorted, I can get all the needed values in one big read step
(skipping those entries I don't need).


Thibaut



Miles Osborne wrote:
> 
> just re-represent the associated data as a bit vector and set of hash
> functions.  you then just copy this around, rather than the raw items
> themselves.
> 
> Miles
> 
> 2009/2/18 Thibaut_ <tbr...@blue.lu>:
>>
>> Hi,
>>
>> The bloomfilter solution works great, but I still have to copy the data
>> around sometimes.
>>
>> I'm still wondering if I can reduce the associated data to the keys to a
>> reference or something small (the >100 KB of data are very big), with
>> which
>> I can then later fetch the data in the reduce step.
>>
>> In the past I was using hbase to store the associated data in it (but
>> unfortunately hbase proved to be very unreliable in my case). I will
>> probably also start to compress the data in the value store, which will
>> probably increase sorting speed (as the data is there probably
>> uncompressed).
>> Is there something else I could do to speed this process up?
>>
>> Thanks,
>> Thibaut
>> --
>> View this message in context:
>> http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to