just re-represent the associated data as a bit vector and set of hash
functions.  you then just copy this around, rather than the raw items
themselves.

Miles

2009/2/18 Thibaut_ <tbr...@blue.lu>:
>
> Hi,
>
> The bloomfilter solution works great, but I still have to copy the data
> around sometimes.
>
> I'm still wondering if I can reduce the associated data to the keys to a
> reference or something small (the >100 KB of data are very big), with which
> I can then later fetch the data in the reduce step.
>
> In the past I was using hbase to store the associated data in it (but
> unfortunately hbase proved to be very unreliable in my case). I will
> probably also start to compress the data in the value store, which will
> probably increase sorting speed (as the data is there probably
> uncompressed).
> Is there something else I could do to speed this process up?
>
> Thanks,
> Thibaut
> --
> View this message in context: 
> http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply via email to