Re: Finding small subset in very large dataset

2009-02-18 Thread Thibaut_

Hi,

The bloomfilter solution works great, but I still have to copy the data
around sometimes.

I'm still wondering if I can reduce the associated data to the keys to a
reference or something small (the 100 KB of data are very big), with which
I can then later fetch the data in the reduce step.

In the past I was using hbase to store the associated data in it (but
unfortunately hbase proved to be very unreliable in my case). I will
probably also start to compress the data in the value store, which will
probably increase sorting speed (as the data is there probably
uncompressed).
Is there something else I could do to speed this process up?

Thanks,
Thibaut
-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Finding small subset in very large dataset

2009-02-18 Thread Miles Osborne
just re-represent the associated data as a bit vector and set of hash
functions.  you then just copy this around, rather than the raw items
themselves.

Miles

2009/2/18 Thibaut_ tbr...@blue.lu:

 Hi,

 The bloomfilter solution works great, but I still have to copy the data
 around sometimes.

 I'm still wondering if I can reduce the associated data to the keys to a
 reference or something small (the 100 KB of data are very big), with which
 I can then later fetch the data in the reduce step.

 In the past I was using hbase to store the associated data in it (but
 unfortunately hbase proved to be very unreliable in my case). I will
 probably also start to compress the data in the value store, which will
 probably increase sorting speed (as the data is there probably
 uncompressed).
 Is there something else I could do to speed this process up?

 Thanks,
 Thibaut
 --
 View this message in context: 
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Re: Finding small subset in very large dataset

2009-02-18 Thread Thibaut_

Hi Miles,

I'm not following you.
If I'm saving an associated hash or bit vector, how can I then quickly
access the elements afterwards (the file with the data might be 100GB big
and is on the DFS)?

I could also directly save the offset of the data in the datafile as
reference, and then on each reducer read in that big file only once. As all
the keys are sorted, I can get all the needed values in one big read step
(skipping those entries I don't need).


Thibaut



Miles Osborne wrote:
 
 just re-represent the associated data as a bit vector and set of hash
 functions.  you then just copy this around, rather than the raw items
 themselves.
 
 Miles
 
 2009/2/18 Thibaut_ tbr...@blue.lu:

 Hi,

 The bloomfilter solution works great, but I still have to copy the data
 around sometimes.

 I'm still wondering if I can reduce the associated data to the keys to a
 reference or something small (the 100 KB of data are very big), with
 which
 I can then later fetch the data in the reduce step.

 In the past I was using hbase to store the associated data in it (but
 unfortunately hbase proved to be very unreliable in my case). I will
 probably also start to compress the data in the value store, which will
 probably increase sorting speed (as the data is there probably
 uncompressed).
 Is there something else I could do to speed this process up?

 Thanks,
 Thibaut
 --
 View this message in context:
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.


 
 
 
 -- 
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.
 
 

-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Finding small subset in very large dataset

2009-02-18 Thread Miles Osborne
if i remember correctly you have two sets of data:

--set A, which is very big
--set B, which is small

and you want to find all elements of A which are in B.  right?

represent A using a variant of a Bloom Filter which supports key-value
pairs.  A Bloomier Filter will do this for you.

each mapper then loads-up A (represented using the Bloomier Filter)
and works over B . whenever A  is present in the representation of B,
you look for the associated value in B and emit that.

if even using a Bloomier Filter you still need too much memory then
you could store it once using Hypertable

see here for an explanation of Bloomier Filters applied to the task of
storing lots of string,probability pairs

Randomized Language Models via Perfect Hash Functions
http://aclweb.org/anthology-new/P/P08/P08-1058.pdf
Miles

2009/2/18 Thibaut_ tbr...@blue.lu:

 Hi Miles,

 I'm not following you.
 If I'm saving an associated hash or bit vector, how can I then quickly
 access the elements afterwards (the file with the data might be 100GB big
 and is on the DFS)?

 I could also directly save the offset of the data in the datafile as
 reference, and then on each reducer read in that big file only once. As all
 the keys are sorted, I can get all the needed values in one big read step
 (skipping those entries I don't need).


 Thibaut



 Miles Osborne wrote:

 just re-represent the associated data as a bit vector and set of hash
 functions.  you then just copy this around, rather than the raw items
 themselves.

 Miles

 2009/2/18 Thibaut_ tbr...@blue.lu:

 Hi,

 The bloomfilter solution works great, but I still have to copy the data
 around sometimes.

 I'm still wondering if I can reduce the associated data to the keys to a
 reference or something small (the 100 KB of data are very big), with
 which
 I can then later fetch the data in the reduce step.

 In the past I was using hbase to store the associated data in it (but
 unfortunately hbase proved to be very unreliable in my case). I will
 probably also start to compress the data in the value store, which will
 probably increase sorting speed (as the data is there probably
 uncompressed).
 Is there something else I could do to speed this process up?

 Thanks,
 Thibaut
 --
 View this message in context:
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.



 --
 View this message in context: 
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Re: Finding small subset in very large dataset

2009-02-12 Thread Thibaut_

Thanks,

I didn't think about the bloom filter variant. That's the solution I was
looking for :-)

Thibaut
-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Finding small subset in very large dataset

2009-02-12 Thread Miles Osborne
Bloom Filters are one of the greatest things ever, so it is nice to
see another application.

Remember that your filter may make mistakes  -you will see items that
are not in the set.  Also, instead of setting a single bit per item
(in the A set), set k distinct bits.

You can analytically work-out the best k for a given number of items
and for some amount of memory.  In practice, this usually boils-down
to k being 3 or so for a reasonable error rate.

Happy hunting

Miles

2009/2/12 Thibaut_ tbr...@blue.lu:

 Thanks,

 I didn't think about the bloom filter variant. That's the solution I was
 looking for :-)

 Thibaut
 --
 View this message in context: 
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Finding small subset in very large dataset

2009-02-11 Thread Thibaut_

Hi,

Let's say the smaller subset has name A. It is a relatively small collection
 100 000 entries (could also be only 100), with nearly no payload as value. 
Collection B is a big collection with 10 000 000 entries (Each key of A
also exists in the collection B), where the value for each key is relatively
big ( 100 KB)

For all the keys in A, I need to get the corresponding value from B and
collect it in the output.


- I can do this by reading in both files, and on the reduce step, do my
computations and collect only those which are both in A and B. The map phase
however will take very long as all the key/value pairs of collection B need
to be sorted (and each key's value is 100 KB) at the end of the map phase,
which is overkill if A is very small.

What I would need is an option to somehow make the intersection first
(Mapper only on keys, then a reduce functino based only on keys and not the
corresponding values which collects the keys I want to take), and then
running the map input and filtering the output collector or the input based
on the results from the reduce phase.

Or is there another faster way? Collection A could be so big that it doesn't
fit into the memory. I could split collection A up into multiple smaller
collections, but that would make it more complicated, so I want to evade
that route. (This is similar to the approach I described above, just a
manual approach)

Thanks,
Thibaut
-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Finding small subset in very large dataset

2009-02-11 Thread Amit Chandel
Are the keys in collection B unique?

If so, I would like to try this approach:
For each key, value of collection B, make a file out of it with file name
given by MD5 hash of the key, and value being its content, and then
store all these files into a HAR archive.
The HAR archive will create an index for you over the keys.
Now you can iterate over the collection A, get the MD5 hash of the key, and
look up in the archive for the file (to get the value).

On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ tbr...@blue.lu wrote:


 Hi,

 Let's say the smaller subset has name A. It is a relatively small
 collection
  100 000 entries (could also be only 100), with nearly no payload as
 value.
 Collection B is a big collection with 10 000 000 entries (Each key of A
 also exists in the collection B), where the value for each key is
 relatively
 big ( 100 KB)

 For all the keys in A, I need to get the corresponding value from B and
 collect it in the output.


 - I can do this by reading in both files, and on the reduce step, do my
 computations and collect only those which are both in A and B. The map
 phase
 however will take very long as all the key/value pairs of collection B need
 to be sorted (and each key's value is 100 KB) at the end of the map phase,
 which is overkill if A is very small.

 What I would need is an option to somehow make the intersection first
 (Mapper only on keys, then a reduce functino based only on keys and not the
 corresponding values which collects the keys I want to take), and then
 running the map input and filtering the output collector or the input based
 on the results from the reduce phase.

 Or is there another faster way? Collection A could be so big that it
 doesn't
 fit into the memory. I could split collection A up into multiple smaller
 collections, but that would make it more complicated, so I want to evade
 that route. (This is similar to the approach I described above, just a
 manual approach)

 Thanks,
 Thibaut
 --
 View this message in context:
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Finding small subset in very large dataset

2009-02-11 Thread Aaron Kimball
I don't see why a HAR archive needs to be involved. You can use a MapFile to
create a scannable index over a SequenceFile and do lookups that way.

But if A is small enough to fit in RAM, then there is a much simpler way:
Write it out to a file and disseminate to all mappers via the
DistributedCache. They then reach read in the entire A set into a HashSet or
other data structure during configure(), before they scan through their
slices of B. They then emit only the B values which hit in A. This is called
a map-side join. If you don't care about sorted ordering of your results,
you can then disable the reducers entirely.

Hive already supports this behavior; but you have to explicitly tell it to
enable map-side joins for each query because only you know that one data set
is small enough ahead of time.

If your A set doesn't fit in RAM, you'll need to get more creative. One
possibility is to do the same thing as above, but instead of reading all of
A into memory, use a hash function to squash the keys from A into some
bounded amount of RAM. For example, allocate yourself a 256 MB bitvector;
for each key in A, set bitvector[hash(A_key) % len(bitvector)] = 1. Then for
each B key in the mapper, if bitvector[hash(B_key) % len(bitvector)] == 1,
then it may match an A key; if it's 0 then it definitely does not match an A
key. For each potential match, send it to the reducer. Send all the A keys
to the reducer as well, where the  precise joining will occur. (Note: this
is effectively the same thing as a Bloom Filter.)

This will send much less data to each reducer and should see better
throughput.

- Aaron


On Wed, Feb 11, 2009 at 4:07 PM, Amit Chandel amitchan...@gmail.com wrote:

 Are the keys in collection B unique?

 If so, I would like to try this approach:
 For each key, value of collection B, make a file out of it with file name
 given by MD5 hash of the key, and value being its content, and then
 store all these files into a HAR archive.
 The HAR archive will create an index for you over the keys.
 Now you can iterate over the collection A, get the MD5 hash of the key, and
 look up in the archive for the file (to get the value).

 On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ tbr...@blue.lu wrote:

 
  Hi,
 
  Let's say the smaller subset has name A. It is a relatively small
  collection
   100 000 entries (could also be only 100), with nearly no payload as
  value.
  Collection B is a big collection with 10 000 000 entries (Each key of A
  also exists in the collection B), where the value for each key is
  relatively
  big ( 100 KB)
 
  For all the keys in A, I need to get the corresponding value from B and
  collect it in the output.
 
 
  - I can do this by reading in both files, and on the reduce step, do my
  computations and collect only those which are both in A and B. The map
  phase
  however will take very long as all the key/value pairs of collection B
 need
  to be sorted (and each key's value is 100 KB) at the end of the map
 phase,
  which is overkill if A is very small.
 
  What I would need is an option to somehow make the intersection first
  (Mapper only on keys, then a reduce functino based only on keys and not
 the
  corresponding values which collects the keys I want to take), and then
  running the map input and filtering the output collector or the input
 based
  on the results from the reduce phase.
 
  Or is there another faster way? Collection A could be so big that it
  doesn't
  fit into the memory. I could split collection A up into multiple smaller
  collections, but that would make it more complicated, so I want to evade
  that route. (This is similar to the approach I described above, just a
  manual approach)
 
  Thanks,
  Thibaut
  --
  View this message in context:
 
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.