Re: Finding small subset in very large dataset
Hi, The bloomfilter solution works great, but I still have to copy the data around sometimes. I'm still wondering if I can reduce the associated data to the keys to a reference or something small (the 100 KB of data are very big), with which I can then later fetch the data in the reduce step. In the past I was using hbase to store the associated data in it (but unfortunately hbase proved to be very unreliable in my case). I will probably also start to compress the data in the value store, which will probably increase sorting speed (as the data is there probably uncompressed). Is there something else I could do to speed this process up? Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Finding small subset in very large dataset
just re-represent the associated data as a bit vector and set of hash functions. you then just copy this around, rather than the raw items themselves. Miles 2009/2/18 Thibaut_ tbr...@blue.lu: Hi, The bloomfilter solution works great, but I still have to copy the data around sometimes. I'm still wondering if I can reduce the associated data to the keys to a reference or something small (the 100 KB of data are very big), with which I can then later fetch the data in the reduce step. In the past I was using hbase to store the associated data in it (but unfortunately hbase proved to be very unreliable in my case). I will probably also start to compress the data in the value store, which will probably increase sorting speed (as the data is there probably uncompressed). Is there something else I could do to speed this process up? Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: Finding small subset in very large dataset
Hi Miles, I'm not following you. If I'm saving an associated hash or bit vector, how can I then quickly access the elements afterwards (the file with the data might be 100GB big and is on the DFS)? I could also directly save the offset of the data in the datafile as reference, and then on each reducer read in that big file only once. As all the keys are sorted, I can get all the needed values in one big read step (skipping those entries I don't need). Thibaut Miles Osborne wrote: just re-represent the associated data as a bit vector and set of hash functions. you then just copy this around, rather than the raw items themselves. Miles 2009/2/18 Thibaut_ tbr...@blue.lu: Hi, The bloomfilter solution works great, but I still have to copy the data around sometimes. I'm still wondering if I can reduce the associated data to the keys to a reference or something small (the 100 KB of data are very big), with which I can then later fetch the data in the reduce step. In the past I was using hbase to store the associated data in it (but unfortunately hbase proved to be very unreliable in my case). I will probably also start to compress the data in the value store, which will probably increase sorting speed (as the data is there probably uncompressed). Is there something else I could do to speed this process up? Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Finding small subset in very large dataset
if i remember correctly you have two sets of data: --set A, which is very big --set B, which is small and you want to find all elements of A which are in B. right? represent A using a variant of a Bloom Filter which supports key-value pairs. A Bloomier Filter will do this for you. each mapper then loads-up A (represented using the Bloomier Filter) and works over B . whenever A is present in the representation of B, you look for the associated value in B and emit that. if even using a Bloomier Filter you still need too much memory then you could store it once using Hypertable see here for an explanation of Bloomier Filters applied to the task of storing lots of string,probability pairs Randomized Language Models via Perfect Hash Functions http://aclweb.org/anthology-new/P/P08/P08-1058.pdf Miles 2009/2/18 Thibaut_ tbr...@blue.lu: Hi Miles, I'm not following you. If I'm saving an associated hash or bit vector, how can I then quickly access the elements afterwards (the file with the data might be 100GB big and is on the DFS)? I could also directly save the offset of the data in the datafile as reference, and then on each reducer read in that big file only once. As all the keys are sorted, I can get all the needed values in one big read step (skipping those entries I don't need). Thibaut Miles Osborne wrote: just re-represent the associated data as a bit vector and set of hash functions. you then just copy this around, rather than the raw items themselves. Miles 2009/2/18 Thibaut_ tbr...@blue.lu: Hi, The bloomfilter solution works great, but I still have to copy the data around sometimes. I'm still wondering if I can reduce the associated data to the keys to a reference or something small (the 100 KB of data are very big), with which I can then later fetch the data in the reduce step. In the past I was using hbase to store the associated data in it (but unfortunately hbase proved to be very unreliable in my case). I will probably also start to compress the data in the value store, which will probably increase sorting speed (as the data is there probably uncompressed). Is there something else I could do to speed this process up? Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: Finding small subset in very large dataset
Thanks, I didn't think about the bloom filter variant. That's the solution I was looking for :-) Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Finding small subset in very large dataset
Bloom Filters are one of the greatest things ever, so it is nice to see another application. Remember that your filter may make mistakes -you will see items that are not in the set. Also, instead of setting a single bit per item (in the A set), set k distinct bits. You can analytically work-out the best k for a given number of items and for some amount of memory. In practice, this usually boils-down to k being 3 or so for a reasonable error rate. Happy hunting Miles 2009/2/12 Thibaut_ tbr...@blue.lu: Thanks, I didn't think about the bloom filter variant. That's the solution I was looking for :-) Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Finding small subset in very large dataset
Hi, Let's say the smaller subset has name A. It is a relatively small collection 100 000 entries (could also be only 100), with nearly no payload as value. Collection B is a big collection with 10 000 000 entries (Each key of A also exists in the collection B), where the value for each key is relatively big ( 100 KB) For all the keys in A, I need to get the corresponding value from B and collect it in the output. - I can do this by reading in both files, and on the reduce step, do my computations and collect only those which are both in A and B. The map phase however will take very long as all the key/value pairs of collection B need to be sorted (and each key's value is 100 KB) at the end of the map phase, which is overkill if A is very small. What I would need is an option to somehow make the intersection first (Mapper only on keys, then a reduce functino based only on keys and not the corresponding values which collects the keys I want to take), and then running the map input and filtering the output collector or the input based on the results from the reduce phase. Or is there another faster way? Collection A could be so big that it doesn't fit into the memory. I could split collection A up into multiple smaller collections, but that would make it more complicated, so I want to evade that route. (This is similar to the approach I described above, just a manual approach) Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Finding small subset in very large dataset
Are the keys in collection B unique? If so, I would like to try this approach: For each key, value of collection B, make a file out of it with file name given by MD5 hash of the key, and value being its content, and then store all these files into a HAR archive. The HAR archive will create an index for you over the keys. Now you can iterate over the collection A, get the MD5 hash of the key, and look up in the archive for the file (to get the value). On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ tbr...@blue.lu wrote: Hi, Let's say the smaller subset has name A. It is a relatively small collection 100 000 entries (could also be only 100), with nearly no payload as value. Collection B is a big collection with 10 000 000 entries (Each key of A also exists in the collection B), where the value for each key is relatively big ( 100 KB) For all the keys in A, I need to get the corresponding value from B and collect it in the output. - I can do this by reading in both files, and on the reduce step, do my computations and collect only those which are both in A and B. The map phase however will take very long as all the key/value pairs of collection B need to be sorted (and each key's value is 100 KB) at the end of the map phase, which is overkill if A is very small. What I would need is an option to somehow make the intersection first (Mapper only on keys, then a reduce functino based only on keys and not the corresponding values which collects the keys I want to take), and then running the map input and filtering the output collector or the input based on the results from the reduce phase. Or is there another faster way? Collection A could be so big that it doesn't fit into the memory. I could split collection A up into multiple smaller collections, but that would make it more complicated, so I want to evade that route. (This is similar to the approach I described above, just a manual approach) Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Finding small subset in very large dataset
I don't see why a HAR archive needs to be involved. You can use a MapFile to create a scannable index over a SequenceFile and do lookups that way. But if A is small enough to fit in RAM, then there is a much simpler way: Write it out to a file and disseminate to all mappers via the DistributedCache. They then reach read in the entire A set into a HashSet or other data structure during configure(), before they scan through their slices of B. They then emit only the B values which hit in A. This is called a map-side join. If you don't care about sorted ordering of your results, you can then disable the reducers entirely. Hive already supports this behavior; but you have to explicitly tell it to enable map-side joins for each query because only you know that one data set is small enough ahead of time. If your A set doesn't fit in RAM, you'll need to get more creative. One possibility is to do the same thing as above, but instead of reading all of A into memory, use a hash function to squash the keys from A into some bounded amount of RAM. For example, allocate yourself a 256 MB bitvector; for each key in A, set bitvector[hash(A_key) % len(bitvector)] = 1. Then for each B key in the mapper, if bitvector[hash(B_key) % len(bitvector)] == 1, then it may match an A key; if it's 0 then it definitely does not match an A key. For each potential match, send it to the reducer. Send all the A keys to the reducer as well, where the precise joining will occur. (Note: this is effectively the same thing as a Bloom Filter.) This will send much less data to each reducer and should see better throughput. - Aaron On Wed, Feb 11, 2009 at 4:07 PM, Amit Chandel amitchan...@gmail.com wrote: Are the keys in collection B unique? If so, I would like to try this approach: For each key, value of collection B, make a file out of it with file name given by MD5 hash of the key, and value being its content, and then store all these files into a HAR archive. The HAR archive will create an index for you over the keys. Now you can iterate over the collection A, get the MD5 hash of the key, and look up in the archive for the file (to get the value). On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ tbr...@blue.lu wrote: Hi, Let's say the smaller subset has name A. It is a relatively small collection 100 000 entries (could also be only 100), with nearly no payload as value. Collection B is a big collection with 10 000 000 entries (Each key of A also exists in the collection B), where the value for each key is relatively big ( 100 KB) For all the keys in A, I need to get the corresponding value from B and collect it in the output. - I can do this by reading in both files, and on the reduce step, do my computations and collect only those which are both in A and B. The map phase however will take very long as all the key/value pairs of collection B need to be sorted (and each key's value is 100 KB) at the end of the map phase, which is overkill if A is very small. What I would need is an option to somehow make the intersection first (Mapper only on keys, then a reduce functino based only on keys and not the corresponding values which collects the keys I want to take), and then running the map input and filtering the output collector or the input based on the results from the reduce phase. Or is there another faster way? Collection A could be so big that it doesn't fit into the memory. I could split collection A up into multiple smaller collections, but that would make it more complicated, so I want to evade that route. (This is similar to the approach I described above, just a manual approach) Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html Sent from the Hadoop core-user mailing list archive at Nabble.com.