Bloom filter based scanner/filter
I am building a data cube on top of HBase. All access to the data is by map/reduce jobs. I want to build a scanner where its first matching criteria is based on the set intersection of bloom filters, followed by additional matching criteria specified in the current filter architecture. First, I run a map/reduce job on table A. For every row I match in table A, I add the row key to a bloom filter. I then do a map/reduce job on table B, where the row keys are over the same domain as table A. I want to build a scanner that can use the builtin Bloom filters in HBase. When the scanner goes to get the block of data to which a row key based bloom filter is attached, it does a set intersection with the table A bloom filter to see if any of the keys from Table A are in the block. If so, the block is read in and the the scanner does addition matching on the rows according to the filter. This is a simplification of my problem. I am trying to find out what the complexity of implementing such a feature would be in HBase. - Sincerely, David G. Boney Chair, Austin ACM SIGKDD ch...@austin-acm-sigkdd.org http://www.meetup.com/Austin-ACM-SIGKDD/ http://tech.groups.yahoo.com/group/austinsigkdd/
Re: Bloom Filter
Very good explanation (and food for thinking) about using bloom filters in HBase in answers here: http://www.quora.com/How-are-bloom-filters-used-in-HBase. Should we put the link to it from Apache HBase book (ref guide)? Alex Baranau -- Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Thu, Jul 26, 2012 at 8:38 PM, Mohit Anchlia mohitanch...@gmail.comwrote: On Thu, Jul 26, 2012 at 1:52 PM, Minh Duc Nguyen mdngu...@gmail.com wrote: Mohit, According to HBase: The Definitive Guide, The row+column Bloom filter is useful when you cannot batch updates for a specific row, and end up with store files which all contain parts of the row. The more specific row+column filter can then identify which of the files contain the data you are requesting. Obviously, if you always load the entire row, this filter is once again hardly useful, as the region server will need to load the matching block out of each file anyway. Since the row+column filter will require more storage, you need to do the math to determine whether it is worth the extra resources. Thanks! I have a timeseries data so I am thinking I should enable bloom filters for only rows ~ Minh On Thu, Jul 26, 2012 at 4:30 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Is it advisable to enable bloom filters on the column family? Also, why is it called global kill switch? Bloom Filter Configuration 2.9.1. io.hfile.bloom.enabled global kill switch io.hfile.bloom.enabled in Configuration serves as the kill switch in case something goes wrong. Default = true. -- Alex Baranau -- Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
Re: Bloom Filter
On Fri, Jul 27, 2012 at 4:25 PM, Alex Baranau alex.barano...@gmail.com wrote: Should we put the link to it from Apache HBase book (ref guide)? I added link. Will show next time we push the site. St.Ack
Re: Bloom Filter
On Fri, Jul 27, 2012 at 7:25 AM, Alex Baranau alex.barano...@gmail.comwrote: Very good explanation (and food for thinking) about using bloom filters in HBase in answers here: http://www.quora.com/How-are-bloom-filters-used-in-HBase. Should we put the link to it from Apache HBase book (ref guide)? Thanks this is helpful Alex Baranau -- Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Thu, Jul 26, 2012 at 8:38 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Jul 26, 2012 at 1:52 PM, Minh Duc Nguyen mdngu...@gmail.com wrote: Mohit, According to HBase: The Definitive Guide, The row+column Bloom filter is useful when you cannot batch updates for a specific row, and end up with store files which all contain parts of the row. The more specific row+column filter can then identify which of the files contain the data you are requesting. Obviously, if you always load the entire row, this filter is once again hardly useful, as the region server will need to load the matching block out of each file anyway. Since the row+column filter will require more storage, you need to do the math to determine whether it is worth the extra resources. Thanks! I have a timeseries data so I am thinking I should enable bloom filters for only rows ~ Minh On Thu, Jul 26, 2012 at 4:30 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Is it advisable to enable bloom filters on the column family? Also, why is it called global kill switch? Bloom Filter Configuration 2.9.1. io.hfile.bloom.enabled global kill switch io.hfile.bloom.enabled in Configuration serves as the kill switch in case something goes wrong. Default = true. -- Alex Baranau -- Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
Re: Bloom Filter
On Thu, Jul 26, 2012 at 1:52 PM, Minh Duc Nguyen mdngu...@gmail.com wrote: Mohit, According to HBase: The Definitive Guide, The row+column Bloom filter is useful when you cannot batch updates for a specific row, and end up with store files which all contain parts of the row. The more specific row+column filter can then identify which of the files contain the data you are requesting. Obviously, if you always load the entire row, this filter is once again hardly useful, as the region server will need to load the matching block out of each file anyway. Since the row+column filter will require more storage, you need to do the math to determine whether it is worth the extra resources. Thanks! I have a timeseries data so I am thinking I should enable bloom filters for only rows ~ Minh On Thu, Jul 26, 2012 at 4:30 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Is it advisable to enable bloom filters on the column family? Also, why is it called global kill switch? Bloom Filter Configuration 2.9.1. io.hfile.bloom.enabled global kill switch io.hfile.bloom.enabled in Configuration serves as the kill switch in case something goes wrong. Default = true.
Re: Scans and Bloom Filter
Bryan, Currently, ROW ROWCOL Bloom Filters are only checked for explicit, single-row 'Get' scans. ROWCOL BFs are only checked when you're querying for explicit column qualifiers (vs getting the entire row). This is because multi-row scans full-row scans are implicit queries. To clarify: With a multirow scan, the next row after 0x0001 is NOT 0x0002. HBase only knows that the next row is 0x0001. The next row could be 0x00010 or 0x0003. However, when you call Htable.get(row=0x0001), HBase knows that you explicitly want that row and don't want 0x00010. Nicolas On 2/15/12 9:18 PM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote: Hello, We are looking at Bloom Filters and wondering if they are helpful when doing a sequential read (multi-row scan) or only when doing a Get for a single row. It logically makes sense that it would only affect (or to greater affect) getting a single row since it is a way for determining if you have to read a whole store file when fetching a key. But, we are told that Scan and Get are essentially the same code on the backend, so I imagine both will check the Blooms if they exist. Also, would a ROWCOL bloom be more effective if you are often doing multi-row scans but always with specifying only a subset of columns in those rows? Thanks, Bryan
Re: Scans and Bloom Filter
Good stuff Nicholas, I'll add this to the book. On 2/16/12 3:52 PM, Nicolas Spiegelberg nspiegelb...@fb.com wrote: Bryan, Currently, ROW ROWCOL Bloom Filters are only checked for explicit, single-row 'Get' scans. ROWCOL BFs are only checked when you're querying for explicit column qualifiers (vs getting the entire row). This is because multi-row scans full-row scans are implicit queries. To clarify: With a multirow scan, the next row after 0x0001 is NOT 0x0002. HBase only knows that the next row is 0x0001. The next row could be 0x00010 or 0x0003. However, when you call Htable.get(row=0x0001), HBase knows that you explicitly want that row and don't want 0x00010. Nicolas On 2/15/12 9:18 PM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote: Hello, We are looking at Bloom Filters and wondering if they are helpful when doing a sequential read (multi-row scan) or only when doing a Get for a single row. It logically makes sense that it would only affect (or to greater affect) getting a single row since it is a way for determining if you have to read a whole store file when fetching a key. But, we are told that Scan and Get are essentially the same code on the backend, so I imagine both will check the Blooms if they exist. Also, would a ROWCOL bloom be more effective if you are often doing multi-row scans but always with specifying only a subset of columns in those rows? Thanks, Bryan
Scans and Bloom Filter
Hello, We are looking at Bloom Filters and wondering if they are helpful when doing a sequential read (multi-row scan) or only when doing a Get for a single row. It logically makes sense that it would only affect (or to greater affect) getting a single row since it is a way for determining if you have to read a whole store file when fetching a key. But, we are told that Scan and Get are essentially the same code on the backend, so I imagine both will check the Blooms if they exist. Also, would a ROWCOL bloom be more effective if you are often doing multi-row scans but always with specifying only a subset of columns in those rows? Thanks, Bryan
Re: Setting up Bloom filter on created/populated table
The syntax you used was for table properties. The following worked for me: alter 'table',NAME='colfam',BLOOMFILTER = 'ROWCOL' where table is name of table and colfam is name of column family. Cheers On Sun, Feb 12, 2012 at 1:33 PM, Ben Snively bsniv...@gmail.com wrote: - On Fri, Feb 10, 2012 at 4:49 AM, bsnively bsnively@... http://gmane.org/get-address.php?address=bsnively%2dRe5JQEeQqe8AvxtiuMwx3w%40public.gmane.org wrote: - - I am trying to test out a POC using HBase -- and am trying to add a bloom - filter to a table that already exists. - - The way I'm trying to add it seems to keep complaining in the hbase shell -- - and I can find any detailed steps of what I'm doing wrong. - - I was trying to do alter 'eventTable', {BLOOMFILTER = 'ROW'} and different - variations of that. - - Any help on how to create a bloom filter on a table that already exists -- - with a large amount of data in it? - - - Paste the exception. - - You should be using 0.92.0 if you want to mess w/ blooms I'd say. - - St.Ack I am using the version right before that (I think -- somewhat new to hbase) -- it's the cloudera distro: hbase-0.90.4-cdh3u2. Figured bloom filters were an option, since it lists BLOOMFILTER='NONE' as the attribute now. This is the command I'm trying to do (after disabling the table): hbase(main):013:0 alter 'ExistenceTable', METHOD = 'table_att', hbase(main):014:0* BLOOMFILTER = 'ROW' 0 row(s) in 0.1360 seconds hbase(main):015:0 describe 'ExistenceTable' DESCRIPTION ENABLED {NAME = 'ExistenceTable', FAMILIES = [{NAME = 'I false ', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '21 47483647', BLOCKSIZE = '65536', IN_MEMORY = 'fals e', BLOCKCACHE = 'true'}, {NAME = 'O', BLOOMFILTE R = 'NONE', REPLICATION_SCOPE = '0', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BL OCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACH E = 'true'}]} 1 row(s) in 0.3480 seconds Unfortunatlye, it still says BLOOMFILTER = 'NONE'. Any thoughts? Thanks, Ben
Re: Setting up Bloom filter on created/populated table
- On Fri, Feb 10, 2012 at 4:49 AM, bsnively bsnively@... http://gmane.org/get-address.php?address=bsnively%2dRe5JQEeQqe8AvxtiuMwx3w%40public.gmane.org wrote: - - I am trying to test out a POC using HBase -- and am trying to add a bloom - filter to a table that already exists. - - The way I'm trying to add it seems to keep complaining in the hbase shell -- - and I can find any detailed steps of what I'm doing wrong. - - I was trying to do alter 'eventTable', {BLOOMFILTER = 'ROW'} and different - variations of that. - - Any help on how to create a bloom filter on a table that already exists -- - with a large amount of data in it? - - - Paste the exception. - - You should be using 0.92.0 if you want to mess w/ blooms I'd say. - - St.Ack I am using the version right before that (I think -- somewhat new to hbase) -- it's the cloudera distro: hbase-0.90.4-cdh3u2. Figured bloom filters were an option, since it lists BLOOMFILTER='NONE' as the attribute now. This is the command I'm trying to do (after disabling the table): hbase(main):013:0 alter 'ExistenceTable', METHOD = 'table_att', hbase(main):014:0* BLOOMFILTER = 'ROW' 0 row(s) in 0.1360 seconds hbase(main):015:0 describe 'ExistenceTable' DESCRIPTION ENABLED {NAME = 'ExistenceTable', FAMILIES = [{NAME = 'I false ', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '21 47483647', BLOCKSIZE = '65536', IN_MEMORY = 'fals e', BLOCKCACHE = 'true'}, {NAME = 'O', BLOOMFILTE R = 'NONE', REPLICATION_SCOPE = '0', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BL OCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACH E = 'true'}]} 1 row(s) in 0.3480 seconds Unfortunatlye, it still says BLOOMFILTER = 'NONE'. Any thoughts? Thanks, Ben
Setting up Bloom filter on created/populated table
I am trying to test out a POC using HBase -- and am trying to add a bloom filter to a table that already exists. The way I'm trying to add it seems to keep complaining in the hbase shell -- and I can find any detailed steps of what I'm doing wrong. I was trying to do alter 'eventTable', {BLOOMFILTER = 'ROW'} and different variations of that. Any help on how to create a bloom filter on a table that already exists -- with a large amount of data in it? Thanks, Ben -- View this message in context: http://old.nabble.com/Setting-up-Bloom-filter-on-created-populated-table-tp33300050p33300050.html Sent from the HBase User mailing list archive at Nabble.com.
Re: Setting up Bloom filter on created/populated table
On Fri, Feb 10, 2012 at 4:49 AM, bsnively bsniv...@gmail.com wrote: I am trying to test out a POC using HBase -- and am trying to add a bloom filter to a table that already exists. The way I'm trying to add it seems to keep complaining in the hbase shell -- and I can find any detailed steps of what I'm doing wrong. I was trying to do alter 'eventTable', {BLOOMFILTER = 'ROW'} and different variations of that. Any help on how to create a bloom filter on a table that already exists -- with a large amount of data in it? Paste the exception. You should be using 0.92.0 if you want to mess w/ blooms I'd say. St.Ack
Adding bloom filter using alter.
We have a column family, say CF. CF was not going to be accessed frequently enough ( only thru MR). So it is without bloom filter, in memory false. However, now we have 35 million rows, for this CF. It is small enough. And now we want to access it frequently. I would like to enable bloom filter, in memory flag. The question is, will alter statement create bloom files for the CF. ? -Sagar
Re: Adding bloom filter using alter.
On Thu, Nov 10, 2011 at 10:56 AM, Sagar Attributor sn...@attributor.com wrote: We have a column family, say CF. CF was not going to be accessed frequently enough ( only thru MR). So it is without bloom filter, in memory false. However, now we have 35 million rows, for this CF. It is small enough. And now we want to access it frequently. I would like to enable bloom filter, in memory flag. The question is, will alter statement create bloom files for the CF. ? Yes. Enable the bloom filter attribute. Thereafter, any files written either from flush or compaction will have Blooms associated. St.Ack
Re: Adding bloom filter using alter.
Awesome Thanks Thanks Stack -Sagar On Thu, Nov 10, 2011 at 11:11 AM, Stack st...@duboce.net wrote: On Thu, Nov 10, 2011 at 10:56 AM, Sagar Attributor sn...@attributor.com wrote: We have a column family, say CF. CF was not going to be accessed frequently enough ( only thru MR). So it is without bloom filter, in memory false. However, now we have 35 million rows, for this CF. It is small enough. And now we want to access it frequently. I would like to enable bloom filter, in memory flag. The question is, will alter statement create bloom files for the CF. ? Yes. Enable the bloom filter attribute. Thereafter, any files written either from flush or compaction will have Blooms associated. St.Ack