Bloom filter based scanner/filter

2013-01-15 Thread David G. Boney
I am building a data cube on top of HBase. All access to the data is by 
map/reduce jobs. I want to build a scanner where its first matching criteria is 
based on the set intersection of bloom filters, followed by additional matching 
criteria specified in the current filter architecture. First, I run a 
map/reduce job on table A. For every row I match in table A, I add the row key 
to a bloom filter. I then do a map/reduce job on table B, where the row keys 
are over the same domain as table A. I want to build a scanner that can use the 
builtin Bloom filters in HBase. When the scanner goes to get the block of data 
to which a row key based bloom filter is attached, it does a set intersection 
with the table A bloom filter to see if any of the keys from Table A are in the 
block. If so, the block is read in and the the scanner does addition matching 
on the rows according to the filter.

This is a simplification of my problem. I am trying to find out what the 
complexity of implementing such a feature would be in HBase.
-
Sincerely,
David G. Boney
Chair, Austin ACM SIGKDD
ch...@austin-acm-sigkdd.org
http://www.meetup.com/Austin-ACM-SIGKDD/
http://tech.groups.yahoo.com/group/austinsigkdd/



Re: Bloom Filter

2012-07-27 Thread Alex Baranau
Very good explanation (and food for thinking) about using bloom filters in
HBase in answers here:
http://www.quora.com/How-are-bloom-filters-used-in-HBase.

Should we put the link to it from Apache HBase book (ref guide)?

Alex Baranau
--
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

On Thu, Jul 26, 2012 at 8:38 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 On Thu, Jul 26, 2012 at 1:52 PM, Minh Duc Nguyen mdngu...@gmail.com
 wrote:

  Mohit,
 
  According to HBase: The Definitive Guide,
 
  The row+column Bloom filter is useful when you cannot batch updates for a
  specific row, and end up with store files which all contain parts of the
  row. The more specific row+column filter can then identify which of the
  files contain the data you are requesting. Obviously, if you always load
  the entire row, this filter is once again hardly useful, as the region
  server will need to load the matching block out of each file anyway.
  Since
  the row+column filter will require more storage, you need to do the math
 to
  determine whether it is worth the extra resources.
 

 Thanks! I have a timeseries data so I am thinking I should enable bloom
 filters for only rows

 
 
 ~ Minh
 
  On Thu, Jul 26, 2012 at 4:30 PM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   Is it advisable to enable bloom filters on the column family?
  
   Also, why is it called global kill switch?
  
   Bloom Filter Configuration
 2.9.1. io.hfile.bloom.enabled global kill switch
  
   io.hfile.bloom.enabled in Configuration serves as the kill switch in
 case
   something goes wrong. Default = true.
  
 




-- 
Alex Baranau
--
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr


Re: Bloom Filter

2012-07-27 Thread Stack
On Fri, Jul 27, 2012 at 4:25 PM, Alex Baranau alex.barano...@gmail.com wrote:
 Should we put the link to it from Apache HBase book (ref guide)?


I added link.  Will show next time we push the site.
St.Ack


Re: Bloom Filter

2012-07-27 Thread Mohit Anchlia
On Fri, Jul 27, 2012 at 7:25 AM, Alex Baranau alex.barano...@gmail.comwrote:

 Very good explanation (and food for thinking) about using bloom filters in
 HBase in answers here:
 http://www.quora.com/How-are-bloom-filters-used-in-HBase.

 Should we put the link to it from Apache HBase book (ref guide)?


Thanks this is helpful


 Alex Baranau
 --
 Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
 Solr

 On Thu, Jul 26, 2012 at 8:38 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  On Thu, Jul 26, 2012 at 1:52 PM, Minh Duc Nguyen mdngu...@gmail.com
  wrote:
 
   Mohit,
  
   According to HBase: The Definitive Guide,
  
   The row+column Bloom filter is useful when you cannot batch updates
 for a
   specific row, and end up with store files which all contain parts of
 the
   row. The more specific row+column filter can then identify which of the
   files contain the data you are requesting. Obviously, if you always
 load
   the entire row, this filter is once again hardly useful, as the region
   server will need to load the matching block out of each file anyway.
   Since
   the row+column filter will require more storage, you need to do the
 math
  to
   determine whether it is worth the extra resources.
  
 
  Thanks! I have a timeseries data so I am thinking I should enable bloom
  filters for only rows
 
  
  
  ~ Minh
  
   On Thu, Jul 26, 2012 at 4:30 PM, Mohit Anchlia mohitanch...@gmail.com
   wrote:
  
Is it advisable to enable bloom filters on the column family?
   
Also, why is it called global kill switch?
   
Bloom Filter Configuration
  2.9.1. io.hfile.bloom.enabled global kill switch
   
io.hfile.bloom.enabled in Configuration serves as the kill switch in
  case
something goes wrong. Default = true.
   
  
 



 --
 Alex Baranau
 --
 Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
 Solr



Re: Bloom Filter

2012-07-26 Thread Mohit Anchlia
On Thu, Jul 26, 2012 at 1:52 PM, Minh Duc Nguyen mdngu...@gmail.com wrote:

 Mohit,

 According to HBase: The Definitive Guide,

 The row+column Bloom filter is useful when you cannot batch updates for a
 specific row, and end up with store files which all contain parts of the
 row. The more specific row+column filter can then identify which of the
 files contain the data you are requesting. Obviously, if you always load
 the entire row, this filter is once again hardly useful, as the region
 server will need to load the matching block out of each file anyway.  Since
 the row+column filter will require more storage, you need to do the math to
 determine whether it is worth the extra resources.


Thanks! I have a timeseries data so I am thinking I should enable bloom
filters for only rows



~ Minh

 On Thu, Jul 26, 2012 at 4:30 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  Is it advisable to enable bloom filters on the column family?
 
  Also, why is it called global kill switch?
 
  Bloom Filter Configuration
2.9.1. io.hfile.bloom.enabled global kill switch
 
  io.hfile.bloom.enabled in Configuration serves as the kill switch in case
  something goes wrong. Default = true.
 



Re: Scans and Bloom Filter

2012-02-16 Thread Nicolas Spiegelberg
Bryan,

Currently, ROW  ROWCOL Bloom Filters are only checked for explicit,
single-row 'Get' scans.  ROWCOL BFs are only checked when you're querying
for explicit column qualifiers (vs getting the entire row).  This is
because multi-row scans  full-row scans are implicit queries.  To
clarify: 

With a multirow scan, the next row after 0x0001 is NOT 0x0002.  HBase only
knows that the next row is  0x0001.  The next row could be 0x00010 or
0x0003.  However, when you call Htable.get(row=0x0001), HBase knows that
you explicitly want that row and don't want 0x00010.

Nicolas

On 2/15/12 9:18 PM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote:

Hello,

We are looking at Bloom Filters and wondering if they are helpful when
doing a sequential read (multi-row scan) or only when doing a Get for a
single row.  It logically makes sense that it would only affect (or to
greater affect) getting a single row since it is a way for determining if
you have to read a whole store file when fetching a key.  But, we are told
that Scan and Get are essentially the same code on the backend, so I
imagine both will check the Blooms if they exist.

Also, would a ROWCOL bloom be more effective if you are often doing
multi-row scans but always with specifying only a subset of columns in
those rows?

Thanks,

Bryan



Re: Scans and Bloom Filter

2012-02-16 Thread Doug Meil

Good stuff Nicholas, I'll add this to the book.





On 2/16/12 3:52 PM, Nicolas Spiegelberg nspiegelb...@fb.com wrote:

Bryan,

Currently, ROW  ROWCOL Bloom Filters are only checked for explicit,
single-row 'Get' scans.  ROWCOL BFs are only checked when you're querying
for explicit column qualifiers (vs getting the entire row).  This is
because multi-row scans  full-row scans are implicit queries.  To
clarify: 

With a multirow scan, the next row after 0x0001 is NOT 0x0002.  HBase only
knows that the next row is  0x0001.  The next row could be 0x00010 or
0x0003.  However, when you call Htable.get(row=0x0001), HBase knows that
you explicitly want that row and don't want 0x00010.

Nicolas

On 2/15/12 9:18 PM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote:

Hello,

We are looking at Bloom Filters and wondering if they are helpful when
doing a sequential read (multi-row scan) or only when doing a Get for a
single row.  It logically makes sense that it would only affect (or to
greater affect) getting a single row since it is a way for determining if
you have to read a whole store file when fetching a key.  But, we are
told
that Scan and Get are essentially the same code on the backend, so I
imagine both will check the Blooms if they exist.

Also, would a ROWCOL bloom be more effective if you are often doing
multi-row scans but always with specifying only a subset of columns in
those rows?

Thanks,

Bryan






Scans and Bloom Filter

2012-02-15 Thread Bryan Beaudreault
Hello,

We are looking at Bloom Filters and wondering if they are helpful when
doing a sequential read (multi-row scan) or only when doing a Get for a
single row.  It logically makes sense that it would only affect (or to
greater affect) getting a single row since it is a way for determining if
you have to read a whole store file when fetching a key.  But, we are told
that Scan and Get are essentially the same code on the backend, so I
imagine both will check the Blooms if they exist.

Also, would a ROWCOL bloom be more effective if you are often doing
multi-row scans but always with specifying only a subset of columns in
those rows?

Thanks,

Bryan


Re: Setting up Bloom filter on created/populated table

2012-02-13 Thread Ted Yu
The syntax you used was for table properties.

The following worked for me:
alter 'table',NAME='colfam',BLOOMFILTER = 'ROWCOL'

where table is name of table and colfam is name of column family.

Cheers

On Sun, Feb 12, 2012 at 1:33 PM, Ben Snively bsniv...@gmail.com wrote:

   - On Fri, Feb 10, 2012 at 4:49 AM, bsnively bsnively@...
 
 http://gmane.org/get-address.php?address=bsnively%2dRe5JQEeQqe8AvxtiuMwx3w%40public.gmane.org
 
 wrote:
   - 
   -  I am trying to test out a POC using HBase -- and am trying to add a
 bloom
   -  filter to a table that already exists.
   - 
   -  The way I'm trying to add it seems to keep complaining in the
 hbase shell --
   -  and I can find any detailed steps of what I'm doing wrong.
   - 
   -  I was trying to do alter 'eventTable', {BLOOMFILTER = 'ROW'}
 and different
   -  variations of that.
   - 
   -  Any help on how to create a bloom filter on a table that
 already exists --
   -  with a large amount of data in it?
   - 
   -
   - Paste the exception.
   -
   - You should be using 0.92.0 if you want to mess w/ blooms I'd say.
   -
   - St.Ack

 I am using the version right before that (I think -- somewhat new to
 hbase) -- it's the cloudera distro: hbase-0.90.4-cdh3u2.

 Figured bloom filters were an option, since it lists
 BLOOMFILTER='NONE' as the attribute now.  This is the command I'm
 trying to do (after disabling the table):

 hbase(main):013:0  alter 'ExistenceTable', METHOD = 'table_att',
 hbase(main):014:0*  BLOOMFILTER = 'ROW'
 0 row(s) in 0.1360 seconds

 hbase(main):015:0 describe 'ExistenceTable'
 DESCRIPTION  ENABLED
  {NAME = 'ExistenceTable', FAMILIES = [{NAME = 'I false
  ', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0',
  COMPRESSION = 'NONE', VERSIONS = '3', TTL = '21
  47483647', BLOCKSIZE = '65536', IN_MEMORY = 'fals
  e', BLOCKCACHE = 'true'}, {NAME = 'O', BLOOMFILTE
  R = 'NONE', REPLICATION_SCOPE = '0', COMPRESSION
  = 'NONE', VERSIONS = '3', TTL = '2147483647', BL
  OCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACH
  E = 'true'}]}
 1 row(s) in 0.3480 seconds


 Unfortunatlye, it still says BLOOMFILTER = 'NONE'.


 Any thoughts?

 Thanks,
 Ben



Re: Setting up Bloom filter on created/populated table

2012-02-12 Thread Ben Snively
   - On Fri, Feb 10, 2012 at 4:49 AM, bsnively bsnively@...
http://gmane.org/get-address.php?address=bsnively%2dRe5JQEeQqe8AvxtiuMwx3w%40public.gmane.org
wrote:
   - 
   -  I am trying to test out a POC using HBase -- and am trying to add a bloom
   -  filter to a table that already exists.
   - 
   -  The way I'm trying to add it seems to keep complaining in the
hbase shell --
   -  and I can find any detailed steps of what I'm doing wrong.
   - 
   -  I was trying to do alter 'eventTable', {BLOOMFILTER = 'ROW'}
and different
   -  variations of that.
   - 
   -  Any help on how to create a bloom filter on a table that
already exists --
   -  with a large amount of data in it?
   - 
   -
   - Paste the exception.
   -
   - You should be using 0.92.0 if you want to mess w/ blooms I'd say.
   -
   - St.Ack

I am using the version right before that (I think -- somewhat new to
hbase) -- it's the cloudera distro: hbase-0.90.4-cdh3u2.

Figured bloom filters were an option, since it lists
BLOOMFILTER='NONE' as the attribute now.  This is the command I'm
trying to do (after disabling the table):

hbase(main):013:0  alter 'ExistenceTable', METHOD = 'table_att',
hbase(main):014:0*  BLOOMFILTER = 'ROW'
0 row(s) in 0.1360 seconds

hbase(main):015:0 describe 'ExistenceTable'
DESCRIPTION  ENABLED
 {NAME = 'ExistenceTable', FAMILIES = [{NAME = 'I false
 ', BLOOMFILTER = 'NONE', REPLICATION_SCOPE = '0',
  COMPRESSION = 'NONE', VERSIONS = '3', TTL = '21
 47483647', BLOCKSIZE = '65536', IN_MEMORY = 'fals
 e', BLOCKCACHE = 'true'}, {NAME = 'O', BLOOMFILTE
 R = 'NONE', REPLICATION_SCOPE = '0', COMPRESSION
 = 'NONE', VERSIONS = '3', TTL = '2147483647', BL
 OCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACH
 E = 'true'}]}
1 row(s) in 0.3480 seconds


Unfortunatlye, it still says BLOOMFILTER = 'NONE'.


Any thoughts?

Thanks,
Ben


Setting up Bloom filter on created/populated table

2012-02-10 Thread bsnively

I am trying to test out a POC using HBase -- and am trying to add a bloom
filter to a table that already exists.

The way I'm trying to add it seems to keep complaining in the hbase shell --
and I can find any detailed steps of what I'm doing wrong.

I was trying to do alter 'eventTable', {BLOOMFILTER = 'ROW'} and different
variations of that.

Any help on how to create a bloom filter on a table that already exists --
with a large amount of data in it?

Thanks,
Ben
-- 
View this message in context: 
http://old.nabble.com/Setting-up-Bloom-filter-on-created-populated-table-tp33300050p33300050.html
Sent from the HBase User mailing list archive at Nabble.com.



Re: Setting up Bloom filter on created/populated table

2012-02-10 Thread Stack
On Fri, Feb 10, 2012 at 4:49 AM, bsnively bsniv...@gmail.com wrote:

 I am trying to test out a POC using HBase -- and am trying to add a bloom
 filter to a table that already exists.

 The way I'm trying to add it seems to keep complaining in the hbase shell --
 and I can find any detailed steps of what I'm doing wrong.

 I was trying to do alter 'eventTable', {BLOOMFILTER = 'ROW'} and different
 variations of that.

 Any help on how to create a bloom filter on a table that already exists --
 with a large amount of data in it?


Paste the exception.

You should be using 0.92.0 if you want to mess w/ blooms I'd say.

St.Ack


Adding bloom filter using alter.

2011-11-10 Thread Sagar Attributor
We have a column family, say CF. 
CF was not going to be accessed frequently enough ( only thru MR). 
So it is without bloom filter, in memory false. 

However, now we have 35 million rows, for this CF. It is small enough. And now 
we want to access it frequently.  I would like to enable bloom filter, in 
memory flag. The question is, will alter statement create bloom files for the 
CF. ?

-Sagar

Re: Adding bloom filter using alter.

2011-11-10 Thread Stack
On Thu, Nov 10, 2011 at 10:56 AM, Sagar Attributor sn...@attributor.com wrote:
 We have a column family, say CF.
 CF was not going to be accessed frequently enough ( only thru MR).
 So it is without bloom filter, in memory false.

 However, now we have 35 million rows, for this CF. It is small enough. And 
 now we want to access it frequently.  I would like to enable bloom filter, in 
 memory flag. The question is, will alter statement create bloom files for the 
 CF. ?


Yes.  Enable the bloom filter attribute.  Thereafter, any files
written either from flush or compaction will have Blooms associated.

St.Ack


Re: Adding bloom filter using alter.

2011-11-10 Thread sagar naik
Awesome Thanks

Thanks Stack



-Sagar

On Thu, Nov 10, 2011 at 11:11 AM, Stack st...@duboce.net wrote:

 On Thu, Nov 10, 2011 at 10:56 AM, Sagar Attributor sn...@attributor.com
 wrote:
  We have a column family, say CF.
  CF was not going to be accessed frequently enough ( only thru MR).
  So it is without bloom filter, in memory false.
 
  However, now we have 35 million rows, for this CF. It is small enough.
 And now we want to access it frequently.  I would like to enable bloom
 filter, in memory flag. The question is, will alter statement create bloom
 files for the CF. ?
 

 Yes.  Enable the bloom filter attribute.  Thereafter, any files
 written either from flush or compaction will have Blooms associated.

 St.Ack