Sorry, I didn't notice your email about packing 500 operations before. You might actually benefit from checking with a batch of Gets vs individual exists.
Best, Mohamed On Sat, Jan 5, 2013 at 8:29 AM, Jean-Marc Spaggiari <jean-m...@spaggiari.org > wrote: > Hum, very interesting! > > Now, what's the best option? Array of get which will retrieve more > information? Or multiple HTable.exits one by one? > > The best will have been to have an array of gets passed to the > exist... I will see how big it is to add that... > > JM > > 2013/1/4, Mohamed Ibrahim <m0b...@gmail.com>: > > What about HTable.exists ?? > > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#exists(org.apache.hadoop.hbase.client.Get) > > > > I think that should work if the Get has only the row key. > > > > Mohamed > > > > > > On Fri, Jan 4, 2013 at 3:17 PM, Adrien Mogenet > > <adrien.moge...@gmail.com>wrote: > > > >> On every Get, BloomFilter is acting as a filter (!) on top of each HFile > >> and allows to check if a key is absent from the HFile. So yes, you will > >> benefit from these filters. > >> > >> > >> On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari < > >> jean-m...@spaggiari.org > >> > wrote: > >> > >> > Is KeyOnlyFilter using the BloomFilters too? > >> > > >> > Here is, with more details, what I'm doing. > >> > > >> > Few questions. > >> > - Can I create one single KeyOnlyFilter and give the same filter to > >> > all the gets? > >> > - Will bloom filters benefit in such scenario? My key is small. Let's > >> > say average 128 bytes. > >> > > >> > The goal here is to check about 500 entries at a time to validate if > >> > they already exist or not. > >> > > >> > In my MR, I'm starting when I have more than 100K lines to handle, and > >> > each line car have up to 1K entries. So it can result up to 100M > >> > gets... Job took initially 500 minutes to complete. I have added few > >> > pretty good nodes and it's not taking less than 300 minutes. But I > >> > would like to get under 100 minutes if I can... > >> > > >> > Thanks, > >> > > >> > JM > >> > > >> > Vector<Get> gets_entry_exist = new Vector<Get>(); > >> > for (Entry entry : entries.getEntries()) > >> > { > >> > Get entry_exist = new Get(entry.toKey()); > >> > entry_exist.setFilter(new KeyOnlyFilter()); > >> > gets_entry_exist.add(entry_exist); > >> > } > >> > > >> > Result[] result_entry_exist = > >> > table_entry.get(gets_entry_exist); > >> > > >> > int index = 0; > >> > for (Entry entry : entries.getEntries()) > >> > { > >> > boolean isEmpty = > >> > result_entry_exist[index++].isEmpty(); > >> > if (isEmpty) > >> > { > >> > // Process here > >> > } > >> > } > >> > { > >> > > >> > > >> > 2013/1/4, Damien Hardy <dha...@viadeoteam.com>: > >> > > Hello Jean-Marc, > >> > > > >> > > BloomFilters are just designed for that. > >> > > > >> > > But they say if a row doesn't exist with a ash of the key (not the > >> > oposit, > >> > > 2 rowkeys could have the same ash result). > >> > > > >> > > If you want to be sure the rowkey exists you have to search for it > in > >> the > >> > > HFile ( the whole mechanism is transparent with the get() ). > >> > > > >> > > Their is also an KeOnlyFilter > >> > > > >> > > >> > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html > >> > > preventing from getting the whole columns of the existing key as > >> > > return > >> > > (which could be heavy). > >> > > > >> > > Cheers, > >> > > > >> > > -- > >> > > Damien > >> > > > >> > > > >> > > 2013/1/4 Jean-Marc Spaggiari <jean-m...@spaggiari.org> > >> > > > >> > >> Hi, > >> > >> > >> > >> What's the fastest way to know if a row exist? > >> > >> > >> > >> Today I'm doing that: > >> > >> > >> > >> Get get_entry_exist = new Get(key).addColumn(CF_DATA, C_DATA); > >> > >> Result entry_exist = table_entry.get(get_entry_exist); > >> > >> > >> > >> But should this be faster? > >> > >> Get get_entry_exist = new Get(key); > >> > >> Result entry_exist = table_entry.get(get_entry_exist); > >> > >> > >> > >> There is only one CF and one C on my table. > >> > >> > >> > >> Or is there an even faster way? > >> > >> > >> > >> Also, is there a way to make that even faster? I think BloomFilters > >> > >> can help, right? > >> > >> > >> > >> Thanks, > >> > >> > >> > >> JM > >> > >> > >> > > > >> > > >> > >> > >> > >> -- > >> Adrien Mogenet > >> 06.59.16.64.22 > >> http://www.mogenet.me > >> > > >