Re: Fastest way to find is a row exist?

Mohamed Ibrahim Sat, 05 Jan 2013 19:46:18 -0800

Sorry, I didn't notice your email about packing 500 operations before.

You might actually benefit from checking with a batch of Gets vs individual
exists.


Best,
Mohamed


On Sat, Jan 5, 2013 at 8:29 AM, Jean-Marc Spaggiari <jean-m...@spaggiari.org
> wrote:

> Hum, very interesting!
>
> Now, what's the best option? Array of get which will retrieve more
> information? Or multiple HTable.exits one by one?
>
> The best will have been to have an array of gets passed to the
> exist... I will see how big it is to add that...
>
> JM
>
> 2013/1/4, Mohamed Ibrahim <m0b...@gmail.com>:
> > What about HTable.exists ??
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#exists(org.apache.hadoop.hbase.client.Get)
> >
> > I think that should work if the Get has only the row key.
> >
> > Mohamed
> >
> >
> > On Fri, Jan 4, 2013 at 3:17 PM, Adrien Mogenet
> > <adrien.moge...@gmail.com>wrote:
> >
> >> On every Get, BloomFilter is acting as a filter (!) on top of each HFile
> >> and allows to check if a key is absent from the HFile. So yes, you will
> >> benefit from these filters.
> >>
> >>
> >> On Fri, Jan 4, 2013 at 8:58 PM, Jean-Marc Spaggiari <
> >> jean-m...@spaggiari.org
> >> > wrote:
> >>
> >> > Is KeyOnlyFilter using the BloomFilters too?
> >> >
> >> > Here is, with more details, what I'm doing.
> >> >
> >> > Few questions.
> >> > - Can I create one single KeyOnlyFilter and give the same filter to
> >> > all the gets?
> >> > - Will bloom filters benefit in such scenario? My key is small. Let's
> >> > say average 128 bytes.
> >> >
> >> > The goal here is to check about 500 entries at a time to validate if
> >> > they already exist or not.
> >> >
> >> > In my MR, I'm starting when I have more than 100K lines to handle, and
> >> > each line car have up to 1K entries. So it can result up to 100M
> >> > gets... Job took initially 500 minutes to complete. I have added few
> >> > pretty good nodes and it's not taking less than 300 minutes. But I
> >> > would like to get under 100 minutes if I can...
> >> >
> >> > Thanks,
> >> >
> >> > JM
> >> >
> >> >         Vector<Get> gets_entry_exist = new Vector<Get>();
> >> >         for (Entry entry : entries.getEntries())
> >> >         {
> >> >                 Get entry_exist = new Get(entry.toKey());
> >> >                 entry_exist.setFilter(new KeyOnlyFilter());
> >> >                 gets_entry_exist.add(entry_exist);
> >> >         }
> >> >
> >> >         Result[] result_entry_exist =
> >> > table_entry.get(gets_entry_exist);
> >> >
> >> >         int index = 0;
> >> >         for (Entry entry : entries.getEntries())
> >> >         {
> >> >                 boolean isEmpty =
> >> > result_entry_exist[index++].isEmpty();
> >> >                 if (isEmpty)
> >> >                 {
> >> >                         // Process here
> >> >                 }
> >> >         }
> >> >                                                 {
> >> >
> >> >
> >> > 2013/1/4, Damien Hardy <dha...@viadeoteam.com>:
> >> > > Hello Jean-Marc,
> >> > >
> >> > > BloomFilters are just designed for that.
> >> > >
> >> > > But they say if a row doesn't exist with a ash of the key (not the
> >> > oposit,
> >> > > 2 rowkeys could have the same ash result).
> >> > >
> >> > > If you want to be sure the rowkey exists you have to search for it
> in
> >> the
> >> > > HFile ( the whole mechanism is transparent with the get() ).
> >> > >
> >> > > Their is also an KeOnlyFilter
> >> > >
> >> >
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html
> >> > > preventing from getting the whole columns of the existing key as
> >> > > return
> >> > > (which could be heavy).
> >> > >
> >> > > Cheers,
> >> > >
> >> > > --
> >> > > Damien
> >> > >
> >> > >
> >> > > 2013/1/4 Jean-Marc Spaggiari <jean-m...@spaggiari.org>
> >> > >
> >> > >> Hi,
> >> > >>
> >> > >> What's the fastest way to know if a row exist?
> >> > >>
> >> > >> Today I'm doing that:
> >> > >>
> >> > >> Get get_entry_exist = new Get(key).addColumn(CF_DATA, C_DATA);
> >> > >> Result entry_exist = table_entry.get(get_entry_exist);
> >> > >>
> >> > >> But should this be faster?
> >> > >> Get get_entry_exist = new Get(key);
> >> > >> Result entry_exist = table_entry.get(get_entry_exist);
> >> > >>
> >> > >> There is only one CF and one C on my table.
> >> > >>
> >> > >> Or is there an even faster way?
> >> > >>
> >> > >> Also, is there a way to make that even faster? I think BloomFilters
> >> > >> can help, right?
> >> > >>
> >> > >> Thanks,
> >> > >>
> >> > >> JM
> >> > >>
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Adrien Mogenet
> >> 06.59.16.64.22
> >> http://www.mogenet.me
> >>
> >
>

Re: Fastest way to find is a row exist?

Reply via email to