Thanks Gary .. for explaining .. I got it ...

On Tue, Aug 18, 2009 at 12:02 AM, Gary Helmling <ghelml...@gmail.com> wrote:

> Hi Bharath,
>
> If you're using the default key generator
> (org.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator),
> it actually appends the base table row key for you.  So even though
> the column value may be the same for multiple rows, the secondary
> index table will still have 1 row for each row with the value in the
> original table.  Here is relevant method from SimpleIndexKeyGenerator:
>
>  public byte[] createIndexKey(byte[] rowKey, Map<byte[], byte[]> columns) {
>    return Bytes.add(columns.get(column), rowKey);
>  }
>
> So, say you have a table "mytable", with the columns:
>    info:keycol       (say this is the one you want to index)
>    info:col2
>    info:col3
>
> If you define your table with the index specification -- new
> IndexSpecification("keycol", Bytes.toBytes("info:keycol")) -- then
> HBase will create the secondary index table named "mytable-by_keycol".
>
> Then, say you add the following rows to "mytable":
>
> "row1":  info:keycol="one", info:col2="abc", info:col3="def"
> "row2":  info:keycol="one", info:col2="ghi", info:col3="jkl"
>
> At this point, your index table ("mytable-by_keycol") will have the
> following rows:
>
> "onerow1": info:keycol="one", __INDEX__:ROW="row1"
> "onerow2": info:keycol="one", __INDEX__:ROW="row2"
>
> So you wind up with 2 rows in the index table (with unique row keys)
> pointing back at the original table rows, even though we've only
> stored a single distinct value for info:keycol.
>
> To access the rows by the secondary index to create a scanner using
> IndexedTable.getIndexedScanner(...).  I don't think there's support
> for using the indexes when performing a random read with
> HTable.getRow()/HTable.get().  (But maybe I'm wrong?)
>
> As Travis mentions, you could always use an alternate approach to
> implement your own indexing (use the index value as the row key for
> your own index table and store the original table row keys as
> individual columns).  I'm using the same approach for one access
> pattern and so far it seems to work very well.
>
> But as far as I know the built in secondary indexing assumes 1
> secondary index table row -> 1 original table row.
>
> Sorry if this got a bit long-winded.  It gets a little complicated to
> explain in text...
>
> --gh
>
>
> On Mon, Aug 17, 2009 at 1:46 PM, bharath
> vissapragada<bharathvissapragada1...@gmail.com> wrote:
> > Thanks for ur explanation Gary ,
> >
> > Consider my case where i can have repetitions of values .. So u say that
> i
> > edit the IndexKeyGenerator in such a way that instead of storing
> > (column->rowkey) i should do in such a way that (coulmn->
> rowkey1,rowkey2)
> > as diff timestamps ... if yes is that a good way ?
> >
> > On Mon, Aug 17, 2009 at 10:53 PM, Gary Helmling <ghelml...@gmail.com>
> wrote:
> >
> >> When defining the IndexSpecification for your table, you can pass your
> >> own implementation of
> >> org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
> >>
> >> This allows you to control how the row keys are generated for the
> >> secondary index table.  For example, you could append the original
> >> table's row key to the indexed value to ensure uniqueness in
> >> referencing the original rows.
> >>
> >> When you create an indexed scanner, the secondary index code opens and
> >> wraps a scanner on the secondary index table, based on the start row
> >> you specify (the indexed value you're looking up).  It applies any
> >> filter passed to rows on the secondary index table, so make sure
> >> anything you want to filter on is listed in the "indexed columns" in
> >> your IndexSpecification.
> >>
> >> For any rows returned by the wrapped scanner, the client code then
> >> does a get for the original table record (the original row key is
> >> stored in the "__INDEX__" column family I think).
> >>
> >> So in total, when using secondary indexes, you wind up with 1 scan + N
> >> gets to look at N rows.
> >>
> >> At least, this was my understanding of how things worked as of 0.19.
> >> I'm actually moving indexing into my app layer as I update to 0.20.
> >>
> >> Hope this helps.
> >>
> >> --gh
> >>
> >>
> >> On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray<jl...@streamy.com>
> wrote:
> >> > I'm actually unsure about that.  Look at the code or experiment.
> >> >
> >> > Seems to me that there would be a uniqueness requirement, otherwise
> what
> >> do
> >> > you expect the behavior to be?  A get can only return a single row, so
> >> > multiple index hits doesn't really make sense.
> >> >
> >> > Clint?  You out there? :)
> >> >
> >> > JG
> >> >
> >> > bharath vissapragada wrote:
> >> >>
> >> >> I got it ... I think this is definitely useful in my app because iam
> >> >> performing a full table scan everytime for selecting the rowkeys
> based
> >> on
> >> >> some column values .
> >> >>
> >> >> BUT ..
> >> >>
> >> >>  we can have more than one rowkey for the same column value .Can you
> >> >> please
> >> >> tell me how they are stored .
> >> >>
> >> >> Thanks in advance
> >> >>
> >> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <jl...@streamy.com>
> >> wrote:
> >> >>
> >> >>> It's not an actual hash or btree index, but rather secondary indexes
> in
> >> >>> HBase are implemented by creating an additional HBase table.
> >> >>>
> >> >>> If I have a table "users" (row key is userid) with family "data" and
> >> >>> column
> >> >>> "email", and I want to index the value in that column...
> >> >>>
> >> >>> I can create a table "users_email" where the row key is the email
> >> address
> >> >>> (value from the column in "users" table) and a single column that
> >> >>> contains
> >> >>> the userid.
> >> >>>
> >> >>> Doing an "index lookup" would mean doing a get on "users_email" and
> >> then
> >> >>> using that userid to do a lookup on the "users" table.
> >> >>>
> >> >>> IndexedTable does this transparently, but still does require two
> >> queries.
> >> >>>  So it's slower than a single query, but certainly faster than a
> full
> >> >>> table
> >> >>> scan.
> >> >>>
> >> >>> If you need hash-level performance on the index lookup, there are
> lots
> >> of
> >> >>> solutions outside of HBase that would work... In-memory Java
> HashMap,
> >> >>> Tokyo
> >> >>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text
> >> >>> indexing,
> >> >>> you can use Lucene or the like.
> >> >>>
> >> >>> Make sense?
> >> >>>
> >> >>> JG
> >> >>>
> >> >>>
> >> >>> bharath vissapragada wrote:
> >> >>>
> >> >>>> But i have read somewhere that Secondary indexes are somewhat slow
> >> >>>> compared
> >> >>>> to normal Hbase tables ..Does that effect the performance ?
> >> >>>>
> >> >>>> Also do you know the type of index created on the column(i mean
> Hash
> >> >>>> type
> >> >>>> or
> >> >>>> Btree etc)
> >> >>>>
> >> >>>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <e2...@yahoo.com>
> >> >>>> wrote:
> >> >>>>
> >> >>>>  Hi!
> >> >>>>>
> >> >>>>> As far as I understand you are talking about the secondary
> indexes.
> >> >>>>> Yes,
> >> >>>>> they can be used to quickly get the rowkey by a value in the
> indexed
> >> >>>>> column.
> >> >>>>>
> >> >>>>> --Kirill
> >> >>>>>
> >> >>>>>
> >> >>>>> bharath vissapragada wrote:
> >> >>>>>
> >> >>>>>  Hi all ,
> >> >>>>>>
> >> >>>>>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3
> >> API
> >> >>>>>> ..
> >> >>>>>>  I
> >> >>>>>> have seen some methods used to create an Indexed Table (on some
> >> >>>>>> column)..
> >> >>>>>> I
> >> >>>>>> have some doubts regarding the same ...
> >> >>>>>>
> >> >>>>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i
> can
> >> >>>>>> easily
> >> >>>>>> lookup a column value and find it's corresponding rowkey(s)
> >> >>>>>> 2) Can i find any performance gain when i use IndexedTable to
> search
> >> >>>>>> for
> >> >>>>>> a
> >> >>>>>> paritcular column value .. instead of scanning an entire normal
> >> HTable
> >> >>>>>> ..
> >> >>>>>>
> >> >>>>>> Kindly clarify my doubts
> >> >>>>>>
> >> >>>>>> Thanks in advance
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>
> >> >
> >>
> >
>

Reply via email to