Thanks for the responses guys, sounds like multiple relationship columns are the way to go.
However, I am still curious about the earlier comments about secondary indices not being available - does this mean I should avoid using IndexedTable and IndexConfiguration completely? I am using 0.19.3. Thanks, Ishaaq Jonathan Gray-2 wrote: > > Ishaaq, > > I think Michael has the right idea. You can do this in a single table > with two families per relationship. > > The row key is the location uuid. And you could have 6 families: > > CONTAINS, CONTAINED_IN, PARENTS, PARENT_OF, SURROUNDS, SURROUNDED_BY > > You would duplicate all your data, but random writes are quite > fast/cheap in HBase. So updates to this big fat row would be very fast, > this is a big advantage with HBase. > > So to delete France, you would grab all the families from it's row, and > iterate down them, deleting the other side of the relationship. > > === (unverified pseudo code) > > HTable ht = new HTable("locations"); > Result result = ht.get(new Get(france_id)); > > NavigableMap<byte[],byte[]> containedIn = > result.getFamilyMap(Bytes.toBytes("CONTAINED_IN")); > > for(byte [] containedIn_id : containedIn.keySet()) { > Delete delete = new Delete(containedIn_id); > delete.deleteColumns(Bytes.toBytes("CONTAINS"), france_id); > ht.delete(delete); > } > > == > > You'd have to do that for all 6 relationships, so it doubles your delete > work as well. > > How you design this really depends on what your requirements are. Many > times I'll start with a design like this which basically allows any kind > of operation relatively quickly at the expense of doubling the data. > > Most other designs are going to require some type of scanning. > > JG > > > Michael Robellard wrote: >> Assuming you always know the location you want to start from: >> >> Can't you have a table with a column family called contains: which holds >> all >> the places that place contains. >> >> Another Column Family for the contained in relationship >> >> and a third column family for surrounding >> >> The keys for each column value would be the row key for the location and >> then if you had information that you used all the time you could store >> it in the value for the column so you don't have to do second table >> lookup all the time >> >> Ishaaq Chandy wrote: >>> No, it doesn't sound 'raw', 'painful' or 'error prone' to me - I am well >>> aware of the reasons why to use HBase over a traditional RDBMS - so am >>> not >>> complaining about this. >>> >>> No, I was asking the question because I was not sure what the best >>> approach >>> would be. >>> >>> >>> By the way, I did not convey the whole story - there is actually a third >>> type of relationship as well - SURROUNDING - i.e. adjacent geographical >>> locations SURROUND each other (again, for business reasons, this >>> relationship is not necessarily always reflexive - though it usually >>> is). >>> >>> So, when you say HBase doesn't provide declarative secondary indices you >>> lost me - what are these? How are these different from the ones >>> available >>> via IndexedTable and IndexSpecification? >>> >>> Hmm, I was hoping by using sparse values in a column family labelled >>> by the >>> location ids I would just have to search for rows which had a non-empty >>> value for the CONTAIN:France column to retrieve the values for that >>> example >>> query I mentioned. I understand that that would make the CONTAIN column >>> family (and the PARENT and SURROUNDING families too) quite wide but I >>> remember reading somewhere that that was quite acceptable for HBase. >>> >>> Further, I was hoping, since the columns labels themselves contain the >>> data >>> I am searching for, that there would an efficient way to do this >>> (don't know >>> why or how - I was just hoping). >>> >>> Anyway, if it means that the only way to do this efficiently in HBase is >>> using four tables - one for the locations and one for each of the three >>> types of relationships then so be it - that is what I'll have to do - >>> I was >>> just hoping for a simpler alternative with my idea to use column >>> families >>> labelled by the location ids. >>> >>> Ishaaq >>> >>> >>> Ryan Rawson wrote: >>> >>>> Hey, >>>> >>>> HBase doesn't provide declarative secondary indexes. Your app code >>>> needs to maintain them, writing into 2 tables with dual writes. You >>>> don't have to duplicate data, you can just use the secondary index as >>>> a pointer into the main table, causing you to have to chase down >>>> potentially thousands of extra RPCs. There are no hbase transactions >>>> when you are modifying multiple tables, but that isnt as big of a >>>> problem as it seems. >>>> >>>> If all this sounds very 'raw' and 'painful' and 'error prone', let me >>>> remind you what HBase is for, and perhaps you can make a better >>>> choice. >>>> >>>> HBase is when you hit the limits of what you can do with mysql. When >>>> you work to scale mysql you end up removing the following features: >>>> - no transactions >>>> - no secondary indexes (slow on mysql/innodb) >>>> - separate multiple table indexes on different databases >>>> - sharding (last step) >>>> >>>> Once you hit the magical 300-500GB size and you have hit the end of >>>> where master-slave replication scaling can take you, you need to move >>>> on to different techniques and technology. This is where HBase picks >>>> up. >>>> >>>> So all the things you list below as 'negatives' are the reality on the >>>> ground when you scale no matter what technology you use. If they >>>> sound too ugly for you, perhaps you really need mysql? >>>> >>>> >>>> On Fri, Jul 3, 2009 at 12:37 AM, tim >>>> robertson<timrobertson...@gmail.com> >>>> wrote: >>>> >>>>> Those 2 tables could be collapsed into 1 table with 2 columns of >>>>> course... >>>>> >>>>> On Fri, Jul 3, 2009 at 9:24 AM, tim >>>>> robertson<timrobertson...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Disclaimer: I am a newbie, so this is just one option, and I am >>>>>> basing >>>>>> on my understanding that secondary indexes are not yet working on >>>>>> HBase... >>>>>> >>>>>> So since HBase has very fast "get by primary key", but is *still* (?) >>>>>> without working secondary indexes, you would need to do scans to find >>>>>> the records. A workaround would be to have 2 more tables >>>>>> "Country_Contains" and "Country_Contained_In", and in each table, the >>>>>> primary key is the unique ID of the country, the payload being the >>>>>> Keys to the rows in the main table. Basically this is creating 2 >>>>>> tables to act as the index manually. This is a duplication of data, >>>>>> and would require management of 3 tables wrapped in a transaction >>>>>> when >>>>>> doing CRUD, but it would allow for lookup of the rows to modify >>>>>> without need for scanning. >>>>>> >>>>>> Just one idea... >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Tim >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<ish...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> I am pretty new to HBase so forgive me if this seems like a silly >>>>>>> question. >>>>>>> >>>>>>> Each row in my Hbase table is a geographical location that is >>>>>>> related >>>>>>> to >>>>>>> other locations. For e.g. one relationship is the CONTAIN >>>>>>> relationship. >>>>>>> So, >>>>>>> Europe CONTAINs England, France, Spain etc. There is an inverse >>>>>>> relationship as well called PARENT, so England has a PARENT called >>>>>>> Europe. >>>>>>> However, note that, for various business reasons not pertinant to >>>>>>> this >>>>>>> discussion, the inverse relationship need not always be set, i.e. we >>>>>>> may not >>>>>>> store France with a PARENT value of Europe, even though Europe >>>>>>> CONTAINs >>>>>>> France. >>>>>>> >>>>>>> So, I store each location as a row with an id and the payload data >>>>>>> for >>>>>>> that >>>>>>> location as a separate data column. This data column includes the >>>>>>> sets >>>>>>> of >>>>>>> ids of the related locations. >>>>>>> >>>>>>> Now, I want to be able to update/delete locations consistently. >>>>>>> So, in >>>>>>> my >>>>>>> example above, I might want to delete France, in which case I also >>>>>>> want >>>>>>> to >>>>>>> make sure that I delete the CONTAINs relationship that Europe has >>>>>>> with >>>>>>> France as that is now obsolete. What is the most efficient way to do >>>>>>> this? I >>>>>>> want to minimise the number of writes I would have to do - on the >>>>>>> other >>>>>>> hand >>>>>>> optimising read performance is more important as writes do not >>>>>>> happen >>>>>>> that >>>>>>> often (this is geographic data after all). >>>>>>> >>>>>>> My thoughts are: I will have to do 1+n writes to do a delete - i.e. >>>>>>> 1 >>>>>>> write >>>>>>> operation to delete France and n write operations to delete the >>>>>>> relationships that n other locations may have to France. In the >>>>>>> case of >>>>>>> a >>>>>>> root location like Europe that may have a large number of locations >>>>>>> that >>>>>>> relate to it this may be expensive, but I see no other way. >>>>>>> >>>>>>> So, I was wondering, how do I index this to speed this up as far as >>>>>>> possible. So, given the location Europe, what are the fields I >>>>>>> should >>>>>>> include in its row and how to index them? I could create a column >>>>>>> family for >>>>>>> each relationship type with a label - the label being the id of the >>>>>>> location >>>>>>> this location is related to, so, for e.g., the Europe row would >>>>>>> have a >>>>>>> column called CONTAIN:England (assuming "England" is the id for the >>>>>>> England >>>>>>> column - in reality it would be a UUID). I would then have as many >>>>>>> labels >>>>>>> under the CONTAIN family for Europe as locations that Europe >>>>>>> contains. >>>>>>> >>>>>>> How would I index this and ensure that when deleting France the >>>>>>> query: >>>>>>> "list >>>>>>> all locations that CONTAIN France" returns with Europe (and whatever >>>>>>> else) >>>>>>> as quickly as possible? >>>>>>> >>>>>>> Thanks, >>>>>>> Ishaaq >>>>>>> >>>>>>> >>>> >>> >>> >> > > -- View this message in context: http://www.nabble.com/indexing-question-tp24318679p24340332.html Sent from the HBase User mailing list archive at Nabble.com.