No, it doesn't sound 'raw', 'painful' or 'error prone' to me - I am well aware of the reasons why to use HBase over a traditional RDBMS - so am not complaining about this.
No, I was asking the question because I was not sure what the best approach would be. By the way, I did not convey the whole story - there is actually a third type of relationship as well - SURROUNDING - i.e. adjacent geographical locations SURROUND each other (again, for business reasons, this relationship is not necessarily always reflexive - though it usually is). So, when you say HBase doesn't provide declarative secondary indices you lost me - what are these? How are these different from the ones available via IndexedTable and IndexSpecification? Hmm, I was hoping by using sparse values in a column family labelled by the location ids I would just have to search for rows which had a non-empty value for the CONTAIN:France column to retrieve the values for that example query I mentioned. I understand that that would make the CONTAIN column family (and the PARENT and SURROUNDING families too) quite wide but I remember reading somewhere that that was quite acceptable for HBase. Further, I was hoping, since the columns labels themselves contain the data I am searching for, that there would an efficient way to do this (don't know why or how - I was just hoping). Anyway, if it means that the only way to do this efficiently in HBase is using four tables - one for the locations and one for each of the three types of relationships then so be it - that is what I'll have to do - I was just hoping for a simpler alternative with my idea to use column families labelled by the location ids. Ishaaq Ryan Rawson wrote: > > Hey, > > HBase doesn't provide declarative secondary indexes. Your app code > needs to maintain them, writing into 2 tables with dual writes. You > don't have to duplicate data, you can just use the secondary index as > a pointer into the main table, causing you to have to chase down > potentially thousands of extra RPCs. There are no hbase transactions > when you are modifying multiple tables, but that isnt as big of a > problem as it seems. > > If all this sounds very 'raw' and 'painful' and 'error prone', let me > remind you what HBase is for, and perhaps you can make a better > choice. > > HBase is when you hit the limits of what you can do with mysql. When > you work to scale mysql you end up removing the following features: > - no transactions > - no secondary indexes (slow on mysql/innodb) > - separate multiple table indexes on different databases > - sharding (last step) > > Once you hit the magical 300-500GB size and you have hit the end of > where master-slave replication scaling can take you, you need to move > on to different techniques and technology. This is where HBase picks > up. > > So all the things you list below as 'negatives' are the reality on the > ground when you scale no matter what technology you use. If they > sound too ugly for you, perhaps you really need mysql? > > > On Fri, Jul 3, 2009 at 12:37 AM, tim robertson<timrobertson...@gmail.com> > wrote: >> Those 2 tables could be collapsed into 1 table with 2 columns of >> course... >> >> On Fri, Jul 3, 2009 at 9:24 AM, tim robertson<timrobertson...@gmail.com> >> wrote: >>> Hi, >>> >>> Disclaimer: I am a newbie, so this is just one option, and I am basing >>> on my understanding that secondary indexes are not yet working on >>> HBase... >>> >>> So since HBase has very fast "get by primary key", but is *still* (?) >>> without working secondary indexes, you would need to do scans to find >>> the records. A workaround would be to have 2 more tables >>> "Country_Contains" and "Country_Contained_In", and in each table, the >>> primary key is the unique ID of the country, the payload being the >>> Keys to the rows in the main table. Basically this is creating 2 >>> tables to act as the index manually. This is a duplication of data, >>> and would require management of 3 tables wrapped in a transaction when >>> doing CRUD, but it would allow for lookup of the rows to modify >>> without need for scanning. >>> >>> Just one idea... >>> >>> Cheers, >>> >>> Tim >>> >>> >>> >>> >>> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<ish...@gmail.com> wrote: >>>> Hi all, >>>> I am pretty new to HBase so forgive me if this seems like a silly >>>> question. >>>> >>>> Each row in my Hbase table is a geographical location that is related >>>> to >>>> other locations. For e.g. one relationship is the CONTAIN relationship. >>>> So, >>>> Europe CONTAINs England, France, Spain etc. There is an inverse >>>> relationship as well called PARENT, so England has a PARENT called >>>> Europe. >>>> However, note that, for various business reasons not pertinant to this >>>> discussion, the inverse relationship need not always be set, i.e. we >>>> may not >>>> store France with a PARENT value of Europe, even though Europe CONTAINs >>>> France. >>>> >>>> So, I store each location as a row with an id and the payload data for >>>> that >>>> location as a separate data column. This data column includes the sets >>>> of >>>> ids of the related locations. >>>> >>>> Now, I want to be able to update/delete locations consistently. So, in >>>> my >>>> example above, I might want to delete France, in which case I also want >>>> to >>>> make sure that I delete the CONTAINs relationship that Europe has with >>>> France as that is now obsolete. What is the most efficient way to do >>>> this? I >>>> want to minimise the number of writes I would have to do - on the other >>>> hand >>>> optimising read performance is more important as writes do not happen >>>> that >>>> often (this is geographic data after all). >>>> >>>> My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1 >>>> write >>>> operation to delete France and n write operations to delete the >>>> relationships that n other locations may have to France. In the case of >>>> a >>>> root location like Europe that may have a large number of locations >>>> that >>>> relate to it this may be expensive, but I see no other way. >>>> >>>> So, I was wondering, how do I index this to speed this up as far as >>>> possible. So, given the location Europe, what are the fields I should >>>> include in its row and how to index them? I could create a column >>>> family for >>>> each relationship type with a label - the label being the id of the >>>> location >>>> this location is related to, so, for e.g., the Europe row would have a >>>> column called CONTAIN:England (assuming "England" is the id for the >>>> England >>>> column - in reality it would be a UUID). I would then have as many >>>> labels >>>> under the CONTAIN family for Europe as locations that Europe contains. >>>> >>>> How would I index this and ensure that when deleting France the query: >>>> "list >>>> all locations that CONTAIN France" returns with Europe (and whatever >>>> else) >>>> as quickly as possible? >>>> >>>> Thanks, >>>> Ishaaq >>>> >>> >> > > -- View this message in context: http://www.nabble.com/indexing-question-tp24318679p24320773.html Sent from the HBase User mailing list archive at Nabble.com.