Thanks a lot :) ...... Imran
On Fri, Nov 13, 2009 at 9:19 AM, Ryan Rawson <[email protected]> wrote: > HBase is semi-column oriented. Column families is the storage model - > everything in a column family is stored in a file linearly in HDFS. > That means accessing data from a column family is really cheap and > easy. Adding more column families adds more files - it has the > performance profile of adding new tables, except you dont actually > have additional tables, so the conceptual complexity stays low. > > Data is stored at the "intersection" of the rowid, and the column > family + qualifier. This is sometimes called a "Cell" - contains a > timestamp as well. You can have multiple versions all timestamped. > The timestamp is by default the int64/java system milli time. I have > to recommend against setting the timestamp explicitly if you can avoid > it. So when you retrieve a row, you can get everything, a list of > column qualifiers or a list of families or any combo. (eg: list of > these qualifiers out of family A and everything from family B) > > One problem with mysql is that large values tend to be pushed aside > and require double block seeks to get them. HBase can help you avoid > this by giving you column families, eg: store the giant data in one > family, smaller meta data in the second. > > The terms to use are: > - Column family (or just family): the unit of locality in hbase. > Everything in a family is stored in 1 (or a set) of files. A table is > a name and a list of families with attributes for those families (eg: > compression). A family is a string. > - Column qualifier (or just qualifier): allows you to store multiple > values for the same row in 1 family. This value is a byte array and > can be anything. The API converts null => new byte[0]{}. This is the > tricky bit, since most people dont think of "column names" as being > dynamic. > - Cell - the old name for a value + timestamp. The new API (see: > class Result) doesn't use this term, instead provides a different path > to read data. > > You can use HBase as a normal datastore and use static names for the > qualifiers, and that is just fine. But if you need something special > to get past the lack of relations, you can start to do fun things with > the qualifier as data. Building a secondary index for example. The > row key would be the secondary value (eg: city) and the qualifier > would be the primary key (eg: userid) and the value would be a > placeholder to indicate the value exists. > > > > > > > > On Thu, Nov 12, 2009 at 6:04 PM, Imran M Yousuf <[email protected]> wrote: >> On Fri, Nov 13, 2009 at 9:00 AM, Ryan Rawson <[email protected]> wrote: >>> HBase does at least 3 things that traditional databases have a hard time >>> with: >>> >>> - Large blobs of data. Mysql is particularly guilty of not handling this >>> well. >>> - Tables that grow to be larger than reasonably priced single machines. >>> - Write loads that are not compatible with master-slave replication >>> >>> The 2nd and 3rd are very interesting, since you either have to pay for >>> something like Oracle RAC, or start sharding. >>> >> >> Exactly, and since contents will be blob data and my experience with >> RDBMS blob suggests that scaling is proportional to *BIG* money so I >> am eager to take the HBase path. I was actually praying and hoping you >> join this thread :). Can you please elaborate Column Family, Column >> and Cell and their basic use cases? >> >> Thanks a lot, >> >> Imran >> >>> On Thu, Nov 12, 2009 at 5:58 PM, Imran M Yousuf <[email protected]> wrote: >>>> On Thu, Nov 12, 2009 at 10:50 PM, Chris Bates >>>> <[email protected]> wrote: >>>>> Hi Imran, >>>>> >>>>> I'm a new user as well. I found these presentations helpful in answering >>>>> most of your questions: >>>>> http://wiki.apache.org/hadoop/HBase/HBasePresentations >>>>> >>>>> There are HBase schema designs in there. >>>>> >>>> >>>> I read them, but without the speakers explanation the schema parts >>>> remain unexplained for a dumb newbie like me. I was looking for more >>>> concrete definitions of column family, column, cell etc. and their use >>>> cases. I guess I will have to learn them by experimenting. >>>> >>>>> You might also want to read the original BigTable paper and the chapter on >>>>> HBase in OReilly's Hadoop book. >>>>> >>>>> But to answer one of your questions--"Big Data" usually refers to a >>>>> dataset >>>>> that is millions to billions in length. But "Big Data" doesn't mean you >>>>> have to use a tool like HBase. We have some MySQL tables that are 100 >>>>> million rows and work fine. You have to identify what works best for your >>>>> use and use the most appropriate tool. >>>> >>>> Thanks, IMHO, I am sure that HBase is more suitable than MySQL simply >>>> because of the complexity and cost in scaling an application with Blob >>>> data. >>>> >>>> Thanks a lot, >>>> >>>> Imran >>>> >>>>> >>>>> On Thu, Nov 12, 2009 at 9:13 AM, Imran M Yousuf <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi! >>>>>> >>>>>> I am absolutely new to HBase. All I have done is to read up >>>>>> documentation, presentation and getting a single instance up and >>>>>> running. I am starting on a Content Management System which will be >>>>>> used as a backend for multiple web applications of different natures. >>>>>> In the CMS: >>>>>> * User can define their content known as content type. >>>>>> * Content can have one-2-many one-2-one and many-2-many relationship >>>>>> with other contents. >>>>>> * Content fields should be versioned >>>>>> * Content type can change in runtime, i.e. fields (a.k.a. columns in >>>>>> HBase) added and removal will not be allowed just yet. >>>>>> * Every content type will have a corresponding grammer to validate >>>>>> content of its type. >>>>>> * It will have authentication and authorization >>>>>> * It will have full text search based on Lucene/Katta. >>>>>> >>>>>> Based on these requirements I have the following questions that I >>>>>> would like feedback on: >>>>>> * Reading articles and presentations it looks to be HBase is a perfect >>>>>> match as it supports multi-dimensional rows, versioned cells, dynamic >>>>>> schema modification. But I could not understand what is the definition >>>>>> of "Big Data" - that is if a content size is roughly 1~100kB >>>>>> (field/cell size 0~100kB), is HBase meant for such uses? >>>>>> * Since I am not sure how much load the site will have, I am planning >>>>>> to setup DN+RS on Rackspace cloud instances with 2GB/80GB HDD with a >>>>>> view of with revenue and pageviews increasing, more moderate >>>>>> "commodity" hardware can be added progressively. Any >>>>>> comments/suggestions on this strategy? >>>>>> * Where can I read up on or checkout samples RDBMS schemas converted >>>>>> to HBase schema? Basically, I want to read up efficient schema design >>>>>> for different cardinal relationships between objects. >>>>>> >>>>>> Thank you, >>>>>> >>>>>> -- >>>>>> Imran M Yousuf >>>>>> Entrepreneur & Software Engineer >>>>>> Smart IT Engineering >>>>>> Dhaka, Bangladesh >>>>>> Email: [email protected] >>>>>> Blog: http://imyousuf-tech.blogs.smartitengineering.com/ >>>>>> Mobile: +880-1711402557 >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Imran M Yousuf >>>> Entrepreneur & Software Engineer >>>> Smart IT Engineering >>>> Dhaka, Bangladesh >>>> Email: [email protected] >>>> Blog: http://imyousuf-tech.blogs.smartitengineering.com/ >>>> Mobile: +880-1711402557 >>>> >>> >> >> >> >> -- >> Imran M Yousuf >> Entrepreneur & Software Engineer >> Smart IT Engineering >> Dhaka, Bangladesh >> Email: [email protected] >> Blog: http://imyousuf-tech.blogs.smartitengineering.com/ >> Mobile: +880-1711402557 >> > -- Imran M Yousuf Entrepreneur & Software Engineer Smart IT Engineering Dhaka, Bangladesh Email: [email protected] Blog: http://imyousuf-tech.blogs.smartitengineering.com/ Mobile: +880-1711402557
