At a high level the design is similar to that described here<http://wiki.apache.org/hadoop/Hive/HBaseIntegration>. It does require the addition of a new Hypertable "Row" object as well as new Input/Output formats to read/write Rows via Hadoop.
*Row:* The fundamental unit of operations in Hive is a row as opposed to a Cell in Hypertable (for instance something like 'select col2, col3 from table where col1=foo' makes no sense when you operate on cells). Hence Hypertable needs to expose a row as a collection of cells to Hive. A Hive column can be mapped to an entire column family (using the Hive Map data-type) or to a specific qualified column. Since a Hypertable row might be sparse, the Row object needs to provide a way to access a Cell (or set of Cells) given the column family (CF) as well as a way to access a cell given column family + qualifier (CQ). For simplicity the row will contain only the latest version of each cell, multiple versions can be added based on requirement. Implementation A row could contain a ByteBuffer containing the cells along with two maps storing storing the 'CF' -> set of cell buffer offsets and 'CF:CQ' -> cell buffer offset. Alternately, since the max #CFs is 256, an array of size 256 could store indices into an array which stores the cell buffer offsets. *LazyHTRow:* Hive supports types and Hypertable currently doesn't. The LazyHypertableRow minimizes the amount of deserialization, allowing Hive to only deserialize the Cell contents that it needs. *HypertableSerDe: *Deserialize Rows to and serialize from LazyHTRow.* * *Row-Input/Output-Formats:* We need input and output formats so Hadoop can read and write the Row objects above. These can use a get/set_serialized_cells API or get_/set_serialized_row, either way it'll need to convert between the Row objects and a collection of cells. *MetaHook*: The Hive metastore has to store the mapping between Hive columns and Hypertable columns as well as Hypertable column family names and column family ids (ie the HT table schema). The table id and schema generation number will have to be passed into the Input/Output formats to detect table alteration. In case of alteration the Hive table needs to be dropped and recreated. Thoughts? -Sanjit -- You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en.
