Ok now a have a good picture of your situation (took me a moment). I guess that even if it's concurrent it will not be that much of a problem. Keeping the max version at 1 will insure that even if 3 mappers insert the history of one entity, the data that overlaps will still be inserted in your "event:" family and the rest will be discarded. Your biggest concern will be the efficiency of reading data from HBase so your mappers should have a local cache.
Hope this helps, J-D On Sat, Jul 19, 2008 at 5:22 PM, imbmay <[EMAIL PROTECTED]> wrote: > > The table was created with two column families: createdAt and event, the > former is the timestamp, so 1 entry per entity and the latter is a > collection of events. In the latter entries take the form event:1524, > event:1207, etc. and for the time being I'm storing only the event time. > The input is a set of text files generated at a rate of about 600 an hour > with up to 50,000 entries per file. Each line in the text file contains a > unique entity ID, a timestamp of the first time it was seen, an event code > and a history of the last 100 event codes. In cases where I haven't seen > an > entity before I want to add everything in the history; when the entity has > been seen previously I just want to add the last event. I'm keeping the > table design simple to start with while I'm getting familiar with HBase. > > The principal area of concern I have is regarding the reading of the data > from the HBase table during the map/reduce process to determine if an > entity > already exists. If I'm running the map/reduce on a single machine then its > pretty easy to keep track of previously unknown entities; but if I'm > running > in a cluster a new entity may show up in the inputs to several concurrent > [EMAIL PROTECTED] > > > Jean-Daniel Cryans wrote: > > > > Brian (guessing it's your name from your email address), > > > > Please be more specific about your table design. For example, a "column" > > in > > HBase is a very vague word since it may refer to a column family or a > > column > > key inside a column family. Also, what kind of load you expect to have? > > > > Maybe answering to this will also help you understanding HBase. > > > > Thx, > > > > J-D > > > > On Fri, Jul 18, 2008 at 4:41 PM, imbmay <[EMAIL PROTECTED]> wrote: > > > >> > >> I want to use hbase to maintain a very large dataset which needs to be > >> updated pretty much continuously. I'm creating a record for each entity > >> and > >> including a creation timestamp column as well as between 10 and 1000 > >> additional columns named for distinct events related to the record > >> entity. > >> Being new to hbase the approach I've taken is to create a map/reduce app > >> that for each input record: > >> > >> Does a lookup in the table using HTable get(row, column) on the > timestamp > >> colum to determine if there is an existing row for the entity. > >> If there is no existing record for the entity, the event history for the > >> entity is added to the table with one column added per unique event id. > >> If there is an existing record for the entity, it just adds the most > >> recent > >> event to the table. > >> > >> I'd like feedback as to whether this is a reasonable approach in terms > of > >> general performance and reliability or if there is a different pattern > >> better suited to hbase with map/reduce or if I should even be using > >> map/reduce for this. > >> > >> Thanks in advance. > >> > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18537368.html > >> Sent from the HBase User mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18548888.html > Sent from the HBase User mailing list archive at Nabble.com. > >
