Depending on the speed requirements associated with retrieving bunches of messages, hbase may have a real edge here. This is a special problem in that there are common query patterns that allow contiguous reads of lots of data. That gives a huge advantage to systems like hbase that store data organized by key.
You might view it as the karmic opposite of the common hot-spotting problem due to storing elements by time-stamp. On Fri, Apr 1, 2011 at 8:10 PM, Kevin Apte <technicalarchitect2...@gmail.com > wrote: > Have you considered using Cassandra? > > Kevin > > > On Fri, Apr 1, 2011 at 11:01 PM, M. C. Srivas <mcsri...@gmail.com> wrote: > > > Ted, this is a pretty clever idea. > > > > On Thu, Mar 31, 2011 at 9:27 PM, Ted Dunning <tdunn...@maprtech.com> > > wrote: > > > > > Solr/Elastic search is a fine solution, but probably won't be quite as > > fast > > > as a well-tuned hbase solution. > > > > > > One key assumption you seem to be making is that you will store > messages > > > only once. If you are willing to make multiple updates to tables, then > > you > > > can arrange the natural ordering of the table to get what you want. > For > > > instance, you could keep the most recent messages (say the last 10 from > > > each > > > of the 1000 most recently updated threads) in an in memory table. Then > > you > > > could store messages in a thread table indexed by thread:timestamp. > > > Finally > > > you could store messages in a table indexed by user:thread or > > > user:timestamp. This would allow you to display the most recent > messages > > > or > > > thread in near zero time, to display all or the most recent messages > from > > a > > > particular thread with only one retrieval and all of the messages from > a > > > particular user in time order in one retrieval. > > > > > > On Thu, Mar 31, 2011 at 5:56 PM, Mark Jarecki <mjare...@bigpond.net.au > > > >wrote: > > > > > > > Hi all, > > > > > > > > I'm modelling a schema for storing and retrieving threaded messages, > > > where, > > > > for planning purposes: > > > > > > > > - there are many millions of users. > > > > - a user might have up to 1000 threads. > > > > - each thread might have up to 50000 messages (with some > threads > > > > being sparse with only a few messages). > > > > - the Stargate REST interface is used. > > > > > > > > I want to be able to execute the following queries: > > > > > > > > - retrieve x latest active threads, with the latest message. > > > > - retrieve x latest active threads, with the latest message, > > > offset > > > > by y. > > > > - retrieve x latest messages from a thread. > > > > - retrieve x latest messages from a thread, offset by y. > > > > > > > > I've come up with a few possible methods for modelling this. But any > > > > insights would be greatly appreciated. > > > > > > > > Thanks in advance, > > > > > > > > Mark > > > > > > > > > > > > Possible solution 1: > > > > > > > > TABLE: threads > > > > KEY: userID : threadID > > > > COLUMN: latest_message > > > > > > > > TABLE: messages > > > > KEY: userID : threadID : timestamp > > > > COLUMN: message > > > > > > > > Messages are first written to the messages table, and then the > threads > > > > table's thread is updated with the latest message. > > > > > > > > To fetch the latest x active threads, with the latest message: > > > > > > > > - I retrieve all threads and then sort and reduce the results > on > > > the > > > > client. > > > > > > > > A concern with this is the fetching of all threads to sort on each > > > request. > > > > This could be unwieldy! > > > > > > > > > > > > Possible solution 2: > > > > > > > > TABLE: threads > > > > KEY: userID : timestamp : threadID > > > > COLUMN: latest_message > > > > > > > > TABLE: messages > > > > KEY: userID : threadID : timestamp > > > > COLUMN: message > > > > > > > > Messages are first written to the messages table, and then the > threads > > > > table's is updated with the latest message. The previous latest > message > > > is > > > > then deleted from the threads table. > > > > > > > > To fetch the latest x active threads, with the latest message: > > > > > > > > - I scan the threads table until I get x unique threads. > > > > > > > > A concern with this could be the issue of keeping the threads table > in > > > sync > > > > with the messages table - especially with the deletion of old latest > > > > messages. > > > > > > > > > > > > Possible solution 3: > > > > > > > > TABLE: messages > > > > KEY: userID : timestamp : threadID > > > > COLUMN: message > > > > > > > > To fetch the latest x active threads, with the latest message: > > > > > > > > - I scan the messages table until I get x unique threads. > > > > > > > > One of my concerns with this method is that some threads will be > busier > > > > than others, forcing a scan through nearly all of a user's messages. > > And > > > > there will be an ever increasing number of messages. A periodic > > archiving > > > > process - moving older messages to another table - might alleviate > > things > > > > here. > > > > > > > > > > > > Possible solution 4: > > > > > > > > Use SOLR/Elastic search or equivalent. > > > > > > > > > > > > > > > > > > > > > > > > > >