Re: Modelling threaded messages

Ted Dunning Fri, 01 Apr 2011 21:59:44 -0700

Depending on the speed requirements associated with retrieving bunches of
messages, hbase may have a real edge here.  This is a special problem in
that there are common query patterns that allow contiguous reads of lots of
data.  That gives a huge advantage to systems like hbase that store data
organized by key.


You might view it as the karmic opposite of the common hot-spotting problem
due to storing elements by time-stamp.

On Fri, Apr 1, 2011 at 8:10 PM, Kevin Apte <technicalarchitect2...@gmail.com
> wrote:

> Have you considered using Cassandra?
>
> Kevin
>
>
> On Fri, Apr 1, 2011 at 11:01 PM, M. C. Srivas <mcsri...@gmail.com> wrote:
>
> > Ted, this is a pretty clever idea.
> >
> > On Thu, Mar 31, 2011 at 9:27 PM, Ted Dunning <tdunn...@maprtech.com>
> > wrote:
> >
> > > Solr/Elastic search is a fine solution, but probably won't be quite as
> > fast
> > > as a well-tuned hbase solution.
> > >
> > > One key assumption you seem to be making is that you will store
> messages
> > > only once.  If you are willing to make multiple updates to tables, then
> > you
> > > can arrange the natural ordering of the table to get what you want.
>  For
> > > instance, you could keep the most recent messages (say the last 10 from
> > > each
> > > of the 1000 most recently updated threads) in an in memory table.  Then
> > you
> > > could store messages in a thread table indexed by thread:timestamp.
> > >  Finally
> > > you could store messages in a table indexed by user:thread or
> > > user:timestamp.  This would allow you to display the most recent
> messages
> > > or
> > > thread in near zero time, to display all or the most recent messages
> from
> > a
> > > particular thread with only one retrieval and all of the messages from
> a
> > > particular user in time order in one retrieval.
> > >
> > > On Thu, Mar 31, 2011 at 5:56 PM, Mark Jarecki <mjare...@bigpond.net.au
> > > >wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'm modelling a schema for storing and retrieving threaded messages,
> > > where,
> > > > for planning purposes:
> > > >
> > > >        - there are many millions of users.
> > > >        - a user might have up to 1000 threads.
> > > >        - each thread might have up to 50000 messages (with some
> threads
> > > > being sparse with only a few messages).
> > > >        - the Stargate REST interface is used.
> > > >
> > > > I want to be able to execute the following queries:
> > > >
> > > >        - retrieve x latest active threads, with the latest message.
> > > >        - retrieve x latest active threads, with the latest message,
> > > offset
> > > > by y.
> > > >        - retrieve x latest messages from a thread.
> > > >        - retrieve x latest messages from a thread, offset by y.
> > > >
> > > > I've come up with a few possible methods for modelling this. But any
> > > > insights would be greatly appreciated.
> > > >
> > > > Thanks in advance,
> > > >
> > > > Mark
> > > >
> > > >
> > > > Possible solution 1:
> > > >
> > > > TABLE:          threads
> > > > KEY:            userID : threadID
> > > > COLUMN:         latest_message
> > > >
> > > > TABLE:          messages
> > > > KEY:            userID : threadID : timestamp
> > > > COLUMN:         message
> > > >
> > > > Messages are first written to the messages table, and then the
> threads
> > > > table's thread is updated with the latest message.
> > > >
> > > > To fetch the latest x active threads, with the latest message:
> > > >
> > > >        - I retrieve all threads and then sort and reduce the results
> on
> > > the
> > > > client.
> > > >
> > > > A concern with this is the fetching of all threads to sort on each
> > > request.
> > > > This could be unwieldy!
> > > >
> > > >
> > > > Possible solution 2:
> > > >
> > > > TABLE:          threads
> > > > KEY:            userID : timestamp : threadID
> > > > COLUMN:         latest_message
> > > >
> > > > TABLE:          messages
> > > > KEY:            userID : threadID : timestamp
> > > > COLUMN:         message
> > > >
> > > > Messages are first written to the messages table, and then the
> threads
> > > > table's is updated with the latest message. The previous latest
> message
> > > is
> > > > then deleted from the threads table.
> > > >
> > > > To fetch the latest x active threads, with the latest message:
> > > >
> > > >        - I scan the threads table until I get x unique threads.
> > > >
> > > > A concern with this could be the issue of keeping the threads table
> in
> > > sync
> > > > with the messages table - especially with the deletion of old latest
> > > > messages.
> > > >
> > > >
> > > > Possible solution 3:
> > > >
> > > > TABLE:          messages
> > > > KEY:            userID : timestamp : threadID
> > > > COLUMN:         message
> > > >
> > > > To fetch the latest x active threads, with the latest message:
> > > >
> > > >        - I scan the messages table until I get x unique threads.
> > > >
> > > > One of my concerns with this method is that some threads will be
> busier
> > > > than others, forcing a scan through nearly all of a user's messages.
> > And
> > > > there will be an ever increasing number of messages. A periodic
> > archiving
> > > > process - moving older messages to another table - might alleviate
> > things
> > > > here.
> > > >
> > > >
> > > > Possible solution 4:
> > > >
> > > > Use SOLR/Elastic search or equivalent.
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: Modelling threaded messages

Reply via email to