Hi, I have a design question and I'm kind of stuck. I do not find an easy solution, but I think there is one.
The problem: consider you have an application where users can "open" an object. And then they can make an operation on that object. Or go further to another object. And now I want to make some statics for all users and all objects. E.g. how long is one object "viewed" by all users. How long is the average time a user looks at one type of objects before an operation occurs (e.g. go further or do something on the object). I can log an "open event" and an "operation event". Not a "close event". The number of objects is around 1M objects, and the number of operations on the objects is around 100M operations, perhaps more. And both numbers can grow fast. What I want to do is a result like: obejct1 => users = { "userX" : 2s , "userY" : 60s , ... } , operationby = "userX" , objecttype = "foobar" for all objects. And then I can calculate the average time one user spent on typeX etc. By now I plan to dump the above structure to another hbase table and do a mapred there to get my averages. But this could be changed if some of you comes up with a better plan. I actually only need the averaged and aggregated data. As I use one hbase table as logging dump I naivly came up with a structure like rowkey = <timestamp> , data:user => <username> , data:object => <object ref> , data:type => <object type> , data:operation => <(open|do something)> .... But now I'm stuck how to make a clever mapred job to get my information. E.g. if I want to know how long the object is viewed by the particular user I would have find an "open" operation and then scan further for either an "open operation" for another object or a "task operation" on the same object. Thus I came up with rowkey = <object-ref>-<timestamp> , data:user => <username> , data:type => <type> , data:operation => <(open|do something)> By this I could scan over the object-ref and calculate all infos for all users for ONE object. However, I have to do that for all objects not just a specific one. How can I run over that table in a mapred fashion? My first idea was to mapred over the objects and then do a appropriate scan in the log table. But by this way I cannot detect a "go further", as the next open event is in another scan range. My next idea was a rokey design by <user>-<timestamp>. But than I would have to map over the users and make a full scan for the full user log, which would kill the advantage of mapred (as the number of users is << number of objects) and one map task would be very huge. This would be more of a parallel scan. Or is this a good idea? But if so ... what to do to get the information from above? Thus ... I'm stuck :(. Any ideas how to achieve my goal? Best wishes Wilm ps: It would be possible to create a "close event" if there is no other solution. But I'd rather do not do that.