Re: how to model data based on "time bucket"

2013-01-31 Thread Rodrigo Ribeiro
Yes, you are correct, event3 never emits for the time "10:07". The proper result table is, as you mention: event1 | event2 event2 | event3 event3 | I guess i was thinking about the old example(T=7). :) On Thu, Jan 31, 2013 at 12:39 PM, Oleg Ruchovets wrote: > Hi Rodrigo

Re: how to model data based on "time bucket"

2013-01-31 Thread Oleg Ruchovets
Hi Rodrigo , That is just GREAT Idea :-) !!! But how did you get a final result: event1 | event2, event3 event2 | event3 event3 | I tried to simulate and didn't get event1| event2,event3 (10:03, [*after*, event1]) (10:04, [*after*, event1]) (10:05, [*after*

Re: how to model data based on "time bucket"

2013-01-31 Thread Rodrigo Ribeiro
Hi, The Map and Reduce steps that you mention is the same as how i though. How should I work with this table.Should I have to scan Main table : row by > row and for every row get event time and based on that time query second > table? > > In case I will do so , i still need to execute 50 milli

Re: how to model data based on "time bucket"

2013-01-31 Thread Oleg Ruchovets
Hi Rodrigo , As usual you have very intereting ! :-) I am not sure that I understand exactly what do you mean and I try to simulate: Suppose we have such events in MAIN Table: event1 | 10:07 event2 | 10:10 event3 | 10:12 Time window T=5 minutes. ===

Re: how to model data based on "time bucket"

2013-01-30 Thread Rodrigo Ribeiro
There is another option, You could do a MapReduce job that, for each row from the main table, emits all times that it would be in the window of time, For example, "event1" would emit {"10:06": event1}, {"10:05": event1} ... {"10:00": event1}. (also for "10:07" if you want to know those who happen i

Re: how to model data based on "time bucket"

2013-01-30 Thread Oleg Ruchovets
Hi Rodrigo. Using solution with 2 tables : one main and one as index. I have ~50 Million records , in my case I need scan all table and as a result I will have 50 Millions scans and It will kill all performance. Is there any other approach to model my usecase using hbase? Thanks Oleg. On Mo

Re: how to model data based on "time bucket"

2013-01-28 Thread Oleg Ruchovets
I think I didn't explain correct. I want to read from 2 table in context of 1 mapreduce job. I mean I want to read one key from main table and scan range from another in the same mapreduce job.I only found MultiTableOutputFormat and there is no MultiTableInputFormat. Is there any workaround to

Re: how to model data based on "time bucket"

2013-01-28 Thread Rodrigo Ribeiro
Yes, it's possible, Check this solution: http://stackoverflow.com/questions/11353911/extending-hadoops-tableinputformat-to-scan-with-a-prefix-used-for-distribution On Mon, Jan 28, 2013 at 2:07 PM, Oleg Ruchovets wrote: > Yes. > This is very interesting approach. > >Is it possible to read

Re: how to model data based on "time bucket"

2013-01-28 Thread Oleg Ruchovets
Yes. This is very interesting approach. Is it possible to read from main key and scan from another using map/reduce? I don't want to read from single client. I use hbase version 0.94.2.21. Thanks Oleg. On Mon, Jan 28, 2013 at 6:27 PM, Rodrigo Ribeiro < rodrigui...@jusbrasil.com.br> wrot

Re: how to model data based on "time bucket"

2013-01-28 Thread Rodrigo Ribeiro
In the approach that i mentioned, you would need a table to retrieve the time of a certain event(if this information can retrieve in another way, you may ignore this table). It would be like you posted: event_id | time = event1 | 10:07 event2 | 10:10 event3 | 10:12 event4 | 10:20 And a

Re: how to model data based on "time bucket"

2013-01-28 Thread Oleg Ruchovets
Yes , I agree that using only timestamp it will cause hotspot. I can create prespliting for regions. I saw TSDB video and presentation and their data model. I think this is not suitable for my case. I looked thru google alot and for my surprise there is any post about such clasic problem. It

Re: how to model data based on "time bucket"

2013-01-28 Thread Michel Segel
Tough one in that if your events are keyed on time alone, you will hit a hot spot on write. Reads,not so much... TSDB would be a good start ... You may not need 'buckets' but just a time stamp and set up a start and stop key values. Sent from a remote device. Please excuse any typos... Mike

Re: how to model data based on "time bucket"

2013-01-28 Thread Oleg Ruchovets
Hi Rodrigo. Can you please explain in more details your solution.You said that I will have another table. How many table will I have? Will I have 2 tables? What will be the schema of the tables? I try to explain what I try to achive: I have ~50 million records like {time|event}. I want to pu

Re: how to model data based on "time bucket"

2013-01-28 Thread Rodrigo Ribeiro
You can use another table as a index, using a rowkey like '{time}:{event_id}', and then scan in the range ["10:07", "10:15"). On Mon, Jan 28, 2013 at 10:06 AM, Oleg Ruchovets wrote: > Hi , > > I have such row data structure: > > event_id | time > = > event1 | 10:07 > event2 | 10:10 >

how to model data based on "time bucket"

2013-01-28 Thread Oleg Ruchovets
Hi , I have such row data structure: event_id | time = event1 | 10:07 event2 | 10:10 event3 | 10:12 event4 | 10:20 event5 | 10:23 event6 | 10:25 Numbers of records is 50-100 million. Question: I need to find group of events starting form eventX and enters to the time window buck