yangz created KUDU-2673:
---------------------------

             Summary: Event timestamp support with kudu.
                 Key: KUDU-2673
                 URL: https://issues.apache.org/jira/browse/KUDU-2673
             Project: Kudu
          Issue Type: Improvement
          Components: java, spark, tserver
    Affects Versions: 1.8.0
            Reporter: yangz
             Fix For: 1.8.0


Kudu has the ability to read historical data. But it is based by the timestamp 
produced by kudu transaction and mvcc system. The timestamp kudu used greatly 
weakened the usability.

For our use case. we write data to kudu from data stream. We use range 
partition by day.

We want to get the hour version from kudu. So we need read history data from 
kudu.

It produced by undo file. But when user give a timestamp, it means timestamp 
the event happen, associated with the data. Not the timestamp kudu produced. So 
we need a way to set event timestamp to the kudu system.

Finally, we got a way to solve this problem.

But our solution has two limit.
 # We only update the table by a row, and for one row we have a timestamp with 
it.
 # For getting the right history version of data, we need the data stream send 
data by event time order.

Despite these problems, it has satisfied our current business.

 

And our implement also solve part problem for the wrong order problem of event 
time if you only need the newest data, which will not read undo file.

for the data send into kudu,       t1 < t2

t1 upsert -> t2 upsert      ->    newest will be t2 value

t2 upsert -> t1 upsret      ->    (current kudu implement) t1,  our implement 
will be t2.

 

Maybe our solution is not the best for the problem. But I think kudu snapshot 
read should support event time.

Our solution is not so complete for all user cases. But I hope it will be 
useful for some cases with the community.   

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to