subject:"Avoiding duplicate writes"

Re: Avoiding duplicate writes

2018-01-11 Thread Ted Yu

Peter:
Normally java.lang.System.nanoTime() is used for measuring duration of time.

See also
https://www.javacodegeeks.com/2012/02/what-is-behind-systemnanotime.html

bq. the prePut co-processor is executed inside a record lock

The prePut hook is called with read lock on the underlying region.


Have you heard of HLC ? See HBASE-14070

The work hasn't been active recently.

FYI

On Thu, Jan 11, 2018 at 2:16 AM, Peter Marron 
wrote:

> Hi,
>
> We have a problem when we are writing lots of records to HBase.
> We are not specifying timestamps explicitly and so the situation arises
> where multiple records are being written in the same millisecond.
> Unfortunately when the records are written and the timestamps are the same
> then later writes are treated as updates of the previous records and not
> separate records, which is what we want.
> So we want to be able to guarantee that records are not treated as
> overwrites (unless we explicitly make them so).
>
> As I understand it there are (at least) two different ways to proceed.
>
> The first approach is to increase the resolution of the timestamp.
> So we could use something like java.lang.System.nanoTime()
> However although this seems to ameliorate the problem it seems to
> introduce other problems.
> Also ideally we would like something that guarantees that we don't lose
> writes rather than making them more unlikely.
>
> The second approach is to write a prePut co-processor.
> In the prePut I can do a read using the same rowkey, column family and
> column qualifier and omit the timestamp.
> As I understand it this will return me the latest timestamp.
> Then I can update the timestamp that I am going to write, if necessary, to
> make sure that the timestamp is always unique.
> In this way I can guarantee that none of my writes are accidentally turned
> into updates.
>
> However this approach seems to be expensive.
> I have to do a read before each write, and although (I believe) it will be
> on the same region server, it's still going to slow things down a lot.
> Also I am assuming that the prePut co-processor is executed inside a
> record lock so that I don't have to worry about synchronization.
> Is this true?
>
> Is there a better way?
>
> Maybe there is some implementation of this already that I can pick up?
>
> Maybe there is some way that I can implement this more efficiently?
>
> It seems to me that this might be better handled at compaction.
> Shouldn't there be some way that I can mark writes with some sort of
> special value of timestamp that means that this write should never be
> considered as an update but always as a separate write?
>
> Any advice gratefully received.
>
> Peter Marron
>

Re: Avoiding duplicate writes

2018-01-11 Thread Lalit Jadhav

Hello Peter,

You can add a Random number in Row key for avoiding Rowkey overriding.
Even though timeStamp at one ms is same Random Number provides uniqueness.

On Thu, Jan 11, 2018 at 3:46 PM, Peter Marron 
wrote:

> Hi,
>
> We have a problem when we are writing lots of records to HBase.
> We are not specifying timestamps explicitly and so the situation arises
> where multiple records are being written in the same millisecond.
> Unfortunately when the records are written and the timestamps are the same
> then later writes are treated as updates of the previous records and not
> separate records, which is what we want.
> So we want to be able to guarantee that records are not treated as
> overwrites (unless we explicitly make them so).
>
> As I understand it there are (at least) two different ways to proceed.
>
> The first approach is to increase the resolution of the timestamp.
> So we could use something like java.lang.System.nanoTime()
> However although this seems to ameliorate the problem it seems to
> introduce other problems.
> Also ideally we would like something that guarantees that we don't lose
> writes rather than making them more unlikely.
>
> The second approach is to write a prePut co-processor.
> In the prePut I can do a read using the same rowkey, column family and
> column qualifier and omit the timestamp.
> As I understand it this will return me the latest timestamp.
> Then I can update the timestamp that I am going to write, if necessary, to
> make sure that the timestamp is always unique.
> In this way I can guarantee that none of my writes are accidentally turned
> into updates.
>
> However this approach seems to be expensive.
> I have to do a read before each write, and although (I believe) it will be
> on the same region server, it's still going to slow things down a lot.
> Also I am assuming that the prePut co-processor is executed inside a
> record lock so that I don't have to worry about synchronization.
> Is this true?
>
> Is there a better way?
>
> Maybe there is some implementation of this already that I can pick up?
>
> Maybe there is some way that I can implement this more efficiently?
>
> It seems to me that this might be better handled at compaction.
> Shouldn't there be some way that I can mark writes with some sort of
> special value of timestamp that means that this write should never be
> considered as an update but always as a separate write?
>
> Any advice gratefully received.
>
> Peter Marron
>



-- 
Regards,
Lalit Jadhav
Network Component Private Limited.

Avoiding duplicate writes

2018-01-11 Thread Peter Marron

Hi,

We have a problem when we are writing lots of records to HBase.
We are not specifying timestamps explicitly and so the situation arises where 
multiple records are being written in the same millisecond.
Unfortunately when the records are written and the timestamps are the same then 
later writes are treated as updates of the previous records and not separate 
records, which is what we want.
So we want to be able to guarantee that records are not treated as overwrites 
(unless we explicitly make them so).

As I understand it there are (at least) two different ways to proceed.

The first approach is to increase the resolution of the timestamp.
So we could use something like java.lang.System.nanoTime()
However although this seems to ameliorate the problem it seems to introduce 
other problems.
Also ideally we would like something that guarantees that we don't lose writes 
rather than making them more unlikely.

The second approach is to write a prePut co-processor.
In the prePut I can do a read using the same rowkey, column family and column 
qualifier and omit the timestamp.
As I understand it this will return me the latest timestamp.
Then I can update the timestamp that I am going to write, if necessary, to make 
sure that the timestamp is always unique.
In this way I can guarantee that none of my writes are accidentally turned into 
updates.

However this approach seems to be expensive.
I have to do a read before each write, and although (I believe) it will be on 
the same region server, it's still going to slow things down a lot.
Also I am assuming that the prePut co-processor is executed inside a record 
lock so that I don't have to worry about synchronization.
Is this true?

Is there a better way?

Maybe there is some implementation of this already that I can pick up?

Maybe there is some way that I can implement this more efficiently?

It seems to me that this might be better handled at compaction.
Shouldn't there be some way that I can mark writes with some sort of special 
value of timestamp that means that this write should never be considered as an 
update but always as a separate write?

Any advice gratefully received.

Peter Marron

Re: Avoiding duplicate writes

Re: Avoiding duplicate writes

Avoiding duplicate writes

3 matches

Site Navigation

Mail list logo

Footer information