Re: Avoiding duplicate writes
Peter: Normally java.lang.System.nanoTime() is used for measuring duration of time. See also https://www.javacodegeeks.com/2012/02/what-is-behind-systemnanotime.html bq. the prePut co-processor is executed inside a record lock The prePut hook is called with read lock on the underlying region. Have you heard of HLC ? See HBASE-14070 The work hasn't been active recently. FYI On Thu, Jan 11, 2018 at 2:16 AM, Peter Marronwrote: > Hi, > > We have a problem when we are writing lots of records to HBase. > We are not specifying timestamps explicitly and so the situation arises > where multiple records are being written in the same millisecond. > Unfortunately when the records are written and the timestamps are the same > then later writes are treated as updates of the previous records and not > separate records, which is what we want. > So we want to be able to guarantee that records are not treated as > overwrites (unless we explicitly make them so). > > As I understand it there are (at least) two different ways to proceed. > > The first approach is to increase the resolution of the timestamp. > So we could use something like java.lang.System.nanoTime() > However although this seems to ameliorate the problem it seems to > introduce other problems. > Also ideally we would like something that guarantees that we don't lose > writes rather than making them more unlikely. > > The second approach is to write a prePut co-processor. > In the prePut I can do a read using the same rowkey, column family and > column qualifier and omit the timestamp. > As I understand it this will return me the latest timestamp. > Then I can update the timestamp that I am going to write, if necessary, to > make sure that the timestamp is always unique. > In this way I can guarantee that none of my writes are accidentally turned > into updates. > > However this approach seems to be expensive. > I have to do a read before each write, and although (I believe) it will be > on the same region server, it's still going to slow things down a lot. > Also I am assuming that the prePut co-processor is executed inside a > record lock so that I don't have to worry about synchronization. > Is this true? > > Is there a better way? > > Maybe there is some implementation of this already that I can pick up? > > Maybe there is some way that I can implement this more efficiently? > > It seems to me that this might be better handled at compaction. > Shouldn't there be some way that I can mark writes with some sort of > special value of timestamp that means that this write should never be > considered as an update but always as a separate write? > > Any advice gratefully received. > > Peter Marron >
Re: Avoiding duplicate writes
Hello Peter, You can add a Random number in Row key for avoiding Rowkey overriding. Even though timeStamp at one ms is same Random Number provides uniqueness. On Thu, Jan 11, 2018 at 3:46 PM, Peter Marronwrote: > Hi, > > We have a problem when we are writing lots of records to HBase. > We are not specifying timestamps explicitly and so the situation arises > where multiple records are being written in the same millisecond. > Unfortunately when the records are written and the timestamps are the same > then later writes are treated as updates of the previous records and not > separate records, which is what we want. > So we want to be able to guarantee that records are not treated as > overwrites (unless we explicitly make them so). > > As I understand it there are (at least) two different ways to proceed. > > The first approach is to increase the resolution of the timestamp. > So we could use something like java.lang.System.nanoTime() > However although this seems to ameliorate the problem it seems to > introduce other problems. > Also ideally we would like something that guarantees that we don't lose > writes rather than making them more unlikely. > > The second approach is to write a prePut co-processor. > In the prePut I can do a read using the same rowkey, column family and > column qualifier and omit the timestamp. > As I understand it this will return me the latest timestamp. > Then I can update the timestamp that I am going to write, if necessary, to > make sure that the timestamp is always unique. > In this way I can guarantee that none of my writes are accidentally turned > into updates. > > However this approach seems to be expensive. > I have to do a read before each write, and although (I believe) it will be > on the same region server, it's still going to slow things down a lot. > Also I am assuming that the prePut co-processor is executed inside a > record lock so that I don't have to worry about synchronization. > Is this true? > > Is there a better way? > > Maybe there is some implementation of this already that I can pick up? > > Maybe there is some way that I can implement this more efficiently? > > It seems to me that this might be better handled at compaction. > Shouldn't there be some way that I can mark writes with some sort of > special value of timestamp that means that this write should never be > considered as an update but always as a separate write? > > Any advice gratefully received. > > Peter Marron > -- Regards, Lalit Jadhav Network Component Private Limited.
Avoiding duplicate writes
Hi, We have a problem when we are writing lots of records to HBase. We are not specifying timestamps explicitly and so the situation arises where multiple records are being written in the same millisecond. Unfortunately when the records are written and the timestamps are the same then later writes are treated as updates of the previous records and not separate records, which is what we want. So we want to be able to guarantee that records are not treated as overwrites (unless we explicitly make them so). As I understand it there are (at least) two different ways to proceed. The first approach is to increase the resolution of the timestamp. So we could use something like java.lang.System.nanoTime() However although this seems to ameliorate the problem it seems to introduce other problems. Also ideally we would like something that guarantees that we don't lose writes rather than making them more unlikely. The second approach is to write a prePut co-processor. In the prePut I can do a read using the same rowkey, column family and column qualifier and omit the timestamp. As I understand it this will return me the latest timestamp. Then I can update the timestamp that I am going to write, if necessary, to make sure that the timestamp is always unique. In this way I can guarantee that none of my writes are accidentally turned into updates. However this approach seems to be expensive. I have to do a read before each write, and although (I believe) it will be on the same region server, it's still going to slow things down a lot. Also I am assuming that the prePut co-processor is executed inside a record lock so that I don't have to worry about synchronization. Is this true? Is there a better way? Maybe there is some implementation of this already that I can pick up? Maybe there is some way that I can implement this more efficiently? It seems to me that this might be better handled at compaction. Shouldn't there be some way that I can mark writes with some sort of special value of timestamp that means that this write should never be considered as an update but always as a separate write? Any advice gratefully received. Peter Marron