[ 
https://issues.apache.org/jira/browse/PHOENIX-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531762#comment-14531762
 ] 

Lars Hofhansl commented on PHOENIX-1954:
----------------------------------------

It's several issues that come together:
# We have OLTP type processes tickle-inserting data into a table, generating 
ids (from sequences)
# We need to add bulk loads of 100's of millions or billions of rows via M/R. 
These need to share the _same_ id space. I.e. no duplicate ids.
# The M/R job needs to be idempotent, i.e. a mapper can restart itself upon 
failure. In that case some data will have been inserted into HBase already. 
Reading the same input, it _must_ produce the same output (including the same 
ids). I.e. no same data inserted from the M/R with a different id.

With current sequences all 3 cannot be done. But it's so close, all we need is 
the ability to reserve a range of ids. We can then use the reserved id space to 
assign ids in the various mapper jobs.
We could not use sequences of course and get and reserve ids directly with 
HBase Increments, but then we would go around Phoenix and we'd have to use the 
other mechanism everywhere we insert ids.


> Reserve chunks of numbers for a sequence
> ----------------------------------------
>
>                 Key: PHOENIX-1954
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1954
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>
> In order to be able to generate many ids in bulk (for example in map reduce 
> jobs) we need a way to generate or reserve large sets of ids. We also need to 
> mix ids reserved with incrementally generated ids from other clients. 
> For this we need to atomically increment the sequence and return the value it 
> had when the increment happened.
> If we're OK to throw the current cached set of values away we can do
> {{NEXT VALUE FOR <seq>(,<N>)}}, that needs to increment value and return the 
> value it incremented from (i.e. it has to throw the current cache away, and 
> return the next value it found at the server).
> Or we can invent a new syntax {{RESERVE VALUES FOR <seq>, <N>}} that does the 
> same, but does not invalidate the cache.
> Note that in either case we won't retrieve the reserved set of values via 
> {{NEXT VALUE FOR}} because we'd need to be idempotent in our case, all we 
> need to guarantee is that after a call to {{RESERVE VALUES FOR <seq>, <N>}}, 
> which returns a value <M> is that the range [M, M+N) won't be used by any 
> other user of the sequence. My might need reserve 1bn ids this way ahead of a 
> map reduce run.
> Any better ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to