Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes

Ron Hitchens Wed, 04 Jun 2014 15:19:34 -0700

   Unless your unique-uri() function is running in a non-update query, in which 
case it runs lock free at a timestamp.  If you're using the pattern of main 
code as a query and updates delegated to invoked/eval'ed transactions, you 
could get bit by this.  It would work fine the vast majority of the time, but 
you wouldn't be protected from someone else's update happening between your 
check in the query and the execution of your invoked update.


   DOIs are a perfect example of what I'm talking about.  Or account numbers, 
or patient record IDs, or aircraft tail numbers, etc.  The impact of non-unique 
record identifiers can range from annoying all the way to legally/financially 
costly or even life-threatening if you're managing medication records, for 
example.

---
Ron Hitchens {r...@overstory.co.uk}  +44 7879 358212

On Jun 4, 2014, at 8:49 PM, "Whitby, Rob" <rob.whi...@springer.com> wrote:

> I thought 2 simultaneous transactions would both get read locks on the uri, 
> then one would get a write lock and the other would fail and retry. Maybe I'm 
> missing something though.
> 
> But anyway, I agree unique indexes would be a handy feature. e.g. our docs 
> have a DOI element which *should* be unique but occasionally aren't, would be 
> nice to enforce that rather than have to code defensively.
> 
> Rob
> ________________________________________
> From: general-boun...@developer.marklogic.com 
> [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens 
> [r...@ronsoft.com]
> Sent: 04 June 2014 19:31
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] New Feature Request: Unique Value Range  
>   Indexes
> 
> Rob,
> 
>   I believe there is a race condition here.  A document may not exit as-of 
> the timestamp when this request starts running, but some other request could 
> create one while it's running.  This request would then over-write that 
> document.
> 
>   I'm actually more concerned about element values inside documents than 
> generating unique document URIs.  It's easy to generate document URIs with 
> 64-bit random numbers that are very unlikely to collide.  But I want to 
> guarantee that some meaningful value inside a document is unique across all 
> documents.
> 
>   In my case, the naming space is actually quite small because I want the IDs 
> to be meaningful but unique.  For example "images:cats:fluffy:XX.png", where 
> XX can increment or be set randomly until the ID is unique.  One way to check 
> for uniqueness is to make the document URI from this ID, then test for an 
> existing document.
> 
>   But this doesn't solve the general problem.  I could conceivably have 
> multiple elements in the document that I want to be unique.  To check for 
> unique element values it's necessary to run a cts query against the 
> element(s).  And I'm not sure if you can completely close the race window 
> between checking for an existing instance and inserting a new one if the 
> query comes back empty.
> 
>   Someone from ML pointed out privately that checking for uniqueness in the 
> index would require cross-cluster communication.  I'm sure that's true, but 
> I'm also pretty sure that any user-level code solution is going to be far 
> less efficient.  I'd be happy to pay that ingestion time penalty for the 
> guarantee that indexed element values are unique.  At query time, such a 
> unique value index should perform like any other range index.
> 
> ---
> Ron Hitchens {r...@overstory.co.uk}  +44 7879 358212
> 
> On Jun 4, 2014, at 6:59 PM, "Whitby, Rob" <rob.whi...@springer.com> wrote:
> 
>> How about something like this?
>> 
>> declare function unique-uri() {
>> let $uri := "/doc/" || xdmp:random() || ".xml"
>> return if (fn:not(fn:doc-available($uri))) then $uri else unique-uri()
>> };
>> 
>> I guess because indexes are distributed across forests, ensuring uniqueness 
>> is not that easy?
>> 
>> Rob
>> ________________________________________
>> From: general-boun...@developer.marklogic.com 
>> [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens 
>> [r...@ronsoft.com]
>> Sent: 04 June 2014 18:01
>> To: MarkLogic Developer Discussion
>> Subject: [MarkLogic Dev General] New Feature Request: Unique Value Range     
>>    Indexes
>> 
>>  I'm working on a project, one aspect of which requires minting unique IDs 
>> and assuring that no two documents with the same ID wind up in the database. 
>>  I know how to accomplish this using locks (I'm pretty sure) but any such 
>> implementation is awkward and prone to subtle edge case errors, and can be 
>> difficult to test.
>> 
>>  It seems to me that this is something that MarkLogic could do much more 
>> reliably and quickly than any user-level code.  The thought that occurred to 
>> me is a variation on range indexes which only allow a single instance of any 
>> given value.
>> 
>>  Conventional range indexes work by creating term lists that look like this 
>> (see Jason Hunter's ML Architecture paper), where each term list contains an 
>> element (or attribute) value and a list of fragment IDs where that term 
>> exists.
>> 
>> aardvark | 23, 135, 469, 611
>> ant      | 23, 469, 558, 611, 750
>> baboon   | 53, 97, 469, 621
>> etc...
>> 
>>  By making a range index like this but which only allows a single fragment 
>> ID in the list, that would ensure that no two documents in the database 
>> contain a given element with the same value.  That is, attempting to add a 
>> second document with the same element or attribute value would cause an 
>> exception.  And being a range index, it would provide a fast lexicon of all 
>> the current unique values in the DB.
>> 
>>  Such an index would look something like this:
>> 
>> abc3vk34 | 17
>> bkx46lkd | 52
>> bz1d34nm | 37
>> etc...
>> 
>>  Usage could be something like this:
>> 
>> declare function create-new-id-doc ($id-root as xs:string) as xs:string
>> {
>>   try {
>>       let $id := $id-root || "-" || mylib:random-string(8)
>>       let $uri := "/idregistry/id-" || $id
>>       let $_ :=
>>           xdmp:document-insert ($uri,
>>               <registered-id>
>>                   <id>{ $id }</id>
>>                   <created>{ fn:current-dateTime() }</created>
>>               </registered-id>
>>        return $id
>>   } catch (e) {
>>       create-new-id-doc ($id-root)
>>   }
>> };
>> 
>>  This doesn't require that I write any (possibly buggy) mutual exclusion 
>> code and I can be confident that once the xdmp:document-insert succeeds that 
>> the ID is unique in the database and that the type (as configured for the 
>> range index) is correct.
>> 
>>  Any love for Unique Value Range Indexes in the next version of MarkLogic?
>> 
>> ---
>> Ron Hitchens {r...@overstory.co.uk}  +44 7879 358212
>> 
>> _______________________________________________
>> General mailing list
>> General@developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> General@developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes

Reply via email to