Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
I'm sure you will find lots of mentions about this in markmail if you look for unique id, and random. MarkLogic was using that same method internally as well for creating its own objects. The idea is indeed that you run that if in update mode, in the same transaction in which you plan to do the insert. The 'lookahead' will create a read lock, which causes writes from other transactions to wait and retry if necessary.. Cheers, Geert -Oorspronkelijk bericht- Van: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] Namens Ron Hitchens Verzonden: donderdag 5 juni 2014 00:19 Aan: MarkLogic Developer Discussion Onderwerp: Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes Unless your unique-uri() function is running in a non-update query, in which case it runs lock free at a timestamp. If you're using the pattern of main code as a query and updates delegated to invoked/eval'ed transactions, you could get bit by this. It would work fine the vast majority of the time, but you wouldn't be protected from someone else's update happening between your check in the query and the execution of your invoked update. DOIs are a perfect example of what I'm talking about. Or account numbers, or patient record IDs, or aircraft tail numbers, etc. The impact of non-unique record identifiers can range from annoying all the way to legally/financially costly or even life-threatening if you're managing medication records, for example. --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 On Jun 4, 2014, at 8:49 PM, Whitby, Rob rob.whi...@springer.com wrote: I thought 2 simultaneous transactions would both get read locks on the uri, then one would get a write lock and the other would fail and retry. Maybe I'm missing something though. But anyway, I agree unique indexes would be a handy feature. e.g. our docs have a DOI element which *should* be unique but occasionally aren't, would be nice to enforce that rather than have to code defensively. Rob From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.com] Sent: 04 June 2014 19:31 To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] New Feature Request: Unique Value RangeIndexes Rob, I believe there is a race condition here. A document may not exit as-of the timestamp when this request starts running, but some other request could create one while it's running. This request would then over-write that document. I'm actually more concerned about element values inside documents than generating unique document URIs. It's easy to generate document URIs with 64-bit random numbers that are very unlikely to collide. But I want to guarantee that some meaningful value inside a document is unique across all documents. In my case, the naming space is actually quite small because I want the IDs to be meaningful but unique. For example images:cats:fluffy:XX.png, where XX can increment or be set randomly until the ID is unique. One way to check for uniqueness is to make the document URI from this ID, then test for an existing document. But this doesn't solve the general problem. I could conceivably have multiple elements in the document that I want to be unique. To check for unique element values it's necessary to run a cts query against the element(s). And I'm not sure if you can completely close the race window between checking for an existing instance and inserting a new one if the query comes back empty. Someone from ML pointed out privately that checking for uniqueness in the index would require cross-cluster communication. I'm sure that's true, but I'm also pretty sure that any user-level code solution is going to be far less efficient. I'd be happy to pay that ingestion time penalty for the guarantee that indexed element values are unique. At query time, such a unique value index should perform like any other range index. --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 On Jun 4, 2014, at 6:59 PM, Whitby, Rob rob.whi...@springer.com wrote: How about something like this? declare function unique-uri() { let $uri := /doc/ || xdmp:random() || .xml return if (fn:not(fn:doc-available($uri))) then $uri else unique-uri() }; I guess because indexes are distributed across forests, ensuring uniqueness is not that easy? Rob From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.com] Sent: 04 June 2014 18:01 To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes I'm working on a project, one aspect of which requires minting unique IDs and assuring that no two documents with the same ID wind up in the database. I know how
Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
The general topic of generating unique id's is even a lot older. I like the idea of the database being able to impose a uniqueness constraint on anything stored in it. It is much more difficult to guarantee that code is behaving correctly, then imposing such an assertion.. Interesting thought to use (range) indexes for that, hadn't heard that one before! Cheers, Geert Van: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] Namens Wayne Feick Verzonden: donderdag 5 juni 2014 00:12 Aan: general@developer.marklogic.com Onderwerp: Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes Fair points, Ron. We have RFE 2322 filed back in Feb 2012 to track this. I'll add a note indicating your interest as well. Wayne. On 06/04/2014 03:00 PM, Ron Hitchens wrote: Wayne, Thanks for this. It's a useful code pattern for this sort of thing and I will probably use it for the specific requirement I have at the moment (I was planning to do something similar anyway). But this code, or any user-level code, does not fully implement the uniqueness guarantee I'd like to have and that I think a specialized range index could easily provide. This will work, but as you say it would be necessary to always use this code convention. It would not prevent creation of duplicate values by code that doesn't follow the convention. If uniqueness were enforced by the index, then I could be confident that uniqueness is absolutely guaranteed and I don't need to trust anyone (including my future self) to always follow the same locking protocol. --- Ron Hitchens {r...@overstory.co.uk mailto:r...@overstory.co.uk } +44 7879 358212 On Jun 4, 2014, at 9:19 PM, Wayne Feick wayne.fe...@marklogic.com mailto:wayne.fe...@marklogic.com wrote: The simplest is to have the document URI correspond to the element value, and if you can use a random value it's good for concurrency. If you can't do that, but you want to ensure only one document can have a particular value for an element, I think it's pretty easy using xdmp:lock-for-update() on an URI that corresponds to the element value. You don't actually need to create a document at that URI, just use it to serialize transactions. Here's one way to do it. declare function lock-element-value($qn as xs:QName, $v as item) { xdmp:lock-for-update( http://acme.com/ http://acme.com/; || xdmp:hash64(fn:namespace-uri-from-QName($qn)) || / || xdmp:hash64(fn:localname-from-QName($qn))) }; You'd then do something like the following. let $lock := lock-element-value($qn, $v) let $existing := cts:search(fn:collection(), cts:element-range-query($qn, =, $v, unfiltered)) return if (fn:exists($existing)) then ... do whatever you need to do with the existing document else ... create a new document, safe from a race with another transaction You'd want to use lock-element-value() in any updates that could affect a change in the element value (insert, update, delete). I think you could get away with ignoring deletes since those would automatically serialize with any transaction that would modify the existing document. We use this sort of pattern internally to ensure uniqueness of IDs. Wayne. On 06/04/2014 12:49 PM, Whitby, Rob wrote: I thought 2 simultaneous transactions would both get read locks on the uri, then one would get a write lock and the other would fail and retry. Maybe I'm missing something though. But anyway, I agree unique indexes would be a handy feature. e.g. our docs have a DOI element which *should* be unique but occasionally aren't, would be nice to enforce that rather than have to code defensively. Rob From: general-boun...@developer.marklogic.com mailto:general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com mailto:general-boun...@developer.marklogic.com ] on behalf of Ron Hitchens [r...@ronsoft.com mailto:r...@ronsoft.com ] Sent: 04 June 2014 19:31 To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes Rob, I believe there is a race condition here. A document may not exit as-of the timestamp when this request starts running, but some other request could create one while it's running. This request would then over-write that document. I'm actually more concerned about element values inside documents than generating unique document URIs. It's easy to generate document URIs with 64-bit random numbers that are very unlikely to collide. But I want to guarantee that some meaningful value inside a document is unique across all documents. In my case, the naming space is actually quite small because I want the IDs to be meaningful but unique. For example images:cats:fluffy:XX.png, where XX can increment or be set randomly until the ID is unique. One way to check for uniqueness is to make the document URI from this ID
Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
Rob, I believe there is a race condition here. A document may not exit as-of the timestamp when this request starts running, but some other request could create one while it's running. This request would then over-write that document. I'm actually more concerned about element values inside documents than generating unique document URIs. It's easy to generate document URIs with 64-bit random numbers that are very unlikely to collide. But I want to guarantee that some meaningful value inside a document is unique across all documents. In my case, the naming space is actually quite small because I want the IDs to be meaningful but unique. For example images:cats:fluffy:XX.png, where XX can increment or be set randomly until the ID is unique. One way to check for uniqueness is to make the document URI from this ID, then test for an existing document. But this doesn't solve the general problem. I could conceivably have multiple elements in the document that I want to be unique. To check for unique element values it's necessary to run a cts query against the element(s). And I'm not sure if you can completely close the race window between checking for an existing instance and inserting a new one if the query comes back empty. Someone from ML pointed out privately that checking for uniqueness in the index would require cross-cluster communication. I'm sure that's true, but I'm also pretty sure that any user-level code solution is going to be far less efficient. I'd be happy to pay that ingestion time penalty for the guarantee that indexed element values are unique. At query time, such a unique value index should perform like any other range index. --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 On Jun 4, 2014, at 6:59 PM, Whitby, Rob rob.whi...@springer.com wrote: How about something like this? declare function unique-uri() { let $uri := /doc/ || xdmp:random() || .xml return if (fn:not(fn:doc-available($uri))) then $uri else unique-uri() }; I guess because indexes are distributed across forests, ensuring uniqueness is not that easy? Rob From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.com] Sent: 04 June 2014 18:01 To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes I'm working on a project, one aspect of which requires minting unique IDs and assuring that no two documents with the same ID wind up in the database. I know how to accomplish this using locks (I'm pretty sure) but any such implementation is awkward and prone to subtle edge case errors, and can be difficult to test. It seems to me that this is something that MarkLogic could do much more reliably and quickly than any user-level code. The thought that occurred to me is a variation on range indexes which only allow a single instance of any given value. Conventional range indexes work by creating term lists that look like this (see Jason Hunter's ML Architecture paper), where each term list contains an element (or attribute) value and a list of fragment IDs where that term exists. aardvark | 23, 135, 469, 611 ant | 23, 469, 558, 611, 750 baboon | 53, 97, 469, 621 etc... By making a range index like this but which only allows a single fragment ID in the list, that would ensure that no two documents in the database contain a given element with the same value. That is, attempting to add a second document with the same element or attribute value would cause an exception. And being a range index, it would provide a fast lexicon of all the current unique values in the DB. Such an index would look something like this: abc3vk34 | 17 bkx46lkd | 52 bz1d34nm | 37 etc... Usage could be something like this: declare function create-new-id-doc ($id-root as xs:string) as xs:string { try { let $id := $id-root || - || mylib:random-string(8) let $uri := /idregistry/id- || $id let $_ := xdmp:document-insert ($uri, registered-id id{ $id }/id created{ fn:current-dateTime() }/created /registered-id return $id } catch (e) { create-new-id-doc ($id-root) } }; This doesn't require that I write any (possibly buggy) mutual exclusion code and I can be confident that once the xdmp:document-insert succeeds that the ID is unique in the database and that the type (as configured for the range index) is correct. Any love for Unique Value Range Indexes in the next version of MarkLogic? --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 ___ General mailing list General@developer.marklogic.com
Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
Maybe you could consider using sem:uuid() in MarkLogic 7? You are much better off with a statistically unique ID than actually taking the time and massive concurrency reduction to check uniqueness. John On 04/06/2014 18:01, Ron Hitchens wrote: I'm working on a project, one aspect of which requires minting unique IDs and assuring that no two documents with the same ID wind up in the database. I know how to accomplish this using locks (I'm pretty sure) but any such implementation is awkward and prone to subtle edge case errors, and can be difficult to test. It seems to me that this is something that MarkLogic could do much more reliably and quickly than any user-level code. The thought that occurred to me is a variation on range indexes which only allow a single instance of any given value. Conventional range indexes work by creating term lists that look like this (see Jason Hunter's ML Architecture paper), where each term list contains an element (or attribute) value and a list of fragment IDs where that term exists. aardvark | 23, 135, 469, 611 ant | 23, 469, 558, 611, 750 baboon | 53, 97, 469, 621 etc... By making a range index like this but which only allows a single fragment ID in the list, that would ensure that no two documents in the database contain a given element with the same value. That is, attempting to add a second document with the same element or attribute value would cause an exception. And being a range index, it would provide a fast lexicon of all the current unique values in the DB. Such an index would look something like this: abc3vk34 | 17 bkx46lkd | 52 bz1d34nm | 37 etc... Usage could be something like this: declare function create-new-id-doc ($id-root as xs:string) as xs:string { try { let $id := $id-root || - || mylib:random-string(8) let $uri := /idregistry/id- || $id let $_ := xdmp:document-insert ($uri, registered-id id{ $id }/id created{ fn:current-dateTime() }/created /registered-id return $id } catch (e) { create-new-id-doc ($id-root) } }; This doesn't require that I write any (possibly buggy) mutual exclusion code and I can be confident that once the xdmp:document-insert succeeds that the ID is unique in the database and that the type (as configured for the range index) is correct. Any love for Unique Value Range Indexes in the next version of MarkLogic? --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 ___ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general -- John Snelson, Lead Engineerhttp://twitter.com/jpcs MarkLogic Corporation http://www.marklogic.com ___ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
On 04/06/2014 19:31, Ron Hitchens wrote: In my case, the naming space is actually quite small because I want the IDs to be meaningful but unique. For example images:cats:fluffy:XX.png, where XX can increment or be set randomly until the ID is unique. Make XX a random number. Or two or more random numbers - until the statistical likelihood of a collision is small enough that you don't care about checking uniqueness anymore. John -- John Snelson, Lead Engineerhttp://twitter.com/jpcs MarkLogic Corporation http://www.marklogic.com ___ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
I thought 2 simultaneous transactions would both get read locks on the uri, then one would get a write lock and the other would fail and retry. Maybe I'm missing something though. But anyway, I agree unique indexes would be a handy feature. e.g. our docs have a DOI element which *should* be unique but occasionally aren't, would be nice to enforce that rather than have to code defensively. Rob From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.com] Sent: 04 June 2014 19:31 To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes Rob, I believe there is a race condition here. A document may not exit as-of the timestamp when this request starts running, but some other request could create one while it's running. This request would then over-write that document. I'm actually more concerned about element values inside documents than generating unique document URIs. It's easy to generate document URIs with 64-bit random numbers that are very unlikely to collide. But I want to guarantee that some meaningful value inside a document is unique across all documents. In my case, the naming space is actually quite small because I want the IDs to be meaningful but unique. For example images:cats:fluffy:XX.png, where XX can increment or be set randomly until the ID is unique. One way to check for uniqueness is to make the document URI from this ID, then test for an existing document. But this doesn't solve the general problem. I could conceivably have multiple elements in the document that I want to be unique. To check for unique element values it's necessary to run a cts query against the element(s). And I'm not sure if you can completely close the race window between checking for an existing instance and inserting a new one if the query comes back empty. Someone from ML pointed out privately that checking for uniqueness in the index would require cross-cluster communication. I'm sure that's true, but I'm also pretty sure that any user-level code solution is going to be far less efficient. I'd be happy to pay that ingestion time penalty for the guarantee that indexed element values are unique. At query time, such a unique value index should perform like any other range index. --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 On Jun 4, 2014, at 6:59 PM, Whitby, Rob rob.whi...@springer.com wrote: How about something like this? declare function unique-uri() { let $uri := /doc/ || xdmp:random() || .xml return if (fn:not(fn:doc-available($uri))) then $uri else unique-uri() }; I guess because indexes are distributed across forests, ensuring uniqueness is not that easy? Rob From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.com] Sent: 04 June 2014 18:01 To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes I'm working on a project, one aspect of which requires minting unique IDs and assuring that no two documents with the same ID wind up in the database. I know how to accomplish this using locks (I'm pretty sure) but any such implementation is awkward and prone to subtle edge case errors, and can be difficult to test. It seems to me that this is something that MarkLogic could do much more reliably and quickly than any user-level code. The thought that occurred to me is a variation on range indexes which only allow a single instance of any given value. Conventional range indexes work by creating term lists that look like this (see Jason Hunter's ML Architecture paper), where each term list contains an element (or attribute) value and a list of fragment IDs where that term exists. aardvark | 23, 135, 469, 611 ant | 23, 469, 558, 611, 750 baboon | 53, 97, 469, 621 etc... By making a range index like this but which only allows a single fragment ID in the list, that would ensure that no two documents in the database contain a given element with the same value. That is, attempting to add a second document with the same element or attribute value would cause an exception. And being a range index, it would provide a fast lexicon of all the current unique values in the DB. Such an index would look something like this: abc3vk34 | 17 bkx46lkd | 52 bz1d34nm | 37 etc... Usage could be something like this: declare function create-new-id-doc ($id-root as xs:string) as xs:string { try { let $id := $id-root || - || mylib:random-string(8) let $uri := /idregistry/id- || $id let $_ := xdmp:document-insert ($uri, registered-id id{ $id
Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
The simplest is to have the document URI correspond to the element value, and if you can use a random value it's good for concurrency. If you can't do that, but you want to ensure only one document can have a particular value for an element, I think it's pretty easy using xdmp:lock-for-update() on an URI that corresponds to the element value. You don't actually need to create a document at that URI, just use it to serialize transactions. Here's one way to do it. declare function lock-element-value($qn as xs:QName, $v as item) { xdmp:lock-for-update( http://acme.com/; || xdmp:hash64(fn:namespace-uri-from-QName($qn)) || / || xdmp:hash64(fn:localname-from-QName($qn))) }; You'd then do something like the following. let $lock := lock-element-value($qn, $v) let $existing := cts:search(fn:collection(), cts:element-range-query($qn, =, $v, unfiltered)) return if (fn:exists($existing)) then ... do whatever you need to do with the existing document else ... create a new document, safe from a race with another transaction You'd want to use lock-element-value() in any updates that could affect a change in the element value (insert, update, delete). I think you could get away with ignoring deletes since those would automatically serialize with any transaction that would modify the existing document. We use this sort of pattern internally to ensure uniqueness of IDs. Wayne. On 06/04/2014 12:49 PM, Whitby, Rob wrote: I thought 2 simultaneous transactions would both get read locks on the uri, then one would get a write lock and the other would fail and retry. Maybe I'm missing something though. But anyway, I agree unique indexes would be a handy feature. e.g. our docs have a DOI element which *should* be unique but occasionally aren't, would be nice to enforce that rather than have to code defensively. Rob From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.com] Sent: 04 June 2014 19:31 To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes Rob, I believe there is a race condition here. A document may not exit as-of the timestamp when this request starts running, but some other request could create one while it's running. This request would then over-write that document. I'm actually more concerned about element values inside documents than generating unique document URIs. It's easy to generate document URIs with 64-bit random numbers that are very unlikely to collide. But I want to guarantee that some meaningful value inside a document is unique across all documents. In my case, the naming space is actually quite small because I want the IDs to be meaningful but unique. For example images:cats:fluffy:XX.png, where XX can increment or be set randomly until the ID is unique. One way to check for uniqueness is to make the document URI from this ID, then test for an existing document. But this doesn't solve the general problem. I could conceivably have multiple elements in the document that I want to be unique. To check for unique element values it's necessary to run a cts query against the element(s). And I'm not sure if you can completely close the race window between checking for an existing instance and inserting a new one if the query comes back empty. Someone from ML pointed out privately that checking for uniqueness in the index would require cross-cluster communication. I'm sure that's true, but I'm also pretty sure that any user-level code solution is going to be far less efficient. I'd be happy to pay that ingestion time penalty for the guarantee that indexed element values are unique. At query time, such a unique value index should perform like any other range index. --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 On Jun 4, 2014, at 6:59 PM, Whitby, Rob rob.whi...@springer.com wrote: How about something like this? declare function unique-uri() { let $uri := /doc/ || xdmp:random() || .xml return if (fn:not(fn:doc-available($uri))) then $uri else unique-uri() }; I guess because indexes are distributed across forests, ensuring uniqueness is not that easy? Rob From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.com] Sent: 04 June 2014 18:01 To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes I'm working on a project, one aspect of which requires minting unique IDs and assuring that no two documents with the same ID wind up in the database. I know how to accomplish this using locks (I'm pretty sure) but any such implementation is awkward and prone to subtle edge
Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
Hi guys, How can I unsubscribe from this mailing list? Met vriendelijke groet, Johan van den Brink Consultant Analyze That - Analytics | Data Integration | Reporting | Process Mining Kerkewijk 8 3901 EG Veenendaal T: (06) 49 92 30 30 T: (0318) 52 55 87 M: jo...@analyzethat.nl W: www.analyzethat.nl L: http://nl.linkedin.com/in/brinkjohanvanden -Original Message- From: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] On Behalf Of Whitby, Rob Sent: woensdag 4 juni 2014 19:59 To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes How about something like this? declare function unique-uri() { let $uri := /doc/ || xdmp:random() || .xml return if (fn:not(fn:doc-available($uri))) then $uri else unique-uri() }; I guess because indexes are distributed across forests, ensuring uniqueness is not that easy? Rob From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.com] Sent: 04 June 2014 18:01 To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes I'm working on a project, one aspect of which requires minting unique IDs and assuring that no two documents with the same ID wind up in the database. I know how to accomplish this using locks (I'm pretty sure) but any such implementation is awkward and prone to subtle edge case errors, and can be difficult to test. It seems to me that this is something that MarkLogic could do much more reliably and quickly than any user-level code. The thought that occurred to me is a variation on range indexes which only allow a single instance of any given value. Conventional range indexes work by creating term lists that look like this (see Jason Hunter's ML Architecture paper), where each term list contains an element (or attribute) value and a list of fragment IDs where that term exists. aardvark | 23, 135, 469, 611 ant | 23, 469, 558, 611, 750 baboon | 53, 97, 469, 621 etc... By making a range index like this but which only allows a single fragment ID in the list, that would ensure that no two documents in the database contain a given element with the same value. That is, attempting to add a second document with the same element or attribute value would cause an exception. And being a range index, it would provide a fast lexicon of all the current unique values in the DB. Such an index would look something like this: abc3vk34 | 17 bkx46lkd | 52 bz1d34nm | 37 etc... Usage could be something like this: declare function create-new-id-doc ($id-root as xs:string) as xs:string { try { let $id := $id-root || - || mylib:random-string(8) let $uri := /idregistry/id- || $id let $_ := xdmp:document-insert ($uri, registered-id id{ $id }/id created{ fn:current-dateTime() }/created /registered-id return $id } catch (e) { create-new-id-doc ($id-root) } }; This doesn't require that I write any (possibly buggy) mutual exclusion code and I can be confident that once the xdmp:document-insert succeeds that the ID is unique in the database and that the type (as configured for the range index) is correct. Any love for Unique Value Range Indexes in the next version of MarkLogic? --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 ___ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
HI. I believe you can do that here: http://developer.marklogic.com/mailman/listinfo/general Kind Regards, David Ennis On 4 June 2014 23:09, Analyze That | Johan van den Brink jo...@analyzethat.nl wrote: Hi guys, How can I unsubscribe from this mailing list? Met vriendelijke groet, Johan van den Brink Consultant Analyze That - Analytics | Data Integration | Reporting | Process Mining Kerkewijk 8 3901 EG Veenendaal T: (06) 49 92 30 30 T: (0318) 52 55 87 M: jo...@analyzethat.nl W: www.analyzethat.nl L: http://nl.linkedin.com/in/brinkjohanvanden -Original Message- From: general-boun...@developer.marklogic.com [mailto: general-boun...@developer.marklogic.com] On Behalf Of Whitby, Rob Sent: woensdag 4 juni 2014 19:59 To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes How about something like this? declare function unique-uri() { let $uri := /doc/ || xdmp:random() || .xml return if (fn:not(fn:doc-available($uri))) then $uri else unique-uri() }; I guess because indexes are distributed across forests, ensuring uniqueness is not that easy? Rob From: general-boun...@developer.marklogic.com [ general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [ r...@ronsoft.com] Sent: 04 June 2014 18:01 To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes I'm working on a project, one aspect of which requires minting unique IDs and assuring that no two documents with the same ID wind up in the database. I know how to accomplish this using locks (I'm pretty sure) but any such implementation is awkward and prone to subtle edge case errors, and can be difficult to test. It seems to me that this is something that MarkLogic could do much more reliably and quickly than any user-level code. The thought that occurred to me is a variation on range indexes which only allow a single instance of any given value. Conventional range indexes work by creating term lists that look like this (see Jason Hunter's ML Architecture paper), where each term list contains an element (or attribute) value and a list of fragment IDs where that term exists. aardvark | 23, 135, 469, 611 ant | 23, 469, 558, 611, 750 baboon | 53, 97, 469, 621 etc... By making a range index like this but which only allows a single fragment ID in the list, that would ensure that no two documents in the database contain a given element with the same value. That is, attempting to add a second document with the same element or attribute value would cause an exception. And being a range index, it would provide a fast lexicon of all the current unique values in the DB. Such an index would look something like this: abc3vk34 | 17 bkx46lkd | 52 bz1d34nm | 37 etc... Usage could be something like this: declare function create-new-id-doc ($id-root as xs:string) as xs:string { try { let $id := $id-root || - || mylib:random-string(8) let $uri := /idregistry/id- || $id let $_ := xdmp:document-insert ($uri, registered-id id{ $id }/id created{ fn:current-dateTime() }/created /registered-id return $id } catch (e) { create-new-id-doc ($id-root) } }; This doesn't require that I write any (possibly buggy) mutual exclusion code and I can be confident that once the xdmp:document-insert succeeds that the ID is unique in the database and that the type (as configured for the range index) is correct. Any love for Unique Value Range Indexes in the next version of MarkLogic? --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 ___ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
Thanks David. Met vriendelijke groet, Johan van den Brink Consultant Analyze That - Analytics | Data Integration | Reporting | Process Mining Kerkewijk 8 3901 EG Veenendaal T: (06) 49 92 30 30 T: (0318) 52 55 87 M: jo...@analyzethat.nlmailto:jo...@analyzethat.nl W: www.analyzethat.nlhttp://www.analyzethat.nl/ L: http://nl.linkedin.com/in/brinkjohanvanden From: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] On Behalf Of David Ennis Sent: woensdag 4 juni 2014 23:32 To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes HI. I believe you can do that here: http://developer.marklogic.com/mailman/listinfo/general Kind Regards, David Ennis On 4 June 2014 23:09, Analyze That | Johan van den Brink jo...@analyzethat.nlmailto:jo...@analyzethat.nl wrote: Hi guys, How can I unsubscribe from this mailing list? Met vriendelijke groet, Johan van den Brink Consultant Analyze That - Analytics | Data Integration | Reporting | Process Mining Kerkewijk 8 3901 EG Veenendaal T: (06) 49 92 30 30 T: (0318) 52 55 87 M: jo...@analyzethat.nlmailto:jo...@analyzethat.nl W: www.analyzethat.nlhttp://www.analyzethat.nl L: http://nl.linkedin.com/in/brinkjohanvanden -Original Message- From: general-boun...@developer.marklogic.commailto:general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.commailto:general-boun...@developer.marklogic.com] On Behalf Of Whitby, Rob Sent: woensdag 4 juni 2014 19:59 To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes How about something like this? declare function unique-uri() { let $uri := /doc/ || xdmp:random() || .xml return if (fn:not(fn:doc-available($uri))) then $uri else unique-uri() }; I guess because indexes are distributed across forests, ensuring uniqueness is not that easy? Rob From: general-boun...@developer.marklogic.commailto:general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.commailto:general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.commailto:r...@ronsoft.com] Sent: 04 June 2014 18:01 To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes I'm working on a project, one aspect of which requires minting unique IDs and assuring that no two documents with the same ID wind up in the database. I know how to accomplish this using locks (I'm pretty sure) but any such implementation is awkward and prone to subtle edge case errors, and can be difficult to test. It seems to me that this is something that MarkLogic could do much more reliably and quickly than any user-level code. The thought that occurred to me is a variation on range indexes which only allow a single instance of any given value. Conventional range indexes work by creating term lists that look like this (see Jason Hunter's ML Architecture paper), where each term list contains an element (or attribute) value and a list of fragment IDs where that term exists. aardvark | 23, 135, 469, 611 ant | 23, 469, 558, 611, 750 baboon | 53, 97, 469, 621 etc... By making a range index like this but which only allows a single fragment ID in the list, that would ensure that no two documents in the database contain a given element with the same value. That is, attempting to add a second document with the same element or attribute value would cause an exception. And being a range index, it would provide a fast lexicon of all the current unique values in the DB. Such an index would look something like this: abc3vk34 | 17 bkx46lkd | 52 bz1d34nm | 37 etc... Usage could be something like this: declare function create-new-id-doc ($id-root as xs:string) as xs:string { try { let $id := $id-root || - || mylib:random-string(8) let $uri := /idregistry/id- || $id let $_ := xdmp:document-insert ($uri, registered-id id{ $id }/id created{ fn:current-dateTime() }/created /registered-id return $id } catch (e) { create-new-id-doc ($id-root) } }; This doesn't require that I write any (possibly buggy) mutual exclusion code and I can be confident that once the xdmp:document-insert succeeds that the ID is unique in the database and that the type (as configured for the range index) is correct. Any love for Unique Value Range Indexes in the next version of MarkLogic? --- Ron Hitchens {r...@overstory.co.ukmailto:r...@overstory.co.uk} +44 7879 358212tel:%2B44%207879%20358212 ___ General mailing list General@developer.marklogic.commailto:General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
Wayne, Thanks for this. It's a useful code pattern for this sort of thing and I will probably use it for the specific requirement I have at the moment (I was planning to do something similar anyway). But this code, or any user-level code, does not fully implement the uniqueness guarantee I'd like to have and that I think a specialized range index could easily provide. This will work, but as you say it would be necessary to always use this code convention. It would not prevent creation of duplicate values by code that doesn't follow the convention. If uniqueness were enforced by the index, then I could be confident that uniqueness is absolutely guaranteed and I don't need to trust anyone (including my future self) to always follow the same locking protocol. --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 On Jun 4, 2014, at 9:19 PM, Wayne Feick wayne.fe...@marklogic.com wrote: The simplest is to have the document URI correspond to the element value, and if you can use a random value it's good for concurrency. If you can't do that, but you want to ensure only one document can have a particular value for an element, I think it's pretty easy using xdmp:lock-for-update() on an URI that corresponds to the element value. You don't actually need to create a document at that URI, just use it to serialize transactions. Here's one way to do it. declare function lock-element-value($qn as xs:QName, $v as item) { xdmp:lock-for-update( http://acme.com/; || xdmp:hash64(fn:namespace-uri-from-QName($qn)) || / || xdmp:hash64(fn:localname-from-QName($qn))) }; You'd then do something like the following. let $lock := lock-element-value($qn, $v) let $existing := cts:search(fn:collection(), cts:element-range-query($qn, =, $v, unfiltered)) return if (fn:exists($existing)) then ... do whatever you need to do with the existing document else ... create a new document, safe from a race with another transaction You'd want to use lock-element-value() in any updates that could affect a change in the element value (insert, update, delete). I think you could get away with ignoring deletes since those would automatically serialize with any transaction that would modify the existing document. We use this sort of pattern internally to ensure uniqueness of IDs. Wayne. On 06/04/2014 12:49 PM, Whitby, Rob wrote: I thought 2 simultaneous transactions would both get read locks on the uri, then one would get a write lock and the other would fail and retry. Maybe I'm missing something though. But anyway, I agree unique indexes would be a handy feature. e.g. our docs have a DOI element which *should* be unique but occasionally aren't, would be nice to enforce that rather than have to code defensively. Rob From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.com] Sent: 04 June 2014 19:31 To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes Rob, I believe there is a race condition here. A document may not exit as-of the timestamp when this request starts running, but some other request could create one while it's running. This request would then over-write that document. I'm actually more concerned about element values inside documents than generating unique document URIs. It's easy to generate document URIs with 64-bit random numbers that are very unlikely to collide. But I want to guarantee that some meaningful value inside a document is unique across all documents. In my case, the naming space is actually quite small because I want the IDs to be meaningful but unique. For example images:cats:fluffy:XX.png, where XX can increment or be set randomly until the ID is unique. One way to check for uniqueness is to make the document URI from this ID, then test for an existing document. But this doesn't solve the general problem. I could conceivably have multiple elements in the document that I want to be unique. To check for unique element values it's necessary to run a cts query against the element(s). And I'm not sure if you can completely close the race window between checking for an existing instance and inserting a new one if the query comes back empty. Someone from ML pointed out privately that checking for uniqueness in the index would require cross-cluster communication. I'm sure that's true, but I'm also pretty sure that any user-level code solution is going to be far less efficient. I'd be happy to pay that ingestion time penalty for the guarantee that indexed element values are unique. At query time, such a unique value index should perform like any other range index. --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 On Jun 4, 2014
Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
Fair points, Ron. We have RFE 2322 filed back in Feb 2012 to track this. I'll add a note indicating your interest as well. Wayne. On 06/04/2014 03:00 PM, Ron Hitchens wrote: Wayne, Thanks for this. It's a useful code pattern for this sort of thing and I will probably use it for the specific requirement I have at the moment (I was planning to do something similar anyway). But this code, or any user-level code, does not fully implement the uniqueness guarantee I'd like to have and that I think a specialized range index could easily provide. This will work, but as you say it would be necessary to always use this code convention. It would not prevent creation of duplicate values by code that doesn't follow the convention. If uniqueness were enforced by the index, then I could be confident that uniqueness is absolutely guaranteed and I don't need to trust anyone (including my future self) to always follow the same locking protocol. --- Ron Hitchens {r...@overstory.co.uk mailto:r...@overstory.co.uk} +44 7879 358212 On Jun 4, 2014, at 9:19 PM, Wayne Feick wayne.fe...@marklogic.com mailto:wayne.fe...@marklogic.com wrote: The simplest is to have the document URI correspond to the element value, and if you can use a random value it's good for concurrency. If you can't do that, but you want to ensure only one document can have a particular value for an element, I think it's pretty easy using xdmp:lock-for-update() on an URI that corresponds to the element value. You don't actually need to create a document at that URI, just use it to serialize transactions. Here's one way to do it. declare function lock-element-value($qn as xs:QName, $v as item) { xdmp:lock-for-update( http://acme.com/; || xdmp:hash64(fn:namespace-uri-from-QName($qn)) || / || xdmp:hash64(fn:localname-from-QName($qn))) }; You'd then do something like the following. let $lock := lock-element-value($qn, $v) let $existing := cts:search(fn:collection(), cts:element-range-query($qn, =, $v, unfiltered)) return if (fn:exists($existing)) then ... do whatever you need to do with the existing document else ... create a new document, safe from a race with another transaction You'd want to use lock-element-value() in any updates that could affect a change in the element value (insert, update, delete). I think you could get away with ignoring deletes since those would automatically serialize with any transaction that would modify the existing document. We use this sort of pattern internally to ensure uniqueness of IDs. Wayne. On 06/04/2014 12:49 PM, Whitby, Rob wrote: I thought 2 simultaneous transactions would both get read locks on the uri, then one would get a write lock and the other would fail and retry. Maybe I'm missing something though. But anyway, I agree unique indexes would be a handy feature. e.g. our docs have a DOI element which *should* be unique but occasionally aren't, would be nice to enforce that rather than have to code defensively. Rob From:general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.com] Sent: 04 June 2014 19:31 To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes Rob, I believe there is a race condition here. A document may not exit as-of the timestamp when this request starts running, but some other request could create one while it's running. This request would then over-write that document. I'm actually more concerned about element values inside documents than generating unique document URIs. It's easy to generate document URIs with 64-bit random numbers that are very unlikely to collide. But I want to guarantee that some meaningful value inside a document is unique across all documents. In my case, the naming space is actually quite small because I want the IDs to be meaningful but unique. For example images:cats:fluffy:XX.png, where XX can increment or be set randomly until the ID is unique. One way to check for uniqueness is to make the document URI from this ID, then test for an existing document. But this doesn't solve the general problem. I could conceivably have multiple elements in the document that I want to be unique. To check for unique element values it's necessary to run a cts query against the element(s). And I'm not sure if you can completely close the race window between checking for an existing instance and inserting a new one if the query comes back empty. Someone from ML pointed out privately that checking for uniqueness in the index would require cross-cluster communication. I'm sure that's true, but I'm also pretty sure that any user-level code solution is going to be far less efficient. I'd be happy to pay that ingestion
Re: [MarkLogic Dev General] New Feature Request: Unique Value Range Indexes
Thanks Wayne. --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 On Jun 4, 2014, at 11:12 PM, Wayne Feick wayne.fe...@marklogic.com wrote: Fair points, Ron. We have RFE 2322 filed back in Feb 2012 to track this. I'll add a note indicating your interest as well. Wayne. On 06/04/2014 03:00 PM, Ron Hitchens wrote: Wayne, Thanks for this. It's a useful code pattern for this sort of thing and I will probably use it for the specific requirement I have at the moment (I was planning to do something similar anyway). But this code, or any user-level code, does not fully implement the uniqueness guarantee I'd like to have and that I think a specialized range index could easily provide. This will work, but as you say it would be necessary to always use this code convention. It would not prevent creation of duplicate values by code that doesn't follow the convention. If uniqueness were enforced by the index, then I could be confident that uniqueness is absolutely guaranteed and I don't need to trust anyone (including my future self) to always follow the same locking protocol. --- Ron Hitchens {r...@overstory.co.uk} +44 7879 358212 On Jun 4, 2014, at 9:19 PM, Wayne Feick wayne.fe...@marklogic.com wrote: The simplest is to have the document URI correspond to the element value, and if you can use a random value it's good for concurrency. If you can't do that, but you want to ensure only one document can have a particular value for an element, I think it's pretty easy using xdmp:lock-for-update() on an URI that corresponds to the element value. You don't actually need to create a document at that URI, just use it to serialize transactions. Here's one way to do it. declare function lock-element-value($qn as xs:QName, $v as item) { xdmp:lock-for-update( http://acme.com/; || xdmp:hash64(fn:namespace-uri-from-QName($qn)) || / || xdmp:hash64(fn:localname-from-QName($qn))) }; You'd then do something like the following. let $lock := lock-element-value($qn, $v) let $existing := cts:search(fn:collection(), cts:element-range-query($qn, =, $v, unfiltered)) return if (fn:exists($existing)) then ... do whatever you need to do with the existing document else ... create a new document, safe from a race with another transaction You'd want to use lock-element-value() in any updates that could affect a change in the element value (insert, update, delete). I think you could get away with ignoring deletes since those would automatically serialize with any transaction that would modify the existing document. We use this sort of pattern internally to ensure uniqueness of IDs. Wayne. On 06/04/2014 12:49 PM, Whitby, Rob wrote: I thought 2 simultaneous transactions would both get read locks on the uri, then one would get a write lock and the other would fail and retry. Maybe I'm missing something though. But anyway, I agree unique indexes would be a handy feature. e.g. our docs have a DOI element which *should* be unique but occasionally aren't, would be nice to enforce that rather than have to code defensively. Rob From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.com] Sent: 04 June 2014 19:31 To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] New Feature Request: Unique Value RangeIndexes Rob, I believe there is a race condition here. A document may not exit as-of the timestamp when this request starts running, but some other request could create one while it's running. This request would then over-write that document. I'm actually more concerned about element values inside documents than generating unique document URIs. It's easy to generate document URIs with 64-bit random numbers that are very unlikely to collide. But I want to guarantee that some meaningful value inside a document is unique across all documents. In my case, the naming space is actually quite small because I want the IDs to be meaningful but unique. For example images:cats:fluffy:XX.png, where XX can increment or be set randomly until the ID is unique. One way to check for uniqueness is to make the document URI from this ID, then test for an existing document. But this doesn't solve the general problem. I could conceivably have multiple elements in the document that I want to be unique. To check for unique element values it's necessary to run a cts query against the element(s). And I'm not sure if you can completely close the race window between checking for an existing instance and inserting a new one if the query comes back empty. Someone from ML pointed out privately that checking for uniqueness in the index would require cross-cluster communication. I'm sure that's true, but I'm also