Re: [MarkLogic Dev General] Marklogic XXE and XML Bomb prevention
Actually, I never sent an earlier mail, so my “earlier mail” statement is wrong (that is, I was wrong about being wrong here. Anyway, the original sample doc was (is) valid and the injection can be done if you have access to the ML server’s file system and ML has read access to a directory you can write to and you can create and can run XQuery to load the file from the server’s file system. But except by disallowing entity reference resolution I don’t see how this is something ML itself can prevent. I think it’s an application and server security issue. I tried the same test but using an HTTP URL that is resolvable: https://github.com/dita-community/dita-ot-project-docker/raw/master/README.md"; ]> &xxe; I verified that the URL is resolvable by using Oxygen’s open URL function to get the file. Using xdmp:load() ML reports: [1.0-ml] SVC-FILOPN: xdmp:eval("xquery version "1.0-ml"; let $source := xdmp:...", (), 13776655024510127060...) -- File open error: open 'https://github.com/dita-community/dita-ot-project-docker/raw/master/README.md': No such file or directory So injection of HTTP URLs appears to not work in this case. So I think the injection case can only happen from ML-executed XQuery when reading files from the server’s file system. But in a normal server, file system access should be tightly controlled. Documents loaded using facilities outside ML, e.g., Java code that parses source XML to pass to XCC or something, it outside of ML’s ability to control and is thus an application issue, not an ML issue. It does not appear to be possible to do this injection using documents supplied via say an HTTP response as even if you could provide an XML document with a DOCTYPE declaration ML would not be able to resolve any entity references that were not to files in the local ML database. Unquote did not handle the entity reference—it clearly tried to resolve it before doing the unquoting, resulting in an “illegal entity reference” message and escaping the “&” resulted in a single “;” where the entity reference was (meaning “&xxe;” was not converted to “&xxe;” and then resolved--not actually sure what the unquote processor is doing in that case). Cheers, E. -- Eliot Kimber http://contrext.com From: on behalf of Eliot Kimber Reply-To: MarkLogic Developer Discussion Date: Wednesday, March 14, 2018 at 2:49 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Marklogic XXE and XML Bomb prevention My earlier mail was wrong. I was able to replicate the behavior in ML 9 using query console. Here is my source doc: ]> &xxe; And test.xml: --- This is a text file injected. --- Here’s my query: xquery version "1.0-ml"; let $source := xdmp:load(“/ekimber/test/injection-test.xml") let $result := doc(“/ekimber/test/injection-test.xml") return $result Result: This is a text file injected. So ML’s built-in parser will resolve entity references to files on the file system when loaded from the file system using xdmp:load(). But I’m not sure this is something ML can or should prevent. I think the security presumption is that if you have access to the server (the machine running ML) and rights to run xquery on ML that you can do anything. I think it is up to a MarkLogic application to impose rules on what documents are allowed to be loaded and what the constraints on entity resolution are. Cheers, Eliot -- Eliot Kimber http://contrext.com From: on behalf of Keith Breinholt Reply-To: MarkLogic Developer Discussion Date: Wednesday, March 14, 2018 at 12:07 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Marklogic XXE and XML Bomb prevention Perhaps you could show the code that you used to insert the document into the database. I, personally, cannot get your code to work for a number of reasons. 1) having both an xml processing statement and an HTML doctype is invalid. 2) Trying to assign the “document” to a variable throws an error because of #1. 3) If I try to put the “document” below into a file on the file system and load it I cannot use xdmp:document-insert() to insert the “document” into the database because there isn’t a valid node. There may be something I have overlooked so please share the code you used to insert this document into a database. -Keith From: general-boun...@developer.marklogic.com On Behalf Of Marcel de Kleine Sent: Wednesday, March 14, 2018 6:43 AM To: general@developer.marklogic.com Subject: [MarkLogic Dev General] Marklogic XXE and XML Bomb prevention Hello, We have noticed Marklogic is vulnerable to xxe (entity expansion) and xml bomb attacks. When loading an malicious document using xdmp:document-insert it won’t catch these and cause e
Re: [MarkLogic Dev General] Marklogic XXE and XML Bomb prevention
My earlier mail was wrong. I was able to replicate the behavior in ML 9 using query console. Here is my source doc: ]> &xxe; And test.xml: --- This is a text file injected. --- Here’s my query: xquery version "1.0-ml"; let $source := xdmp:load(“/ekimber/test/injection-test.xml") let $result := doc(“/ekimber/test/injection-test.xml") return $result Result: This is a text file injected. So ML’s built-in parser will resolve entity references to files on the file system when loaded from the file system using xdmp:load(). But I’m not sure this is something ML can or should prevent. I think the security presumption is that if you have access to the server (the machine running ML) and rights to run xquery on ML that you can do anything. I think it is up to a MarkLogic application to impose rules on what documents are allowed to be loaded and what the constraints on entity resolution are. Cheers, Eliot -- Eliot Kimber http://contrext.com From: on behalf of Keith Breinholt Reply-To: MarkLogic Developer Discussion Date: Wednesday, March 14, 2018 at 12:07 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Marklogic XXE and XML Bomb prevention Perhaps you could show the code that you used to insert the document into the database. I, personally, cannot get your code to work for a number of reasons. 1) having both an xml processing statement and an HTML doctype is invalid. 2) Trying to assign the “document” to a variable throws an error because of #1. 3) If I try to put the “document” below into a file on the file system and load it I cannot use xdmp:document-insert() to insert the “document” into the database because there isn’t a valid node. There may be something I have overlooked so please share the code you used to insert this document into a database. -Keith From: general-boun...@developer.marklogic.com On Behalf Of Marcel de Kleine Sent: Wednesday, March 14, 2018 6:43 AM To: general@developer.marklogic.com Subject: [MarkLogic Dev General] Marklogic XXE and XML Bomb prevention Hello, We have noticed Marklogic is vulnerable to xxe (entity expansion) and xml bomb attacks. When loading an malicious document using xdmp:document-insert it won’t catch these and cause either loading of unwanted external documents (xxe) and lockup of the system (xml bomb). For example, if I load this document : ]> &xxe; The file test.xml gets nicely added to the xml document. See OWASP and others for examples. This is clearly a xml processing issue so the question is : can we disable this? And if so, on what levels would this be possible. Best should be system-wide. ( And if you cannot disable this, I think this is something ML should address immediately. Thank you in advance, Marcel de Kleine, EPAM Marcel de Kleine Senior Software Engineer Office: +31 20 241 6134 x 30530 Cell: +31 6 14806016 Email: marcel_de_kle...@epam.com Delft, Netherlands epam.com CONFIDENTIALITY CAUTION AND DISCLAIMER This message is intended only for the use of the individual(s) or entity(ies) to which it is addressed and contains information that is legally privileged and confidential. If you are not the intended recipient, or the person responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. All unintended recipients are obliged to delete this message and destroy any printed copies. ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] How to Reliably Tell if FlexRep Push is Done
I have a process that has to sync up a database between a master server and a number of remote servers before submitting tasks to those servers. In some cases the sync may be a complete copy of the database, which can take an hour or more to sync. In many cases the sync will either be unnecessary or a quick sync of a few updated docs. My task submission process cycles through the pool of remote servers, checking their general status and submitting tasks to them. However, I don't want to submit any tasks until any flexrep syncs are done. I've worked out this function: declare function er:is-flexrep-in-progress( $server ) { let $domain-ids := flexrep:configuration-domain-ids() let $missing-counts as xs:integer* := for $domain-id in $domain-ids let $cfg := flexrep:configuration-get($domain-id) let $targets := flexrep:configuration-targets($cfg) return for $target in $targets let $target-id as xs:unsignedLong := $target/flexrep:target-id let $status := flexrep:target-status($domain-id, $target-id) let $target-name as xs:string := $status/flexrep:target-name let $missing-count := xs:integer($status/flexrep:missing-count) return if (contains($target-name, $server) and $missing-count gt 0) then $missing-count else 0 let $result := sum($missing-counts) gt 0 return $result }; But it's kind of slow--in my initial profiling through qconsole it could take as much as 800 ms and typically took about 500 ms. The time could be an unavoidable side effect of having 100s of 1000s of fragments in this db--my profiling showed that most of time came from doing 1000s of small operations down in the flexrep code. I'm wondering if there's a better or more efficient way to determine if flexrep is still in progress? Thanks, E. -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Rsync-Like DB Contents Comparison and Update?
ML 9 I have a system of servers where a master server gets new remote servers allocated it more or less randomly and dynamically. The remote servers need to have a correct copy of a databse on the master server but the database is pretty big (the previously-mentioned 380K doc, 3GB database). I can of course sync it with FlexRep but when a new server comes available I don't know what the current state of its local copy of the database is (if it has one at all) so I'm forced to recreate my master server's replication targets and do a full push, which takes an hour or two. In the case where the remote server already has a copy of the database I would like to be able to compare it's contents to the master's and determine what the deltas are, if any, and only handle those, which usually would only be a few docs out of the total set. Does there exist this kind of rsync or git-like comparison mechanism, either out of the box or as a public project? I'm thinking of something comparable to what git does, which is create hashes of each file and then comparing hashes. I could do this in XQuery but I suspect something more efficient could be done at the forest level, if one knew what one was doing. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Why Would FlexRep Pull be Dramatically Slower Than Push for Same Database and Server Pair?
I have a pair of ML 9 servers. On the master server I have a domain with a target configured with a docs per batch of 100 for a database with about 380K docs coming in at about 3GB reported by the ML status page. When I use FlexRep push to another server with an empty database the push takes about an hour to 2 hours depending on time of day (and thus overall network traffic). When I use FlexRep pull to pull from master to the remote, it takes about 9 hours. What would account for this time difference? I'm guessing it's that the pull process doesn't use the docs/batch setting (which if I manually set it to 1 for a push also results in about 9 hours). As it happens, I don't need to use pull as I can use push just as easily, but I was just curious about the time difference and whether it's an inherent aspect of FlexRep pull, indicates a bug, or could be some configuration error on my part (but I don't think so since the target configuration is the same in both cases--the only variable is pull vs push). Cheers, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Good Way to Automatical
I've taken the Admin install CPF script and reworked it as a function library and removed the code related to default domains (I don't want default domains in any case). What's left includes code to set up the pipelines and triggers, as described in the CPF Configuration chapter. *But* It also includes the loading of schemas that are used by the pipelines and I didn't see anything that does that (or mentions it) in the CPF API. So unless I'm missing something (which is quite possible), I still need to do the schema loading. What I've extracted from the Admin code seems like a convenient way to just get CPF in place so that you can then set up your custom domains. It could be optimized for the needs of FlexRep, namely only bothering to even install the change and FlexRep pipelines but seems likely that other servers that will need automatic conversion might use other pipelines and there's no particular harm in having unused pipelines lying about. Note my requirement is simply to have CPF available so that I can then configure FlexRep. The lack of a quick-and-easy way to programmatically install CPF is simply a roadblock to the real configuration I need to do, namely configure FlexRep, which is otherwise easy enough to do (once one has understood how all the FlexRep parts fit together, which was a little harder than it should have been, but I think I've already commented on the FlexRep docs...). So I was really looking for a "call this one function to get CPF installed so you can continue on with your real task of getting FlexRep configured via a script" and I'm not seeing that out of the box. Or said more directly: there's a one-button task in the Admin UI to get CPF installed for a database. There should be a corresponding single-call function to do it programmatically and the FlexRep docs should make reference to that function at the same time they refer to the manual CPF installation process. Cheers, E. -- Eliot Kimber http://contrext.com On 1/22/18, 5:32 PM, "general-boun...@developer.marklogic.com on behalf of Mary Holstege" wrote: There isn't a single API that orchestrates all the pieces, but there are APIs do do all the necessary parts in the pipelines and domains modules. These should be executed against your triggers database. If you share a triggers database, you don't need to do it all over again. p:insert to put a pipeline XML into the right collections etc. dom:configuration-create to create the overall configuration object that defines your restart user etc. You need to do this before you create domains or things will go horribly wrong. dom:create to define your domains dom:add-pipeline to attach pipelines if you didn't put them in the domain in dom:create All default pipelines are in the Installer directory. The thing in the admin GUI makes some default assumptions about some of this that aren't always the appropriate thing to do. I'd suggest making a script that creates the domains you want and loads and attaches the appropriate pipelines. //Mary On Mon, 22 Jan 2018 14:09:23 -0800, Eliot Kimber wrote: > I'm putting together a script that will do all the configuration for a > server all the way through defining a FlexRep app server, domains, and > targets. The requirement is avoid the need for any manual intervention > once the configuration is started. > > The one fly in this ointment is the CPF--since I'm creating new > databases they of course won't have CPF installed, so I need to install > the CPF into those that are involved in FlexRep. > > As far as I can tell there is no API for doing this API (there should > be), so I'm going to attempt to simply call the > Admin/database-cpf-admin-go.xqy module, which seems simple enough (I > only need to specify the database name as far as I can tell). > > But calling an Admin module like this feels a little dirty and has some > risk since it's not a published API and there's no guarantee it will not > change without warning in the future (although the risk seems pretty > small since it's a module that hasn't changed in ages and it's only > called in one place in my code). > > Is there a better way to automate installation of the CPF than doing > what the "confirm CPF installation" UI form does? > > This is in the context of setting up new servers on demand, e.g., in a > Docker environment where this server has a very narrow use. > > Thanks, > > Eliot > -- > Eliot Kimber >
Re: [MarkLogic Dev General] Good Way to Automatically Install CPF
I was pointed to the Scripting Content Processing Framework (CPF) Configuration chapter, which does seem to have the guidance I seek. I was focused on scripting FlexRep configuration and didn't fully appreciate the underlying requirement on CPF. I just wanted an Easy button In looking more closely at the Admin code that does the CPF default installation I see that it's not actually callable as a module anyway as it expects to get values from HTTP request parameters (insert snarky comment about separation of concerns in code here). So it looks like the answer is "cut and paste what's in the Admin installer code, fix it to be callable as functions, and use that". Cheers, E. -- Eliot Kimber http://contrext.com On 1/22/18, 4:15 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: Correct subject line for this thread. Cheers, E. -- Eliot Kimber http://contrext.com On 1/22/18, 4:09 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I'm putting together a script that will do all the configuration for a server all the way through defining a FlexRep app server, domains, and targets. The requirement is avoid the need for any manual intervention once the configuration is started. The one fly in this ointment is the CPF--since I'm creating new databases they of course won't have CPF installed, so I need to install the CPF into those that are involved in FlexRep. As far as I can tell there is no API for doing this API (there should be), so I'm going to attempt to simply call the Admin/database-cpf-admin-go.xqy module, which seems simple enough (I only need to specify the database name as far as I can tell). But calling an Admin module like this feels a little dirty and has some risk since it's not a published API and there's no guarantee it will not change without warning in the future (although the risk seems pretty small since it's a module that hasn't changed in ages and it's only called in one place in my code). Is there a better way to automate installation of the CPF than doing what the "confirm CPF installation" UI form does? This is in the context of setting up new servers on demand, e.g., in a Docker environment where this server has a very narrow use. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Good Way to Automatically Install CPF
Correct subject line for this thread. Cheers, E. -- Eliot Kimber http://contrext.com On 1/22/18, 4:09 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I'm putting together a script that will do all the configuration for a server all the way through defining a FlexRep app server, domains, and targets. The requirement is avoid the need for any manual intervention once the configuration is started. The one fly in this ointment is the CPF--since I'm creating new databases they of course won't have CPF installed, so I need to install the CPF into those that are involved in FlexRep. As far as I can tell there is no API for doing this API (there should be), so I'm going to attempt to simply call the Admin/database-cpf-admin-go.xqy module, which seems simple enough (I only need to specify the database name as far as I can tell). But calling an Admin module like this feels a little dirty and has some risk since it's not a published API and there's no guarantee it will not change without warning in the future (although the risk seems pretty small since it's a module that hasn't changed in ages and it's only called in one place in my code). Is there a better way to automate installation of the CPF than doing what the "confirm CPF installation" UI form does? This is in the context of setting up new servers on demand, e.g., in a Docker environment where this server has a very narrow use. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Good Way to Automatical
I'm putting together a script that will do all the configuration for a server all the way through defining a FlexRep app server, domains, and targets. The requirement is avoid the need for any manual intervention once the configuration is started. The one fly in this ointment is the CPF--since I'm creating new databases they of course won't have CPF installed, so I need to install the CPF into those that are involved in FlexRep. As far as I can tell there is no API for doing this API (there should be), so I'm going to attempt to simply call the Admin/database-cpf-admin-go.xqy module, which seems simple enough (I only need to specify the database name as far as I can tell). But calling an Admin module like this feels a little dirty and has some risk since it's not a published API and there's no guarantee it will not change without warning in the future (although the risk seems pretty small since it's a module that hasn't changed in ages and it's only called in one place in my code). Is there a better way to automate installation of the CPF than doing what the "confirm CPF installation" UI form does? This is in the context of setting up new servers on demand, e.g., in a Docker environment where this server has a very narrow use. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] General Digest, Vol 162, Issue 17
OK, good to know it should work except for the point-in-time stuff. Any idea why the forests I was trying to restore would have been greyed out in the restore-from-backup panel? Would that reflect an issue with there being point-in-time data in the ML4 data being restored or would it be because of something else? I checked the usual suspects, such as the forest names not matching or not having permissions on the files and those all seemed to be correct, but of course there’s still a strong chance it’s my user error. I haven’t had much time to work on this so when it didn’t work immediately I set it aside to pursue other avenues (and in particular, was able to make Flexrep push from ML4 to ML 9 work at appropriate speed, so that’s good). Thanks, E. -- Eliot Kimber http://contrext.com From: on behalf of Rajesh Kumar Reply-To: MarkLogic Developer Discussion Date: Thursday, December 21, 2017 at 12:55 AM To: "general@developer.marklogic.com" Subject: Re: [MarkLogic Dev General] General Digest, Vol 162, Issue 17 Hi Eliot, You can do restore of the content in ML 9 from ML 4 , but Point-in-time data will not work. In ML 4 the timestamp was defined based on the transaction but in later versions ( from 6 i guess ) it was long. So Point-in-time data will not work. Regards, Rajesh On Thu, Dec 21, 2017 at 1:30 AM, wrote: Send General mailing list submissions to general@developer.marklogic.com To subscribe or unsubscribe via the World Wide Web, visit http://developer.marklogic.com/mailman/listinfo/general or, via email, send a message with subject or body 'help' to general-requ...@developer.marklogic.com You can reach the person managing the list at general-ow...@developer.marklogic.com When replying, please edit your Subject line so it is more specific than "Re: Contents of General digest..." Today's Topics: 1. Re: Possible to Restore ML 4 Backup To ML 9 (Eliot Kimber) -- Message: 1 Date: Tue, 19 Dec 2017 14:54:39 -0600 From: Eliot Kimber Subject: Re: [MarkLogic Dev General] Possible to Restore ML 4 Backup To ML 9 To: MarkLogic Developer Discussion Message-ID: <451f0463-39a0-4921-b924-1036844a3...@mitchell1.com> Content-Type: text/plain; charset="utf-8" I did not consider using the API. I will try that. Thanks, Eliot -- Eliot Kimber http://contrext.com From: on behalf of Arthur Tsoi Reply-To: MarkLogic Developer Discussion Date: Tuesday, December 19, 2017 at 1:21 PM To: "general@developer.marklogic.com" Subject: Re: [MarkLogic Dev General] Possible to Restore ML 4 Backup To ML 9 On ML 9, can you try using the API xdmp:database-restore instead of the UI? Alternatively, since you can restore into an ML 8 server, you can do another backup from the ML 8 server and then restore that into an ML 9 server. Arthur -- next part -- An HTML attachment was scrubbed... URL: http://developer.marklogic.com/pipermail/general/attachments/20171219/724876c7/attachment-0001.html -- ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general End of General Digest, Vol 162, Issue 17 ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Possible to Restore ML 4 Backup To ML 9
I did not consider using the API. I will try that. Thanks, Eliot -- Eliot Kimber http://contrext.com From: on behalf of Arthur Tsoi Reply-To: MarkLogic Developer Discussion Date: Tuesday, December 19, 2017 at 1:21 PM To: "general@developer.marklogic.com" Subject: Re: [MarkLogic Dev General] Possible to Restore ML 4 Backup To ML 9 On ML 9, can you try using the API xdmp:database-restore instead of the UI? Alternatively, since you can restore into an ML 8 server, you can do another backup from the ML 8 server and then restore that into an ML 9 server. Arthur ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Possible to Restore ML 4 Backup To ML 9
Just pinging this issue again as being able to do this would make a number of things easier than they would otherwise be. Thanks, Eliot -- Eliot Kimber http://contrext.com On 12/14/17, 3:42 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I also made sure the files are world readable and writeable since I thought at first it might be a permissions problem. Cheers, Eliot -- Eliot Kimber http://contrext.com On 12/14/17, 3:41 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I have successfully restored an ML 4 backup into an ML 8 server. I'm now trying to restore an ML 4 backup into an ML 9 server and not having any luck. If I try to do a full database restore the forest directories in the backup are listed but their check boxes are not selectable. I have verified that the forest names match. If I try to restore each forest individually, I can start the backup but it fails with this error: Error: Forest label does not exist: /marklogic/backup/20171214-1/Forests/mydatabase-2/Label There is a Label file: # cat mydatabase-2/Label ??*h1?5 # _ So I'm assuming this is a version incompatibility. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Possible to Flexrep Pull From ML9 to ML4?
I am able to flexrep push local forests from ML4 to ML 9, so that will let me do what I want by dynamically managing the flexrep config on the master host. Cheers, Eliot -- Eliot Kimber http://contrext.com On 12/14/17, 12:42 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I am trying to set up flexrep pull where an ML 9 server is pulling from an ML 4 server. Is this actually possible? It looks like the code is different in ML 4 from what the ML 9 pull code expects. In particular, the ML 9 pull code is calling the module pollBinaryChunk.xqy on the pulled-from server but as far as I can tell no such module exists in the ML 4 code. Is there a workaround for this? Because of the way my systems are set up it's problematic to use command-line tools like mlcp or corb, so I'm trying to keep everything in XQuery. My specific requirement is to sync part of a much larger database to a set of Docker-based servers whenever those servers are started up. Because the servers are not persistent I cannot use normal push flexrep and because I only want part of the database backup and restore is not ideal and I also want to avoid the reindexing cost if possible. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Possible to Restore ML 4 Backup To ML 9
I also made sure the files are world readable and writeable since I thought at first it might be a permissions problem. Cheers, Eliot -- Eliot Kimber http://contrext.com On 12/14/17, 3:41 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I have successfully restored an ML 4 backup into an ML 8 server. I'm now trying to restore an ML 4 backup into an ML 9 server and not having any luck. If I try to do a full database restore the forest directories in the backup are listed but their check boxes are not selectable. I have verified that the forest names match. If I try to restore each forest individually, I can start the backup but it fails with this error: Error: Forest label does not exist: /marklogic/backup/20171214-1/Forests/mydatabase-2/Label There is a Label file: # cat mydatabase-2/Label ??*h1?5 # _ So I'm assuming this is a version incompatibility. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Possible to Restore ML 4 Backup To ML 9
I have successfully restored an ML 4 backup into an ML 8 server. I'm now trying to restore an ML 4 backup into an ML 9 server and not having any luck. If I try to do a full database restore the forest directories in the backup are listed but their check boxes are not selectable. I have verified that the forest names match. If I try to restore each forest individually, I can start the backup but it fails with this error: Error: Forest label does not exist: /marklogic/backup/20171214-1/Forests/mydatabase-2/Label There is a Label file: # cat mydatabase-2/Label ??*h1?5 # _ So I'm assuming this is a version incompatibility. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Possible to Flexrep Pull From ML9 to ML4?
I am trying to set up flexrep pull where an ML 9 server is pulling from an ML 4 server. Is this actually possible? It looks like the code is different in ML 4 from what the ML 9 pull code expects. In particular, the ML 9 pull code is calling the module pollBinaryChunk.xqy on the pulled-from server but as far as I can tell no such module exists in the ML 4 code. Is there a workaround for this? Because of the way my systems are set up it's problematic to use command-line tools like mlcp or corb, so I'm trying to keep everything in XQuery. My specific requirement is to sync part of a much larger database to a set of Docker-based servers whenever those servers are started up. Because the servers are not persistent I cannot use normal push flexrep and because I only want part of the database backup and restore is not ideal and I also want to avoid the reindexing cost if possible. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Best Approach to Manage "Flags" That Might Change Within a Single Transaction
I think I've solved my problem by once again being more careful about holding elements in memory. By replacing global reads of my job doc with on-demand reads through xdmp:eval() I seem to have resolved my issue with changes to the job doc not being seen within the same separate transaction (e.g,, my read loop). I seem to be unable to let go of my procedural language brain damage Still, it seems like having a general, cross-application field or shared memory mechanism would be useful for this type of application where one app (e.g., my Web UI) spawns tasks that do the work and need a way to dynamically communicate within the scope of a single long-running transaction. At least that's the way I would go about building this type of application in a different environment. Cheers, E. -- Eliot Kimber http://contrext.com On 12/7/17, 10:48 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I don't think server fields are going to work because they are per application server and I have different application servers at work. There is an HTTP server that gets the pause/resume request and then spawned tasks running the TaskServer that need to read the field. My experiments show that, per the docs, a field changed by one app is not seen by a different app. Cheers, Eliot -- Eliot Kimber http://contrext.com On 12/7/17, 10:13 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I had not considered server fields--I'll check it out. Cheers, E. -- Eliot Kimber http://contrext.com On 12/7/17, 10:11 AM, "general-boun...@developer.marklogic.com on behalf of Erik Hennum" wrote: Hi, Eliot: Have you considered a server field -- where any code that changes the status also updates the server field and the iterator checks the server field? The server fields are local to the host, so there's no concern about a separate iterator running on a different host. If multiple iterators run on the same host, each would need to distinguish its status by an id, which the iterator could generate from a random id when it starts. Hoping that helps, Erik Hennum From: general-boun...@developer.marklogic.com on behalf of Eliot Kimber Sent: Thursday, December 7, 2017 7:48:44 AM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Best Approach to Manage "Flags" That Might Change Within a Single Transaction In the context of my remote processing management system, where my client server is sending many tasks to a set of remote servers through a set of spawned tasks running in parallel, I need to be able to pause the client so that it stops sending new tasks to the remote servers. So far I've been using a single document stored in ML as my mechanism for indicating that a job is in progress and capturing the job details (job ID, start time, servers in use, etc.). This works fine because it was only updated at the start and end of the job. But for the pause/resume use case I need to have a flag that indicates that the job is paused and have other processes (e.g., my task-submission code) immediately respond to a change. For example, if I'm looping over 100 tasks to load up a remote task queue and the job is paused, I want that loop to end immediately. So basically, in this loop, for every iteration, check the "is paused" status, which requires reading the job doc to see if a @paused attribute is present (the @paused attribute captures the time the pause was requested and serves as the "is paused" flag). However, because the loop is a single transaction, it will see the same version of the job doc for every iteration, even if it's changed. I tried using xdmp:eval() to read the job doc but that didn't seem to change the behavior. E.g., doing this in query console: return (er:is-job-paused(), er:pause-job(), er:is-job-paused()) Results in (false, false) So this isn't going to work. So my question: what's the best way to manage this kind of dynamic flag in ML? I could use file system files instead of docs in the database, which would avoid the ML transaction beh
Re: [MarkLogic Dev General] Best Approach to Manage "Flags" That Might Change Within a Single Transaction
I don't think server fields are going to work because they are per application server and I have different application servers at work. There is an HTTP server that gets the pause/resume request and then spawned tasks running the TaskServer that need to read the field. My experiments show that, per the docs, a field changed by one app is not seen by a different app. Cheers, Eliot -- Eliot Kimber http://contrext.com On 12/7/17, 10:13 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I had not considered server fields--I'll check it out. Cheers, E. -- Eliot Kimber http://contrext.com On 12/7/17, 10:11 AM, "general-boun...@developer.marklogic.com on behalf of Erik Hennum" wrote: Hi, Eliot: Have you considered a server field -- where any code that changes the status also updates the server field and the iterator checks the server field? The server fields are local to the host, so there's no concern about a separate iterator running on a different host. If multiple iterators run on the same host, each would need to distinguish its status by an id, which the iterator could generate from a random id when it starts. Hoping that helps, Erik Hennum From: general-boun...@developer.marklogic.com on behalf of Eliot Kimber Sent: Thursday, December 7, 2017 7:48:44 AM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Best Approach to Manage "Flags" That Might Change Within a Single Transaction In the context of my remote processing management system, where my client server is sending many tasks to a set of remote servers through a set of spawned tasks running in parallel, I need to be able to pause the client so that it stops sending new tasks to the remote servers. So far I've been using a single document stored in ML as my mechanism for indicating that a job is in progress and capturing the job details (job ID, start time, servers in use, etc.). This works fine because it was only updated at the start and end of the job. But for the pause/resume use case I need to have a flag that indicates that the job is paused and have other processes (e.g., my task-submission code) immediately respond to a change. For example, if I'm looping over 100 tasks to load up a remote task queue and the job is paused, I want that loop to end immediately. So basically, in this loop, for every iteration, check the "is paused" status, which requires reading the job doc to see if a @paused attribute is present (the @paused attribute captures the time the pause was requested and serves as the "is paused" flag). However, because the loop is a single transaction, it will see the same version of the job doc for every iteration, even if it's changed. I tried using xdmp:eval() to read the job doc but that didn't seem to change the behavior. E.g., doing this in query console: return (er:is-job-paused(), er:pause-job(), er:is-job-paused()) Results in (false, false) So this isn't going to work. So my question: what's the best way to manage this kind of dynamic flag in ML? I could use file system files instead of docs in the database, which would avoid the ML transaction behavior but that seems a little hackier than I'd like. What I'd really like is some kind of "shared memory" mechanism where I can set and reset variables at will across different modules running in parallel but I haven't seen anything like that in my study of the ML API. Is there such a mechanism that I've missed? Or am I just thinking about the problem the wrong way? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://develop
Re: [MarkLogic Dev General] Best Approach to Manage "Flags" That Might Change Within a Single Transaction
I had not considered server fields--I'll check it out. Cheers, E. -- Eliot Kimber http://contrext.com On 12/7/17, 10:11 AM, "general-boun...@developer.marklogic.com on behalf of Erik Hennum" wrote: Hi, Eliot: Have you considered a server field -- where any code that changes the status also updates the server field and the iterator checks the server field? The server fields are local to the host, so there's no concern about a separate iterator running on a different host. If multiple iterators run on the same host, each would need to distinguish its status by an id, which the iterator could generate from a random id when it starts. Hoping that helps, Erik Hennum From: general-boun...@developer.marklogic.com on behalf of Eliot Kimber Sent: Thursday, December 7, 2017 7:48:44 AM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Best Approach to Manage "Flags" That Might Change Within a Single Transaction In the context of my remote processing management system, where my client server is sending many tasks to a set of remote servers through a set of spawned tasks running in parallel, I need to be able to pause the client so that it stops sending new tasks to the remote servers. So far I've been using a single document stored in ML as my mechanism for indicating that a job is in progress and capturing the job details (job ID, start time, servers in use, etc.). This works fine because it was only updated at the start and end of the job. But for the pause/resume use case I need to have a flag that indicates that the job is paused and have other processes (e.g., my task-submission code) immediately respond to a change. For example, if I'm looping over 100 tasks to load up a remote task queue and the job is paused, I want that loop to end immediately. So basically, in this loop, for every iteration, check the "is paused" status, which requires reading the job doc to see if a @paused attribute is present (the @paused attribute captures the time the pause was requested and serves as the "is paused" flag). However, because the loop is a single transaction, it will see the same version of the job doc for every iteration, even if it's changed. I tried using xdmp:eval() to read the job doc but that didn't seem to change the behavior. E.g., doing this in query console: return (er:is-job-paused(), er:pause-job(), er:is-job-paused()) Results in (false, false) So this isn't going to work. So my question: what's the best way to manage this kind of dynamic flag in ML? I could use file system files instead of docs in the database, which would avoid the ML transaction behavior but that seems a little hackier than I'd like. What I'd really like is some kind of "shared memory" mechanism where I can set and reset variables at will across different modules running in parallel but I haven't seen anything like that in my study of the ML API. Is there such a mechanism that I've missed? Or am I just thinking about the problem the wrong way? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Best Approach to Manage "Flags" That Might Change Within a Single Transaction
In the context of my remote processing management system, where my client server is sending many tasks to a set of remote servers through a set of spawned tasks running in parallel, I need to be able to pause the client so that it stops sending new tasks to the remote servers. So far I've been using a single document stored in ML as my mechanism for indicating that a job is in progress and capturing the job details (job ID, start time, servers in use, etc.). This works fine because it was only updated at the start and end of the job. But for the pause/resume use case I need to have a flag that indicates that the job is paused and have other processes (e.g., my task-submission code) immediately respond to a change. For example, if I'm looping over 100 tasks to load up a remote task queue and the job is paused, I want that loop to end immediately. So basically, in this loop, for every iteration, check the "is paused" status, which requires reading the job doc to see if a @paused attribute is present (the @paused attribute captures the time the pause was requested and serves as the "is paused" flag). However, because the loop is a single transaction, it will see the same version of the job doc for every iteration, even if it's changed. I tried using xdmp:eval() to read the job doc but that didn't seem to change the behavior. E.g., doing this in query console: return (er:is-job-paused(), er:pause-job(), er:is-job-paused()) Results in (false, false) So this isn't going to work. So my question: what's the best way to manage this kind of dynamic flag in ML? I could use file system files instead of docs in the database, which would avoid the ML transaction behavior but that seems a little hackier than I'd like. What I'd really like is some kind of "shared memory" mechanism where I can set and reset variables at will across different modules running in parallel but I haven't seen anything like that in my study of the ML API. Is there such a mechanism that I've missed? Or am I just thinking about the problem the wrong way? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Error in Typeswitch Syntax Diagram
In the topic on typeswitch in this doc (ML 9 version): http://docs.marklogic.com/guide/xquery/langoverview#id_75915 The syntax diagram shows a “,” separator for the repeat line for “case” clauses. The comma should not be there. Cheers, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Bug in XSLT and XQuery Reference Guide
Here’s the code I have: for $map in $details order by map:get($map, 'active') ascending, map:get($map, 'queued') ascending, return $map Cheers, E. -- Eliot Kimber http://contrext.com On 11/29/17, 11:46 AM, "general-boun...@developer.marklogic.com on behalf of Geert Josten" wrote: Thanks, looks like you are right. Can you elaborate on the multiple expressions? Cheers, Geert On 11/29/17, 5:30 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >I didn¹t see a place to submit comments in the guide like you can in the >reference topics so I¹m posting here. > >In http://docs.marklogic.com/guide/xquery/langoverview#id_11626, in the >section on the order-by clause, the syntax diagram shows the repeat >returning to before the ³order by² keyword. > >The correct syntax should have the repeat returning *after* the ³order >by² keyword and before the $varExpr > >That is, order by is: > >order by expression1, expression2 > >not order by expression1, order by expression2 > >I also didn¹t see any examples of order-by clauses with multiple >expressions‹that would be useful to have. > >Cheers, > >E. > >-- >Eliot Kimber >http://contrext.com > > > > >___ >General mailing list >General@developer.marklogic.com >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Bug in XSLT and XQuery Reference Guide
I didn’t see a place to submit comments in the guide like you can in the reference topics so I’m posting here. In http://docs.marklogic.com/guide/xquery/langoverview#id_11626, in the section on the order-by clause, the syntax diagram shows the repeat returning to before the “order by” keyword. The correct syntax should have the repeat returning *after* the “order by” keyword and before the $varExpr That is, order by is: order by expression1, expression2 not order by expression1, order by expression2 I also didn’t see any examples of order-by clauses with multiple expressions—that would be useful to have. Cheers, E. -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] How to Do Equivalent of While true() Loop In ML?
I can see how this form of recursive function using xdmp:set() is better than for 1 to 1,000,000. I was trying to find a pure XQuery solution, that is, one that didn’t rely on xdmp:set(), so my recursion was: declare function local:handle-task($task, $tasks) { if (empty($task)) then () else let $do-submit := local:submit-job($task) return local:handle-task(head($tasks), tail($tasks)) } (I could of course just pass tasks but I think the function signature is clearer if the thing that is acted on is passed as a separate parameter.) But beyond that I have another loop or recursion that polls the remote servers until a server becomes available. That recursion I think was where I was getting out of memory even though tail recursion optimization should have prevented it. So in my mind there’s a still a question of whether or not a pure XQuery solution is possible with ML 9 or do I have to use xdmp:set()? Of course I have to do whatever will solve the problem but I like to avoid proprietary extensions whenever possible (otherwise, what’s the point of using standards). Cheers, Eliot -- Eliot Kimber http://contrext.com On 11/28/17, 10:39 AM, "general-boun...@developer.marklogic.com on behalf of John Snelson" wrote: You should use recursive functions for this kind of thing: declare function local:while($test, $body) { if($test()) then ($body(), local:while($test,$body)) else () }; let $tasks := ... return local:while( function() { exists($tasks) }, function() { submit-task(head($tasks)), xdmp:set($tasks,tail($tasks)) } ) MarkLogic's tail call optimization will mean that the local:while() function will use a constant amount of stack space. However in your specific example you really just want to execute a function on each member of a sequence. In that specific case you can use fn:map: fn:map(submit-task#1,$tasks) John On 27/11/17 16:56, Eliot Kimber wrote: > I have a client-server system where the client is spawning 100s of 1000s of jobs on the client. The client polls the servers to see when each server’s task queue is ready for more jobs. This all works fine. > > Logically this polling is a while-true() loop that will continue until either all the servers are offline or all the tasks to be submitted are consumed. > > In a procedural language this is trivial, but in XQuery 2 I’m not finding a way to do it that works. In XQuery 3 I could use the new iterate operator but that doesn’t seem to be available in MarkLogic 9. > > My first attempt was to use a recursive process, relying on tail recursion optimization to avoid blowing the stack buffer. That worked logically but I still ran into out-of-memory on the server at some point (around 200K jobs submitted) and it seems likely that it was runaway recursion doing it. > > So I tried using a simple loop with xdmp:set() to iterate over the tasks and use an exception to break out when all the tasks are done: > > try { > for $i in 1 to 100 (: i.e., loop forever :) > if (empty($tasks)) > then error() > else submit-task(head($tasks)) > xdmp:set($tasks, tail($tasks)) > } catch ($e) { > (: We’re done. ( > } > > Is there a better way to do this kind of looping forever? > > I’m also having a very strange behavior where in my new looping code I’m getting what I think must be a pending commit deadlock that I didn’t get in my recursive version of the code. I can trace the code to the xdmp:eval() that would commit an update to the task and that code never returns. > > Each task is a document that I update to reflect the details of the task’s status (start and end times, current processing status, etc.). Those updates are all done either in separately-run modules or via xdmp:eval(), so as far as I can tell there shouldn’t be any issues with uncommitted updates. I didn’t change anything in the logic that updates the task documents, only the loop that iterates over the tasks. > > Could it be that the use of xdmp:set() to modify the $tasks variable (a sequence of elements) would be causing some kind of commit lock? > > Thanks, > > Eliot > > -- > Eliot Kimber > http://contrext.com > > > > > ___ > General mailing list > General@developer.marklogic.com > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general --
Re: [MarkLogic Dev General] How to Do Equivalent of While true() Loop In ML?
My error was using the list of elements as the control for my loop rather than some lookup key. This was causing read locks on the elements that then blocked my subsequent update attempts. I rewrote my looping code to limit reads of the elements to just those places where it was doing an update so that no read lock was held outside the updating code. My process is now almost completed for its full 500K+ task set, so memory issues resolved. Now I just need to fix a bug in my queue loading logic and it might just work for production… Cheers, E. -- Eliot Kimber http://contrext.com On 11/27/17, 2:50 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I’ve tried turning on the lock tracing and I do see deadlocks both with the recursive version and the loop-with-set version, but the loop-with-set version is basically ever task doc is locked (which is what I would expect) but with the recursive version I get little bursts of five or six deadlocks at a time but the code generally runs. But this does suggest that my code is creating unexpected locks that I need to resolve. Cheers, E. -- Eliot Kimber http://contrext.com On 11/27/17, 11:45 AM, "general-boun...@developer.marklogic.com on behalf of Will Thompson" wrote: Eliot, Is the controller/while-loop transaction read-only (i.e.: is xdmp:request-timestamp() nonempty)? If it is, then I think you can be sure it's not holding locks. Otherwise, I would restructure that part of the application so that any transaction responsible for dispatching jobs doesn't make any updates. Generally, you don't want a long-running update transaction to touch lots of documents. If you turn on debug logging, ML will report to the error log when it detects a deadlock (and randomly kills and retries one of the deadlocking transactions). There is also a lock trace event you can enable to get detailed output about which transactions are holding locks and which ones are waiting on them (See: https://help.marklogic.com/knowledgebase/article/View/387/0/understanding-the-lock-trace-diagnostic-trace-event). All of the reporting IIRC is based on transaction IDs, so you generally have to do your own logging elsewhere to identify which IDs are associated with which transactions. As you might expect, this can get kind of hairy. In the past I have used a task server job to check for a condition, and if it hasn't been met, sleep for a few 100ms and respawn. Similar behavior could also be accomplished with triggers or wth CPF, but both are probably overkill for your case. -Will > On Nov 27, 2017, at 10:59 AM, William Sawyer wrote: > > You could recursively spawn or setup a schedule task to run every minute or faster if needed. > > -Will > > On Mon, Nov 27, 2017 at 9:56 AM, Eliot Kimber wrote: > I have a client-server system where the client is spawning 100s of 1000s of jobs on the client. The client polls the servers to see when each server’s task queue is ready for more jobs. This all works fine. > > Logically this polling is a while-true() loop that will continue until either all the servers are offline or all the tasks to be submitted are consumed. > > In a procedural language this is trivial, but in XQuery 2 I’m not finding a way to do it that works. In XQuery 3 I could use the new iterate operator but that doesn’t seem to be available in MarkLogic 9. > > My first attempt was to use a recursive process, relying on tail recursion optimization to avoid blowing the stack buffer. That worked logically but I still ran into out-of-memory on the server at some point (around 200K jobs submitted) and it seems likely that it was runaway recursion doing it. > > So I tried using a simple loop with xdmp:set() to iterate over the tasks and use an exception to break out when all the tasks are done: > > try { > for $i in 1 to 100 (: i.e., loop forever :) > if (empty($tasks)) > then error() > else submit-task(head($tasks)) > xdmp:set($tasks, tail($tasks)) > } catch ($e) { > (: We’re done. ( > } > > Is there a better way to do this kind of looping forever? > > I’m also having a very strange behavior where in my new looping code I’m getting what I think must be a pending commit deadlock that I didn’t get in my recursive version of the code. I can trace the code to the xdmp:eval() that would commit an update to the
Re: [MarkLogic Dev General] How to Do Equivalent of While true() Loop In ML?
I’ve tried turning on the lock tracing and I do see deadlocks both with the recursive version and the loop-with-set version, but the loop-with-set version is basically ever task doc is locked (which is what I would expect) but with the recursive version I get little bursts of five or six deadlocks at a time but the code generally runs. But this does suggest that my code is creating unexpected locks that I need to resolve. Cheers, E. -- Eliot Kimber http://contrext.com On 11/27/17, 11:45 AM, "general-boun...@developer.marklogic.com on behalf of Will Thompson" wrote: Eliot, Is the controller/while-loop transaction read-only (i.e.: is xdmp:request-timestamp() nonempty)? If it is, then I think you can be sure it's not holding locks. Otherwise, I would restructure that part of the application so that any transaction responsible for dispatching jobs doesn't make any updates. Generally, you don't want a long-running update transaction to touch lots of documents. If you turn on debug logging, ML will report to the error log when it detects a deadlock (and randomly kills and retries one of the deadlocking transactions). There is also a lock trace event you can enable to get detailed output about which transactions are holding locks and which ones are waiting on them (See: https://help.marklogic.com/knowledgebase/article/View/387/0/understanding-the-lock-trace-diagnostic-trace-event). All of the reporting IIRC is based on transaction IDs, so you generally have to do your own logging elsewhere to identify which IDs are associated with which transactions. As you might expect, this can get kind of hairy. In the past I have used a task server job to check for a condition, and if it hasn't been met, sleep for a few 100ms and respawn. Similar behavior could also be accomplished with triggers or wth CPF, but both are probably overkill for your case. -Will > On Nov 27, 2017, at 10:59 AM, William Sawyer wrote: > > You could recursively spawn or setup a schedule task to run every minute or faster if needed. > > -Will > > On Mon, Nov 27, 2017 at 9:56 AM, Eliot Kimber wrote: > I have a client-server system where the client is spawning 100s of 1000s of jobs on the client. The client polls the servers to see when each server’s task queue is ready for more jobs. This all works fine. > > Logically this polling is a while-true() loop that will continue until either all the servers are offline or all the tasks to be submitted are consumed. > > In a procedural language this is trivial, but in XQuery 2 I’m not finding a way to do it that works. In XQuery 3 I could use the new iterate operator but that doesn’t seem to be available in MarkLogic 9. > > My first attempt was to use a recursive process, relying on tail recursion optimization to avoid blowing the stack buffer. That worked logically but I still ran into out-of-memory on the server at some point (around 200K jobs submitted) and it seems likely that it was runaway recursion doing it. > > So I tried using a simple loop with xdmp:set() to iterate over the tasks and use an exception to break out when all the tasks are done: > > try { > for $i in 1 to 100 (: i.e., loop forever :) > if (empty($tasks)) > then error() > else submit-task(head($tasks)) > xdmp:set($tasks, tail($tasks)) > } catch ($e) { > (: We’re done. ( > } > > Is there a better way to do this kind of looping forever? > > I’m also having a very strange behavior where in my new looping code I’m getting what I think must be a pending commit deadlock that I didn’t get in my recursive version of the code. I can trace the code to the xdmp:eval() that would commit an update to the task and that code never returns. > > Each task is a document that I update to reflect the details of the task’s status (start and end times, current processing status, etc.). Those updates are all done either in separately-run modules or via xdmp:eval(), so as far as I can tell there shouldn’t be any issues with uncommitted updates. I didn’t change anything in the logic that updates the task documents, only the loop that iterates over the tasks. > > Could it be that the use of xdmp:set() to modify the $tasks variable (a sequence of elements) would be causing some kind of commit lock? > > Thanks, > > Eliot > > -- > Eliot Kimber > http://contrext.com > > > > > ___ > General mailing list > General@developer.marklogic.com >
Re: [MarkLogic Dev General] How to Do Equivalent of While true() Loop In ML?
I looks like my deadlock issue is, not surprisingly, my own bug somewhere—the use of xdmp:set on a sequence of elements appears to a red herring. Cheers, E. -- Eliot Kimber http://contrext.com From: on behalf of Eliot Kimber Reply-To: MarkLogic Developer Discussion Date: Monday, November 27, 2017 at 11:24 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] How to Do Equivalent of While true() Loop In ML? It does appear that using xdmp:set() on my sequence of elements leads to the apparent commit deadlock. I reworked my code to iterate over a sequence of task IDs and then in each iteration I fetch the task element using that ID. With that change, my loop succeeds. Using a scheduled job in this case doesn’t really work because I need to load up the client with as many job-submission tasks as there are available threads in the task server to make sure the servers are fully loaded (most jobs complete in a few seconds). So my approach is to divide the total jobs among N tasks in the task server. I do have a scheduled job to recreate these tasks, for example, in the case of a server restart. I could probably adjust my system where the tasks only do a subset of the total jobs and I depend on the scheduled job to create a new batch of job-submission tasks, but that seems unnecessarily complicated. Cheers, E. -- Eliot Kimber http://contrext.com From: on behalf of William Sawyer Reply-To: MarkLogic Developer Discussion Date: Monday, November 27, 2017 at 10:59 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] How to Do Equivalent of While true() Loop In ML? You could recursively spawn or setup a schedule task to run every minute or faster if needed. -Will On Mon, Nov 27, 2017 at 9:56 AM, Eliot Kimber wrote: I have a client-server system where the client is spawning 100s of 1000s of jobs on the client. The client polls the servers to see when each server’s task queue is ready for more jobs. This all works fine. Logically this polling is a while-true() loop that will continue until either all the servers are offline or all the tasks to be submitted are consumed. In a procedural language this is trivial, but in XQuery 2 I’m not finding a way to do it that works. In XQuery 3 I could use the new iterate operator but that doesn’t seem to be available in MarkLogic 9. My first attempt was to use a recursive process, relying on tail recursion optimization to avoid blowing the stack buffer. That worked logically but I still ran into out-of-memory on the server at some point (around 200K jobs submitted) and it seems likely that it was runaway recursion doing it. So I tried using a simple loop with xdmp:set() to iterate over the tasks and use an exception to break out when all the tasks are done: try { for $i in 1 to 100 (: i.e., loop forever :) if (empty($tasks)) then error() else submit-task(head($tasks)) xdmp:set($tasks, tail($tasks)) } catch ($e) { (: We’re done. ( } Is there a better way to do this kind of looping forever? I’m also having a very strange behavior where in my new looping code I’m getting what I think must be a pending commit deadlock that I didn’t get in my recursive version of the code. I can trace the code to the xdmp:eval() that would commit an update to the task and that code never returns. Each task is a document that I update to reflect the details of the task’s status (start and end times, current processing status, etc.). Those updates are all done either in separately-run modules or via xdmp:eval(), so as far as I can tell there shouldn’t be any issues with uncommitted updates. I didn’t change anything in the logic that updates the task documents, only the loop that iterates over the tasks. Could it be that the use of xdmp:set() to modify the $tasks variable (a sequence of elements) would be causing some kind of commit lock? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] How to Do Equivalent of While true() Loop In ML?
It does appear that using xdmp:set() on my sequence of elements leads to the apparent commit deadlock. I reworked my code to iterate over a sequence of task IDs and then in each iteration I fetch the task element using that ID. With that change, my loop succeeds. Using a scheduled job in this case doesn’t really work because I need to load up the client with as many job-submission tasks as there are available threads in the task server to make sure the servers are fully loaded (most jobs complete in a few seconds). So my approach is to divide the total jobs among N tasks in the task server. I do have a scheduled job to recreate these tasks, for example, in the case of a server restart. I could probably adjust my system where the tasks only do a subset of the total jobs and I depend on the scheduled job to create a new batch of job-submission tasks, but that seems unnecessarily complicated. Cheers, E. -- Eliot Kimber http://contrext.com From: on behalf of William Sawyer Reply-To: MarkLogic Developer Discussion Date: Monday, November 27, 2017 at 10:59 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] How to Do Equivalent of While true() Loop In ML? You could recursively spawn or setup a schedule task to run every minute or faster if needed. -Will On Mon, Nov 27, 2017 at 9:56 AM, Eliot Kimber wrote: I have a client-server system where the client is spawning 100s of 1000s of jobs on the client. The client polls the servers to see when each server’s task queue is ready for more jobs. This all works fine. Logically this polling is a while-true() loop that will continue until either all the servers are offline or all the tasks to be submitted are consumed. In a procedural language this is trivial, but in XQuery 2 I’m not finding a way to do it that works. In XQuery 3 I could use the new iterate operator but that doesn’t seem to be available in MarkLogic 9. My first attempt was to use a recursive process, relying on tail recursion optimization to avoid blowing the stack buffer. That worked logically but I still ran into out-of-memory on the server at some point (around 200K jobs submitted) and it seems likely that it was runaway recursion doing it. So I tried using a simple loop with xdmp:set() to iterate over the tasks and use an exception to break out when all the tasks are done: try { for $i in 1 to 100 (: i.e., loop forever :) if (empty($tasks)) then error() else submit-task(head($tasks)) xdmp:set($tasks, tail($tasks)) } catch ($e) { (: We’re done. ( } Is there a better way to do this kind of looping forever? I’m also having a very strange behavior where in my new looping code I’m getting what I think must be a pending commit deadlock that I didn’t get in my recursive version of the code. I can trace the code to the xdmp:eval() that would commit an update to the task and that code never returns. Each task is a document that I update to reflect the details of the task’s status (start and end times, current processing status, etc.). Those updates are all done either in separately-run modules or via xdmp:eval(), so as far as I can tell there shouldn’t be any issues with uncommitted updates. I didn’t change anything in the logic that updates the task documents, only the loop that iterates over the tasks. Could it be that the use of xdmp:set() to modify the $tasks variable (a sequence of elements) would be causing some kind of commit lock? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] How to Do Equivalent of While true() Loop In ML?
I have a client-server system where the client is spawning 100s of 1000s of jobs on the client. The client polls the servers to see when each server’s task queue is ready for more jobs. This all works fine. Logically this polling is a while-true() loop that will continue until either all the servers are offline or all the tasks to be submitted are consumed. In a procedural language this is trivial, but in XQuery 2 I’m not finding a way to do it that works. In XQuery 3 I could use the new iterate operator but that doesn’t seem to be available in MarkLogic 9. My first attempt was to use a recursive process, relying on tail recursion optimization to avoid blowing the stack buffer. That worked logically but I still ran into out-of-memory on the server at some point (around 200K jobs submitted) and it seems likely that it was runaway recursion doing it. So I tried using a simple loop with xdmp:set() to iterate over the tasks and use an exception to break out when all the tasks are done: try { for $i in 1 to 100 (: i.e., loop forever :) if (empty($tasks)) then error() else submit-task(head($tasks)) xdmp:set($tasks, tail($tasks)) } catch ($e) { (: We’re done. ( } Is there a better way to do this kind of looping forever? I’m also having a very strange behavior where in my new looping code I’m getting what I think must be a pending commit deadlock that I didn’t get in my recursive version of the code. I can trace the code to the xdmp:eval() that would commit an update to the task and that code never returns. Each task is a document that I update to reflect the details of the task’s status (start and end times, current processing status, etc.). Those updates are all done either in separately-run modules or via xdmp:eval(), so as far as I can tell there shouldn’t be any issues with uncommitted updates. I didn’t change anything in the logic that updates the task documents, only the loop that iterates over the tasks. Could it be that the use of xdmp:set() to modify the $tasks variable (a sequence of elements) would be causing some kind of commit lock? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Ensuring That Tail Recursion Optimization Will Be Applied
ML 9 I’m using recursive functions to process an arbitrarily large set of items where in a procedural language I would use a while true() loop. The number of items can be large so tail recursion optimization has to be in place or I’ll eventually blow the call stack. My question: How do I ensure that my functions are constructed so that tail recursion optimization will be applied? A search on “recursion” or “tail recursion” didn’t reveal anything on the ML docs site. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] [RESOLVE] Spawned Task Appears to Block Other Threads
I moved my response handler to a different web app than the web app that starts the job submission task, but I was still having the issue where the response handler didn’t handle responses (or didn’t get responses, hard for me to tell which it is) until the job submission task is canceled. I realized that my problem was the initial job submission was updating the job record for each job, but I was doing the update as part of the main processing, rather than using eval(), so the commit wasn’t done until the task ended. The response handler also wants to update the job record, but because there is a pending commit, it is blocked. By having the job submission do the update via eval() then the commit is done immediately and everything works. So the lesson is: you have to really understand the implications of updates and commits or you will go astray. Or maybe “concurrency is always hard”. Thanks, Eliot -- Eliot Kimber http://contrext.com On 11/10/17, 9:39 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: Yes, the client process is started from a Web app, so I think your analysis is correct. I will move the response handling to a separate Web app—probably should have done that from the start. Thanks, Eliot -- Eliot Kimber http://contrext.com On 11/9/17, 11:46 PM, "general-boun...@developer.marklogic.com on behalf of Geert Josten" wrote: Hi Eliot, I think you kicked off your watcher job with an HTTP request, and it keeps the port open until it finishes. Only one thread can use the port at the same time. Use a different port for task response traffic, or consider running your watcher as a scheduled task. Not super robust, and probably not used in production, but i did write an alternative queque for MarkLogic. It might give you some ideas.. https://github.com/grtjn/ml-queue Cheers, Geert On 11/10/17, 1:06 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >I have a system where I have a ³client² ML server that submits jobs to a >set of remote ML servers, checking their task queues and keeping each >server¹s queue at a max of 100 queued items (the remote servers could go >away without notice so the client needs to be able to restart tasks and >not have too many things queued up that would just have to resubmitted). > >The remote tasks then talk back to the client to report status and return >their final results. > >My job submission code use recursive functions to iterate over the set of >tasks to be submitted, checking for free remote queue slots via the ML >REST API and submitting jobs as the queues empty. This code is spawned >into a separate task in the task server. It uses xdmp:sleep(1000) to >pause between checking the job queues. > >This all works fine, in that my jobs are submitted correctly and the >remote queues fill up. > >However, as long as the job-submission task in the task server is >running, the HTTP app that handles the REST calls from the remote servers >is blocked (which blocks the remote jobs, which are of course waiting for >responses from the client). > >If I kill the task server task, then the remote responses are handled as >I would expect. > >My question: Why would the task server task block the other app? There >must be something I¹m doing or not doing but I have no idea what it might >be. > >Thanks, > >Eliot >-- >Eliot Kimber >http://contrext.com > > > > >___ >General mailing list >General@developer.marklogic.com >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Spawned Task Appears to Block Other Threads
Yes, the client process is started from a Web app, so I think your analysis is correct. I will move the response handling to a separate Web app—probably should have done that from the start. Thanks, Eliot -- Eliot Kimber http://contrext.com On 11/9/17, 11:46 PM, "general-boun...@developer.marklogic.com on behalf of Geert Josten" wrote: Hi Eliot, I think you kicked off your watcher job with an HTTP request, and it keeps the port open until it finishes. Only one thread can use the port at the same time. Use a different port for task response traffic, or consider running your watcher as a scheduled task. Not super robust, and probably not used in production, but i did write an alternative queque for MarkLogic. It might give you some ideas.. https://github.com/grtjn/ml-queue Cheers, Geert On 11/10/17, 1:06 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >I have a system where I have a ³client² ML server that submits jobs to a >set of remote ML servers, checking their task queues and keeping each >server¹s queue at a max of 100 queued items (the remote servers could go >away without notice so the client needs to be able to restart tasks and >not have too many things queued up that would just have to resubmitted). > >The remote tasks then talk back to the client to report status and return >their final results. > >My job submission code use recursive functions to iterate over the set of >tasks to be submitted, checking for free remote queue slots via the ML >REST API and submitting jobs as the queues empty. This code is spawned >into a separate task in the task server. It uses xdmp:sleep(1000) to >pause between checking the job queues. > >This all works fine, in that my jobs are submitted correctly and the >remote queues fill up. > >However, as long as the job-submission task in the task server is >running, the HTTP app that handles the REST calls from the remote servers >is blocked (which blocks the remote jobs, which are of course waiting for >responses from the client). > >If I kill the task server task, then the remote responses are handled as >I would expect. > >My question: Why would the task server task block the other app? There >must be something I¹m doing or not doing but I have no idea what it might >be. > >Thanks, > >Eliot >-- >Eliot Kimber >http://contrext.com > > > > >___ >General mailing list >General@developer.marklogic.com >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Spawned Task Appears to Block Other Threads
I have a system where I have a “client” ML server that submits jobs to a set of remote ML servers, checking their task queues and keeping each server’s queue at a max of 100 queued items (the remote servers could go away without notice so the client needs to be able to restart tasks and not have too many things queued up that would just have to resubmitted). The remote tasks then talk back to the client to report status and return their final results. My job submission code use recursive functions to iterate over the set of tasks to be submitted, checking for free remote queue slots via the ML REST API and submitting jobs as the queues empty. This code is spawned into a separate task in the task server. It uses xdmp:sleep(1000) to pause between checking the job queues. This all works fine, in that my jobs are submitted correctly and the remote queues fill up. However, as long as the job-submission task in the task server is running, the HTTP app that handles the REST calls from the remote servers is blocked (which blocks the remote jobs, which are of course waiting for responses from the client). If I kill the task server task, then the remote responses are handled as I would expect. My question: Why would the task server task block the other app? There must be something I’m doing or not doing but I have no idea what it might be. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] [resolved] What Might Cause Documents to Silently Not Be Created?
I’ve found my bug: bad assumption about the data: While the URIs of all my input documents are unique, their filenames are not and I was using just the filename as the basis for my task record URIs. I was in the process of posting my code and that led me to verify that my assumptions were correct (because I knew somebody would challenge them) and what do you know, they weren’t. So the lesson for the day is: always double check your assumptions about data. But I also learned something about uncatchable exceptions, so that’s good too. Thanks for everyone’s help. Cheers, Eliot -- Eliot Kimber http://contrext.com On 11/9/17, 10:11 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I’m actually not doing anything with the HTTP response. I get the response but currently don’t examine it (in fact I have a FIXME in the code to add handling of non-200 response codes, but for now it’s basically fire and forget—the request to the remote server ultimately spawns a task on that server, so the only non-success response would be one where the xmdp:spawn() on the remote server failed, which is unlikely to happen under normal operating conditions). I’m also careful to always turn off auto mapping, which is as evil as evil can be. There were no relevant errors in the ErrorLog.txt. The code is running on the task server and I do see all my expected (success) messages there. Looking at the uncatchable exceptions article, the only possible issue would be failures during commit that result in uncatchable exceptions. Per the article, I’m now using eval() to do the document-insert—that should allow any commit-time failure exception to now be caught. Otherwise, none of the conditions that you suggested could hold: document URIs should be unique (because they reflect the URIs of the source items, each of which is the root node of its own document), there are no permissions in effect, I’m only creating a few 100 tasks in the task queue (each task then processes a 1000 input items, so 500K items means 500 tasks), I’m not spawning in update mode (but if that was the problem then it should fail for all attempts, not just a few of them). Cheers, E. Eliot Kimber On 11/9/17, 10:00 AM, "general-boun...@developer.marklogic.com on behalf of Will Thompson" wrote: Eliot, When you make the remote HTTP call, are you using one of the xdmp:http-XYZ functions? Since those functions return a payload describing the response condition and don't throw exceptions for most errors, is it possible that an HTTP response error condition is not being handled, resulting in inserting an empty sequence instead of a document? In the default case where function mapping is turned on, inserting an empty sequence will result in not calling xdmp:document-insert at all. You could test to see if that's happening by disabling function mapping, which would cause an exception to be raised instead. -Will > On Nov 8, 2017, at 5:25 PM, Eliot Kimber wrote: > > Using ML 9: > > I have a process that quickly creates a large number of small documents, one for each item in a set of input items. > > My code is basically: > > 1. Log that I’m about to act on the input item > 2. Act on the input item (send the input item to a remote HTTP end point) > 3. Create a new doc reflecting the input item I just acted on > > This code is within a try/catch and I log the exception, so I should know if there are any exceptions during this process by examining the log. > > I’m processing about 500K input items, with the processing spread over the 16 threads of my task server. So there are 16 tasks quickly writing these docs concurrently. > > I know the exact count of the input items and I get that count in the log, so I know that I’m actually processing all the items I should be. > > However, if I subsequently count the documents created in step 3 I’m short by about 1500, meaning that not all the docs got created, which should not be able to happen unless there was an exception between the log message and the document-insert() call, but I’m not finding any exceptions or other errors reported in the log. > > My question: is there anything that would cause docs to silently not get created under this kind of heavy-load? I would hope not but just wanted to make sure. > > I’m assuming this issue is my bug somewhere, but the code is pretty simple and I’m not seeing any obvious way the documents could not get created without a corresponding exception report. &
Re: [MarkLogic Dev General] What Might Cause Documents to Silently Not Be Created?
I’m actually not doing anything with the HTTP response. I get the response but currently don’t examine it (in fact I have a FIXME in the code to add handling of non-200 response codes, but for now it’s basically fire and forget—the request to the remote server ultimately spawns a task on that server, so the only non-success response would be one where the xmdp:spawn() on the remote server failed, which is unlikely to happen under normal operating conditions). I’m also careful to always turn off auto mapping, which is as evil as evil can be. There were no relevant errors in the ErrorLog.txt. The code is running on the task server and I do see all my expected (success) messages there. Looking at the uncatchable exceptions article, the only possible issue would be failures during commit that result in uncatchable exceptions. Per the article, I’m now using eval() to do the document-insert—that should allow any commit-time failure exception to now be caught. Otherwise, none of the conditions that you suggested could hold: document URIs should be unique (because they reflect the URIs of the source items, each of which is the root node of its own document), there are no permissions in effect, I’m only creating a few 100 tasks in the task queue (each task then processes a 1000 input items, so 500K items means 500 tasks), I’m not spawning in update mode (but if that was the problem then it should fail for all attempts, not just a few of them). Cheers, E. Eliot Kimber On 11/9/17, 10:00 AM, "general-boun...@developer.marklogic.com on behalf of Will Thompson" wrote: Eliot, When you make the remote HTTP call, are you using one of the xdmp:http-XYZ functions? Since those functions return a payload describing the response condition and don't throw exceptions for most errors, is it possible that an HTTP response error condition is not being handled, resulting in inserting an empty sequence instead of a document? In the default case where function mapping is turned on, inserting an empty sequence will result in not calling xdmp:document-insert at all. You could test to see if that's happening by disabling function mapping, which would cause an exception to be raised instead. -Will > On Nov 8, 2017, at 5:25 PM, Eliot Kimber wrote: > > Using ML 9: > > I have a process that quickly creates a large number of small documents, one for each item in a set of input items. > > My code is basically: > > 1. Log that I’m about to act on the input item > 2. Act on the input item (send the input item to a remote HTTP end point) > 3. Create a new doc reflecting the input item I just acted on > > This code is within a try/catch and I log the exception, so I should know if there are any exceptions during this process by examining the log. > > I’m processing about 500K input items, with the processing spread over the 16 threads of my task server. So there are 16 tasks quickly writing these docs concurrently. > > I know the exact count of the input items and I get that count in the log, so I know that I’m actually processing all the items I should be. > > However, if I subsequently count the documents created in step 3 I’m short by about 1500, meaning that not all the docs got created, which should not be able to happen unless there was an exception between the log message and the document-insert() call, but I’m not finding any exceptions or other errors reported in the log. > > My question: is there anything that would cause docs to silently not get created under this kind of heavy-load? I would hope not but just wanted to make sure. > > I’m assuming this issue is my bug somewhere, but the code is pretty simple and I’m not seeing any obvious way the documents could not get created without a corresponding exception report. > > Thanks, > > Eliot > -- > Eliot Kimber > https://urldefense.proofpoint.com/v2/url?u=http-3A__contrext.com&d=DwIGaQ&c=IdrBOxAMwHPzAikPNzltHw&r=_thRNTuzvzYaEDwaA_AfnAe5hN2lWgi6qdluz6ApLYI&m=2iEH0KHItwSGn5Cq8UYIMpA4MQnafnAny1y8s43aoag&s=mTsM_MYz77769uC2Vfuy-90pJind0H3TE9DPcO3HaDM&e= > > > > > ___ > General mailing list > General@developer.marklogic.com > Manage your subscription at: > https://urldefense.proofpoint.com/v2/url?u=http-3A__developer.marklogic.com_mailman_listinfo_general&d=DwIGaQ&c=IdrBOxAMwHPzAikPNzltHw&r=_thRNTuzvzYaEDwaA_AfnAe5hN2lWgi6qdluz6ApLYI&m=2iEH0KHItwSGn5Cq8UYIMpA4MQnafnAny1y8s43aoag&s=rwaLAlQ6u8lCrp2pFbliZy9Buu5-PZZo65CIbCTXoUk&e= _
[MarkLogic Dev General] What Might Cause Documents to Silently Not Be Created?
Using ML 9: I have a process that quickly creates a large number of small documents, one for each item in a set of input items. My code is basically: 1. Log that I’m about to act on the input item 2. Act on the input item (send the input item to a remote HTTP end point) 3. Create a new doc reflecting the input item I just acted on This code is within a try/catch and I log the exception, so I should know if there are any exceptions during this process by examining the log. I’m processing about 500K input items, with the processing spread over the 16 threads of my task server. So there are 16 tasks quickly writing these docs concurrently. I know the exact count of the input items and I get that count in the log, so I know that I’m actually processing all the items I should be. However, if I subsequently count the documents created in step 3 I’m short by about 1500, meaning that not all the docs got created, which should not be able to happen unless there was an exception between the log message and the document-insert() call, but I’m not finding any exceptions or other errors reported in the log. My question: is there anything that would cause docs to silently not get created under this kind of heavy-load? I would hope not but just wanted to make sure. I’m assuming this issue is my bug somewhere, but the code is pretty simple and I’m not seeing any obvious way the documents could not get created without a corresponding exception report. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] exact match using regex
The anchors mean that the expression must match the entirety of the input string, so “^re$” can only match the input string “re”. If you want to match only the blank-delimited token “re” then you would want something like: “(^re\s|\sre\s|\sre$)” That is, match “re “ at start of input, “ re “ anywhere, or “ re” at end of input. Or you could tokenize the input and then use the equals operator: Tokenize(“I am learning regex”, “ “) = (“re”) Remember that the “=” operator is sequence comparison, so if any member of the left-hand sequence equals any member of the right-hand sequence, it resolves to true(). Cheers, Eliot Eliot Kimber http://contrext.com On 10/13/17, 2:41 PM, "general-boun...@developer.marklogic.com on behalf of vikas.sin...@cognizant.com" wrote: Thanks for your reply But metacharacter addition not working it is behaving like , This statement is returning fn:matches("I am learning regex","^re$") false (expected) But this statement is returning fn:matches("I am learning re","^re$") false (unexpected) but I am expecting true for the same. -Original Message- From: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] On Behalf Of Christopher Hamlin Sent: Friday, October 13, 2017 3:26 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] exact match using regex You can use anchors, as fn:matches("I am learning regex","^re$") 7.6.2 fn:matches of https://www.w3.org/TR/xpath-functions/#flags says: Unless the metacharacters ^ and $ are used as anchors, the string is considered to match the pattern if any substring matches the pattern. But if anchors are used, the anchors must match the start/end of the string (in string mode), or the start/end of a line (in multiline mode). Note: This is different from the behavior of patterns in [XML Schema Part 2: Datatypes Second Edition], where regular expressions are implicitly anchored. On Fri, Oct 13, 2017 at 3:19 PM, wrote: > Hi All, > > > > How to match exact word using fn:matches and regex. > > > > Example : fn:matches(“I am learning regex”,”re”) > > Above statement returning true as it is matching with “regex” but I > want it should match only when string found exact keyword. > > > > Regards, > > Vikas Singh > > > > This e-mail and any files transmitted with it are for the sole use of > the intended recipient(s) and may contain confidential and privileged > information. If you are not the intended recipient(s), please reply to > the sender and destroy all copies of the original message. Any > unauthorized review, use, disclosure, dissemination, forwarding, > printing or copying of this email, and/or any action taken in reliance > on the contents of this e-mail is strictly prohibited and may be > unlawful. Where permitted by applicable law, this e-mail and other > e-mail communications sent to and from Cognizant e-mail addresses may be monitored. > > ___ > General mailing list > General@developer.marklogic.com > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] How To Detect Task Time Limit Exceeded Failures?
In fact, now that I look for it, that’s already happening in my code, I just didn’t realize it. So problem solved. Cheers, E. -- Eliot Kimber http://contrext.com On 10/7/17, 8:18 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I can certainly experiment with that. Cheers, E. -- Eliot Kimber http://contrext.com On 10/7/17, 7:41 AM, "general-boun...@developer.marklogic.com on behalf of Geert Josten" wrote: Hi Eliot, I heard the other day that it should be possible to capture such timeouts with a try catch within the code itself. That gives an extra 10 seconds delay which might be sufficient to send out an alert email, or raise some other flag. After those few extra seconds, the timeout gets rethrown if you don¹t finish in time.. Might be worth investigating? Cheers, Geert On 10/7/17, 12:10 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >Using current ML 9: > >I¹ve set up a little client-server application where the client spawns a >large number of tasks on a remote cluster. Each remote task reports its >status back to the client via HTTP. > >However, if one of the tasks times out in the Task Server there¹s no way >for it to report its own failure and there doesn¹t seem to be anything >else other than the task server that can detect the failure and report it. > >Is there any built-in mechanism by which a task time limit exceeded >failure can be detected in a way that would allow me to the report back >to the calling client? For example, something that gets the task¹s >current call stack at the time of failure, which would give me the info I >need to report back to the calling client. > >Unfortunately, the code I¹m running in these tasks is pre-existing >processing that I¹m building this remote processing around so I can¹t >easily do something like provide a heartbeat signal for each running task >that a separate process could poll in order to detect terminated >processes, although I¹m guessing that¹s the most likely solution now that >I think about it. > >I do report to the client when each task starts so I guess I could >presume that if a task hasn¹t finished some time after the configured max >time limit that it is presumed to have failed. > >Thanks, > >Eliot >-- >Eliot Kimber >http://contrext.com > > > > >___ >General mailing list >General@developer.marklogic.com >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] How To Detect Task Time Limit Exceeded Failures?
I can certainly experiment with that. Cheers, E. -- Eliot Kimber http://contrext.com On 10/7/17, 7:41 AM, "general-boun...@developer.marklogic.com on behalf of Geert Josten" wrote: Hi Eliot, I heard the other day that it should be possible to capture such timeouts with a try catch within the code itself. That gives an extra 10 seconds delay which might be sufficient to send out an alert email, or raise some other flag. After those few extra seconds, the timeout gets rethrown if you don¹t finish in time.. Might be worth investigating? Cheers, Geert On 10/7/17, 12:10 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >Using current ML 9: > >I¹ve set up a little client-server application where the client spawns a >large number of tasks on a remote cluster. Each remote task reports its >status back to the client via HTTP. > >However, if one of the tasks times out in the Task Server there¹s no way >for it to report its own failure and there doesn¹t seem to be anything >else other than the task server that can detect the failure and report it. > >Is there any built-in mechanism by which a task time limit exceeded >failure can be detected in a way that would allow me to the report back >to the calling client? For example, something that gets the task¹s >current call stack at the time of failure, which would give me the info I >need to report back to the calling client. > >Unfortunately, the code I¹m running in these tasks is pre-existing >processing that I¹m building this remote processing around so I can¹t >easily do something like provide a heartbeat signal for each running task >that a separate process could poll in order to detect terminated >processes, although I¹m guessing that¹s the most likely solution now that >I think about it. > >I do report to the client when each task starts so I guess I could >presume that if a task hasn¹t finished some time after the configured max >time limit that it is presumed to have failed. > >Thanks, > >Eliot >-- >Eliot Kimber >http://contrext.com > > > > >___ >General mailing list >General@developer.marklogic.com >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] How To Detect Task Time Limit Exceeded Failures?
Using current ML 9: I’ve set up a little client-server application where the client spawns a large number of tasks on a remote cluster. Each remote task reports its status back to the client via HTTP. However, if one of the tasks times out in the Task Server there’s no way for it to report its own failure and there doesn’t seem to be anything else other than the task server that can detect the failure and report it. Is there any built-in mechanism by which a task time limit exceeded failure can be detected in a way that would allow me to the report back to the calling client? For example, something that gets the task’s current call stack at the time of failure, which would give me the info I need to report back to the calling client. Unfortunately, the code I’m running in these tasks is pre-existing processing that I’m building this remote processing around so I can’t easily do something like provide a heartbeat signal for each running task that a separate process could poll in order to detect terminated processes, although I’m guessing that’s the most likely solution now that I think about it. I do report to the client when each task starts so I guess I could presume that if a task hasn’t finished some time after the configured max time limit that it is presumed to have failed. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Trouble with syntax... trying to return XML tree
Your code would be easier to read if you used literal result elements rather than element constructors, e.g.: {$policy_num} {$num_recs} { local:recordsByPolicy($policy_num) } As a general rule, it’s only necessary (or useful) to use element constructors when the element type is dynamically determined. Cheers, Eliot -- Eliot Kimber http://contrext.com From: on behalf of Matt Moody Reply-To: MarkLogic Developer Discussion Date: Wednesday, October 4, 2017 at 7:08 PM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Trouble with syntax... trying to return XML tree I am getting the error XDMP-UNEXPECTED: (err:XPST0003) Unexpected token syntax error, unexpected $end from the below query, and cannot figure out why. Any ideas would be appreciated! The idea here, is that an Email may show up in multiple Policy Records, and that same Email may be linked to multiple Policy Numbers, and each Policy Number may show up in multiple Records. I want to use this query to return all Emails with multiple Policies, return each Policy number linked to the Email, and then show each Record (Document) where that Policy number is contained. xquery version "1.0-ml"; declare variable $coll := "insurance-policies "; declare function local:recordsByPolicy($policyNum as xs:string) {( for $doc in fn:collection($coll)//policy[policy_num/text() = $policyNum] return element document { element doc_uri { xdmp:node-uri($doc) } } )}; declare function local:policiesByEmail($sourceEmail as xs:string) {( for $policy_num in fn:distinct-values(fn:collection($coll)//policy[insured_email/text() = $email]/policy_num/text()) let $num_recs := fn:count(fn:collection($coll)//policy[policy_num/text() = $policy_num]) order by $num_recs descending return element policy { element policy_num {$policy_num}, element number_of_records {$num_recs}, element records {( local:recordsByPolicy($policy_num) )} } )}; let $emails_with_multiple_policies := for $em at $i in fn:distinct-values(fn:collection($coll)//insured_email/text()) let $policies := fn:count(fn:distinct-values(fn:collection($coll)//policy[insured_email/text() = $em]/policy_num/text())) where $policies > 1 return ( element total_source_records {(fn:count(fn:collection($coll)))}, element unique_source_emails {(fn:count($unique_source_emails))}, element emails_w_mpolicies {(fn:count($emails_with_multiple_policies))}, element results {( for $email in fn:distinct-values(fn:collection($coll)//insured_email/text()) let $num_policies := fn:count(fn:distinct-values(fn:collection($coll)//policy[insured_email/text() = $email]/policy_num/text())) where $num_policies > 1 order by $num_policies descending return element result { element email {$email}, element number_of_policies_found {$num_policies}, element policies {( local:policiesByEmail($email) )} } )} ) Matt Moody Sales Engineer MarkLogic Corporation matt.mo...@marklogic.com Mobile: +61 (0)415 564 355 This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation. ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] How To Reflect Specific Timezone in Formatted Date Time?
I’m trying to produce a formatted date that reflects a specific time zone name, rather than e.g., “GMT-07:00” format-dateTime($time, "[Y0001]-[M01]-[D01] at [H01]:[m01]:[s01] [ZN]") where $time = 2017-09-29T08:01:54.216992-07:00 Returns 2017-09-29 at 08:01:54 GMT-07:00 Running on server in pacific time zone. What I’d like is 2017-09-29 at 08:01:54 PDT I’ve tried setting the $place parameter to different values but nothing I’ve tried gives me a different result except to add a prefix before the date indicating the location. I also tried different values for the time zone pattern with no change (or simple failure due to a bad pattern). The W3C docs suggest that “[ZN]” should result in just the time zone name but those specs are very difficult to understand so I’m never sure I’m understanding them correctly. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Apparent Memory Leak in Profiler
I can verify that ML 8.07 resolves the memory leak in the profiler. I can now profile 100s of 1000s of tasks no problem. Cheers, E. -- Eliot Kimber http://contrext.com On 8/28/17, 1:41 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: Thanks—I should be able to test with latest ML 8 in a couple of days. Cheers, E. -- Eliot Kimber http://contrext.com On 8/28/17, 12:37 PM, "general-boun...@developer.marklogic.com on behalf of Christopher Hamlin" wrote: There was a bug where, under certain circumstances, the profiler will result in a query deadlock &/or a resource leak (#45569). It could be that this is what you are seeing. It was noticed in 8.0-2 and is fixed in the latest release (8.0-7). On Mon, Aug 28, 2017 at 1:11 PM, Eliot Kimber wrote: > I reported earlier that my profiling application was causing MarkLogic to restart after handling about 20,000 tasks. Turns out it was an out-of-memory issue on the server itself (currently configured with 256GB of RAM). We could see a distinct spike in memory usage, at which point the server restarted MarkLogic. I tried different input data sets so it doesn’t appear to be an issue with a particular input document (my data set has a few outliers that are much larger than typical but only a few). > > Subsequent testing determined that it was the use of the MarkLogic profiler that was causing the memory spike: if I turned off the profiler then memory usage was flat and all the tasks completed as expected. > > This is ML 8.03. I’m still working on getting my server upgraded to a newer version of MarkLogic so I can see if this is an issue that has already been fixed. > > So it looks like there’s some kind of memory leak related to the profiler and I’d like to understand what that issue and either understand how to avoid it or report it formally. > > If it’s a general potential problem with large-scale processing would like to understand how to avoid it or plan for it. If it’s a problem specific to the profiler then need to report it formally and provide appropriate diagnostics. > > So my questions: > > 1. Is this a known issue with profiling? I’m guessing not in that I’m probably doing something out-of-the-ordinary vis-à-vis profiling and is something that nobody would see in typical single-instance ad-hoc profiling. > 2. What types of MarkLogic processing would cause this kind of memory spike that lasts across the execution of multiple tasks? I would expect the memory required for a given task to be released as soon as the task is complete so I’m guessing it must be an issue with caches or something? > > Thanks, > > Eliot > -- > Eliot Kimber > http://contrext.com > > > > > ___ > General mailing list > General@developer.marklogic.com > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Apparent Memory Leak in Profiler
Thanks—I should be able to test with latest ML 8 in a couple of days. Cheers, E. -- Eliot Kimber http://contrext.com On 8/28/17, 12:37 PM, "general-boun...@developer.marklogic.com on behalf of Christopher Hamlin" wrote: There was a bug where, under certain circumstances, the profiler will result in a query deadlock &/or a resource leak (#45569). It could be that this is what you are seeing. It was noticed in 8.0-2 and is fixed in the latest release (8.0-7). On Mon, Aug 28, 2017 at 1:11 PM, Eliot Kimber wrote: > I reported earlier that my profiling application was causing MarkLogic to restart after handling about 20,000 tasks. Turns out it was an out-of-memory issue on the server itself (currently configured with 256GB of RAM). We could see a distinct spike in memory usage, at which point the server restarted MarkLogic. I tried different input data sets so it doesn’t appear to be an issue with a particular input document (my data set has a few outliers that are much larger than typical but only a few). > > Subsequent testing determined that it was the use of the MarkLogic profiler that was causing the memory spike: if I turned off the profiler then memory usage was flat and all the tasks completed as expected. > > This is ML 8.03. I’m still working on getting my server upgraded to a newer version of MarkLogic so I can see if this is an issue that has already been fixed. > > So it looks like there’s some kind of memory leak related to the profiler and I’d like to understand what that issue and either understand how to avoid it or report it formally. > > If it’s a general potential problem with large-scale processing would like to understand how to avoid it or plan for it. If it’s a problem specific to the profiler then need to report it formally and provide appropriate diagnostics. > > So my questions: > > 1. Is this a known issue with profiling? I’m guessing not in that I’m probably doing something out-of-the-ordinary vis-à-vis profiling and is something that nobody would see in typical single-instance ad-hoc profiling. > 2. What types of MarkLogic processing would cause this kind of memory spike that lasts across the execution of multiple tasks? I would expect the memory required for a given task to be released as soon as the task is complete so I’m guessing it must be an issue with caches or something? > > Thanks, > > Eliot > -- > Eliot Kimber > http://contrext.com > > > > > ___ > General mailing list > General@developer.marklogic.com > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Apparent Memory Leak in Profiler
I reported earlier that my profiling application was causing MarkLogic to restart after handling about 20,000 tasks. Turns out it was an out-of-memory issue on the server itself (currently configured with 256GB of RAM). We could see a distinct spike in memory usage, at which point the server restarted MarkLogic. I tried different input data sets so it doesn’t appear to be an issue with a particular input document (my data set has a few outliers that are much larger than typical but only a few). Subsequent testing determined that it was the use of the MarkLogic profiler that was causing the memory spike: if I turned off the profiler then memory usage was flat and all the tasks completed as expected. This is ML 8.03. I’m still working on getting my server upgraded to a newer version of MarkLogic so I can see if this is an issue that has already been fixed. So it looks like there’s some kind of memory leak related to the profiler and I’d like to understand what that issue and either understand how to avoid it or report it formally. If it’s a general potential problem with large-scale processing would like to understand how to avoid it or plan for it. If it’s a problem specific to the profiler then need to report it formally and provide appropriate diagnostics. So my questions: 1. Is this a known issue with profiling? I’m guessing not in that I’m probably doing something out-of-the-ordinary vis-à-vis profiling and is something that nobody would see in typical single-instance ad-hoc profiling. 2. What types of MarkLogic processing would cause this kind of memory spike that lasts across the execution of multiple tasks? I would expect the memory required for a given task to be released as soon as the task is complete so I’m guessing it must be an issue with caches or something? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Why No cts:median-aggregate() Function?
I’m upgrading my profiling system to use cts:math functions for doing math on large numbers of durations—this speeds things up tremendously of course. However, there doesn’t appear to be a median-aggregate() function in ML 8 or ML 9, only cts:median(), which operates on a sequence of doubles. For example, for a range index that is xs:dayTimeDurations I can I do: let $average := cts:avg-aggregate(cts:element-reference(xs:QName("prof:overall-elapsed")), ("item-frequency"), cts:collection-query(epf:get-trial-collection($trial-number))) But to get the equivalent median the only solution I’m seeing is to convert all the durations to doubles and then take the median, which is very slow. At least in my data set, the median is a better measure of overall performance than average because I have a small number of very slow outliers, so I really need both median and average. This seems like an obvious oversight in the ct:math package—am I missing a solution? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Noob query question..
I just went through an exercise similar to this with my profiling application. In my case I’m capturing the output from the profiler for a large number of processing instances (potentially millions). I’m measuring both raw performance and also at-scale performance for processing of a large corpus, so it’s not sufficient to just profile a few cases. We know there is wide variation in performance for different input documents and we also want to see trends, both within the data and over time as the data, code, and servers evolve. So I’m measuring everything. I want to know which expressions take the most time across all the instances and get a count. For example, in the longest instances one particular expression is always the top one but is it the top one in the faster instances? The information is in the profiler histogram output but it is not ordered by shallow time (the value I’m interested in), so it’s not as easy as just getting the first expression for each histogram. The solution approach, developed for me by Evan Lenz, is to use a trick with co-occurrence queries where attributes on the same element have a proximity of zero. If you construct an index over two attributes you can then use cts:value-co-occurrences() to get all the pairs and then select the ones you want. This approach also requires that each set be in a separate document (so that you can limit each call to cts:value-co-occurrences() to a single document using cts:document-query(). If there were multiple profiling results in a single document there would be no index-based way to limit value-co-occurrences() to a single profiling instance. To enable this I had to post-process the output the MarkLogic profiler to add attributes to the prof:expression elements with the shallow-time and expr-source values, which are otherwise within subelements, the result being: (I used a simple XSLT transform for this part of the processing, applied as I store my profiling results.) I then defined attribute range indexes with word positions turned on for the @shallow-time and @expr-source attributes. With that I could then do this to find the longest for each profiling instance (where each profiling instance is stored as a separate document): let $maps := for $uri in cts:uris((), (), cts:collection-query($collection)) let $expression-index := cts:element-attribute-reference(xs:QName("prof:expression"),xs:QName("expr-source")) let $shallow-time-index := cts:element-attribute-reference(xs:QName("prof:expression"),xs:QName("shallow-time")) let $max := cts:max($shallow-time-index, (), cts:document-query($uri)) let $co-occurrences := cts:value-co-occurrences( $expression-index, $shallow-time-index, ("proximity=0", "map"), cts:document-query($uri) ) let $max-co-occurrence := for $map in $co-occurrences let $keys := map:keys($map) for $key in $keys return if ($max eq xs:dayTimeDuration(map:get($map, $key)[1])) then map:entry($key, map:get($map, $key)) else () return $max-co-occurrence -- Eliot Kimber http://contrext.com From: on behalf of "Ladner, Eric (Eric.Ladner)" Reply-To: MarkLogic Developer Discussion Date: Thursday, August 24, 2017 at 4:30 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Noob query question.. Thank you. I will play with this in my development environment tomorrow. I don’t quote see how it’s getting the counts per subject, though. For reference.. the structure is similar to this: Test Subject 2017-04-01T15:32:00 Blah, blah There would be many notes, obviously and the output would ideally be something like (not married to that output, but some output showing the counts for each subject over that time range). Test Subject 2 Subject 2 4 ... Eric Ladner Systems Analyst eric.lad...@chevron.com From: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] On Behalf Of Sam Mefford Sent: August 24, 2017 15:59 To: MarkLogic Developer Discussion Subject: [**EXTERNAL**] Re: [MarkLogic Dev General] Noob query question.. I should point out that this is not the fastest way to do it. A faster way would be to index "date-taken" as a dateTime element range index and use cts:search with cts:element-range-query. Sam Mefford Senior Engineer MarkLogic Corporation sam.meff...@marklogic.com Cell: +1 801 706 9731 www.marklogic.com This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of
Re: [MarkLogic Dev General] Where is General Documentation for the Task Server App?
I’ll see if I can get MarkLogic server upgraded—does make sense to be on the latest version of ML 8. Cheers, Eliot -- Eliot Kimber http://contrext.com On 8/23/17, 11:25 AM, "general-boun...@developer.marklogic.com on behalf of Geert Josten" wrote: Hi Eliot, You could be hitting a bug in MarkLogic. It might be worth upgrading to 8.0-7, and seeing if it still happens with that version. A lot of patches and performance improvements have been made since 8.0-3.2.. Cheers, Geert On 8/23/17, 5:47 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >Yes, I checked the log and the messages are: > >2017-08-22 21:43:42.287 Info: TaskServer: profiling-task.xqy >[4136613570697343302]: Starting, start: 32361, group size: 10, >outdir="/profiling/trial-058/group-3237/ >2017-08-22 21:43:42.287 Info: TaskServer: " >2017-08-22 21:43:42.405 Info: Saving /marklogic/Forests/Meters/0a02 >2017-08-22 21:44:34.572 Notice: Starting MarkLogic Server 8.0-3.2 x86_64 >in /opt/MarkLogic with data in /marklogic >2017-08-22 21:44:34.617 Info: Host running Linux >3.10.0-327.18.2.el7.x86_64 (Red Hat Enterprise Linux Server release 6.8 >(Santiago)) >2017-08-22 21:44:34.690 Info: SSL FIPS mode has been enabled > >The first message is from my task indicating that the 3237st (out of >50,000 in the queue) is starting. > >Then the MarkLogic start message for no obvious reason. It’s not a time >at which a scheduled server restart would have likely happened and nobody >was (or should have been) awake at that hour and I’m the only person who >should be doing anything with this server anyway. > >What’s interesting is that I’m getting this restart consistently at about >the 3200th task, so it feels like either a time out or a resource >exhaustion that then triggers a restart, but there are no messages about >any kind of failure, out of memory condition, etc. > >I’m pretty sure it’s an issue with the configuration of the underlying >linux server but I wanted to know if there were any conditions under >which the Task Server or ML server itself would spontaneously restart. > >Thanks, > >Eliot > >-- >Eliot Kimber >http://contrext.com > > > >On 8/23/17, 10:25 AM, "general-boun...@developer.marklogic.com on behalf >of Dave Cassel" david.cas...@marklogic.com> wrote: > >I don't believe there's any reason why the Task Server would be >triggering >a restart (although some configuration changes affecting the Task >Server >would). I'd look elsewhere for an error. Specifically, I'd check >ErrorLog.txt, find the time when a restart happened, and look to see >if >anything interesting was logged just before. (Perhaps you've already >done >that.) > >-- >Dave Cassel, @dmcassel <https://twitter.com/dmcassel> >Technical Community Manager >MarkLogic Corporation <http://www.marklogic.com/> > >http://developer.marklogic.com/ > > > > >On 8/23/17, 11:18 AM, "general-boun...@developer.marklogic.com on >behalf >of Eliot Kimber" ekim...@contrext.com> wrote: > >>I¹m trying to understand the Task Server (and in my case, why it is >>consistently restarting after satisfying a subset of its queue). >> >>Going through the ML 8 docs I¹m not finding any general discussion >of the >>Task Server, only references to it from elsewhere (e.g., in the docs >for >>xdmp:spawn() and in discussion of scheduling tasks). >> >>But not finding anything that would appear to provide insight into >why >>the server would perform an uncommanded restart (or information >>indicating that it would never do that and thus the problem must be >>elsewhere). >> >>Have I missed it? Given that the Task Server is a built-in and >prominent >>part of MarkLogic it seems odd that there¹s no general documentation >for >>it, which makes me think I must have missed it. But I both searched >the >>doc set and ToC and scanned the entire Guide ToC and didn¹t find >anything. >> >>Thanks, >> >>Eliot >>-- >>Eliot Kimber >
Re: [MarkLogic Dev General] Where is General Documentation for the Task Server App?
Yes, I checked the log and the messages are: 2017-08-22 21:43:42.287 Info: TaskServer: profiling-task.xqy [4136613570697343302]: Starting, start: 32361, group size: 10, outdir="/profiling/trial-058/group-3237/ 2017-08-22 21:43:42.287 Info: TaskServer: " 2017-08-22 21:43:42.405 Info: Saving /marklogic/Forests/Meters/0a02 2017-08-22 21:44:34.572 Notice: Starting MarkLogic Server 8.0-3.2 x86_64 in /opt/MarkLogic with data in /marklogic 2017-08-22 21:44:34.617 Info: Host running Linux 3.10.0-327.18.2.el7.x86_64 (Red Hat Enterprise Linux Server release 6.8 (Santiago)) 2017-08-22 21:44:34.690 Info: SSL FIPS mode has been enabled The first message is from my task indicating that the 3237st (out of 50,000 in the queue) is starting. Then the MarkLogic start message for no obvious reason. It’s not a time at which a scheduled server restart would have likely happened and nobody was (or should have been) awake at that hour and I’m the only person who should be doing anything with this server anyway. What’s interesting is that I’m getting this restart consistently at about the 3200th task, so it feels like either a time out or a resource exhaustion that then triggers a restart, but there are no messages about any kind of failure, out of memory condition, etc. I’m pretty sure it’s an issue with the configuration of the underlying linux server but I wanted to know if there were any conditions under which the Task Server or ML server itself would spontaneously restart. Thanks, Eliot -- Eliot Kimber http://contrext.com On 8/23/17, 10:25 AM, "general-boun...@developer.marklogic.com on behalf of Dave Cassel" wrote: I don't believe there's any reason why the Task Server would be triggering a restart (although some configuration changes affecting the Task Server would). I'd look elsewhere for an error. Specifically, I'd check ErrorLog.txt, find the time when a restart happened, and look to see if anything interesting was logged just before. (Perhaps you've already done that.) -- Dave Cassel, @dmcassel <https://twitter.com/dmcassel> Technical Community Manager MarkLogic Corporation <http://www.marklogic.com/> http://developer.marklogic.com/ On 8/23/17, 11:18 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >I¹m trying to understand the Task Server (and in my case, why it is >consistently restarting after satisfying a subset of its queue). > >Going through the ML 8 docs I¹m not finding any general discussion of the >Task Server, only references to it from elsewhere (e.g., in the docs for >xdmp:spawn() and in discussion of scheduling tasks). > >But not finding anything that would appear to provide insight into why >the server would perform an uncommanded restart (or information >indicating that it would never do that and thus the problem must be >elsewhere). > >Have I missed it? Given that the Task Server is a built-in and prominent >part of MarkLogic it seems odd that there¹s no general documentation for >it, which makes me think I must have missed it. But I both searched the >doc set and ToC and scanned the entire Guide ToC and didn¹t find anything. > >Thanks, > >Eliot >-- >Eliot Kimber >http://contrext.com > > > > >___ >General mailing list >General@developer.marklogic.com >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Where is General Documentation for the Task Server App?
I’m trying to understand the Task Server (and in my case, why it is consistently restarting after satisfying a subset of its queue). Going through the ML 8 docs I’m not finding any general discussion of the Task Server, only references to it from elsewhere (e.g., in the docs for xdmp:spawn() and in discussion of scheduling tasks). But not finding anything that would appear to provide insight into why the server would perform an uncommanded restart (or information indicating that it would never do that and thus the problem must be elsewhere). Have I missed it? Given that the Task Server is a built-in and prominent part of MarkLogic it seems odd that there’s no general documentation for it, which makes me think I must have missed it. But I both searched the doc set and ToC and scanned the entire Guide ToC and didn’t find anything. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Getting Impossible Value from count()--why?
Yes, I added “item-frequency” to my cts:element-values() call and now all the numbers appear to be correct. I haven’t circled back to my original issue with the ML-provided search buckets not being the right size but if I have time I’ll see if the issue was failing to specify item-frequency at some point. Cheers, E. - On 8/23/17, 2:37 AM, "general-boun...@developer.marklogic.com on behalf of Geert Josten" wrote: Hi Eliot, Keep in mind that you pass in item-frequency in cts:element-values, but the default for range constraints is likely fragment-frequency. Did you pass in an item-frequency facet-option in there too? Kind regards, Geert On 8/22/17, 10:47 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >If I sum the counts of each bucket calculated using cts:frequency() it >matches the total calculated using the initial result from the >element-values() query, so I guess the 10,000 count is a side effect of >some internal lexicon implementation magic. > >Cheers, > >E. > >-- >Eliot Kimber >http://contrext.com > > > >On 8/22/17, 3:25 PM, "general-boun...@developer.marklogic.com on behalf >of Eliot Kimber" ekim...@contrext.com> wrote: > >I think this is again my weak understanding of lexicons and frequency >counting. > >If I change my code to sum the frequencies of the durations in each >range then I get more sensible numbers, e.g.: > >let $count := sum(for $dur in $durations[. lt $upper-bound][. ge >$lower-bound] return cts:frequency($dur)) > >Having updated get-enrichment-durations() to: > >cts:element-values(xs:QName("prof:overall-elapsed"), (), >("descending", "item-frequency"), > cts:collection-query($collection)) > >It still seems odd that the pure lexicon check returns exactly 10,000 >*values*--that still seems suspect, but then using those 10,000 values to >calculate the total frequency does result in a more likely number. I >guess I can do some brute-force querying to see if it¹s accurate. > >Cheers, > >Eliot > -- >Eliot Kimber >http://contrext.com > > > >On 8/22/17, 2:52 PM, "general-boun...@developer.marklogic.com on >behalf of Eliot Kimber" behalf of ekim...@contrext.com> wrote: > >Using ML 8.0-3.2 > >As part of my profiling application I run a large number of >profiles, storing the profiler results back to the database. I¹m then >extracting the times from the profiling data to create histograms and do >other analysis. > >My first attempt to do this with buckets ran into the problem >that the index-based buckets were not returning accurate numbers, so I >reimplemented it to construct the buckets manually from a list of the >actual duration values. > >My code is: > >let $durations as xs:dayTimeDuration* := >epf:get-enrichment-durations($collection) >let $search-range := epf:construct-search-range() >let $facets := >for $bucket in $search-range/search:bucket >let $upper-bound := if ($bucket/@lt) then >xs:dayTimeDuration($bucket/@lt) else xs:dayTimeDuration("PT0S") >let $lower-bound := xs:dayTimeDuration($bucket/@ge) >let $count := count($durations[. lt $upper-bound][. ge >$lower-bound]) >return if ($count gt 0) > then count="{$count}">{epf:format-day-time-duration($upper-bound)}t-value> > else () > >The get-enrichment-durations() function does this: > > cts:element-values(xs:QName("prof:overall-elapsed"), (), >"descending", > cts:collection-query($collection)) > >This works nicely and seems to provide correct numbers except >when the number of durations within a particular set of bounds exceeds >10,000, at which point count() returns 10,000, which is an impossible >number‹the chance of there being exactly 10,000 instances within a given >range is basically zero. But I¹m getting 10,000 twice, which is >absolutely impossible. > >Here¹s the results I get from runnin
Re: [MarkLogic Dev General] Getting Impossible Value from count()--why?
If I sum the counts of each bucket calculated using cts:frequency() it matches the total calculated using the initial result from the element-values() query, so I guess the 10,000 count is a side effect of some internal lexicon implementation magic. Cheers, E. -- Eliot Kimber http://contrext.com On 8/22/17, 3:25 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I think this is again my weak understanding of lexicons and frequency counting. If I change my code to sum the frequencies of the durations in each range then I get more sensible numbers, e.g.: let $count := sum(for $dur in $durations[. lt $upper-bound][. ge $lower-bound] return cts:frequency($dur)) Having updated get-enrichment-durations() to: cts:element-values(xs:QName("prof:overall-elapsed"), (), ("descending", "item-frequency"), cts:collection-query($collection)) It still seems odd that the pure lexicon check returns exactly 10,000 *values*--that still seems suspect, but then using those 10,000 values to calculate the total frequency does result in a more likely number. I guess I can do some brute-force querying to see if it’s accurate. Cheers, Eliot -- Eliot Kimber http://contrext.com On 8/22/17, 2:52 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: Using ML 8.0-3.2 As part of my profiling application I run a large number of profiles, storing the profiler results back to the database. I’m then extracting the times from the profiling data to create histograms and do other analysis. My first attempt to do this with buckets ran into the problem that the index-based buckets were not returning accurate numbers, so I reimplemented it to construct the buckets manually from a list of the actual duration values. My code is: let $durations as xs:dayTimeDuration* := epf:get-enrichment-durations($collection) let $search-range := epf:construct-search-range() let $facets := for $bucket in $search-range/search:bucket let $upper-bound := if ($bucket/@lt) then xs:dayTimeDuration($bucket/@lt) else xs:dayTimeDuration("PT0S") let $lower-bound := xs:dayTimeDuration($bucket/@ge) let $count := count($durations[. lt $upper-bound][. ge $lower-bound]) return if ($count gt 0) then {epf:format-day-time-duration($upper-bound)} else () The get-enrichment-durations() function does this: cts:element-values(xs:QName("prof:overall-elapsed"), (), "descending", cts:collection-query($collection)) This works nicely and seems to provide correct numbers except when the number of durations within a particular set of bounds exceeds 10,000, at which point count() returns 10,000, which is an impossible number—the chance of there being exactly 10,000 instances within a given range is basically zero. But I’m getting 10,000 twice, which is absolutely impossible. Here’s the results I get from running this in the query console: 75778 http://marklogic.com/appservices/search";>0.01 seconds http://marklogic.com/appservices/search";>0.02 seconds http://marklogic.com/appservices/search";>0.03 seconds http://marklogic.com/appservices/search";>0.04 seconds http://marklogic.com/appservices/search";>0.05 seconds … There are 75,778 actual duration values and the count value for the 3rd and 4th ranges are exactly 10,000. If I change the let $count := expression to only test the upper or lower bound then I get numbers greater than 10,000. I also tried changing the order of the predicates and using a single predicate with “and”. The problem only seems to be related to using both predicates when the resulting sequence would have more than 10K items. Is there an explanation for why count() gives me exactly 10,000 in this case? Is there a workaround for this behavior? The search range I’m constructing is normal ML-defined markup for defining a search range, e.g.: http://marklogic.com/appservices/search";> 0.001 Second 0.002 Second 0.003 Second 0.004 Second 0.005 Second … Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing l
Re: [MarkLogic Dev General] Large job processing question.
The Task Manager will queue the jobs. It will only process as many at once as there are threads configured for the Task Manager. In my profiling application I’m queueing 10s of 1000s of 10-doc tasks. My Task Manager has a maximum queue of 100,000 tasks. If I do a small number of large tasks then I quickly exhaust RAM. Cheers, E. -- Eliot Kimber http://contrext.com From: on behalf of "Ladner, Eric (Eric.Ladner)" Reply-To: MarkLogic Developer Discussion Date: Tuesday, August 22, 2017 at 3:33 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Large job processing question. Is it smart enough not to spawn 100,000 jobs at once and swamp the system? Eric Ladner Systems Analyst eric.lad...@chevron.com From: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] On Behalf Of Geert Josten Sent: August 22, 2017 13:59 To: MarkLogic Developer Discussion Subject: [**EXTERNAL**] Re: [MarkLogic Dev General] Large job processing question. Hi Eric, Personally, I would probably let go of the all-docs-at-once approach, and spawn processes for each input (sub)folder, and potentially for batches or individual files in any folder as well. Same for the existing documents, spawn a process for batches or individual docs that check if they still exist. If you make them append logs to the documents or their properties, you can gather reports about changes afterwards if needed. Cheers, Geert From: on behalf of "Ladner, Eric (Eric.Ladner)" Reply-To: MarkLogic Developer Discussion Date: Tuesday, August 22, 2017 at 4:36 PM To: "general@developer.marklogic.com" Subject: [MarkLogic Dev General] Large job processing question. We have some large jobs (ingestion and validation of unstructured documents) that have timeout issues. The way the jobs are structured is structured is that the first job checks that all the existing documents are valid (still exists on the file system). It does this in two steps: 1) gather all documents to be validated from the DB 2) check that list against the file system. The second job is: 1) the filesystem is traversed to find any new documents (or that have been modified in the last X days), 2) those new/modified documents are ingested. The problem in the second step is there could be tens of thousands of documents in a hundred thousand folders (don’t ask). The job will typically time out after an hour during the “go find all the new documents” phase. I’m trying to find out if there’s a way to re-structure the job so that it runs faster and doesn’t time out, or maybe breaks up the task into different parts that run in parallel or something. Any thoughts welcome. Eric Ladner Systems Analyst eric.lad...@chevron.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Getting Impossible Value from count()--why?
I think this is again my weak understanding of lexicons and frequency counting. If I change my code to sum the frequencies of the durations in each range then I get more sensible numbers, e.g.: let $count := sum(for $dur in $durations[. lt $upper-bound][. ge $lower-bound] return cts:frequency($dur)) Having updated get-enrichment-durations() to: cts:element-values(xs:QName("prof:overall-elapsed"), (), ("descending", "item-frequency"), cts:collection-query($collection)) It still seems odd that the pure lexicon check returns exactly 10,000 *values*--that still seems suspect, but then using those 10,000 values to calculate the total frequency does result in a more likely number. I guess I can do some brute-force querying to see if it’s accurate. Cheers, Eliot -- Eliot Kimber http://contrext.com On 8/22/17, 2:52 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: Using ML 8.0-3.2 As part of my profiling application I run a large number of profiles, storing the profiler results back to the database. I’m then extracting the times from the profiling data to create histograms and do other analysis. My first attempt to do this with buckets ran into the problem that the index-based buckets were not returning accurate numbers, so I reimplemented it to construct the buckets manually from a list of the actual duration values. My code is: let $durations as xs:dayTimeDuration* := epf:get-enrichment-durations($collection) let $search-range := epf:construct-search-range() let $facets := for $bucket in $search-range/search:bucket let $upper-bound := if ($bucket/@lt) then xs:dayTimeDuration($bucket/@lt) else xs:dayTimeDuration("PT0S") let $lower-bound := xs:dayTimeDuration($bucket/@ge) let $count := count($durations[. lt $upper-bound][. ge $lower-bound]) return if ($count gt 0) then {epf:format-day-time-duration($upper-bound)} else () The get-enrichment-durations() function does this: cts:element-values(xs:QName("prof:overall-elapsed"), (), "descending", cts:collection-query($collection)) This works nicely and seems to provide correct numbers except when the number of durations within a particular set of bounds exceeds 10,000, at which point count() returns 10,000, which is an impossible number—the chance of there being exactly 10,000 instances within a given range is basically zero. But I’m getting 10,000 twice, which is absolutely impossible. Here’s the results I get from running this in the query console: 75778 http://marklogic.com/appservices/search";>0.01 seconds http://marklogic.com/appservices/search";>0.02 seconds http://marklogic.com/appservices/search";>0.03 seconds http://marklogic.com/appservices/search";>0.04 seconds http://marklogic.com/appservices/search";>0.05 seconds … There are 75,778 actual duration values and the count value for the 3rd and 4th ranges are exactly 10,000. If I change the let $count := expression to only test the upper or lower bound then I get numbers greater than 10,000. I also tried changing the order of the predicates and using a single predicate with “and”. The problem only seems to be related to using both predicates when the resulting sequence would have more than 10K items. Is there an explanation for why count() gives me exactly 10,000 in this case? Is there a workaround for this behavior? The search range I’m constructing is normal ML-defined markup for defining a search range, e.g.: http://marklogic.com/appservices/search";> 0.001 Second 0.002 Second 0.003 Second 0.004 Second 0.005 Second … Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Count of cts:element-values() not equal to number of element instances--what's going on?
A closer reading of the manual reveals my mistake: I needed to specify "item-frequency" in the element-values() query. Without it I was getting the count of *fragments* with the value, not the total number of occurrences. When I add the “item-frequency” option to element-values() then I get the correct count from the sum of cts:frequency(). Cheers, E. -- Eliot Kimber http://contrext.com On 8/14/17, 2:58 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: Using both cts:frequence and cts:count-aggregate I get numbers that are closer to the correct count but are short by about 200. What would account for the difference? Queries: let $profiles := collection($collection)/enrprof:profiling-instance/enrprof:enrichment/enrprof:evalResult/prof:* let $histograms := $profiles/prof:histogram let $overall-elapsed := $profiles/prof:metadata/prof:overall-elapsed let $durations := cts:element-values(xs:QName("prof:overall-elapsed"), (), "descending", cts:collection-query($collection)) let $count-frequency := sum(for $dur in $durations return cts:frequency($dur)) let $overall-elapsed-ref := cts:element-reference(fn:QName("http://marklogic.com/xdmp/profile","overall-elapsed";),("type=dayTimeDuration")) let $count-frequency := sum(for $dur in $durations return cts:frequency($dur)) let $count-aggregate := cts:count-aggregate($overall-elapsed-ref,(), cts:collection-query($collection)) Results: 47539 47539 47539 47371 47371 21219 Cheers, E. -- Eliot Kimber http://contrext.com On 8/14/17, 1:53 PM, "general-boun...@developer.marklogic.com on behalf of Mary Holstege" wrote: That is overkill. The results you get out of cts:element-values have a frequency (accessible via cts:frequency). The cts: aggregates (e.g. cts:count, cts:sum) take the frequency into account. //Mary On Mon, 14 Aug 2017 11:42:07 -0700, Oleksii Segeda wrote: > Eliot, > > You can do something like this: > cts:element-value-co-occurrences(xs:QName("prof:overall-elapsed"),xs:QName("xdmp:document")) > if you have only one element per document. > > Best, > > Oleksii Segeda > IT Analyst > Information and Technology Solutions > www.worldbank.org > > > -Original Message- > From: general-boun...@developer.marklogic.com > [mailto:general-boun...@developer.marklogic.com] On Behalf Of Eliot > Kimber > Sent: Monday, August 14, 2017 2:31 PM > To: MarkLogic Developer Discussion > Subject: [MarkLogic Dev General] Count of cts:element-values() not equal > to number of element instances--what's going on? > > I have this query: > > let $durations := cts:element-values(xs:QName("prof:overall-elapsed"), > (), "descending", > cts:collection-query($collection)) > > And this query: > > let $overall-elapsed := $profiles/prof:metadata/prof:overall-elapsed > > Where there an element range index for prof:overall-elapsed. > > Comparing the two results I get very different numbers when I expected > them to be equal: > > 47539 > 21219 > > Doing this: > > count(distinct-values($overall-elapsed ! xs:dayTimeDuration(.)) > > Returns 21219, making it clear that the range index is returning > distinct values, not all values. It makes sense in terms of how I would > expect a range index to be structured (a one-to-many mapping for values > to elements) but doesn’t make sense as the return for a function named > “element-values” (and not element-distinct-values). > > I didn’t see this behavior mentioned in the docs (although the > introduction to the Lexicon reference section does describe lexicons as > sets of unique values). > > My requirement is to *quickly* get a list of the durations for all > prof:expression elements (which I use for both counting and for > bucketing, so I need all values, not just all distinct values). > > Is there a way to do what I want using only indexes? > > Thanks, > >
[MarkLogic Dev General] Getting Impossible Value from count()--why?
Using ML 8.0-3.2 As part of my profiling application I run a large number of profiles, storing the profiler results back to the database. I’m then extracting the times from the profiling data to create histograms and do other analysis. My first attempt to do this with buckets ran into the problem that the index-based buckets were not returning accurate numbers, so I reimplemented it to construct the buckets manually from a list of the actual duration values. My code is: let $durations as xs:dayTimeDuration* := epf:get-enrichment-durations($collection) let $search-range := epf:construct-search-range() let $facets := for $bucket in $search-range/search:bucket let $upper-bound := if ($bucket/@lt) then xs:dayTimeDuration($bucket/@lt) else xs:dayTimeDuration("PT0S") let $lower-bound := xs:dayTimeDuration($bucket/@ge) let $count := count($durations[. lt $upper-bound][. ge $lower-bound]) return if ($count gt 0) then {epf:format-day-time-duration($upper-bound)} else () The get-enrichment-durations() function does this: cts:element-values(xs:QName("prof:overall-elapsed"), (), "descending", cts:collection-query($collection)) This works nicely and seems to provide correct numbers except when the number of durations within a particular set of bounds exceeds 10,000, at which point count() returns 10,000, which is an impossible number—the chance of there being exactly 10,000 instances within a given range is basically zero. But I’m getting 10,000 twice, which is absolutely impossible. Here’s the results I get from running this in the query console: 75778 http://marklogic.com/appservices/search";>0.01 seconds http://marklogic.com/appservices/search";>0.02 seconds http://marklogic.com/appservices/search";>0.03 seconds http://marklogic.com/appservices/search";>0.04 seconds http://marklogic.com/appservices/search";>0.05 seconds … There are 75,778 actual duration values and the count value for the 3rd and 4th ranges are exactly 10,000. If I change the let $count := expression to only test the upper or lower bound then I get numbers greater than 10,000. I also tried changing the order of the predicates and using a single predicate with “and”. The problem only seems to be related to using both predicates when the resulting sequence would have more than 10K items. Is there an explanation for why count() gives me exactly 10,000 in this case? Is there a workaround for this behavior? The search range I’m constructing is normal ML-defined markup for defining a search range, e.g.: http://marklogic.com/appservices/search";> 0.001 Second 0.002 Second 0.003 Second 0.004 Second 0.005 Second … Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Count of cts:element-values() not equal to number of element instances--what's going on?
That would make sense but since these elements are generated by the ML profiler I don’t think it’s possible for them to ever be empty. This query returns zero: let $overall-elapsed := collection($collection)/enrprof:profiling-instance/enrprof:enrichment/enrprof:evalResult/prof:report/prof:metadata/prof:overall-elapsed count($overall-elapsed[normalize-space(xs:string(.)) eq '']) Cheers, E. -- Eliot Kimber http://contrext.com On 8/15/17, 2:09 AM, "general-boun...@developer.marklogic.com on behalf of Geert Josten" wrote: Wild guess.. Empty prof:overall-elapsed elements, that are ignored/rejected by the range index? Cheers On 8/14/17, 9:58 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >Using both cts:frequence and cts:count-aggregate I get numbers that are >closer to the correct count but are short by about 200. What would >account for the difference? > >Queries: > >let $profiles := >collection($collection)/enrprof:profiling-instance/enrprof:enrichment/enrp >rof:evalResult/prof:* >let $histograms := $profiles/prof:histogram >let $overall-elapsed := $profiles/prof:metadata/prof:overall-elapsed >let $durations := cts:element-values(xs:QName("prof:overall-elapsed"), >(), "descending", > cts:collection-query($collection)) >let $count-frequency := sum(for $dur in $durations return >cts:frequency($dur)) >let $overall-elapsed-ref := >cts:element-reference(fn:QName("http://marklogic.com/xdmp/profile","overal >l-elapsed"),("type=dayTimeDuration")) > >let $count-frequency := sum(for $dur in $durations return >cts:frequency($dur)) >let $count-aggregate := cts:count-aggregate($overall-elapsed-ref,(), >cts:collection-query($collection)) > >Results: > >47539 >47539 >47539 >47371 >47371 >21219 > >Cheers, > >E. >-- >Eliot Kimber >http://contrext.com > > > > >On 8/14/17, 1:53 PM, "general-boun...@developer.marklogic.com on behalf >of Mary Holstege" mary.holst...@marklogic.com> wrote: > > >That is overkill. The results you get out of cts:element-values have >a >frequency (accessible via cts:frequency). The cts: aggregates (e.g. >cts:count, cts:sum) take the frequency into account. > >//Mary > >On Mon, 14 Aug 2017 11:42:07 -0700, Oleksii Segeda > wrote: > >> Eliot, >> >> You can do something like this: >> > cts:element-value-co-occurrences(xs:QName("prof:overall-elapsed"),xs:QNam >e("xdmp:document")) >> if you have only one element per document. >> >> Best, >> >> Oleksii Segeda >> IT Analyst >> Information and Technology Solutions >> www.worldbank.org >> >> >> -Original Message- >> From: general-boun...@developer.marklogic.com >> [mailto:general-boun...@developer.marklogic.com] On Behalf Of Eliot > >> Kimber >> Sent: Monday, August 14, 2017 2:31 PM >> To: MarkLogic Developer Discussion >> Subject: [MarkLogic Dev General] Count of cts:element-values() not >equal >> to number of element instances--what's going on? >> >> I have this query: >> >> let $durations := >cts:element-values(xs:QName("prof:overall-elapsed"), >> (), "descending", >> cts:collection-query($collection)) >> >> And this query: >> >> let $overall-elapsed := $profiles/prof:metadata/prof:overall-elapsed >> >> Where there an element range index for prof:overall-elapsed. >> >> Comparing the two results I get very different numbers when I >expected >> them to be equal: >> >> 47539 >> 21219 >> >> Doing this: >> >> count(distinct-values($overall-elapsed ! xs:dayTimeDuration(.)) >> >> Returns 21219, making it clear that the range index is returning >> distinct values, not all values. It makes sense in terms of how I >would >> expect a range inde
Re: [MarkLogic Dev General] Count of cts:element-values() not equal to number of element instances--what's going on?
Using both cts:frequence and cts:count-aggregate I get numbers that are closer to the correct count but are short by about 200. What would account for the difference? Queries: let $profiles := collection($collection)/enrprof:profiling-instance/enrprof:enrichment/enrprof:evalResult/prof:* let $histograms := $profiles/prof:histogram let $overall-elapsed := $profiles/prof:metadata/prof:overall-elapsed let $durations := cts:element-values(xs:QName("prof:overall-elapsed"), (), "descending", cts:collection-query($collection)) let $count-frequency := sum(for $dur in $durations return cts:frequency($dur)) let $overall-elapsed-ref := cts:element-reference(fn:QName("http://marklogic.com/xdmp/profile","overall-elapsed";),("type=dayTimeDuration")) let $count-frequency := sum(for $dur in $durations return cts:frequency($dur)) let $count-aggregate := cts:count-aggregate($overall-elapsed-ref,(), cts:collection-query($collection)) Results: 47539 47539 47539 47371 47371 21219 Cheers, E. -- Eliot Kimber http://contrext.com On 8/14/17, 1:53 PM, "general-boun...@developer.marklogic.com on behalf of Mary Holstege" wrote: That is overkill. The results you get out of cts:element-values have a frequency (accessible via cts:frequency). The cts: aggregates (e.g. cts:count, cts:sum) take the frequency into account. //Mary On Mon, 14 Aug 2017 11:42:07 -0700, Oleksii Segeda wrote: > Eliot, > > You can do something like this: > cts:element-value-co-occurrences(xs:QName("prof:overall-elapsed"),xs:QName("xdmp:document")) > if you have only one element per document. > > Best, > > Oleksii Segeda > IT Analyst > Information and Technology Solutions > www.worldbank.org > > > -Original Message----- > From: general-boun...@developer.marklogic.com > [mailto:general-boun...@developer.marklogic.com] On Behalf Of Eliot > Kimber > Sent: Monday, August 14, 2017 2:31 PM > To: MarkLogic Developer Discussion > Subject: [MarkLogic Dev General] Count of cts:element-values() not equal > to number of element instances--what's going on? > > I have this query: > > let $durations := cts:element-values(xs:QName("prof:overall-elapsed"), > (), "descending", > cts:collection-query($collection)) > > And this query: > > let $overall-elapsed := $profiles/prof:metadata/prof:overall-elapsed > > Where there an element range index for prof:overall-elapsed. > > Comparing the two results I get very different numbers when I expected > them to be equal: > > 47539 > 21219 > > Doing this: > > count(distinct-values($overall-elapsed ! xs:dayTimeDuration(.)) > > Returns 21219, making it clear that the range index is returning > distinct values, not all values. It makes sense in terms of how I would > expect a range index to be structured (a one-to-many mapping for values > to elements) but doesn’t make sense as the return for a function named > “element-values” (and not element-distinct-values). > > I didn’t see this behavior mentioned in the docs (although the > introduction to the Lexicon reference section does describe lexicons as > sets of unique values). > > My requirement is to *quickly* get a list of the durations for all > prof:expression elements (which I use for both counting and for > bucketing, so I need all values, not just all distinct values). > > Is there a way to do what I want using only indexes? > > Thanks, > > E. > -- > Eliot Kimber > http://contrext.com > > > > ___ > General mailing list > General@developer.marklogic.com > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > ___ > General mailing list > General@developer.marklogic.com > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general -- Using Opera's revolutionary email client: http://www.opera.com/mail/ ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Count of cts:element-values() not equal to number of element instances--what's going on?
I have this query: let $durations := cts:element-values(xs:QName("prof:overall-elapsed"), (), "descending", cts:collection-query($collection)) And this query: let $overall-elapsed := $profiles/prof:metadata/prof:overall-elapsed Where there an element range index for prof:overall-elapsed. Comparing the two results I get very different numbers when I expected them to be equal: 47539 21219 Doing this: count(distinct-values($overall-elapsed ! xs:dayTimeDuration(.)) Returns 21219, making it clear that the range index is returning distinct values, not all values. It makes sense in terms of how I would expect a range index to be structured (a one-to-many mapping for values to elements) but doesn’t make sense as the return for a function named “element-values” (and not element-distinct-values). I didn’t see this behavior mentioned in the docs (although the introduction to the Lexicon reference section does describe lexicons as sets of unique values). My requirement is to *quickly* get a list of the durations for all prof:expression elements (which I use for both counting and for bucketing, so I need all values, not just all distinct values). Is there a way to do what I want using only indexes? Thanks, E. -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Tracking Spawned Tasks?
That’s a good point about comments. I’ll try to add comments for things I found lacking and subsequently discover answers to. Cheers, E. -- Eliot Kimber http://contrext.com From: on behalf of Evan Lenz Reply-To: MarkLogic Developer Discussion Date: Monday, August 14, 2017 at 12:27 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Tracking Spawned Tasks? Hi Eliot, One nice thing about the MarkLogic documentation for functions is that you can add helpful comments yourself. I've seen others write "See also" comments and have done so myself from time to time. In general I've found the comments very helpful, even if it's a user getting their newbie question answered. Evan Evan Lenz President, Lenz Consulting Group, Inc. http://lenzconsulting.com On Mon, Aug 14, 2017 at 9:57 AM, Erik Hennum wrote: Hi, Eliot and Ron: The return option is explained with the rest of the options in the eval article: http://docs.marklogic.com/xdmp:eval The second example under spawn uses the promise: http://docs.marklogic.com/xdmp:spawn As Ron notes, the server field is only useful if the polling requests go back to the same host. To allow for restarts, the polling logic should check for the persisted final status document if the server field is empty. (That's the motivation for persisting a final status document even when using server fields.) Thanks for the feedback on the documentation -- I'll pass that along. Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Ron Hitchens [r...@ronsoft.com] Sent: Monday, August 14, 2017 7:55 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Tracking Spawned Tasks? Proceed with caution when using server fields. They exist only on a single machine, they are not propagated across nodes in a cluster. If you have a cluster behind a load balancer (as most are) and you stash something in a server field to be checked later, the next request may be vectored to a different cluster node, where your stashed value will not be present. Likewise, if you put something in a field to be picked up by a spawned task, the spawned task may run on a different node. Ron Hitchens r...@overstory.co.uk, +44 7879 358212 On August 14, 2017 at 3:24:32 PM, Eliot Kimber (ekim...@contrext.com) wrote: I like using set-server-field: my requirement feels like just what server fields were intended for. Cheers, E. -- Eliot Kimber http://contrext.com On 8/14/17, 8:32 AM, "general-boun...@developer.marklogic.com on behalf of Erik Hennum" wrote: Hi, Eliot: xdmp:spawn() doesn't return an identifier because, if it is used as a future via the result option, it is obligated to return the result. The approach you sketch below -- passing in an identifier and writing tickets to a status database -- is pretty much what InfoStudio did. One refinement would be to log status in a server field via xdmp:set-server-field() and, on completion, write final status to a database (for durability in the case of a restart). Hoping that helps, Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Eliot Kimber [ekim...@contrext.com] Sent: Saturday, August 12, 2017 10:15 AM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Tracking Spawned Tasks? Using ML 8 I’m refining a profiling application that spawns a number of tasks and then, eventually, reports on the results once all the tasks have completed. Right now I just fire off the tasks and then refresh my app, which looks for results. It would be nice to be able to show the status of the spawned tasks but it looks like xdmp:spawn() doesn’t return anything (sort of expected to get some sort of task ID or something) and so there’s no obvious way to track spawned tasks from the spawning application. I could do something like generate private task IDs and pass those as parameters to the spawned tasks and then maintain a set of task status docs, but I was hoping there was some something easier. It seems like it would be a common requirement but I couldn’t find anything useful in the ML 8 docs or searching the web. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at
Re: [MarkLogic Dev General] Tracking Spawned Tasks?
I like using set-server-field: my requirement feels like just what server fields were intended for. Cheers, E. -- Eliot Kimber http://contrext.com On 8/14/17, 8:32 AM, "general-boun...@developer.marklogic.com on behalf of Erik Hennum" wrote: Hi, Eliot: xdmp:spawn() doesn't return an identifier because, if it is used as a future via the result option, it is obligated to return the result. The approach you sketch below -- passing in an identifier and writing tickets to a status database -- is pretty much what InfoStudio did. One refinement would be to log status in a server field via xdmp:set-server-field() and, on completion, write final status to a database (for durability in the case of a restart). Hoping that helps, Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Eliot Kimber [ekim...@contrext.com] Sent: Saturday, August 12, 2017 10:15 AM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Tracking Spawned Tasks? Using ML 8 I’m refining a profiling application that spawns a number of tasks and then, eventually, reports on the results once all the tasks have completed. Right now I just fire off the tasks and then refresh my app, which looks for results. It would be nice to be able to show the status of the spawned tasks but it looks like xdmp:spawn() doesn’t return anything (sort of expected to get some sort of task ID or something) and so there’s no obvious way to track spawned tasks from the spawning application. I could do something like generate private task IDs and pass those as parameters to the spawned tasks and then maintain a set of task status docs, but I was hoping there was some something easier. It seems like it would be a common requirement but I couldn’t find anything useful in the ML 8 docs or searching the web. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Tracking Spawned Tasks?
Can you expand on this statement: “if it is used as a future via the result option, it is obligated to return the result.” I didn’t see anything about this in the ML 8 docs for xdmp:spawn() and it seems pretty important. One general comment I’ll make about the ML docs is that it seems to assume/require a fairly encyclopedic knowledge of many subtle details. I’ve now read pretty much all the guides at least once and spent a lot of time in the reference docs, and while it’s all easy to access and very useful, it’s still a challenge to fully understand the implications of many things. For example, the documentation for various options for ranges only shows the syntax but has no guidance about semantics or what values are actually allowed, which would be very useful. This is documentation where a few well-placed see-alsos or a bit more usage guidance would go a long way. As a professional technical writer I know how challenging this aspect of docs is but it would help a lot. Cheers, Eliot On 8/14/17, 8:32 AM, "general-boun...@developer.marklogic.com on behalf of Erik Hennum" wrote: Hi, Eliot: xdmp:spawn() doesn't return an identifier because, if it is used as a future via the result option, it is obligated to return the result. The approach you sketch below -- passing in an identifier and writing tickets to a status database -- is pretty much what InfoStudio did. One refinement would be to log status in a server field via xdmp:set-server-field() and, on completion, write final status to a database (for durability in the case of a restart). Hoping that helps, Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Eliot Kimber [ekim...@contrext.com] Sent: Saturday, August 12, 2017 10:15 AM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Tracking Spawned Tasks? Using ML 8 I’m refining a profiling application that spawns a number of tasks and then, eventually, reports on the results once all the tasks have completed. Right now I just fire off the tasks and then refresh my app, which looks for results. It would be nice to be able to show the status of the spawned tasks but it looks like xdmp:spawn() doesn’t return anything (sort of expected to get some sort of task ID or something) and so there’s no obvious way to track spawned tasks from the spawning application. I could do something like generate private task IDs and pass those as parameters to the spawned tasks and then maintain a set of task status docs, but I was hoping there was some something easier. It seems like it would be a common requirement but I couldn’t find anything useful in the ML 8 docs or searching the web. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Tracking Spawned Tasks?
Using ML 8 I’m refining a profiling application that spawns a number of tasks and then, eventually, reports on the results once all the tasks have completed. Right now I just fire off the tasks and then refresh my app, which looks for results. It would be nice to be able to show the status of the spawned tasks but it looks like xdmp:spawn() doesn’t return anything (sort of expected to get some sort of task ID or something) and so there’s no obvious way to track spawned tasks from the spawning application. I could do something like generate private task IDs and pass those as parameters to the spawned tasks and then maintain a set of task status docs, but I was hoping there was some something easier. It seems like it would be a common requirement but I couldn’t find anything useful in the ML 8 docs or searching the web. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Unexpected Failure Doing Math on Durations (ML 8.0-3.2
Hmm. Apparently I have to set $div to xs:double. That still seems unnecessary and feels like a bug. Cheers, E. Eliot Kimber On 8/11/17, 10:45 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: (ML 8.0-3.2) In my xquery I’m doing this: let $total := xs:dayTimeDuration("PT6M38.33S") let $fastest-three := xs:dayTimeDuration("PT4M6.258784S") let $div := ($fastest-three div $total) return ($div) Which returns: 0.6182280621595159793 If I then try to multiple $div by 100 (to get a percent) I get decimal overflow: [1.0-ml] XDMP-DECOVRFLW: (err:FOAR0002) $div * 100 -- Decimal overflow Which is not at all expected. I also noticed that ML 8 does not appear to support the XQuery 3.x two-argument round() function, e.g., round($div, 2). Why am I getting a decimal overflow here? As a test I tried doing the same thing with Saxon 9.7 and XSLT 3: http://www.w3.org/1999/XSL/Transform"; xmlns:xs="http://www.w3.org/2001/XMLSchema"; exclude-result-prefixes="xs" version="3.1"> div: $div * 100: round($div, 2) = Which produces: div: 0.61822806215951597921 $div * 100: 61.822806215951597921 round($div, 2) = 0.62 Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Unexpected Failure Doing Math on Durations (ML 8.0-3.2
(ML 8.0-3.2) In my xquery I’m doing this: let $total := xs:dayTimeDuration("PT6M38.33S") let $fastest-three := xs:dayTimeDuration("PT4M6.258784S") let $div := ($fastest-three div $total) return ($div) Which returns: 0.6182280621595159793 If I then try to multiple $div by 100 (to get a percent) I get decimal overflow: [1.0-ml] XDMP-DECOVRFLW: (err:FOAR0002) $div * 100 -- Decimal overflow Which is not at all expected. I also noticed that ML 8 does not appear to support the XQuery 3.x two-argument round() function, e.g., round($div, 2). Why am I getting a decimal overflow here? As a test I tried doing the same thing with Saxon 9.7 and XSLT 3: http://www.w3.org/1999/XSL/Transform"; xmlns:xs="http://www.w3.org/2001/XMLSchema"; exclude-result-prefixes="xs" version="3.1"> div: $div * 100: round($div, 2) = Which produces: div: 0.61822806215951597921 $div * 100: 61.822806215951597921 round($div, 2) = 0.62 Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Making Collection Facet Work with search:search() (Resolved)
My element was not in the search namespace. This was a side effect of cutting and pasting from various examples where a default namespace had been set. Hmph. Cheers, E. -- Eliot Kimber http://contrext.com On 8/7/17, 9:55 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: This is ML 8.0-3.2 Cheers, E. -- Eliot Kimber http://contrext.com On 8/7/17, 9:45 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I’m trying to do a search:search() with two contraints: one for collection and one for a bucketed facet. Here is my search definition: search:search(("install"), limit=5 100th of a second 200th of a Second … (bunch of buckets omitted) More than 2 seconds http://marklogic.com/xdmp/query-meters"; name="elapsed-time"/> ) I can’t see any problem with this definition and if I run it as shown it works and my bucket constraint result is good (very nice, by the way). However, If I try to specify a value for the named constraint “trial:”, e.g.: search:search(("install and trial:trial-001"), …) Then I get this failure: [1.0-ml] XDMP-AS: (err:XPTY0004) $constraint-elem as element() -- Invalid coercion: () as element() Stack Trace In /MarkLogic/appservices/search/ast.xqy on line 305 In ast:joiner-constraint(map:map(http://www.w3.org/2001/XMLSchema"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xmlns:map="http://marklogic.com/xdmp/map";>6<..XDMP-ATOMIZEFUNC: (err:FOTY0013) Functions cannot be atomized...), http://marklogic.com/appservices/search";>trial) (and lots more stack trace items). What is causing this failure and what do I do to resolve it? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Making Collection Facet Work with search:search()
This is ML 8.0-3.2 Cheers, E. -- Eliot Kimber http://contrext.com On 8/7/17, 9:45 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I’m trying to do a search:search() with two contraints: one for collection and one for a bucketed facet. Here is my search definition: search:search(("install"), limit=5 100th of a second 200th of a Second … (bunch of buckets omitted) More than 2 seconds http://marklogic.com/xdmp/query-meters"; name="elapsed-time"/> ) I can’t see any problem with this definition and if I run it as shown it works and my bucket constraint result is good (very nice, by the way). However, If I try to specify a value for the named constraint “trial:”, e.g.: search:search(("install and trial:trial-001"), …) Then I get this failure: [1.0-ml] XDMP-AS: (err:XPTY0004) $constraint-elem as element() -- Invalid coercion: () as element() Stack Trace In /MarkLogic/appservices/search/ast.xqy on line 305 In ast:joiner-constraint(map:map(http://www.w3.org/2001/XMLSchema"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xmlns:map="http://marklogic.com/xdmp/map";>6<..XDMP-ATOMIZEFUNC: (err:FOTY0013) Functions cannot be atomized...), http://marklogic.com/appservices/search";>trial) (and lots more stack trace items). What is causing this failure and what do I do to resolve it? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Making Collection Facet Work with search:search()
I’m trying to do a search:search() with two contraints: one for collection and one for a bucketed facet. Here is my search definition: search:search(("install"), limit=5 100th of a second 200th of a Second … (bunch of buckets omitted) More than 2 seconds http://marklogic.com/xdmp/query-meters"; name="elapsed-time"/> ) I can’t see any problem with this definition and if I run it as shown it works and my bucket constraint result is good (very nice, by the way). However, If I try to specify a value for the named constraint “trial:”, e.g.: search:search(("install and trial:trial-001"), …) Then I get this failure: [1.0-ml] XDMP-AS: (err:XPTY0004) $constraint-elem as element() -- Invalid coercion: () as element() Stack Trace In /MarkLogic/appservices/search/ast.xqy on line 305 In ast:joiner-constraint(map:map(http://www.w3.org/2001/XMLSchema"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xmlns:map="http://marklogic.com/xdmp/map";>6<..XDMP-ATOMIZEFUNC: (err:FOTY0013) Functions cannot be atomized...), http://marklogic.com/appservices/search";>trial) (and lots more stack trace items). What is causing this failure and what do I do to resolve it? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Trying to Get ML Data Visualization Widgets Working
I figured out my n00b error: The initialization script needs to be at the end of the HTML doc. Once I did that (and also set the constraintType configuration property on the chart configuration), I’m now getting a visualization widget on my page. Now I just need to fill it with data…. Cheers, E. Eliot Kimber Doer of Things Nobody Else Has Time For On 8/7/17, 3:19 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I was looking at the Google visualization stuff and then found Dave Lee’s paper on using it with MarkLogic, which then led me to the ML 8 docs. Note that I’m not really using the application builder, just using code that comes with it. I suspect my issue is just a basic JavaScript problem. I’ll take a look at vis.js—I need easy for this project… Cheers, E. Eliot Kimber Doer of Things Nobody Else Has Tme For On 8/7/17, 3:08 PM, "general-boun...@developer.marklogic.com on behalf of Erik Hennum" wrote: Hi, Eliot: The AppBuilder has been superseded by initiatives in the JavaScript ecosystem and is deprecated in MarkLogic 8 and removed in 9. I've heard good things about the D3 (versatile) and vis.js (easy) Open Source JavaScript visualization libraries. Hoping that's useful, Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Eliot Kimber [ekim...@contrext.com] Sent: Monday, August 07, 2017 12:24 PM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Trying to Get ML Data Visualization WidgetsWorking Using ML 8, I’m setting up a little profiling web application and I need to do visualization on the recorded data, e.g., durations reported by query meters for a large number operations. I’m following the guidance in the Search Developer's Guide — Chapter 31, Data Visualization Widgets, in the context of my own simple Web app (that is, I did not use the application builder to initially create my app, I just created a simple HTTP app from scratch. I’m generating an HTML page that includes all the Javascript for visualization: var durationBarChartConfig = { title: "Duration Distributions", dataLabel: "Durations", dataType: "int" } ML.controller.init(); ML.chartWidget('duration-bar-chart-1', 'bar', durationBarChartConfig); ML.chartWidget('duration-bar-chart-2', 'bar', durationBarChartConfig); ML.chartWidget('duration-bar-chart-3', 'bar', durationBarChartConfig); ML.controller.loadData(); And in the main HTML I’m generating the corresponding widget-containing divs: However, when I load the page I get this result in the console: chart.js:82 Uncaught Chart widget container ID "duration-bar-chart-1" does not exist The element exists and there are no other errors in the JS console. I assume I must be missing something basic here but as I’m not at all versed in JavaScript I’m hoping someone can point me in the right direction. I didn’t see anything in the ML guide or the underlying JavaScript code that suggested I’m missing some setup. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Trying to Get ML Data Visualization Widgets Working
I was looking at the Google visualization stuff and then found Dave Lee’s paper on using it with MarkLogic, which then led me to the ML 8 docs. Note that I’m not really using the application builder, just using code that comes with it. I suspect my issue is just a basic JavaScript problem. I’ll take a look at vis.js—I need easy for this project… Cheers, E. Eliot Kimber Doer of Things Nobody Else Has Tme For On 8/7/17, 3:08 PM, "general-boun...@developer.marklogic.com on behalf of Erik Hennum" wrote: Hi, Eliot: The AppBuilder has been superseded by initiatives in the JavaScript ecosystem and is deprecated in MarkLogic 8 and removed in 9. I've heard good things about the D3 (versatile) and vis.js (easy) Open Source JavaScript visualization libraries. Hoping that's useful, Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Eliot Kimber [ekim...@contrext.com] Sent: Monday, August 07, 2017 12:24 PM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Trying to Get ML Data Visualization WidgetsWorking Using ML 8, I’m setting up a little profiling web application and I need to do visualization on the recorded data, e.g., durations reported by query meters for a large number operations. I’m following the guidance in the Search Developer's Guide — Chapter 31, Data Visualization Widgets, in the context of my own simple Web app (that is, I did not use the application builder to initially create my app, I just created a simple HTTP app from scratch. I’m generating an HTML page that includes all the Javascript for visualization: var durationBarChartConfig = { title: "Duration Distributions", dataLabel: "Durations", dataType: "int" } ML.controller.init(); ML.chartWidget('duration-bar-chart-1', 'bar', durationBarChartConfig); ML.chartWidget('duration-bar-chart-2', 'bar', durationBarChartConfig); ML.chartWidget('duration-bar-chart-3', 'bar', durationBarChartConfig); ML.controller.loadData(); And in the main HTML I’m generating the corresponding widget-containing divs: However, when I load the page I get this result in the console: chart.js:82 Uncaught Chart widget container ID "duration-bar-chart-1" does not exist The element exists and there are no other errors in the JS console. I assume I must be missing something basic here but as I’m not at all versed in JavaScript I’m hoping someone can point me in the right direction. I didn’t see anything in the ML guide or the underlying JavaScript code that suggested I’m missing some setup. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Trying to Get ML Data Visualization Widgets Working
Using ML 8, I’m setting up a little profiling web application and I need to do visualization on the recorded data, e.g., durations reported by query meters for a large number operations. I’m following the guidance in the Search Developer's Guide — Chapter 31, Data Visualization Widgets, in the context of my own simple Web app (that is, I did not use the application builder to initially create my app, I just created a simple HTTP app from scratch. I’m generating an HTML page that includes all the Javascript for visualization: var durationBarChartConfig = { title: "Duration Distributions", dataLabel: "Durations", dataType: "int" } ML.controller.init(); ML.chartWidget('duration-bar-chart-1', 'bar', durationBarChartConfig); ML.chartWidget('duration-bar-chart-2', 'bar', durationBarChartConfig); ML.chartWidget('duration-bar-chart-3', 'bar', durationBarChartConfig); ML.controller.loadData(); And in the main HTML I’m generating the corresponding widget-containing divs: However, when I load the page I get this result in the console: chart.js:82 Uncaught Chart widget container ID "duration-bar-chart-1" does not exist The element exists and there are no other errors in the JS console. I assume I must be missing something basic here but as I’m not at all versed in JavaScript I’m hoping someone can point me in the right direction. I didn’t see anything in the ML guide or the underlying JavaScript code that suggested I’m missing some setup. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Possible to Create Multi-Tagname Range Indexes Using admin Functions?
Using ML 8 I’m setting up a script to configure a large number of range indexes. If I do it manually via the UI I can create a single range index configuration that lists multiple tag names. However, this doesn’t appear to be possible with the admin API using admin:database-add-range-element-index(). I’m passing multiple range index definitions to a single call of admin:database-add-range-element-index() but the result in the UI is still one range index configuration per element name: declare function local:configure-range-element-indexes( $config as element(), $db-id as xs:integer, $datatype as xs:string, $namespace as xs:string?, $tagnames as xs:string* ) as element() { let $indexes as element()* := for $tagname in $tagnames return admin:database-range-element-index( $datatype, $namespace, $tagname, "http://marklogic.com/collation/";, false(), "reject") let $new-config := admin:database-add-range-element-index( $config, $db-id, $indexes) return $new-config }; Some of these tag name lists are quite long, reflecting logical groupings of element types, so it would be nice to have them grouped under single definitions in the UI for the benefit of people inspecting the configuration. Is what I want to do possible? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Make string XML "safe" in xquery
If you have the string in one variable then the earlier answer should do what you want: let $str := “This is xml” let $elem as element() := {$str} return $elem Result is: This is <not> xml Cheers, E. Eliot Kimber From: on behalf of Steven Anderson Reply-To: MarkLogic Developer Discussion Date: Saturday, July 29, 2017 at 2:59 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Make string XML "safe" in xquery I know that the text string between the start and end tags contains no XML, only text with no markup. My plan is to put the text string between the open and close title tags as part of a larger text string that I want to convert to an XML document via xdmp:unquote . Sounds like I'll be whipping up my own function based on fn:replace. Steve On Jul 29, 2017, at 12:39 AM, Jason Hunter wrote: Why do you have malformed xml like that? How do you reliably know what's tags and what's string? Your plan is to preprocess the text to make it well-formed xml so it can be unquoted? Sent from my iPhone On Jul 29, 2017, at 14:08, Steven Anderson wrote: Within the larger context, I have a string like this: A title for the product that I'm then converting into an xml document node using xdmp:unquote. That makes xdmp:unquote barf, but if I do a fn:replace, on the specific characters it works. As I said, I can whip something up, but I assumed there was an obvious function for this. Steve On Jul 28, 2017, at 10:28 PM, Jason Hunter wrote: In normal XQuery you don't need to do this. Are you sure you do? Maybe you just need: { $value } The value will be properly escaped on output. Sent from my iPhone On Jul 29, 2017, at 13:01, Steven Anderson wrote: I could do that, but I just figured there'd be a xquery function do to it for all three special XML characters. It's easy enough to write one, but I just assumed that someone else would have needed it. On Jul 28, 2017, at 9:12 PM, Indrajeet Verma wrote: Steve - Did you try using fn:replace? e.g. fn:replace(fn:replace($title, "<", "<"), ">", ">") On Sat, Jul 29, 2017 at 5:32 AM, Steve Anderson wrote: I have a string like this: A title for the product and I'd like to replace it with A title for <placeholder> the product Basically, I want to make the a valid XML text node, fixing greater than, less than, and ampersands. I thought I could make xdmp:quote do that, but, perhaps because it's Friday afternoon, I can't find the right options to make it work. Is there any easy solution I can't find? Steve ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Failure Trying to Restore ML 4.2 Backup to ML 8
It looks like the issue was related to permissions on the file system that held the backup directories. We relocated them and ensured that they were owned by the ML server’s user and now I’m able to restore into ML 8 (or at least the restore has started). Was able to restore into ML 4, which of course should work no problem. Cheers, E. -- Eliot Kimber http://contrext.com On 7/27/17, 1:41 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I have a backup from an ML 4.2 database that I’m trying to restore to an ML 8 server. The database configurations are (or should be) the same between both servers and both are running on Linux servers. If I have the same set of forests defined on the ML 8 server as in the backup then when I go to the restore screen it lists all the forests in the backup but all their check boxes are unchecked and greyed out. If I delete some of the forests from the ML 8 server then those forests are selected and not greyed out. But… When I proceed with the restore I consistently get failures like this: Operation failed with error message: XDMP-NOFOREST: xdmp:database-restore((xs:unsignedLong("5211046837612715608"), xs:unsignedLong("5138674030818805002")), "/marklogic/backup/rsuite/20170726-1", (), fn:false(), (), fn:false(), ()) -- No forest with identifier 5211046837612715608. Check server logs. I’m not seeing any other errors in the ErrorLog.txt log I didn’t see anything in the ML 8 backup and restore docs that suggested what the issue might be. The 5211046837612715608 value comes from databases.xml and I see a mapping for this ID in the assignments.xml file: rsuite02 true 14194071972761628339 /somedir/MarkLogic all false 5211046837612715608 And there is a directory Forests/rsuite02 in the backup. Any idea what would be causing this failure? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Failure Trying to Restore ML 4.2 Backup to ML 8
I have a backup from an ML 4.2 database that I’m trying to restore to an ML 8 server. The database configurations are (or should be) the same between both servers and both are running on Linux servers. If I have the same set of forests defined on the ML 8 server as in the backup then when I go to the restore screen it lists all the forests in the backup but all their check boxes are unchecked and greyed out. If I delete some of the forests from the ML 8 server then those forests are selected and not greyed out. But… When I proceed with the restore I consistently get failures like this: Operation failed with error message: XDMP-NOFOREST: xdmp:database-restore((xs:unsignedLong("5211046837612715608"), xs:unsignedLong("5138674030818805002")), "/marklogic/backup/rsuite/20170726-1", (), fn:false(), (), fn:false(), ()) -- No forest with identifier 5211046837612715608. Check server logs. I’m not seeing any other errors in the ErrorLog.txt log I didn’t see anything in the ML 8 backup and restore docs that suggested what the issue might be. The 5211046837612715608 value comes from databases.xml and I see a mapping for this ID in the assignments.xml file: rsuite02 true 14194071972761628339 /somedir/MarkLogic all false 5211046837612715608 And there is a directory Forests/rsuite02 in the backup. Any idea what would be causing this failure? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Using CURL to Test ML HTTP Processing
OK, I think I got it sorted, although I’m not sure I understand why it needs to be this way. On my curl command I added: -H "Content-Type: application/text" Along with: --data-binary "@testfile.txt" And then in my XQuery I use: xdmp:get-request-body("text") And get the response I expected (and wanted). Cheers, E. -- Eliot Kimber http://contrext.com On 6/23/17, 9:18 AM, "general-boun...@developer.marklogic.com on behalf of Erik Hennum" wrote: Hi, Eliot: Try specifying the content-type. I believe that, if a POST request doesn't specify the content-type, curl defaults the content-type to application/x-www-form-urlencoded (This convenience may or may not be seen as a feature.) Regards, Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Eliot Kimber [ekim...@contrext.com] Sent: Thursday, June 22, 2017 3:02 PM To: MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Using CURL to Test ML HTTP Processing I’m trying to understand the ML support for handling HTTP requests and I’m trying to use CURL to test things just to learn. I’m getting an odd behavior and I haven’t been able to figure out what I’m doing wrong from either the curl info I can find or from the relevant ML docs. Here’s my module: xquery version "1.0-ml"; let $type:= xdmp:get-request-header('Content-Type') let $field-names := xdmp:get-request-field-names() return This is test remote access {$type} { for $name in $field-names return {$name} } { for $name in $field-names return xdmp:get-request-field($name)} I’m trying to use POST to send the data in a file to this module using the –data-binary parameter: curl -X POST --data-binary "file=@testfile.txt" --user ekimber:ekimber http://anglia.corp.mitchellrepair.com:11984/test-remote-access.xqy However, the response I get is: This is test remote accessapplication/x-www-form-urlencodedfile@testfile.txt Note that the field value is the string “@testfile.txt”, not the content of the file. This is the form of call that appears to be the correct way to associate a field name with the data from a file. If I leave off “file=” then the contents of testfile.txt become the field name: curl -X POST --data-binary "@testfile.txt" --user ekimber:ekimber http://anglia.corp.mitchellrepair.com:11984/test-remote-access.xqy This is test remote accessapplication/x-www-form-urlencodedThis is the test File. More text. Which also seems wrong. I must be doing something wrong, either on the CURL side or on the ML side but I can’t figure out what it is. All the examples I could find in the ML docs use direct form submission rather than CURL. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Using CURL to Test ML HTTP Processing
I’m trying to understand the ML support for handling HTTP requests and I’m trying to use CURL to test things just to learn. I’m getting an odd behavior and I haven’t been able to figure out what I’m doing wrong from either the curl info I can find or from the relevant ML docs. Here’s my module: xquery version "1.0-ml"; let $type:= xdmp:get-request-header('Content-Type') let $field-names := xdmp:get-request-field-names() return This is test remote access {$type} { for $name in $field-names return {$name} } { for $name in $field-names return xdmp:get-request-field($name)} I’m trying to use POST to send the data in a file to this module using the –data-binary parameter: curl -X POST --data-binary "file=@testfile.txt" --user ekimber:ekimber http://anglia.corp.mitchellrepair.com:11984/test-remote-access.xqy However, the response I get is: This is test remote accessapplication/x-www-form-urlencodedfile@testfile.txt Note that the field value is the string “@testfile.txt”, not the content of the file. This is the form of call that appears to be the correct way to associate a field name with the data from a file. If I leave off “file=” then the contents of testfile.txt become the field name: curl -X POST --data-binary "@testfile.txt" --user ekimber:ekimber http://anglia.corp.mitchellrepair.com:11984/test-remote-access.xqy This is test remote accessapplication/x-www-form-urlencodedThis is the test File. More text. Which also seems wrong. I must be doing something wrong, either on the CURL side or on the ML side but I can’t figure out what it is. All the examples I could find in the ML docs use direct form submission rather than CURL. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
Thanks, I’ll take a look. Cheers, E. -- Eliot Kimber http://contrext.com From: on behalf of Gary Vidal Reply-To: MarkLogic Developer Discussion Date: Thursday, May 25, 2017 at 5:37 AM To: Subject: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics Eliot, I will share some code I wrote using Apache Flink, which does exactly what you want to do for MarkLogic on a client machine. The problem is with such an old version of ML you are forced to pull every document out and perform analysis externally. In my previous life I wrote a version that runs on MarkLogic using spawn and parallel tasks, but not sure it would work on 4.2, but will share for sake of others. Feel free to contact me directly for any additional help https://github.com/garyvidal/ml-libraries/tree/master/task-spawner ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
I got what I needed by creating a simple groovy script that uses the XCC library to submit queries. Script is below. My main discovery was that I need to create a new session for every iteration to avoid connection time outs. With this I was able to process several 100 thousand docs and accumulate the results on my local machine. My command line is: groovy -cp lib/xcc.jar GetArticleMetadataDetails.groovy I chose groovy because it supports Java libraries directly and makes it easy to script things. Groovy script: #!/usr/bin/env groovy /* * Use XCC jar to run enrichment jobs and collect the results. */ import com.marklogic.xcc.*; import com.marklogic.xcc.types.*; ContentSource source = ContentSourceFactory.newContentSource("myserver", 1984, "user", "pw"); RequestOptions options = new RequestOptions(); options.setRequestTimeLimit(3600) moduleUrl = "rq-metadata-analysis.xqy" println "Running module ${moduleUrl}..." println new Date() File outfile = new File("query-result.xml") outfile.write "\n"; (36..56).each { index -> Session session = source.newSession(); ModuleInvoke request = session.newModuleInvoke(moduleUrl) println "Group number: ${index}, ${new Date()}" request.setNewIntegerVariable("", "groupNum", index); request.setNewIntegerVariable("", "length", 1); request.setOptions(options); ResultSequence rs = session.submitRequest(request); ResultItem item = rs.next(); XdmItem xdmItem = item.getItem(); InputStream is = item.asInputStream(); is.eachLine { line -> outfile.append line outfile.append "\n" } session.close(); } outfile.append ""; println "Done." // End of script. -- Eliot Kimber http://contrext.com On 5/22/17, 10:43 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I haven’t yet seen anything in the docs that directly address what I’m trying to do and suspect I’m simply missing some ML basics or just going about things the wrong way. I have a corpus of several hundred thousand docs (but could be millions, of course), where each doc is an average of 200K and several thousand elements. I want to analyze the corpus to get details about the number of specific subelements within each document, e.g.: for $article in cts:search(/Article, cts:directory-query("/Default/", "infinity"))[$start to $end] return I’m running this as a query from Oxygen (so I can capture the results locally so I can do other stuff with them). On the server I’m using I blow the expanded tree cache if I try to request more than about 20,000 docs. Is there a way to do this kind of processing over an arbitrarily large set *and* get the results back from a single query request? I think the only solution is to write the results to back to the database and then fetch that as the last thing but I was hoping there was something simpler. Have I missed an obvious solution? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
What is TDE? I’m not conversant with ML 9 features yet. Also, I’m currently working against an ML 4.2 server (don’t ask). TaskBot looks like just what I need but docs say it requires ML 7+ but could possibly be made to work with earlier releases. If someone can point me in the right direction I can take a stab at making it work with ML 4. Thanks, Eliot -- Eliot Kimber http://contrext.com On 5/23/17, 8:56 AM, "general-boun...@developer.marklogic.com on behalf of Erik Hennum" wrote: Hi, Eliot: On reflection, let me retract the range index suggestion. I wasn't considering the domain implied by the element names -- it would never make sense to blow out a range index with the value of all of the paragraphs. The TDE suggestion for MarkLogic 9 would still work, however, because you could have an xs:short column with a value of 1 for every paragraph. Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Erik Hennum [erik.hen...@marklogic.com] Sent: Tuesday, May 23, 2017 6:21 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics Hi, Eliot: One alternative to Geert's good suggestion -- if and only if the number of element names is small and you can create range indexes on them: * add an element attribute range index on Article/@id * add an element range index on p * execute a cts:value-tuples() call with the constraining element query and directory query * iterate over the tuples, incrementing the value of the id in a map * remove the range index on p In MarkLogic 9, that approach gets simpler. You can just use TDE to project rows with columns for the id and element, group on the id column, and count the rows in the group. Hoping that's useful (and salutations in passing), Erik Hennum From: general-boun...@developer.marklogic.com [general-boun...@developer.marklogic.com] on behalf of Geert Josten [geert.jos...@marklogic.com] Sent: Tuesday, May 23, 2017 12:53 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics Hi Eliot, I¹d consider using taskbot (http://registry.demo.marklogic.com/package/taskbot), and using that in combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It will make optimal use of the TaskServer of the host on which you initiate the call. It doesn¹t scale endlessly, but it batches up the work automatically for you, and will get you a lot further fairly easily.. Cheers, Geert On 5/23/17, 5:43 AM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: >I haven¹t yet seen anything in the docs that directly address what I¹m >trying to do and suspect I¹m simply missing some ML basics or just going >about things the wrong way. > >I have a corpus of several hundred thousand docs (but could be millions, >of course), where each doc is an average of 200K and several thousand >elements. > >I want to analyze the corpus to get details about the number of specific >subelements within each document, e.g.: > > >for $article in cts:search(/Article, cts:directory-query("/Default/", >"infinity"))[$start to $end] > return paras=²{count($article//p}²/> > >I¹m running this as a query from Oxygen (so I can capture the results >locally so I can do other stuff with them). > >On the server I¹m using I blow the expanded tree cache if I try to >request more than about 20,000 docs. > >Is there a way to do this kind of processing over an arbitrarily large >set *and* get the results back from a single query request? > >I think the only solution is to write the results to back to the database >and then fetch that as the last thing but I was hoping there was >something simpler. > >Have I missed an obvious solution? > >Thanks, > >Eliot > >-- >Eliot Kimber >http://contrext.com > > > > >___ >General mailing list >General@developer.marklogic.com >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://deve
[MarkLogic Dev General] Processing Large Number of Docs to Get Statistics
I haven’t yet seen anything in the docs that directly address what I’m trying to do and suspect I’m simply missing some ML basics or just going about things the wrong way. I have a corpus of several hundred thousand docs (but could be millions, of course), where each doc is an average of 200K and several thousand elements. I want to analyze the corpus to get details about the number of specific subelements within each document, e.g.: for $article in cts:search(/Article, cts:directory-query("/Default/", "infinity"))[$start to $end] return I’m running this as a query from Oxygen (so I can capture the results locally so I can do other stuff with them). On the server I’m using I blow the expanded tree cache if I try to request more than about 20,000 docs. Is there a way to do this kind of processing over an arbitrarily large set *and* get the results back from a single query request? I think the only solution is to write the results to back to the database and then fetch that as the last thing but I was hoping there was something simpler. Have I missed an obvious solution? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Docker & Amazon Cloud Support / Marklogic 9
I run MarkLogic in a container in order to quickly manage development environments where ML is part of a larger environment, e.g., RSuite CMS set ups where I have ML, RSuite, and MySQL each running in a separate container, all managed via docker-compose. Cheers, E. -- Eliot Kimber http://contrext.com From: on behalf of Dave Cassel Reply-To: MarkLogic Developer Discussion Date: Wednesday, May 17, 2017 at 12:54 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Docker & Amazon Cloud Support / Marklogic 9 I asked about Docker — Docker is not currently a supported environment, though we're happy to collect feedback about use cases. -- Dave Cassel, @dmcassel Technical Community Manager MarkLogic Corporation http://developer.marklogic.com/ From: on behalf of Dave Cassel Reply-To: MarkLogic Developer Discussion Date: Tuesday, May 16, 2017 at 9:00 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Docker & Amazon Cloud Support / Marklogic 9 AMIs are in the approval process and should be available soon. -- Dave Cassel, @dmcassel Technical Community Manager MarkLogic Corporation http://developer.marklogic.com/ From: on behalf of Andreas Felix Reply-To: MarkLogic Developer Discussion Date: Tuesday, May 16, 2017 at 10:47 AM To: "general@developer.marklogic.com" Subject: [MarkLogic Dev General] Docker & Amazon Cloud Support / Marklogic 9 Hi, is there already a support for Amazon Cloud and Docker in Marklogic 9? I found nothing in the Amazon Marketplace and no Information about the announced Docker Support in the Release Notes. regards andreas -- Mit freundlichen Grüßen / Kind regards Ing. Andreas Felix Senior IT Consultant EBCONT enterprise technologies GmbH Millennium Tower Handelskai 94-96 1200 Wien Mobil: +43 664 606 51 747 Fax: +43 2772 812 69-9 Email: andreas.fe...@ebcont.com Web: http://www.ebcont-et.com/ OUR TEAM IS YOUR SUCCESS HG St. Pölten - FN 293731h UID: ATU63444589 ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Understanding My Profiling Results
OK, I’ve done some more timing analysis and determined that, as expected, the reverse query takes the bulk of the time. So it looks like 0.025 seconds for this reverse query is the time we can expect and that any performance improvement will come from improved and/or more hardware. Cheers, Eliot -- Eliot Kimber http://contrext.com On 5/16/17, 10:14 AM, "Eliot Kimber" wrote: Apparently I’m an idiot—it was pointed out that the time for the cts:and-query() is just the query constructor, which of course takes no time, so the reported time is the time for the search itself. I can do more testing to see what time each part of the search. Cheers, E. -- Eliot Kimber http://contrext.com On 5/16/17, 9:33 AM, "Eliot Kimber" wrote: Some more background: there are about 2.5 million MatchingQuery documents in the database I’m testing with. The reverse query index is of course turned on. Cheers, E. -- Eliot Kimber http://contrext.com On 5/15/17, 7:44 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I’m getting a raw profiling report outside the context of CQ and trying to do some analysis on it (I’m running the same operation on several hundred input objects and collecting the profiling for each instance in order to try to get better trend data). I’ve identified one expression that takes the bulk of the processing time but the profiling details aren’t adding up so I’m wondering what I’m missing. Here’s the expression that is reported in the histogram: cts:search(fn:collection()/MatchingQuery, cts:and-query((func:func-returns-boolean($some-param), cts:collection-query("collection-name"), cts:reverse-query($node))), "unfiltered") The intent of this search is to find MatchingQuery documents that match the node in $node. The deep time for this is PT0.023642S and the shallow time is PT0.023289S, which is what I would expect (shallow and deep almost the same). So the question is, which of these terms is contributing to this time? If I search for histogram entries for the individual terms I get a deep time of “0.000342” for the cts:and-query, which is obviously a small fraction of total time of 0.023 seconds. Does that mean that the “fn:collection()/MatchingQuery” term accounts for the remaining time (the bulk of the 0.23 seconds)? If not, what accounts for the remaining time? I’m also capturing the query meters and the only cache misses I’m seeing are value cache misses (17 misses, 1 hit). I’m not sure what aspect of this query (if any) would hit the value cache. So my question: what are these times telling me about this particular search expression? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Understanding My Profiling Results
Apparently I’m an idiot—it was pointed out that the time for the cts:and-query() is just the query constructor, which of course takes no time, so the reported time is the time for the search itself. I can do more testing to see what time each part of the search. Cheers, E. -- Eliot Kimber http://contrext.com On 5/16/17, 9:33 AM, "Eliot Kimber" wrote: Some more background: there are about 2.5 million MatchingQuery documents in the database I’m testing with. The reverse query index is of course turned on. Cheers, E. -- Eliot Kimber http://contrext.com On 5/15/17, 7:44 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I’m getting a raw profiling report outside the context of CQ and trying to do some analysis on it (I’m running the same operation on several hundred input objects and collecting the profiling for each instance in order to try to get better trend data). I’ve identified one expression that takes the bulk of the processing time but the profiling details aren’t adding up so I’m wondering what I’m missing. Here’s the expression that is reported in the histogram: cts:search(fn:collection()/MatchingQuery, cts:and-query((func:func-returns-boolean($some-param), cts:collection-query("collection-name"), cts:reverse-query($node))), "unfiltered") The intent of this search is to find MatchingQuery documents that match the node in $node. The deep time for this is PT0.023642S and the shallow time is PT0.023289S, which is what I would expect (shallow and deep almost the same). So the question is, which of these terms is contributing to this time? If I search for histogram entries for the individual terms I get a deep time of “0.000342” for the cts:and-query, which is obviously a small fraction of total time of 0.023 seconds. Does that mean that the “fn:collection()/MatchingQuery” term accounts for the remaining time (the bulk of the 0.23 seconds)? If not, what accounts for the remaining time? I’m also capturing the query meters and the only cache misses I’m seeing are value cache misses (17 misses, 1 hit). I’m not sure what aspect of this query (if any) would hit the value cache. So my question: what are these times telling me about this particular search expression? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Understanding My Profiling Results
Some more background: there are about 2.5 million MatchingQuery documents in the database I’m testing with. The reverse query index is of course turned on. Cheers, E. -- Eliot Kimber http://contrext.com On 5/15/17, 7:44 PM, "general-boun...@developer.marklogic.com on behalf of Eliot Kimber" wrote: I’m getting a raw profiling report outside the context of CQ and trying to do some analysis on it (I’m running the same operation on several hundred input objects and collecting the profiling for each instance in order to try to get better trend data). I’ve identified one expression that takes the bulk of the processing time but the profiling details aren’t adding up so I’m wondering what I’m missing. Here’s the expression that is reported in the histogram: cts:search(fn:collection()/MatchingQuery, cts:and-query((func:func-returns-boolean($some-param), cts:collection-query("collection-name"), cts:reverse-query($node))), "unfiltered") The intent of this search is to find MatchingQuery documents that match the node in $node. The deep time for this is PT0.023642S and the shallow time is PT0.023289S, which is what I would expect (shallow and deep almost the same). So the question is, which of these terms is contributing to this time? If I search for histogram entries for the individual terms I get a deep time of “0.000342” for the cts:and-query, which is obviously a small fraction of total time of 0.023 seconds. Does that mean that the “fn:collection()/MatchingQuery” term accounts for the remaining time (the bulk of the 0.23 seconds)? If not, what accounts for the remaining time? I’m also capturing the query meters and the only cache misses I’m seeing are value cache misses (17 misses, 1 hit). I’m not sure what aspect of this query (if any) would hit the value cache. So my question: what are these times telling me about this particular search expression? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Understanding My Profiling Results
I’m getting a raw profiling report outside the context of CQ and trying to do some analysis on it (I’m running the same operation on several hundred input objects and collecting the profiling for each instance in order to try to get better trend data). I’ve identified one expression that takes the bulk of the processing time but the profiling details aren’t adding up so I’m wondering what I’m missing. Here’s the expression that is reported in the histogram: cts:search(fn:collection()/MatchingQuery, cts:and-query((func:func-returns-boolean($some-param), cts:collection-query("collection-name"), cts:reverse-query($node))), "unfiltered") The intent of this search is to find MatchingQuery documents that match the node in $node. The deep time for this is PT0.023642S and the shallow time is PT0.023289S, which is what I would expect (shallow and deep almost the same). So the question is, which of these terms is contributing to this time? If I search for histogram entries for the individual terms I get a deep time of “0.000342” for the cts:and-query, which is obviously a small fraction of total time of 0.023 seconds. Does that mean that the “fn:collection()/MatchingQuery” term accounts for the remaining time (the bulk of the 0.23 seconds)? If not, what accounts for the remaining time? I’m also capturing the query meters and the only cache misses I’m seeing are value cache misses (17 misses, 1 hit). I’m not sure what aspect of this query (if any) would hit the value cache. So my question: what are these times telling me about this particular search expression? Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Optimizing Reverse Queries
This is a process that is performed almost constantly as new material is added to the corpus or the classification details are refined, both of which happen all the time. One of the required features of the system is to produce a report of what the new classification would be if the fully classification process was applied to the current content so that those responsible for the classification can evaluate its correctness. That process takes so long that it risks delaying publishing of updated content in the time required by the business process this system serves. I think I have enough to go on now to explore a few possible avenues, as well as gather more precise profiling and performance info. Cheers, E. -- Eliot Kimber http://contrext.com On 5/2/17, 1:04 AM, "Jason Hunter" wrote: > By “which query” I mean which of the 125,000 separate query docs actually matched for a given cts:reverse-query() call. cts:search( doc(), cts:reverse-query(doc("newdoc.xml")) ) This will return all the docs containing any serialized queries which would match newdoc.xml. > I guess my question is: in the case where the reverse query is applied to an element that is not a full document, does the “brute force” have to be applied for every candidate query or only for those that match containing document of the input element? In general I avoid putting any xpath in the first arg. In the JavaScript API it's not even possible, because it gives a false sense of optimization. > If the brute force cost is applied to each query then doing a two-phase search would be faster: determine which reverse queries apply to the input document and then use those to find the elements within the input document that actually matched. But if the brute force cost only applies to those queries that match the containing doc then ML internally must produce the faster result than doing it in my own code. > > But as you say, that calls into the question the use of reverse queries at all: why not simply run the 125,000 forward queries and update each element matched as appropriate? Yep. If it's a one-time batch job and you're trying to minimize the time then this would be faster, I bet. > Or it may simply be that we need to do some horizontal scaling and invest in additional D-nodes. You're going to do this often? -jh- ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Optimizing Reverse Queries
By “which query” I mean which of the 125,000 separate query docs actually matched for a given cts:reverse-query() call. I guess my question is: in the case where the reverse query is applied to an element that is not a full document, does the “brute force” have to be applied for every candidate query or only for those that match containing document of the input element? If the brute force cost is applied to each query then doing a two-phase search would be faster: determine which reverse queries apply to the input document and then use those to find the elements within the input document that actually matched. But if the brute force cost only applies to those queries that match the containing doc then ML internally must produce the faster result than doing it in my own code. But as you say, that calls into the question the use of reverse queries at all: why not simply run the 125,000 forward queries and update each element matched as appropriate? Or it may simply be that we need to do some horizontal scaling and invest in additional D-nodes. Cheers, E. -- Eliot Kimber http://contrext.com On 5/1/17, 10:26 PM, "Jason Hunter" wrote: > Another question: having gotten a result from a reverse search at the full document level, is there a way to know *which* queries matched? If so then it would be easy enough to apply those queries to the relevant elements to do additional filtering (although I suppose that might get us back to the same place). I'm a little confused. You're putting multiple serialized queries into each document? If you have just one serialized query in a document it's going to be obvious which query was the reverse match -- it was that one. > In particular, if I have 125,000 reverse queries applied to a single document (assuming that total database volume doesn’t affect query speed in this case) on a modern fast server with appropriate indexes in place, how fast should I expect that query to take? 1ms?, 10ms?, 100ms? 1 second? If you have 125,000 documents each with a serialized query in it and you do a reverse query for one document against those serialized queries and there's no hits, it should be extremely fast. More hits will slow things a little bit because hits involve a little work. The IMLS paper explains what the algorithm has to do. I suspect (but haven't measured) that it's a lot like forward queries in that the timing depends a lot on number of matches. > Our corpus has about 25 million elements that would be fragments per the advice above (about 1.5 million full documents). If you have 25 million elements you want to run against 125,000 serialized queries, wouldn't forward queries be faster? You'd only have to do 125,000 search calls instead of 25,000,000. :) > I’ve never done much with fragments in MarkLogic so I’m not sure what the full implication of making these subelements into fragments would be for other processing. Yeah, fragmentation is not to be done lightly. -jh- ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Optimizing Reverse Queries
I think the key bit is here: “MarkLogic indexes work at the fragment/document level. So doing a reverse query 20 times against different subparts of a document is going to involve brute force filtering to see if the match was in the needed part or not.” That suggests that our general approach to using reverse queries is flawed for this reason and would explain the apparent poor performance. It’s not possible to break the current docs into smaller docs but it might be possible to configure fragmentation at a level where each fragment would only have one element we need to match on (e.g., titles). Another question: having gotten a result from a reverse search at the full document level, is there a way to know *which* queries matched? If so then it would be easy enough to apply those queries to the relevant elements to do additional filtering (although I suppose that might get us back to the same place). Unfortunately my current performance metrics are “it takes way too long now and needs to take at least ½ as long”. I need to do more work to get some useful measurements and do some calculations to determine what a reasonable performance should be (e.g., we have X million cases to check and 100ms per it should take about Y time but it takes Y*n time—why?). Ultimately I need to try to determine how fast it *should* be for this type of operation. If I can determine that then I can determine whether the throughput requirements can be met by simply achieving that performance with the current server configuration or determine that it cannot and that we need to scale up, e.g, add additional D-nodes or something. I realize that nobody can offer me solid numbers based on what little I can share about the project details, other than to suggest some bounds. In particular, if I have 125,000 reverse queries applied to a single document (assuming that total database volume doesn’t affect query speed in this case) on a modern fast server with appropriate indexes in place, how fast should I expect that query to take? 1ms?, 10ms?, 100ms? 1 second? Based on my experience with ML and the documentation I would expect something around 10ms. Our corpus has about 25 million elements that would be fragments per the advice above (about 1.5 million full documents). If we assume 10ms per query per fragment then it would take about 3 days to process all of them. Currently it takes 9, so roughly a 3x slowdown over what I think we could expect +/- 1 day (there’s other overhead in this 9-day number that may or may not be reduceable). I’ve never done much with fragments in MarkLogic so I’m not sure what the full implication of making these subelements into fragments would be for other processing. Cheers, Eliot -- Eliot Kimber http://contrext.com On 5/1/17, 9:43 PM, "Jason Hunter" wrote: So what's the performance you're seeing? And what do you expect to be able to see? Something to consider: MarkLogic indexes work at the fragment/document level. So doing a reverse query 20 times against different subparts of a document is going to involve brute force filtering to see if the match was in the needed part or not. Might be better to have 20 documents instead of 1. -jh- > On May 2, 2017, at 01:29, Eliot Kimber wrote: > > Actually, its expected that every element will be matched by at least one query. This is a classification application and the intent of the application is that every element of interest will be classified. Many, if not most, of the queries depend on word-search features, e.g., stemmed matches, case insensitivity, etc. > > I’m new to this project so it may be that there is a better way to approach the problem in general. This is the system as currently implemented. > > My overall charge is to improve the throughput performance so my first task is to first understand what the performance bottlenecks are then identify possible solutions. > > It seems unlikely that we’ve done something silly in our queries or ML configuration but I want to eliminate the easy-to-fix before exploring more complicated options. > > Cheers, > > Eliot > > -- > Eliot Kimber > http://contrext.com > > > > On 5/1/17, 12:10 PM, "Jason Hunter" wrote: > >> The processing is, for each document to be processed, examine on the order of 10-20 elements to see if they match the reverse query by getting the node to be looked up and then doing: > >Maybe you can reverse query on the document as a whole instead of running 20 reverse queries per document. Only bother with the enumeration of the 20 if there's a proven hit within the document. > >(I assume the vast majority of the time the
Re: [MarkLogic Dev General] Optimizing Reverse Queries
Actually, its expected that every element will be matched by at least one query. This is a classification application and the intent of the application is that every element of interest will be classified. Many, if not most, of the queries depend on word-search features, e.g., stemmed matches, case insensitivity, etc. I’m new to this project so it may be that there is a better way to approach the problem in general. This is the system as currently implemented. My overall charge is to improve the throughput performance so my first task is to first understand what the performance bottlenecks are then identify possible solutions. It seems unlikely that we’ve done something silly in our queries or ML configuration but I want to eliminate the easy-to-fix before exploring more complicated options. Cheers, Eliot -- Eliot Kimber http://contrext.com On 5/1/17, 12:10 PM, "Jason Hunter" wrote: > The processing is, for each document to be processed, examine on the order of 10-20 elements to see if they match the reverse query by getting the node to be looked up and then doing: Maybe you can reverse query on the document as a whole instead of running 20 reverse queries per document. Only bother with the enumeration of the 20 if there's a proven hit within the document. (I assume the vast majority of the time there's not going to be hits. If that's true then why not prove that in one pop instead of 20 pops.) -jh- ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Optimizing Reverse Queries
Just realized I didn’t show all the relevant query details (the actual query has more terms but I’m not at liberty to show those details). But the cts:search does specify “unfiltered”: So a more complete representation is: cts:search( collection()/MyReverseQueries, cts:and-query(( me:normal-query1(), me:normal-query2(), cts:reverse-query($node))), “unfiltered”) Cheers, E. -- Eliot Kimber http://contrext.com On 5/1/17, 10:31 AM, "Eliot Kimber" wrote: Here is a typical reverse query document. Others may be a bit more complex, for example, an OR query matching on text strings or doing a cts word search: http://marklogic.com/cts";> c15fc case-insensitive diacritic-insensitive punctuation-insensitive whitespace-insensitive unstemmed wildcarded The processing is, for each document to be processed, examine on the order of 10-20 elements to see if they match the reverse query by getting the node to be looked up and then doing: cts:search(cts:reverse-query($node)) The initial profiling we did was just taking one source document and applying the process that then uses these reverse queries (that is, we haven’t yet had a chance to profile a larger run of documents). I’m just starting my performance analysis here, but I don’t have any experience with reverse queries so I mostly just wanted to make sure that there wasn’t something fairly obvious that I might look for as a source of slowness before digging into things more deeply. I’m pretty sure I’ll have to do deeper profiling to see where the time is really being taken—strong possibility that it’s in our code and not really the reverse queries. Cheers, Eliot -- Eliot Kimber http://contrext.com On 5/1/17, 10:00 AM, "Jason Hunter" wrote: On May 1, 2017, at 20:45, Eliot Kimber wrote: > > Using ML 8 we have an application that relies on reverse queries. The overall application is not performing as well as we need it to and our initial attempts at profiling show that the reverse queries are taking most of the time. We have about 120,000 separate reverse query documents. What kind of reverse queries are they? Text? Geo? Simple? Complex? > The “Inside MarkLogic” document suggests that reverse queries, properly indexed, should be quite fast. I have verified that we have the “fast reverse queries” index turned on. > > My question: What should I look for that might be causing our reverse queries to not be optimized? What are you doing with them? Looping against 1,000 documents? Sample code will help us all understand. How fast are they running exactly? How fast do you need them to run? > Are there any other ML settings or server configurations that might affect reverse query performance? Are there particular query patterns that might be suboptimal? Is there a way that I can confirm that the reverse queries are performing as fast as possible? The xdmp:plan function is your friend. -jh- ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Optimizing Reverse Queries
Here is a typical reverse query document. Others may be a bit more complex, for example, an OR query matching on text strings or doing a cts word search: http://marklogic.com/cts";> c15fc case-insensitive diacritic-insensitive punctuation-insensitive whitespace-insensitive unstemmed wildcarded The processing is, for each document to be processed, examine on the order of 10-20 elements to see if they match the reverse query by getting the node to be looked up and then doing: cts:search(cts:reverse-query($node)) The initial profiling we did was just taking one source document and applying the process that then uses these reverse queries (that is, we haven’t yet had a chance to profile a larger run of documents). I’m just starting my performance analysis here, but I don’t have any experience with reverse queries so I mostly just wanted to make sure that there wasn’t something fairly obvious that I might look for as a source of slowness before digging into things more deeply. I’m pretty sure I’ll have to do deeper profiling to see where the time is really being taken—strong possibility that it’s in our code and not really the reverse queries. Cheers, Eliot -- Eliot Kimber http://contrext.com On 5/1/17, 10:00 AM, "Jason Hunter" wrote: On May 1, 2017, at 20:45, Eliot Kimber wrote: > > Using ML 8 we have an application that relies on reverse queries. The overall application is not performing as well as we need it to and our initial attempts at profiling show that the reverse queries are taking most of the time. We have about 120,000 separate reverse query documents. What kind of reverse queries are they? Text? Geo? Simple? Complex? > The “Inside MarkLogic” document suggests that reverse queries, properly indexed, should be quite fast. I have verified that we have the “fast reverse queries” index turned on. > > My question: What should I look for that might be causing our reverse queries to not be optimized? What are you doing with them? Looping against 1,000 documents? Sample code will help us all understand. How fast are they running exactly? How fast do you need them to run? > Are there any other ML settings or server configurations that might affect reverse query performance? Are there particular query patterns that might be suboptimal? Is there a way that I can confirm that the reverse queries are performing as fast as possible? The xdmp:plan function is your friend. -jh- ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Optimizing Reverse Queries
Using ML 8 we have an application that relies on reverse queries. The overall application is not performing as well as we need it to and our initial attempts at profiling show that the reverse queries are taking most of the time. We have about 120,000 separate reverse query documents. The “Inside MarkLogic” document suggests that reverse queries, properly indexed, should be quite fast. I have verified that we have the “fast reverse queries” index turned on. My question: What should I look for that might be causing our reverse queries to not be optimized? Are there any other ML settings or server configurations that might affect reverse query performance? Are there particular query patterns that might be suboptimal? Is there a way that I can confirm that the reverse queries are performing as fast as possible? This is an application that is applied to 100s of 1000s of documents, so even a small performance improvement will be significant. Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Attempting to Run ML 4 Under CentOs: Can't find libbteuclid.so.6.5.1
Thanks to a suggestion from Norm Walsh I got this working. The key was simply adding the /opt/MarkLogic/lib dir to the LD_LIBRARY_PATH environment variable. I now have ML 4 running under an Ubutunu 14-based Docker container, along with another container for RSuite 3.6.3 and a third for MySQL 5.1 (on which RSuite 3.6.3 also depends). I haven’t circled back to see if the same approach would work with CentOS—it probably would. Part of the reason for insisting on ML4 is that I need to do performance comparisons between the current ML4-based environment and a potential ML7- or 8-based environment (where I presume I will see performance improvements but you don’t know until you measure). So I need ML4 to establish a baseline. Even if older versions of a product are not supported the installers should be provided for situations like this—not everyone can upgrade and technologies like Docker make it possible to maintain older environments. I understand the need to limit support exposures but you can do that without severely inconveniencing customers by simply making support policies clear as regards older versions. I was fortunate that I found a packrat who never throws anything away… Cheers, Eliot -- Eliot Kimber http://contrext.com On 4/28/17, 6:01 PM, "Eliot Kimber" wrote: Upgrading this particular RSuite installation is not an option at this time. I’m going to explore installing ML4 on Ubuntu. Cheers, Eliot -- Eliot Kimber http://contrext.com On 4/28/17, 3:58 PM, "Ganesh Vaideeswaran" wrote: Eliot, MarkLogic 4 does not support CentOS 6. So, I am not sure if I can offer any more guidance on this other than to say please upgrade RSuite to a version that runs on a supported MarkLogic version. And you are probably looking at that option. Ganesh -Original Message- From: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] On Behalf Of Eliot Kimber Sent: Friday, April 28, 2017 3:31 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Attempting to Run ML 4 Under CentOs: Can't find libbteuclid.so.6.5.1 I realize that ML4 is not supported. However, the application I’m using (RSuite 3.6.3) requires ML 4 so I need to be able to run it. Cheers, Eliot -- Eliot Kimber http://contrext.com On 4/28/17, 3:29 PM, "Ganesh Vaideeswaran" wrote: Eliot, MarkLogic 4 is not supported anymore. I would encourage you to upgrade to the latest version of MarkLogic 8. With respect to what is the best OS choice, since you are familiar with CentOS, I would suggest CentOS 7 though MarkLogic 8 supports CentOS 6 as well. Also at this time, we do not test MarkLogic running inside a docker. If you deploy MarkLogic inside a container and you need help from our support team, they _may_ request you reproduce the issue in a supported platform. Note that MarkLogic 9 only supports CentOS 7. Good luck with your upgrade. Ganesh -Original Message- From: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] On Behalf Of Eliot Kimber Sent: Friday, April 28, 2017 3:18 PM To: general@developer.marklogic.com Subject: [MarkLogic Dev General] Attempting to Run ML 4 Under CentOs: Can't find libbteuclid.so.6.5.1 In order to support an ancient version of RSuite CMS I need to run MarkLogic 4.2. I have a 64-bit RPM that I’ve installed into Cento6. However, when I run MarkLogic I get this failure: [root@localhost bin]# ./MarkLogic ./MarkLogic: error while loading shared libraries: libbteuclid.so.6.5.1: cannot open shared object file: No such file or directory [root@localhost bin]# The library is present: [root@localhost bin]# ls ../lib libbteuclid.so.6.5.1 libbtrliprofile.so.6.5.1 libbtrlpc.so.6.5.1 libbtunicode.so libbtutils.so.6.5.0 libbtrlijni.solibbtrlpcore.so.6.5 libbtrlpjni.so libbtutiljni.so [root@localhost bin]# So I think it must be a configuration issue, possibly the wrong version of the library. My online research did not reveal any obvious solution and my linux fu is weak. I have the Redhat lsb-base package installed: [root@localhost /]# yum install redhat-lsb Loaded plugins: fastestmirror, refresh-packagekit, security Setting up Install Process Loading mirror speeds from cac
Re: [MarkLogic Dev General] Attempting to Run ML 4 Under CentOs: Can't find libbteuclid.so.6.5.1
Upgrading this particular RSuite installation is not an option at this time. I’m going to explore installing ML4 on Ubuntu. Cheers, Eliot -- Eliot Kimber http://contrext.com On 4/28/17, 3:58 PM, "Ganesh Vaideeswaran" wrote: Eliot, MarkLogic 4 does not support CentOS 6. So, I am not sure if I can offer any more guidance on this other than to say please upgrade RSuite to a version that runs on a supported MarkLogic version. And you are probably looking at that option. Ganesh -Original Message- From: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] On Behalf Of Eliot Kimber Sent: Friday, April 28, 2017 3:31 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Attempting to Run ML 4 Under CentOs: Can't find libbteuclid.so.6.5.1 I realize that ML4 is not supported. However, the application I’m using (RSuite 3.6.3) requires ML 4 so I need to be able to run it. Cheers, Eliot -- Eliot Kimber http://contrext.com On 4/28/17, 3:29 PM, "Ganesh Vaideeswaran" wrote: Eliot, MarkLogic 4 is not supported anymore. I would encourage you to upgrade to the latest version of MarkLogic 8. With respect to what is the best OS choice, since you are familiar with CentOS, I would suggest CentOS 7 though MarkLogic 8 supports CentOS 6 as well. Also at this time, we do not test MarkLogic running inside a docker. If you deploy MarkLogic inside a container and you need help from our support team, they _may_ request you reproduce the issue in a supported platform. Note that MarkLogic 9 only supports CentOS 7. Good luck with your upgrade. Ganesh -Original Message- From: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] On Behalf Of Eliot Kimber Sent: Friday, April 28, 2017 3:18 PM To: general@developer.marklogic.com Subject: [MarkLogic Dev General] Attempting to Run ML 4 Under CentOs: Can't find libbteuclid.so.6.5.1 In order to support an ancient version of RSuite CMS I need to run MarkLogic 4.2. I have a 64-bit RPM that I’ve installed into Cento6. However, when I run MarkLogic I get this failure: [root@localhost bin]# ./MarkLogic ./MarkLogic: error while loading shared libraries: libbteuclid.so.6.5.1: cannot open shared object file: No such file or directory [root@localhost bin]# The library is present: [root@localhost bin]# ls ../lib libbteuclid.so.6.5.1 libbtrliprofile.so.6.5.1 libbtrlpc.so.6.5.1 libbtunicode.so libbtutils.so.6.5.0 libbtrlijni.solibbtrlpcore.so.6.5 libbtrlpjni.so libbtutiljni.so [root@localhost bin]# So I think it must be a configuration issue, possibly the wrong version of the library. My online research did not reveal any obvious solution and my linux fu is weak. I have the Redhat lsb-base package installed: [root@localhost /]# yum install redhat-lsb Loaded plugins: fastestmirror, refresh-packagekit, security Setting up Install Process Loading mirror speeds from cached hostfile * base: mirror.5ninesolutions.com * extras: centos.eecs.wsu.edu * updates: mirror.5ninesolutions.com Package redhat-lsb-4.0-7.el6.centos.x86_64 already installed and latest version Nothing to do So my question: Is it possible to run ML 4.2 under CentOS and if so what do I need to do to resolve this library problem? If not, what is my best OS choice? My ultimate goal is to run ML in a Docker container (along with RSuite and MySQL, on which RSuite depends), so I was using the ML 7+ Dockerfile as a base (thus my use of CentOS). Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage y
Re: [MarkLogic Dev General] Attempting to Run ML 4 Under CentOs: Can't find libbteuclid.so.6.5.1
I realize that ML4 is not supported. However, the application I’m using (RSuite 3.6.3) requires ML 4 so I need to be able to run it. Cheers, Eliot -- Eliot Kimber http://contrext.com On 4/28/17, 3:29 PM, "Ganesh Vaideeswaran" wrote: Eliot, MarkLogic 4 is not supported anymore. I would encourage you to upgrade to the latest version of MarkLogic 8. With respect to what is the best OS choice, since you are familiar with CentOS, I would suggest CentOS 7 though MarkLogic 8 supports CentOS 6 as well. Also at this time, we do not test MarkLogic running inside a docker. If you deploy MarkLogic inside a container and you need help from our support team, they _may_ request you reproduce the issue in a supported platform. Note that MarkLogic 9 only supports CentOS 7. Good luck with your upgrade. Ganesh -Original Message- From: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] On Behalf Of Eliot Kimber Sent: Friday, April 28, 2017 3:18 PM To: general@developer.marklogic.com Subject: [MarkLogic Dev General] Attempting to Run ML 4 Under CentOs: Can't find libbteuclid.so.6.5.1 In order to support an ancient version of RSuite CMS I need to run MarkLogic 4.2. I have a 64-bit RPM that I’ve installed into Cento6. However, when I run MarkLogic I get this failure: [root@localhost bin]# ./MarkLogic ./MarkLogic: error while loading shared libraries: libbteuclid.so.6.5.1: cannot open shared object file: No such file or directory [root@localhost bin]# The library is present: [root@localhost bin]# ls ../lib libbteuclid.so.6.5.1 libbtrliprofile.so.6.5.1 libbtrlpc.so.6.5.1 libbtunicode.so libbtutils.so.6.5.0 libbtrlijni.solibbtrlpcore.so.6.5 libbtrlpjni.so libbtutiljni.so [root@localhost bin]# So I think it must be a configuration issue, possibly the wrong version of the library. My online research did not reveal any obvious solution and my linux fu is weak. I have the Redhat lsb-base package installed: [root@localhost /]# yum install redhat-lsb Loaded plugins: fastestmirror, refresh-packagekit, security Setting up Install Process Loading mirror speeds from cached hostfile * base: mirror.5ninesolutions.com * extras: centos.eecs.wsu.edu * updates: mirror.5ninesolutions.com Package redhat-lsb-4.0-7.el6.centos.x86_64 already installed and latest version Nothing to do So my question: Is it possible to run ML 4.2 under CentOS and if so what do I need to do to resolve this library problem? If not, what is my best OS choice? My ultimate goal is to run ML in a Docker container (along with RSuite and MySQL, on which RSuite depends), so I was using the ML 7+ Dockerfile as a base (thus my use of CentOS). Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Attempting to Run ML 4 Under CentOs: Can't find libbteuclid.so.6.5.1
In order to support an ancient version of RSuite CMS I need to run MarkLogic 4.2. I have a 64-bit RPM that I’ve installed into Cento6. However, when I run MarkLogic I get this failure: [root@localhost bin]# ./MarkLogic ./MarkLogic: error while loading shared libraries: libbteuclid.so.6.5.1: cannot open shared object file: No such file or directory [root@localhost bin]# The library is present: [root@localhost bin]# ls ../lib libbteuclid.so.6.5.1 libbtrliprofile.so.6.5.1 libbtrlpc.so.6.5.1 libbtunicode.so libbtutils.so.6.5.0 libbtrlijni.solibbtrlpcore.so.6.5 libbtrlpjni.so libbtutiljni.so [root@localhost bin]# So I think it must be a configuration issue, possibly the wrong version of the library. My online research did not reveal any obvious solution and my linux fu is weak. I have the Redhat lsb-base package installed: [root@localhost /]# yum install redhat-lsb Loaded plugins: fastestmirror, refresh-packagekit, security Setting up Install Process Loading mirror speeds from cached hostfile * base: mirror.5ninesolutions.com * extras: centos.eecs.wsu.edu * updates: mirror.5ninesolutions.com Package redhat-lsb-4.0-7.el6.centos.x86_64 already installed and latest version Nothing to do So my question: Is it possible to run ML 4.2 under CentOS and if so what do I need to do to resolve this library problem? If not, what is my best OS choice? My ultimate goal is to run ML in a Docker container (along with RSuite and MySQL, on which RSuite depends), so I was using the ML 7+ Dockerfile as a base (thus my use of CentOS). Thanks, Eliot -- Eliot Kimber http://contrext.com ___ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Determining Whether Whitespace is In Data as Stored or A Result of Serialization?
I upgraded to 4.2-7 and verified that the serialization issue was resolved. Cheers, E. On 11/30/11 6:56 AM, "general@developer.marklogic.com" wrote: > Date: Tue, 29 Nov 2011 16:55:31 -0800 > From: Danny Sokolsky > Subject: Re: [MarkLogic Dev General] Determining Whether Whitespace is > In Data as Stored or A Result of Serialization? > To: General MarkLogic Developer Discussion > > Message-ID: > > Content-Type: text/plain; charset="us-ascii" > > Hi Eliot, > > There were some changes made in later 4.2 releases to restore the behavior > from earlier releases. The serialization is about how it is output, not how > it is stored, so it should be stored correctly. > > I recommend trying it on the latest 4.2 release (4.2-7 now, I think). I think > it will then, by default, behave the same as in 4.1. In 4.2, there are some > serialization options you can set at the query level to control this. In > MarkLogic 5, you can also control these options' default values at the App > Server level. > > Here is the 4.2 release not item that describes some of these changes: > > http://docs.marklogic.com/4.2doc/docapp.xqy#display.xqy?fname=http://pubs/4.2d > oc/xml/relnotes/chap4.xml%2340996 > > -Danny ** -- Eliot Kimber Senior Solutions Architect "Bringing Strategy, Content, and Technology Together" Main: 512.554.9368 www.reallysi.com www.rsuitecms.com ___ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general
[MarkLogic Dev General] Determining Whether Whitespace is In Data as Stored or A Result of Serialization?
I have determined that content loaded through the XccRunner.load() method has unwanted whitespace not in the original XML when subsequently accessed from MarkLogic. I've tested on 4.2-1. Earlier versions do not seem to have this behavior (although I need to do more testing to confirm--but we certainly would have noticed it if we had, as from our standpoint it constitutes a data corruption issue as data being returned from ML is different from what was given to ML). I traced the DOM being loaded right to the call of load() and verified by inspection that there were no whitespace nodes between two particular elements, e.g., the original source was: texttext Accessing the loaded document using e.g.,: doc('/foo/bar/mynewdoc.xml') Results in: text text (where there is multiple whitespace before the start tags and before the close tag). I tried various access routes, including CQ, access via our own product's calls to the XccRunner API, OxygenXML via WebDAV and direct XQuery (via Xcc) and get the same result. Some accesses show more indention than others, but they all have indention. >From what I could find it appears that this is the result of a change in the default serialization options. My primary question is: how can I determine how the XML is stored in ML without interference from any serialization options? Assuming the ML is not literally storing the bytes of the ML, I assume I can't just look inside the forest, but is there a reliable way to see what the original whitespace was? My first task is to prove that the ML is correct as provided to MarkLogic. My secondary questions: 1. Is there any way that options on the load() method could affect whitespace as stored? I didn't see any but I could have missed something. 2. If this is in fact a function of serialization options, where would we control that in our Java code that uses Xcc to run XQueries? Is it simply a matter of adding "declare option xdmp:output indent=no;" to our XQuery modules? 3. Is this default serialization behavior changed in ML 5? Thanks, Eliot -- Eliot Kimber Senior Solutions Architect "Bringing Strategy, Content, and Technology Together" Main: 512.554.9368 www.reallysi.com www.rsuitecms.com ___ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general
Re: [MarkLogic Dev General] Trailing Spaces Removed from Attribute Values--Bug or Feature?
Michael Blakeley wrote: I can't speak to the product issue, but is it practical to work around this behavior by ending your class attributes with a dummy character? For example: /self::*[contains(@class, ' topic/topic ')] => true DITA attaches meaning to the leading '-' and '+' characters. Will the trailing '-' cause problems? If so, could another character be used? I've been exploring that question with the DITA Technical Committee and the answer appears to be that in fact there are tools that would be disrupted by having anything after the initial "-" or "+" that is not a module/type name pair. Adding a trailing character would be an easy fix but it would still require scrubbing of the data for use by tools outside of MarkLogic. Cheers, Eliot -- Eliot Kimber Senior Solutions Architect "Bringing Strategy, Content, and Technology Together" Main: 610.631.6770 www.reallysi.com www.rsuitecms.com ___ General mailing list General@developer.marklogic.com http://xqzone.com/mailman/listinfo/general