Just a general question about regular expression support and some weigh in from
MarkLogic.
Ideally I think full support for Regular Expression matching and named grouping
for cts:highlight and other searches would be a nice to have feature
I am working on alot of Regex matching to automatically assign markup to xml
elements and query for patterns using an index.
Since what I am looking for are specific patterns in the Text for Legal
Citations. I cannot enumerate all the possible combinations of a pattern. I
have a workaround by using a regex matching xquery function that returns
matching text and non-matching text once I locate the document.
I would then collect all the matching phrases and create a cts:word-query for
each match,
then run cts:highlight over the matches, first to create the boundary element.
And then reiterate the over the boundary element to add metadata to each
element.
Here are my limitations,
* I cannot capture named groups(I could Ideally use non-capture groups
and just use replace functions).
* fn:replace only returns positions 1-9 as per xquery spec (Again
non-capture groups will muddy regex or Regex the Regex:-) to make all
non-named-groups non-capture groups).
* Speed is a concern and native functions would ideally perform better.
* I would like to use regex on cts:queries for searching for documents.
* Necessity to build an expression by having access to cts:text, cts:node
like cts:highlight. My created function limits my ability to construct nodes
to pass to the function like cts:highlight.
Ideally, a function or set of functions to do regex matching on indexes would
be useful or as general purpose utilities to perform such functions:
Recommendation 1:
A highlight utility or enhancement of cts:highlight to allow for regex-
cts:pattern-highlight($node, $query, $expression)
cts:text : text-captured
cts:group as element(cts:group) (Captures Named Regex (?<group>:[expr])
Recommendation 2:
A cts:query that allows for Regex Patterns
cts:(regex|pattern)-query($patterns as xs:string*,$options,$weight)
$pattern : a regular-expression or sequence of $expressions
$options : (case-sensitive|case-insensitive| (:i = Regex Ignore Case:)
whitespace-sensitive|whitespace-insensitive| (:x =
Whitespace mode:)
single-mode|multiline (:s=
Mode:)
element-boundary (:I
guess preserve element boundaries:)
named-capture| indexed-capture (:Captures
group names and returns them to cts:group:)
Also, if someone can weigh in on the ramifications of regex searches with
Marklogic indexing and is there a possibility of a native regex support for
cts:search
(beyond fn:matches, fn:replace,fn:tokenize)
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general