Which very cleverly relies on ($doc/node() instance of text()) failing
for documents with multiple root nodes (maybe a bit too clever as an
example, but works...)
thanks!
Michael Blakeley wrote:
Yes, that's correct. Here's another take on the XQuery:
let $doc := doc($uri)
return
if ($doc/binary()) then 'binary'
else if (doc/node() instance of text()) then 'text'
else 'xml'
Note that this test will treat empty documents (ie, uri does not
exist) as xml. You could test for that case via empty(), or just 'if
($doc)...'
-- Mike
Mike Sokolov wrote:
OK I am pursuing a solution along those general lines. Just out of
curiosity though: does this mean that internally there is no
distinction between xml documents and text documents and binary
documents? It sounds as if text documents are simply documents that
happen to have a single text node (and same for binary) - is that right?
-Mike
Danny Sokolsky wrote:
Mike,
I think your approach is the right idea, only it needs a little more
logic to be more robust. If you took the last() instead of the
first in
your node-kind test, that might work most of the time (or more often):
node-kind(doc($uri)/node()[last()])
Here is a similar idea using the instance of operator, performing a
little logic to make a best-guess at the type:
define function doctype($x as node()) as element()
{
<node>
<uri>{xdmp:node-uri($x)}</uri>
<type>{
if ($x/node() instance of binary())
then ("binary node") else if ( $x/node() instance of element() )
then ("XML node")
else if ( $x/node() instance of text() )
then "text node"
else "not sure"
}</type>
</node>
}
for $x in doc()[1 to 100]
return doctype($x)
I have not found any of my documents that return "not sure" here, but I
can imagine that you might be able to construct one.
-Danny
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Mike
Sokolov
Sent: Monday, March 31, 2008 10:34 AM
To: General Mark Logic Developer Discussion
Subject: [MarkLogic Dev General] document format
I have been trying to come up with a way to determine the "format"
of a document in MarkLogic. The only api call that seems directly
related is xdmp:document-uri-format, but this seems to operate on
the uri without any reference to the contents of a document.
Instead, I tried testing:
node-kind(doc($uri)/node()[1])
but we just found an XML document for which this returns "text" -
apparently it has a BOM at the start, so the document node has two
child
nodes: one text (containing the BOM) and one element (the root
element).
Presumably there could be comments there too and processing
instructions, so this strategy is clearly flawed.
Does anybody have a good way to determine whether a document in Mark
Logic is an XML document, a text document or a binary document?
-Mike
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general