Which very cleverly relies on ($doc/node() instance of text()) failing for documents with multiple root nodes (maybe a bit too clever as an example, but works...)

thanks!

Michael Blakeley wrote:
Yes, that's correct. Here's another take on the XQuery:

let $doc := doc($uri)
return
  if ($doc/binary()) then 'binary'
  else if (doc/node() instance of text()) then 'text'
  else 'xml'

Note that this test will treat empty documents (ie, uri does not exist) as xml. You could test for that case via empty(), or just 'if ($doc)...'

-- Mike

Mike Sokolov wrote:
OK I am pursuing a solution along those general lines. Just out of curiosity though: does this mean that internally there is no distinction between xml documents and text documents and binary documents? It sounds as if text documents are simply documents that happen to have a single text node (and same for binary) - is that right?

-Mike

Danny Sokolsky wrote:
Mike,

I think your approach is the right idea, only it needs a little more
logic to be more robust. If you took the last() instead of the first in
your node-kind test, that might work most of the time (or more often):

node-kind(doc($uri)/node()[last()])

Here is a similar idea using the instance of operator, performing a
little logic to make a best-guess at the type:

define function doctype($x as node()) as element()
{
<node>
  <uri>{xdmp:node-uri($x)}</uri>
  <type>{
  if ($x/node() instance of binary())
  then ("binary node")   else if ( $x/node() instance of element() )
       then ("XML node")
       else if ( $x/node() instance of text() )
            then  "text node"
            else "not sure"
}</type>
</node>
}

for $x in doc()[1 to 100]
return doctype($x)

I have not found any of my documents that return "not sure" here, but I
can imagine that you might be able to construct one.

-Danny

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Mike
Sokolov
Sent: Monday, March 31, 2008 10:34 AM
To: General Mark Logic Developer Discussion
Subject: [MarkLogic Dev General] document format

I have been trying to come up with a way to determine the "format" of a document in MarkLogic. The only api call that seems directly related is xdmp:document-uri-format, but this seems to operate on the uri without any reference to the contents of a document. Instead, I tried testing:

node-kind(doc($uri)/node()[1])


but we just found an XML document for which this returns "text" - apparently it has a BOM at the start, so the document node has two child

nodes: one text (containing the BOM) and one element (the root element).

Presumably there could be comments there too and processing instructions, so this strategy is clearly flawed.

Does anybody have a good way to determine whether a document in Mark Logic is an XML document, a text document or a binary document?

-Mike
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to