Re: [MarkLogic Dev General] Schema validation incorrect for no-namespace document

David Lee Wed, 10 Dec 2014 04:04:36 -0800

To answer your first question
"Schema Agnostic" ...
That is a general term ( because there is no 'standard' term for this concept).
What it means more specifically is that MarkLogic supports and uses schema
but doesn't require them.  Its in line with the W3C XQuery and XML Specs 
(except for possibly a bug you found and a few places we have extensions) ... 
but the W3C Specs don't have a definition more precise then "Implementation 
Dependent" and "Schema Aware".   MarkLogic is both and more but its 
intentionally not 'strict' about schemas unless you ask for it, like doing a 
validate.

First, the reasoning behind this.
We want to support a multitude of use cases and make it as easy and useful for 
as many people.
So we don't require you have any schema, a valid schema or a a locatable schema 
for most things.
Most of the time this is what is desired.  For example if you load an XML doc, 
or say a million XML docs
that reference a schema but you forgot  to put one in the database, or maybe 
you have 1 of a million document that isn't quite 100% right .  most people 
don't want the load to abort (say 2/3s in as you hit one bad document) ...
That's *usually* not what people want.     So ML ignores schemas, for purposes 
of validation, for document inserts.  That is the behavior you are seeing in 
the update, its intentional.
If you want to validate you can using validate {}  (we need to look into the 
differencing issue .. I suspect  there is something else going on).

Another example is *finding* schemas.  It can be quite hard sometimes to make 
sure your schemas are found correctly and the right ones found, especially if 
you don't use namespaces.
If you have 100 schemas with no namespaces and 10 of them define a <name> 
element ...
and you put a million documents in with <name> elements ... most people would 
rather that the system keep function then stop you from doing anything because 
you have accidently inserted ambiguous schemas.

There are also issues of typing.  W3Schema is a strange thing as it attempts to 
fill the role of many jobs but often people only need a few.  A good example is 
simple atomic types.  Its very useful to have a basic schema that defines that 
<children> is an xs:integer   and <birthdate> is a xs:date so when you write 
XQuery  and make indexes, transform to JSON etc you don't have to put casts and 
type conversions everywhere.   Rather
        $doc/person/children gt 10   vs  xs:int($doc/person/ children) gt 10
and it allows you to write type safe XQuery like the following without too much 
casting.

      declare function is-old-enough-to-drink(
                $person as element(person) ,
                $drinkingAge as xs:yearMonthDuration
      )      as xs:boolean
     {
                    return ( $birthDate + $drinkingAge  ) gt current-date()
      }
  ...
       if( is-old-enough-to-drink( $person , $state-laws[state eq 
my-state()]/minimum-drinking-age )  then
              "Serve Beer"
     else
             "Call Cops"

This can be done with an incomplete schema that only defines the bare necessary 
types for a few elements.  The document is allows to not
validate against the schema unless you call validate {} but it still gives you 
the benefit of adding just enough type information but only spend as much
effort as is worth it for your application and development.   You can choose no 
schemas, some schemas, partial schemes or fully explicit schemas
and MarkLogic will *attempt* to make use of them as best it can,  part of that 
being the usability issue of not failing miserably if it fails to find the 
schema
or validate unnecessarily.

That is 'tradeoff' or 'balance' of being completely 'schema free' (like many 
NoSQL databases  ) or completely 'Schema Aware ' or 'Schema Required'
like many XML or Hybrid database  --
'Schema Agnostic' means that MarkLogic - as a product - doesn't require schemas 
unless you ask for it.
But at the same time it makes as much use as possible of schemas you do supply 
without putting an unnecessary burden ...

The tradeoff is a minor thing for small projects (either/or small amounts of 
code or data)  involved to either totally fix your data and schemas vs
writing you code to do the job a schema would do is fairly small ...
But when you start working on large projects (say millions of documents from 
external sources of various quality, and lots of code) ...
having 'schema support' but it not being a chokehold on you  is extremely 
valuable.
You will see this architecture in many places in MarkLogic where there is a 
range of 'strictness' in the API's and features,
because with every degree of required constraint - comes a cost ( to guarantee 
that your data and schemas are 100% perfect).
So we leave the choice up to you to decide in what cases that matters and when 
it does not.

An example being Range Indexes, there's an option to ignore or reject documents 
with data values that are not valid for the type of the index.
A strict type system (like a relational DB typed field  )  doesn't give you 
that choice.
Sometimes you would prefer to load  the 1million documents and simply ignore 
the 1 badly formed value - atleast to the point of being able to query for it 
...
or sometimes its more important to abort the whole thing and not let ANY of the 
data in if a single field is bad.

That's "Schema Agnostic"

As for the issue with your dereferencing docs in validate, it would be useful 
if you provided a small complete example of
1)      The full XML file
2)      The full schema
3)      The XQuery used

The snippet of code you show shouldn't be failing in the way you describe so 
either there is a bug or were not seeing the big picture.

-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
[email protected]
Phone: +1 812-482-5224
Cell:  +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>

From: [email protected] 
[mailto:[email protected]] On Behalf Of mohan mohan
Sent: Tuesday, December 09, 2014 11:13 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Schema validation incorrect for 
no-namespace document

Can some body explain me what is schema agnostic ??

On Wed, Dec 10, 2014 at 4:12 AM, Will Thompson 
<[email protected]<mailto:[email protected]>> wrote:
Sure, here is the schema:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";
    xmlns:vc="http://www.w3.org/2007/XMLSchema-versioning";
    vc:minVersion="1.0" vc:maxVersion="1.1">

    <xs:element name="dir">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="doc" type="type-doc" minOccurs="0" 
maxOccurs="unbounded"/>
            </xs:sequence>
            <xs:attribute name="uri-source" type="xs:string" use="required"/>
            <xs:attribute name="uri-target" type="xs:string" use="required"/>
        </xs:complexType>
    </xs:element>

    <xs:complexType name="type-doc">
        <xs:choice minOccurs="0" maxOccurs="unbounded">
            <xs:element name="unknown" type="type-unknown"/>
            <xs:element name="deleted" type="type-deleted"/>
            <xs:element name="updated" type="type-updated"/>
        </xs:choice>
        <xs:attribute name="uri-source" type="xs:string" use="required"/>
        <xs:attribute name="uri-target" type="xs:string" use="required"/>
    </xs:complexType>

    <xs:complexType name="type-updated">
        <xs:attribute name="id-source" type="xs:string" use="required"/>
        <xs:attribute name="id-target" type="xs:string" use="required" />
        <xs:attribute name="title-target" type="xs:string" use="required" />
        <xs:attribute name="ancestor-title-target" type="xs:string" 
use="required" />
        <xs:attribute name="location-source" type="xs:string" use="required" />
        <xs:attribute name="location-target" type="xs:string" use="required" />
        <xs:attribute name="status" type="type-status" use="required" />
    </xs:complexType>

    <xs:complexType name="type-unknown">
        <xs:attribute name="id-source" type="xs:string" use="required"/>
    </xs:complexType>

    <xs:complexType name="type-deleted">
        <xs:attribute name="id-source" type="xs:string" use="required"/>
    </xs:complexType>

    <xs:simpleType name="type-status">
        <xs:restriction base="xs:string">
            <xs:enumeration value="unknown"/>
            <xs:enumeration value="changed"/>
            <xs:enumeration value="unchanged"/>
        </xs:restriction>
    </xs:simpleType>

</xs:schema>

And here is a simple test:

xdmp:document-insert('test-doc.xml',
<dir uri-source="/books-search/comm/flh/2014/"
  uri-target="/books-search/comm/flh/2015/">
  <doc uri-source="/books-search/comm/flh/2014/FLH_ch02.xml"
    uri-target="/books-search/comm/flh/2015/FLH_ch02.xml">
    <updated status="changed"/>
    <updated id-source="/chapter/subchapter[1]/section[3]/section[3]/p" 
id-target="p0f30b428fdccc"
      title-target="§3.3 Rebutting community-property presumption."
      ancestor-title-target="§3. Establishing Character of Marital Property"
      location-source="2_a_3_3" location-target="2_a_3_3" status="unchanged"/>
  </doc>
</dir>)

Followed by

validate strict { doc('test-doc.xml') }

The first <updated> element requires all the attributes from the second, so it 
should fail. It doesn't for me, but after dereferencing the doc it does. Since 
namespacing the docs (there aren't many) and the schema does work, that's my 
current workaround (and probably better practice anyway).

Let me know if you can't reproduce it. Thanks for following up!

-Will

> On Dec 9, 2014, at 3:25 PM, Mary Holstege 
> <[email protected]<mailto:[email protected]>> wrote:
>
> On Tue, 09 Dec 2014 12:42:29 -0800, Will Thompson 
> <[email protected]<mailto:[email protected]>> wrote:
>
>> I recently ran into some issues validating a no-namespace document. The 
>> schema was updated, which should have caused the document to fail 
>> validation, but it didn't. I have been using 
>> xdmp:expanded-tree-cache-clear() following schema updates, but neither that 
>> nor a server restart had any affect.
>>
>> After doing some more testing, I discovered that dereferencing it before 
>> validation works:
>>
>> validate strict { document { doc($uri) }/* }
>>
>> And if I namespace the document and schema, everything works as expected as 
>> well. Bug, or am I missing something? This is on 7.0-4.1.
>>
>> -Will
>
> Yes, this sounds like a bug, and probably is not related to schema change
> so much as schema processing in general, since you cleared the cache.
> (Correct me if I am wrong.)
>
> Namespaced and no-namespaced schemas should work consistently.
> That said, nonamespaced schemas are a little trickier because of the
> interaction with elementForm and attributeForm, so I wouldn't be too
> shocked if there a code path somewhere that doesn't handle things
> properly.
>
> If you have a test case you'd be willing to share, I'd love to see it.
>
> //Mary
>

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Schema validation incorrect for no-namespace document

Reply via email to