Re: [CODE4LIB] Formats and its identifiers
Rob is correct on all points. Namespace URIs can, in some cases, be overloaded to function as schema identifiers. But they absolutely can't be used blindly in this way for arbitrary formats -- there are all kinds of potential gotchas. That being so, I think it is wiser and more explicit _always_ to define a separate identitifier for a format. _/|____ /o ) \/ Mike Taylorhttp://www.miketaylor.org.uk )_v__/\ "... currently trading under the name Gently for reasons which it would be otiose, for the moment, to rehearse" -- Douglas Adams, "Dirk Gently" Rob Sanderson writes: > On Mon, 2009-05-11 at 14:53 +0100, Jakob Voss wrote: > > > >> A format should be described with a schema (XML Schema, OWL etc.) or at > > >> least a standard. Mostly this schema already has a namespace or similar > > >> identifier that can be used for the whole format. > > > > > > This is unfortunately not the case. > > > > It is mostly the case - but people like to misinterpret schemas and > > tailor them to their needs. > > You're advocating an approach that "mostly" works, as opposed to one > that works in all cases? > > > > >> For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML > > >> Namespace http://www.loc.gov/mods/v3 so this is the best identifier to > > >> identify MODS. > > > > > > And this is a perfect example of why this is not the case. > > > The same mods schema (let alone namespace) defines TWO formats, mods and > > > modsCollection. > > > That's your interpretation. According to the schema, the MODS format > > *is* either a single mods-element or a modsCollection-element. > > According to the __schema__ yes. Not according to the namespace. The > namespace is a collection of names only and says precisely nothing about > structure. > > And, yes, given no definition of "format", my interpretation is that the > mods schema defines two formats, as it defines two top level elements > with different contents (eg one may contain the other). This is > typically how people would define format in this context, I would say. > > This is, of course, tangential to the fact that you cannot use the __XML > Namespace__ as an identifier for the format, no matter how you define > it. > > > > That's > > exactely what you can refer to with the namespace identifier > > http://www.loc.gov/mods/v3. > > No, that's a collection of elements, not a schema. > > > > If you need to identify the specific element 'mods' of the format only, > > then you need another identifer. > > Correct. I'm glad you agree with me. > > Given that namespaces do not specify anything to do with structure, you > thus need a new identifier for EVERY element in a namespace as they > could be used as the top level tag of ANY schema. > > There isn't a widely accepted identifier system for schemas, only schema > locations. There are also many methods for defining schemas > (schematron, relax-ng, DTDs, xml schema) which can all define exactly > the same "format". > > > > But if the MODS specification defines that you can refer to any element > > with an URI fragment identifier, then the right identifier would be > > http://www.loc.gov/mods/v3#mods > > That would be an identifier for the *element*. > > > The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' > > does not identify the top level element but the MODS *format* (in any of > > the versions 3.0-3.4) itself. This format *includes* the top level > > element 'mods'. > > No, it identifies a collection of names. These names are structured > according to a schema, which is what we need an identifier for. Beyond > that, we may also need identifiers for which structure we mean within > the schema (eg mods vs modsCollection) > > > Rob
Re: [CODE4LIB] Formats and its identifiers
Ross Singer wrote: Agreed. The same is true, of course, of MARC and, by extension, MARCXML. Part of the "format" is that it can be one record or multiple. I don't think this a particularly strong argument against using the namespace as an identifier. Actually, the MARC format (not MARCXML) is very much a single-record format. There is a standard for "tape headers" but no wrapper for MARC (Z39.2) records, since the MARC format doesn't have a way to do that. Having worked for way too long with MARC, I had a lot of trouble with the "collection" concept in MARCXML and MODS, and am still not sure I see the utility of it beyond what a file of records provides. I'm assuming its main purpose is to provide valid XML when you have a file with more than one bibliographic record. However, it seems that the collection and the records within the collection are part and parcel of the same schema, making the things we think of as "records" subordinate to the collection, even if it is a collection of one. kc -- --- Karen Coyle / Digital Library Consultant kco...@kcoyle.net http://www.kcoyle.net ph.: 510-540-7596 skype: kcoylenet fx.: 510-848-3913 mo.: 510-435-8234
Re: [CODE4LIB] Formats and its identifiers
On Mon, 2009-05-11 at 14:53 +0100, Jakob Voss wrote: > >> A format should be described with a schema (XML Schema, OWL etc.) or at > >> least a standard. Mostly this schema already has a namespace or similar > >> identifier that can be used for the whole format. > > > > This is unfortunately not the case. > > It is mostly the case - but people like to misinterpret schemas and > tailor them to their needs. You're advocating an approach that "mostly" works, as opposed to one that works in all cases? > >> For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML > >> Namespace http://www.loc.gov/mods/v3 so this is the best identifier to > >> identify MODS. > > > > And this is a perfect example of why this is not the case. > > The same mods schema (let alone namespace) defines TWO formats, mods and > > modsCollection. > That's your interpretation. According to the schema, the MODS format > *is* either a single mods-element or a modsCollection-element. According to the __schema__ yes. Not according to the namespace. The namespace is a collection of names only and says precisely nothing about structure. And, yes, given no definition of "format", my interpretation is that the mods schema defines two formats, as it defines two top level elements with different contents (eg one may contain the other). This is typically how people would define format in this context, I would say. This is, of course, tangential to the fact that you cannot use the __XML Namespace__ as an identifier for the format, no matter how you define it. > That's > exactely what you can refer to with the namespace identifier > http://www.loc.gov/mods/v3. No, that's a collection of elements, not a schema. > If you need to identify the specific element 'mods' of the format only, > then you need another identifer. Correct. I'm glad you agree with me. Given that namespaces do not specify anything to do with structure, you thus need a new identifier for EVERY element in a namespace as they could be used as the top level tag of ANY schema. There isn't a widely accepted identifier system for schemas, only schema locations. There are also many methods for defining schemas (schematron, relax-ng, DTDs, xml schema) which can all define exactly the same "format". > But if the MODS specification defines that you can refer to any element > with an URI fragment identifier, then the right identifier would be > http://www.loc.gov/mods/v3#mods That would be an identifier for the *element*. > The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' > does not identify the top level element but the MODS *format* (in any of > the versions 3.0-3.4) itself. This format *includes* the top level > element 'mods'. No, it identifies a collection of names. These names are structured according to a schema, which is what we need an identifier for. Beyond that, we may also need identifiers for which structure we mean within the schema (eg mods vs modsCollection) Rob
Re: [CODE4LIB] Formats and its identifiers
On Mon, May 11, 2009 at 9:53 AM, Jakob Voss wrote: > That's your interpretation. According to the schema, the MODS format *is* > either a single mods-element or a modsCollection-element. That's exactely > what you can refer to with the namespace identifier > http://www.loc.gov/mods/v3. Agreed. The same is true, of course, of MARC and, by extension, MARCXML. Part of the "format" is that it can be one record or multiple. I don't think this a particularly strong argument against using the namespace as an identifier. > The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' > does not identify the top level element but the MODS *format* (in any of the > versions 3.0-3.4) itself. This format *includes* the top level element > 'mods'. I'm not really sure of the changes between MODS v.3.0-3.3 -- are they basically backwards and forwards compatible? I imagine there are a lot of cases where the client doesn't care what point release of MODS the thing is serialized as, just that it's MODS and that it can find generally what it's looking for in that structure, right? -Ross.
[CODE4LIB] Formats and its identifiers
Hi Rob, You wrote: A format should be described with a schema (XML Schema, OWL etc.) or at least a standard. Mostly this schema already has a namespace or similar identifier that can be used for the whole format. This is unfortunately not the case. It is mostly the case - but people like to misinterpret schemas and tailor them to their needs. For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML Namespace http://www.loc.gov/mods/v3 so this is the best identifier to identify MODS. And this is a perfect example of why this is not the case. The same mods schema (let alone namespace) defines TWO formats, mods and modsCollection. That's your interpretation. According to the schema, the MODS format *is* either a single mods-element or a modsCollection-element. That's exactely what you can refer to with the namespace identifier http://www.loc.gov/mods/v3. If you need to identify the specific element 'mods' of the format only, then you need another identifer. Up to now there is no default way to create an identifier for a specific element in an XML format, see http://www.w3.org/TR/webarch/#xml-fragids But if the MODS specification defines that you can refer to any element with an URI fragment identifier, then the right identifier would be http://www.loc.gov/mods/v3#mods You wrote: > I totally agree that it's an awful design choice. However it's a > demonstration that XML namespaces _do not identify format_. And > hence, we need another identifier which is not the namespace of > the top level element. The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' does not identify the top level element but the MODS *format* (in any of the versions 3.0-3.4) itself. This format *includes* the top level element 'mods'. Also consider the following more hypothetical, but perfectly feasible situations: * One namespace is used to define two _totally_ separate sets of elements. There's no reason why this can't be done. Ok, let A and B be two formats with two totally sets of elements (and rules how to use them). If you put them into one namespace, then you get a new format C that is the union of A and B. * One namespace defines so many elements that it's meaningless to call it a format at all. Even though the top level tag might be the same, the contents are so varied that you're unable to realistically process it. Sad but true: The word "format" in the context of library applications does not make sense anyway in most cases. Technically a format is just a set of possible instances, defined as a formal language or with any other type of specification. The problem of library formats is that many people refer to them without providing a proper specification. Coming back to the mods example: If the SRU Schema registry lists "info:srw/schema/1/mods-v3.3" as the identifier for "MODS Schema Version 3.3" with a pointer to the XML Schema "http://www.loc.gov/standards/mods/v3/mods-3-3.xsd"; then *any* XML document that validates against this schema must be considered to be a MODS 3.3 document - either with 'mods' or with 'modsCollection' as root element. Greetings Jakob -- Jakob Voß , skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de