On May 20, 2009, at 19:24, Bruce D'Arcus wrote:

Re: the recent microdata work and the subsequent effort to include
BibTeX in the spec, I summarized my argument against this on my blog:

<http://community.muohio.edu/blogs/darcusb/archives/2009/05/20/on-the-inclusion-of-bibtex-in-html5 >
Quoting from the blog post:
On the last use case, he has chosen BibTeX, on the basis that it is widely used and simple to author and process.

Those are good criteria.
• BibTeX is designed for the sciences, that typically only cite secondary academic literature. It is thus inadequate for, nor widely used, in many fields outside of the sciences: the humanities and law being quite obvious examples. For this reason, BibTeX cannot by default adequately represent even the use cases Ian has identified. For example, there are many citations on Wikipedia that can only be represented using effectively useless types such as “misc” and which require new properties to be invented.

This doesn't mean that BibTeX is a bad basis. The set of types and fields is limited, though.

Since renderings of bibliography don't show the type of the reference usually, having to use 'misc' for almost everything isn't a practical problem although it is aesthetically displeasing.

The set of fields is more of an issue, but it can be fixed by inventing more fields--it doesn't mean the whole base solution needs to be discarded. Fortunately, having custom fields in .bib doesn't break existing pre-Web, pre-ISBN bibliography styles. I've used at least these custom fields:

key: Show this citation pseudo-id in rendering instead of the actual id used for matching.
url: The absolute URL of a resource that is on the Web.
refdate: The date when the author made the reference to an ephemeral source such as a Web page.
isbn: The ISBN of a publication.
stdnumber: RFC or ISO number. e.g. "RFC 2397" or "ISO/IEC 10646:2003(E)"

Particularly the 'url' and 'isbn' field names should be obvious and uncontroversial additions.

• Related, BibTeX cannot represent much of the data in widely used bibliographic applications such as Endnote, RefWorks and Zotero except in very general ways.

Do you have an example? (I've never used the other formats.)

• The BibTeX extensibility model puts a rather large burden on inventing new properties to accommodate data not in the core model. For example, the core model has no way to represent a DOI identifier (this is no surprise, as BibTeX was created before DOIs existed). As a consequence, people have gradually added this to their BibTeX records and styles in a more ad hoc way. This ad hoc approach to extensibility has one of two consequences: either the vocabulary terms are understood as completely uncontrolled strings, or one needs to standardize them. If we assume the first case, we introduce potential interoperability problems.

In practice, those problems have already been introduced. For some reason I don't understand, there's an existing pattern of calling a field 'doi' but putting an absolute URI in the value. (As opposed to using a field name 'url' or a value that contains only the DOI- significant part.)

If we assume the second, we have an organizational and process problem: that the WHATWG and/or the W3C—neither of which have expertise in this domain—become the gate-keepers for such extensions. In either case, we have a rather brittle and anachronistic approach to extension.

Problems of this nature haven't stopped the WHATWG in the past. :-)

• The BibTeX model conflicts with Dublin Core and with vCard, both of which are quite sensibly used elsewhere in the microdata spec to encode information related to the document proper. There seems little justification in having two different ways to represent a document depending on whether on it is THIS document or THAT document.

When you are referring to THAT document, you generally want the names of the authors--not their full business cards. Therefore, vCard is an overkill, and conversion to .bib is more useful than conversion to vCard for this use case.

My suggestion instead?
• reuse Dublin Core and vCard for the generic data: titles, creators/contributors, publisher, dates, part/version relations, etc., and only add those properties (volume, issue, pages, editors, etc.) that they omit

This would make conversion to and from the dominant bibliography format (.bib) more complex. Furthermore, there's a risk of a GIGO effect where the conversion can't be done algorithmically. (IIRC, you can't algorithmically map a .bib author name to the vCard name structure without a huge dictionary of names.)

• typing should NOT be handled a bibtex-type property, but the same way everything else is typed in the microdata proposal: a global identifier

Why is typing even needed except for separating articles from compilations?

• make it possible for people to interweave other, richer, vocabularies such as bibo within such item descriptions. In other words, extension properties should be URIs. • define the mapping to RDF of such an “item” description; can we say, for example, that it constitutes a dct:references link from the document to the described source?

How are these useful for conversions to and from the incumbent format (BibTeX)? (Only BibTeX is supported by all of Google Scholar, the ACM Portal, Stanford Spires, NASA ADS at Harvard and Citebase.org. The three last ones being databases that arXiv seems to delegate to.)

--
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Reply via email to