> > On Mon, May 14, 2012 at 10:45 AM, Karen Coyle <li...@kcoyle.net> wrote: > > What happened with the MARC format is that when we moved it into actual > databases it turned out that certain things that people expected or wanted > didn't really work well. For example, many librarians expected that you > could *[a]* *replicate a card catalog display* with *[b]* *records* > *displaying > in order by the* *heading that was searched*. That is really hard to do (* > [c]* *and not possible to do efficiently*) using* [d]* *DBMS*functionality, > which is based on > *[e]* *retrieved sets* not *linear ordering*, and* [f] **especially using > keyword searching*. [emphasis and labels added] >
[These are somewhat strong claims, which may require some weakening before they are entirely valid. I will not unpack the term "record" here.] BLUF: Not all DBMS are Relational; it is possible to efficiently retrieve records in order from many different types of DBMS, including Relational databases. [c] and [d] make the claim that it is impossible to retrieve records efficiently in some desired order using DBMS functionality. This is justified by [e] which claims that the source of this necessary inefficiency is that DBMS functionality is based on "retrieved sets" not "linear ordering". It is difficult to work out what the intended reading of [e] is. The use of the term "sets" in "retrieved sets", if interpreted in a mathematical sense, which would indeed make the concept of ordering nonsensical, as the elements within sets are unordered. However, since the claim is made in support of claims about possible efficiency, and since this is an attribute of possible systems, this reading cannot be the intended one. All of the major types of DBMS implementations have some form of ordering, internally, and in the query language. It is trivially true that in SQL based databases, the order in which the results of queries are retrieved is unspecified if you do not specify the order in which you want results to be returned, but even if we restrict ourself to these kinds of databases, this is not sufficient to support the strong claim. However, even though what the order in which the results might be unspecified, results are returned in *some* order, one after another. [e] thus cannot provide support for [c] and [d]. The internal arrangement of database records on disk is generally in some kind of linear order- to a first approximation, the records are stored one after the other in some order. This internal order may be as simple as the order in which the records were added to the database, or it may be an order based on the content of one of the fields of the record. If not otherwise specified, the order in which records are returned is based on this internal order*. For example, the Relational DBMSs Oracle and PostgresSQL both allow for records to be clustered (ordered on disk) so as to make retrieval in that order extremely efficient. Object Oriented DBMSs are usually quite efficient at following links to related records, and many will optimize based on patterns of retrieval order. Some OODBMS-like systems in the NoSQL family are entirely memory resident, making disk access irrelevant. Hierarchical databases like IDMS can be very fast at following the chains around which they are organized. Since we can efficiently retrieve records, in some specified order, using some DBMS, we have a counter-example to [c]. The claim can be weakened to a claim of possible inefficiency, but that is not unexpected in an ILS context :-) Claim [f] may or may not be considered to hold; one can replicate the database record once for each keyword that occurs, which is roughly equivalent to KWIC, which of course, trades space efficiency for time. If records are not replicated, keyword access to an indexed field may require a separate disk access for every record that matches the keyword (when the match is not at the start of the string). Often read requests will be ordered so that disk head movement is minimised; where this happens this is faster than purely random access. If the entire database is memory resident, or if all relevant records are in cache, the overhead of disk access is irrelevant. In this case [f] does not hold. I hope this isn't too confusing, Simon * In some situations involving multiple tables, some systems may return records in a different order if no specific order is requested. This is due to decisions that the DBMS makes on the fastest way of answering the query. Since not asking for results to be returned in a specific order tells the system that you don't care about ordering, the system may choose to use different algorithms when running your query. This extra freedom to optimize is why the order of results is unspecified by default.