[RDA-L] Part 2: Efficiency of DBMS operations Re: [RDA-L] [BIBFRAME] RDA, DBMS and RDF

Simon Spero Mon, 14 May 2012 12:51:14 -0700

>
> On Mon, May 14, 2012 at 10:45 AM, Karen Coyle <li...@kcoyle.net> wrote:
>
>  What happened with the MARC format is that when we moved it into actual
> databases it turned out that certain things that people expected or wanted
> didn't really work well. For example, many librarians expected that you
> could *[a]* *replicate a card catalog display* with *[b]*  *records* 
> *displaying
> in order by the* *heading that was searched*. That is really hard to do (*
> [c]* *and not possible to do efficiently*) using* [d]* *DBMS*functionality, 
> which is based on
> *[e]* *retrieved sets* not *linear ordering*, and* [f] **especially using
> keyword searching*.  [emphasis and labels  added]
>


[These are somewhat strong claims, which may require some weakening before
they are entirely valid.  I will not unpack the term  "record" here.]

BLUF: Not all DBMS  are Relational;  it is possible to efficiently retrieve
records in order from many different types of DBMS, including Relational
databases.

[c] and [d] make the claim that it is impossible to retrieve records
efficiently in some desired order using DBMS functionality.  This is
justified by [e] which claims that the source of this necessary
inefficiency is that DBMS functionality is based on "retrieved sets" not
"linear ordering".

It is difficult to work out what the intended reading of [e] is. The use of
the term "sets" in "retrieved sets", if interpreted in a mathematical
sense, which would indeed make the concept of ordering nonsensical, as the
elements within sets are unordered.  However, since the claim is made in
support of claims about possible efficiency, and since this is an attribute
of possible systems, this reading cannot be the intended one.

All of the major types of DBMS implementations have some form of ordering,
 internally, and in the query language.   It is trivially true that in SQL
based databases, the order in which the results of queries are retrieved is
unspecified if you do not specify the order in which you want results to be
returned, but even if we restrict ourself to these kinds of databases, this
is not sufficient to support the strong claim.  However, even though what
the order in which the results might be unspecified, results are returned
in *some* order, one after another.  [e] thus cannot provide support for
[c] and [d].

The internal arrangement of database records on disk is generally in some
kind of  linear order- to a first approximation, the records are stored one
after the other in some  order. This internal  order may be as simple as
 the order in which the records were added to the database, or it may be an
order based on the content of one of the fields of the record.  If not
otherwise specified, the order in which records are returned is based on
this internal order*.

For example, the Relational DBMSs  Oracle and PostgresSQL both allow for
records to be clustered (ordered on disk) so as to make retrieval in that
order extremely efficient.
Object Oriented DBMSs are usually quite efficient at following links to
related records, and many will optimize based on patterns of
 retrieval order.
Some OODBMS-like systems in the NoSQL family are entirely memory resident,
making disk access irrelevant.
Hierarchical databases like IDMS can be very fast at following the chains
around which they are organized.

Since we can efficiently  retrieve records, in some specified order, using
some DBMS, we have a counter-example to [c].

The claim can be weakened to a claim of possible inefficiency, but that is
 not unexpected in an ILS context :-)

Claim [f] may or may not be considered to hold; one can replicate the
database record once for each keyword that occurs, which is roughly
 equivalent to KWIC, which of course, trades space efficiency for time.  If
records are not replicated, keyword access to an indexed field may require
a separate disk access for every record that matches the keyword (when the
match is not at the start of the string).

Often read requests will be ordered so that disk head movement is
minimised; where this happens this is faster than purely random access.

 If the entire database is memory resident, or if all relevant records are
in cache, the overhead of disk access is irrelevant. In this case [f] does
not hold.


I hope this isn't too confusing,

Simon

* In some situations involving multiple tables, some systems  may return
records in a different order if no specific order is requested.  This is
due to decisions that the DBMS makes on the fastest way of answering the
query.  Since not asking for results to be returned in a specific order
tells the system that you don't care about ordering, the system may choose
to use different algorithms when running your query.  This extra freedom to
optimize is  why the order of results is unspecified by default.

[RDA-L] Part 2: Efficiency of DBMS operations Re: [RDA-L] [BIBFRAME] RDA, DBMS and RDF

Reply via email to