RE: Please allow any indicator in any field

2014-03-30 Thread Wagner, Alexander

Hello Rob!

  CERN may use a subset of Marc21 where most of the indicators are
  __.  Ok, that's CERNs library decission.

I'm not sure that the /LIBRARY/ was asked as much as it should be. ;)
But this is another issue.

   But if the Invenio
  developers claim that Invenio supports Marc21, you *must* allow
  other indicators there, and consider it valid.

 Then don't say it supports MARC21.  Simple solution.

Frankly, using Marc21 as internal format is /THE/ selling point for
Invenio.

Doesn't sound sensible to drop it. But nobody whats to
anyway, I think.

 The primary goal of invenio should be to meet the needs of its
 original institution(s).

Then you get a solution for a small island only.

Something like this has no future. One should have leared that
from all those large IT projects of the past, especially all those
large ambitious projects that just failed.

The second selling point of invenio is that it doesn't try to be a
solution for an island.

 If marc indicators are not necessary in the database functions of
 the originating institution, then feel free to ignore them.  Avoid
 getting dragged back 40 years to the days of library catalogs by any
 mandates to follow every rule to the letter.  Those rules may have
 made sense in 1970 but they don't always now.

Still, you'll have the largest set of data available in those formats.

You will agree, that it can't be the goal to recatalogue all libraries
in the world. And even if, I would strongly resist to recatalogue it
to the currently favoured standard from IT point of view cause
usually those standards change every other month. ;)

So, what I want to say is that with the usuage of Marc you have access
to a host of data and a huge amount of work accomplished over
decades by probably millions of dilligent people.

And you'll agree that data exchange is one of the strong points in
current IT. One wouldn't like to drop that just to do everything again
and again at every other place. This is even worse than reusing
typwriters to produce paper cards. I'm sure you don't want to propose
that.

 And MARC development has been under the control of library
 association committees, made up of librarians, who make decisions
 based on cataloging rules for description of items

Yepp. This is exactly what we do: describe real world items.

Unfortunatly, it is not enough to describe something that would be
nice if it would be standardized. We have to describe what we get and
have. Unfortunately, real life is not always simple and rarely ideal.

 (as typed on paper cards)

Well, only the smalest libaries still use paper cards today. Still, it
could be more efficient to use them than to introduce a computer.
(Really...  OSI-layer 8 might interfere destructivley. Often, layer 8
is /not/ the librarian, but this strange guy called user.)

 and not contemporary technology.

If we would build a library on contemporary technlogy at point t we
would have to rebuild all libraries every other month or so. Note,
that you're talking about 10^6 entries for every average sized
university library.  Please note also, though we librarians might be
simple minded people, that quite a bunch of those rules we
invented result from the simple fact, that literature is not that much
unified as you might wish for.

Just two very common examples:

* We print books for some 500 years. Till now, not even the page
formats are unified. Even though it sounds much more efficent to
produce paper of only say 5 differen sizes, right?

BTW: this also causes quite practical problems if you have to shelf
those items...
BTW/2: this unification happend for the paper of your office printer.
So it is possible in principle. Right?

* There're still people how believe an ISBN to be unique.

This is far from true. Almost every librarian knows that, still I know a
bunch of funny IT solutions that take the ISBN as primary key.
Oh, it works till you open up a real library. For the first 10.000 entires
it might be good enough.

 It is important to remember that marc, was originally a U.S. Library of
 Congress file format designed for large main-frame machines in a day of
 top down programming, and magnetic tape reel storage.

So the point is, that it was efficent on a machine with the computing
power of my pocket calculator. Surely this should be able to work on a
current workstation ;)

No, I do not want to emulate a main frame and a tape storage.

 Access was entirely sequential which explains some of the record
 architecture.  the format was intended to be used to generate paper
 file cards.

Agree. I also agree on it's limitations. I also agree that you might
have more efficient means of storage and data handling nowadays.

Still it should be possible to handle ingestion and dissemitation of
what we have, seemlessly. Especially with such an advanced a
technology.  Come on, if my pocket calculator could handle it... ;)

 Modern computing should have freedom to use marc in any way that
 makes it as suitable 

RE: Please allow any indicator in any field

2014-03-30 Thread Wagner, Alexander
Hello Ferran!

[...]
  I know Marc21 reasonably well, and I don't remember now any case
  where having different indicators mean something so different that
  has to be treat differently.

 Here I would be more careful. Basically, I would treat Marc fields and
 indicators not as 3 digits plus two other funny chars but consider the
 whole bunch as a 5 character wide filed designation. I think, here I'm
 in fact a bit more in line with Estebans approach. At least if I
 understand it correctly. (Though I agree with you that one might not
 come up with a complete bibfield list, but just with a set of most
 common usages.)

But «most common usages» won't cover them all, and so, you cannot load
arbitrary records coming from unknown sources and expect Invenio to do
the expected thing with them.

I'm not sure, but I think its basically a missunderstanding but we
generally agree. As I said, for indexing/dispaly I perfectly agree with you.
In definition of the fields as such, telling invenio what in input
e.g. an author should be one could, and probably should, be more
explicit.

 Be conservative in what you do, be liberal in what you accept from
 others.

Perfectly agree.

I'd add the famous Einstein her. As simple as possible, but not
simpler. ;)

 I share some concerns about this with Ferran and Martin and some
 others, and I'm very sure it's quite a task...

I don't think it is so difficult if the code just accepts 245%% for
title, 100%% for first author, etc.  With a 10% effort we could cover
more than 95% of the cases.

Alexander, would you accept to exchange the current Invenio default
behaviour with the default I'm proposing?  Knowing that it would not be
perfect, do you think that it would be better?

I think in general, yes. As said above, I feel we perfecly agree about
how Invenios default indexing and even dispaly should be set up, and
there %% is in all cases I see better than __.

If one defines a field, lets stick to author, I would however suggest,
that the definition says:

- Author should be 1001_ and stored as lastname, firstname
- alternatively 1002_ and sorted firstname lastname (note: deprecated)
- as fallback 100%% is treated as author in the index in case we have
foreign data (note: very deprecated)

You 245 example is quite telling and the people with not to much
library background might miss the point here a bit, simply as title
is as such a quite simple field in the sense that it is only a string,
from the IT point of view.

The point missed here, and it is really an important point, is that if
you get foreign data you /never/ get 245__ in your Marc, you'll
/always/ get indicators. So, stock invenio if I pull in 10.000 records
from our latest ebook package e.g. will have no titles. I consider
this indeed a bug, not an inconvenience.

Even our modernists who consider Marc ancient should agree that data
exchange is quite important. And no, I do not know /any/ format that
can transport the richness of Marc in a standardized, accepted manner,
nor do I know any format that is used for such a host of data as
Marc21. (If you consider it ancient, pease note that currently all german
library catalogues are migrated to use it instead of our own invention
MAB.) Hey, journal literature from the sciences is really quite
trivial. Book literatrue from the humanities is quite a different
story. If you don't believe it have some fun with
http://gso.gbv.de/DB=2.1/PPNSET?PPN=741186039 and it's friends.

Additionally, those indicators contain a strong meaning. I agree that
the first indicator might be considered superflous in databases (it is
probably /not/ if you consider that you have to create a bibliography,
and this is a /very/ common request).

Disregarding the second indicator is another story. If you have a
large bibliography it isn't too sensible to use a blind string
sorting. You'll get so many entries in T that you don't find your
stuff anymore. Yes, I know, these are offline bibliographies. Yes, I
hate to print a database. I feel it is completely sensless. Yes, I
work for several years now in converting all our peers to accept an URL
instead of 400 pages of paper.

BUT, IRL they often don't like trees and insist on silly printed
lists. Yes, ignoring The in sorting would be simple, but just to
add German we have 3 definite articles and two indefinite articles, of
course they have a different lenght and if you add other languages,
well.
--

Kind regards,

Alexander Wagner
Subject Specialist
Central Library
52425 Juelich

mail : a.wag...@fz-juelich.de
phone: +49 2461 61-1586
Fax  : +49 2461 61-6103
www.fz-juelich.de/zb/DE/zb-fi




Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: 

Re: RFC Trac to GitHub

2014-03-30 Thread Tibor Simko
On Fri, 28 Mar 2014, Tibor Simko wrote:
 Good idea for those teams who are undecided; but please set deadline
 to today, not to 7 days from now, because we thought of completing the
 first selection round today.

 (And please note that we are doing only first-round voting today, not
 the final selection; so there will be time to reiterate.  Unless some
 top candidate would be agreed upon in the first round already, as in
 any real presidential election.)

 I note that some people from our corridor (Jiri, Lars, and friends)
 also voted.  This is not necessary, because the votes from our CERN-IT
 teams were already expressed.  (I see that the CDS team has voted as
 well.)  If tricider is to express the opinion of the INSPIRE team,
 then these votes are only clouding your picture, as it were.

Some people have been wondering why I proposed to express our opinions
by teams or units in the first selection round of our next ticketing
and merging tool.  So I thought I express my thoughts briefly here.

Firstly, Invenio developer community consists of about 40 people
committing code per year.  We are naturally and organically organised in
teams that work on different topics, say the multimedia team or the
author disambiguation team.  It is natural that people within given
teams talk to each other more frequently than with outside teams, and
that each team probably has their opinion on what tools (merging,
ticketing, etc) would advance their work the best.  I'd call this
intra-team communication.

Secondly, the teams are not living in a vacuum.  We have very fruitful
collaboration between various teams and units, and Invenio would never
look the way it does now without our cooperation.  Examples are
numerous, say the task scheduler improvements brought by INSPIRE
operations, or Solr indexing and ranking improvements brought by CDS
operations.  I'd call this inter-team communication.

In order to coordinate our intra-team and inter-team ideas about our
project, technology, services, and whatnot, we have several channels in
place.  One of them being Invenio Developer Forum on Mondays where we
muse about common greater good.  From time to time we pause and think
of sharpening our saws.  As happened a few years ago for the selection
of Flask framework, later on for the selection of the Twitter Bootstrap
UI framework, then last summer for the selection of JS MVC framework
(which is still kind of on hold).  Now we are trying to select our
ticketing and merging framework.  Usually we do this by inter-team
discussions and intra-team discussions to see if a consensus emerges.
It would be hard to do otherwise.

Thirdly, the INSPIRE team proposed an (internal) doodling process to
express everyone's opinions in a numerically quantifiable way.  I'm not
sure this is the best way to go about it.  Consider a team of developers
A, B, C, D, and E.  Consider that A does on average 2 pull requests per
month, while B does 10.  What do we do?  Shall we express their opinions
numerically by head-count?  (This is what doodling does.)  Isn't it more
representative to weight the opinion of respective developers by a
factor of 5x, since B uses the tool five times more on average than A?
Now consider developer C who also commits 2 times per month, but instead
of committing 500 LOCs on average as A and B do, she commits 2500 LOCs.
Again, shall we apply a correction factor?  Will it be 5x?  How do we
weight the number of pull requests vs the number of lines of code?  Now
consider developer D who reviews code on average 5x more than A-C.  How
do we numerically factor his own use case?  And what about developer E
who also reviews code a lot, but only twice per month.  Will his opinion
count less?  E works on the most difficult tasks and hence has possibly
5x more review annotations on average then the rest of the team.  Shall
we increase the weight behind his opinion?  And how do we balance coding
amount vs pull request issuing amount vs code reviewing amount against
each other, when choosing a common tool?

This was just to illustrate various difficulties that come along when
expressing votes numerically.  Ditto for people doing triaging more
than others, ditto for people doing documentation more than others, etc.
We could aim for a matrix, but would this be the right approach?

So it was these kind of considerations that brought me to propose to
register our first selection round results by sub-teams or sub-units
we are composed of organically; simply because this addresses my first
point (=intra-team communication) and also covers to some extent my
third point (because different developer styles, from A to E, are
usually found within given team).  But I also reckon that such a
results-by-sub-units representation has a major drawback: it does not
cover very nicely my second point (=inter-team communication).  However,
since this communication was happening organically since autumn on the
corridors, and formally during our Monday fora since