Re: [CODE4LIB] MARC Magic for file

2012-05-24 Thread Ed Summers
On Wed, May 23, 2012 at 6:16 PM, Kyle Banerjee
baner...@orbiscascade.org wrote:
 I'm not sure whether to laugh or cry that it's a sign of progress that a 40
 year old utility designed to identify file types is now just beginning to
 be able to recognize a format that's been around for almost 50 years...

Laugh :-)

//Ed


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Ross Singer
Wow, this is pretty cool.

Kevin, do you have examples of the output?

Does it work for bulk files?

I mean, I could just try this on my Ubuntu machine, but it's all the way 
downstairs...

-Ross.

On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:

 I finally had occasion today (read: remembered) to see if the *nix file 
 command would recognize a MARC record file.  I haven't tested extensively, 
 but it did identify the file as MARC21 Bibliographic record.  It also 
 correctly identified a MARC21 Authority Record.  I'm running the most recent 
 version of Ubuntu (12.04 - precise pangolin).
 
 I write because the inclusion of a file MARC21 specification rule in the 
 magic.db stems from a Code4lib exchange that started in March 2011 [1] (it 
 ends in April if you want to go crawling for the entire thread).
 
 Rgds,
 
 Kevin
 
 [1] 
 https://listserv.nd.edu/cgi-bin/wa?A2=ind1103L=CODE4LIBT=0F=S=P=112728
 
 --
 Kevin Ford
 Network Development and MARC Standards Office
 Library of Congress
 Washington, DC


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Francis Kayiwa
On Wed, May 23, 2012 at 03:28:56PM -0400, Ross Singer wrote:
 Wow, this is pretty cool.
 
 Kevin, do you have examples of the output?
 
 Does it work for bulk files?
 
 I mean, I could just try this on my Ubuntu machine, but it's all the way 
 downstairs...

My OS lists it as `data`

$ cd
$ ls
devid_rsa.pub laflin marc   orthancssh
updating
$ ftp http://drupal.org/files/issues/5_records_utf8.mrc_.txt
Trying 140.211.166.6...
Requesting http://drupal.org/files/issues/5_records_utf8.mrc_.txt
100%
|**|
5965   00:00
5965 bytes received in 0.00 seconds (1.56 MB/s)
$ ls
5_records_utf8.mrc_.txt  id_rsa.pub   marc
ssh
dev  laflin   orthanc
updating
$ mkdir test
$ mv 5_records_utf8.mrc_.txt test/  

   
$ cd test/  

   
$ mv 5_records_utf8.mrc_.txt 5_records_utf8.mrc 

   
$ ls
5_records_utf8.mrc
$ file 5_records_utf8.mrc   

   
5_records_utf8.mrc: data
$ ls
5_records_utf8.mrc
$ ls -al
total 32
drwxr-xr-x   2 kayiwa  kayiwa   512 May 23 14:34 .
drwxr-xr-x  10 kayiwa  kayiwa   512 May 23 14:34 ..
-rw-r--r--   1 kayiwa  kayiwa  5965 May 23 14:33 5_records_utf8.mrc
$ uname -a
OpenBSD orthanc.lib.uic.edu 5.1 GENERIC.MP#256 i386

./fxk

 
 -Ross.
 
 On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:
 
  I finally had occasion today (read: remembered) to see if the *nix file 
  command would recognize a MARC record file.  I haven't tested extensively, 
  but it did identify the file as MARC21 Bibliographic record.  It also 
  correctly identified a MARC21 Authority Record.  I'm running the most 
  recent version of Ubuntu (12.04 - precise pangolin).
  
  I write because the inclusion of a file MARC21 specification rule in the 
  magic.db stems from a Code4lib exchange that started in March 2011 [1] (it 
  ends in April if you want to go crawling for the entire thread).
  
  Rgds,
  
  Kevin
  
  [1] 
  https://listserv.nd.edu/cgi-bin/wa?A2=ind1103L=CODE4LIBT=0F=S=P=112728
  
  --
  Kevin Ford
  Network Development and MARC Standards Office
  Library of Congress
  Washington, DC
 

-- 
If builders built buildings the way programmers wrote programs,
then the first woodpecker to come along would destroy civilization.


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Ford, Kevin
 Does it work for bulk files?
-- It passed on a file containing 215 MARC Bibs and on a file containing 2,574 
MARC Auth records.  Don't know if you consider these bulk, but there is more 
than 1 record in each file (caveat: file stops after evaluating the first 
line, so of the 2,574 Auth records, the last 2,573 could be invalid).  It 
failed on a file containing all of LC Classification.  I need to figure out 
why.  

 Kevin, do you have examples of the output?
-- I received MARC21 Bibliography and MARC21 Authority respectively.  In 
theory, if Leader 20-23 are not 4500 then (non-conforming) should be 
appended to the identification.  If requested, the mimetype - application/marc 
- should also be outputted.

Rgds,

Kevin




 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Ross Singer
 Sent: Wednesday, May 23, 2012 3:29 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MARC Magic for file
 
 Wow, this is pretty cool.
 
 Kevin, do you have examples of the output?
 
 Does it work for bulk files?
 
 I mean, I could just try this on my Ubuntu machine, but it's all the
 way downstairs...
 
 -Ross.
 
 On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:
 
  I finally had occasion today (read: remembered) to see if the *nix
 file command would recognize a MARC record file.  I haven't tested
 extensively, but it did identify the file as MARC21 Bibliographic
 record.  It also correctly identified a MARC21 Authority Record.  I'm
 running the most recent version of Ubuntu (12.04 - precise pangolin).
 
  I write because the inclusion of a file MARC21 specification rule
 in the magic.db stems from a Code4lib exchange that started in March
 2011 [1] (it ends in April if you want to go crawling for the entire
 thread).
 
  Rgds,
 
  Kevin
 
  [1]
  https://listserv.nd.edu/cgi-
 bin/wa?A2=ind1103L=CODE4LIBT=0F=S=P=1
  12728
 
  --
  Kevin Ford
  Network Development and MARC Standards Office Library of Congress
  Washington, DC


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Jonathan Rochkind
I have become recently unpleasantly aquainted with the world of Marc 
that is not Marc21, but is ISO 2709.


What'll it do on ISO 2709? I might be able to dig up an example. I 
wonder if it'll claim it's Marc21 (not), or if it's Marc21 
Non-confirming (no, it's not quite that either. It's ISO-2709 MARC 
that's not Marc21).


If it just doens't know anything about it and says 'data', that's just 
fine, if it knows about Marc21 but not non-Marc21 ISO 2709.


On 5/23/2012 3:48 PM, Ford, Kevin wrote:

Does it work for bulk files?

-- It passed on a file containing 215 MARC Bibs and on a file containing 2,574 MARC Auth records.  
Don't know if you consider these bulk, but there is more than 1 record in each file 
(caveat: file stops after evaluating the first line, so of the 2,574 Auth records, the 
last 2,573 could be invalid).  It failed on a file containing all of LC Classification.  I need to 
figure out why.


Kevin, do you have examples of the output?

-- I received MARC21 Bibliography and MARC21 Authority respectively.  In theory, if Leader 
20-23 are not 4500 then (non-conforming) should be appended to the identification.  If 
requested, the mimetype - application/marc - should also be outputted.

Rgds,

Kevin





-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Ross Singer
Sent: Wednesday, May 23, 2012 3:29 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC Magic for file

Wow, this is pretty cool.

Kevin, do you have examples of the output?

Does it work for bulk files?

I mean, I could just try this on my Ubuntu machine, but it's all the
way downstairs...

-Ross.

On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:


I finally had occasion today (read: remembered) to see if the *nix

file command would recognize a MARC record file.  I haven't tested
extensively, but it did identify the file as MARC21 Bibliographic
record.  It also correctly identified a MARC21 Authority Record.  I'm
running the most recent version of Ubuntu (12.04 - precise pangolin).


I write because the inclusion of a file MARC21 specification rule

in the magic.db stems from a Code4lib exchange that started in March
2011 [1] (it ends in April if you want to go crawling for the entire
thread).


Rgds,

Kevin

[1]
https://listserv.nd.edu/cgi-

bin/wa?A2=ind1103L=CODE4LIBT=0F=S=P=1

12728

--
Kevin Ford
Network Development and MARC Standards Office Library of Congress
Washington, DC




Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread stuart yeates

On 24/05/12 07:14, Ford, Kevin wrote:

I finally had occasion today (read: remembered) to see if the *nix file 
command would recognize a MARC record file.  I haven't tested extensively, but it did 
identify the file as MARC21 Bibliographic record.  It also correctly identified a MARC21 
Authority Record.  I'm running the most recent version of Ubuntu (12.04 - precise 
pangolin).

I write because the inclusion of a file MARC21 specification rule in the 
magic.db stems from a Code4lib exchange that started in March 2011 [1] (it ends in April 
if you want to go crawling for the entire thread).


A couple of warnings about the unix file command

(a) it only looks at the start of the file. This is great because it 
works fast on big files. This is dreadful because it can't warn you that 
everything after the first 10k of a 2GB file is corrupt or a 1k MARC 
file is pre-pended to a 400GB astronomy data file.


(b) it is not uncommon for a file to match multiple file types. This can 
cause problems when using file to check whether inputs to a program are 
actually the type the program is expecting.


(c) some platforms have been notoriously slow to add new definitions, 
ubuntu is not such a platform.


cheers
stuart
--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Kevin Ford
Don't know what to say.  Crawling through the source for file at [1], 
the pattern matching code as in place as of Sept 2011.  It could be 
present earlier than Sept 2011, but I stopped hunting for it.  The 
earliest it would have made its way into the magic db would have been 
April 2011.


Perhaps OpenBSD is using some custom branch of file, haven't updated 
the db, etc.


Yours,

Kevin



On 05/23/2012 03:36 PM, Francis Kayiwa wrote:

On Wed, May 23, 2012 at 03:28:56PM -0400, Ross Singer wrote:

Wow, this is pretty cool.

Kevin, do you have examples of the output?

Does it work for bulk files?

I mean, I could just try this on my Ubuntu machine, but it's all the way 
downstairs...


My OS lists it as `data`

$ cd
$ ls
devid_rsa.pub laflin marc   orthancssh
updating
$ ftp http://drupal.org/files/issues/5_records_utf8.mrc_.txt
Trying 140.211.166.6...
Requesting http://drupal.org/files/issues/5_records_utf8.mrc_.txt
100%
|**|
5965   00:00
5965 bytes received in 0.00 seconds (1.56 MB/s)
$ ls
5_records_utf8.mrc_.txt  id_rsa.pub   marc
ssh
dev  laflin   orthanc
updating
$ mkdir test
$ mv 5_records_utf8.mrc_.txt test/
$ cd test/
$ mv 5_records_utf8.mrc_.txt 5_records_utf8.mrc
$ ls
5_records_utf8.mrc
$ file 5_records_utf8.mrc
5_records_utf8.mrc: data
$ ls
5_records_utf8.mrc
$ ls -al
total 32
drwxr-xr-x   2 kayiwa  kayiwa   512 May 23 14:34 .
drwxr-xr-x  10 kayiwa  kayiwa   512 May 23 14:34 ..
-rw-r--r--   1 kayiwa  kayiwa  5965 May 23 14:33 5_records_utf8.mrc
$ uname -a
OpenBSD orthanc.lib.uic.edu 5.1 GENERIC.MP#256 i386

./fxk



-Ross.

On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:


I finally had occasion today (read: remembered) to see if the *nix file 
command would recognize a MARC record file.  I haven't tested extensively, but it did 
identify the file as MARC21 Bibliographic record.  It also correctly identified a MARC21 
Authority Record.  I'm running the most recent version of Ubuntu (12.04 - precise 
pangolin).

I write because the inclusion of a file MARC21 specification rule in the 
magic.db stems from a Code4lib exchange that started in March 2011 [1] (it ends in April 
if you want to go crawling for the entire thread).

Rgds,

Kevin

[1] https://listserv.nd.edu/cgi-bin/wa?A2=ind1103L=CODE4LIBT=0F=S=P=112728

--
Kevin Ford
Network Development and MARC Standards Office
Library of Congress
Washington, DC






Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Kevin Ford

 It failed on a file containing all of LC Classification.  I need to
 figure out why.
-- To reply to myself: Having looked at the file db pattern source 
[1], I see that the file maintainer introduced a typo into the 
matching pattern for correctly identifying Classification records. 
That's way it's failing for Class records.


Over and out,

Kevin

[1] ftp://ftp.astron.com/pub/file/


On 05/23/2012 03:48 PM, Ford, Kevin wrote:

Does it work for bulk files?

-- It passed on a file containing 215 MARC Bibs and on a file containing 2,574 MARC Auth records.  
Don't know if you consider these bulk, but there is more than 1 record in each file 
(caveat: file stops after evaluating the first line, so of the 2,574 Auth records, the 
last 2,573 could be invalid).  It failed on a file containing all of LC Classification.  I need to 
figure out why.


Kevin, do you have examples of the output?

-- I received MARC21 Bibliography and MARC21 Authority respectively.  In theory, if Leader 
20-23 are not 4500 then (non-conforming) should be appended to the identification.  If 
requested, the mimetype - application/marc - should also be outputted.

Rgds,

Kevin





-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Ross Singer
Sent: Wednesday, May 23, 2012 3:29 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC Magic for file

Wow, this is pretty cool.

Kevin, do you have examples of the output?

Does it work for bulk files?

I mean, I could just try this on my Ubuntu machine, but it's all the
way downstairs...

-Ross.

On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:


I finally had occasion today (read: remembered) to see if the *nix

file command would recognize a MARC record file.  I haven't tested
extensively, but it did identify the file as MARC21 Bibliographic
record.  It also correctly identified a MARC21 Authority Record.  I'm
running the most recent version of Ubuntu (12.04 - precise pangolin).


I write because the inclusion of a file MARC21 specification rule

in the magic.db stems from a Code4lib exchange that started in March
2011 [1] (it ends in April if you want to go crawling for the entire
thread).


Rgds,

Kevin

[1]
https://listserv.nd.edu/cgi-

bin/wa?A2=ind1103L=CODE4LIBT=0F=S=P=1

12728

--
Kevin Ford
Network Development and MARC Standards Office Library of Congress
Washington, DC


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Ross Singer
On May 23, 2012, at 4:22 PM, Kevin Ford wrote:

 Don't know what to say.  Crawling through the source for file at [1], the 
 pattern matching code as in place as of Sept 2011.  It could be present 
 earlier than Sept 2011, but I stopped hunting for it.  The earliest it would 
 have made its way into the magic db would have been April 2011.
 
 Perhaps OpenBSD is using some custom branch of file, haven't updated the 
 db, etc.

As Stuart pointed out, some implementations are slow to update the db.  OSX, 
for example, also just says data (hence my question on the output).

-Ross.
 
 Yours,
 
 Kevin
 
 
 
 On 05/23/2012 03:36 PM, Francis Kayiwa wrote:
 On Wed, May 23, 2012 at 03:28:56PM -0400, Ross Singer wrote:
 Wow, this is pretty cool.
 
 Kevin, do you have examples of the output?
 
 Does it work for bulk files?
 
 I mean, I could just try this on my Ubuntu machine, but it's all the way 
 downstairs...
 
 My OS lists it as `data`
 
 $ cd
 $ ls
 devid_rsa.pub laflin marc   orthancssh
 updating
 $ ftp http://drupal.org/files/issues/5_records_utf8.mrc_.txt
 Trying 140.211.166.6...
 Requesting http://drupal.org/files/issues/5_records_utf8.mrc_.txt
 100%
 |**|
 5965   00:00
 5965 bytes received in 0.00 seconds (1.56 MB/s)
 $ ls
 5_records_utf8.mrc_.txt  id_rsa.pub   marc
 ssh
 dev  laflin   orthanc
 updating
 $ mkdir test
 $ mv 5_records_utf8.mrc_.txt test/
 $ cd test/
 $ mv 5_records_utf8.mrc_.txt 5_records_utf8.mrc
 $ ls
 5_records_utf8.mrc
 $ file 5_records_utf8.mrc
 5_records_utf8.mrc: data
 $ ls
 5_records_utf8.mrc
 $ ls -al
 total 32
 drwxr-xr-x   2 kayiwa  kayiwa   512 May 23 14:34 .
 drwxr-xr-x  10 kayiwa  kayiwa   512 May 23 14:34 ..
 -rw-r--r--   1 kayiwa  kayiwa  5965 May 23 14:33 5_records_utf8.mrc
 $ uname -a
 OpenBSD orthanc.lib.uic.edu 5.1 GENERIC.MP#256 i386
 
 ./fxk
 
 
 -Ross.
 
 On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:
 
 I finally had occasion today (read: remembered) to see if the *nix file 
 command would recognize a MARC record file.  I haven't tested extensively, 
 but it did identify the file as MARC21 Bibliographic record.  It also 
 correctly identified a MARC21 Authority Record.  I'm running the most 
 recent version of Ubuntu (12.04 - precise pangolin).
 
 I write because the inclusion of a file MARC21 specification rule in the 
 magic.db stems from a Code4lib exchange that started in March 2011 [1] (it 
 ends in April if you want to go crawling for the entire thread).
 
 Rgds,
 
 Kevin
 
 [1] 
 https://listserv.nd.edu/cgi-bin/wa?A2=ind1103L=CODE4LIBT=0F=S=P=112728
 
 --
 Kevin Ford
 Network Development and MARC Standards Office
 Library of Congress
 Washington, DC
 
 


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Francis Kayiwa
On Wed, May 23, 2012 at 04:34:47PM -0400, Ross Singer wrote:
 On May 23, 2012, at 4:22 PM, Kevin Ford wrote:
 
  Don't know what to say.  Crawling through the source for file at [1], the 
  pattern matching code as in place as of Sept 2011.  It could be present 
  earlier than Sept 2011, but I stopped hunting for it.  The earliest it 
  would have made its way into the magic db would have been April 2011.
  
  Perhaps OpenBSD is using some custom branch of file, haven't updated the 
  db, etc.
 
 As Stuart pointed out, some implementations are slow to update the db.  OSX, 
 for example, also just says data (hence my question on the output).


adding FreeBSD's magicfile from this commit on a users $HOME

http://lists.freebsd.org/pipermail/svn-src-vendor/2011-October/000851.html

For my next trick I will try to remember that I need to do that.

./fxk




-- 
If builders built buildings the way programmers wrote programs,
then the first woodpecker to come along would destroy civilization.


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Simon Spero
The file format magic format magic changed between versions; I think the
OSX version was not compatible with more up to date versions (in the
original thread, this caused me some confusion).

Simon

On Wed, May 23, 2012 at 4:34 PM, Ross Singer rossfsin...@gmail.com wrote:

 On May 23, 2012, at 4:22 PM, Kevin Ford wrote:

  Don't know what to say.  Crawling through the source for file at [1],
 the pattern matching code as in place as of Sept 2011.  It could be present
 earlier than Sept 2011, but I stopped hunting for it.  The earliest it
 would have made its way into the magic db would have been April 2011.
 
  Perhaps OpenBSD is using some custom branch of file, haven't updated
 the db, etc.

 As Stuart pointed out, some implementations are slow to update the db.
  OSX, for example, also just says data (hence my question on the output).

 -Ross.
 
  Yours,
 
  Kevin
 
 
 
  On 05/23/2012 03:36 PM, Francis Kayiwa wrote:
  On Wed, May 23, 2012 at 03:28:56PM -0400, Ross Singer wrote:
  Wow, this is pretty cool.
 
  Kevin, do you have examples of the output?
 
  Does it work for bulk files?
 
  I mean, I could just try this on my Ubuntu machine, but it's all the
 way downstairs...
 
  My OS lists it as `data`
 
  $ cd
  $ ls
  devid_rsa.pub laflin marc   orthancssh
  updating
  $ ftp http://drupal.org/files/issues/5_records_utf8.mrc_.txt
  Trying 140.211.166.6...
  Requesting http://drupal.org/files/issues/5_records_utf8.mrc_.txt
  100%
 
 |**|
  5965   00:00
  5965 bytes received in 0.00 seconds (1.56 MB/s)
  $ ls
  5_records_utf8.mrc_.txt  id_rsa.pub   marc
  ssh
  dev  laflin   orthanc
  updating
  $ mkdir test
  $ mv 5_records_utf8.mrc_.txt test/
  $ cd test/
  $ mv 5_records_utf8.mrc_.txt 5_records_utf8.mrc
  $ ls
  5_records_utf8.mrc
  $ file 5_records_utf8.mrc
  5_records_utf8.mrc: data
  $ ls
  5_records_utf8.mrc
  $ ls -al
  total 32
  drwxr-xr-x   2 kayiwa  kayiwa   512 May 23 14:34 .
  drwxr-xr-x  10 kayiwa  kayiwa   512 May 23 14:34 ..
  -rw-r--r--   1 kayiwa  kayiwa  5965 May 23 14:33 5_records_utf8.mrc
  $ uname -a
  OpenBSD orthanc.lib.uic.edu 5.1 GENERIC.MP#256 i386
 
  ./fxk
 
 
  -Ross.
 
  On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:
 
  I finally had occasion today (read: remembered) to see if the *nix
 file command would recognize a MARC record file.  I haven't tested
 extensively, but it did identify the file as MARC21 Bibliographic record.
  It also correctly identified a MARC21 Authority Record.  I'm running the
 most recent version of Ubuntu (12.04 - precise pangolin).
 
  I write because the inclusion of a file MARC21 specification rule
 in the magic.db stems from a Code4lib exchange that started in March 2011
 [1] (it ends in April if you want to go crawling for the entire thread).
 
  Rgds,
 
  Kevin
 
  [1]
 https://listserv.nd.edu/cgi-bin/wa?A2=ind1103L=CODE4LIBT=0F=S=P=112728
 
  --
  Kevin Ford
  Network Development and MARC Standards Office
  Library of Congress
  Washington, DC
 
 



Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Kyle Banerjee
On Wed, May 23, 2012 at 12:14 PM, Ford, Kevin k...@loc.gov wrote:

 I finally had occasion today (read: remembered) to see if the *nix file
 command would recognize a MARC record file.  I haven't tested extensively,
 but it did identify the file as MARC21 Bibliographic record.  It also
 correctly identified a MARC21 Authority Record.  I'm running the most
 recent version of Ubuntu (12.04 - precise pangolin).


I'm not sure whether to laugh or cry that it's a sign of progress that a 40
year old utility designed to identify file types is now just beginning to
be able to recognize a format that's been around for almost 50 years...

kyle
-- 
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787


Re: [CODE4LIB] MARC magic for file

2011-04-08 Thread Sean Hannan
http://i.imgur.com/6WtA0.png

(Sorry, it's Friday. Also, blame dchud for the idea.)

-Sean


On 4/6/11 4:53 PM, Mike Taylor m...@indexdata.com wrote:

 On 6 April 2011 19:53, Jonathan Rochkind rochk...@jhu.edu wrote:
 On 4/6/2011 2:43 PM, William Denton wrote:
 
 Validity does mean something definite ... but Postel's Law is a good
 guideline, especially with the swamp of bad MARC, old MARC, alternate
 MARC, that's out there.  Valid MARC is valid MARC, but if---for the sake
 of file and its magic---we can identify technically invalid but still
 usable MARC, that's good.
 
 Hmm, accept in the case of Web Browsers, I think general consensus is
 Postel's law was not helpful. These days, most people seem to think that
 having different browsers be tolerant of invalid data in different ways was
 actually harmful rather than helpful to inter-operability (which is
 theoretically the goal of Postel's law), and that's not what people do
 anymore in web browser land, at least not to the extremes they used to do
 it.
 
 But the idea that browsers should be less permissive in what they
 accept is a modern one that we now have the luxury of only because
 adherence to Postel's law in the early days of the Web allowed it to
 become ubiquitous.  Though it's true, as Harvey Thompson has observed
 that it's difficult to retro-fit correctness, Clay Shirky was also
 very right when he pointed out that You cannot simultaneously have
 mass adoption and rigor.  If browsers in 1995 had been as pedantic as
 the browsers of 2011 (rightly) are, we wouldn't even have the Web; or
 if it existed at all it would just be a nichey thing that a few
 scientists used to make their publications available to each other.
 
 So while I agree that in the case of HTML we are right to now be
 moving towards more rigorous demands of what to accept (as well, of
 course, as being conservative in what we emit), I don't think we could
 have made the leap from nothing to modern rigour.
 
 -- Mike


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
Actually, you can have records that are MARC21 coming out of vendor databases 
(who sometime embed control characters into the leader) and still be valid.  
Once you stop looking at just your ILS or OCLC, you probably wouldn't be 
surprised to know that records start looking very different.

--TR



Terry Reese, Associate Professor
Gray Family Chair 
for Innovative Library Services
121 Valley Libraries
Corvallis, Or 97331
tel: 541.737.6384




 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Wednesday, April 06, 2011 9:44 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MARC magic for file
 
 Can't you have a legal MARC file that does NOT have 4500 in those
 leader positions?  It's just not legal Marc21, right?   Other marc
 formats may specify or even allow flexibility in the things these bytes
 specify:
 
 * Length of the length-of-field portion
 * Number of characters in the starting-character-position portion of a
 Directory entry
 * Number of characters in the implementation-defined portion of a Directory
 entry
 
 Or, um, 23, which is I guess is left to the specific Marc implementation (ie,
 Marc21 is one such) to use for it's own purposes.
 
 I have no idea how that should inform the 'marc magic'.
 
 Is mime-type application/marc defined as specifically Marc21, or as any
 Marc?
 
 Jonathan
 
 On 4/6/2011 12:28 PM, Ford, Kevin wrote:
  Well, this brings us right up against the issue of files that adhere to 
  their
 specifications versus forgiving applications.  Think of browsers and HTML.
 Suffice it to say, MARC applications are quite likely to be forgiving of 
 leader
 positions 20-23.  In my non-conforming MARC file and in Bill's, the leader
 positions 20-21 (45) seemed constant, but things could fall apart for
 positions 22-23.  So...
 
  I present the following (in-line and attached, to preserve tabs) in an
 attempt to straddle the two sides of this issue: applications forgiving of 
 non-
 conforming files.  Should the two characters following 45 (at position 20)
 *not* be 00, then the identification will be noted as non-conforming.  We
 could classify this as reasonable identification but hardly ironclad (indeed,
 simply checking to confirm that part of the first 24 positions match the
 specification hardly constitutes a robust identification, but it's something).
 
  It will also give you a mimetype too, now.
 
  Would any like testing it out more fully on their own files?
 
 
  #
  # MARC 21 Magic  (Third cut)
 
  # Set at position 0
  0   bytex
 
  # leader position 20-21 must be 45
  20 string  45
  # leader starts with 5 digits, followed by codes specific to MARC
  format
  0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
  !:mime  application/marc
  0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
  !:mime  application/marc
  0 regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
  !:mime  application/marc
  0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification
  !:mime  application/marc
  0 regex/1 (^[0-9]{5})[cdn][q] MARC Community
  !:mime  application/marc
 
  # leader position 22-23, should be 00 but is it?
  0 regex/1 (^.{21})([^0]{2})   (non-conforming)
  !:mime  application/marc
 
 
  If this works, I'll see about submitting this copy.  Thanks to all your 
  efforts
 already.
 
  Warmly,
 
  Kevin
 
  --
  Library of Congress
  Network Development and MARC Standards Office
 
 
 
 
 
  
  From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Simon
  Spero [s...@unc.edu]
  Sent: Sunday, April 03, 2011 14:01
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] MARC magic for file
 
  I am pretty sure that the marc4j standard reader ignores them; the
  tolerant reader definitely does. Otherwise JHU might have about two
  parseable records based on the mangled leaders that J-Rock  gets stuck
  with :-)
 
  An analysis of the ~7M LC bib records from the scriblio.net data files
  (~ Dec 2006) indicated that leader  has less than 8 bits of
  information in it (shannon-weaver definition). This excludes the
  initial length value, which is redundant given the end of record marker.
 
 
  The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader.
The final characters of the leader are 450.
 
  Also, I object to the phrase decent MARC tool.  Any tool capable of
  dealing with MARC as it exists cannot afford the luxury of decency :-)
 
  [ HA: A clear conscience?
BW: Yes, Sir Humphrey.
HA: When did you acquire this taste for luxuries?]
 
  Simon
 
  On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephenso...@ostephens.com
 wrote:
 
  I'm sure any decent MARC tool can deal with them, since decent MARC
  tools are certainly going to be forgiving enough to deal with four
  characters

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind
I'm not sure what you mean Terry.  Maybe we have different 
understandings of valid.


If leader bytes 20-23 are not 4500, I suggest that is _by definition_ 
not a valid Marc21 file. It violates the Marc21 specification.


Now, they may still be _usable_, by software that ignores these bytes 
anyway or works around them. We definitely have a lot of software that 
does that.


Which can end up causing problems that remind me of very analagous 
problems caused by the early days of web browsers that felt like being 
'tolerant' of bad data. My html works in every web brower BUT this one, 
why not? Oh, becuase that's the only one that actually followed the 
standard, oops.


I actually ran into an example of that problem with this exact issue. 
MOST software just ignores marc leader bytes 20-23, and assumes the 
semantics of 4500---the only legal semantics for Marc21.  But Marc4j 
actually _respected_ them, apparently the author thought that some marc 
in the wild might intentionally set different bytes here (no idea if 
that's true or not). So if the leader bytes 20-23 were invalid 
(according to the spec), Marc47 would suddenly decide that the length 
of field portion was NOT 4, but actually BELIEVE whatever was in leader 
byte 20, causing the record to be parsed improperly.  And I had records 
like that coming out of my ILS (not even a vendor database). That was an 
unfun couple days of debugging to figure out what was going on.


On 4/6/2011 12:52 PM, Reese, Terry wrote:

Actually, you can have records that are MARC21 coming out of vendor databases 
(who sometime embed control characters into the leader) and still be valid.  
Once you stop looking at just your ILS or OCLC, you probably wouldn't be 
surprised to know that records start looking very different.

--TR



Terry Reese, Associate Professor
Gray Family Chair
for Innovative Library Services
121 Valley Libraries
Corvallis, Or 97331
tel: 541.737.6384





-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Wednesday, April 06, 2011 9:44 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file

Can't you have a legal MARC file that does NOT have 4500 in those
leader positions?  It's just not legal Marc21, right?   Other marc
formats may specify or even allow flexibility in the things these bytes
specify:

* Length of the length-of-field portion
* Number of characters in the starting-character-position portion of a
Directory entry
* Number of characters in the implementation-defined portion of a Directory
entry

Or, um, 23, which is I guess is left to the specific Marc implementation (ie,
Marc21 is one such) to use for it's own purposes.

I have no idea how that should inform the 'marc magic'.

Is mime-type application/marc defined as specifically Marc21, or as any
Marc?

Jonathan

On 4/6/2011 12:28 PM, Ford, Kevin wrote:

Well, this brings us right up against the issue of files that adhere to their

specifications versus forgiving applications.  Think of browsers and HTML.
Suffice it to say, MARC applications are quite likely to be forgiving of leader
positions 20-23.  In my non-conforming MARC file and in Bill's, the leader
positions 20-21 (45) seemed constant, but things could fall apart for
positions 22-23.  So...

I present the following (in-line and attached, to preserve tabs) in an

attempt to straddle the two sides of this issue: applications forgiving of non-
conforming files.  Should the two characters following 45 (at position 20)
*not* be 00, then the identification will be noted as non-conforming.  We
could classify this as reasonable identification but hardly ironclad (indeed,
simply checking to confirm that part of the first 24 positions match the
specification hardly constitutes a robust identification, but it's something).

It will also give you a mimetype too, now.

Would any like testing it out more fully on their own files?


#
# MARC 21 Magic  (Third cut)

# Set at position 0
0   bytex

# leader position 20-21 must be 45

20  string  45

# leader starts with 5 digits, followed by codes specific to MARC
format

0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic

!:mime  application/marc

0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority

!:mime  application/marc

0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings

!:mime  application/marc

0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification

!:mime  application/marc

0   regex/1 (^[0-9]{5})[cdn][q] MARC Community

!:mime  application/marc

# leader position 22-23, should be 00 but is it?

0   regex/1 (^.{21})([^0]{2})   (non-conforming)

!:mime  application/marc


If this works, I'll see about submitting this copy.  Thanks to all your efforts

already.

Warmly,

Kevin

--
Library of Congress
Network Development and MARC Standards

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Prettyman, Timothy
Just as a historical note, this non-standard use of LDR/22 is likely due to 
OCLC's use of the character as a hexadecimal flag from back in the days when 
marc records were mostly schlepped around on tapes.  They referred to it as the 
Transaction type code.  When records were sent to oclc for processing, 
various values of the flag indicated whether a catalog card was to be produced, 
whether the record was an update, whether the user location symbol was to be 
set, etc.  I'm sure others have used it for their own nefarious purposes as 
well.

Tim Prettyman
University of Michigan/LIT

On Apr 6, 2011, at 12:28 PM, Ford, Kevin wrote:

 Well, this brings us right up against the issue of files that adhere to their 
 specifications versus forgiving applications.  Think of browsers and HTML.  
 Suffice it to say, MARC applications are quite likely to be forgiving of 
 leader positions 20-23.  In my non-conforming MARC file and in Bill's, the 
 leader positions 20-21 (45) seemed constant, but things could fall apart 
 for positions 22-23.  So...
 
 I present the following (in-line and attached, to preserve tabs) in an 
 attempt to straddle the two sides of this issue: applications forgiving of 
 non-conforming files.  Should the two characters following 45 (at position 
 20) *not* be 00, then the identification will be noted as non-conforming.  
 We could classify this as reasonable identification but hardly ironclad 
 (indeed, simply checking to confirm that part of the first 24 positions match 
 the specification hardly constitutes a robust identification, but it's 
 something).
 
 It will also give you a mimetype too, now.
 
 Would any like testing it out more fully on their own files?
 
 
 #
 # MARC 21 Magic  (Third cut)
 
 # Set at position 0
 0 bytex   
 
 # leader position 20-21 must be 45
 20   string  45  
 
 # leader starts with 5 digits, followed by codes specific to MARC format
 0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
 !:mimeapplication/marc
 0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
 !:mimeapplication/marc
 0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
 !:mimeapplication/marc
 0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification
 !:mimeapplication/marc
 0   regex/1 (^[0-9]{5})[cdn][q] MARC Community
 !:mimeapplication/marc
 
 # leader position 22-23, should be 00 but is it?
 0   regex/1 (^.{21})([^0]{2})   (non-conforming)
 !:mimeapplication/marc
 
 
 If this works, I'll see about submitting this copy.  Thanks to all your 
 efforts already.
 
 Warmly,
 
 Kevin
 
 --
 Library of Congress
 Network Development and MARC Standards Office
 
 
 
 
 
 
 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero 
 [s...@unc.edu]
 Sent: Sunday, April 03, 2011 14:01
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MARC magic for file
 
 I am pretty sure that the marc4j standard reader ignores them; the tolerant
 reader definitely does. Otherwise JHU might have about two parseable records
 based on the mangled leaders that J-Rock  gets stuck with :-)
 
 An analysis of the ~7M LC bib records from the scriblio.net data files (~
 Dec 2006) indicated that leader  has less than 8 bits of information in it
 (shannon-weaver definition). This excludes the initial length value, which
 is redundant given the end of record marker.
 
 
 The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader.
 The final characters of the leader are 450.
 
 Also, I object to the phrase decent MARC tool.  Any tool capable of
 dealing with MARC as it exists cannot afford the luxury of decency :-)
 
 [ HA: A clear conscience?
 BW: Yes, Sir Humphrey.
 HA: When did you acquire this taste for luxuries?]
 
 Simon
 
 On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens o...@ostephens.com wrote:
 
 I'm sure any decent MARC tool can deal with them, since decent MARC tools
 are certainly going to be forgiving enough to deal with four characters
 that
 apparently don't even really matter.
 
 You say that, but I'm pretty sure Marc4J throws errors MARC records where
 these characters are incorrect
 
 Owen
 
 On Fri, Apr 1, 2011 at 3:51 AM, William Denton w...@pobox.com wrote:
 
 On 28 March 2011, Ford, Kevin wrote:
 
 I couldn't get Simon's MARC 21 Magic file to work.  Among other issues,
 I
 received line too long errors.  But, since I've been curious about
 this
 for sometime, I figured I'd take a whack at it myself.  Try this:
 
 
 This is very nice!  Thanks.  I tried it on a bunch of MARC files I have,
 and it recognized almost all of them.  A few it didn't, so I had a closer
 look, and they're invalid.
 
 For example, the Internet Archive's Binghamton catalogue dump:
 
 http://ia600307.us.archive.org/6/items/marc_binghamton_univ/
 
 $ file -m marc.magic bgm*mrc
 bgm_openlib_final_0-5.mrc: data
 bgm_openlib_final_10

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
Actually -- I'd disagree because that is a very narrow view of the 
specification.  When validating MARC, I'd take the approach to validate 
structure (which allows you to then read any MARC format) -- then use a 
separate process for validating content of fields, which in my opinion, is more 
open to interpretation based on system usage of the data.  For example, 22 and 
23 are undefined values that local systems may very well have a practical need 
to define and use given that there are only so many values in the leader.  This 
is why I sometimes see additional values in the 09 field (which should be a or 
blank) to define different character set types, or additional elements added to 
other fields.  If I want to validate the content of those fields, I'd validate 
it through a different process -- but I separate the process from the 
validation of the structure -- because the two are not exclusive.

--TR

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Wednesday, April 06, 2011 9:59 AM
 To: Code for Libraries
 Cc: Reese, Terry
 Subject: Re: [CODE4LIB] MARC magic for file
 
 I'm not sure what you mean Terry.  Maybe we have different understandings
 of valid.
 
 If leader bytes 20-23 are not 4500, I suggest that is _by definition_ not a
 valid Marc21 file. It violates the Marc21 specification.
 
 Now, they may still be _usable_, by software that ignores these bytes
 anyway or works around them. We definitely have a lot of software that
 does that.
 
 Which can end up causing problems that remind me of very analagous
 problems caused by the early days of web browsers that felt like being
 'tolerant' of bad data. My html works in every web brower BUT this one,
 why not? Oh, becuase that's the only one that actually followed the
 standard, oops.
 
 I actually ran into an example of that problem with this exact issue.
 MOST software just ignores marc leader bytes 20-23, and assumes the
 semantics of 4500---the only legal semantics for Marc21.  But Marc4j
 actually _respected_ them, apparently the author thought that some marc in
 the wild might intentionally set different bytes here (no idea if that's true 
 or
 not). So if the leader bytes 20-23 were invalid
 (according to the spec), Marc47 would suddenly decide that the length of
 field portion was NOT 4, but actually BELIEVE whatever was in leader byte
 20, causing the record to be parsed improperly.  And I had records like that
 coming out of my ILS (not even a vendor database). That was an unfun
 couple days of debugging to figure out what was going on.
 
 On 4/6/2011 12:52 PM, Reese, Terry wrote:
  Actually, you can have records that are MARC21 coming out of vendor
 databases (who sometime embed control characters into the leader) and still
 be valid.  Once you stop looking at just your ILS or OCLC, you probably
 wouldn't be surprised to know that records start looking very different.
 
  --TR
 
 
  
  Terry Reese, Associate Professor
  Gray Family Chair
  for Innovative Library Services
  121 Valley Libraries
  Corvallis, Or 97331
  tel: 541.737.6384
  
 
 
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
  Of Jonathan Rochkind
  Sent: Wednesday, April 06, 2011 9:44 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] MARC magic for file
 
  Can't you have a legal MARC file that does NOT have 4500 in those
  leader positions?  It's just not legal Marc21, right?   Other marc
  formats may specify or even allow flexibility in the things these
  bytes
  specify:
 
  * Length of the length-of-field portion
  * Number of characters in the starting-character-position portion of
  a Directory entry
  * Number of characters in the implementation-defined portion of a
  Directory entry
 
  Or, um, 23, which is I guess is left to the specific Marc
  implementation (ie,
  Marc21 is one such) to use for it's own purposes.
 
  I have no idea how that should inform the 'marc magic'.
 
  Is mime-type application/marc defined as specifically Marc21, or as
  any Marc?
 
  Jonathan
 
  On 4/6/2011 12:28 PM, Ford, Kevin wrote:
  Well, this brings us right up against the issue of files that adhere
  to their
  specifications versus forgiving applications.  Think of browsers and HTML.
  Suffice it to say, MARC applications are quite likely to be forgiving
  of leader positions 20-23.  In my non-conforming MARC file and in
  Bill's, the leader positions 20-21 (45) seemed constant, but things
  could fall apart for positions 22-23.  So...
  I present the following (in-line and attached, to preserve tabs) in
  an
  attempt to straddle the two sides of this issue: applications
  forgiving of non- conforming files.  Should the two characters
  following 45 (at position 20)
  *not* be 00, then the identification will be noted as
  non-conforming.  We could classify this as reasonable
  identification but hardly ironclad

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread William Denton

On 6 April 2011, Reese, Terry wrote:

Actually -- I'd disagree because that is a very narrow view of the 
specification.  When validating MARC, I'd take the approach to validate 
structure (which allows you to then read any MARC format) -- then use a 
separate process for validating content of fields, which in my opinion, 
is more open to interpretation based on system usage of the data.


What do you think is the best way to recognize MARC files (up to some 
level of validity, given all the MARC you've seen and parsed) that could 
be made to work the way magic is defined?


Bill
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
I'm honestly not family with magic.  I can tell you in MarcEdit, the way that 
the process works is there is a very generic function that reads the structure 
of the data not trusting the information in the leader (since I find this data 
very un-reliable).  Then, if users want to apply a set of rules to the 
validation -- I apply those as a secondary process.  If you are looking to 
validate specific content within a record, then what you want to do in this 
function may be appropriate -- though you'll find some local systems will 
consistently fail the process.

--tr


From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of William Denton 
[w...@pobox.com]
Sent: Wednesday, April 06, 2011 10:29 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file

On 6 April 2011, Reese, Terry wrote:

 Actually -- I'd disagree because that is a very narrow view of the
 specification.  When validating MARC, I'd take the approach to validate
 structure (which allows you to then read any MARC format) -- then use a
 separate process for validating content of fields, which in my opinion,
 is more open to interpretation based on system usage of the data.

What do you think is the best way to recognize MARC files (up to some
level of validity, given all the MARC you've seen and parsed) that could
be made to work the way magic is defined?

Bill
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Kyle Banerjee
..  Maybe we have different understandings of valid.

 If leader bytes 20-23 are not 4500, I suggest that is _by definition_ not
 a valid Marc21 file. It violates the Marc21 specification.

 Now, they may still be _usable_, by software that ignores these bytes
 anyway or works around them. We definitely have a lot of software that does
 that.

 Which can end up causing problems that remind me of very analagous problems
 caused by the early days of web browsers that felt like being 'tolerant' of
 bad data. My html works in every web brower BUT this one, why not? Oh,
 becuase that's the only one that actually followed the standard, oops.


There is some question as to what value there is in validating fields that
have no meaning by definition. What benefit does validating an undefined
value have other than create an opportunity to break things and slow the
process down just a little? The entire concept of an invalid entry in an
undefined field (e.g byte 23) is oxymoronic.

I'd go so far as to question the value of validating redundant data that
theoretically has meaning but which are never supposed to vary. The 4 and
the 5 simply repeat what is already known about the structure of the MARC
record. Choking on stuff like this is like having a web browser ask you want
to do with a page because it lacks a document type declaration.

Garbage data is the reality, so having parsers stop when they encounter data
they don't actually need unnecessarily complicates things. That kind of
stuff should generate a warning at worst.

kyle


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind

Actually -- I'd disagree because that is a very narrow view of the
specification.  When validating MARC, I'd take the approach to validate
structure (which allows you to then read any MARC format) -- then use a
separate process for validating content of fields, which in my opinion,
is more open to interpretation based on system usage of the data.


Wait, so is there any formal specification of validity that you can 
look at to determine your definition of validity, or it's just well, 
if I can recover it into useful data, using my own algorithms


I think we computer programmers are really better-served by reserving 
the notion of validity for things specified by formal specifications 
-- as we normally do, talking about any other data format.   And the 
only formal specifications I can find for Marc21 say that leader bytes 
20-23 should be 4500. (Not true of Marc in general just Marc21).


Now it may very well be (is!) true that the library community with Marc 
have been in the practice of tolerating working Marc that is NOT valid 
according to any specification.   So, sure, we may need to write 
software to take account of that sordid history. But I think it IS a 
sordid history -- not having a specification to ensure validity makes it 
VERY hard to write any new software that recognizes what you expect it 
to be recognize, because what you expect it to recognize isn't formally 
specified anywhere. It's a problem.  We shouldn't try to hide the 
problem in our discussions by using the word valid to mean something 
different than we use it for any modern data format. valid only has a 
meaning when you're talking about valid according to some specific 
specification.


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind

On 4/6/2011 2:02 PM, Kyle Banerjee wrote:

I'd go so far as to question the value of validating redundant data that
theoretically has meaning but which are never supposed to vary. The 4 and
the 5 simply repeat what is already known about the structure of the MARC
record. Choking on stuff like this is like having a web browser ask you want
to do with a page because it lacks a document type declaration.


Well, the problem is when the original Marc4J author took the spec at 
it's word, and actually _acted upon_ the '4' and the '5', changing file 
semantics if they were different, and throwing an exception if it was a 
non-digit.


This actually happened, I'm not making this up!  Took me a while to debug.

So do you think he got it wrong?  How was he supposed to know he got it 
wrong, he wrote to the spec and took it at it's word. Are you SURE there 
aren't any Marc formats other than Marc21 out there that actually do use 
these bytes with their intended meaning, instead of fixing them? How was 
the Marc4J author supposed to be sure of that, or even guess it might be 
the case, and know he'd be serving users better by ignoring the spec 
here instead of following it?  What documents instead of the actual 
specifications should he have been looking at to determine that he ought 
not to be taking those bytes at their words, but just ignoring them?


To realize that we have so much non-conformant data out there that we're 
better off ignoring these bytes, is something you can really only learn 
through experience -- and something you can then later realize you're 
wrong on too:


Ie: I _thought_ I was writing only for Marc21, but then it turns out 
I've got to accept records from Outer Weirdistan that are a kind of 
legal Marc that actually uses those bytes for their intended meaning -- 
better go back and fix my entire software stack, involving various 
proprietary and open source products from multiple sources, each of 
which has undocumented behavior when it comes to these bytes, maybe they 
follow the spec or maybe the follow Kyle's advice, but they don't tell 
me.  This is a mess.


Maybe this scenario is impossible, maybe there ARE and NEVER HAVE BEEN 
any Marc variants that actually use leader bytes 20-22 in this way -- 
how can I determine that?  I've just got to guess and hope for the 
best.  The point of specifications in the first place is for 
inter-operability, so we know that if all software and data conforms to 
the spec, then all software and data will interact in expected ways.  
Once we start guessing at which parts of the spec we really ought to be 
ignoring


Again, I realize in the actual environment we've got, this is not a 
luxury we have. But it's a fault, not a benefit, to have lots of 
software everywhere behaving in non-compliant ways and creating invalid 
(according to the spec!) data.


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread William Denton

On 6 April 2011, Jonathan Rochkind wrote:

I think we computer programmers are really better-served by reserving the 
notion of validity for things specified by formal specifications -- as we 
normally do, talking about any other data format.   And the only formal 
specifications I can find for Marc21 say that leader bytes 20-23 should be 
4500. (Not true of Marc in general just Marc21).


Validity does mean something definite ... but Postel's Law is a good 
guideline, especially with the swamp of bad MARC, old MARC, alternate 
MARC, that's out there.  Valid MARC is valid MARC, but if---for the sake 
of file and its magic---we can identify technically invalid but still 
usable MARC, that's good.


Bill
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind

On 4/6/2011 2:43 PM, William Denton wrote:


Validity does mean something definite ... but Postel's Law is a good
guideline, especially with the swamp of bad MARC, old MARC, alternate
MARC, that's out there.  Valid MARC is valid MARC, but if---for the sake
of file and its magic---we can identify technically invalid but still
usable MARC, that's good.


Hmm, accept in the case of Web Browsers, I think general consensus is 
Postel's law was not helpful. These days, most people seem to think that 
having different browsers be tolerant of invalid data in different ways 
was actually harmful rather than helpful to inter-operability (which is 
theoretically the goal of Postel's law), and that's not what people do 
anymore in web browser land, at least not to the extremes they used to 
do it.


So Postel's Law may not be a universal.  Although marc data may or may 
not be analagous to a web browser/html. :)  It doesn't _really_ matter, 
cause we're stuck with the legacy we're stuck with, there's no changing 
it now. But there are real world negative consequences to it, some of 
which I've tried to explain in previous messages. (And still don't call 
it validity if it's not please! But yes, sometimes insisting on strict 
validity is not the appropriate solution).


Also note that assuming that byte 20-21 is 45 even when it's something 
else is possibly not something Postel would accept as an application of 
his law -- unless you document your software specifically as working 
only with Marc21, and not any Marc.


[Postel's Law: Be conservative in what you send; be liberal in what you 
accept. http://en.wikipedia.org/wiki/Robustness_principle  .  That wiki 
page also notes the general category of downside in following Postel's 
law, which is what was encountered with HTML, and which _I've_ 
encountered with MARC:  For example, a defective implementation that 
sends non-conforming messages might be used only with implementations 
that tolerate those deviations from the specification until, possibly 
several years later, it is connected with a less tolerant application 
that rejects its messages. In such a situation, identifying the problem 
is often difficult, and deploying a solution can be costly. 


Yes, identifying the problem and deploying the solution was costly, in 
my MARC case, although it definitely could have been worse. ]


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Kyle Banerjee
 Well, the problem is when the original Marc4J author took the spec at it's
 word, and actually _acted upon_ the '4' and the '5', changing file semantics
 if they were different, and throwing an exception if it was a non-digit.

At least the author actually used the values rather than checking to see if
a 4 or 5 were there. I still don't see what the point of looking for a 0 in
an undefined field would be. I'm wondering what kind of nut job would write
this into the standard, but that's not the author's problem.


 Do you think he got it wrong?  How was he supposed to know he got it wrong,
 he wrote to the spec and took it at it's word. Are you SURE there aren't any
 Marc formats other than Marc21 out there that actually do use these bytes
 with their intended meaning, instead of fixing them?


I wouldn't call it wrong -- the spec is a logical point of departure. MARC21
derives from an ISO standard that does not use those character positions and
which otherwise requires the same data layout, but the author wouldn't
necessarily know that.

Standards have something in common with laws in that how they are used in
the real world is as or more important than what is actually defined --
what's written and what's done in practice can be very different.

Everyone here who has parsed catalog data who has done an ILS migration
knows better than to just think for a second that fields can be assumed to
be used as defined except for very basic stuff.


 How was the Marc4J author supposed to be sure of that, or even guess it
 might be the case, and know he'd be serving users better by ignoring the
 spec here instead of following it?


There might not have been a good way to know. With data, one thing you
always want to do is ask a bunch of people who work with it all the time
about anomalies in the wild. Many great works of fiction masquerade as
documents which supposedly describe reality.


 Ie: I _thought_ I was writing only for Marc21, but then it turns out I've
 got to accept records from Outer Weirdistan that are a kind of legal Marc
 that actually uses those bytes for their intended meaning


Any such MARC as it would be noncompliant with the ISO standard from which
MARC21 hails. If working from the MARC21 standard and weird records are in
question, there would be a greater chance of choking on nonumeric tags as
those are allowed by the ISO standard.

Ignoring that MARC21 would need to be redefined to be able to take on other
values, one can safely conclude that such a redefinition could only be
written by totally deranged individuals. Values lower than 4 and 5
respectively would limit record length to the point little or no data could
be stored, and greater values would be completely nonsensical as the MARC
record length limitation would mean that the extra space allocated by the
digits could only contain zeros.

In any case, MARC is a legacy standard from the 60's. The chances of new
flavors emerging are dismal at best.


 Again, I realize in the actual environment we've got, this is not a luxury
 we have. But it's a fault, not a benefit, to have lots of software
 everywhere behaving in non-compliant ways and creating invalid (according to
 the spec!) data.

Creating is another matter entirely. Since we can control what we create
ourselves, we make things a little better every time we make things
comformant. However, we can't control what others do and being able to read
everything is useful, including stuff created using tools/processes that
aren't up to scratch.

kyle


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Mike Taylor
On 6 April 2011 19:53, Jonathan Rochkind rochk...@jhu.edu wrote:
 On 4/6/2011 2:43 PM, William Denton wrote:

 Validity does mean something definite ... but Postel's Law is a good
 guideline, especially with the swamp of bad MARC, old MARC, alternate
 MARC, that's out there.  Valid MARC is valid MARC, but if---for the sake
 of file and its magic---we can identify technically invalid but still
 usable MARC, that's good.

 Hmm, accept in the case of Web Browsers, I think general consensus is
 Postel's law was not helpful. These days, most people seem to think that
 having different browsers be tolerant of invalid data in different ways was
 actually harmful rather than helpful to inter-operability (which is
 theoretically the goal of Postel's law), and that's not what people do
 anymore in web browser land, at least not to the extremes they used to do
 it.

But the idea that browsers should be less permissive in what they
accept is a modern one that we now have the luxury of only because
adherence to Postel's law in the early days of the Web allowed it to
become ubiquitous.  Though it's true, as Harvey Thompson has observed
that it's difficult to retro-fit correctness, Clay Shirky was also
very right when he pointed out that You cannot simultaneously have
mass adoption and rigor.  If browsers in 1995 had been as pedantic as
the browsers of 2011 (rightly) are, we wouldn't even have the Web; or
if it existed at all it would just be a nichey thing that a few
scientists used to make their publications available to each other.

So while I agree that in the case of HTML we are right to now be
moving towards more rigorous demands of what to accept (as well, of
course, as being conservative in what we emit), I don't think we could
have made the leap from nothing to modern rigour.

-- Mike


Re: [CODE4LIB] MARC magic for file

2011-04-03 Thread Simon Spero
I am pretty sure that the marc4j standard reader ignores them; the tolerant
reader definitely does. Otherwise JHU might have about two parseable records
based on the mangled leaders that J-Rock  gets stuck with :-)

An analysis of the ~7M LC bib records from the scriblio.net data files (~
Dec 2006) indicated that leader  has less than 8 bits of information in it
(shannon-weaver definition). This excludes the initial length value, which
is redundant given the end of record marker.


The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader.
 The final characters of the leader are 450.

Also, I object to the phrase decent MARC tool.  Any tool capable of
dealing with MARC as it exists cannot afford the luxury of decency :-)

[ HA: A clear conscience?
 BW: Yes, Sir Humphrey.
 HA: When did you acquire this taste for luxuries?]

Simon

On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens o...@ostephens.com wrote:

 I'm sure any decent MARC tool can deal with them, since decent MARC tools
 are certainly going to be forgiving enough to deal with four characters
 that
 apparently don't even really matter.

 You say that, but I'm pretty sure Marc4J throws errors MARC records where
 these characters are incorrect

 Owen

 On Fri, Apr 1, 2011 at 3:51 AM, William Denton w...@pobox.com wrote:

  On 28 March 2011, Ford, Kevin wrote:
 
   I couldn't get Simon's MARC 21 Magic file to work.  Among other issues,
 I
  received line too long errors.  But, since I've been curious about
 this
  for sometime, I figured I'd take a whack at it myself.  Try this:
 
 
  This is very nice!  Thanks.  I tried it on a bunch of MARC files I have,
  and it recognized almost all of them.  A few it didn't, so I had a closer
  look, and they're invalid.
 
  For example, the Internet Archive's Binghamton catalogue dump:
 
  http://ia600307.us.archive.org/6/items/marc_binghamton_univ/
 
  $ file -m marc.magic bgm*mrc
  bgm_openlib_final_0-5.mrc: data
  bgm_openlib_final_10-15.mrc:   MARC Bibliographic
  bgm_openlib_final_15-18.mrc:   data
  bgm_openlib_final_5-10.mrc:MARC Bibliographic
 
  But why?  Aha:
 
  $ head -c 25 bgm_openlib_final_*mrc
  == bgm_openlib_final_0-5.mrc ==
  01812cas  2200457   45x00
  == bgm_openlib_final_10-15.mrc ==
  01008nam  2200289ua 45000
  == bgm_openlib_final_15-18.mrc ==
  01614cam00385   45  0
  == bgm_openlib_final_5-10.mrc ==
  00887nam  2200265v  45000
 
  As you say, the leader should end with 4500 (as defined at
  http://www.loc.gov/marc/authority/adleader.html) but two of those files
  don't.  So they're not valid MARC.  I'm sure any decent MARC tool can
 deal
  with them, since decent MARC tools are certainly going to be forgiving
  enough to deal with four characters that apparently don't even really
  matter.
 
  So on the one hand they're usable MARC but file wouldn't say so, and on
 the
  other that's a good indication that the files have failed a basic
 validity
  test.  I wonder if there are similar situations for JPEGs or MP3s.
 
  I think you should definitely submit this for inclusion in the magic
 file.
  It would be very useful for us all!
 
  Bill
 
  P.S. I'd never used head -c (to show a fixed number of bytes) before.
  Always nice to find a new useful option to an old command.
 
 
   #
  # MARC 21 Magic  (Second cut)
 
  # Set at position 0
  0   short   0x
 
  # leader ends with 4500
 
  20  string  4500
 
 
  # leader starts with 5 digits, followed by codes specific to MARC format
 
  0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
  0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
  0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
  0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification
  0   regex/1 (^[0-9]{5})[cdn][q] MARC Community
 
 
 
  --
  William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
 



 --
 Owen Stephens
 Owen Stephens Consulting
 Web: http://www.ostephens.com
 Email: o...@ostephens.com



Re: [CODE4LIB] MARC magic for file

2011-04-01 Thread Owen Stephens
I'm sure any decent MARC tool can deal with them, since decent MARC tools
are certainly going to be forgiving enough to deal with four characters that
apparently don't even really matter.

You say that, but I'm pretty sure Marc4J throws errors MARC records where
these characters are incorrect

Owen

On Fri, Apr 1, 2011 at 3:51 AM, William Denton w...@pobox.com wrote:

 On 28 March 2011, Ford, Kevin wrote:

  I couldn't get Simon's MARC 21 Magic file to work.  Among other issues, I
 received line too long errors.  But, since I've been curious about this
 for sometime, I figured I'd take a whack at it myself.  Try this:


 This is very nice!  Thanks.  I tried it on a bunch of MARC files I have,
 and it recognized almost all of them.  A few it didn't, so I had a closer
 look, and they're invalid.

 For example, the Internet Archive's Binghamton catalogue dump:

 http://ia600307.us.archive.org/6/items/marc_binghamton_univ/

 $ file -m marc.magic bgm*mrc
 bgm_openlib_final_0-5.mrc: data
 bgm_openlib_final_10-15.mrc:   MARC Bibliographic
 bgm_openlib_final_15-18.mrc:   data
 bgm_openlib_final_5-10.mrc:MARC Bibliographic

 But why?  Aha:

 $ head -c 25 bgm_openlib_final_*mrc
 == bgm_openlib_final_0-5.mrc ==
 01812cas  2200457   45x00
 == bgm_openlib_final_10-15.mrc ==
 01008nam  2200289ua 45000
 == bgm_openlib_final_15-18.mrc ==
 01614cam00385   45  0
 == bgm_openlib_final_5-10.mrc ==
 00887nam  2200265v  45000

 As you say, the leader should end with 4500 (as defined at
 http://www.loc.gov/marc/authority/adleader.html) but two of those files
 don't.  So they're not valid MARC.  I'm sure any decent MARC tool can deal
 with them, since decent MARC tools are certainly going to be forgiving
 enough to deal with four characters that apparently don't even really
 matter.

 So on the one hand they're usable MARC but file wouldn't say so, and on the
 other that's a good indication that the files have failed a basic validity
 test.  I wonder if there are similar situations for JPEGs or MP3s.

 I think you should definitely submit this for inclusion in the magic file.
 It would be very useful for us all!

 Bill

 P.S. I'd never used head -c (to show a fixed number of bytes) before.
 Always nice to find a new useful option to an old command.


  #
 # MARC 21 Magic  (Second cut)

 # Set at position 0
 0   short   0x

 # leader ends with 4500

 20  string  4500


 # leader starts with 5 digits, followed by codes specific to MARC format

 0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
 0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
 0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
 0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification
 0   regex/1 (^[0-9]{5})[cdn][q] MARC Community



 --
 William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org




-- 
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com


Re: [CODE4LIB] MARC magic for file

2011-03-31 Thread William Denton

On 28 March 2011, Ford, Kevin wrote:

I couldn't get Simon's MARC 21 Magic file to work.  Among other issues, 
I received line too long errors.  But, since I've been curious about 
this for sometime, I figured I'd take a whack at it myself.  Try this:


This is very nice!  Thanks.  I tried it on a bunch of MARC files I have, 
and it recognized almost all of them.  A few it didn't, so I had a closer 
look, and they're invalid.


For example, the Internet Archive's Binghamton catalogue dump:

http://ia600307.us.archive.org/6/items/marc_binghamton_univ/

$ file -m marc.magic bgm*mrc
bgm_openlib_final_0-5.mrc: data
bgm_openlib_final_10-15.mrc:   MARC Bibliographic
bgm_openlib_final_15-18.mrc:   data
bgm_openlib_final_5-10.mrc:MARC Bibliographic

But why?  Aha:

$ head -c 25 bgm_openlib_final_*mrc
== bgm_openlib_final_0-5.mrc ==
01812cas  2200457   45x00
== bgm_openlib_final_10-15.mrc ==
01008nam  2200289ua 45000
== bgm_openlib_final_15-18.mrc ==
01614cam00385   45  0
== bgm_openlib_final_5-10.mrc ==
00887nam  2200265v  45000

As you say, the leader should end with 4500 (as defined at 
http://www.loc.gov/marc/authority/adleader.html) but two of those files 
don't.  So they're not valid MARC.  I'm sure any decent MARC tool can deal 
with them, since decent MARC tools are certainly going to be forgiving 
enough to deal with four characters that apparently don't even really 
matter.


So on the one hand they're usable MARC but file wouldn't say so, and on 
the other that's a good indication that the files have failed a basic 
validity test.  I wonder if there are similar situations for JPEGs or 
MP3s.


I think you should definitely submit this for inclusion in the magic file. 
It would be very useful for us all!


Bill

P.S. I'd never used head -c (to show a fixed number of bytes) before. 
Always nice to find a new useful option to an old command.



#
# MARC 21 Magic  (Second cut)

# Set at position 0
0   short   0x

# leader ends with 4500

20  string  4500


# leader starts with 5 digits, followed by codes specific to MARC format

0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification
0   regex/1 (^[0-9]{5})[cdn][q] MARC Community



--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org


Re: [CODE4LIB] MARC magic for file

2011-03-28 Thread Ford, Kevin
I couldn't get Simon's MARC 21 Magic file to work.  Among other issues, I 
received line too long errors.  But, since I've been curious about this for 
sometime, I figured I'd take a whack at it myself.  Try this:

#
# MARC 21 Magic  (Second cut)

# Set at position 0
0   short   0x 

# leader ends with 4500
20 string  4500

# leader starts with 5 digits, followed by codes specific to MARC format
0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
0 regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification
0 regex/1 (^[0-9]{5})[cdn][q] MARC Community

I've also attached it to this email to preserve the tabs.  

In any event, I can confirm it works on MARC Bib, MARC Authority, and MARC 
Classification files I have bumping around my computer.  I've not tested it on 
MARC Holdings and MARC Community.

Do let us/me know if it works for you (and the community generally).  I can see 
about submitting it for formal inclusion in the magic file.

Warmly,

Kevin

--
Library of Congress
Network Development and MARC Standards Office




From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero 
[s...@unc.edu]
Sent: Thursday, March 24, 2011 12:28
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file

Some of the problems in your first cut are:

1. Offsets for regex are given in terms of lines.  MARC files don't have
newlines in them, unless you're Millennium, in which case they can be
inserted every 200,000 bytes to keep things interesting.
2.  Byte matches match byte values, so 20 byte 4   is looking for the
binary value, not the ascii digit.
3.  Sometimes you need to prime the buffer before you can do a regexp match.

Is this good enough?


# MARC 21 Magic  (First cut)
#  indicator count must be 2
10 string 2
#  leader must end in 4500
20 string 4500
#  leader must start with five digits, a record status, and a record
type
0 regex ^([0-9]{5})[acdnp][acdefgijkmoprt][abcims] MARC Bibliographic
0 regex ^([0-9]{5})[acdnp][z] MARC Authority

Simon


On Wed, Mar 23, 2011 at 8:09 PM, William Denton w...@pobox.com wrote:

 Has anyone figured out the magic necessary for file to recognize MARC
 files?

 If you don't know it, file is a Unix command that tells you what kind of
 file a file is.  For example:

 $ file 101015_001.mp3
 101015_001.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS,
 layer III, v1, 192 kbps, 44.1 kHz, Stereo

 $ file P126.jpg
 P126.jpg: JPEG image data, EXIF standard, comment: AppleMark

 It's a really useful command.  I assume it's on OSX, but I don't know. You
 can get it for Windows with Cygwin.

 The problem is, file doesn't grok MARC:

 $ file catalog.01.mrc
 catalog.01.mrc: data

 I took a stab at getting the magic defined, but it didn't work.  I'll
 include what I used below.  You can put it into a magic.txt file, and then
 use

 file -m magic.txt some_file.mrc

 to test it.  It'll tell you the file is MARC Bibliographic ... but it also
 thinks that PDFs, JPEGs, and text files are MARC.  That's no good.

 It'd be great if the MARC magic got into the central magic database so
 everyone would be able to recognize various MARC file types.

 Bill


 # --- clip'n'test
 # MARC 21 for Bibliographic Data
 # http://www.loc.gov/marc/bibliographic/bdleader.html
 #
 # This doesn't work properly

 0 stringx

 5regex  [acdnp]
 6regex  [acdefgijkmoprt]
 7regex  [abcims]
 8regex  [\ a]
 9regex  [\ a]
 10   byte  x
 11   byte  x
 12   stringx
 17   regex [\ 12345678uz]
 18   regex  [\ aciu]
 19   regex  [\ abc] MARC Bibliographic

 #20   byte 4
 #21   byte 5
 #22   byte 0
 #23   byte 0   MARC Bibliographic

 # --- end clip'n'test

 --
 William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org



marc.magic
Description: marc.magic


Re: [CODE4LIB] MARC magic for file

2011-03-24 Thread Simon Spero
Some of the problems in your first cut are:

1. Offsets for regex are given in terms of lines.  MARC files don't have
newlines in them, unless you're Millennium, in which case they can be
inserted every 200,000 bytes to keep things interesting.
2.  Byte matches match byte values, so 20 byte 4   is looking for the
binary value, not the ascii digit.
3.  Sometimes you need to prime the buffer before you can do a regexp match.

Is this good enough?


# MARC 21 Magic  (First cut)
#  indicator count must be 2
10 string 2
#  leader must end in 4500
20 string 4500
#  leader must start with five digits, a record status, and a record
type
0 regex ^([0-9]{5})[acdnp][acdefgijkmoprt][abcims] MARC Bibliographic
0 regex ^([0-9]{5})[acdnp][z] MARC Authority

Simon


On Wed, Mar 23, 2011 at 8:09 PM, William Denton w...@pobox.com wrote:

 Has anyone figured out the magic necessary for file to recognize MARC
 files?

 If you don't know it, file is a Unix command that tells you what kind of
 file a file is.  For example:

 $ file 101015_001.mp3
 101015_001.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS,
 layer III, v1, 192 kbps, 44.1 kHz, Stereo

 $ file P126.jpg
 P126.jpg: JPEG image data, EXIF standard, comment: AppleMark

 It's a really useful command.  I assume it's on OSX, but I don't know. You
 can get it for Windows with Cygwin.

 The problem is, file doesn't grok MARC:

 $ file catalog.01.mrc
 catalog.01.mrc: data

 I took a stab at getting the magic defined, but it didn't work.  I'll
 include what I used below.  You can put it into a magic.txt file, and then
 use

 file -m magic.txt some_file.mrc

 to test it.  It'll tell you the file is MARC Bibliographic ... but it also
 thinks that PDFs, JPEGs, and text files are MARC.  That's no good.

 It'd be great if the MARC magic got into the central magic database so
 everyone would be able to recognize various MARC file types.

 Bill


 # --- clip'n'test
 # MARC 21 for Bibliographic Data
 # http://www.loc.gov/marc/bibliographic/bdleader.html
 #
 # This doesn't work properly

 0 stringx

 5regex  [acdnp]
 6regex  [acdefgijkmoprt]
 7regex  [abcims]
 8regex  [\ a]
 9regex  [\ a]
 10   byte  x
 11   byte  x
 12   stringx
 17   regex [\ 12345678uz]
 18   regex  [\ aciu]
 19   regex  [\ abc] MARC Bibliographic

 #20   byte 4
 #21   byte 5
 #22   byte 0
 #23   byte 0   MARC Bibliographic

 # --- end clip'n'test

 --
 William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org