Re: [CODE4LIB] Vote for NE code4lib meetup location
Looking into the space/time issue this week, folks. I promise. -- Michael B. Klein Digital Initiatives Technology Librarian Boston Public Library (617) 859-2391 [EMAIL PROTECTED] From: Jay Luker [EMAIL PROTECTED] Reply-To: Code for Libraries CODE4LIB@LISTSERV.ND.EDU CODE4LIB@LISTSERV.ND.EDU Date: Wed, 15 Oct 2008 16:48:12 -0400 To: Code for Libraries CODE4LIB@LISTSERV.ND.EDU CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Vote for NE code4lib meetup location Sorry to leave you all in suspense all day. The results are in: 23 Boston, MA 18 Northampton, MA 14 Concord, NH 11 Portland, ME Michael Klein has said he will now check when a suitable space will be available at BPL. Then we'll update the WhenIsGood page and hope for some availability intersection goodness. --jay
Re: [CODE4LIB] OCR PDFs
It's not exactly what you're looking for, but Microsoft Office comes with a scripting OCR engine that works on TIFFs. I use it to get text from yearbooks we are scanning so people can look for names and such. While I wouldn't put it on par with ABBYY, it does a pretty decent job. I wrote a simple script in vbscript that scans all the tiff files in a folder and exports a txt file with the same name as the image that has all of the text it finds. If you want it, let me know and I'll send it your way. Mike Beccaria Systems Librarian Head of Digital Initiatives Paul Smith's College 518.327.6376 [EMAIL PROTECTED] --- This message may contain confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of James Tuttle Sent: Friday, October 17, 2008 7:57 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] OCR PDFs -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I wonder if any of you might have experience with creating text PDFs from TIFFs. I've been using tiffcp to stitch TIFFs together into a single image and then using tiff2pdf to generate PDFs from the single TIFF. I've had to pass this image-based PDF to someone with Acrobat to use it's batch processing facility to OCR the text and save a text-based PDF. I wonder if anyone has suggestions for software I can integrate into the script (Python on Linux) I'm using. Thanks, James - -- - --- James Tuttle Digital Repository Librarian NCSU Libraries, Box 7111 North Carolina State University Raleigh, NC 27695-7111 [EMAIL PROTECTED] (919)513-0651 Phone (919)515-3031 Fax -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFI+H1zKxpLzx+LOWMRAgxIAJwNXyeMJbk6r6hmHpNAdEvWIQbCVgCgp8JR nyS3WZ4UuRbU/6DTH7ohe/M= =mT2T -END PGP SIGNATURE-
[CODE4LIB] marc4j 2.4 released
Dear Code4Libbers, I'm very pleased to announce that for the first time in almost two years there has been a new release of marc4j. Release 2.4 is a minor release in the sense that it shouldn't break any existing code, but it's a major release in the sense that it represents an influx of new people into the development of this project, and a significant improvement in marc4j's ability to handle malformed or mis-encoded marc records. Release notes are here: http://marc4j.tigris.org/files/documents/ 220/44060/changes.txt And the project website, including download links, is here: http:// marc4j.tigris.org/ We've been using this new marc4j code in solrmarc since solrmarc started, so if you're using Blacklight or VuFind, you're probably using it already, just in an unreleased form. Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these improvements to marc4j and getting this release out the door. Bess Elizabeth (Bess) Sadler Research and Development Librarian Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243-2305
[CODE4LIB] Call for Book Reviewers Announcing a New Reviews Editor: Journal of Web Librarianship
Please excuse cross postings! The Journal of Web Librarianship is pleased to announce Lisa Ennis and Nicole Mitchell as new co-editors of the reviews section beginning with volume 3, issue 1. Lisa is the Systems Librarian at UAB’s Lister Hill Library of the Health Sciences. She received her M.A. in History from Georgia College State University and her M.S. in Information Sciences from the University of Tennessee. Nicole is a reference librarian at UAB’s Lister Hill Library of the Health Sciences. She received her M.A. in History from Georgia College State University and her M.L.I.S. from the University of Alabama. We are currently seeking contributors to the reviews section! The Journal of Web Librarianship scope, which extends to the Reviews section, includes: * web page design * usability testing of library or library-related sites * cataloging or classification of Web information * international issues in web librarianship * scholars' use of the web * information architecture * RSS feeds, podcasting, blogs, and other 2.0 technologies, * search engines * the history of libraries and the web * emerging and future aspects of web librarianship. New reviewers, including MLIS students and recent graduates, are welcome to apply. If you are interested, please email Lisa and Nicole at [EMAIL PROTECTED] with a copy of your resume, statement of interest, and a writing sample (preferably of a review). Jody Fagan Editor, The Journal of Web Librarianship http://www.lib.jmu.edu/org/jwl James Madison University Preferred email: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: [CODE4LIB] 2009 Conference Registration Rates?
We're still working to line up sponsors but we hope to be able to keep the registration fee the same as last year - $125. Room rate at the conference hotel is $135 plus tax (free internet in guest rooms). Jean Rainwater Brown University Library Providence, RI 02912 On Mon, Oct 20, 2008 at 11:45 AM, John Nowlin [EMAIL PROTECTED] wrote: Where is the information for the 2009 conference registration fee, need this to get the travel request completed. John Nowlin College Center for Library Automation (cclaflorida.org) Tallahassee, FL 32310 US - 850.922.6044 Always vote for principle, though you may vote alone, and you may cherish the sweetest reflection that your vote is never lost. -- John Quincy Adams
[CODE4LIB] Mashed Library UK 2008 - registration is open
I posted a little while ago that I was organising a 'Mashed Libraries' event. Well, registration for the event is now open at http://www.ukoln.ac.uk/events/mashed-library-2008/ There is no charge for the day, thanks to my employer (Imperial College London), sponsorship from UKOLN (http://www.ukoln.ac.uk), and the donation of time and space from Birkbeck College London (esp. thanks to David Flanders for this). Although the day is intended to be reasonably informal, there is a loose schedule for the day that looks like this: 10am Start 10-11 Dummies guide to ... (some short presentations on some of the tech/tools that might be of use during the day - I've got some topics, but post requests/suggestions at http://mashedlibrary.ning.com/forum/topic/show?id=2186716%3ATopic%3A5) 11-4 Mashup - work in teams or individually to do interesting stuff 4-5 Round up of mashups and close Food and drink will be supplied throughout the day as necessary (again, no charge) Registration closes on 14th November. I've had some interest shown in remote participation, and I'm happy to see what we can do to support this, although I'm not quite sure what form this participation should take - if you are interested in this, please post at http://mashedlibrary.ning.com/forum/topic/show?id=2186716%3ATopic%3A127 and I'll see what I can do (not promising anything at this stage though!) Hope to see some of you there. Best wishes, Owen Owen Stephens Assistant Director: eStrategy and Information Resources Imperial College London [EMAIL PROTECTED]
Re: [CODE4LIB] marc4j 2.4 released
Very cool! I noticed that a feature, MarcDirStreamReader, is capable of iterating over all marc record files in a given directory. Does anyone know of any de-duplicating efforts done with marc4j? For example, libraries that have similar holdings would have their records merged into one record with a location tag somewhere. I know places do it (consortia etc.) but I haven't been able to find a good open program that handles stuff like that. Mike Beccaria Systems Librarian Head of Digital Initiatives Paul Smith's College 518.327.6376 [EMAIL PROTECTED] --- This message may contain confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Bess Sadler Sent: Monday, October 20, 2008 11:12 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] marc4j 2.4 released Dear Code4Libbers, I'm very pleased to announce that for the first time in almost two years there has been a new release of marc4j. Release 2.4 is a minor release in the sense that it shouldn't break any existing code, but it's a major release in the sense that it represents an influx of new people into the development of this project, and a significant improvement in marc4j's ability to handle malformed or mis-encoded marc records. Release notes are here: http://marc4j.tigris.org/files/documents/ 220/44060/changes.txt And the project website, including download links, is here: http:// marc4j.tigris.org/ We've been using this new marc4j code in solrmarc since solrmarc started, so if you're using Blacklight or VuFind, you're probably using it already, just in an unreleased form. Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these improvements to marc4j and getting this release out the door. Bess Elizabeth (Bess) Sadler Research and Development Librarian Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243-2305
[CODE4LIB] JOB ADVERTISEMENT- Web Applications Developer, VCU Libraries
Web Applications Developer. Virginia Commonwealth University Libraries seeks faculty candidates for advancing the state of the art in the library’s Web environment, making it a rich, functional, and highly engaging experience for the VCU community of users. Position reports to the Web Applications Manager. ALA-accredited graduate degree or accredited graduate degree in an appropriate discipline required. Salary commensurate with experience, not less than $48,000. Review of applications will begin November 24, 2008, and will continue until the position is filled. Preferred qualifications, application procedures and other information are available in the complete position description at http://www.library.vcu.edu/admin/webappdev.html. VCU is Virginia’s largest university and one of the nation’s leading research institutions. It is located in historic and dynamic Richmond, Virginia, convenient to the beauty of the Blue Ridge Mountains, the recreation destinations of the Atlantic Ocean and the Chesapeake Bay, and the cultural resources of Washington, D.C. Virginia Commonwealth University is an Equal Opportunity/Affirmative Action Employer. Women, minorities, and persons with disabilities are encouraged to apply. -- Jimmy Ghaphery Head, Library Information Systems VCU Libraries http://www.library.vcu.edu --
Re: [CODE4LIB] marc4j 2.4 released
Hi, Mike. I don't know of any off-the-shelf software that does de-duplication of the kind you're describing, but it would be pretty useful. That would be awesome if someone wanted to build something like that into marc4j. Has anyone published any good algorithms for de-duping? As I understand it, if you have two records that are 100% identical except for holdings information, that's pretty easy. It gets harder when one record is more complete than the other, and very hard when one record has even slightly different information than the other, to tell whether they are the same record and decide whose information to privilege. Are there any good de-duping guidelines out there? When a library contracts out the de-duping of their catalog, what kind of specific guidelines are they expected to provide? Anyone know? I remember the open library folks were very interested in this question. Any open library folks on this list? Did that effort to de- dupe all those contributed marc records ever go anywhere? Bess On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote: Very cool! I noticed that a feature, MarcDirStreamReader, is capable of iterating over all marc record files in a given directory. Does anyone know of any de-duplicating efforts done with marc4j? For example, libraries that have similar holdings would have their records merged into one record with a location tag somewhere. I know places do it (consortia etc.) but I haven't been able to find a good open program that handles stuff like that. Mike Beccaria Systems Librarian Head of Digital Initiatives Paul Smith's College 518.327.6376 [EMAIL PROTECTED] --- This message may contain confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Bess Sadler Sent: Monday, October 20, 2008 11:12 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] marc4j 2.4 released Dear Code4Libbers, I'm very pleased to announce that for the first time in almost two years there has been a new release of marc4j. Release 2.4 is a minor release in the sense that it shouldn't break any existing code, but it's a major release in the sense that it represents an influx of new people into the development of this project, and a significant improvement in marc4j's ability to handle malformed or mis-encoded marc records. Release notes are here: http://marc4j.tigris.org/files/documents/ 220/44060/changes.txt And the project website, including download links, is here: http:// marc4j.tigris.org/ We've been using this new marc4j code in solrmarc since solrmarc started, so if you're using Blacklight or VuFind, you're probably using it already, just in an unreleased form. Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these improvements to marc4j and getting this release out the door. Bess Elizabeth (Bess) Sadler Research and Development Librarian Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243-2305
Re: [CODE4LIB] marc4j 2.4 released
To me, de-duplication means throwing out some records as duplicates. Are we talking about that, or are we talking about what I call work set grouping and others (erroneously in my opinion) call FRBRization? If the latter, I don't think there is any mature open source software that addresses that yet. Or for that matter, any proprietary for-purchase software that you could use as a component in your own tools. Various proprietary software includes a work set grouping feature in it's black box (AquaBrowser, Primo, I believe the VTLS ILS). But I don't know of anything available to do it for you in your own tool. I've been just starting to give some thought to how to accomplish this, and it's a bit of a tricky problem on several grounds, including computationally (doing it in a way that performs efficiently). One choice is whether you group records at the indexing stage, or on-demand at the retrieval stage. Both have performance implications--we really don't want to slow down retrieval OR indexing. Usually if you have the choice, you put the slow down at indexing since it only happens once in abstract theory. But in fact, with what we do, when indexing that's already been optmized and does not have this feature can take hours or even days with some of our corpuses, and when in fact we do re-index from time to time (including 'incremental' addition to the index of new and changed records)---we really don't want to slow down indexing either. Jonathan Bess Sadler wrote: Hi, Mike. I don't know of any off-the-shelf software that does de-duplication of the kind you're describing, but it would be pretty useful. That would be awesome if someone wanted to build something like that into marc4j. Has anyone published any good algorithms for de-duping? As I understand it, if you have two records that are 100% identical except for holdings information, that's pretty easy. It gets harder when one record is more complete than the other, and very hard when one record has even slightly different information than the other, to tell whether they are the same record and decide whose information to privilege. Are there any good de-duping guidelines out there? When a library contracts out the de-duping of their catalog, what kind of specific guidelines are they expected to provide? Anyone know? I remember the open library folks were very interested in this question. Any open library folks on this list? Did that effort to de-dupe all those contributed marc records ever go anywhere? Bess On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote: Very cool! I noticed that a feature, MarcDirStreamReader, is capable of iterating over all marc record files in a given directory. Does anyone know of any de-duplicating efforts done with marc4j? For example, libraries that have similar holdings would have their records merged into one record with a location tag somewhere. I know places do it (consortia etc.) but I haven't been able to find a good open program that handles stuff like that. Mike Beccaria Systems Librarian Head of Digital Initiatives Paul Smith's College 518.327.6376 [EMAIL PROTECTED] --- This message may contain confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Bess Sadler Sent: Monday, October 20, 2008 11:12 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] marc4j 2.4 released Dear Code4Libbers, I'm very pleased to announce that for the first time in almost two years there has been a new release of marc4j. Release 2.4 is a minor release in the sense that it shouldn't break any existing code, but it's a major release in the sense that it represents an influx of new people into the development of this project, and a significant improvement in marc4j's ability to handle malformed or mis-encoded marc records. Release notes are here: http://marc4j.tigris.org/files/documents/ 220/44060/changes.txt And the project website, including download links, is here: http:// marc4j.tigris.org/ We've been using this new marc4j code in solrmarc since solrmarc started, so if you're using Blacklight or VuFind, you're probably using it already, just in an unreleased form. Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these improvements to marc4j and getting this release out the door. Bess Elizabeth (Bess) Sadler Research and Development Librarian Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243-2305 -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] marc4j 2.4 released
Terry Reese wrote a program called RobertCompare a few years back http://oregonstate.edu/~reeset/marcedit/html/robertcompare.html that could compare MARC records and tell you about differences. Perhaps that would be useful. kyle On Mon, Oct 20, 2008 at 11:55 AM, Bess Sadler [EMAIL PROTECTED] wrote: Hi, Mike. I don't know of any off-the-shelf software that does de-duplication of the kind you're describing, but it would be pretty useful. That would be awesome if someone wanted to build something like that into marc4j. Has anyone published any good algorithms for de-duping? As I understand it, if you have two records that are 100% identical except for holdings information, that's pretty easy. It gets harder when one record is more complete than the other, and very hard when one record has even slightly different information than the other, to tell whether they are the same record and decide whose information to privilege. Are there any good de-duping guidelines out there? When a library contracts out the de-duping of their catalog, what kind of specific guidelines are they expected to provide? Anyone know? I remember the open library folks were very interested in this question. Any open library folks on this list? Did that effort to de-dupe all those contributed marc records ever go anywhere? Bess On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote: Very cool! I noticed that a feature, MarcDirStreamReader, is capable of iterating over all marc record files in a given directory. Does anyone know of any de-duplicating efforts done with marc4j? For example, libraries that have similar holdings would have their records merged into one record with a location tag somewhere. I know places do it (consortia etc.) but I haven't been able to find a good open program that handles stuff like that. Mike Beccaria Systems Librarian Head of Digital Initiatives Paul Smith's College 518.327.6376 [EMAIL PROTECTED] --- This message may contain confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Bess Sadler Sent: Monday, October 20, 2008 11:12 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] marc4j 2.4 released Dear Code4Libbers, I'm very pleased to announce that for the first time in almost two years there has been a new release of marc4j. Release 2.4 is a minor release in the sense that it shouldn't break any existing code, but it's a major release in the sense that it represents an influx of new people into the development of this project, and a significant improvement in marc4j's ability to handle malformed or mis-encoded marc records. Release notes are here: http://marc4j.tigris.org/files/documents/ 220/44060/changes.txt And the project website, including download links, is here: http:// marc4j.tigris.org/ We've been using this new marc4j code in solrmarc since solrmarc started, so if you're using Blacklight or VuFind, you're probably using it already, just in an unreleased form. Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these improvements to marc4j and getting this release out the door. Bess Elizabeth (Bess) Sadler Research and Development Librarian Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243-2305 -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance [EMAIL PROTECTED] / 541.359.9599
[CODE4LIB] de-dupping (was: marc4j 2.4 released)
I've wondered if standard number matching (ISBN, LCCN, OCLC, ISSN ...) would be a big piece. Isn't there such a service from OCLC, and another flavor of something-or-other from LibraryThing? - Naomi On Oct 20, 2008, at 12:21 PM, Jonathan Rochkind wrote: To me, de-duplication means throwing out some records as duplicates. Are we talking about that, or are we talking about what I call work set grouping and others (erroneously in my opinion) call FRBRization? If the latter, I don't think there is any mature open source software that addresses that yet. Or for that matter, any proprietary for-purchase software that you could use as a component in your own tools. Various proprietary software includes a work set grouping feature in it's black box (AquaBrowser, Primo, I believe the VTLS ILS). But I don't know of anything available to do it for you in your own tool. I've been just starting to give some thought to how to accomplish this, and it's a bit of a tricky problem on several grounds, including computationally (doing it in a way that performs efficiently). One choice is whether you group records at the indexing stage, or on-demand at the retrieval stage. Both have performance implications--we really don't want to slow down retrieval OR indexing. Usually if you have the choice, you put the slow down at indexing since it only happens once in abstract theory. But in fact, with what we do, when indexing that's already been optmized and does not have this feature can take hours or even days with some of our corpuses, and when in fact we do re-index from time to time (including 'incremental' addition to the index of new and changed records)---we really don't want to slow down indexing either. Jonathan Bess Sadler wrote: Hi, Mike. I don't know of any off-the-shelf software that does de-duplication of the kind you're describing, but it would be pretty useful. That would be awesome if someone wanted to build something like that into marc4j. Has anyone published any good algorithms for de- duping? As I understand it, if you have two records that are 100% identical except for holdings information, that's pretty easy. It gets harder when one record is more complete than the other, and very hard when one record has even slightly different information than the other, to tell whether they are the same record and decide whose information to privilege. Are there any good de-duping guidelines out there? When a library contracts out the de-duping of their catalog, what kind of specific guidelines are they expected to provide? Anyone know? I remember the open library folks were very interested in this question. Any open library folks on this list? Did that effort to de-dupe all those contributed marc records ever go anywhere? Bess On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote: Very cool! I noticed that a feature, MarcDirStreamReader, is capable of iterating over all marc record files in a given directory. Does anyone know of any de-duplicating efforts done with marc4j? For example, libraries that have similar holdings would have their records merged into one record with a location tag somewhere. I know places do it (consortia etc.) but I haven't been able to find a good open program that handles stuff like that. Mike Beccaria Systems Librarian Head of Digital Initiatives Paul Smith's College 518.327.6376 [EMAIL PROTECTED] -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu Naomi Dushay [EMAIL PROTECTED]
Re: [CODE4LIB] de-dupping (was: marc4j 2.4 released)
Hi all: My student, Yee Fan Tan, and I published a short technical column on record linkage tasks (very similar to the de-dup task discussed here) in February in the Communications of the ACM. Min-Yen Kan and Yee Fan Tan (2008) Record matching in digital library metadata. In Communications of the ACM, Technical opinion column, pp. 91-94, February. http://doi.acm.org/10.1145/1314215.1314231 We're in the process of releasing a tool/demo for de-dup tasks, as a java library (jar). If there's sufficient interest, we might try to cater some of our string similarity metrics to MARC or other catalog data. Cheers, Min -- Min-Yen KAN (Dr) :: Assistant Professor :: National University of Singapore :: School of Computing, AS6 05-12, Law Link, Singapore 117590 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) :: [EMAIL PROTECTED] (E) :: www.comp.nus.edu.sg/~kanmy (W) Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you. On Tue, Oct 21, 2008 at 8:03 AM, Naomi Dushay [EMAIL PROTECTED] wrote: I've wondered if standard number matching (ISBN, LCCN, OCLC, ISSN ...) would be a big piece. Isn't there such a service from OCLC, and another flavor of something-or-other from LibraryThing? - Naomi On Oct 20, 2008, at 12:21 PM, Jonathan Rochkind wrote: To me, de-duplication means throwing out some records as duplicates. Are we talking about that, or are we talking about what I call work set grouping and others (erroneously in my opinion) call FRBRization? If the latter, I don't think there is any mature open source software that addresses that yet. Or for that matter, any proprietary for-purchase software that you could use as a component in your own tools. Various proprietary software includes a work set grouping feature in it's black box (AquaBrowser, Primo, I believe the VTLS ILS). But I don't know of anything available to do it for you in your own tool. I've been just starting to give some thought to how to accomplish this, and it's a bit of a tricky problem on several grounds, including computationally (doing it in a way that performs efficiently). One choice is whether you group records at the indexing stage, or on-demand at the retrieval stage. Both have performance implications--we really don't want to slow down retrieval OR indexing. Usually if you have the choice, you put the slow down at indexing since it only happens once in abstract theory. But in fact, with what we do, when indexing that's already been optmized and does not have this feature can take hours or even days with some of our corpuses, and when in fact we do re-index from time to time (including 'incremental' addition to the index of new and changed records)---we really don't want to slow down indexing either. Jonathan Bess Sadler wrote: Hi, Mike. I don't know of any off-the-shelf software that does de-duplication of the kind you're describing, but it would be pretty useful. That would be awesome if someone wanted to build something like that into marc4j. Has anyone published any good algorithms for de-duping? As I understand it, if you have two records that are 100% identical except for holdings information, that's pretty easy. It gets harder when one record is more complete than the other, and very hard when one record has even slightly different information than the other, to tell whether they are the same record and decide whose information to privilege. Are there any good de-duping guidelines out there? When a library contracts out the de-duping of their catalog, what kind of specific guidelines are they expected to provide? Anyone know? I remember the open library folks were very interested in this question. Any open library folks on this list? Did that effort to de-dupe all those contributed marc records ever go anywhere? Bess On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote: Very cool! I noticed that a feature, MarcDirStreamReader, is capable of iterating over all marc record files in a given directory. Does anyone know of any de-duplicating efforts done with marc4j? For example, libraries that have similar holdings would have their records merged into one record with a location tag somewhere. I know places do it (consortia etc.) but I haven't been able to find a good open program that handles stuff like that. Mike Beccaria Systems Librarian Head of Digital Initiatives Paul Smith's College 518.327.6376 [EMAIL PROTECTED] -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu Naomi Dushay [EMAIL PROTECTED]
[CODE4LIB] XML Workshop
This is being shamelessly cross-posted all apologies for full mailboxes! WEB DEVELOPMENT WITH XML: DESIGN AND APPLICATIONS, JAN. 5-9, 2009, CHAPEL HILL, NC Washington DCThe Association of Research Libraries (ARL) is pleased to offer once again an in-depth workshop focused on Web development with XML. Taught by experienced XML developers from the libraries of Brown University, the University of Virginia, and the Virginia Foundation for the Humanities, this five-day workshop will explore XML with a specific focus on fundamentals of design, markup, and use. Participants will use XML and related technologies in the creation of a prototype digital publication. In addition, the University of North Carolina at Chapel Hill Libraries will host a reception and tour of their new Carolina Digital Library and Archive. Topics to be covered include: 1. XML: What is it? How does it differ from SGML and HTML? 2. Working with content models (primarily XML Schema) and methods of using them when constructing and validating XML 3. Implementing methods of content transformation and delivery (using XSL and XPath) so the XML we build can be delivered, read, and used in a variety of formats 4. Using XML applications such as XQuery and eXist to further utilize XML capabilities and technologies in a Web environment DATE LOCATION January 5-9, 2009 University of North Carolina at Chapel Hill 247 Davis Library Chapel Hill NC PRESENTERS Matthew Gibson, Managing Editor, Encyclopedia Virginia Christine Ruotolo, Digital Service Manager, University of Virginia Library Patrick Yott, Director, Center for Digital Initiatives, Brown University Matthew, Christine, and Patrick have taught XML courses in collaboration with the ARL Statistics and Measurement program since 2002. This will be their seventh collaborative event. REGISTRATION Register by December 1, 2008, at http://www.arl.org/stats/statsevents/index.shtml. Members of ARL and TRLN libraries pay a registration fee of $850; non-members pay $1,275. These prices do not include meals or housing for the event. ARL has reserved a block of rooms at the Carolina Inn, a nearby hotel, until November 20, 2008. The rooms cannot be guaranteed after this date. For reservations, call 800-962-8519 and identify yourself as part of the Association of Research Libraries group. AUDIENCE There are no prerequisites for this workshop. QUESTIONS? For more information, please contact Kristina Justh, [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]. -- The Association of Research Libraries (ARL) is a nonprofit organization of 123 research libraries in North America. Its mission is to influence the changing environment of scholarly communication and the public policies that affect research libraries and the diverse communities they serve. ARL pursues this mission by advancing the goals of its member research libraries, providing leadership in public and information policy to the scholarly and higher education communities, fostering the exchange of ideas and expertise, and shaping a future environment that leverages its interests with those of allied organizations. ARL is on the Web at http://www.arl.org/. Triangle Research Libraries Network (TRLN) is a collaborative organization of Duke University, North Carolina Central University, North Carolina State University, and the University of North Carolina at Chapel Hill, the purpose of which is to marshal the financial, human, and information resources of their research libraries through cooperative efforts in order to create a rich and unparalleled knowledge environment that furthers the universities' teaching, research, and service missions. TRLN is on the Web at http://www.trln.org/.