Re: [CODE4LIB] Lightweight IR infrastructure

2017-10-25 Thread Tom Cramer
Josh, None of those pieces is an IR, but do you think that when taken as a whole they could comprise an IR? Yes. I think it’s very healthy to think of the IR as a set of services, rather than a single software product. And I really like the idea of using your catalog as the discovery environme

Re: [CODE4LIB] Sorting LC call numbers in MySQL

2017-10-25 Thread Hannah Calkins
If the issue is that you don't want to actually have to separate out additional columns and/or use some sort of temp table to hold them, you can work around that by pulling out some substrings in the order by. I work primarily in SQL but i'm pretty sure patindex ("pattern index") works in MySQL a

Re: [CODE4LIB] Sorting LC call numbers in MySQL

2017-10-25 Thread Will Martin
Wow, options popping out of the woodwork left and right! We'll probably try out several of these and see which suits best. Thanks a lot, everyone! Will On 2017-10-25 16:31, Ray Voelker wrote: Figured I'd chime in with something I spent entirely way too much time on (probably). A JavaScript c

Re: [CODE4LIB] Sorting LC call numbers in MySQL

2017-10-25 Thread Ray Voelker
Figured I'd chime in with something I spent entirely way too much time on (probably). A JavaScript class to normalize and sort LC call numbers https://github.com/rayvoelker/js-loc-callnumbers Not sure if that helps you in your particular situation, but it might give you a place to start along wit

Re: [CODE4LIB] Sorting LC call numbers in MySQL

2017-10-25 Thread Craig Dietrich
If you're using PHP you could use natsort() on the result, http://php.net/manual/en/function.natsort.php Alternatively, you could try ordering by the length of the VARCHAR field first, ORDER BY LENGTH(field), field Sent from mobile > On Oct 25, 2017, at 2:11 PM, Jodie Gambill wrote: > > Hi

Re: [CODE4LIB] Lightweight IR infrastructure

2017-10-25 Thread Josh Welker
Hi Bryan, I agree that a repository is more than documents, and in this model we would still do metadata, indexing, etc. It would just be handled by a different piece. Instead of having one system that does it all (like DSpace), we'd use the library catalog for metadata and indexing, backup tools

Re: [CODE4LIB] Sorting LC call numbers in MySQL

2017-10-25 Thread Ken Irwin
Will -- I use this sortLC php script that I ported from Michael Doran's perl version: PHP: https://github.com/kenirwin/Weeding-Helper/blob/master/sortLC.php Perl: https://rocky.uta.edu/doran/sortlc/ They have their flaws, but I find them to work pretty well. I hope this helps! Ken -Origi

Re: [CODE4LIB] Sorting LC call numbers in MySQL

2017-10-25 Thread Jodie Gambill
Hi Will - I had a similar task a few years ago on a small project, though we only used the classMark and classNum (from your example) parts of the call number for what we needed. I implemented it as you outlined above, with two separate fields to enable proper sorting -- classMark as varchar and cl

Re: [CODE4LIB] Sorting LC call numbers in MySQL

2017-10-25 Thread Benjamin Florin
The best way is probably to normalize the call numbers into a sortable string outside of MySQ, save that string to a sortable_callnumber field in your database, and sort by that. Normalizing call numbers ( http://robotlibrarian.billdueber.com/2008/11/normalizing-loc-call-numbers-for-sorting/) turn

Re: [CODE4LIB] Lightweight IR infrastructure

2017-10-25 Thread Bryan Brown
Josh, Theres nothing wrong with what you are describing if its all your institution needs, but I would be careful about promoting that as an IR. An IR is much more than a bunch of documents. The metadata modelling, preservation features and indexing that you want to leave out are what makes it

[CODE4LIB] Sorting LC call numbers in MySQL

2017-10-25 Thread Will Martin
We have a small web app with a MySQL backend, containing lists of books that have been reported lost and tracking our efforts at locating them. Access Services requested that the list of currently missing books be sorted according to LC call number. So we did, but the results are ordered in t

Re: [CODE4LIB] clustering techniques for normalizing bibliographic data

2017-10-25 Thread Eric Lease Morgan
Thank you for all the replies. It all makes me feel as if I’m on the right track. Again, thank you. —Eric M.

[CODE4LIB] Lightweight IR infrastructure

2017-10-25 Thread Josh Welker
We're a mid-sized university library (10,000 fte) trying to get an IR off the ground to showcase student and faculty research. We've had a DSpace instance running for several years, but we use so few of its features that DSpace ends up being more trouble than it is worth. In particular, it's very f

Re: [CODE4LIB] Call For Proposals: Forum on Ethics and Archiving the Web

2017-10-25 Thread Ed Summers
> On Oct 25, 2017, at 6:55 AM, Edward Summers wrote: > > Recognizing that ethics and web archiving is a rapidly evolving field and > that it might not fit directly into your primary work/research interests we > wanted to keep the proposal process simple. We just need 100 words from you > abou

Re: [CODE4LIB] clustering techniques for normalizing bibliographic data

2017-10-25 Thread Andromeda Yelton
It turns out it's straightforward to reimplement the default fingerprinting algorithm that OpenRefine uses. We did that here to help catch those sorts of trivial spelling differences in user searches in order to provide best-bet suggestions for some of our most popular stuff. Here's my reimplementa

Re: [CODE4LIB] clustering techniques for normalizing bibliographic data

2017-10-25 Thread Kyle Banerjee
On Wed, Oct 25, 2017 at 8:57 AM, Eric Lease Morgan wrote: > ...My bibliographic data is fraught with inconsistencies. For example, a > publisher’s name may be recorded one way, another way, or a third way. The > same goes for things like publisher place: South Bend; South Bend, IN; > South Bend,

Re: [CODE4LIB] clustering techniques for normalizing bibliographic data

2017-10-25 Thread Chad Nelson
Eric, You can actually use open refine programmatically. There are multiple client libraries for it. https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Developers#known-client-libraries-for-refine Trevor Muñoz wrote a great blog post about his work doing just that http://trevormunoz

Re: [CODE4LIB] [EXT] Re: [CODE4LIB] clustering techniques for normalizing bibliographic data

2017-10-25 Thread Terry Reese
I actually love the approach Mark writes about here. It was partly what inspired me to do this work in MarcEdit -- abet, in a light-weight way -- so not to incur any additional dependencies. --tr On Wed, Oct 25, 2017 at 12:23 PM, Phillips, Mark wrote: > Of possible interest is some work we've

Re: [CODE4LIB] [EXT] Re: [CODE4LIB] clustering techniques for normalizing bibliographic data

2017-10-25 Thread Phillips, Mark
Of possible interest is some work we've done to take the clustering capabilities of OpenRefine and bake them into our metadata editing interface for The Portal to Texas History and the UNT Digital Library. We've focused a bit on interfaces which might be of interest. I've written a bit about i

Re: [CODE4LIB] clustering techniques for normalizing bibliographic data

2017-10-25 Thread Péter Király
Hi Eric, I am planning to work on detecting such anomalities. What I have thought about so far the following approaches: - n-gram analysis - basket analysis - similarity detection of Solr - final state automat The tools I will use: Apache Solr and Apache Spark. I haven't started yet the implement

Re: [CODE4LIB] clustering techniques for normalizing bibliographic data

2017-10-25 Thread Terry Reese
Unfortunately -- not in a language you likely would want to use. But I've been working on doing this in MarcEdit 7, and to do it, I found that I got a lot of mileage using the Levenshtein distance algorithm (which I prefer). You can usually find these in a variety of languages. The approach that

[CODE4LIB] clustering techniques for normalizing bibliographic data

2017-10-25 Thread Eric Lease Morgan
Has anybody here played with any clustering techniques for normalizing bibliographic data? My bibliographic data is fraught with inconsistencies. For example, a publisher’s name may be recorded one way, another way, or a third way. The same goes for things like publisher place: South Bend; Sout

Re: [CODE4LIB] Fiscal continuity vote now open [radical idea]

2017-10-25 Thread EDWIN VINCENT SPERR
Once again, this is a periodic reminder of why *formally* organizing (which may or may not involve incorporation in a State) is such a great idea. This kind of thing (Who is a member of the community? How do they vote? How do you determine whether a vote is properly held and binding?) is *preci

[CODE4LIB] Call For Proposals: Forum on Ethics and Archiving the Web

2017-10-25 Thread Edward Summers
Forum on Ethics and Archiving the Web New Museum, New York City, March 22-24 Proposals due by November 14 (funding available) http://rhizome.org/editorial/2017/oct/24/open-call-national-forum-on-ethics-and-archiving-the-web/ The dramatic rise in the public’s use of the web and social media to docu