Okay, looking more carefully at the sortkey Ben pointed out, you really do have 
two different problems affecting your sort.  Sorry for focusing on the smaller 
one!

Everything I advocated earlier still stands, but in the meantime, we do need to 
fix the misplaced padding in the 'no decimal but we have a year' case.

Dan


Daniel Wells
Library Programmer/Analyst
Hekman Library, Calvin College
616.526.7133

-----Original Message-----
From: open-ils-general-boun...@list.georgialibraries.org 
[mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf Of Dan 
Wells
Sent: Friday, April 25, 2014 10:15 AM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] dewey call number normalization

Hello Paul,

You've pretty much nailed your problem as being the extra decimals before your 
Cutters.  While that's normal for an LC call number, I've looked around and 
found nothing to make me believe that's common or accepted practice for DDC.

That said, as far as I can tell, the "standard" format for Dewey is:

[Dewey Decimal Number] [Whatever else you want to make it unique]

In my experience, the second part is *usually* the first few letters of the 
author's last name, or a cutter-ized version of the same.  Can anyone point to 
an authoritative source on how to build the non-DDC part of the call number?  
It would be a great help if we could at least reference something and say "this 
is what our normalizer supports."

Naturally, if we can cook up a normalizer that works 100% with our agreed upon 
form (whatever that might be), yet also make it flexible enough to accommodate 
variances, we absolutely should do that.  I also think the code you included 
here is on the right track for being more flexible.  Still, our first step must 
be to establish a canonical support format before we consider any code to 
handle exceptions.

Thanks,
Dan


Daniel Wells
Library Programmer/Analyst
Hekman Library, Calvin College
616.526.7133

-----Original Message-----
From: open-ils-general-boun...@list.georgialibraries.org 
[mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf Of Paul 
Hoffman
Sent: Friday, April 25, 2014 4:45 AM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] dewey call number normalization

On Thu, Apr 24, 2014 at 05:15:35PM -0400, Ben Shum wrote:
> This will be a slightly more technical answer that may require some 
> direct database access to ascertain more details.
> 
> You mention that the call numbers are identified as DDC.  So that's 
> label_class of 2, I believe.  We are using Dewey (DDC) for all of our 
> materials by default as well in our consortium.
> 
> I'd be curious to know what the label_sortkey values were for those 
> call numbers you mention.  That field is what actually drives the 
> sorting values for a given set.

Here's what our DB shows (Adam and I work together):

SELECT   label_class, label, label_sortkey
FROM     asset.call_number
WHERE    label_sortkey like '720%'
ORDER BY label_sortkey;

 label_class |      label      |         label_sortkey         
-------------+-----------------+-------------------------------
           1 | 720 H47 1979    | 720 H47 1979
           2 | 720 a           | 720_000000000000000_A
           2 | 720 .H47        | 720_000000000000000__H47
           2 | 720.1 H74 1979  | 720_100000000000000_H74_1979
           2 | 720.1 .H47 1980 | 720_100000000000000__H47_1980
           2 | 720.1 .H74 1979 | 720_100000000000000__H74_1979
           2 | 720 H74 1979    | 720_H74_197900000000000
           2 | 720 .H47 1980   | 720__H47_198000000000000
           2 | 720 .H47 1980   | 720__H47_198000000000000
           2 | 720 .H74 1979   | 720__H74_197900000000000
(10 rows)

So the problem appears to be caused by the periods that sometimes occur before 
the Cutter number.  I don't know if that's kosher or not, but I can see that it 
occurs plenty in our (Voyager) catalog.

Looking at the function asset.label_normalizer_dewey it seems to me that it can 
be done much more simply and efficiently if you leverage the fact that space 
(ASCII 32) and tilde (ASCII 126) come before and after (respectively) anything 
else meaningful that might be found in a call number.  Except periods, which 
complicate things.  Anyhow, here's a first stab at it:

use strict;
use warnings;
sub ddcnorm {
    local $_ = uc shift;
    # Strip leading or trailing space and any slashes or apostrophes
    s/^\s+|\s+$|[\/']//g;
    # Insert a space at digit/non-digit boundaries
    s/(?<=[0-9])(?=[^0-9])|(?<=[^0-9])(?=[0-9])/ /g;
    # Replace some punctuation with a space
    tr/-/ /;  # XXX What else?
    # Strip extra junk -- XXX make this work on non-ASCII call numbers
    tr/A-Za-z0-9. //cd;
    s/ \. /~/g;
    s/ \.|\. / /g;
    tr/ //s;
    return $_;
}

When I run our Deweys in the 720s through this, I get what seems to be the 
right order:

           2 | 720 a           | 720 A
           2 | 720 .H47        | 720 H 47
           2 | 720 .H47 1980   | 720 H 47 1980
           2 | 720 .H47 1980   | 720 H 47 1980
           2 | 720 .H74 1979   | 720 H 74 1979
           2 | 720 H74 1979    | 720 H 74 1979
           2 | 720.1 .H47 1980 | 720~1 H 47 1980
           2 | 720.1 .H74 1979 | 720~1 H 74 1979
           2 | 720.1 H74 1979  | 720~1 H 74 1979

If there's any interest, I'll run our entire set of Deweys through it and see 
if I can make sense of the results.  Hmm... should prefixes like "j" or "C"
(juvenile or Canadian) be ignored?

Paul.

> On Thu, Apr 24, 2014 at 3:32 PM, Adam Shire <a...@flo.org> wrote:
> 
> > Hi Everyone,
> >
> > We are testing in evergreen 2.5.2
> >
> > I'm noticing what I think looks like incorrect behavior when using 
> > the call number browse feature.
> >
> > Doing a call number browse search for 720 results in the following 
> > call number sort order:
> >
> > 720 H47 1979
> > 720 .H47
> > 720.1 H74 1979
> > 720.1 .H47 1980
> > 720.1 .H74 1979
> > 720 H74 1979
> > 720 .H74 1979
> >
> >
> > It looks like the decimal point might be throwing things off. I 
> > think that should be taken care of in a normalizer, but maybe there 
> > is a reason not to. I think the 720.1's should come at the end of 
> > this list, regardless of the decimal point before the cutter.
> >
> > All of the call numbers are identified as DDC.
> >
> > you can probably replicate this here
> > http://emerson.eg.flo.org/eg/opac/cnbrowse?cn=715&locg=2
> >
> >
> > I didn't see any bug reports that seemed to address this specific 
> > issue, so I'm wondering if there could be something else causing this 
> > behavior.
> >
> > thanks,
> > Adam
> >
> > --
> >
> > Adam Shire
> > Member Services Librarian
> > Fenway Libraries Online <http://flo.org>
> > 617-442-2384
> >
> 
> 
> 
> --
> Benjamin Shum
> Evergreen Systems Manager
> Bibliomation, Inc.
> 24 Wooster Ave.
> Waterbury, CT 06708
> 203-577-4070, ext. 113

--
Paul Hoffman <p...@flo.org>
Systems Librarian
Fenway Libraries Online
c/o Wentworth Institute of Technology
550 Huntington Ave.
Boston, MA 02115
(617) 442-2384 (FLO main number)

Reply via email to