Huh, so the lines you shared seem strange and not what I would have expected, especially:
2 | 720 H74 1979 | 720_H74_197900000000000 For the year to be padded with 0's instead of the first group of numbers (the 720) looks weird. And surely explains a lot about why sorting seems off. I just replicated the same effect on our system using Dewey class and a year at the end of the Dewey string, so it's definitely starting to feel like this is a new bug in the normalizer function -- asset.label_normalizer_dewey() Looking back in time, I know that https://bugs.launchpad.net/evergreen/+bug/1150939 was a recent bug. Looks like back then, the test cases we were looking at did not include the year as a potential part at the end of the call number and we were very focused on testing cases where there was a lead prefix or none with a dewey 3 number and then a cutter. This issue might affect Koha too, given that we share this normalizer routine in both of our projects. I guess it's time to file a new potential bug and take a closer look at the asset.label_normalizer_dewey function to see what it's doing wrong... -- Ben On Fri, Apr 25, 2014 at 4:45 AM, Paul Hoffman <p...@flo.org> wrote: > On Thu, Apr 24, 2014 at 05:15:35PM -0400, Ben Shum wrote: > > This will be a slightly more technical answer that may require some > direct > > database access to ascertain more details. > > > > You mention that the call numbers are identified as DDC. So that's > > label_class of 2, I believe. We are using Dewey (DDC) for all of our > > materials by default as well in our consortium. > > > > I'd be curious to know what the label_sortkey values were for those call > > numbers you mention. That field is what actually drives the sorting > values > > for a given set. > > Here's what our DB shows (Adam and I work together): > > SELECT label_class, label, label_sortkey > FROM asset.call_number > WHERE label_sortkey like '720%' > ORDER BY label_sortkey; > > label_class | label | label_sortkey > -------------+-----------------+------------------------------- > 1 | 720 H47 1979 | 720 H47 1979 > 2 | 720 a | 720_000000000000000_A > 2 | 720 .H47 | 720_000000000000000__H47 > 2 | 720.1 H74 1979 | 720_100000000000000_H74_1979 > 2 | 720.1 .H47 1980 | 720_100000000000000__H47_1980 > 2 | 720.1 .H74 1979 | 720_100000000000000__H74_1979 > 2 | 720 H74 1979 | 720_H74_197900000000000 > 2 | 720 .H47 1980 | 720__H47_198000000000000 > 2 | 720 .H47 1980 | 720__H47_198000000000000 > 2 | 720 .H74 1979 | 720__H74_197900000000000 > (10 rows) > > So the problem appears to be caused by the periods that sometimes occur > before > the Cutter number. I don't know if that's kosher or not, but I can see > that it > occurs plenty in our (Voyager) catalog. > > Looking at the function asset.label_normalizer_dewey it seems to me that > it can > be done much more simply and efficiently if you leverage the fact that > space > (ASCII 32) and tilde (ASCII 126) come before and after (respectively) > anything > else meaningful that might be found in a call number. Except periods, > which > complicate things. Anyhow, here's a first stab at it: > > use strict; > use warnings; > sub ddcnorm { > local $_ = uc shift; > # Strip leading or trailing space and any slashes or apostrophes > s/^\s+|\s+$|[\/']//g; > # Insert a space at digit/non-digit boundaries > s/(?<=[0-9])(?=[^0-9])|(?<=[^0-9])(?=[0-9])/ /g; > # Replace some punctuation with a space > tr/-/ /; # XXX What else? > # Strip extra junk -- XXX make this work on non-ASCII call numbers > tr/A-Za-z0-9. //cd; > s/ \. /~/g; > s/ \.|\. / /g; > tr/ //s; > return $_; > } > > When I run our Deweys in the 720s through this, I get what seems to be the > right order: > > 2 | 720 a | 720 A > 2 | 720 .H47 | 720 H 47 > 2 | 720 .H47 1980 | 720 H 47 1980 > 2 | 720 .H47 1980 | 720 H 47 1980 > 2 | 720 .H74 1979 | 720 H 74 1979 > 2 | 720 H74 1979 | 720 H 74 1979 > 2 | 720.1 .H47 1980 | 720~1 H 47 1980 > 2 | 720.1 .H74 1979 | 720~1 H 74 1979 > 2 | 720.1 H74 1979 | 720~1 H 74 1979 > > If there's any interest, I'll run our entire set of Deweys through it and > see > if I can make sense of the results. Hmm... should prefixes like "j" or "C" > (juvenile or Canadian) be ignored? > > Paul. > > > On Thu, Apr 24, 2014 at 3:32 PM, Adam Shire <a...@flo.org> wrote: > > > > > Hi Everyone, > > > > > > We are testing in evergreen 2.5.2 > > > > > > I'm noticing what I think looks like incorrect behavior when using the > > > call number browse feature. > > > > > > Doing a call number browse search for 720 results in the following call > > > number sort order: > > > > > > 720 H47 1979 > > > 720 .H47 > > > 720.1 H74 1979 > > > 720.1 .H47 1980 > > > 720.1 .H74 1979 > > > 720 H74 1979 > > > 720 .H74 1979 > > > > > > > > > It looks like the decimal point might be throwing things off. I think > that > > > should be taken care of in a normalizer, but maybe there is a reason > not > > > to. I think the 720.1's should come at the end of this list, > regardless of > > > the decimal point before the cutter. > > > > > > All of the call numbers are identified as DDC. > > > > > > you can probably replicate this here > > > http://emerson.eg.flo.org/eg/opac/cnbrowse?cn=715&locg=2 > > > > > > > > > I didn't see any bug reports that seemed to address this specific > issue, > > > so I'm wondering if there could be something else causing this > behavior. > > > > > > thanks, > > > Adam > > > > > > -- > > > > > > Adam Shire > > > Member Services Librarian > > > Fenway Libraries Online <http://flo.org> > > > 617-442-2384 > > > > > > > > > > > -- > > Benjamin Shum > > Evergreen Systems Manager > > Bibliomation, Inc. > > 24 Wooster Ave. > > Waterbury, CT 06708 > > 203-577-4070, ext. 113 > > -- > Paul Hoffman <p...@flo.org> > Systems Librarian > Fenway Libraries Online > c/o Wentworth Institute of Technology > 550 Huntington Ave. > Boston, MA 02115 > (617) 442-2384 (FLO main number) > -- Benjamin Shum Evergreen Systems Manager Bibliomation, Inc. 24 Wooster Ave. Waterbury, CT 06708 203-577-4070, ext. 113