On Wed, Oct 25, 2017 at 8:57 AM, Eric Lease Morgan <emor...@nd.edu> wrote:
> ...My bibliographic data is fraught with inconsistencies. For example, a > publisher’s name may be recorded one way, another way, or a third way. The > same goes for things like publisher place: South Bend; South Bend, IN; > South Bend, Ind. And then there is the ISBD punctuation that is sometimes > applied and sometimes not. All of these inconsistencies make indexing & > faceted browsing more difficult than it needs to be. Effective normalizing is about understanding patterns that represent the same thing and being aware of the patterns associated with specific types of data. For example, in your publisher example, detecting geographic entities and normalizing the states would be easy enough. You'll also see variation in how the publisher names themselves are expressed but also that the vast majority of variations follow a small number of patterns. Don't be afraid to use multifield logic to normalize one or multiple fields. To return to your publisher example, fragments from the publisher name and place may be used to normalize both fields individually and collectively more accurately than attempting to normalize each field in isolation. What is the source of your bib data -- or are their many? You may be able to use info such as byte 18 (descriptive cataloging form) in the Leader or even the cat date to figure out cataloging rules that would have been in play that drive patterns specific to those records. If you have multiple sources of records, the patterns will most likely vary with the source, e.g. there are multiple ways personal names can be expressed, but the number of variations is small. Depending on what you're working with, other clustering tools may be helpful. However, you may get better and more predictable results using a method tuned for the data you have than a much more sophisticated mechanism created for other uses. kyle