Working with the HathiTrust Research Center data can be fun, and I sincerely 
believe it is an under-utilized system, but creating collections sans 
duplicates is difficult. Has anybody here figured out a “kewl” way to remove 
duplicates.

Creating HathiTrust collections is easy: do search, select items of interest, 
and repeat until tired. One can then download a CSV file describing the 
collection, but upon closer inspection MANY of the titles are repeated. I know 
why this has happened, alas, but how might I automatically/programmatically 
resolve this issue? I’ve begun experimenting with OpenRefine. Does anybody else 
have other suggestions? 

—
Eric Morgan

Reply via email to