Jonathan, Bill, Very interesting--thanks for the replies. While I'm not sure I understand what indexing arbitrary XML into solr might look like, this does prompt me to think it would be interesting to look at Trajecting up some EAD (may I use it as a verb?) into solr, for finding aid searchability. It is my impression that most of the effort in making finding aids searchable is in the indexing, and I'm not aware of a general purpose tool / approach for those of us using solr yet, though there have been plenty of successful approaches at individual sites. (Happy to have my ignorance rectified.)
Mike Giarlo is organizing a DLF hackfest for ArchivesSpace / Hydra integration. I wonder if Traject for EAD might be touched on there? - Tom On Oct 15, 2013, at 10:28 AM, Jonathan Rochkind wrote: > Yep, what Bill said, I have had thoughts of extending it to other types of > input too, it was part of my original design goals. > > In particular, I was thinking of extending it to arbitrary XML. > > Unlike MARC, there are many other options for indexing XML into Solr > (assuming that's your end goal), so you may or may not find traject to be > better than those, although for myself there might be some benefit in using > the same tool accross formats too. > > There are a number of built-in 'macros' that are MARC-specific; you wouldn't > use those. And might need some others that are, say, XML-specific. (Probably > just a single one, extract_xpath, for XML). > > Same could be done for MODS, sure -- or you could handle MODS with a > (hypothetical) generic XML setup. > > But yeah, if you want to take input records, and transform them into > hash-like data structures -- I was thinking from the start of structuring > traject to support such use cases, yep. (If you want to go to something other > than a hash-like data structure, well, it might still be possible, but it's > straying from traject's target a bit more). > > [Oh, and I just made up 'traject'. I was looking for a word (made up or real) > not already being used for any popular software, and thinking about > 'projections' in the sense of mathematical transformations; and about > 'trajectory' in the sense of things sent through outer space, with the > Solr/Solar connection. I actually had originally decided to call it > "transject", but then accidentally wrote "traject" when I created the github > project, and then figured that was easier to pronounce and write anyhow.] > > On 10/15/13 1:02 PM, Bill Dueber wrote: >> 'traject' means "to transmit" (e.g., "trajectory") -- or at least it did, >> when people still used it, which they don't. >> >> The traject workflow is incredibly general: *a reader* sends *a record* to >> *an >> indexing routine* which stuffs...stuff...into a context object which is >> then sent to *a writer*. We have a few different MARC readers, a few useful >> writers (one of which, obviously, is the solr writer), and a bunch of >> shipped routines (which we're calling "macros" but are just well-formed >> ruby lambda or blocks) for extracting and transforming common MARC data. >> >> [see >> http://robotlibrarian.billdueber.com/announcing-traject-indexing-software/for >> more explanation and some examples] >> >> But there's no reason why a reader couldn't produce a MODS record which >> would then be worked on. I'm already imagining readers and writers that >> target databases (RDBMS or NoSQL), or a queueing system like Hornet, etc. >> >> If there are people at Stanford that want to talk about how (easy it is) to >> extend traject, I'd be happy to have that conversation. >> >> >> >> On Tue, Oct 15, 2013 at 12:28 PM, Tom Cramer <[email protected]> wrote: >> >>> ++ Jonathan and Bill. >>> >>> 1.) Do you have any thoughts on extending traject to index other types of >>> data--say MODS--into solr, in the future? >>> >>> 2.) What's the etymology of 'traject'? >>> >>> - Tom >>> >>> >>> On Oct 14, 2013, at 8:53 AM, Jonathan Rochkind wrote: >>> >>>> Jonathan Rochkind (Johns Hopkins) and Bill Dueber (University of >>> Michigan), are happy to announce a robust, feature-complete beta release of >>> "traject," a tool for indexing MARC data to Solr. >>>> >>>> traject, in the vein of solrmarc, allows you to define your indexing >>> rules using simple macro and translation files. However, traject runs under >>> JRuby and is "ruby all the way down," so you can easily provide additional >>> logic by simply requiring ruby files. >>>> >>>> There's a sample configuration file to give you a feel for traject[1]. >>>> >>>> You can view the code[2] on github, and easily install it as a (jruby) >>> gem using "gem install traject". >>>> >>>> traject is in a beta release hoping for feedback from more testers prior >>> to a 1.0.0 release, but it is already being used in production to generate >>> the HathiTrust (metadata-lookup) Catalog (http://www.hathitrust.org/). >>> traject was developed using a test-driven approach and has undergone both >>> continuous integration and an extensive benchmarking/profiling period to >>> keep it fast. It is also well covered by high-quality documentation. >>>> >>>> Feedback is very welcome on all aspects of traject including >>> documentation, ease of getting started, features, any problems you have, >>> etc. >>>> >>>> What we think makes traject great: >>>> >>>> * It's all just well-crafted and documented ruby code; easy to program, >>> easy to read, easy to modify (the whole code base is only 6400 lines of >>> code, more than a third of which is tests) >>>> * Fast. Traject by default indexes using multiple threads, so you can >>> use all your cores! >>>> * Decoupled from specific readers/writers, so you can use ruby-marc or >>> marc4j to read, and write to solr, a debug file, or anywhere else you'd >>> like with little extra code. >>>> * Designed so it's easy to test your own code and distribute it as a gem >>>> >>>> We're hoping to build up an ecosystem around traject and encourage >>> people to ask questions and contribute code (either directly to the project >>> or via releasing plug-in gems). >>>> >>>> [1] >>> https://github.com/traject-project/traject/blob/master/test/test_support/demo_config.rb >>>> [2] http://github.com/traject-project/traject >>> >> >> >>
