Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

Jonathan Rochkind Fri, 06 Aug 2010 22:19:37 -0700

Huh, since the highlighter only needs to run on the documents in the actual 
returned section of the result set (10-50?), I wouldn't think total number of 
documents would matter much (I certainly could be wrong), but total size of 
each document's stored field definitely has a known performance impact on 
highlighter. Maybe some time I'll have time or the local requirement need to 
investigate; wonder if there'd be a way to write a custom highlighting 
component optimized for the EAD use case, or for the general case of "identify 
matching section(s) in XML" that would do better.

I'm less nervous about custom components that do not require patches to Solr 
than I am about patches to Solr core that are not (yet?) included in solr 
tagged release or trunk.  

With some of the stuff I'm working with, RAM seems to have sometimes unexpected 
impacts on performance too. From thinking about what it does, and from looking 
at my cache hit/miss/eviction statistics, I didn''t really have reason to think 
that lack of RAM was what was slowing down my StatsComponent use, but adding 
RAM seems to help a lot. I need a hardware upgrade to be able to add enough RAM 
and avoid swap, to be sure that what I think I'm seeing about RAM effects on 
performance is what I'm seeing, but I think so.   Wonder if throwing monster 
amounts of RAM at Solr and increasing certain relevant caches a lot would have 
an impact on highlighter performance. 

I've  thought about using the highlighter in that way on Marc documents to 
provide matching snippets ala google in hits page -- the fact that Marc 
documents aren't "full text', but are lists of structured (well, you know, they 
try :) ) fields, means that you can't just use the highlighter out of the box 
and get a reasonable snippet to show the user, but if you could use it to 
identify which _fields_ matched the query, and then throw each matching field 
(or the first N) through a display mapper that labels it and formats it 
appropriately (my as-of-yet not publically released marc mapping ruby framework 
could handle that nicely), that could provide a nice "hit snippet" perhaps. A 
large marc document is probably still smaller than a typical EAD document, so 
might have greater chance of success. 
________________________________________
From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bess Sadler 
[bess.sad...@gmail.com]
Sent: Saturday, August 07, 2010 12:41 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in 
fedora)

On Aug 6, 2010, at 8:07 PM, Jonathan Rochkind wrote:

> I've been brainstorming other weird ways to do this. This one is totally 
> wacky and possibly a bad idea, but I'll throw it out there anyway. What if 
> you only indexed the entire EAD as one document, BUT threw the entire EAD in 
> a stored field, and used solr highlightning on that field.  NOT to show the 
> highlighter results to the user, but to sort of trick the highlighter, using 
> hl.fragmenter/fragmentsBuilder (possibly with a custom component in a jar) to 
> telling you _which_ sub-sections of the EAD matched, and your software could 
> then display the matching sub-sections (possibly with direct links to 
> display) in the search results, under the actual document hit.

Hi, Jonathan. I don't think this is a crazy idea, and in fact it is one of the 
approaches that Matt M. and I tried during our NWDA project. However, we found 
that it wasn't scalable. The highlighter was way too slow with the number of 
documents and fragments we were throwing at it. It wasn't even a huge number of 
documents, so we abandoned that idea. However, it's still a really elegant 
solution if only it were performant. Let me know if you decide to give it a try.

Bess

Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

Reply via email to