I'm testing out the default (gap) fragmenter with some simple,
single-word queries on a patched 1.3.0 release populated with some
real-world data. (I think the primary quirk in my setup is that I'm
using ShingleFilterFactory to put word bigrams (aka shingles) into my
index. I was worried that this might mess up highlighting, but
highlighting is *mostly* working.) There are some oddities here, and
I'm wondering if people have any suggestions for debugging my setup
and/or trying to make a good, reproducible test case.

1. The main weird thing is that, the vast majority of the time, the
highlighted term is the last term in the fragment. For example, if I
search for "cat", then almost all my fragments look like this:

fragment 1: "to the *cat*"
fragment 2: "with the *cat*"
fragment 3: "it's what the *cat*"
fragment 4: "Once upon a time the *cat*"

(My actual fragments are longer. The key to note is that all of these
examples end in "cat".)

Sometimes "cat" will appear at somewhere other than the last position,
but this is rare. My expectation, in contrast, is that "cat" would
tend to be more or less evenly distributed throughout fragment
positions.

Note: I tried to reproduce this on 1.3.0 with my patches applied but
using the example dataset/schema from the Solr source tree rather than
my own dataset/schema. With the example dataset this didn't seem to be
an issue.

I've experienced three other highlighting issues, which may or may not
be related:

2. Sometimes, if a term appears multiple times in a fragment, not just
the term but all the words in between the two appearances will get
highlighted too. For example, I searched for "fear", and got this as
one of the snippets:

    SETTLEMENT AGREEMENT This Agreement ("the Agreement") is entered
into this 18th day of August, 2008, by
    and between Cape <em>Fear Bank Corporation, a North Carolina
corporation (the "Company"), and Cape Fear</em>

In contrast, I would have expected

    SETTLEMENT AGREEMENT This Agreement ("the Agreement") is entered
into this 18th day of August, 2008, by
    and between Cape <em>Fear</em> Bank Corporation, a North Carolina
corporation (the "Company"), and Cape <em>Fear</em>

3. My install seems to have a curiously liberal interpretation of
hl.fragsize. Now if I put hl.fragsize=0, then things are as expected,
i.e. it highlights the whole field. And it also seems more or less
true (as it should) that as I increase hl.fragsize, the fragments get
longer. However, I was surprised to see that when I put hl.fragsize=1
or hl.fragsize=5, I can get fragments as long as this one:

    addition, we believe the wireless feature for our controller will
facilitate exceptional customer services and
    response time." About GpsLatitude GpsLatitude, a Montreal-based
company, is a provider of security
    solutions and tracking for mobile assets. It is also a developer
of advanced " Videlocalisation" , a cost-effective,
    integrated mobile digital <em>video</em>

That seems shockingly long for something of size "five".

4. Very rarely I'll get a fragment that doesn't actually contain any
of the search terms. For example, maybe I'll search for "cat", and
I'll get back "three ounces of milk" as a snippet. I need to explore
this more, though the last time this happened when I opened the
document and found that when I located "three ounces of milk" in the
document text, the word "cat" did appear nearby; so maybe the document
did contain "three ounces of milk for the cat".

Obviously I'm not describing my setup in much detail. Let me know what
you think would be helpful to know more about.

Thanks,
Chris

Reply via email to