Take a look at the synonym filter as well. I mean, basically that's exactly what you are doing - adding synonyms at each position.

-- Jack Krupansky

-----Original Message----- From: Manuel Le Normand
Sent: Wednesday, July 2, 2014 12:57 PM
To: solr-user@lucene.apache.org
Subject: Re: OCR - Saving multi-term position

Thanks for your answers Erick and Michael.

The term confidence level is an OCR output metric which tells for every
word what are the odds it's the actual scanned term. I wish the OCR prog to
output all the "suspected words" that sum up to above ~90% of confidence it
is the actual term instead of outputting a single word as default behaviour.

I'm happy to hear this approach was used before, I will implement an
analyser that indexes these terms in same position to enable positional
queries.
Hope it works on well. In case it does I will open up a Jira ticket for it.

If anyone else has had experience with this use case I'd love hearing,

Manuel


On Wed, Jul 2, 2014 at 7:28 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

Problem here is that you wind up with a zillion unique terms in your
index, which may lead to performance issues, but you probably already
know that :).

I've seen situations where running it through a dictionary helps. That
is, does each term in the OCR match some dictionary? Problem here is
that it then de-values terms that don't happen to be in the
dictionary, names for instance.

But to answer your question: No, there really isn't a pre-built
analysis chain that i know of that does this. Root issue is how to
assign "confidence"? No clue for your specific domain.

So payloads seem quite reasonable here. Happens there's a recent
end-to-end example, see:
http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/

Best,
Erick

On Wed, Jul 2, 2014 at 7:58 AM, Michael Della Bitta
<michael.della.bi...@appinions.com> wrote:
> I don't have first hand knowledge of how you implement that, but I bet a
> look at the WordDelimiterFilter would help you understand how to emit
> multiple terms with the same positions pretty easily.
>
> I've heard of this "bag of word variants" approach to indexing
poor-quality
> OCR output before for findability reasons and I heard it works out OK.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
>
> w: appinions.com <http://www.appinions.com/>
>
>
> On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand <
> manuel.lenorm...@gmail.com> wrote:
>
>> Hello,
>> Many of our indexed documents are scanned and OCR'ed documents.
>> Unfortunately we were not able to improve much the OCR quality (less
than
>> 80% word accuracy) for various reasons, a fact which badly hurts the
>> retrieval quality.
>>
>> As we use an open-source OCR, we think of changing every scanned term
>> output to it's main possible variations to get a higher level of
>> confidence.
>>
>> Is there any analyser that supports this kind of need or should I make
up a
>> syntax and analyser of my own, i.e the payload syntax?
>>
>> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3
fox|4
>>
>> Thanks,
>> Manuel
>>


Reply via email to