Re: Is there a way to retrieve the a term's position/offset in Solr

Bjarke Buur Mortensen Thu, 30 Mar 2017 08:01:00 -0700

OK, that complicates things a bit.

I would still try to go for a solution where you store the rich text in
Solr, but make sure you tokenize it correctly.


If the format is relatively simple, you could use either a regexp pattern
tokenizer
https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-SimplifiedRegularExpressionPatternTokenizer

or perhaps, before tokenization, use a pattern replace char filter to strip
out the parts of the rich text that should not be indexed
https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.PatternReplaceCharFilterFactory

I assume that you have some process for converting the rich text to plain
text before indexing, so if you can replicate that process using Solr's
charfilters, tokenizers and filters then that would allow you to use the
highlighter to get the rich text back.

HTH,
Bjarle


2017-03-30 10:39 GMT+02:00 forest_soup <tanglin0...@gmail.com>:

> Unfortunately the rich text is not an html/xml/doc/pdf or any other popular
> rich text format. And we would like to show the highlighted text in the
> doc's own specific viewer. That's why I'm eagerly want the offset.
>
> The /tvrh(term vector component) and tv.offsets/tv.positions can give us
> such info, but they returns all terms' data instead of the being searched
> ones. So we are still seeking ways to filter the results.
>
> Any ideas?
>
> Thanks!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-
> position-offset-in-Solr-tp4326931p4327623.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Is there a way to retrieve the a term's position/offset in Solr

Reply via email to