(maybe a better question for solr-user... but)
which offsets are you talking about? The tokens? Are you looking for
something like analysis.jsp or SOLR-477?
Steve Suppe wrote:
Hello,
I'm looking into returning the offsets for various fields I've created
in a JSON object. Is there some
like I'm drinking from a firehose!). In
our case, certain documents will have certain fields attached, and will be
returned based on search criteria. We have specific highlighting
requirements, and I will have to rely on the actual offsets of those
matching fields (as opposed to using the built
This is a possibility, but I was thinking if I
could get SOLR to return that information in the initial JSON, then I
could save a step and speed things up immensely.
nothing off the shelf to do it... you may want to look at implementing a
search component to augment the response with
I appreciate all the help - I think, for now, we'll try and leverage the
analysis.jsp approach, as it appears that different approaches might be in
the works, and I don't want to much with any of that just yet :)
If I get some time, maybe I'll have better news in the future. Thanks again!
[
https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ryan McKinley reopened SOLR-234:
TrimFilter should update the start and end offsets
...oh, hmm ... you only want to split on - if it has a space on both sides
huh? does java regex have a don't be greedy option? ... javadocs say
yes (they call it Reluctant vs greedy so try something like this...
pattern=\s*?(\s-\s|--|,|\(|\))\s*?
it *almost* works, with:
to
the TrimFilter.
By default it will *not* modify the offsets.
Depending on how the Tokenizer+Analyzer stream is configured it may or may not
make sense, so the option seems reasonable.
TrimFilter should update the start and end offsets
: After 1/2 hour of regex hacking... I think I'll stick with a two step
: process: split then trim ;)
But regex hacking is FUN!!
I'm 99% certain this does waht you want...
tokenizer class=solr.PatternTokenizerFactory
Chris Hostetter wrote:
: After 1/2 hour of regex hacking... I think I'll stick with a two step
: process: split then trim ;)
But regex hacking is FUN!!
I'm 99% certain this does waht you want...
tokenizer class=solr.PatternTokenizerFactory
: Incidently, PatternTokenizerFactory seems to have the anoying limitation
: of assuming there is a token prior to each match -- even if the match
: explicitly matches on the start of the string (so it creates a 0 width
: token) ... that seems like a bug right?
: how would you change it? I
TrimFilter should update the start and end offsets
--
Key: SOLR-234
URL: https://issues.apache.org/jira/browse/SOLR-234
Project: Solr
Issue Type: Improvement
Reporter: Ryan
exactly where in
the orriginal stream of date the source of the token was found ... if hte
token is modified in some way (ie: stemmed, trimmed, etc..) the offsets
are suppose to remain the same becuase regardless of the token text
munging, the orriginal location hsa not actually changed.
I'll move
[
https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495147
]
Yonik Seeley commented on SOLR-234:
---
Updating the offsets does seem like the right thing to do.
I imagine using
[
https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495153
]
Ryan McKinley commented on SOLR-234:
Updating the offsets does seem like the right thing to do.
My real use
()-endOff,
t.type() );
t.setPositionIncrement( incr ); //+ start ); TODO? what should happen
with the offset
}
TrimFilter should update the start and end offsets
--
Key: SOLR-234
URL
with spaces at the end.
It doesn't make any sense.
I'd think that updating the offsets is almost always the right thing to do (and
should be the default?), given that spaces will almost always come from the
field value itself.
-Yonik
TrimFilter should update the start and end offsets
: My real use case is adding the the trim filter to the pattern tokenizer.
: the 'correct' answer in my case it to update the offsets.
hmmm... wouldn't the correct thing to do in that case be to change your
pattern so it strips the whitespace when tokenizing? that way the offsets
of your tokens
[
https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495202
]
Hoss Man commented on SOLR-234:
---
I'd think that updating the offsets is almost always the right thing to do
the
token itself ...
I get the basic pattern now: Tokenizers determin the start/end offsets and
Filters just transform the text along the way.
In Ryan's use case he may want his highlighter-esque code to be able to know
...
I am fine with either:
1. leave the TrimFilter as is and do
Chris Hostetter wrote:
: My real use case is adding the the trim filter to the pattern tokenizer.
: the 'correct' answer in my case it to update the offsets.
hmmm... wouldn't the correct thing to do in that case be to change your
pattern so it strips the whitespace when tokenizing? that way
On 11-May-07, at 5:02 PM, Ryan McKinley wrote:
Chris Hostetter wrote:
: My real use case is adding the the trim filter to the pattern
tokenizer.
: the 'correct' answer in my case it to update the offsets.
hmmm... wouldn't the correct thing to do in that case be to
change your
pattern so
: probably I'm just not very good at regex ;)
:
:pattern=--|,|\s-\s|\(|\)
:
: this will split on --, - , (, and ). I can't figure out how to
: build the pattern so it will trim each thing on the way out.
just make sure you match the whitespace in the pattern, you're already
doing that
[
https://issues.apache.org/jira/browse/SOLR-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495213
]
Yonik Seeley commented on SOLR-234:
---
offsets point back to the original field value for a particular token
sense to have an option -- i'm just saying that as a general rule
TokenFilters shouldn't be munging offsets ... i don't see a big difference
between TrimFilter and StemmingFilter (where the the stem of fooand foo
is foo). so the option should default to off.
TrimFilter should update
24 matches
Mail list logo