OK, so I thought some more concrete evidence might be helpful to make the
case here and did a quick POC. To get access to precise within-token
offsets we do need to make some changes to the public API, but the profile
could be kept small. In the version I worked up, I extracted the character
Given that character transformations do happen in TokenFilters, shouldn't
we strive to have an API that supports correct offsets (ie highlighting)
for any combination of token filters? Currently we can't do that. For
example because of the current situation, WordDelimiterGraphFilter,
dec
The problem is not a performance one, its a complexity thing. Really I
think only the tokenizer should be messing with the offsets...
They are the ones actually parsing the original content so it makes
sense they would produce the pointers back to them.
I know there are some tokenfilters out there
correct offsets in the face of manipulations
like replacing ellipses, ligatures (like AE, OE), trademark symbols
(replaced by "tm") and the like so that we can have the invariant that
correctOffset(OffsetAttribute.startOffset) + CharTermAttribute.length() ==
correctOffset(OffsetAttribute
How would a fixup API work? We would try to provide correctOffset
throughout the full analysis chain?
Mike McCandless
http://blog.mikemccandless.com
On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov wrote:
> I've run into some difficulties with offsets in some TokenFilters I've b
I think you see it correctly. Currently, only tokenizers can really
safely modify offsets, because only they have access to the correction
logic from the charfilter.
Doing it from a tokenfilter just means you will have bugs...
On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov wrote:
> I
I've run into some difficulties with offsets in some TokenFilters I've been
writing, and I wonder if anyone can shed any light. Because characters may
be inserted or removed by prior filters (eg ICUFoldingFilter does this with
ellipses), and there is no offset-correcting data structure
>
> The second question if where I should put in place of "???". The API says
> "pass a prior PostingsEnum for possible reuse", but I don't get how to create
> an instance of it.
You can just pass null.
Alan Woodward
www.flax.co.uk
>
> Many thanks!
>
>
> ---
Hi
Given a document in a lucene index, I would like to get a list of terms
in that document and their offsets. I suppose starting with
IndexReader.getTermVector can get me going with this. I have some code
as below (Lucene 5.3) of which I have some questions
Thanks Mike, this clarifies my understanding as well.
Regds,
Rizwan
On Wed, Aug 14, 2013 at 7:56 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> I think you just need to add fieldType.setStoreTermVectors(true) as well.
>
> However, I see you are also indexing o
I think you just need to add fieldType.setStoreTermVectors(true) as well.
However, I see you are also indexing offsets into the postings, which
is wasteful because now you've indexed offsets twice in your index.
Usually only one place is needed, i.e. if you will use
PostingsHighlighter,
PM, Ankit Murarka <
ankit.mura...@rancoretech.com> wrote:
> Hello,
> I generally add fields to my document in the following manner. I
> wish to add offsets to this field.
>
> doc.add(new StringField("contents",line,**Field.Store.YES));
>
> I wish to also
Hello,
I generally add fields to my document in the following manner.
I wish to add offsets to this field.
doc.add(new StringField("contents",line,Field.Store.YES));
I wish to also store offsets. So, I went through javadoc, and found I
need to use FieldType.
So, I ende
On Wed, May 8, 2013 at 9:03 AM, AarKay wrote:
> Thanks Mike. This is little bit clear to me now.
>
> Just to make sure I got it right, do you mean that we need to store just
> the offsets and set IndexOptions to DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
> to be able to use Post
Thanks Mike. This is little bit clear to me now.
Just to make sure I got it right, do you mean that we need to store just
the offsets and set IndexOptions to DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
to be able to use PostingsHighlighter?
Also we don't need to store TermVectors and Posi
On Wed, May 8, 2013 at 4:23 AM, AarKay wrote:
> I see that Lucene 4.x has FieldInfo.IndexOptions that can be used to tell
> lucene whether to Index Documents/Frequencies/Positions/Offsets.
>
> We are in the process of upgrading from Lucene 2.9 to Lucene 4.x and I was
> wondering
I see that Lucene 4.x has FieldInfo.IndexOptions that can be used to tell
lucene whether to Index Documents/Frequencies/Positions/Offsets.
We are in the process of upgrading from Lucene 2.9 to Lucene 4.x and I was
wondering if there was a way to tell lucene whether to index
docs/freqs/pos/offsets
.nabble.com/Token-Stream-with-Offsets-Token-Sources-class-tp4054383p4054830.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional
this message in context:
http://lucene.472066.n3.nabble.com/Token-Stream-with-Offsets-Token-Sources-class-tp4054383p4054514.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: java
query is : "Running Apple" (a phrase query)
> my doc contents are :
> name : Running Apple 60 GB iPod with Video Playback Black - Apple
>
> Please let me know on what I'm doing anything wrong.
>
> Thanks.
>
>
>
are :
name : Running Apple 60 GB iPod with Video Playback Black - Apple
Please let me know on what I'm doing anything wrong.
Thanks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Token-Stream-with-Offsets-Token-Sources-
Hello,
Is there a way to get Term Offsets of a given term per document without
enabling the termVectors ?
Is it that Lucene index stores the positions but not the offsets by default
- is it correct ?
Thanks,
Phani.
--
View this message in context:
http://lucene.472066.n3.nabble.com/To-get
Hi all,
Is it possible to find offsets for phrase queries?
Thanks, Vitaly
Hi all,
I use Lucene 4.0.
I try to find offsets for phrase queries.
My code works then I search for one word but then I call it for some phrase I
didn't get offsets.
termsEnum.seekExact returns false for phrase queries.
reader = DirectoryReader.open( mIndexDir );
IndexSea
float score;
int segId;
long timeStamp;
}
?
--
View this message in context:
http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAnd
PositionsEnum-for-tp4020933p4020968.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
g endOffset;
>> float score;
>> int segId;
>> long timeStamp;
>> }
>> ?
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-i
q.com
> --
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4021981.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> --
text:
http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4021981.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
-
To unsubscribe, e-mail
segId;
> long timeStamp;
> }
> ?
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4020968.html
> Sent from the Lucene - Java Users mailing list ar
{
int termId;
long startOffset;
long endOffset;
float score;
int segId;
long timeStamp;
}
?
--
View this message in context:
http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4020968.html
Sent from the
On Sun, Nov 18, 2012 at 12:09 PM, wgggfiy wrote:
> I'm now studying lucene 4.0.
> 1, what is the startOffset and endOffset for ? is there a code example ?
These are set by the analyzer, to the start and end character offset
for this token (using the OffsetAttribute). The offsets a
context:
http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: jav
So for Lucene 3.6, is the right way to do this to create a new Document and add
new Fields based on the old Fields (with the settings you want them to have for
term vector offsets and positions, etc.) and then call updateDocument on that
new Document?
Thanks,
Mike
-Original Message
you originally provided. This
would require extra disk seeks.
See https://issues.apache.org/jira/browse/LUCENE-3312 for an effort to
fix this trap for google summer of code.
On Wed, Aug 22, 2012 at 5:23 PM, Mike O'Leary wrote:
> I have one more question about term vector positions and off
I have one more question about term vector positions and offsets being
preserved. My co-worker is working on updating the documents in an index with a
field that contains a numerical value derived from the term frequencies and
inverse document frequencies of terms in the document. His first
field and adjusts the field flags based on the presence/absence of a term
> vector. FieldInfos were not enough to handle some combinations.
>
> * Luke doesn't show the offsets/positions flags in the document view, since
> they are not known in advance. However, the pop-up that sh
On 27/07/2012 00:50, Mike O'Leary wrote:
Hi Robert,
Thanks for your help. This cleared up all of the things I was having trouble
understanding about offsets and positions in term vectors.
Mike
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Friday, July 20, 2
Hi Robert,
Thanks for your help. This cleared up all of the things I was having trouble
understanding about offsets and positions in term vectors.
Mike
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Friday, July 20, 2012 5:59 PM
To: java-user@lucene.apache.org
On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary wrote:
> Hi Robert,
> I'm not trying to determine whether a document has term vectors, I'm trying
> to determine whether the term vectors that are in the index have offsets and
> positions > stored.
Right: what
Hi Robert,
I'm not trying to determine whether a document has term vectors, I'm trying to
determine whether the term vectors that are in the index have offsets and
positions stored. Shouldn't the Field instance variables called
storeOffsetWithTermVector and storePositionWithTermV
out.writeEndElement();
> out.writeEndDocument();
> out.flush();
> reader.close();
> }
>
> private void dumpDocument(Document document, XMLStreamWriter out)
> throws XMLStreamException {
>
-user@lucene.apache.org
Subject: RE: Problem with TermVector offsets and positions not being preserved
Hi Robert,
I put together the following two small applications to try to separate the
problem I am having from my own software and any bugs it contains. One of the
applications is c
eld.name());
out.writeAttribute("value", field.stringValue());
out.writeEndElement();
}
out.writeEndElement();
}
}
-
Hi Mike:
I wrote up some tests last night against 3.6 trying to find some way
to reproduce what you are seeing, e.g. adding additional segments with
the field specified without term vectors, without tv offsets, omitting
TF, and merging them and checking everything out. I couldnt find any
problems
I created an index using Lucene 3.6.0 in which I specified that a certain text
field in each document should be indexed, stored, analyzed with no norms, with
term vectors, offsets and positions. Later I looked at that index in Luke, and
it said that term vectors were created for this field, but
.
)
Best regards
Karsten
in Context:
http://lucene.472066.n3.nabble.com/Offsets-in-3-6-4-0-td3994830.html#a3995288
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h
I was
hoping to avoid that because I am also storing other information in the
Payload which makes it feel a bit messy; especially as it seemed
sensible to me to actually make use of the Offsets field as it already
exists. Anyway, the problem is solved so far, thank you very much!
I still wonder what th
-ALPHA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Positions
Best regards
Karsten
P.S. in context:
http://lucene.472066.n3.nabble.com/Offsets-in-3-6-4-0-td3994830.html
-
To unsubscribe, e-mail: java
Dear list,
I am working on a search application that depends on retrieving offsets
for each match. Currently (in Lucene 3.6), this seems to be overly
costly, at least in my solution that looks like this:
---
TermPositionVector
On Fri, Jan 13, 2012 at 9:33 PM, Nishad Prakash wrote:
>
> - More generally, I would like to be able to iterate over positions in
> a document, collecting offset information for those positions as I go.
> Is there any way to do this? I didn't find such an iterator, but I
> may not know where to l
rt of thing, and not
worry about spans?
-Mike
On 1/19/2012 9:46 PM, Nishad Prakash wrote:
I'm going to cry. There is no way to retrieve offsets for position,
rather than for term?
On 1/13/2012 6:33 PM, Nishad Prakash wrote:
I'm having a set of issues in trying to use Luce
I'm going to cry. There is no way to retrieve offsets for position,
rather than for term?
On 1/13/2012 6:33 PM, Nishad Prakash wrote:
I'm having a set of issues in trying to use Lucene that are all
connected to the difficulty of retrieving offsets. I need some advice
on h
I'm having a set of issues in trying to use Lucene that are all
connected to the difficulty of retrieving offsets. I need some advice
on how best to proceed, or a pointer if this has been answered
somewhere.
My app requires that I display all portions of the documents where the
search te
ot;content" and "hits[i].doc" I see that are not null.
> The problem is in this line
>
> "TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVecto
> (hits[i].doc,"contents"); "
>
> hits[i].doc represent the Doc id or ?
>
> Thanks
its[i].doc" I see that are not null.
The problem is in this line
"TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVecto
(hits[i].doc,"contents"); "
hits[i].doc represent the Doc id or ?
Thanks
--
View this message in c
On Thu, Nov 24, 2011 at 11:21 AM, starz10de wrote:
> Hi,
>
> no hits are not null, I can print all retrieved docuemtns without problem.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/highlighter-by-using-term-offsets-tp3527712p3533380.html
> Sen
Hi,
no hits are not null, I can print all retrieved docuemtns without problem.
--
View this message in context:
http://lucene.472066.n3.nabble.com/highlighter-by-using-term-offsets-tp3527712p3533380.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com
:
>
> I'm writing a highlighter by using term offsets as follows:
>
>
> IndexReader reader = IndexReader.open( indexPath );
> TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVector(
> hits[i].doc,"contents");
>
> Whe
I'm writing a highlighter by using term offsets as follows:
IndexReader reader = IndexReader.open( indexPath );
TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVector(
hits[i].doc,"contents");
When I run the searcher, I face this error in
Ter
Hi,
I am using 2.9.2 version of lucene.
For my project I need to find the term positions in the document for it to be
highlighted in the display.
For normal queries it works fine. But with wild card queries, there is no
offset info available.
This is my code:
QueryParser qp = new Que
Hello, everyone!
Could anyone please explain how to get offsets for hits? I.e. I have a big text
file and want to find some string in it. As a result of this operation, I need
an array of offsets (in characters) from the beginning of the file for each
occurrence of the string.
As an example
-Original Message-
> > From: Karolina Bernat [mailto:karolina.ber...@googlemail.com]
> > Sent: Tuesday, January 25, 2011 1:45 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Preserving original HTML file offsets for highlighting
> >
> > Hi Uwe,
> >
>
remen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Karolina Bernat [mailto:karolina.ber...@googlemail.com]
> Sent: Tuesday, January 25, 2011 1:45 PM
> To: java-user@lucene.apache.org
> Subject: Re: Preserving original HTML file offsets for
ng of HTML
> files with Lucene.
>
> What I need to do is highlight the hits (terms) in the original HTML file
> (or get the positions of the terms/tokens in the original file).
> This problem has already been described by Fred Toth in this thread in 2005
> (Preserving origina
rg
> Subject: Preserving original HTML file offsets for highlighting
>
> Hi all,
>
> I'm new to Lucene and have a question about indexing/highlighting of HTML
> files with Lucene.
>
> What I need to do is highlight the hits (terms) in the original HTML file
(or get
> the p
th in this thread in 2005
(Preserving original HTML file offsets for highlighting, need
HTMLTokenizer?):
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3c6.2.1.2.2.20050530134630.063ae...@fast.synernet.com%3E
I've searched the mailing list archives hoping for an answer,
> Hi Ahmet,
>
> I am using Lucene.NET with C# so I can't test this quickly.
> Will HTMLStripCharFilter maintain the character offsets or does it just
extract
> the plain text?
Yes the CharFilter does this!
Uwe
Hi Ahmet,
I am using Lucene.NET with C# so I can't test this quickly.
Will HTMLStripCharFilter maintain the character offsets or does it just
extract the plain text?
Hans
> You can use org.apache.solr.analysis.HTMLStripCharFilter. It is possible to
> add
ling.
>
> I think it should be possible to write a tokenizer that
> strips out the HTML
> tags but maintains the original offsets within the HTML
> document so they
> can be used for highlighting the original HTML document,
> not just the
> text representation.
>
> Does any
to write a tokenizer that strips out the HTML
tags but maintains the original offsets within the HTML document so they
can be used for highlighting the original HTML document, not just the
text representation.
Does anybody know any tokenizers that can do this? It seems it's something
other p
t how to get at
it effectively. It would be extremely helpful if someone could please
point out how to access this from (Py)Lucene v3.0.1.
2) If it's not possible to access the character offsets, token
positions would be a vaguely acceptable fallback but would limit the
final capabilities of
: Monday, April 26, 2010 10:55 AM
To: java-user@lucene.apache.org
Subject: Re: Term offsets for highlighting
Stephen Greene wrote:
> Hi Koji,
>
> Thank you. I implemented a solution based on the
FieldTermStackTest.java
> and if I do a search like "iron ore" it matches iron or o
Stephen Greene wrote:
Hi Koji,
Thank you. I implemented a solution based on the FieldTermStackTest.java
and if I do a search like "iron ore" it matches iron or ore. The same is
true if I specify iron AND ore.
The termSetMap[0].value[0] = ore and termSetMap[0].value[1] = iron.
What am I missing
tIndexReader(), pintDocId,
fieldName);
-Original Message-
From: Koji Sekiguchi [mailto:k...@r.email.ne.jp]
Sent: Saturday, April 24, 2010 5:18 AM
To: java-user@lucene.apache.org
Subject: Re: Term offsets for highlighting
Hi Steve,
> is there a way to access a TermVector containin
Hi Steve,
> is there a way to access a TermVector containing only matched terms,
> or is my previous approach still the
So you want to access FieldTermStack, I understand.
The way to access it, I wrote it at previous mail:
You cannot access FieldTermStack from FVH, but I think you
can create i
ay, April 19, 2010 9:02 PM
To: java-user@lucene.apache.org
Subject: Re: Term offsets for highlighting
Stephen Greene wrote:
> Hi Koji,
>
> An additional question. Is it possible to access the FieldTermStack
from
> the FastVectorHighlighter after the it has been populated with
matching
with returning positional offsets to have
highlighting tags applied to them in a separate process.
Thank you for your insight,
Steve
Hi Steve,
You cannot access FieldTermStack from FVH, but I think you
can create it by your own. To know how to do it, please refer to
FieldTermStackTest.java
positional offsets to have
highlighting tags applied to them in a separate process.
Thank you for your insight,
Steve
-Original Message-
From: Koji Sekiguchi [mailto:k...@r.email.ne.jp]
Sent: Sunday, April 18, 2010 10:42 AM
To: java-user@lucene.apache.org
Subject: Re: Term offsets for
Subject: Re: Term offsets for highlighting
Stephen Greene wrote:
> Hi Koji,
>
> Thank you for your reply. I did try the QueryScorer without success,
but
> I was using Lucene 2.4.x
>
Hi Steve,
I thought you were using 2.9 or later because you mentioned
FastVectorHighlighter in you
rning term
offsets before I set out to refactor the existing code. Do you have any
insight as to whether the fast vector highlighter would offer any
benefits in this area over the highlighter package?
I'm not sure FVH offers benefit for you, but yes, FVH can
recognize phrase and highlight t
solution for returning term
offsets before I set out to refactor the existing code. Do you have any
insight as to whether the fast vector highlighter would offer any
benefits in this area over the highlighter package?
Thank you,
Steve
-Original Message-
From: Koji Sekiguchi [mailto:k
Stephen Greene wrote:
Hello,
I am trying to determine begin and end offsets for terms and phrases
matching a query.
Is there a way using either the highlighter or fast vector highlighter
in contrib?
I have already attempted extending the highlighter which would match
terms but would not
Hello,
I am trying to determine begin and end offsets for terms and phrases
matching a query.
Is there a way using either the highlighter or fast vector highlighter
in contrib?
I have already attempted extending the highlighter which would match
terms but would not match phrases.
The
art of speech) and inside the Payload for position
> specific info (eg a relation id, paragraph id or whatever you want :it's
> a byte[]).
>
> With those techniques you can do many things, you have to be inventive but
> with payloads you can do very interesting things.
> Yo
do very interesting things.
You can also store the offsets inside the payload and don't bother with
term vector!
Well there is really hundreds of solutions to deal with linguistic data
inside lucene. What is hard is when you have to deal with relations but
a triplet store should be more ada
I am quite new to Lucene, but I have searched the FAQs and consulted
the mailinglist archive. I debugged through the source codes as well.
I have writen an Analyzer, that analyzes a stream by sending it to a
whole pipeline of linguistic processing and uses the internal
representation to construct
to text
> > (like the google cache view)? how can I keep track of the offsets
> > between tika parser and lucene analyzer?
>
> Currently Tika doesn't expose that information but the Tika Parser API
> was designed for such use in mind, so it will be possible to add the
> offs
On Sep 2, 2009, at 5:40 AM, David Causse wrote:
Hi,
If I use tika for parsing HTML code and inject parsed String to a
lucene
analyzer. What about the offset information for KWIC and return to
text
(like the google cache view)? how can I keep track of the offsets
between tika parser and
ika for parsing, offsets questions
>
> Hi,
>
> On Wed, Sep 2, 2009 at 2:40 PM, David Causse wrote:
> > If I use tika for parsing HTML code and inject parsed String to a lucene
> > analyzer. What about the offset information for KWIC and return to text
> > (like the go
Hi,
On Wed, Sep 2, 2009 at 2:40 PM, David Causse wrote:
> If I use tika for parsing HTML code and inject parsed String to a lucene
> analyzer. What about the offset information for KWIC and return to text
> (like the google cache view)? how can I keep track of the offsets
> between
Hi,
If I use tika for parsing HTML code and inject parsed String to a lucene
analyzer. What about the offset information for KWIC and return to text
(like the google cache view)? how can I keep track of the offsets
between tika parser and lucene analyzer?
What are the solutions/ideas to do a
Mark,
This is exactly what I want and It worked perfectly. Thanks!
I'll post my highlighter to JIRA in a few days (hopegully).
It uses term offsets with positions (WITH_POSITIONS_OFFSETS)
to support PhraseQuery.
Thanks again,
Koji
Mark Miller wrote:
Okay, Koji, hopefully I'
>
> I'm writing a highlighter by using term offsets info (yes, I borrowed
> the idea
> of LUCENE-644). In my highlighter, I'm seeing unexpected term offsets info
> when getting multi-valued field.
>
> For example, if I indexed [" "," bbb "] (m
Hello,
I'm writing a highlighter by using term offsets info (yes, I borrowed
the idea
of LUCENE-644). In my highlighter, I'm seeing unexpected term offsets info
when getting multi-valued field.
For example, if I indexed [" "," bbb "] (multi-valued), I go
the code of KeywordAnalyzer - and have seen it
don't sets
the offsets in any case. I wrote my own Analyzer based on
KeywordAnalyzer
and added the two lines
reusableToken.setStartOffset(0);
reusableToken.setEndOffset(upto);
inside KeywordTokenizer.next(..). It see
Thanks for raising these!
For the 1st issue (KeywordTokenizer fails to set start/end offset on
its token), I think we add your two lines to fix it. I'll open an
issue for this.
The 2nd issue (if same field name has more than one NOT_ANALYZED
instance in a doc then the offsets are d
the code of KeywordAnalyzer - and have seen it don't sets
the offsets in any case. I wrote my own Analyzer based on KeywordAnalyzer
and added the two lines
reusableToken.setStartOffset(0);
reusableToken.setEndOffset(upto);
inside KeywordTokenizer.next(..). It seems to
, you need to give it offsets to highlight
a string, and I would be glad to do so if I could get the offsets!
I will give snippets of code here:
//read the index, store the terms
try {
inread = IndexReader.open(pathToIndex);
terms = inread.
Thanks Erik. For our purposes it seems more generally useful to use the
original start/end offsets.
Antony
Erik Hatcher wrote:
They aren't used implicitly by anything in Lucene, but can be very handy
for efficient highlighting. Where you set the offsets really all
depends on how you
terFactory. It produces
Analyzing "<[EMAIL PROTECTED]>"
1: [EMAIL PROTECTED]:1->31:]
2: [humphrey:1->9:]
3: [bogart:10->16:]
4: [casablanca:17->27:]
5: [com:28->31:]
I set the start/end offset to be the length of the component, but
in the LIA book listing 4.6 sho
1 - 100 of 104 matches
Mail list logo