Re: offsets

2018-08-04 Thread Michael Sokolov
OK, so I thought some more concrete evidence might be helpful to make the case here and did a quick POC. To get access to precise within-token offsets we do need to make some changes to the public API, but the profile could be kept small. In the version I worked up, I extracted the character

Re: offsets

2018-08-01 Thread Michael Sokolov
Given that character transformations do happen in TokenFilters, shouldn't we strive to have an API that supports correct offsets (ie highlighting) for any combination of token filters? Currently we can't do that. For example because of the current situation, WordDelimiterGraphFilter, dec

Re: offsets

2018-07-31 Thread Robert Muir
The problem is not a performance one, its a complexity thing. Really I think only the tokenizer should be messing with the offsets... They are the ones actually parsing the original content so it makes sense they would produce the pointers back to them. I know there are some tokenfilters out there

Re: offsets

2018-07-30 Thread Michael Sokolov
correct offsets in the face of manipulations like replacing ellipses, ligatures (like AE, OE), trademark symbols (replaced by "tm") and the like so that we can have the invariant that correctOffset(OffsetAttribute.startOffset) + CharTermAttribute.length() == correctOffset(OffsetAttribute

Re: offsets

2018-07-29 Thread Michael McCandless
How would a fixup API work? We would try to provide correctOffset throughout the full analysis chain? Mike McCandless http://blog.mikemccandless.com On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov wrote: > I've run into some difficulties with offsets in some TokenFilters I've b

Re: offsets

2018-07-25 Thread Robert Muir
I think you see it correctly. Currently, only tokenizers can really safely modify offsets, because only they have access to the correction logic from the charfilter. Doing it from a tokenfilter just means you will have bugs... On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov wrote: > I

offsets

2018-07-25 Thread Michael Sokolov
I've run into some difficulties with offsets in some TokenFilters I've been writing, and I wonder if anyone can shed any light. Because characters may be inserted or removed by prior filters (eg ICUFoldingFilter does this with ellipses), and there is no offset-correcting data structure

Re: offsets of a term in a document

2015-09-21 Thread Alan Woodward
> > The second question if where I should put in place of "???". The API says > "pass a prior PostingsEnum for possible reuse", but I don't get how to create > an instance of it. You can just pass null. Alan Woodward www.flax.co.uk > > Many thanks! > > > ---

offsets of a term in a document

2015-09-21 Thread Ziqi Zhang
Hi Given a document in a lucene index, I would like to get a list of terms in that document and their offsets. I suppose starting with IndexReader.getTermVector can get me going with this. I have some code as below (Lucene 5.3) of which I have some questions

Re: Trying to store Offsets. Dont know the exact meaning of some terms.

2013-08-14 Thread rizwan patel
Thanks Mike, this clarifies my understanding as well. Regds, Rizwan On Wed, Aug 14, 2013 at 7:56 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > I think you just need to add fieldType.setStoreTermVectors(true) as well. > > However, I see you are also indexing o

Re: Trying to store Offsets. Dont know the exact meaning of some terms.

2013-08-14 Thread Michael McCandless
I think you just need to add fieldType.setStoreTermVectors(true) as well. However, I see you are also indexing offsets into the postings, which is wasteful because now you've indexed offsets twice in your index. Usually only one place is needed, i.e. if you will use PostingsHighlighter,

Re: Trying to store Offsets. Dont know the exact meaning of some terms.

2013-08-14 Thread rizwan patel
PM, Ankit Murarka < ankit.mura...@rancoretech.com> wrote: > Hello, > I generally add fields to my document in the following manner. I > wish to add offsets to this field. > > doc.add(new StringField("contents",line,**Field.Store.YES)); > > I wish to also

Trying to store Offsets. Dont know the exact meaning of some terms.

2013-08-13 Thread Ankit Murarka
Hello, I generally add fields to my document in the following manner. I wish to add offsets to this field. doc.add(new StringField("contents",line,Field.Store.YES)); I wish to also store offsets. So, I went through javadoc, and found I need to use FieldType. So, I ende

Re: IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS vs storing positions and offsets/

2013-05-08 Thread Michael McCandless
On Wed, May 8, 2013 at 9:03 AM, AarKay wrote: > Thanks Mike. This is little bit clear to me now. > > Just to make sure I got it right, do you mean that we need to store just > the offsets and set IndexOptions to DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS > to be able to use Post

Re: IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS vs storing positions and offsets/

2013-05-08 Thread AarKay
Thanks Mike. This is little bit clear to me now. Just to make sure I got it right, do you mean that we need to store just the offsets and set IndexOptions to DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS to be able to use PostingsHighlighter? Also we don't need to store TermVectors and Posi

Re: IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS vs storing positions and offsets/

2013-05-08 Thread Michael McCandless
On Wed, May 8, 2013 at 4:23 AM, AarKay wrote: > I see that Lucene 4.x has FieldInfo.IndexOptions that can be used to tell > lucene whether to Index Documents/Frequencies/Positions/Offsets. > > We are in the process of upgrading from Lucene 2.9 to Lucene 4.x and I was > wondering

IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS vs storing positions and offsets/

2013-05-08 Thread AarKay
I see that Lucene 4.x has FieldInfo.IndexOptions that can be used to tell lucene whether to Index Documents/Frequencies/Positions/Offsets. We are in the process of upgrading from Lucene 2.9 to Lucene 4.x and I was wondering if there was a way to tell lucene whether to index docs/freqs/pos/offsets

Re: Token Stream with Offsets (Token Sources class)

2013-04-09 Thread vempap
.nabble.com/Token-Stream-with-Offsets-Token-Sources-class-tp4054383p4054830.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional

Re: Token Stream with Offsets (Token Sources class)

2013-04-08 Thread vempap
this message in context: http://lucene.472066.n3.nabble.com/Token-Stream-with-Offsets-Token-Sources-class-tp4054383p4054514.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java

Re: Token Stream with Offsets (Token Sources class)

2013-04-07 Thread Simon Willnauer
query is : "Running Apple" (a phrase query) > my doc contents are : > name : Running Apple 60 GB iPod with Video Playback Black - Apple > > Please let me know on what I'm doing anything wrong. > > Thanks. > > >

Token Stream with Offsets (Token Sources class)

2013-04-07 Thread vempap
are : name : Running Apple 60 GB iPod with Video Playback Black - Apple Please let me know on what I'm doing anything wrong. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Token-Stream-with-Offsets-Token-Sources-

To get Term Offsets of a term per document

2013-02-20 Thread vempap
Hello, Is there a way to get Term Offsets of a given term per document without enabling the termVectors ? Is it that Lucene index stores the positions but not the offsets by default - is it correct ? Thanks, Phani. -- View this message in context: http://lucene.472066.n3.nabble.com/To-get

Lucene 4.0.0 - find offsets for phrase queries

2012-12-18 Thread Vitaly_Artemov
Hi all, Is it possible to find offsets for phrase queries? Thanks, Vitaly

Lucene 4.0.0 - find offsets for phrase queries

2012-12-17 Thread Vitaly_Artemov
Hi all, I use Lucene 4.0. I try to find offsets for phrase queries. My code works then I search for one word but then I call it for some phrase I didn't get offsets. termsEnum.seekExact returns false for phrase queries. reader = DirectoryReader.open( mIndexDir ); IndexSea

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

2012-11-27 Thread David Causse
float score; int segId; long timeStamp; } ? -- View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAnd PositionsEnum-for-tp4020933p4020968.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

2012-11-27 Thread Wu, Stephen T., Ph.D.
g endOffset; >> float score; >> int segId; >> long timeStamp; >> } >> ? >> >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-i

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

2012-11-23 Thread Ian Lea
q.com > -- > -- > View this message in context: > http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4021981.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > --

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

2012-11-23 Thread wgggfiy
text: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4021981.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

2012-11-19 Thread Michael McCandless
segId; > long timeStamp; > } > ? > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4020968.html > Sent from the Lucene - Java Users mailing list ar

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

2012-11-18 Thread wgggfiy
{ int termId; long startOffset; long endOffset; float score; int segId; long timeStamp; } ? -- View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4020968.html Sent from the

Re: what is the offsets and payload in DocsAndPositionsEnum for ??

2012-11-18 Thread Michael McCandless
On Sun, Nov 18, 2012 at 12:09 PM, wgggfiy wrote: > I'm now studying lucene 4.0. > 1, what is the startOffset and endOffset for ? is there a code example ? These are set by the analyzer, to the start and end character offset for this token (using the OffsetAttribute). The offsets a

what is the offsets and payload in DocsAndPositionsEnum for ??

2012-11-18 Thread wgggfiy
context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: jav

RE: Problem with TermVector offsets and positions not being preserved

2012-08-24 Thread Mike O'Leary
So for Lucene 3.6, is the right way to do this to create a new Document and add new Fields based on the old Fields (with the settings you want them to have for term vector offsets and positions, etc.) and then call updateDocument on that new Document? Thanks, Mike -Original Message

Re: Problem with TermVector offsets and positions not being preserved

2012-08-24 Thread Robert Muir
you originally provided. This would require extra disk seeks. See https://issues.apache.org/jira/browse/LUCENE-3312 for an effort to fix this trap for google summer of code. On Wed, Aug 22, 2012 at 5:23 PM, Mike O'Leary wrote: > I have one more question about term vector positions and off

RE: Problem with TermVector offsets and positions not being preserved

2012-08-22 Thread Mike O'Leary
I have one more question about term vector positions and offsets being preserved. My co-worker is working on updating the documents in an index with a field that contains a numerical value derived from the term frequencies and inverse document frequencies of terms in the document. His first

Re: Problem with TermVector offsets and positions not being preserved

2012-07-27 Thread Robert Muir
field and adjusts the field flags based on the presence/absence of a term > vector. FieldInfos were not enough to handle some combinations. > > * Luke doesn't show the offsets/positions flags in the document view, since > they are not known in advance. However, the pop-up that sh

Re: Problem with TermVector offsets and positions not being preserved

2012-07-27 Thread Andrzej Bialecki
On 27/07/2012 00:50, Mike O'Leary wrote: Hi Robert, Thanks for your help. This cleared up all of the things I was having trouble understanding about offsets and positions in term vectors. Mike -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, July 20, 2

RE: Problem with TermVector offsets and positions not being preserved

2012-07-26 Thread Mike O'Leary
Hi Robert, Thanks for your help. This cleared up all of the things I was having trouble understanding about offsets and positions in term vectors. Mike -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, July 20, 2012 5:59 PM To: java-user@lucene.apache.org

Re: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Robert Muir
On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary wrote: > Hi Robert, > I'm not trying to determine whether a document has term vectors, I'm trying > to determine whether the term vectors that are in the index have offsets and > positions > stored. Right: what

RE: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Mike O'Leary
Hi Robert, I'm not trying to determine whether a document has term vectors, I'm trying to determine whether the term vectors that are in the index have offsets and positions stored. Shouldn't the Field instance variables called storeOffsetWithTermVector and storePositionWithTermV

Re: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Robert Muir
out.writeEndElement(); > out.writeEndDocument(); > out.flush(); > reader.close(); > } > > private void dumpDocument(Document document, XMLStreamWriter out) > throws XMLStreamException { >

RE: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Mike O'Leary
-user@lucene.apache.org Subject: RE: Problem with TermVector offsets and positions not being preserved Hi Robert, I put together the following two small applications to try to separate the problem I am having from my own software and any bugs it contains. One of the applications is c

RE: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Mike O'Leary
eld.name()); out.writeAttribute("value", field.stringValue()); out.writeEndElement(); } out.writeEndElement(); } } -

Re: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Robert Muir
Hi Mike: I wrote up some tests last night against 3.6 trying to find some way to reproduce what you are seeing, e.g. adding additional segments with the field specified without term vectors, without tv offsets, omitting TF, and merging them and checking everything out. I couldnt find any problems

Problem with TermVector offsets and positions not being preserved

2012-07-19 Thread Mike O'Leary
I created an index using Lucene 3.6.0 in which I specified that a certain text field in each document should be indexed, stored, analyzed with no norms, with term vectors, offsets and positions. Later I looked at that index in Luke, and it said that term vectors were created for this field, but

Re: Offsets in 3.6/4.0

2012-07-17 Thread karsten-solr
. ) Best regards Karsten in Context: http://lucene.472066.n3.nabble.com/Offsets-in-3-6-4-0-td3994830.html#a3995288 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h

Re: Offsets in 3.6/4.0

2012-07-17 Thread Carsten Schnober
I was hoping to avoid that because I am also storing other information in the Payload which makes it feel a bit messy; especially as it seemed sensible to me to actually make use of the Offsets field as it already exists. Anyway, the problem is solved so far, thank you very much! I still wonder what th

Re: Offsets in 3.6/4.0

2012-07-16 Thread karsten-solr
-ALPHA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Positions Best regards Karsten P.S. in context: http://lucene.472066.n3.nabble.com/Offsets-in-3-6-4-0-td3994830.html - To unsubscribe, e-mail: java

Offsets in 3.6/4.0

2012-07-13 Thread Carsten Schnober
Dear list, I am working on a search application that depends on retrieving offsets for each match. Currently (in Lucene 3.6), this seems to be overly costly, at least in my solution that looks like this: --- TermPositionVector

Re: Retrieving offsets

2012-01-19 Thread Robert Muir
On Fri, Jan 13, 2012 at 9:33 PM, Nishad Prakash wrote: > > - More generally, I would like to be able to iterate over positions in > a document, collecting offset information for those positions as I go. > Is there any way to do this?  I didn't find such an iterator, but I > may not know where to l

Re: Retrieving offsets

2012-01-19 Thread Mike Sokolov
rt of thing, and not worry about spans? -Mike On 1/19/2012 9:46 PM, Nishad Prakash wrote: I'm going to cry. There is no way to retrieve offsets for position, rather than for term? On 1/13/2012 6:33 PM, Nishad Prakash wrote: I'm having a set of issues in trying to use Luce

Re: Retrieving offsets

2012-01-19 Thread Nishad Prakash
I'm going to cry. There is no way to retrieve offsets for position, rather than for term? On 1/13/2012 6:33 PM, Nishad Prakash wrote: I'm having a set of issues in trying to use Lucene that are all connected to the difficulty of retrieving offsets. I need some advice on h

Retrieving offsets

2012-01-13 Thread Nishad Prakash
I'm having a set of issues in trying to use Lucene that are all connected to the difficulty of retrieving offsets. I need some advice on how best to proceed, or a pointer if this has been answered somewhere. My app requires that I display all portions of the documents where the search te

Re: highlighter by using term offsets

2011-11-24 Thread Ian Lea
ot;content" and "hits[i].doc" I see that are not null. > The problem is in this line > > "TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVecto > (hits[i].doc,"contents"); " > > hits[i].doc represent the Doc id or ? > > Thanks

Re: highlighter by using term offsets

2011-11-24 Thread starz10de
its[i].doc" I see that are not null. The problem is in this line "TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVecto (hits[i].doc,"contents"); " hits[i].doc represent the Doc id or ? Thanks -- View this message in c

Re: highlighter by using term offsets

2011-11-24 Thread Ian Lea
On Thu, Nov 24, 2011 at 11:21 AM, starz10de wrote: > Hi, > > no hits are not null, I can print all retrieved docuemtns without problem. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/highlighter-by-using-term-offsets-tp3527712p3533380.html > Sen

Re: highlighter by using term offsets

2011-11-24 Thread starz10de
Hi, no hits are not null, I can print all retrieved docuemtns without problem. -- View this message in context: http://lucene.472066.n3.nabble.com/highlighter-by-using-term-offsets-tp3527712p3533380.html Sent from the Lucene - Java Users mailing list archive at Nabble.com

Re: highlighter by using term offsets

2011-11-23 Thread Ian Lea
: > >  I'm writing a highlighter by using term offsets as follows: > > > IndexReader reader = IndexReader.open( indexPath ); > TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVector( > hits[i].doc,"contents"); > > Whe

highlighter by using term offsets

2011-11-22 Thread starz10de
I'm writing a highlighter by using term offsets as follows: IndexReader reader = IndexReader.open( indexPath ); TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVector( hits[i].doc,"contents"); When I run the searcher, I face this error in Ter

How to get the term offsets for wild card queries?

2011-11-10 Thread Vidya Kanigiluppai Sivasubramanian
Hi, I am using 2.9.2 version of lucene. For my project I need to find the term positions in the document for it to be highlighted in the display. For normal queries it works fine. But with wild card queries, there is no offset info available. This is my code: QueryParser qp = new Que

How to get hit offsets?

2011-09-12 Thread Dmitry Savenko
Hello, everyone! Could anyone please explain how to get offsets for hits? I.e. I have a big text file and want to find some string in it. As a result of this operation, I need an array of offsets (in characters) from the beginning of the file for each occurrence of the string. As an example

Re: Preserving original HTML file offsets for highlighting

2011-01-26 Thread Karolina Bernat
-Original Message- > > From: Karolina Bernat [mailto:karolina.ber...@googlemail.com] > > Sent: Tuesday, January 25, 2011 1:45 PM > > To: java-user@lucene.apache.org > > Subject: Re: Preserving original HTML file offsets for highlighting > > > > Hi Uwe, > > >

RE: Preserving original HTML file offsets for highlighting

2011-01-25 Thread Uwe Schindler
remen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Karolina Bernat [mailto:karolina.ber...@googlemail.com] > Sent: Tuesday, January 25, 2011 1:45 PM > To: java-user@lucene.apache.org > Subject: Re: Preserving original HTML file offsets for

Re: Preserving original HTML file offsets for highlighting

2011-01-25 Thread Karolina Bernat
ng of HTML > files with Lucene. > > What I need to do is highlight the hits (terms) in the original HTML file > (or get the positions of the terms/tokens in the original file). > This problem has already been described by Fred Toth in this thread in 2005 > (Preserving origina

RE: Preserving original HTML file offsets for highlighting

2011-01-24 Thread Uwe Schindler
rg > Subject: Preserving original HTML file offsets for highlighting > > Hi all, > > I'm new to Lucene and have a question about indexing/highlighting of HTML > files with Lucene. > > What I need to do is highlight the hits (terms) in the original HTML file (or get > the p

Preserving original HTML file offsets for highlighting

2011-01-24 Thread Karolina Bernat
th in this thread in 2005 (Preserving original HTML file offsets for highlighting, need HTMLTokenizer?): http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3c6.2.1.2.2.20050530134630.063ae...@fast.synernet.com%3E I've searched the mailing list archives hoping for an answer,

RE: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

2010-06-08 Thread Uwe Schindler
> Hi Ahmet, > > I am using Lucene.NET with C# so I can't test this quickly. > Will HTMLStripCharFilter maintain the character offsets or does it just extract > the plain text? Yes the CharFilter does this! Uwe

Re: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

2010-06-08 Thread Hans Merkl
Hi Ahmet, I am using Lucene.NET with C# so I can't test this quickly. Will HTMLStripCharFilter maintain the character offsets or does it just extract the plain text? Hans > You can use org.apache.solr.analysis.HTMLStripCharFilter. It is possible to > add

Re: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

2010-06-07 Thread Ahmet Arslan
ling. > > I think it should be possible to write a tokenizer that > strips out the HTML > tags but maintains the original offsets within the HTML > document so they > can be used for highlighting the original HTML document, > not just the > text representation. > > Does any

Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

2010-06-07 Thread Hans Merkl
to write a tokenizer that strips out the HTML tags but maintains the original offsets within the HTML document so they can be used for highlighting the original HTML document, not just the text representation. Does anybody know any tokenizers that can do this? It seems it's something other p

Matching term document character offsets (PyLucene 3.0.1)

2010-05-16 Thread Ben Phelan
t how to get at it effectively. It would be extremely helpful if someone could please point out how to access this from (Py)Lucene v3.0.1. 2) If it's not possible to access the character offsets, token positions would be a vaguely acceptable fallback but would limit the final capabilities of

RE: Term offsets for highlighting

2010-04-27 Thread Stephen Greene
: Monday, April 26, 2010 10:55 AM To: java-user@lucene.apache.org Subject: Re: Term offsets for highlighting Stephen Greene wrote: > Hi Koji, > > Thank you. I implemented a solution based on the FieldTermStackTest.java > and if I do a search like "iron ore" it matches iron or o

Re: Term offsets for highlighting

2010-04-26 Thread Koji Sekiguchi
Stephen Greene wrote: Hi Koji, Thank you. I implemented a solution based on the FieldTermStackTest.java and if I do a search like "iron ore" it matches iron or ore. The same is true if I specify iron AND ore. The termSetMap[0].value[0] = ore and termSetMap[0].value[1] = iron. What am I missing

RE: Term offsets for highlighting

2010-04-26 Thread Stephen Greene
tIndexReader(), pintDocId, fieldName); -Original Message- From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] Sent: Saturday, April 24, 2010 5:18 AM To: java-user@lucene.apache.org Subject: Re: Term offsets for highlighting Hi Steve, > is there a way to access a TermVector containin

Re: Term offsets for highlighting

2010-04-24 Thread Koji Sekiguchi
Hi Steve, > is there a way to access a TermVector containing only matched terms, > or is my previous approach still the So you want to access FieldTermStack, I understand. The way to access it, I wrote it at previous mail: You cannot access FieldTermStack from FVH, but I think you can create i

RE: Term offsets for highlighting

2010-04-22 Thread Stephen Greene
ay, April 19, 2010 9:02 PM To: java-user@lucene.apache.org Subject: Re: Term offsets for highlighting Stephen Greene wrote: > Hi Koji, > > An additional question. Is it possible to access the FieldTermStack from > the FastVectorHighlighter after the it has been populated with matching

Re: Term offsets for highlighting

2010-04-19 Thread Koji Sekiguchi
with returning positional offsets to have highlighting tags applied to them in a separate process. Thank you for your insight, Steve Hi Steve, You cannot access FieldTermStack from FVH, but I think you can create it by your own. To know how to do it, please refer to FieldTermStackTest.java

RE: Term offsets for highlighting

2010-04-19 Thread Stephen Greene
positional offsets to have highlighting tags applied to them in a separate process. Thank you for your insight, Steve -Original Message- From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] Sent: Sunday, April 18, 2010 10:42 AM To: java-user@lucene.apache.org Subject: Re: Term offsets for

RE: Term offsets for highlighting

2010-04-19 Thread Stephen Greene
Subject: Re: Term offsets for highlighting Stephen Greene wrote: > Hi Koji, > > Thank you for your reply. I did try the QueryScorer without success, but > I was using Lucene 2.4.x > Hi Steve, I thought you were using 2.9 or later because you mentioned FastVectorHighlighter in you

Re: Term offsets for highlighting

2010-04-18 Thread Koji Sekiguchi
rning term offsets before I set out to refactor the existing code. Do you have any insight as to whether the fast vector highlighter would offer any benefits in this area over the highlighter package? I'm not sure FVH offers benefit for you, but yes, FVH can recognize phrase and highlight t

RE: Term offsets for highlighting

2010-04-18 Thread Stephen Greene
solution for returning term offsets before I set out to refactor the existing code. Do you have any insight as to whether the fast vector highlighter would offer any benefits in this area over the highlighter package? Thank you, Steve -Original Message- From: Koji Sekiguchi [mailto:k

Re: Term offsets for highlighting

2010-04-16 Thread Koji Sekiguchi
Stephen Greene wrote: Hello, I am trying to determine begin and end offsets for terms and phrases matching a query. Is there a way using either the highlighter or fast vector highlighter in contrib? I have already attempted extending the highlighter which would match terms but would not

Term offsets for highlighting

2010-04-16 Thread Stephen Greene
Hello, I am trying to determine begin and end offsets for terms and phrases matching a query. Is there a way using either the highlighter or fast vector highlighter in contrib? I have already attempted extending the highlighter which would match terms but would not match phrases. The

Re: Getting left and right offsets of term search results

2009-10-12 Thread Till Kolter
art of speech) and inside the Payload for position > specific info (eg a relation id, paragraph id or whatever you want :it's > a byte[]). > > With those techniques you can do many things, you have to be inventive but > with payloads you can do very interesting things. > Yo

Re: Getting left and right offsets of term search results

2009-10-09 Thread David Causse
do very interesting things. You can also store the offsets inside the payload and don't bother with term vector! Well there is really hundreds of solutions to deal with linguistic data inside lucene. What is hard is when you have to deal with relations but a triplet store should be more ada

Getting left and right offsets of term search results

2009-10-09 Thread Till Kolter
I am quite new to Lucene, but I have searched the FAQs and consulted the mailinglist archive. I debugged through the source codes as well. I have writen an Analyzer, that analyzes a stream by sending it to a whole pipeline of linguistic processing and uses the internal representation to construct

Re: Use of tika for parsing, offsets questions

2009-09-04 Thread David Causse
to text > > (like the google cache view)? how can I keep track of the offsets > > between tika parser and lucene analyzer? > > Currently Tika doesn't expose that information but the Tika Parser API > was designed for such use in mind, so it will be possible to add the > offs

Re: Use of tika for parsing, offsets questions

2009-09-03 Thread Grant Ingersoll
On Sep 2, 2009, at 5:40 AM, David Causse wrote: Hi, If I use tika for parsing HTML code and inject parsed String to a lucene analyzer. What about the offset information for KWIC and return to text (like the google cache view)? how can I keep track of the offsets between tika parser and

RE: Use of tika for parsing, offsets questions

2009-09-03 Thread Uwe Schindler
ika for parsing, offsets questions > > Hi, > > On Wed, Sep 2, 2009 at 2:40 PM, David Causse wrote: > > If I use tika for parsing HTML code and inject parsed String to a lucene > > analyzer. What about the offset information for KWIC and return to text > > (like the go

Re: Use of tika for parsing, offsets questions

2009-09-03 Thread Jukka Zitting
Hi, On Wed, Sep 2, 2009 at 2:40 PM, David Causse wrote: > If I use tika for parsing HTML code and inject parsed String to a lucene > analyzer. What about the offset information for KWIC and return to text > (like the google cache view)? how can I keep track of the offsets > between

Use of tika for parsing, offsets questions

2009-09-02 Thread David Causse
Hi, If I use tika for parsing HTML code and inject parsed String to a lucene analyzer. What about the offset information for KWIC and return to text (like the google cache view)? how can I keep track of the offsets between tika parser and lucene analyzer? What are the solutions/ideas to do a

Re: term offsets info seems to be wrong...

2009-01-16 Thread Koji Sekiguchi
Mark, This is exactly what I want and It worked perfectly. Thanks! I'll post my highlighter to JIRA in a few days (hopegully). It uses term offsets with positions (WITH_POSITIONS_OFFSETS) to support PhraseQuery. Thanks again, Koji Mark Miller wrote: Okay, Koji, hopefully I'

Re: term offsets info seems to be wrong...

2009-01-16 Thread Mark Miller
> > I'm writing a highlighter by using term offsets info (yes, I borrowed > the idea > of LUCENE-644). In my highlighter, I'm seeing unexpected term offsets info > when getting multi-valued field. > > For example, if I indexed [" "," bbb "] (m

term offsets info seems to be wrong...

2009-01-16 Thread Koji Sekiguchi
Hello, I'm writing a highlighter by using term offsets info (yes, I borrowed the idea of LUCENE-644). In my highlighter, I'm seeing unexpected term offsets info when getting multi-valued field. For example, if I indexed [" "," bbb "] (multi-valued), I go

Re: term offsets wrong depending on analyzer

2008-11-11 Thread Michael McCandless
the code of KeywordAnalyzer - and have seen it don't sets the offsets in any case. I wrote my own Analyzer based on KeywordAnalyzer and added the two lines reusableToken.setStartOffset(0); reusableToken.setEndOffset(upto); inside KeywordTokenizer.next(..). It see

Re: term offsets wrong depending on analyzer

2008-11-07 Thread Michael McCandless
Thanks for raising these! For the 1st issue (KeywordTokenizer fails to set start/end offset on its token), I think we add your two lines to fix it. I'll open an issue for this. The 2nd issue (if same field name has more than one NOT_ANALYZED instance in a doc then the offsets are d

term offsets wrong depending on analyzer

2008-11-07 Thread Christian Reuschling
the code of KeywordAnalyzer - and have seen it don't sets the offsets in any case. I wrote my own Analyzer based on KeywordAnalyzer and added the two lines reusableToken.setStartOffset(0); reusableToken.setEndOffset(upto); inside KeywordTokenizer.next(..). It seems to

Offsets-highlight newbie question

2008-02-10 Thread Katya
, you need to give it offsets to highlight a string, and I would be glad to do so if I could get the offsets! I will give snippets of code here: //read the index, store the terms try { inread = IndexReader.open(pathToIndex); terms = inread.

Re: Start/end offsets in analyzers

2007-03-28 Thread Antony Bowesman
Thanks Erik. For our purposes it seems more generally useful to use the original start/end offsets. Antony Erik Hatcher wrote: They aren't used implicitly by anything in Lucene, but can be very handy for efficient highlighting. Where you set the offsets really all depends on how you

Re: Start/end offsets in analyzers

2007-03-28 Thread Erik Hatcher
terFactory. It produces Analyzing "<[EMAIL PROTECTED]>" 1: [EMAIL PROTECTED]:1->31:] 2: [humphrey:1->9:] 3: [bogart:10->16:] 4: [casablanca:17->27:] 5: [com:28->31:] I set the start/end offset to be the length of the component, but in the LIA book listing 4.6 sho

  1   2   >