Our mails are crossing.... Not that I know of. But why don't you just index (or maybe just store) a separate field containing your offset information? Something like title_offset with, say, a comma-separated pair denoting char position and length that you then read in at search time and parse.....
But your tokenizer controls *everything*. Why isn't the Token being returned from your next() method being constructed with the offsets you desire? Erick On Fri, Mar 7, 2008 at 2:39 PM, Steve Suppe <[EMAIL PROTECTED]> wrote: > OK, I think I understand what's going on - it looks like I am able to set > the token for the full author name (Say, "Steve Suppe") with the correct > offsets, but the analyzer takes it once step further and tokenizes 'Steve' > and 'Suppe' which is giving me a lot more generated offsets and is > confusing me. > > I like the tokenization, as it allows me to just search for Suppe and get > results. However, I don't want those "sub-offsets" returned. Is there a > way to distinguish the 'main' offsets for the whole field? > > Thanks again, > Steve > > At 10:38 AM 3/7/2008, you wrote: > >Hi all, > > > >I'm trying to index documents so that a) I have all the documents indexed > >'normally' (in that I can search for documents that match certain words, > >and b) parts of the document that I consider important, such as author > and > >title are ALSO stored in their own indexed fields. > > > >I have (a) working fine, and (b) is almost working - however, I'm trying > >to force the separate field to have the original offsets of where it > >existed in the text. As in, if the title was at characters 76-200 in the > >original text, I'd like the field to have that as its information, so > when > >I look at the field I can find the place in the document quickly. > > > >I don't seem to be able to do this - I have my own analyzer that finds > the > >tokens and sets the start and end offsets accordingly. However, when I > >create the new field and write it to the index, it seems like these > >offsets are ignored? When I pull offsets out later, they start at 0 and > >move up from there. > > > >I am creating the field like: > > > >CASAnnotationAnalyzer psa = new CASAnnotationAnalyzer(); > >analyzer.addAnalyzer(info.indexName, psa); > > > >TokenStream ts = psa.tokenStream(info.indexName, > > new StringReader(info.value > )); > >Field stemF = new Field(info.indexName, ts, > > > Field.TermVector.WITH_POSITIONS_OFFSETS); > >d.add(stemF); > > > >(d is the document being indexed). > > > >I have tried various permutations of creating the field and token stream > - > >does anyone have any insights, please? > > > >Thanks in advance, > >Steve > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >