Re: getting term offset information for fields with multiple value entiries

Grant Ingersoll Thu, 16 Aug 2007 17:54:26 -0700

Hi Christian,

Is there anyway you can post a complete, self-contained examplepreferably as a JUnit test? I think it would be useful to know moreabout how you are indexing (i.e. what Analyzer, etc.)The offsets should be taken from whatever is set in on the Tokenduring Analysis. I, too, am trying to remember where in the codethis is taking place


Also, what version of Lucene are you using?

-Grant

On Aug 16, 2007, at 5:50 AM, [EMAIL PROTECTED] wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,
I have an index with an 'actor' field, for each actor there existsan single field value entry, e.g.
stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition <movie_actors>
movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
movie_actors:Miguel Bosé
movie_actors:Anna Lizaran (as Ana Lizaran)
movie_actors:Raquel Sanchís
movie_actors:Angelina Llongueras

I try to get the term offset, e.g. for 'angelina' with
termPositionVector = (TermPositionVector) reader.getTermFreqVector(docNumber, "movie_actors");
int iTermIndex = termPositionVector.indexOf("angelina");
TermVectorOffsetInfo[] termOffsets = termPositionVector.getOffsets(iTermIndex);
I get one TermVectorOffsetInfo for the field - with offset numbersthat are bigger than one single
Field entry.
I guessed that Lucene gives the offset number for the situationthat all values were concatenated,
which is for the single (virtual) string:
movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)MiguelBoséAnna Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras
This fits in nearly no situation, so my second guess was thatlucene adds some virtual delimiters between the singlefield entries for offset calculation. I added a delimiter, so theresult would be:
movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel BoséAnna Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
(note the ' ' between each actor name)
..this also fits not for each situation - there are too muchdelimiters there now, so, further, I guessed that Lucene don't adda delimiter in each situation. So I added only one when the lastcharacter of an entry was no alphanumerical one, with:
StringBuilder strbAttContent = new StringBuilder();
for (String strAttValue : m_luceneDocument.getValues(strFieldName))
{
   strbAttContent.append(strAttValue);
if(strbAttContent.substring(strbAttContent.length() - 1).matches("\\w"))
      strbAttContent.append(' ');
}

where I get the result (virtual) entry:
movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)MiguelBoséAnna Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras
this fits in ~96% of all my queries....but still its not 100% theway lucene calculates the offset value for fields with multiple
value entries.
..maybe the problem is that there are special characters inside mydatabase (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.I have looked to this specific situation, but considering this onecharacter don't solves the problem.
How do Lucene calculates these offsets? I also searched inside thesource code, but can't find the correct place.
Thanks in advance!

Christian Reuschling





- --
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-125
mailto:[EMAIL PROTECTED]  http://www.dfki.uni-kl.de/~reuschling/
- ------------Legal Company Information Required by GermanLaw------------------Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster(Vorsitzender)
                  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFGxB3XQoTr50f1tpcRAti+AKCH0YgcHjA+bO9NTbuxaAlKb8dO5gCfSfSK
oVOiAdWYROqXOMqHv176xBY=
=b2jO
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: getting term offset information for fields with multiple value entiries

Reply via email to