Re: getting term offset information for fields with multiple value entiries

Grant Ingersoll Fri, 17 Aug 2007 14:19:13 -0700

What version of Lucene are you using?


On Aug 17, 2007, at 12:44 PM, [EMAIL PROTECTED] wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello community, dear Grant
I have build a JUnit test case that illustrates the problem -there, I try to cutout the right substring with the offset values given from Lucene -and fail :(
A few remarks:
In this example, the 'é' from 'Bosé' makes that the '\w' patterndon't matches -
it is recognized, unlike in StandardAnalyzer - as delimiter sign.
Analysis: It seems that Lucene calculates the offset values byadding a virtual
delimiter between every field value.
But Lucene forgets the last characters of a field value when these are
analyzer-specific delimiter values. (I seem this because ofDocumentWriter, line
245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
With this line of code, only the end offset of the last token isconsidered - by
forgetting potential, trimmed delimiter chars.

Thus, solving would be:
1. Add a single delimiter char between the field values
2. Substract (from the Lucene Offset) the count of analyzer-specific delimiters
   that are at the end of all field values before the match
For this, someone needs to know what a delimiter for an specificanalyzer is.
The other possibility of course is to change the behaviour insideLucene, becausethe current offset values are more or less useless / hard to use (Icurrently have
no idea how to get analyzer-specific delimiter chars).

For me, this looks like a bug - am I wrong?

Any ideas/hints/remarks? I would be very lucky about :)

Greetings

Christian



Grant Ingersoll schrieb:
Hi Christian,

Is there anyway you can post a complete, self-contained example
preferably as a JUnit test?  I think it would be useful to know more
about how you are indexing (i.e. what Analyzer, etc.)
The offsets should be taken from whatever is set in on the Tokenduring
Analysis.  I, too, am trying to remember where in the code this is
taking place

Also, what version of Lucene are you using?

-Grant

On Aug 16, 2007, at 5:50 AM, [EMAIL PROTECTED] wrote:

Hello,

I have an index with an 'actor' field, for each actor there exists an
single field value entry, e.g.
stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition
<movie_actors>

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
movie_actors:Miguel Bosé
movie_actors:Anna Lizaran (as Ana Lizaran)
movie_actors:Raquel Sanchís
movie_actors:Angelina Llongueras

I try to get the term offset, e.g. for 'angelina' with

termPositionVector = (TermPositionVector)
reader.getTermFreqVector(docNumber, "movie_actors");
int iTermIndex = termPositionVector.indexOf("angelina");
TermVectorOffsetInfo[] termOffsets =
termPositionVector.getOffsets(iTermIndex);


I get one TermVectorOffsetInfo for the field - with offset numbers
that are bigger than one single
Field entry.
I guessed that Lucene gives the offset number for the situation that
all values were concatenated,
which is for the single (virtual) string:

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras

This fits in nearly no situation, so my second guess was that lucene
adds some virtual delimiters between the single
field entries for offset calculation. I added a delimiter, so the
result would be:
movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel BoséAnna
Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
(note the ' ' between each actor name)

..this also fits not for each situation - there are too much
delimiters there now, so, further, I guessed that Lucene don't add
a delimiter in each situation. So I added only one when the last
character of an entry was no alphanumerical one, with:
StringBuilder strbAttContent = new StringBuilder();
for (String strAttValue : m_luceneDocument.getValues(strFieldName))
{
   strbAttContent.append(strAttValue);
   if(strbAttContent.substring(strbAttContent.length() -
1).matches("\\w"))
      strbAttContent.append(' ');
}

where I get the result (virtual) entry:
movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras

this fits in ~96% of all my queries....but still its not 100% the way
lucene calculates the offset value for fields with multiple
value entries.


..maybe the problem is that there are special characters inside my
database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
I have looked to this specific situation, but considering this one
character don't solves the problem.


How do Lucene calculates these offsets? I also searched inside the
source code, but can't find the correct place.


Thanks in advance!

Christian Reuschling





--
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-125
mailto:[EMAIL PROTECTED]  http://www.dfki.uni-kl.de/~reuschling/

------------Legal Company Information Required by German
Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender)
                  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________
----------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFGxdBkQoTr50f1tpcRAjyHAJ9jvKORu0Ciepprg70z2qOdFhmnIgCgoaYA
nkji0oFq2UGEkh7H9YLdqbE=
=wKe1
-----END PGP SIGNATURE-----
package org.dynaq.index;



import static org.junit.Assert.assertTrue;

import java.io.IOException;
import java.util.LinkedList;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.TermPositionVector;
import org.apache.lucene.index.TermVectorOffsetInfo;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;



public class TokenizerTest
{

    IndexReader m_indexReader;

    Analyzer m_analyzer = new StandardAnalyzer();


    @Before
public void createIndex() throws CorruptIndexException,LockObtainFailedException, IOException
    {
        Directory ramDirectory = new RAMDirectory();

        // we create a fist little set of actors names
        LinkedList<String> llActorNames = new LinkedList<String>();
        llActorNames.add("Mayrata O'Wisiedo (as Mairata O'Wisiedo)");
        llActorNames.add("Miguel Bosé");
        llActorNames.add("Anna Lizaran (as Ana Lizaran)");
        llActorNames.add("Raquel Sanchís");
        llActorNames.add("Angelina Llongueras");
// store them into a single document with multiple valuesfor one Field
        Document testDoc = new Document();
        for (String strActorsName : llActorNames)
        {
            Field testEntry =
new Field("movie_actors", strActorsName,Field.Store.YES, Field.Index.TOKENIZED,Field.TermVector.WITH_POSITIONS_OFFSETS);
            testDoc.add(testEntry);
        }

        // now we write it into the index
IndexWriter indexWriter = new IndexWriter(ramDirectory,true, m_analyzer, true);
        indexWriter.addDocument(testDoc);
        indexWriter.close();

        m_indexReader = IndexReader.open(ramDirectory);
    }



    @Test
    public void checkOffsetValues() throws ParseException, IOException
    {

        // first, we search for 'angelina'
        String strSearchTerm = "Angelina";


        Searcher searcher = new IndexSearcher(m_indexReader);
QueryParser parser = new QueryParser("movie_actors",m_analyzer);
        Query query = parser.parse(strSearchTerm);

        Hits hits = searcher.search(query);
        Document resultDoc = hits.doc(0);

        String[] straValues = resultDoc.getValues("movie_actors");
// now, we get the field values and build a single stringvalue out of them
        StringBuilder strbSimplyConcatenated = new StringBuilder();
        StringBuilder strbWithDelimiters = new StringBuilder();
StringBuilder strbWithDelimitersAfterAlphaNumChar = newStringBuilder();
        for (String strActorName : straValues)
        {
// first situation: we simply concatenate all fieldvalue entries
            strbSimplyConcatenated.append(strActorName);
// second: we add a single delimiter char between thefield values
            strbWithDelimiters.append(strActorName).append('$');
// third try: we add a single delimiter, but only ifthe last char of the actor before was a alphanum-char
            strbWithDelimitersAfterAlphaNumChar.append(strActorName);
String strLastChar = strActorName.substring(strActorName.length() - 1, strActorName.length());if(strLastChar.matches("\\w"))strbWithDelimitersAfterAlphaNumChar.append('$');
        }
// this is the offset value from lucene. This should be theplace of 'angelina' in one of the concatenated value Strings aboveTermPositionVector termPositionVector =(TermPositionVector) m_indexReader.getTermFreqVector(0,"movie_actors");int iTermIndex = termPositionVector.indexOf(strSearchTerm.toLowerCase());TermVectorOffsetInfo[] termOffsets =termPositionVector.getOffsets(iTermIndex);
        int iStartOffset = termOffsets[0].getStartOffset();
        int iEndOffset = termOffsets[0].getEndOffset();
//we create the substrings according to the offset valuegiven from luceneString strSubString1 = strbSimplyConcatenated.substring(iStartOffset, iEndOffset);String strSubString2 = strbWithDelimiters.substring(iStartOffset, iEndOffset);String strSubString3 =strbWithDelimitersAfterAlphaNumChar.substring(iStartOffset,iEndOffset);
System.out.println("Offset value: " + iStartOffset + "-" +iEndOffset);
        System.out.println("simply concatenated:");
        System.out.println(strbSimplyConcatenated);
System.out.println("SubString for offset: '" +strSubString1 + "'");
        System.out.println();
        System.out.println("with delimiters:");
        System.out.println(strbWithDelimiters);
System.out.println("SubString for offset: '" +strSubString2 + "'");
        System.out.println();
System.out.println("with delimiter after alphanumcharacter:");
        System.out.println(strbWithDelimitersAfterAlphaNumChar);
System.out.println("SubString for offset: '" +strSubString3 + "'");
//is the offset value correct for one of the concatenatedstrings?
        //this fails for all situations
assertTrue(strSubString1.equals(strSearchTerm) ||strSubString2.equals(strSearchTerm) || strSubString3.equals(strSearchTerm));
        /*
* Comments: In this example, the 'é' from 'Bosé' makesthat the '\w' pattern don't matches - it is recognized, unlike in
         * StandardAnalyzer - as delimiter sign.
         *
* Analysis: It seems that Lucene calculates the offsetvalues by adding a virtual delimiter between every field value.* But Lucene forgets the last characters of a field valuewhen these are analyzer-specific delimiter values.* (I seem this because of DocumentWriter, line 245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'* With this line of code, only the end offset of the lasttoken is considered - by forgetting potential, trimmed delimiter
         * chars.
         *
         * Thus, solving would be:
         * 1. Add a single delimiter char between the field values
* 2. Substract (from the Lucene Offset) the count ofanalyzer-specific delimiters that are at the end of all fieldvalues before the match
         *
* For this, someone needs to know what a delimiter for anspecific analyzer is.
         */



    }



    @After
    public void closeIndex() throws CorruptIndexException, IOException
    {
        m_indexReader.close();
    }


}






























---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/

Re: getting term offset information for fields with multiple value entiries

Reply via email to