Re: getting term offset information for fields with multiple value entiries

duiduder Mon, 20 Aug 2007 07:18:11 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Grant, dear community


I have written some lines of code to adapt the offset values from Lucene to
values where the terms really appear in the concatenated field value entries.

My tests are successful :)

There are two additional methods inside the test:
1. calculateLuceneOffsetDiffs(String strFieldName, Document document2highlight, 
Analyzer analyzer)
2. adaptLuceneOffset(int luceneOffset, LinkedHashMap<Integer, Integer> 
hsEndOffset2EndDelimiterCount)

For calculating, I use the analyzer for the specific field, and concatenate the
values with a single (whitespace) delimiter in between.

It really looks like that Lucene forgets possibly trimmed delimiter chars at the
end of field values in order to calculate the offsets. The delimiter chars are
analyzer specific - thus I need the analyzer to calculate the corrections.

Lucene offsets are too low by forgetting these values- so I calculate the count
of delimiter chars that are before a given Lucene offset, and add them.

Hope this helps - maybe this behaviour is still known? I haven't found comments
on it, but I know that the origin highlighter from Lucene-contrib don't deals
with multiple field entries (at least a version from last year) - maybe this
behaviour was the reason.

greetings

Christian



- --
______________________________________________________________________________
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-125
mailto:[EMAIL PROTECTED]  http://www.dfki.uni-kl.de/~reuschling/

- ------------Legal Company Information Required by German Law------------------
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
                  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
______________________________________________________________________________


Grant Ingersoll schrieb:
> What version of Lucene are you using?
> 
> 
> On Aug 17, 2007, at 12:44 PM, [EMAIL PROTECTED] wrote:
> 
> Hello community, dear Grant
> 
> I have build a JUnit test case that illustrates the problem - there, I
> try to cut
> out the right substring with the offset values given from Lucene - and
> fail :(
> 
> A few remarks:
> 
> In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't
> matches -
> it is recognized, unlike in StandardAnalyzer - as delimiter sign.
> 
> Analysis: It seems that Lucene calculates the offset values by adding
> a virtual
> delimiter between every field value.
> But Lucene forgets the last characters of a field value when these are
> analyzer-specific delimiter values. (I seem this because of
> DocumentWriter, line
> 245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
> With this line of code, only the end offset of the last token is
> considered - by
> forgetting potential, trimmed delimiter chars.
> 
> Thus, solving would be:
> 1. Add a single delimiter char between the field values
> 2. Substract (from the Lucene Offset) the count of analyzer-specific
> delimiters
>    that are at the end of all field values before the match
> 
> For this, someone needs to know what a delimiter for an specific
> analyzer is.
> 
> The other possibility of course is to change the behaviour inside
> Lucene, because
> the current offset values are more or less useless / hard to use (I
> currently have
> no idea how to get analyzer-specific delimiter chars).
> 
> For me, this looks like a bug - am I wrong?
> 
> Any ideas/hints/remarks? I would be very lucky about :)
> 
> Greetings
> 
> Christian
> 
> 
> 
> Grant Ingersoll schrieb:
>>>> Hi Christian,
>>>>
>>>> Is there anyway you can post a complete, self-contained example
>>>> preferably as a JUnit test?  I think it would be useful to know more
>>>> about how you are indexing (i.e. what Analyzer, etc.)
>>>> The offsets should be taken from whatever is set in on the Token during
>>>> Analysis.  I, too, am trying to remember where in the code this is
>>>> taking place
>>>>
>>>> Also, what version of Lucene are you using?
>>>>
>>>> -Grant
>>>>
>>>> On Aug 16, 2007, at 5:50 AM, [EMAIL PROTECTED] wrote:
>>>>
>>>> Hello,
>>>>
>>>> I have an index with an 'actor' field, for each actor there exists an
>>>> single field value entry, e.g.
>>>>
>>>> stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition
>>>>
>>>> <movie_actors>
>>>>
>>>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
>>>> movie_actors:Miguel Bosé
>>>> movie_actors:Anna Lizaran (as Ana Lizaran)
>>>> movie_actors:Raquel Sanchís
>>>> movie_actors:Angelina Llongueras
>>>>
>>>> I try to get the term offset, e.g. for 'angelina' with
>>>>
>>>> termPositionVector = (TermPositionVector)
>>>> reader.getTermFreqVector(docNumber, "movie_actors");
>>>> int iTermIndex = termPositionVector.indexOf("angelina");
>>>> TermVectorOffsetInfo[] termOffsets =
>>>> termPositionVector.getOffsets(iTermIndex);
>>>>
>>>>
>>>> I get one TermVectorOffsetInfo for the field - with offset numbers
>>>> that are bigger than one single
>>>> Field entry.
>>>> I guessed that Lucene gives the offset number for the situation that
>>>> all values were concatenated,
>>>> which is for the single (virtual) string:
>>>>
>>>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
>>>> Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras
>>>>
>>>> This fits in nearly no situation, so my second guess was that lucene
>>>> adds some virtual delimiters between the single
>>>> field entries for offset calculation. I added a delimiter, so the
>>>> result would be:
>>>>
>>>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna
>>>> Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
>>>> (note the ' ' between each actor name)
>>>>
>>>> ..this also fits not for each situation - there are too much
>>>> delimiters there now, so, further, I guessed that Lucene don't add
>>>> a delimiter in each situation. So I added only one when the last
>>>> character of an entry was no alphanumerical one, with:
>>>> StringBuilder strbAttContent = new StringBuilder();
>>>> for (String strAttValue : m_luceneDocument.getValues(strFieldName))
>>>> {
>>>>    strbAttContent.append(strAttValue);
>>>>    if(strbAttContent.substring(strbAttContent.length() -
>>>> 1).matches("\\w"))
>>>>       strbAttContent.append(' ');
>>>> }
>>>>
>>>> where I get the result (virtual) entry:
>>>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
>>>> Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras
>>>>
>>>> this fits in ~96% of all my queries....but still its not 100% the way
>>>> lucene calculates the offset value for fields with multiple
>>>> value entries.
>>>>
>>>>
>>>> ..maybe the problem is that there are special characters inside my
>>>> database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
>>>> I have looked to this specific situation, but considering this one
>>>> character don't solves the problem.
>>>>
>>>>
>>>> How do Lucene calculates these offsets? I also searched inside the
>>>> source code, but can't find the correct place.
>>>>
>>>>
>>>> Thanks in advance!
>>>>
>>>> Christian Reuschling
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> ______________________________________________________________________________
>>>>
>>>>
>>>> Christian Reuschling, Dipl.-Ing.(BA)
>>>> Software Engineer
>>>>
>>>> Knowledge Management Department
>>>> German Research Center for Artificial Intelligence DFKI GmbH
>>>> Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
>>>>
>>>> Phone: +49.631.20575-125
>>>> mailto:[EMAIL PROTECTED]  http://www.dfki.uni-kl.de/~reuschling/
>>>>
>>>> ------------Legal Company Information Required by German
>>>> Law------------------
>>>> Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
>>>> (Vorsitzender)
>>>>                   Dr. Walter Olthoff
>>>> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
>>>> Amtsgericht Kaiserslautern, HRB 2313=
>>>> ______________________________________________________________________________
>>>>
>>>>
>>>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>
> 
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://lucene.grantingersoll.com
> 
>>>> Lucene Helpful Hints:
>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
package org.dynaq.index;
>>
>>
>>
import static org.junit.Assert.assertTrue;
>>
import java.io.IOException;
import java.util.LinkedList;
>>
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.TermPositionVector;
import org.apache.lucene.index.TermVectorOffsetInfo;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
>>
>>
>>
public class TokenizerTest
{
>>
    IndexReader m_indexReader;
>>
    Analyzer m_analyzer = new StandardAnalyzer();
>>
>>
    @Before
    public void createIndex() throws CorruptIndexException,
LockObtainFailedException, IOException
    {
        Directory ramDirectory = new RAMDirectory();
>>
        // we create a fist little set of actors names
        LinkedList<String> llActorNames = new LinkedList<String>();
        llActorNames.add("Mayrata O'Wisiedo (as Mairata O'Wisiedo)");
        llActorNames.add("Miguel Bosé");
        llActorNames.add("Anna Lizaran (as Ana Lizaran)");
        llActorNames.add("Raquel Sanchís");
        llActorNames.add("Angelina Llongueras");
>>
>>
        // store them into a single document with multiple values for
one Field
        Document testDoc = new Document();
        for (String strActorsName : llActorNames)
        {
            Field testEntry =
                    new Field("movie_actors", strActorsName,
Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS);
>>
            testDoc.add(testEntry);
        }
>>
        // now we write it into the index
        IndexWriter indexWriter = new IndexWriter(ramDirectory, true,
m_analyzer, true);
>>
        indexWriter.addDocument(testDoc);
        indexWriter.close();
>>
        m_indexReader = IndexReader.open(ramDirectory);
    }
>>
>>
>>
    @Test
    public void checkOffsetValues() throws ParseException, IOException
    {
>>
        // first, we search for 'angelina'
        String strSearchTerm = "Angelina";
>>
>>
        Searcher searcher = new IndexSearcher(m_indexReader);
>>
        QueryParser parser = new QueryParser("movie_actors", m_analyzer);
        Query query = parser.parse(strSearchTerm);
>>
        Hits hits = searcher.search(query);
        Document resultDoc = hits.doc(0);
>>
        String[] straValues = resultDoc.getValues("movie_actors");
>>
        // now, we get the field values and build a single string
value out of them
        StringBuilder strbSimplyConcatenated = new StringBuilder();
        StringBuilder strbWithDelimiters = new StringBuilder();
        StringBuilder strbWithDelimitersAfterAlphaNumChar = new
StringBuilder();
        for (String strActorName : straValues)
        {
            // first situation: we simply concatenate all field value
entries
            strbSimplyConcatenated.append(strActorName);
            // second: we add a single delimiter char between the
field values
            strbWithDelimiters.append(strActorName).append('$');
            // third try: we add a single delimiter, but only if the
last char of the actor before was a alphanum-char
            strbWithDelimitersAfterAlphaNumChar.append(strActorName);
            String strLastChar =
strActorName.substring(strActorName.length() - 1, strActorName.length());
            if(strLastChar.matches("\\w"))
strbWithDelimitersAfterAlphaNumChar.append('$');
        }
>>
>>
        // this is the offset value from lucene. This should be the
place of 'angelina' in one of the concatenated value Strings above
        TermPositionVector termPositionVector = (TermPositionVector)
m_indexReader.getTermFreqVector(0, "movie_actors");
        int iTermIndex =
termPositionVector.indexOf(strSearchTerm.toLowerCase());
        TermVectorOffsetInfo[] termOffsets =
termPositionVector.getOffsets(iTermIndex);
        int iStartOffset = termOffsets[0].getStartOffset();
        int iEndOffset = termOffsets[0].getEndOffset();
>>
        //we create the substrings according to the offset value given
from lucene
        String strSubString1 =
strbSimplyConcatenated.substring(iStartOffset, iEndOffset);
        String strSubString2 =
strbWithDelimiters.substring(iStartOffset, iEndOffset);
        String strSubString3 =
strbWithDelimitersAfterAlphaNumChar.substring(iStartOffset, iEndOffset);
>>
        System.out.println("Offset value: " + iStartOffset + "-" +
iEndOffset);
        System.out.println("simply concatenated:");
        System.out.println(strbSimplyConcatenated);
        System.out.println("SubString for offset: '" + strSubString1 +
"'");
        System.out.println();
        System.out.println("with delimiters:");
        System.out.println(strbWithDelimiters);
        System.out.println("SubString for offset: '" + strSubString2 +
"'");
        System.out.println();
        System.out.println("with delimiter after alphanum character:");
        System.out.println(strbWithDelimitersAfterAlphaNumChar);
        System.out.println("SubString for offset: '" + strSubString3 +
"'");
>>
>>
        //is the offset value correct for one of the concatenated
strings?
>>
        //this fails for all situations
        assertTrue(strSubString1.equals(strSearchTerm) ||
strSubString2.equals(strSearchTerm) ||
strSubString3.equals(strSearchTerm));
>>
        /*
         * Comments: In this example, the 'é' from 'Bosé' makes that
the '\w' pattern don't matches - it is recognized, unlike in
         * StandardAnalyzer - as delimiter sign.
         *
         * Analysis: It seems that Lucene calculates the offset values
by adding a virtual delimiter between every field value.
         * But Lucene forgets the last characters of a field value
when these are analyzer-specific delimiter values.
         * (I seem this because of DocumentWriter, line 245:
'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
         * With this line of code, only the end offset of the last
token is considered - by forgetting potential, trimmed delimiter
         * chars.
         *
         * Thus, solving would be:
         * 1. Add a single delimiter char between the field values
         * 2. Substract (from the Lucene Offset) the count of
analyzer-specific delimiters that are at the end of all field values
before the match
         *
         * For this, someone needs to know what a delimiter for an
specific analyzer is.
         */
>>
>>
>>
    }
>>
>>
>>
    @After
    public void closeIndex() throws CorruptIndexException, IOException
    {
        m_indexReader.close();
    }
>>
>>
}
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
- ---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFGyaJ+QoTr50f1tpcRAu+vAKCDLOKZXtp1j4qKmgHddustbkT+9wCfdai9
RmT0X0x9/T5SIWES1uZaOCM=
=sZMR
-----END PGP SIGNATURE-----

package org.dynaq.index;



import static org.junit.Assert.assertTrue;

import java.io.IOException;
import java.io.StringReader;
import java.util.LinkedHashMap;
import java.util.LinkedList;
import java.util.Map;
import java.util.AbstractMap.SimpleEntry;
import java.util.Map.Entry;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.TermPositionVector;
import org.apache.lucene.index.TermVectorOffsetInfo;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;



public class TokenizerTest
{

    IndexReader m_indexReader;

    Analyzer m_analyzer = new StandardAnalyzer();



    @Before
    public void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException
    {
        Directory ramDirectory = new RAMDirectory();

        // we create a fist little set of actors names
        LinkedList<String> llActorNames = new LinkedList<String>();
        llActorNames.add("Mayrata O'Wisiedo (as Mairata O'Wisiedo)");
        llActorNames.add("Miguel BosÃ©");
        llActorNames.add("Anna Lizaran (as Ana Lizaran)");
        llActorNames.add("Raquel SanchÃs");
        llActorNames.add("Angelina Llongueras");


        // store them into a single document with multiple values for one Field
        Document testDoc = new Document();
        for (String strActorsName : llActorNames)
        {
            Field testEntry =
                    new Field("movie_actors", strActorsName, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS);

            testDoc.add(testEntry);
        }

        // now we write it into the index
        IndexWriter indexWriter = new IndexWriter(ramDirectory, true, m_analyzer, true);

        indexWriter.addDocument(testDoc);
        indexWriter.close();

        m_indexReader = IndexReader.open(ramDirectory);
    }



    /**
     * Determines the differences between the offset values calculated by Lucene and the offset where the terms really appears inside concatenated
     * values of multiple field entries.<br>
     * I assume the default behaviour of lucene is a bug - because there the (possibly trimmed) delimiters at the end of a field are ignored for
     * offset calculation. Thus, this is a workaround.<br>
     * The method offers all information necessary to provide adequate highlights for multiple field value entries:
     * 
     * <li>All field values concatenated to one String. These are separated with single delimiters between the values (These are still considered
     * from the Lucene offsets</li>
     * <li>A mapping table with the end offset of a field with analyzer-specific delimiters at the end as <b>keys</b>. (As it would calculated from
     * Lucene, to keep comparableness). The <b>values</b> are the related counts of end offset delimiters for this field entry.</li>
     * <br>
     * Provide adequate highlights by following these steps:
     * <li>Calculate the diffs for a specific field with calculateLuceneOffsetDiffs(..)</li>
     * <li>Get the term offset from Lucene (e.g. by a TermVectorOffsetInfo lookup)</li>
     * <li>Adapt the offset value with adaptLuceneOffset(..)</li>
     * <li>Use the content String as returned by calculateLuceneOffsetDiffs(..) for highlighting</li>
     * 
     * 
     * @return entry-key: the concatenated field values, with single-whitespace-delimiter between them<br>
     *         entry-value: the mapping table endOffset => endDelimiterCount
     * @throws IOException
     */
    static public Map.Entry<String, LinkedHashMap<Integer, Integer>> calculateLuceneOffsetDiffs(String strFieldName, Document document2highlight,
            Analyzer analyzer) throws IOException
    {
        LinkedHashMap<Integer, Integer> hsEndOffset2EndDelimiterCount = new LinkedHashMap<Integer, Integer>();

        StringBuilder strbValuesWithDelimiters = new StringBuilder();

        String[] straValues = document2highlight.getValues(strFieldName);


        int iCurrentValueLengthSum = 0;

        for (String strActorName : straValues)
        {
            // we add a single delimiter char between the field values
            strbValuesWithDelimiters.append(strActorName).append(' ');

            //in the case the Tokenizer eliminates the single Letter, it was a delimiter char 
            int iLocalDelimitersAtEnd = 0;
            Token token;
            do
            {
                String strLastChar =
                        strActorName.substring(strActorName.length() - iLocalDelimitersAtEnd - 1, strActorName.length() - iLocalDelimitersAtEnd);
                TokenStream tokenStream = analyzer.tokenStream(strFieldName, new StringReader(strLastChar));
                token = tokenStream.next();
                if(token == null) iLocalDelimitersAtEnd++;
            }
            while (token == null);

            // Lucene forgets the local delimiters - so we do it also, to be comparable
            iCurrentValueLengthSum += strActorName.length() + 1 - iLocalDelimitersAtEnd;

            if(iLocalDelimitersAtEnd != 0) hsEndOffset2EndDelimiterCount.put(iCurrentValueLengthSum - 1, iLocalDelimitersAtEnd);
        }


        SimpleEntry<String, LinkedHashMap<Integer, Integer>> entry =
                new SimpleEntry<String, LinkedHashMap<Integer, Integer>>(strbValuesWithDelimiters.toString(), hsEndOffset2EndDelimiterCount);

        return entry;
    }



    /**
     * Modifies an offset value from lucene accoring to previously determined LuceneOffset differences for a specific field.
     * 
     * @param luceneOffset the offset value from Lucene that we want to adapt
     * 
     * @return the correct, corresponding offset value
     */
    static public int adaptLuceneOffset(int luceneOffset, LinkedHashMap<Integer, Integer> hsEndOffset2EndDelimiterCount)
    {
        //we sum up the delimiters befor the given lucene offset value
        int iDelimiterSumBe4Offset = 0;
        for (Entry<Integer, Integer> entry : hsEndOffset2EndDelimiterCount.entrySet())
            if(entry.getKey() <= luceneOffset - 1) iDelimiterSumBe4Offset += entry.getValue();

        return luceneOffset + iDelimiterSumBe4Offset;
    }



    @Test
    public void checkOffsetValues() throws ParseException, IOException
    {

        // first, we search for 'angelina'
        String strSearchTerm = "Angelina";


        Searcher searcher = new IndexSearcher(m_indexReader);

        QueryParser parser = new QueryParser("movie_actors", m_analyzer);
        Query query = parser.parse(strSearchTerm);

        Hits hits = searcher.search(query);
        Document resultDoc = hits.doc(0);

        String[] straValues = resultDoc.getValues("movie_actors");

        // now, we get the field values and build a single string value out of them
        StringBuilder strbSimplyConcatenated = new StringBuilder();
        StringBuilder strbWithDelimiters = new StringBuilder();

        for (String strActorName : straValues)
        {
            // first situation: we simply concatenate all field value entries
            strbSimplyConcatenated.append(strActorName);
            // second: we add a single delimiter char between the field values
            strbWithDelimiters.append(strActorName).append('$');
        }




        // this is the offset value from lucene. This should be the place of 'angelina' in one of the concatenated value Strings above
        TermPositionVector termPositionVector = (TermPositionVector) m_indexReader.getTermFreqVector(0, "movie_actors");
        int iTermIndex = termPositionVector.indexOf(strSearchTerm.toLowerCase());
        TermVectorOffsetInfo[] termOffsets = termPositionVector.getOffsets(iTermIndex);
        int iStartOffset = termOffsets[0].getStartOffset();
        int iEndOffset = termOffsets[0].getEndOffset();

        // we create the substrings according to the offset value given from lucene
        String strSubString1 = strbSimplyConcatenated.substring(iStartOffset, iEndOffset);
        String strSubString2 = strbWithDelimiters.substring(iStartOffset, iEndOffset);

        System.out.println("Offset value: " + iStartOffset + "-" + iEndOffset);
        System.out.println("simply concatenated field values:");
        System.out.println(strbSimplyConcatenated);
        System.out.println("SubString for Lucene offset: '" + strSubString1 + "'");
        System.out.println();
        System.out.println("concatenated field values with single delimiter:");
        System.out.println(strbWithDelimiters);
        System.out.println("SubString for Lucene offset: '" + strSubString2 + "'");
        System.out.println();


        //now we try to correct the values
        Map.Entry<String, LinkedHashMap<Integer, Integer>>  entry = TokenizerTest.calculateLuceneOffsetDiffs("movie_actors", resultDoc, m_analyzer);
        String strConcatenatedContent = entry.getKey();
        LinkedHashMap<Integer, Integer> endOffset2endDelimiterCount = entry.getValue();

        iStartOffset = TokenizerTest.adaptLuceneOffset(iStartOffset, endOffset2endDelimiterCount);
        iEndOffset = TokenizerTest.adaptLuceneOffset(iEndOffset, endOffset2endDelimiterCount);

        String strSubString3 = strConcatenatedContent.substring(iStartOffset, iEndOffset);
        System.out.println("SubString for modified offset: '" + strSubString3 + "'");




        // is the offset value correct for one of the concatenated strings?

        // this fails for all situations
        assertTrue(strSubString1.equals(strSearchTerm) || strSubString2.equals(strSearchTerm) || strSubString3.equals(strSearchTerm));

         /* 
         * Analysis: It seems that Lucene calculates the offset values by adding a virtual delimiter between every field value. But Lucene forgets the
         * last characters of a field value when these are analyzer-specific delimiter values. (I seem this because of DocumentWriter, line 245:
         * 'if(lastToken != null) offset += lastToken.endOffset() + 1;)' With this line of code, only the end offset of the last token is considered -
         * by forgetting potential, trimmed delimiter chars.
         * 
         * Thus, solving is:
         * 1. Add a single delimiter char between the field values 
         * 2. Add (to the Lucene Offset) the count of analyzer-specific delimiters that are at the end of all field values before the match
         * 
         * For this, the analyzer is used
         */



    }



    @After
    public void closeIndex() throws CorruptIndexException, IOException
    {
        m_indexReader.close();
    }


}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: getting term offset information for fields with multiple value entiries

Reply via email to