Re: getting term offset information for fields with multiple value entiries

duiduder Mon, 20 Aug 2007 00:01:36 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

this is the 2.2.0 release





Grant Ingersoll schrieb:
> What version of Lucene are you using?
> 
> 
> On Aug 17, 2007, at 12:44 PM, [EMAIL PROTECTED] wrote:
> 
> Hello community, dear Grant
> 
> I have build a JUnit test case that illustrates the problem - there, I
> try to cut
> out the right substring with the offset values given from Lucene - and
> fail :(
> 
> A few remarks:
> 
> In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't
> matches -
> it is recognized, unlike in StandardAnalyzer - as delimiter sign.
> 
> Analysis: It seems that Lucene calculates the offset values by adding
> a virtual
> delimiter between every field value.
> But Lucene forgets the last characters of a field value when these are
> analyzer-specific delimiter values. (I seem this because of
> DocumentWriter, line
> 245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
> With this line of code, only the end offset of the last token is
> considered - by
> forgetting potential, trimmed delimiter chars.
> 
> Thus, solving would be:
> 1. Add a single delimiter char between the field values
> 2. Substract (from the Lucene Offset) the count of analyzer-specific
> delimiters
>    that are at the end of all field values before the match
> 
> For this, someone needs to know what a delimiter for an specific
> analyzer is.
> 
> The other possibility of course is to change the behaviour inside
> Lucene, because
> the current offset values are more or less useless / hard to use (I
> currently have
> no idea how to get analyzer-specific delimiter chars).
> 
> For me, this looks like a bug - am I wrong?
> 
> Any ideas/hints/remarks? I would be very lucky about :)
> 
> Greetings
> 
> Christian
> 
> 
> 
> Grant Ingersoll schrieb:
>>>> Hi Christian,
>>>>
>>>> Is there anyway you can post a complete, self-contained example
>>>> preferably as a JUnit test?  I think it would be useful to know more
>>>> about how you are indexing (i.e. what Analyzer, etc.)
>>>> The offsets should be taken from whatever is set in on the Token during
>>>> Analysis.  I, too, am trying to remember where in the code this is
>>>> taking place
>>>>
>>>> Also, what version of Lucene are you using?
>>>>
>>>> -Grant
>>>>
>>>> On Aug 16, 2007, at 5:50 AM, [EMAIL PROTECTED] wrote:
>>>>
>>>> Hello,
>>>>
>>>> I have an index with an 'actor' field, for each actor there exists an
>>>> single field value entry, e.g.
>>>>
>>>> stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition
>>>>
>>>> <movie_actors>
>>>>
>>>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
>>>> movie_actors:Miguel Bosé
>>>> movie_actors:Anna Lizaran (as Ana Lizaran)
>>>> movie_actors:Raquel Sanchís
>>>> movie_actors:Angelina Llongueras
>>>>
>>>> I try to get the term offset, e.g. for 'angelina' with
>>>>
>>>> termPositionVector = (TermPositionVector)
>>>> reader.getTermFreqVector(docNumber, "movie_actors");
>>>> int iTermIndex = termPositionVector.indexOf("angelina");
>>>> TermVectorOffsetInfo[] termOffsets =
>>>> termPositionVector.getOffsets(iTermIndex);
>>>>
>>>>
>>>> I get one TermVectorOffsetInfo for the field - with offset numbers
>>>> that are bigger than one single
>>>> Field entry.
>>>> I guessed that Lucene gives the offset number for the situation that
>>>> all values were concatenated,
>>>> which is for the single (virtual) string:
>>>>
>>>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
>>>> Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras
>>>>
>>>> This fits in nearly no situation, so my second guess was that lucene
>>>> adds some virtual delimiters between the single
>>>> field entries for offset calculation. I added a delimiter, so the
>>>> result would be:
>>>>
>>>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna
>>>> Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
>>>> (note the ' ' between each actor name)
>>>>
>>>> ..this also fits not for each situation - there are too much
>>>> delimiters there now, so, further, I guessed that Lucene don't add
>>>> a delimiter in each situation. So I added only one when the last
>>>> character of an entry was no alphanumerical one, with:
>>>> StringBuilder strbAttContent = new StringBuilder();
>>>> for (String strAttValue : m_luceneDocument.getValues(strFieldName))
>>>> {
>>>>    strbAttContent.append(strAttValue);
>>>>    if(strbAttContent.substring(strbAttContent.length() -
>>>> 1).matches("\\w"))
>>>>       strbAttContent.append(' ');
>>>> }
>>>>
>>>> where I get the result (virtual) entry:
>>>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
>>>> Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras
>>>>
>>>> this fits in ~96% of all my queries....but still its not 100% the way
>>>> lucene calculates the offset value for fields with multiple
>>>> value entries.
>>>>
>>>>
>>>> ..maybe the problem is that there are special characters inside my
>>>> database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
>>>> I have looked to this specific situation, but considering this one
>>>> character don't solves the problem.
>>>>
>>>>
>>>> How do Lucene calculates these offsets? I also searched inside the
>>>> source code, but can't find the correct place.
>>>>
>>>>
>>>> Thanks in advance!
>>>>
>>>> Christian Reuschling
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> ______________________________________________________________________________
>>>>
>>>>
>>>> Christian Reuschling, Dipl.-Ing.(BA)
>>>> Software Engineer
>>>>
>>>> Knowledge Management Department
>>>> German Research Center for Artificial Intelligence DFKI GmbH
>>>> Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
>>>>
>>>> Phone: +49.631.20575-125
>>>> mailto:[EMAIL PROTECTED]  http://www.dfki.uni-kl.de/~reuschling/
>>>>
>>>> ------------Legal Company Information Required by German
>>>> Law------------------
>>>> Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
>>>> (Vorsitzender)
>>>>                   Dr. Walter Olthoff
>>>> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
>>>> Amtsgericht Kaiserslautern, HRB 2313=
>>>> ______________________________________________________________________________
>>>>
>>>>
>>>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>
> 
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://lucene.grantingersoll.com
> 
>>>> Lucene Helpful Hints:
>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
package org.dynaq.index;
>>
>>
>>
import static org.junit.Assert.assertTrue;
>>
import java.io.IOException;
import java.util.LinkedList;
>>
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.TermPositionVector;
import org.apache.lucene.index.TermVectorOffsetInfo;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
>>
>>
>>
public class TokenizerTest
{
>>
    IndexReader m_indexReader;
>>
    Analyzer m_analyzer = new StandardAnalyzer();
>>
>>
    @Before
    public void createIndex() throws CorruptIndexException,
LockObtainFailedException, IOException
    {
        Directory ramDirectory = new RAMDirectory();
>>
        // we create a fist little set of actors names
        LinkedList<String> llActorNames = new LinkedList<String>();
        llActorNames.add("Mayrata O'Wisiedo (as Mairata O'Wisiedo)");
        llActorNames.add("Miguel Bosé");
        llActorNames.add("Anna Lizaran (as Ana Lizaran)");
        llActorNames.add("Raquel Sanchís");
        llActorNames.add("Angelina Llongueras");
>>
>>
        // store them into a single document with multiple values for
one Field
        Document testDoc = new Document();
        for (String strActorsName : llActorNames)
        {
            Field testEntry =
                    new Field("movie_actors", strActorsName,
Field.Store.YES, Field.Index.TOKENIZED,
Field.TermVector.WITH_POSITIONS_OFFSETS);
>>
            testDoc.add(testEntry);
        }
>>
        // now we write it into the index
        IndexWriter indexWriter = new IndexWriter(ramDirectory, true,
m_analyzer, true);
>>
        indexWriter.addDocument(testDoc);
        indexWriter.close();
>>
        m_indexReader = IndexReader.open(ramDirectory);
    }
>>
>>
>>
    @Test
    public void checkOffsetValues() throws ParseException, IOException
    {
>>
        // first, we search for 'angelina'
        String strSearchTerm = "Angelina";
>>
>>
        Searcher searcher = new IndexSearcher(m_indexReader);
>>
        QueryParser parser = new QueryParser("movie_actors", m_analyzer);
        Query query = parser.parse(strSearchTerm);
>>
        Hits hits = searcher.search(query);
        Document resultDoc = hits.doc(0);
>>
        String[] straValues = resultDoc.getValues("movie_actors");
>>
        // now, we get the field values and build a single string
value out of them
        StringBuilder strbSimplyConcatenated = new StringBuilder();
        StringBuilder strbWithDelimiters = new StringBuilder();
        StringBuilder strbWithDelimitersAfterAlphaNumChar = new
StringBuilder();
        for (String strActorName : straValues)
        {
            // first situation: we simply concatenate all field value
entries
            strbSimplyConcatenated.append(strActorName);
            // second: we add a single delimiter char between the
field values
            strbWithDelimiters.append(strActorName).append('$');
            // third try: we add a single delimiter, but only if the
last char of the actor before was a alphanum-char
            strbWithDelimitersAfterAlphaNumChar.append(strActorName);
            String strLastChar =
strActorName.substring(strActorName.length() - 1, strActorName.length());
            if(strLastChar.matches("\\w"))
strbWithDelimitersAfterAlphaNumChar.append('$');
        }
>>
>>
        // this is the offset value from lucene. This should be the
place of 'angelina' in one of the concatenated value Strings above
        TermPositionVector termPositionVector = (TermPositionVector)
m_indexReader.getTermFreqVector(0, "movie_actors");
        int iTermIndex =
termPositionVector.indexOf(strSearchTerm.toLowerCase());
        TermVectorOffsetInfo[] termOffsets =
termPositionVector.getOffsets(iTermIndex);
        int iStartOffset = termOffsets[0].getStartOffset();
        int iEndOffset = termOffsets[0].getEndOffset();
>>
        //we create the substrings according to the offset value given
from lucene
        String strSubString1 =
strbSimplyConcatenated.substring(iStartOffset, iEndOffset);
        String strSubString2 =
strbWithDelimiters.substring(iStartOffset, iEndOffset);
        String strSubString3 =
strbWithDelimitersAfterAlphaNumChar.substring(iStartOffset, iEndOffset);
>>
        System.out.println("Offset value: " + iStartOffset + "-" +
iEndOffset);
        System.out.println("simply concatenated:");
        System.out.println(strbSimplyConcatenated);
        System.out.println("SubString for offset: '" + strSubString1 +
"'");
        System.out.println();
        System.out.println("with delimiters:");
        System.out.println(strbWithDelimiters);
        System.out.println("SubString for offset: '" + strSubString2 +
"'");
        System.out.println();
        System.out.println("with delimiter after alphanum character:");
        System.out.println(strbWithDelimitersAfterAlphaNumChar);
        System.out.println("SubString for offset: '" + strSubString3 +
"'");
>>
>>
        //is the offset value correct for one of the concatenated
strings?
>>
        //this fails for all situations
        assertTrue(strSubString1.equals(strSearchTerm) ||
strSubString2.equals(strSearchTerm) ||
strSubString3.equals(strSearchTerm));
>>
        /*
         * Comments: In this example, the 'é' from 'Bosé' makes that
the '\w' pattern don't matches - it is recognized, unlike in
         * StandardAnalyzer - as delimiter sign.
         *
         * Analysis: It seems that Lucene calculates the offset values
by adding a virtual delimiter between every field value.
         * But Lucene forgets the last characters of a field value
when these are analyzer-specific delimiter values.
         * (I seem this because of DocumentWriter, line 245:
'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
         * With this line of code, only the end offset of the last
token is considered - by forgetting potential, trimmed delimiter
         * chars.
         *
         * Thus, solving would be:
         * 1. Add a single delimiter char between the field values
         * 2. Substract (from the Lucene Offset) the count of
analyzer-specific delimiters that are at the end of all field values
before the match
         *
         * For this, someone needs to know what a delimiter for an
specific analyzer is.
         */
>>
>>
>>
    }
>>
>>
>>
    @After
    public void closeIndex() throws CorruptIndexException, IOException
    {
        m_indexReader.close();
    }
>>
>>
}
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
- ---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFGyTwrQoTr50f1tpcRAs+gAJ9Z+dTK2ikiZRtO3jKzG2kG9bImxgCdGjkl
zSLVw6GYrdFGZUa+YN+QzEc=
=kHZv
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: getting term offset information for fields with multiple value entiries

Reply via email to