-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 this is the 2.2.0 release
Grant Ingersoll schrieb: > What version of Lucene are you using? > > > On Aug 17, 2007, at 12:44 PM, [EMAIL PROTECTED] wrote: > > Hello community, dear Grant > > I have build a JUnit test case that illustrates the problem - there, I > try to cut > out the right substring with the offset values given from Lucene - and > fail :( > > A few remarks: > > In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't > matches - > it is recognized, unlike in StandardAnalyzer - as delimiter sign. > > Analysis: It seems that Lucene calculates the offset values by adding > a virtual > delimiter between every field value. > But Lucene forgets the last characters of a field value when these are > analyzer-specific delimiter values. (I seem this because of > DocumentWriter, line > 245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)' > With this line of code, only the end offset of the last token is > considered - by > forgetting potential, trimmed delimiter chars. > > Thus, solving would be: > 1. Add a single delimiter char between the field values > 2. Substract (from the Lucene Offset) the count of analyzer-specific > delimiters > that are at the end of all field values before the match > > For this, someone needs to know what a delimiter for an specific > analyzer is. > > The other possibility of course is to change the behaviour inside > Lucene, because > the current offset values are more or less useless / hard to use (I > currently have > no idea how to get analyzer-specific delimiter chars). > > For me, this looks like a bug - am I wrong? > > Any ideas/hints/remarks? I would be very lucky about :) > > Greetings > > Christian > > > > Grant Ingersoll schrieb: >>>> Hi Christian, >>>> >>>> Is there anyway you can post a complete, self-contained example >>>> preferably as a JUnit test? I think it would be useful to know more >>>> about how you are indexing (i.e. what Analyzer, etc.) >>>> The offsets should be taken from whatever is set in on the Token during >>>> Analysis. I, too, am trying to remember where in the code this is >>>> taking place >>>> >>>> Also, what version of Lucene are you using? >>>> >>>> -Grant >>>> >>>> On Aug 16, 2007, at 5:50 AM, [EMAIL PROTECTED] wrote: >>>> >>>> Hello, >>>> >>>> I have an index with an 'actor' field, for each actor there exists an >>>> single field value entry, e.g. >>>> >>>> stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition >>>> >>>> <movie_actors> >>>> >>>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) >>>> movie_actors:Miguel Bosé >>>> movie_actors:Anna Lizaran (as Ana Lizaran) >>>> movie_actors:Raquel Sanchís >>>> movie_actors:Angelina Llongueras >>>> >>>> I try to get the term offset, e.g. for 'angelina' with >>>> >>>> termPositionVector = (TermPositionVector) >>>> reader.getTermFreqVector(docNumber, "movie_actors"); >>>> int iTermIndex = termPositionVector.indexOf("angelina"); >>>> TermVectorOffsetInfo[] termOffsets = >>>> termPositionVector.getOffsets(iTermIndex); >>>> >>>> >>>> I get one TermVectorOffsetInfo for the field - with offset numbers >>>> that are bigger than one single >>>> Field entry. >>>> I guessed that Lucene gives the offset number for the situation that >>>> all values were concatenated, >>>> which is for the single (virtual) string: >>>> >>>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna >>>> Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras >>>> >>>> This fits in nearly no situation, so my second guess was that lucene >>>> adds some virtual delimiters between the single >>>> field entries for offset calculation. I added a delimiter, so the >>>> result would be: >>>> >>>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna >>>> Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras >>>> (note the ' ' between each actor name) >>>> >>>> ..this also fits not for each situation - there are too much >>>> delimiters there now, so, further, I guessed that Lucene don't add >>>> a delimiter in each situation. So I added only one when the last >>>> character of an entry was no alphanumerical one, with: >>>> StringBuilder strbAttContent = new StringBuilder(); >>>> for (String strAttValue : m_luceneDocument.getValues(strFieldName)) >>>> { >>>> strbAttContent.append(strAttValue); >>>> if(strbAttContent.substring(strbAttContent.length() - >>>> 1).matches("\\w")) >>>> strbAttContent.append(' '); >>>> } >>>> >>>> where I get the result (virtual) entry: >>>> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna >>>> Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras >>>> >>>> this fits in ~96% of all my queries....but still its not 100% the way >>>> lucene calculates the offset value for fields with multiple >>>> value entries. >>>> >>>> >>>> ..maybe the problem is that there are special characters inside my >>>> database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches. >>>> I have looked to this specific situation, but considering this one >>>> character don't solves the problem. >>>> >>>> >>>> How do Lucene calculates these offsets? I also searched inside the >>>> source code, but can't find the correct place. >>>> >>>> >>>> Thanks in advance! >>>> >>>> Christian Reuschling >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> ______________________________________________________________________________ >>>> >>>> >>>> Christian Reuschling, Dipl.-Ing.(BA) >>>> Software Engineer >>>> >>>> Knowledge Management Department >>>> German Research Center for Artificial Intelligence DFKI GmbH >>>> Trippstadter Straße 122, D-67663 Kaiserslautern, Germany >>>> >>>> Phone: +49.631.20575-125 >>>> mailto:[EMAIL PROTECTED] http://www.dfki.uni-kl.de/~reuschling/ >>>> >>>> ------------Legal Company Information Required by German >>>> Law------------------ >>>> Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster >>>> (Vorsitzender) >>>> Dr. Walter Olthoff >>>> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes >>>> Amtsgericht Kaiserslautern, HRB 2313= >>>> ______________________________________________________________________________ >>>> >>>> >>>>> > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] >>>>> > >>>> -------------------------- >>>> Grant Ingersoll >>>> http://lucene.grantingersoll.com > >>>> Lucene Helpful Hints: >>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance >>>> http://wiki.apache.org/lucene-java/LuceneFAQ > > > >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>>> For additional commands, e-mail: [EMAIL PROTECTED] > > package org.dynaq.index; >> >> >> import static org.junit.Assert.assertTrue; >> import java.io.IOException; import java.util.LinkedList; >> import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.TermPositionVector; import org.apache.lucene.index.TermVectorOffsetInfo; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.Searcher; import org.apache.lucene.store.Directory; import org.apache.lucene.store.LockObtainFailedException; import org.apache.lucene.store.RAMDirectory; import org.junit.After; import org.junit.Before; import org.junit.Test; >> >> >> public class TokenizerTest { >> IndexReader m_indexReader; >> Analyzer m_analyzer = new StandardAnalyzer(); >> >> @Before public void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException { Directory ramDirectory = new RAMDirectory(); >> // we create a fist little set of actors names LinkedList<String> llActorNames = new LinkedList<String>(); llActorNames.add("Mayrata O'Wisiedo (as Mairata O'Wisiedo)"); llActorNames.add("Miguel Bosé"); llActorNames.add("Anna Lizaran (as Ana Lizaran)"); llActorNames.add("Raquel Sanchís"); llActorNames.add("Angelina Llongueras"); >> >> // store them into a single document with multiple values for one Field Document testDoc = new Document(); for (String strActorsName : llActorNames) { Field testEntry = new Field("movie_actors", strActorsName, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS); >> testDoc.add(testEntry); } >> // now we write it into the index IndexWriter indexWriter = new IndexWriter(ramDirectory, true, m_analyzer, true); >> indexWriter.addDocument(testDoc); indexWriter.close(); >> m_indexReader = IndexReader.open(ramDirectory); } >> >> >> @Test public void checkOffsetValues() throws ParseException, IOException { >> // first, we search for 'angelina' String strSearchTerm = "Angelina"; >> >> Searcher searcher = new IndexSearcher(m_indexReader); >> QueryParser parser = new QueryParser("movie_actors", m_analyzer); Query query = parser.parse(strSearchTerm); >> Hits hits = searcher.search(query); Document resultDoc = hits.doc(0); >> String[] straValues = resultDoc.getValues("movie_actors"); >> // now, we get the field values and build a single string value out of them StringBuilder strbSimplyConcatenated = new StringBuilder(); StringBuilder strbWithDelimiters = new StringBuilder(); StringBuilder strbWithDelimitersAfterAlphaNumChar = new StringBuilder(); for (String strActorName : straValues) { // first situation: we simply concatenate all field value entries strbSimplyConcatenated.append(strActorName); // second: we add a single delimiter char between the field values strbWithDelimiters.append(strActorName).append('$'); // third try: we add a single delimiter, but only if the last char of the actor before was a alphanum-char strbWithDelimitersAfterAlphaNumChar.append(strActorName); String strLastChar = strActorName.substring(strActorName.length() - 1, strActorName.length()); if(strLastChar.matches("\\w")) strbWithDelimitersAfterAlphaNumChar.append('$'); } >> >> // this is the offset value from lucene. This should be the place of 'angelina' in one of the concatenated value Strings above TermPositionVector termPositionVector = (TermPositionVector) m_indexReader.getTermFreqVector(0, "movie_actors"); int iTermIndex = termPositionVector.indexOf(strSearchTerm.toLowerCase()); TermVectorOffsetInfo[] termOffsets = termPositionVector.getOffsets(iTermIndex); int iStartOffset = termOffsets[0].getStartOffset(); int iEndOffset = termOffsets[0].getEndOffset(); >> //we create the substrings according to the offset value given from lucene String strSubString1 = strbSimplyConcatenated.substring(iStartOffset, iEndOffset); String strSubString2 = strbWithDelimiters.substring(iStartOffset, iEndOffset); String strSubString3 = strbWithDelimitersAfterAlphaNumChar.substring(iStartOffset, iEndOffset); >> System.out.println("Offset value: " + iStartOffset + "-" + iEndOffset); System.out.println("simply concatenated:"); System.out.println(strbSimplyConcatenated); System.out.println("SubString for offset: '" + strSubString1 + "'"); System.out.println(); System.out.println("with delimiters:"); System.out.println(strbWithDelimiters); System.out.println("SubString for offset: '" + strSubString2 + "'"); System.out.println(); System.out.println("with delimiter after alphanum character:"); System.out.println(strbWithDelimitersAfterAlphaNumChar); System.out.println("SubString for offset: '" + strSubString3 + "'"); >> >> //is the offset value correct for one of the concatenated strings? >> //this fails for all situations assertTrue(strSubString1.equals(strSearchTerm) || strSubString2.equals(strSearchTerm) || strSubString3.equals(strSearchTerm)); >> /* * Comments: In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't matches - it is recognized, unlike in * StandardAnalyzer - as delimiter sign. * * Analysis: It seems that Lucene calculates the offset values by adding a virtual delimiter between every field value. * But Lucene forgets the last characters of a field value when these are analyzer-specific delimiter values. * (I seem this because of DocumentWriter, line 245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)' * With this line of code, only the end offset of the last token is considered - by forgetting potential, trimmed delimiter * chars. * * Thus, solving would be: * 1. Add a single delimiter char between the field values * 2. Substract (from the Lucene Offset) the count of analyzer-specific delimiters that are at the end of all field values before the match * * For this, someone needs to know what a delimiter for an specific analyzer is. */ >> >> >> } >> >> >> @After public void closeIndex() throws CorruptIndexException, IOException { m_indexReader.close(); } >> >> } >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> - --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] > ------------------------------------------------------ > Grant Ingersoll > http://www.grantingersoll.com/ > http://lucene.grantingersoll.com > http://www.paperoftheweek.com/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) iD8DBQFGyTwrQoTr50f1tpcRAs+gAJ9Z+dTK2ikiZRtO3jKzG2kG9bImxgCdGjkl zSLVw6GYrdFGZUa+YN+QzEc= =kHZv -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]