Change to MultiReader
There was a message from Kirk Roberts, 18/4/2007 - MultiSearcher vs MultiReader Grant mentioned the visibility of the readerIndex() method in MultiReader, but nothing seems ever came of it. Is there any reason why the following could not be put into MultiReader? Something like this seems necessary when handling multiple indices to solve the BitSet caching issue I raised on the user thread. It's slightly more efficient for a Filter implementation bits() method to know these reader numbers in the filter (as the doc id always seems to increment) rather than delegating back to the reader to resolve it each call. However, it gives useful utility methods for doing so, and gives freedom to the underlying implementation in case that needs to change. Antony /** Fetches the IndexReader instance where the specified document exists * @param n the MultiReader document number * @return the reader index */ public int readerIndex(int n) {// find reader for doc n: return MultiSegmentReader.readerIndex(n, this.starts, this.subReaders.length); } /** Fetches the document number in the specified reader for the given document number. * @param i the reader index obtained from [EMAIL PROTECTED] #readerIndex(int)} * @param n the MultiReader document number * @return the mapped document number */ public int id(int i, int n) {// find true doc for doc n: return n - this.starts[i]; } /** Fetches the document number in the specified reader for the given document number. * @param n the MultiReader document number * @return the mapped document number */ public int id(int n) {// find true doc for doc n: return n - this.starts[readerIndex(n)]; } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1150) The token types of the standard tokenizer is not accessible
[ https://issues.apache.org/jira/browse/LUCENE-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588953#action_12588953 ] Antony Bowesman commented on LUCENE-1150: - The original tokenImage String array from 2.2 is still not available in this patch, they are still in the Impl. These are the values returned from Token.type(), so should they not be visible as well as the static ints? > The token types of the standard tokenizer is not accessible > --- > > Key: LUCENE-1150 > URL: https://issues.apache.org/jira/browse/LUCENE-1150 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Nicolas Lalevée >Assignee: Michael McCandless > Fix For: 2.3.2, 2.4 > > Attachments: LUCENE-1150.patch, LUCENE-1150.take2.patch > > > The StandardTokenizerImpl not being public, these token types are not > accessible : > {code:java} > public static final int ALPHANUM = 0; > public static final int APOSTROPHE= 1; > public static final int ACRONYM = 2; > public static final int COMPANY = 3; > public static final int EMAIL = 4; > public static final int HOST = 5; > public static final int NUM = 6; > public static final int CJ= 7; > /** > * @deprecated this solves a bug where HOSTs that end with '.' are identified > * as ACRONYMs. It is deprecated and will be removed in the next > * release. > */ > public static final int ACRONYM_DEP = 8; > public static final String [] TOKEN_TYPES = new String [] { > "", > "", > "", > "", > "", > "", > "", > "", > "" > }; > {code} > So no custom TokenFilter can be based of the token type. Actually even the > StandardFilter cannot be writen outside the > org.apache.lucene.analysis.standard package. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizerConstants in 2.3
Thanks Mike/Hoss for the clarification. Antony Michael McCandless wrote: Chris Hostetter wrote: : > But, StandardTokenizer is public? It "exports" those constants for you? : : Really? Sorry, but I can't find them - in 2.3.1 sources, there are no : references to those statics. Javadocs have no reference to them in : StandardTokenizer I think Michael is forgetting that he re-added those constants to the trunk after 2.3.1 was released... https://issues.apache.org/jira/browse/LUCENE-1150 Woops! I'm sorry Antony -- Hoss is correct. I didn't realize this missed 2.3. I'll backport this fix to 2.3 branch so it'll be included when we release 2.3.2 (which I think we should do soon -- alot of little fixes have been backported). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizerConstants in 2.3
But, StandardTokenizer is public? It "exports" those constants for you? Really? Sorry, but I can't find them - in 2.3.1 sources, there are no references to those statics. Javadocs have no reference to them in StandardTokenizer http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/standard/StandardTokenizer.html and I can't see ALPHANUM in the Javadoc index. Eclipse cannot resolve them. Am I missing something? Antony Mike Antony Bowesman wrote: But, the constants that are used by StandardTokenizer are still available as static ints in the StandardTokenizer class (ie, ALPHANUM, APOSTROPHE, etc.). Does that work? Problem as mentioned below is that the StandardTokenizerImpl.java is package private and even though the ints and string array are declared as public static, they are not visible. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizerConstants in 2.3
But, the constants that are used by StandardTokenizer are still available as static ints in the StandardTokenizer class (ie, ALPHANUM, APOSTROPHE, etc.). Does that work? Problem as mentioned below is that the StandardTokenizerImpl.java is package private and even though the ints and string array are declared as public static, they are not visible. Antony Mike Antony Bowesman wrote: I'm migrating from 2.1 to 2.3 and found that the public interface StandardTokenizerConstants has gone. It looks like the definitions have disappeared inside the package private class StandardTokenizerImpl. Was this intentional? I was using these to determine the returns values from Token.type(). Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sort difference between 2.1 and 2.3
Thanks for the explanation Mike. It's not a big issue, it's just a test case where I was needed to ensure ordering for the test, so I'll just use a valid high utf-16 character. It just seemed odd that the field was showing strangely in Luke. Your explanation gives the reason, thanks. Antony Michael McCandless wrote: You're right, Lucene changed wrt the 0x character: 2.3 now uses this character internally as an "end of term" marker when storing term text. This was done as part of LUCENE-843 (speeding up indexing). Technically that character is an invalid UTF16 character (for interchange), but it looks like a few Lucene users were indeed relying on older Lucene versions accepting & preserving it. You could use 0xfffe instead? Lucene 2.3 will preserve it, though It's also invalid for interchange (so future Lucene versions might change wrt that, too). Or ... it looks like you're use case is to sort all "last" values after all "first" values? In which case one way to do this (without using invalid UTF16 characters) might be to add a new field marking whether you have a "last" or a "first" value, then sort first by that field and second by your value field? Mike Antony Bowesman <[EMAIL PROTECTED]> wrote: Hi, I had a test case that added two documents, each with one untokenized field, and sorted them. The data in each document was char(1) + "First" char(0x) + "Last" With Lucene 2.1 the documents are sorted correctly, but with Lucene 2.3.1, they are not. Looking at the index with Luke shows that the document with "Last" has not been handled correctly, i.e. the text for the "subject" field is empty. The test case below shows the problem. Regards Antony import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; import java.io.IOException; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchAllDocsQuery; import org.apache.lucene.search.Query; import org.apache.lucene.search.Sort; import org.apache.lucene.search.SortField; import org.junit.After; import org.junit.Before; import org.junit.Test; public class LastSubjectTest { /** * Set up a number of documents with 1 duplicate ContentId * @throws Exception */ @Before public void setUp() throws Exception { IndexWriter writer = new IndexWriter("TestDir/", new StandardAnalyzer(), true); Document doc = new Document(); String subject = new StringBuffer(1).append((char)0x).toString() + "Last"; Field f = new Field("subject", subject, Field.Store.YES, Field.Index.NO_NORMS); doc.add(f); writer.addDocument(doc); doc = new Document(); subject = new StringBuffer(1).append((char)0x1).toString() + "First"; f = new Field("subject", subject, Field.Store.YES, Field.Index.NO_NORMS); doc.add(f); writer.addDocument(doc); writer.close(); } /** * @throws Exception */ @After public void tearDown() throws Exception { } /** * Tests that the last is after first document, sorted by subject * @throws IOException */ @Test public void testSortDateAscending() throws IOException { IndexSearcher searcher = new IndexSearcher("TestDir/"); Query q = new MatchAllDocsQuery(); Sort sort = new Sort(new SortField("subject")); Hits hits = searcher.search(q, sort); assertEquals("Hits should match all documents", searcher.getIndexReader().maxDoc(), hits.length()); Document fd = hits.doc(0); Document ld = hits.doc(1); String fs = fd.get("subject"); String ls = ld.get("subject"); for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); String subject = doc.get("subject"); System.out.println("Subject:" + subject); } assertTrue("Subjects have been sorted incorrectly", fs.compareTo(ls) < 0); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sort difference between 2.1 and 2.3
Hi, I had a test case that added two documents, each with one untokenized field, and sorted them. The data in each document was char(1) + "First" char(0x) + "Last" With Lucene 2.1 the documents are sorted correctly, but with Lucene 2.3.1, they are not. Looking at the index with Luke shows that the document with "Last" has not been handled correctly, i.e. the text for the "subject" field is empty. The test case below shows the problem. Regards Antony import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; import java.io.IOException; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchAllDocsQuery; import org.apache.lucene.search.Query; import org.apache.lucene.search.Sort; import org.apache.lucene.search.SortField; import org.junit.After; import org.junit.Before; import org.junit.Test; public class LastSubjectTest { /** * Set up a number of documents with 1 duplicate ContentId * @throws Exception */ @Before public void setUp() throws Exception { IndexWriter writer = new IndexWriter("TestDir/", new StandardAnalyzer(), true); Document doc = new Document(); String subject = new StringBuffer(1).append((char)0x).toString() + "Last"; Field f = new Field("subject", subject, Field.Store.YES, Field.Index.NO_NORMS); doc.add(f); writer.addDocument(doc); doc = new Document(); subject = new StringBuffer(1).append((char)0x1).toString() + "First"; f = new Field("subject", subject, Field.Store.YES, Field.Index.NO_NORMS); doc.add(f); writer.addDocument(doc); writer.close(); } /** * @throws Exception */ @After public void tearDown() throws Exception { } /** * Tests that the last is after first document, sorted by subject * @throws IOException */ @Test public void testSortDateAscending() throws IOException { IndexSearcher searcher = new IndexSearcher("TestDir/"); Query q = new MatchAllDocsQuery(); Sort sort = new Sort(new SortField("subject")); Hits hits = searcher.search(q, sort); assertEquals("Hits should match all documents", searcher.getIndexReader().maxDoc(), hits.length()); Document fd = hits.doc(0); Document ld = hits.doc(1); String fs = fd.get("subject"); String ls = ld.get("subject"); for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); String subject = doc.get("subject"); System.out.println("Subject:" + subject); } assertTrue("Subjects have been sorted incorrectly", fs.compareTo(ls) < 0); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
StandardTokenizerConstants in 2.3
I'm migrating from 2.1 to 2.3 and found that the public interface StandardTokenizerConstants has gone. It looks like the definitions have disappeared inside the package private class StandardTokenizerImpl. Was this intentional? I was using these to determine the returns values from Token.type(). Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FieldSortedHitQueue.maxscore
Out of interest, is maxscore supposed to be a) the max score of the the items inserted to the queue, even though they may have dropped out of the final results or b) the max score of the size items remaining in the queue Currently it reflects a, but just wondered whether it was correct. Regards Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FieldSortedHitQueue.fillFields() not visible
I'm implementing a HitCollector to do sorting and will use FieldSortedHitQueue, but for some reason the fillFields() method is package private. Judging from the comments to the method, I don't need it, but if I do later on, I can't, unless of course I extend the class and copy the existing code. Was this done on purpose? Regards Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Documentation Brainstorming
Grant Ingersoll wrote: Mind you, our docs are an order of magnitude better than this other project I agree, Lucene is a very well documented project compared to many. In general and in conjunction with LIA, it's a pretty easy project to get in to. 3. There is a whole lot of knowledge stored in the email archives, how can we leverage it? This is indeed a key point. HitCollector and surrounding classes are poorly documented and there have been many replies to questions which recommend using a HitCollector. The search package is generally well described, apart from what are described as 'low level API' or 'expert' methods and classes. I found I needed to get to that level to get the best out of Lucene in a framework that sits on top of it. Performance is another topic which would really benefit from a 'best practice' guide. The dev and user posts concerning performance always get many responses. Although a challenge to produce, bringing together some kind of recommendations which relate user data to reader/writer usage, e.g. what maxBufferedDocs, maxMergeDocs, mergeFactor to use with a number of different usage scenarios would be great, although there's no substitute for evaluating that with your own data. A definitive statement about 'optimize' and when (not) to use it and what its relationship with performance is. I know there's lots about it already, but it's dotted all over the place. Maybe this sort of information would be better in LIA2... Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexWriter shutdown
Doron Cohen wrote: Antony Bowesman wrote: Another use this may have is that mini-optimize operations could be done at more regular intervals to reduce the time for a full optimize. I could then schedule mini-optimise to run for a couple of minutes at more frequent intervals. This seems to assume the proposed feature allows to continue an interrupted merge at a later time, from where it was stopped. But if I understood correctly then the proposed feature does not work this way - so all the (uncommitted) work done until shutdown will be "lost" - i.e. next merge() would start from scratch. Yes, it does (wrongly) assume that. For some reason, I had thought the optimize operation was a copy+pack operation, but of course, it's not, so I can see why this incremental approach is not possible (or at least non trivial). Still, the shutdown function would be useful on its own. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexWriter shutdown
Michael Busch wrote: Hi, if you run Lucene as a service you want to be able to shut it down in a certain period of time (usually 1-2 mins). This can be a problem if the IndexWriter is in the middle of a merge when the service shutdown request is received. My question is if people think that the shutdown feature is something we would like to add to the Lucene core? If yes, I can go ahead and attach my code to a JIRA issue, if no I'd like to make the small change to IndexWriter (add the protected method flushRamSegments(triggerMerge)). My approach seems to work quite well, but maybe others (e. g. the IndexWriter "experts") have different/better ideas how to implement it. If these are conditions that also apply during an optimize(), then yes, I would vote for this feature. I have a Lucene based service and optimisation takes over an hour for a freshly created 18GB index with 1.3M documents. Although optimisation can be scheduled to run at whatever time, it could be necessary to shut down the service during the optimisation and this presents a problem in how to safely interrupt the optimize process. Another use this may have is that mini-optimize operations could be done at more regular intervals to reduce the time for a full optimize. I could then schedule mini-optimise to run for a couple of minutes at more frequent intervals. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-862) Contrib query org.apache.lucene.search.BoostingQuery sets boost on constructor Query, not cloned copy
Contrib query org.apache.lucene.search.BoostingQuery sets boost on constructor Query, not cloned copy - Key: LUCENE-862 URL: https://issues.apache.org/jira/browse/LUCENE-862 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.1 Environment: All Reporter: Antony Bowesman Priority: Minor BoostingQuery sets the boost value on the passed context Query public BoostingQuery(Query match, Query context, float boost) { this.match = match; this.context = (Query)context.clone();// clone before boost this.boost = boost; context.setBoost(0.0f); // ignore context-only matches } This should be this.context.setBoost(0.0f); // ignore context-only matches Also, boost value of 0.0 may have wrong effect - see discussion at http://www.mail-archive.com/[EMAIL PROTECTED]/msg12243.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-861) Contrib queries package Query implementations do not override equals()
[ https://issues.apache.org/jira/browse/LUCENE-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antony Bowesman updated LUCENE-861: --- Description: Query implementations should override equals() so that Query instances can be cached and that Filters can know if a Query has been used before. See the discussion in this thread. http://www.mail-archive.com/[EMAIL PROTECTED]/msg13061.html Following 3 contrib Query implementations do no override equals() org.apache.lucene.search.BoostingQuery; org.apache.lucene.search.FuzzyLikeThisQuery; org.apache.lucene.search.similar.MoreLikeThisQuery; Test cases below show the problem. package com.teamware.office.lucene.search; import static org.junit.Assert.*; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.Term; import org.apache.lucene.search.BoostingQuery; import org.apache.lucene.search.FuzzyLikeThisQuery; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.similar.MoreLikeThisQuery; import org.junit.After; import org.junit.Before; import org.junit.Test; public class ContribQueriesEqualsTest { /** * @throws java.lang.Exception */ @Before public void setUp() throws Exception { } /** * @throws java.lang.Exception */ @After public void tearDown() throws Exception { } /** * Show that the BoostingQuery in the queries contrib package * does not implement equals() correctly. */ @Test public void testBoostingQueryEquals() { TermQuery q1 = new TermQuery(new Term("subject:", "java")); TermQuery q2 = new TermQuery(new Term("subject:", "java")); assertEquals("Two TermQueries with same attributes should be equal", q1, q2); BoostingQuery bq1 = new BoostingQuery(q1, q2, 0.1f); BoostingQuery bq2 = new BoostingQuery(q1, q2, 0.1f); assertEquals("BoostingQuery with same attributes is not equal", bq1, bq2); } /** * Show that the MoreLikeThisQuery in the queries contrib package * does not implement equals() correctly. */ @Test public void testMoreLikeThisQueryEquals() { String moreLikeFields[] = new String[] {"subject", "body"}; MoreLikeThisQuery mltq1 = new MoreLikeThisQuery("java", moreLikeFields, new StandardAnalyzer()); MoreLikeThisQuery mltq2 = new MoreLikeThisQuery("java", moreLikeFields, new StandardAnalyzer()); assertEquals("MoreLikeThisQuery with same attributes is not equal", mltq1, mltq2); } /** * Show that the FuzzyLikeThisQuery in the queries contrib package * does not implement equals() correctly. */ @Test public void testFuzzyLikeThisQueryEquals() { FuzzyLikeThisQuery fltq1 = new FuzzyLikeThisQuery(10, new StandardAnalyzer()); fltq1.addTerms("javi", "subject", 0.5f, 2); FuzzyLikeThisQuery fltq2 = new FuzzyLikeThisQuery(10, new StandardAnalyzer()); fltq2.addTerms("javi", "subject", 0.5f, 2); assertEquals("FuzzyLikeThisQuery with same attributes is not equal", fltq1, fltq2); } } was: Query implementations should override equals() so that Query instances can be cached and that Filters can know if a Query has been used before. See the discussion in this thread. http://www.mail-archive.com/[EMAIL PROTECTED]/msg13061.html Test cases below show the problem. package com.teamware.office.lucene.search; import static org.junit.Assert.*; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.Term; import org.apache.lucene.search.BoostingQuery; import org.apache.lucene.search.FuzzyLikeThisQuery; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.similar.MoreLikeThisQuery; import org.junit.After; import org.junit.Before; import org.junit.Test; public class ContribQueriesEqualsTest { /** * @throws java.lang.Exception */ @Before public void setUp() throws Exception { } /** * @throws java.lang.Exception */ @After public void tearDown() throws Exception { } /** * Show that the BoostingQuery in the queries contrib package * does not implement equals() correctly. */ @Test public void testBoostingQueryEquals() { TermQuery q1 = new TermQuery(new Term("subject:", "java")); TermQuery q2 = new TermQuery(new Term("subject:", "java")); assertEquals("Two TermQueries with same attributes should be equal", q1, q2); BoostingQuery bq1 = new BoostingQuery(q1, q2, 0.1f); BoostingQuery bq2 = new BoostingQuery(q1, q2, 0.1f);
[jira] Created: (LUCENE-861) Contrib queries package Query implementations do not override equals()
Contrib queries package Query implementations do not override equals() -- Key: LUCENE-861 URL: https://issues.apache.org/jira/browse/LUCENE-861 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.1 Environment: All Reporter: Antony Bowesman Priority: Minor Query implementations should override equals() so that Query instances can be cached and that Filters can know if a Query has been used before. See the discussion in this thread. http://www.mail-archive.com/[EMAIL PROTECTED]/msg13061.html Test cases below show the problem. package com.teamware.office.lucene.search; import static org.junit.Assert.*; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.Term; import org.apache.lucene.search.BoostingQuery; import org.apache.lucene.search.FuzzyLikeThisQuery; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.similar.MoreLikeThisQuery; import org.junit.After; import org.junit.Before; import org.junit.Test; public class ContribQueriesEqualsTest { /** * @throws java.lang.Exception */ @Before public void setUp() throws Exception { } /** * @throws java.lang.Exception */ @After public void tearDown() throws Exception { } /** * Show that the BoostingQuery in the queries contrib package * does not implement equals() correctly. */ @Test public void testBoostingQueryEquals() { TermQuery q1 = new TermQuery(new Term("subject:", "java")); TermQuery q2 = new TermQuery(new Term("subject:", "java")); assertEquals("Two TermQueries with same attributes should be equal", q1, q2); BoostingQuery bq1 = new BoostingQuery(q1, q2, 0.1f); BoostingQuery bq2 = new BoostingQuery(q1, q2, 0.1f); assertEquals("BoostingQuery with same attributes is not equal", bq1, bq2); } /** * Show that the MoreLikeThisQuery in the queries contrib package * does not implement equals() correctly. */ @Test public void testMoreLikeThisQueryEquals() { String moreLikeFields[] = new String[] {"subject", "body"}; MoreLikeThisQuery mltq1 = new MoreLikeThisQuery("java", moreLikeFields, new StandardAnalyzer()); MoreLikeThisQuery mltq2 = new MoreLikeThisQuery("java", moreLikeFields, new StandardAnalyzer()); assertEquals("MoreLikeThisQuery with same attributes is not equal", mltq1, mltq2); } /** * Show that the FuzzyLikeThisQuery in the queries contrib package * does not implement equals() correctly. */ @Test public void testFuzzyLikeThisQueryEquals() { FuzzyLikeThisQuery fltq1 = new FuzzyLikeThisQuery(10, new StandardAnalyzer()); fltq1.addTerms("javi", "subject", 0.5f, 2); FuzzyLikeThisQuery fltq2 = new FuzzyLikeThisQuery(10, new StandardAnalyzer()); fltq2.addTerms("javi", "subject", 0.5f, 2); assertEquals("FuzzyLikeThisQuery with same attributes is not equal", fltq1, fltq2); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: optimize() method call
Robert Engels wrote: I think this is great, and it gave me an idea. What if another thread could call a "stop optimize" which would stop the optimize after it came to a consistent state (not in the middle of a segment merge). We schedule our optimizes for the "lull" time period, but with 24/7 operation this could be hard to find. Being able to stop and then resume the optimize seems like a great idea. +1. It would be useful in shutdown cases where immediate shutdown is needed, or to allow a scheduled backup to kick in at a fixed time, rather than having to wait for optimize to complete. Or is there another way to interrupt optmimize safely? Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ScoreDocComparator extends Comparator?
Oops. Java 1.5 PriorityQueue.remove(o) would not be useful for ScoreDoc as it would delete the first object where compare(o1, o2) == 0. Antony Should ScoreDocComparator extend java.util.Comparator. The existing compare() method has the Javadoc comment @see java.util.Comparator. It would then be useful with Java 1.5's PriorityQueue and that would be good because PriorityQueue has a remove() method which makes it useful for manipulating the queue. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ScoreDocComparator extends Comparator?
Should ScoreDocComparator extend java.util.Comparator. The existing compare() method has the Javadoc comment @see java.util.Comparator. It would then be useful with Java 1.5's PriorityQueue and that would be good because PriorityQueue has a remove() method which makes it useful for manipulating the queue. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ANN: Luke 0.7 released
Great Andrzej, that fixed it. Thanks. Antony Andrzej Bialecki wrote: Antony Bowesman wrote: With the luke.jar download, it throws an Exception java.lang.NoClassDefFoundError: org/apache/lucene/index/IndexGate Fixed - I uploaded an updated jar. Sorry for the problem. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ANN: Luke 0.7 released
Hi Andrzej, Thanks for this - it's a great tool. With the luke.jar download, it throws an Exception java.lang.NoClassDefFoundError: org/apache/lucene/index/IndexGate at org.getopt.luke.Luke.getIndexFileNames(Unknown Source) at org.getopt.luke.Luke.showFiles(Unknown Source) at org.getopt.luke.Luke.initOverview(Unknown Source) at org.getopt.luke.Luke.openIndex(Unknown Source) at org.getopt.luke.Luke.openOk(Unknown Source) That seems to be part of the Luke sources, but is not in luke.jar. It is in lukemin and lukeall. I can't find it in the Lucene source tree. Cheers Antony Andrzej Bialecki wrote: Hi all, I'm happy to announce that a new version of Luke - the Lucene Index Toolbox - is now available. As usually, you can get it from: http://www.getopt.org/luke Highlights of this release: * support for Lucene 2.1.0 release and earlier * pagination of search results * support for many new Field flags * new plugin for term analysis (contributed by Mark Harwood) * many other usability and functionality improvements. Have fun! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 2.1, soon
Yonik Seeley wrote: Lucene 2.1 has been a long time in coming, but I think we should plan on making a release when the file format changes settle down. Was there any kind of consensus of what 'soon' meant. Is it likely to be days, this month, or sometime later? I'd really like to get lockless commits, but am wary of just taking the latest build for a production environment. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Analyzer thread safety; Stop words
Yonik Seeley wrote: On 11/29/06, Antony Bowesman <[EMAIL PROTECTED]> wrote: Yonik Seeley wrote: The GreekAnalyzer is just an example of how you can use existing Analyzers (as long as they have a default constructor), but it's not the recommended approach. TokenFilters are preffered over Analyzers you can plug them together in any way you see fit to solve your analysis problem. For Solr, an added bonus of using chains of filters is that Solr can "know" about the results after each filter and show you the results on an analysis web page (very useful for debugging). If I were to analyze greek text, I might do something like this: language="Greek" /> xt"/> If you try to put everything in Analyzer constructors, you get combinatorial explosion. I guess you would use methods rather than, as you say, getting into constructor hell. Anyway, I'll have a deeper look at the solr stuff when I get to phase 2. Right now, I've gone as far with analysis as I need to, but I would like to get better configuration than I've currently got. I know it will come back to bite... Thanks for your comments Yonik Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Analyzer thread safety; Stop words
Yonik Seeley wrote: On 11/29/06, Antony Bowesman <[EMAIL PROTECTED]> wrote: That's true, but all the existing Analyzers allow the stop set to be configured via the analyzer constructors, but in different ways. But you can duplicate most Analyzers (all the ones in Lucene?) with a chain of Tokenizers and TokenFilters (since that is how almost all of them are implemented). Most Analyzers are simply shortcuts to putting together your own. Something seems confused to me. Although stop words are use by Filters, they are currently exposed via Analyzers which is the granularity used at the IndexWriter/Parser levels. This is what contributors are writing, not Filters. There are lots of analysis contributions which deal with stop words that are perfectly usable as is. They shouldn't need to be duplicated to be re-used and if that's needed, it points to a deficiency in the design. If we all have to put together our own, again, doesn't this argue that there should be a standard way of doing it at the higher Analyzer level. Sure, the solr way of using the configurable filters gives great flexibility, but in your solrconfig.xml example it shows how the GreekAnalyzer can be deployed, but it also highlights the problem that it does not seem to be possible to make use of the stopword Hashtable available to the GreekAnalyzer constructor. It seems to me that Lucene would benefit if there was an Analyzer Interface. On the other hand, maybe your TokenFilterFactory stuff would be useful as part of Lucene. Anyway, just my penny's worth. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Analyzer thread safety; Stop words
Hi Yonik, Thanks for your comments. Secondly, has anyone thought that it would be a good idea to extend the Analyzer interface (Abstract class) to allow a standard way to set stop words? There seem to be two 'families' of stop word configuration via constructors. That belongs at the TokenFilter level (where it currently is). That's true, but all the existing Analyzers allow the stop set to be configured via the analyzer constructors, but in different ways. For example StandardAnalyzer has: public StandardAnalyzer(String[] stopWords) public StandardAnalyzer(Set stopWords) public StandardAnalyzer(File stopwords) wheras RussianAnalyzer has: public RussianAnalyzer(char[] charset, Hashtable stopwords) public RussianAnalyzer(char[] charset, String[] stopwords) so, this does not make common stop word configuration possible without some messy code to look at constructor signatures and make some guesses. Perhaps the Analyzer class could have some default methods, e.g. public void setStopWords(File stopWordFile); public void setStopWords(Set stopWordSet); public void setStopWords(String[] stopWords); Things currently are pluggable: one makes new Analyzers by plugging together a Tokenizer followed by several TokeFilters. If you are talking about some sort of external configuration, take a look at Solr. Yes, you've done some nice stuff there with Solr. Unfortunately, I only came across it some time after I'd already done a lot of the work for our system. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Analyzer thread safety; Stop words
Two points about Analyzers: Does anyone have any experience with thread safety of Analyzer implementations. Apart from PerFieldAnalyzerWrapper, the analyzers seem to be thread safe, but is there a requirement that analyzers should be thread safe? Secondly, has anyone thought that it would be a good idea to extend the Analyzer interface (Abstract class) to allow a standard way to set stop words? There seem to be two 'families' of stop word configuration via constructors. The Set, File and String[] in Analyzers, such as StandardAnalyzer, StopAnalyzer where the and then the Russian/Greek variants that do not have the same Constructor signature to configure stopwords. It makes it messy to make analyzers pluggable in a generic way so that stopwords can be configurable for any plugged analyzer. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]