Change to MultiReader

2008-09-11 Thread Antony Bowesman

There was a message from Kirk Roberts, 18/4/2007 - MultiSearcher vs MultiReader

Grant mentioned the visibility of the readerIndex() method in MultiReader, but 
nothing seems ever came of it.


Is there any reason why the following could not be put into MultiReader? 
Something like this seems necessary when handling multiple indices to solve the 
BitSet caching issue I raised on the user thread.


It's slightly more efficient for a Filter implementation bits() method to know 
these reader numbers in the filter (as the doc id always seems to increment) 
rather than delegating back to the reader to resolve it each call.  However, it 
gives useful utility methods for doing so, and gives freedom to the underlying 
implementation in case that needs to change.


Antony

/** Fetches the IndexReader instance where the specified document exists
 *  @param  n the MultiReader document number
 *  @return the reader index
 */
public int readerIndex(int n) {// find reader for doc n:
  return MultiSegmentReader.readerIndex(n, this.starts, this.subReaders.length);
}

/** Fetches the document number in the specified reader for the given document
number.
 *  @param  i the reader index obtained from [EMAIL PROTECTED] 
#readerIndex(int)}
 *  @param  n the MultiReader document number
 *  @return the mapped document number
 */
public int id(int i, int n) {// find true doc for doc n:
  return n - this.starts[i];
}

/** Fetches the document number in the specified reader for the given document
number.
 *  @param  n the MultiReader document number
 *  @return the mapped document number
 */
public int id(int n) {// find true doc for doc n:
  return n - this.starts[readerIndex(n)];
}



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1150) The token types of the standard tokenizer is not accessible

2008-04-15 Thread Antony Bowesman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588953#action_12588953
 ] 

Antony Bowesman commented on LUCENE-1150:
-

The original tokenImage String array from 2.2 is still not available in this 
patch, they are still in the Impl.  These are the values returned from 
Token.type(), so should they not be visible as well as the static ints?


> The token types of the standard tokenizer is not accessible
> ---
>
> Key: LUCENE-1150
> URL: https://issues.apache.org/jira/browse/LUCENE-1150
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Nicolas Lalevée
>Assignee: Michael McCandless
> Fix For: 2.3.2, 2.4
>
> Attachments: LUCENE-1150.patch, LUCENE-1150.take2.patch
>
>
> The StandardTokenizerImpl not being public, these token types are not 
> accessible :
> {code:java}
> public static final int ALPHANUM  = 0;
> public static final int APOSTROPHE= 1;
> public static final int ACRONYM   = 2;
> public static final int COMPANY   = 3;
> public static final int EMAIL = 4;
> public static final int HOST  = 5;
> public static final int NUM   = 6;
> public static final int CJ= 7;
> /**
>  * @deprecated this solves a bug where HOSTs that end with '.' are identified
>  * as ACRONYMs. It is deprecated and will be removed in the next
>  * release.
>  */
> public static final int ACRONYM_DEP   = 8;
> public static final String [] TOKEN_TYPES = new String [] {
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> ""
> };
> {code}
> So no custom TokenFilter can be based of the token type. Actually even the 
> StandardFilter cannot be writen outside the 
> org.apache.lucene.analysis.standard package.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizerConstants in 2.3

2008-04-09 Thread Antony Bowesman

Thanks Mike/Hoss for the clarification.
Antony


Michael McCandless wrote:


Chris Hostetter wrote:


: > But, StandardTokenizer is public?  It "exports" those constants 
for you?

:
: Really?  Sorry, but I can't find them - in 2.3.1 sources, there are no
: references to those statics.  Javadocs have no reference to them in
: StandardTokenizer

I think Michael is forgetting that he re-added those constants to the
trunk after 2.3.1 was released...

https://issues.apache.org/jira/browse/LUCENE-1150


Woops!  I'm sorry Antony -- Hoss is correct.

I didn't realize this missed 2.3.  I'll backport this fix to 2.3 branch 
so it'll be included when we release 2.3.2 (which I think we should do 
soon -- alot of little fixes have been backported).


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizerConstants in 2.3

2008-04-08 Thread Antony Bowesman

But, StandardTokenizer is public?  It "exports" those constants for you?


Really?  Sorry, but I can't find them - in 2.3.1 sources, there are no 
references to those statics.  Javadocs have no reference to them in 
StandardTokenizer


http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/analysis/standard/StandardTokenizer.html

and I can't see ALPHANUM in the Javadoc index.  Eclipse cannot resolve them.

Am I missing something?
Antony




Mike

Antony Bowesman wrote:

But, the constants that are used by StandardTokenizer are still
available as static ints in the StandardTokenizer class (ie, ALPHANUM,
APOSTROPHE, etc.).  Does that work?


Problem as mentioned below is that the StandardTokenizerImpl.java is 
package private and even though the ints and string array are declared 
as public static, they are not visible.


Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizerConstants in 2.3

2008-04-08 Thread Antony Bowesman

But, the constants that are used by StandardTokenizer are still
available as static ints in the StandardTokenizer class (ie, ALPHANUM,
APOSTROPHE, etc.).  Does that work?


Problem as mentioned below is that the StandardTokenizerImpl.java is package 
private and even though the ints and string array are declared as public static, 
they are not visible.


Antony




Mike

Antony Bowesman wrote:
I'm migrating from 2.1 to 2.3 and found that the public interface 
StandardTokenizerConstants has gone.  It looks like the definitions 
have disappeared inside the package private class StandardTokenizerImpl.


Was this intentional?  I was using these to determine the returns 
values from Token.type().


Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sort difference between 2.1 and 2.3

2008-04-08 Thread Antony Bowesman
Thanks for the explanation Mike.  It's not a big issue, it's just a test case 
where I was needed to ensure ordering for the test, so I'll just use a valid 
high utf-16 character.  It just seemed odd that the field was showing strangely 
in Luke.  Your explanation gives the reason, thanks.


Antony



Michael McCandless wrote:

You're right, Lucene changed wrt the 0x character: 2.3 now uses
this character internally as an "end of term" marker when storing term
text.

This was done as part of LUCENE-843 (speeding up indexing).

Technically that character is an invalid UTF16 character (for
interchange), but it looks like a few Lucene users were indeed relying
on older Lucene versions accepting & preserving it.

You could use 0xfffe instead?  Lucene 2.3 will preserve it, though
It's also invalid for interchange (so future Lucene versions might
change wrt that, too).

Or ... it looks like you're use case is to sort all "last" values
after all "first" values?  In which case one way to do this (without
using invalid UTF16 characters) might be to add a new field marking
whether you have a "last" or a "first" value, then sort first by that
field and second by your value field?

Mike

Antony Bowesman <[EMAIL PROTECTED]> wrote:

Hi,

 I had a test case that added two documents, each with one untokenized
field, and sorted them.  The data in each document was

 char(1) + "First"
 char(0x) + "Last"

 With Lucene 2.1 the documents are sorted correctly, but with Lucene 2.3.1,
they are not.  Looking at the index with Luke shows that the document with
"Last" has not been handled correctly, i.e. the text for the "subject" field
is empty.

 The test case below shows the problem.

 Regards
 Antony


 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertTrue;

 import java.io.IOException;

 import org.apache.lucene.analysis.standard.StandardAnalyzer;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.index.IndexWriter;
 import org.apache.lucene.search.Hits;
 import org.apache.lucene.search.IndexSearcher;
 import org.apache.lucene.search.MatchAllDocsQuery;
 import org.apache.lucene.search.Query;
 import org.apache.lucene.search.Sort;
 import org.apache.lucene.search.SortField;
 import org.junit.After;
 import org.junit.Before;
 import org.junit.Test;

 public class LastSubjectTest
 {
/**
 *  Set up a number of documents with 1 duplicate ContentId
 *  @throws Exception
 */
@Before
public void setUp() throws Exception
{
IndexWriter writer = new IndexWriter("TestDir/", new
StandardAnalyzer(), true);
Document doc = new Document();
String subject = new StringBuffer(1).append((char)0x).toString()
+ "Last";
Field f = new Field("subject", subject, Field.Store.YES,
Field.Index.NO_NORMS);
doc.add(f);
writer.addDocument(doc);
doc = new Document();
subject = new StringBuffer(1).append((char)0x1).toString() +
"First";
f = new Field("subject", subject, Field.Store.YES,
Field.Index.NO_NORMS);
doc.add(f);
writer.addDocument(doc);
writer.close();
}

/**
 *  @throws Exception
 */
@After
public void tearDown() throws Exception
{
}

/**
 *  Tests that the last is after first document, sorted by subject
 *  @throws IOException
 */
@Test
public void testSortDateAscending()
   throws IOException
{
IndexSearcher searcher = new IndexSearcher("TestDir/");
Query q = new MatchAllDocsQuery();
Sort sort = new Sort(new SortField("subject"));
Hits hits = searcher.search(q, sort);
assertEquals("Hits should match all documents",
searcher.getIndexReader().maxDoc(), hits.length());

Document fd = hits.doc(0);
Document ld = hits.doc(1);
String fs = fd.get("subject");
String ls = ld.get("subject");

for (int i = 0; i < hits.length(); i++)
{
Document doc = hits.doc(i);
String subject = doc.get("subject");
System.out.println("Subject:" + subject);
}
assertTrue("Subjects have been sorted incorrectly", fs.compareTo(ls)
< 0);
}

 }


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Sort difference between 2.1 and 2.3

2008-04-07 Thread Antony Bowesman

Hi,

I had a test case that added two documents, each with one untokenized field, and 
sorted them.  The data in each document was


char(1) + "First"
char(0x) + "Last"

With Lucene 2.1 the documents are sorted correctly, but with Lucene 2.3.1, they 
are not.  Looking at the index with Luke shows that the document with "Last" has 
not been handled correctly, i.e. the text for the "subject" field is empty.


The test case below shows the problem.

Regards
Antony


import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;

import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

public class LastSubjectTest
{
/**
 *  Set up a number of documents with 1 duplicate ContentId
 *  @throws Exception
 */
@Before
public void setUp() throws Exception
{
IndexWriter writer = new IndexWriter("TestDir/", new 
StandardAnalyzer(), true);

Document doc = new Document();
String subject = new StringBuffer(1).append((char)0x).toString() + 
"Last";
Field f = new Field("subject", subject, Field.Store.YES, 
Field.Index.NO_NORMS);

doc.add(f);
writer.addDocument(doc);
doc = new Document();
subject = new StringBuffer(1).append((char)0x1).toString() + "First";
f = new Field("subject", subject, Field.Store.YES, 
Field.Index.NO_NORMS);
doc.add(f);
writer.addDocument(doc);
writer.close();
}

/**
 *  @throws Exception
 */
@After
public void tearDown() throws Exception
{
}

/**
 *  Tests that the last is after first document, sorted by subject
 *  @throws IOException
 */
@Test
public void testSortDateAscending()
   throws IOException
{
IndexSearcher searcher = new IndexSearcher("TestDir/");
Query q = new MatchAllDocsQuery();
Sort sort = new Sort(new SortField("subject"));
Hits hits = searcher.search(q, sort);
assertEquals("Hits should match all documents", 
searcher.getIndexReader().maxDoc(), hits.length());


Document fd = hits.doc(0);
Document ld = hits.doc(1);
String fs = fd.get("subject");
String ls = ld.get("subject");

for (int i = 0; i < hits.length(); i++)
{
Document doc = hits.doc(i);
String subject = doc.get("subject");
System.out.println("Subject:" + subject);
}
assertTrue("Subjects have been sorted incorrectly", fs.compareTo(ls) < 
0);
}

}


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



StandardTokenizerConstants in 2.3

2008-04-07 Thread Antony Bowesman
I'm migrating from 2.1 to 2.3 and found that the public interface 
StandardTokenizerConstants has gone.  It looks like the definitions have 
disappeared inside the package private class StandardTokenizerImpl.


Was this intentional?  I was using these to determine the returns values from 
Token.type().


Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



FieldSortedHitQueue.maxscore

2008-01-08 Thread Antony Bowesman

Out of interest, is maxscore supposed to be

a) the max score of the the items inserted to the queue, even though they may 
have dropped out of the final results or

b) the max score of the size items remaining in the queue

Currently it reflects a, but just wondered whether it was correct.

Regards
Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



FieldSortedHitQueue.fillFields() not visible

2008-01-07 Thread Antony Bowesman
I'm implementing a HitCollector to do sorting and will use FieldSortedHitQueue, 
but for some reason the fillFields() method is package private.


Judging from the comments to the method, I don't need it, but if I do later on, 
I can't, unless of course I extend the class and copy the existing code.


Was this done on purpose?

Regards
Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Documentation Brainstorming

2007-05-28 Thread Antony Bowesman

Grant Ingersoll wrote:
Mind you, our docs are an order of magnitude better than 
this other project 


I agree, Lucene is a very well documented project compared to many.  In general 
and in conjunction with LIA, it's a pretty easy project to get in to.


3.  There is a whole lot of knowledge stored in the email archives, how 
can we leverage it?


This is indeed a key point.  HitCollector and surrounding classes are poorly 
documented and there have been many replies to questions which recommend using a 
HitCollector.


The search package is generally well described, apart from what are described as 
'low level API' or 'expert' methods and classes.  I found I needed to get to 
that level to get the best out of Lucene in a framework that sits on top of it.


Performance is another topic which would really benefit from a 'best practice' 
guide.  The dev and user posts concerning performance always get many responses. 
 Although a challenge to produce, bringing together some kind of 
recommendations which relate user data to reader/writer usage, e.g. what 
maxBufferedDocs, maxMergeDocs, mergeFactor to use with a number of different 
usage scenarios would be great, although there's no substitute for evaluating 
that with your own data.


A definitive statement about 'optimize' and when (not) to use it and what its 
relationship with performance is.  I know there's lots about it already, but 
it's dotted all over the place.


Maybe this sort of information would be better in LIA2...
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexWriter shutdown

2007-05-21 Thread Antony Bowesman

Doron Cohen wrote:

Antony Bowesman wrote:

Another use this may have is that mini-optimize operations could be
done at more
regular intervals to reduce the time for a full optimize.  I could
then schedule
mini-optimise to run for a couple of minutes at more frequent intervals.


This seems to assume the proposed feature allows to continue an
interrupted merge at a later time, from where it was stopped. But
if I understood correctly then the proposed feature does not work
this way - so all the (uncommitted) work done until shutdown will
be "lost" - i.e. next merge() would start from scratch.


Yes, it does (wrongly) assume that.  For some reason, I had thought the optimize 
operation was a copy+pack operation, but of course, it's not, so I can see why 
this incremental approach is not possible (or at least non trivial).


Still, the shutdown function would be useful on its own.
Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexWriter shutdown

2007-05-20 Thread Antony Bowesman

Michael Busch wrote:

Hi,

if you run Lucene as a service you want to be able to shut it down in a 
certain period of time (usually 1-2 mins). This can be a problem if the 
IndexWriter is in the middle of a merge when the service shutdown 
request is received.



My question is if people think that the shutdown feature is something we 
would like to add to the Lucene core? If yes, I can go ahead and attach 
my code to a JIRA issue, if no I'd like to make the small change to 
IndexWriter (add the protected method flushRamSegments(triggerMerge)). 
My approach seems to work quite well, but maybe others (e. g. the 
IndexWriter "experts") have different/better ideas how to implement it.


If these are conditions that also apply during an optimize(), then yes, I would 
vote for this feature.  I have a Lucene based service and optimisation takes 
over an hour for a freshly created 18GB index with 1.3M documents.


Although optimisation can be scheduled to run at whatever time, it could be 
necessary to shut down the service during the optimisation and this presents a 
problem in how to safely interrupt the optimize process.


Another use this may have is that mini-optimize operations could be done at more 
regular intervals to reduce the time for a full optimize.  I could then schedule 
mini-optimise to run for a couple of minutes at more frequent intervals.


Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-862) Contrib query org.apache.lucene.search.BoostingQuery sets boost on constructor Query, not cloned copy

2007-04-12 Thread Antony Bowesman (JIRA)
Contrib query org.apache.lucene.search.BoostingQuery sets boost on constructor 
Query, not cloned copy
-

 Key: LUCENE-862
 URL: https://issues.apache.org/jira/browse/LUCENE-862
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.1
 Environment: All
Reporter: Antony Bowesman
Priority: Minor


BoostingQuery sets the boost value on the passed context Query

public BoostingQuery(Query match, Query context, float boost) {
  this.match = match;
  this.context = (Query)context.clone();// clone before boost
  this.boost = boost;

  context.setBoost(0.0f);  // ignore context-only 
matches
}

This should be 
  this.context.setBoost(0.0f);  // ignore context-only 
matches

Also, boost value of 0.0 may have wrong effect - see discussion at

http://www.mail-archive.com/[EMAIL PROTECTED]/msg12243.html 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-861) Contrib queries package Query implementations do not override equals()

2007-04-12 Thread Antony Bowesman (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antony Bowesman updated LUCENE-861:
---

Description: 
Query implementations should override equals() so that Query instances can be 
cached and that Filters can know if a Query has been used before.  See the 
discussion in this thread.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg13061.html

Following 3 contrib Query implementations do no override equals()

org.apache.lucene.search.BoostingQuery;
org.apache.lucene.search.FuzzyLikeThisQuery;
org.apache.lucene.search.similar.MoreLikeThisQuery;

Test cases below show the problem.

package com.teamware.office.lucene.search;

import static org.junit.Assert.*;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BoostingQuery;
import org.apache.lucene.search.FuzzyLikeThisQuery;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.similar.MoreLikeThisQuery;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class ContribQueriesEqualsTest
{
/**
 * @throws java.lang.Exception
 */
@Before
public void setUp() throws Exception
{
}

/**
 * @throws java.lang.Exception
 */
@After
public void tearDown() throws Exception
{
}

/**
 *  Show that the BoostingQuery in the queries contrib package 
 *  does not implement equals() correctly.
 */
@Test
public void testBoostingQueryEquals()
{
TermQuery q1 = new TermQuery(new Term("subject:", "java"));
TermQuery q2 = new TermQuery(new Term("subject:", "java"));
assertEquals("Two TermQueries with same attributes should be equal", 
q1, q2);
BoostingQuery bq1 = new BoostingQuery(q1, q2, 0.1f);
BoostingQuery bq2 = new BoostingQuery(q1, q2, 0.1f);
assertEquals("BoostingQuery with same attributes is not equal", bq1, 
bq2);
}

/**
 *  Show that the MoreLikeThisQuery in the queries contrib package 
 *  does not implement equals() correctly.
 */
@Test
public void testMoreLikeThisQueryEquals()
{
String moreLikeFields[] = new String[] {"subject", "body"};

MoreLikeThisQuery mltq1 = new MoreLikeThisQuery("java", moreLikeFields, 
new StandardAnalyzer());
MoreLikeThisQuery mltq2 = new MoreLikeThisQuery("java", moreLikeFields, 
new StandardAnalyzer());
assertEquals("MoreLikeThisQuery with same attributes is not equal", 
mltq1, mltq2);
}
/**
 *  Show that the FuzzyLikeThisQuery in the queries contrib package 
 *  does not implement equals() correctly.
 */
@Test
public void testFuzzyLikeThisQueryEquals()
{
FuzzyLikeThisQuery fltq1 = new FuzzyLikeThisQuery(10, new 
StandardAnalyzer());
fltq1.addTerms("javi", "subject", 0.5f, 2);
FuzzyLikeThisQuery fltq2 = new FuzzyLikeThisQuery(10, new 
StandardAnalyzer());
fltq2.addTerms("javi", "subject", 0.5f, 2);
assertEquals("FuzzyLikeThisQuery with same attributes is not equal", 
fltq1, fltq2);
}
}


  was:
Query implementations should override equals() so that Query instances can be 
cached and that Filters can know if a Query has been used before.  See the 
discussion in this thread.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg13061.html

Test cases below show the problem.

package com.teamware.office.lucene.search;

import static org.junit.Assert.*;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BoostingQuery;
import org.apache.lucene.search.FuzzyLikeThisQuery;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.similar.MoreLikeThisQuery;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class ContribQueriesEqualsTest
{
/**
 * @throws java.lang.Exception
 */
@Before
public void setUp() throws Exception
{
}

/**
 * @throws java.lang.Exception
 */
@After
public void tearDown() throws Exception
{
}

/**
 *  Show that the BoostingQuery in the queries contrib package 
 *  does not implement equals() correctly.
 */
@Test
public void testBoostingQueryEquals()
{
TermQuery q1 = new TermQuery(new Term("subject:", "java"));
TermQuery q2 = new TermQuery(new Term("subject:", "java"));
assertEquals("Two TermQueries with same attributes should be equal", 
q1, q2);
BoostingQuery bq1 = new BoostingQuery(q1, q2, 0.1f);
BoostingQuery bq2 = new BoostingQuery(q1, q2, 0.1f);
  

[jira] Created: (LUCENE-861) Contrib queries package Query implementations do not override equals()

2007-04-12 Thread Antony Bowesman (JIRA)
Contrib queries package Query implementations do not override equals()
--

 Key: LUCENE-861
 URL: https://issues.apache.org/jira/browse/LUCENE-861
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.1
 Environment: All
Reporter: Antony Bowesman
Priority: Minor


Query implementations should override equals() so that Query instances can be 
cached and that Filters can know if a Query has been used before.  See the 
discussion in this thread.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg13061.html

Test cases below show the problem.

package com.teamware.office.lucene.search;

import static org.junit.Assert.*;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BoostingQuery;
import org.apache.lucene.search.FuzzyLikeThisQuery;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.similar.MoreLikeThisQuery;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class ContribQueriesEqualsTest
{
/**
 * @throws java.lang.Exception
 */
@Before
public void setUp() throws Exception
{
}

/**
 * @throws java.lang.Exception
 */
@After
public void tearDown() throws Exception
{
}

/**
 *  Show that the BoostingQuery in the queries contrib package 
 *  does not implement equals() correctly.
 */
@Test
public void testBoostingQueryEquals()
{
TermQuery q1 = new TermQuery(new Term("subject:", "java"));
TermQuery q2 = new TermQuery(new Term("subject:", "java"));
assertEquals("Two TermQueries with same attributes should be equal", 
q1, q2);
BoostingQuery bq1 = new BoostingQuery(q1, q2, 0.1f);
BoostingQuery bq2 = new BoostingQuery(q1, q2, 0.1f);
assertEquals("BoostingQuery with same attributes is not equal", bq1, 
bq2);
}

/**
 *  Show that the MoreLikeThisQuery in the queries contrib package 
 *  does not implement equals() correctly.
 */
@Test
public void testMoreLikeThisQueryEquals()
{
String moreLikeFields[] = new String[] {"subject", "body"};

MoreLikeThisQuery mltq1 = new MoreLikeThisQuery("java", moreLikeFields, 
new StandardAnalyzer());
MoreLikeThisQuery mltq2 = new MoreLikeThisQuery("java", moreLikeFields, 
new StandardAnalyzer());
assertEquals("MoreLikeThisQuery with same attributes is not equal", 
mltq1, mltq2);
}
/**
 *  Show that the FuzzyLikeThisQuery in the queries contrib package 
 *  does not implement equals() correctly.
 */
@Test
public void testFuzzyLikeThisQueryEquals()
{
FuzzyLikeThisQuery fltq1 = new FuzzyLikeThisQuery(10, new 
StandardAnalyzer());
fltq1.addTerms("javi", "subject", 0.5f, 2);
FuzzyLikeThisQuery fltq2 = new FuzzyLikeThisQuery(10, new 
StandardAnalyzer());
fltq2.addTerms("javi", "subject", 0.5f, 2);
assertEquals("FuzzyLikeThisQuery with same attributes is not equal", 
fltq1, fltq2);
}
}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: optimize() method call

2007-04-11 Thread Antony Bowesman

Robert Engels wrote:

I think this is great, and it gave me an idea. What if another thread could
call a "stop optimize" which would stop the optimize after it came to a
consistent state (not in the middle of a segment merge).

We schedule our optimizes for the "lull" time period, but with 24/7 operation
this could be hard to find.

Being able to stop and then resume the optimize seems like a great idea.


+1.  It would be useful in shutdown cases where immediate shutdown is needed, or 
to allow a scheduled backup to kick in at a fixed time, rather than having to 
wait for optimize to complete.  Or is there another way to interrupt optmimize 
safely?


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ScoreDocComparator extends Comparator?

2007-03-22 Thread Antony Bowesman
Oops.  Java 1.5 PriorityQueue.remove(o) would not be useful for ScoreDoc as it 
would delete the first object where compare(o1, o2) == 0.


Antony

Should ScoreDocComparator extend java.util.Comparator.  The existing 
compare() method has the Javadoc comment @see java.util.Comparator.


It would then be useful with Java 1.5's PriorityQueue and that would be 
good because PriorityQueue has a remove() method which makes it useful 
for manipulating the queue.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



ScoreDocComparator extends Comparator?

2007-03-22 Thread Antony Bowesman
Should ScoreDocComparator extend java.util.Comparator.  The existing compare() 
method has the Javadoc comment @see java.util.Comparator.


It would then be useful with Java 1.5's PriorityQueue and that would be good 
because PriorityQueue has a remove() method which makes it useful for 
manipulating the queue.


Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ANN: Luke 0.7 released

2007-02-21 Thread Antony Bowesman

Great Andrzej, that fixed it.  Thanks.
Antony


Andrzej Bialecki wrote:

Antony Bowesman wrote:

With the luke.jar download, it throws an Exception

java.lang.NoClassDefFoundError: org/apache/lucene/index/IndexGate


Fixed - I uploaded an updated jar. Sorry for the problem.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ANN: Luke 0.7 released

2007-02-21 Thread Antony Bowesman

Hi Andrzej,

Thanks for this - it's a great tool.

With the luke.jar download, it throws an Exception

java.lang.NoClassDefFoundError: org/apache/lucene/index/IndexGate
at org.getopt.luke.Luke.getIndexFileNames(Unknown Source)
at org.getopt.luke.Luke.showFiles(Unknown Source)
at org.getopt.luke.Luke.initOverview(Unknown Source)
at org.getopt.luke.Luke.openIndex(Unknown Source)
at org.getopt.luke.Luke.openOk(Unknown Source)

That seems to be part of the Luke sources, but is not in luke.jar.  It is in 
lukemin and lukeall.  I can't find it in the Lucene source tree.


Cheers
Antony


Andrzej Bialecki wrote:

Hi all,

I'm happy to announce that a new version of Luke - the Lucene Index 
Toolbox - is now available. As usually, you can get it from:


   http://www.getopt.org/luke

Highlights of this release:

* support for Lucene 2.1.0 release and earlier
* pagination of search results
* support for many new Field flags
* new plugin for term analysis (contributed by Mark Harwood)
* many other usability and functionality improvements.

Have fun!




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene 2.1, soon

2007-01-31 Thread Antony Bowesman

Yonik Seeley wrote:

Lucene 2.1 has been a long time in coming, but I think we should plan
on making a release when the file format changes settle down.


Was there any kind of consensus of what 'soon' meant.  Is it likely to be days, 
this month, or sometime later?  I'd really like to get lockless commits, but am 
wary of just taking the latest build for a production environment.


Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Analyzer thread safety; Stop words

2006-11-29 Thread Antony Bowesman

Yonik Seeley wrote:

On 11/29/06, Antony Bowesman <[EMAIL PROTECTED]> wrote:

Yonik Seeley wrote:


The GreekAnalyzer is just an example of how you can use existing
Analyzers (as long as they have a default constructor), but it's not
the recommended approach.

TokenFilters are preffered over Analyzers you can plug them
together in any way you see fit to solve your analysis problem.  For
Solr, an added bonus of using chains of filters  is that Solr can
"know" about the results after each filter and show you the results on
an analysis web page (very useful for debugging).

If I were to analyze greek text, I might do something like this:


 
 
 
 
 
language="Greek" />

xt"/>
 


If you try to put everything in Analyzer constructors, you get
combinatorial explosion.


I guess you would use methods rather than, as you say, getting into constructor 
hell.  Anyway, I'll have a deeper look at the solr stuff when I get to phase 2. 
 Right now, I've gone as far with analysis as I need to, but I would like to 
get better configuration than I've currently got.  I know it will come back to 
bite...


Thanks for your comments Yonik
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Analyzer thread safety; Stop words

2006-11-29 Thread Antony Bowesman

Yonik Seeley wrote:

On 11/29/06, Antony Bowesman <[EMAIL PROTECTED]> wrote:


That's true, but all the existing Analyzers allow the stop set to be 
configured

via the analyzer constructors, but in different ways.


But you can duplicate most Analyzers (all the ones in Lucene?) with a
chain of Tokenizers and TokenFilters (since that is how almost all of
them are implemented).  Most Analyzers are simply shortcuts to putting
together your own.


Something seems confused to me.  Although stop words are use by Filters, they 
are currently exposed via Analyzers which is the granularity used at the 
IndexWriter/Parser levels.  This is what contributors are writing, not Filters.


There are lots of analysis contributions which deal with stop words that are 
perfectly usable as is.  They shouldn't need to be duplicated to be re-used and 
if that's needed, it points to a deficiency in the design.  If we all have to 
put together our own, again, doesn't this argue that there should be a standard 
way of doing it at the higher Analyzer level.


Sure, the solr way of using the configurable filters gives great flexibility, 
but in your solrconfig.xml example it shows how the GreekAnalyzer can be 
deployed, but it also highlights the problem that it does not seem to be 
possible to make use of the stopword Hashtable available to the GreekAnalyzer 
constructor.


It seems to me that Lucene would benefit if there was an Analyzer Interface.  On 
the other hand, maybe your TokenFilterFactory stuff would be useful as part of 
Lucene.


Anyway, just my penny's worth.
Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Analyzer thread safety; Stop words

2006-11-29 Thread Antony Bowesman

Hi Yonik,

Thanks for your comments.

Secondly, has anyone thought that it would be a good idea to extend 
the Analyzer
interface (Abstract class) to allow a standard way to set stop words?  
There

seem to be two 'families' of stop word configuration via constructors.


That belongs at the TokenFilter level (where it currently is).


That's true, but all the existing Analyzers allow the stop set to be configured 
via the analyzer constructors, but in different ways.


For example StandardAnalyzer has:

public StandardAnalyzer(String[] stopWords)
public StandardAnalyzer(Set stopWords)
public StandardAnalyzer(File stopwords)

wheras RussianAnalyzer has:

public RussianAnalyzer(char[] charset, Hashtable stopwords)
public RussianAnalyzer(char[] charset, String[] stopwords)

so, this does not make common stop word configuration possible without some 
messy code to look at constructor signatures and make some guesses.


Perhaps the Analyzer class could have some default methods, e.g.

public void setStopWords(File stopWordFile);
public void setStopWords(Set stopWordSet);
public void setStopWords(String[] stopWords);


Things currently are pluggable: one makes new Analyzers by plugging
together a Tokenizer followed by several TokeFilters.

If you are talking about some sort of external configuration, take a
look at Solr.


Yes, you've done some nice stuff there with Solr.  Unfortunately, I only came 
across it some time after I'd already done a lot of the work for our system.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Analyzer thread safety; Stop words

2006-11-24 Thread Antony Bowesman

Two points about Analyzers:

Does anyone have any experience with thread safety of Analyzer implementations. 
 Apart from PerFieldAnalyzerWrapper, the analyzers seem to be thread safe, but 
is there a requirement that analyzers should be thread safe?


Secondly, has anyone thought that it would be a good idea to extend the Analyzer 
interface (Abstract class) to allow a standard way to set stop words?  There 
seem to be two 'families' of stop word configuration via constructors.


The Set, File and String[] in Analyzers, such as StandardAnalyzer, StopAnalyzer 
where the and then the Russian/Greek variants that do not have the same 
Constructor signature to configure stopwords.


It makes it messy to make analyzers pluggable in a generic way so that stopwords 
can be configurable for any plugged analyzer.


Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]