Re: Pool of IndexReaders or Pool of Searchers?

2004-07-12 Thread Vince Taluskie
Can you supply details on the config tested?
Vince
Anson Lau wrote:
Hi,
When I did some load testing on a lucene powered search app, using a
pool of index searchers doesn't give me any more search per second
than just using a singleton index searcher.
Anson
Quoting [EMAIL PROTECTED]:
 

Hi,
I have multiple threads reading an index.  Should they all be
using
the same IndexReader and using a pool of IndexSearchers?  Or
should they be
using a pool of IndexReaders?
Basically, one reader or many?
Thanks.
   

-
 

To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why is Field.java final?

2004-07-12 Thread John Wang
Hi:
   On the same thought, how about the org.apache.lucene.analysis.Token
class. Can we make it non-final?

   I sent out this question 3 different times and still got no responses...

Thanks

-John

On Mon, 12 Jul 2004 18:33:04 -0700, Kevin A. Burton
<[EMAIL PROTECTED]> wrote:
> Doug Cutting wrote:
> 
> > Kevin A. Burton wrote:
> >
> >> I was going to create a new IDField class which just calls super(
> >> name, value, false, true, false) but noticed I was prevented because
> >> Field.java is final?
> >
> >
> > You don't need to subclass to do this, just a static method somewhere.
> >
> >> Why is this? I can't see any harm in making it non-final...
> >
> >
> > Field and Document are not designed to be extensible. They are
> > persisted in such a way that added methods are not available when the
> > field is restored. In other words, when a field is read, it always
> > constructs an instance of Field, not a subclass.
> 
> Thats fine... I think thats acceptable behavior. I don't think anyone
> would assume that inner vars are restored or that the field is serialized.
> 
> Not a big deal but it would be nice...
> 
> 
> 
> --
> 
> Please reply using PGP.
> 
>http://peerfear.org/pubkey.asc
> 
>NewsMonster - http://www.newsmonster.org/
> 
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Kevin A. Burton
Aviran wrote:
Bug 30058 posted
 

Which of course is here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=30058
Is this the source of the revision you modified?
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06116.html
Also what version of Lucene?
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why is Field.java final?

2004-07-12 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
I was going to create a new IDField class which just calls super( 
name, value, false, true, false) but noticed I was prevented because 
Field.java is final?

You don't need to subclass to do this, just a static method somewhere.
Why is this? I can't see any harm in making it non-final...

Field and Document are not designed to be extensible. They are 
persisted in such a way that added methods are not available when the 
field is restored. In other words, when a field is read, it always 
constructs an instance of Field, not a subclass.
Thats fine... I think thats acceptable behavior. I don't think anyone 
would assume that inner vars are restored or that the field is serialized.

Not a big deal but it would be nice...
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Field.java -> STORED, NOT_STORED, etc...

2004-07-12 Thread Kevin A. Burton
Doug Cutting wrote:
It would be best to get the compiler to check the order.
If we change this, why not use type-safe enumerations:
http://www.javapractices.com/Topic1.cjp
The calls would look like:
new Field("name", "value", Stored.YES, Indexed.NO, Tokenized.YES);
Stored could be implemented as the nested class:
public final class Stored {
private Stored() {}
public static final Stored YES = new Stored();
public static final Stored NO = new Stored();
}
+1... I'm not in love with this pattern but since Java < 1.4 doesnt' 
support enum its better than nothing.

I also didn't want to submit a recommendation that would break APIs. I 
assume the old API would be deprecated?

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Kevin A. Burton
Doug Cutting wrote:
I noticed that the class org.apache.lucene.index.FieldInfos uses private
class members Vector byNumber and Hashtable byName, both of which are
synchronized objects. By changing the Vector byNumber to ArrayList 
byNumber
I was able to get 110% improvement in performance (number of searches 
per
second).

That's impressive! Good job finding a bottleneck!
Wow... thats awesome.
We have all dual XEONs with Hyperthreading and kernel 2.6 so I imagine 
in this situation we'd see an improvement too.

I wonder if we could break this out into a patch for legacy Lucene 
users. I'd like to see the stacktrace too.

We're using a lot of synchronized code (Hashtable, Vector, etc) so I'm 
willing to bet this is happening in other places.

My question is: do the fields byNumber and byName have to be 
synchronized
and what can happen if I'll change them to be ArrayList and HashMap 
which
are not synchronized ? Can this corrupt the index or the integrity of 
the
results?

I think that is a safe change. FieldInfos is only modifed by 
DocumentWriter and SegmentMerger, and there is no possibility of other 
threads accessing those instances. Please submit a patch to the 
developer mailing list.

That would be great!
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Optimizing the index

2004-07-12 Thread Bill Tschumy
Is optimizing the index something you should do periodically even if 
you are continually adding documents.  I guess another way of asking 
the question is does optimization have any negative effects on speed of 
adding documents?
--
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Field.java -> STORED, NOT_STORED, etc...

2004-07-12 Thread Gerard Sychay
I think this is a great idea.

I've never used the Field.Keyword and Field.Text type methods because I
can never remember what their 3-boolean-argument equivalents are. I
always stick the constructor format in a comment somewhere and use it.

>>> Doug Cutting <[EMAIL PROTECTED]> 07/11/04 12:03PM >>>
Doug Cutting wrote:
> The calls would look like:
> 
> new Field("name", "value", Stored.YES, Indexed.NO, Tokenized.YES);
> 
> Stored could be implemented as the nested class:
> 
> public final class Stored {
>   private Stored() {}
>   public static final Stored YES = new Stored();
>   public static final Stored NO = new Stored();
> }

Actually, while we're at it, Indexed and Tokenized are confounded.  A 
single entry would be better, something like:

public final class Index {
   private Index() {}
   public static final Index NO = new Index();
   public static final Index TOKENIZED = new Index();
   public static final Index UN_TOKENIZED = new Index();
}

then calls would look like just:

new Field("name", "value", Store.YES, Index.TOKENIZED);

BTW, I think Stored would be better named Store too.

BooleanQuery's required and prohibited flags could get the same 
treatment, with the addition of a nested class like:

public final class Occur {
   private Occur() {}
   public static final Occur MUST_NOT = new Occur();
   public static final Occur SHOULD = new Occur();
   public static final Occur MUST = new Occur();
}

and adding a boolean clause would look like:

booleanQuery.add(new TermQuery(...), Occur.MUST);

Then we can deprecate the old methods.

Comments?

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Could search results give an idea of which field matched

2004-07-12 Thread Grant Ingersoll
See the explain functionality in the Javadocs and previous threads.  You can ask 
Lucene to explain why it got the results it did for a give hit.

>>> [EMAIL PROTECTED] 07/12/04 04:52PM >>>
I search the index on multiple fields. Could the search results also
tell me which field matched so that the document was selected? From what
I can tell, only the document number and a score are returned, is there
a way to also find out what was the field(s) of the document matched the
query?

 

Sildy

 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Could search results give an idea of which field matched

2004-07-12 Thread Sildy Augustine
I search the index on multiple fields. Could the search results also
tell me which field matched so that the document was selected? From what
I can tell, only the document number and a score are returned, is there
a way to also find out what was the field(s) of the document matched the
query?

 

Sildy

 



Re: Exact match search

2004-07-12 Thread Daniel Naber
On Monday 12 July 2004 21:17, [EMAIL PROTECTED] wrote:

> I want to match documents that exactly equal a certain value, not just
> contain it.

Just don't tokenize your Fields, and make sure that the query also doesn't 
get tokenized (the easiest way to ensure that is probably to not use 
QueryParser but just build a TermQuery directly from the user's input).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exact match search

2004-07-12 Thread Don Vaillancourt
How do you go about getting an exact match for a document that can contain 
hundreds of words?  As I understand it, when you tokenize a document it is 
broken into words so really all the results you show are exact matches.

At 03:17 PM 12/07/2004, you wrote:
Hi,
I want to match documents that exactly equal a certain value, not just
contain it.
If I search for "foo" in Lucene I get back documents like these:
"foo"
"foo bar"
"bar foo"
Is there a way to just get the ones that exactly
equal the value I'm searching for?  In this case, I want to only return the
first document (ex. "foo").
I have a workaround where I store all the values
and then after I get the hits I go through them and skip those that don't
match.  But this will return result sets of hundreds of documents that I don't
need.
Help!
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.







Exact match search

2004-07-12 Thread yahootintin . 1247688
Hi,



I want to match documents that exactly equal a certain value, not just
contain it.



If I search for "foo" in Lucene I get back documents like these:

"foo"

"foo bar"

"bar foo"



Is there a way to just get the ones that exactly
equal the value I'm searching for?  In this case, I want to only return the
first document (ex. "foo").



I have a workaround where I store all the values
and then after I get the hits I go through them and skip those that don't
match.  But this will return result sets of hundreds of documents that I don't
need.



Help!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Aviran
Bug 30058 posted

Aviran

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 12, 2004 1:38 PM
To: Lucene Users List
Subject: Re: Lucene Search has poor cpu utilization on a 4-CPU machine


Aviran wrote:
> I use Lucene 1.4 final
> 
> Here is the thread dump for one blocked thread (If you want a full 
> thread dump for all threads I can do that too)

Thanks.  I think I get the point.  I recently removed a synchronization 
point higher in the stack, so that now this one shows up!

Whether or not you submit a patch, please file a bug report in Bugzilla 
with your proposed change, so that we don't lose track of this issue.

Thanks,

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Browse by Letter within a Category

2004-07-12 Thread Daniel Naber
On Monday 12 July 2004 17:48, O'Hare, Thomas wrote:

> Does Lucene have a "beginning of line" query syntax, like the regular
> expression ^ symbol? For example,
> Â
> title:^A*

If your title isn't tokenized the "^" is implicit, I think. As usual, if 
your title is tokenized you can easily add another field with the same 
value as title, but in untokenized form.

> What is the best way to sort by a date? I currently have a date field
> that is used for searching in the format MMDD as a Field.Keyword. 

Lucene 1.4 added an IndexSearcher.search() method that takes a Sort() 
object which lets you sort by any field. Your date field can be used for 
that, as it has the correct format (because sorting it alphabetically will 
give you the right order already).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Doug Cutting
Aviran wrote:
I use Lucene 1.4 final
Here is the thread dump for one blocked thread (If you want a full thread
dump for all threads I can do that too)
Thanks.  I think I get the point.  I recently removed a synchronization 
point higher in the stack, so that now this one shows up!

Whether or not you submit a patch, please file a bug report in Bugzilla 
with your proposed change, so that we don't lose track of this issue.

Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Aviran
I use Lucene 1.4 final

Here is the thread dump for one blocked thread (If you want a full thread
dump for all threads I can do that too)

"Thread-32" daemon prio=1 tid=0x082334c0 nid=0xa66 waiting for monitor entry
[4f385000..4f38687c]
at java.util.Vector.elementAt(Vector.java:430)
- waiting to lock <0x452b93a8> (a java.util.Vector)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)
at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151)
at
org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:149)
at
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115)
at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143)
at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:137)
at
org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:51)
at
org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:364)
at
org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:59)
at
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java
:165)
at
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java
:165)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:154)
at
gov.gsa.search.SearcherByPageAndSortedField.search(SearcherByPageAndSortedFi
eld.java:317)
at
gov.gsa.search.SearcherByPageAndSortedField.search(SearcherByPageAndSortedFi
eld.java:203)
at
gov.gsa.search.grants.SearchGrants.searchByPageAndSortedField(SearchGrants.j
ava:308)
at
gov.gsa.search.grants.SearchServlet.searchByIndex(SearchServlet.java:1541)
at
gov.gsa.search.grants.SearchServlet.getResults(SearchServlet.java:1325)
at gov.gsa.search.grants.SearchServlet.doGet(SearchServlet.java:500)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:740)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:247)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:193)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:256)
at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:643)
at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:191)
at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:643)
at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
at
org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2415)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180
)
at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:643)
at
org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve.
java:171)
at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:641)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:172
)
at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:641)
at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
:174)
at
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invok
eNext(StandardPipeline.java:643)
at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
at
org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:223)
at
org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:261)
at
org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:360)
at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:604)
at
org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:562)
at
org.apache.jk.common.SocketConnection.runIt(ChannelSocket.java:679)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.jav
a:619)
at java.lang.Thread.run(Thread.java:534)


And how do I submit a patch to the developer mailing list? Just

Re: Anyone use MultiSearcher class

2004-07-12 Thread Zilverline info
Hi Don,
Yes, I'm using the MultiSearcher (in Zilverline), and have seen no 
serious performance issues with it. The app performs well with multiple 
indexes, it's responds so quick (with 100k+ documents) that I haven't 
even taken the time to measure the difference to a single index search.
Michael Franken

Don Vaillancourt wrote:
Hello,
Has anyone used the Multisearcher class?
I have noticed that searching two indexes using this MultiSearcher 
class takes 8 times longer than searching only one index.  I could 
understand if it took 3 to 4 times longer to search due to sorting the 
two search results and stuff, but why 8 times longer.

Is there some optimization that can be done to hasten the search?  Or 
should I just write my own MultiSearcher.  The problem though is that 
there is no way for me to create my own Hits object (no methods are 
available and the class is final).

Anyone have any clue?
Thanks
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re:Anyone use MultiSearcher class

2004-07-12 Thread Don Vaillancourt
Actually, after I implemented the MultiSeacher, I had totally forgotten 
about this class.  Although it isn't clear what I does.  I'm assuming that 
it uses threads to search multiple indexes.

I'll have to try it.
Thanks
At 01:10 PM 12/07/2004, you wrote:
I think there is a ParallelMultiSearcher class that extands Multisearcher. 
Have
you tried it?

-- Debut du message initial ---
De : Don Vaillancourt <[EMAIL PROTECTED]>
A  : "Lucene Users List" <[EMAIL PROTECTED]>
Copies :
Date   : Mon, 12 Jul 2004 12:36:29 -0400
Sujet  : Anyone use MultiSearcher class
Hello,
Has anyone used the Multisearcher class?
I have noticed that searching two indexes using this MultiSearcher class
takes 8 times longer than searching only one index.  I could understand if
it took 3 to 4 times longer to search due to sorting the two search results
and stuff, but why 8 times longer.
Is there some optimization that can be done to hasten the search?  Or
should I just write my own MultiSearcher.  The problem though is that there
is no way for me to create my own Hits object (no methods are available and
the class is final).
Anyone have any clue?
Thanks
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.







Re:Anyone use MultiSearcher class

2004-07-12 Thread fp235-5
 
I think there is a ParallelMultiSearcher class that extands Multisearcher. Have
you tried it?

-- Debut du message initial ---

De : Don Vaillancourt <[EMAIL PROTECTED]>
A  : "Lucene Users List" <[EMAIL PROTECTED]>
Copies : 
Date   : Mon, 12 Jul 2004 12:36:29 -0400
Sujet  : Anyone use MultiSearcher class

Hello,

Has anyone used the Multisearcher class?

I have noticed that searching two indexes using this MultiSearcher class 
takes 8 times longer than searching only one index.  I could understand if 
it took 3 to 4 times longer to search due to sorting the two search results 
and stuff, but why 8 times longer.

Is there some optimization that can be done to hasten the search?  Or 
should I just write my own MultiSearcher.  The problem though is that there 
is no way for me to create my own Hits object (no methods are available and 
the class is final).

Anyone have any clue?

Thanks


Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.















-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Doug Cutting
Aviran wrote:
First let me explain what I found out. I'm running Lucene on a 4 CPU server.
While doing some stress tests I've noticed (by doing full thread dump) that
searching threads are blocked on the method: public FieldInfo fieldInfo(int
fieldNumber) This causes for a significant cpu idle time. 
What version of Lucene are you running?  Also, can you please send the 
stack traces of the blocked threads, or at least a description of them? 
 I'd be interested to see what context this happens in.  In particular, 
which IndexReader and Searcher/Scorer/Weight methods does it happen under?

I noticed that the class org.apache.lucene.index.FieldInfos uses private
class members Vector byNumber and Hashtable byName, both of which are
synchronized objects. By changing the Vector byNumber to ArrayList byNumber
I was able to get 110% improvement in performance (number of searches per
second).
That's impressive!  Good job finding a bottleneck!
My question is: do the fields byNumber and byName have to be synchronized
and what can happen if I'll change them to be ArrayList and HashMap which
are not synchronized ? Can this corrupt the index or the integrity of the
results?
I think that is a safe change.  FieldInfos is only modifed by 
DocumentWriter and SegmentMerger, and there is no possibility of other 
threads accessing those instances.  Please submit a patch to the 
developer mailing list.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Anyone use MultiSearcher class

2004-07-12 Thread Don Vaillancourt
Hello,
Has anyone used the Multisearcher class?
I have noticed that searching two indexes using this MultiSearcher class 
takes 8 times longer than searching only one index.  I could understand if 
it took 3 to 4 times longer to search due to sorting the two search results 
and stuff, but why 8 times longer.

Is there some optimization that can be done to hasten the search?  Or 
should I just write my own MultiSearcher.  The problem though is that there 
is no way for me to create my own Hits object (no methods are available and 
the class is final).

Anyone have any clue?
Thanks
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.







Re: AW: Understanding TooManyClauses-Exception and Query-RAM-size

2004-07-12 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
What I  really would like to see are some best practices or some advice from
some users who are working with really large indices how they handle this
situation, or why they  don't have to  care about it or maybe why I am
completely missing the point ;-))
Many folks with really large indexes just don't permit things like 
wildcard and range searches.  For example, Google supports no wildcards 
and has only recently added limited numeric range searching.  Yahoo! 
supports neither.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Browse by Letter within a Category

2004-07-12 Thread Peter M Cipollone
You can use
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/spans/SpanFirstQuery.html

Pete

- Original Message - 
From: "O'Hare, Thomas" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>
Sent: Monday, July 12, 2004 11:48 AM
Subject: RE: Browse by Letter within a Category


Thank you for the suggestion. I implemented what you recommended and now
having it working. I'm sorting on the first word in the title.

Does Lucene have a "beginning of line" query syntax, like the regular
expression ^ symbol? For example,

title:^A*


What is the best way to sort by a date? I currently have a date field
that is used for searching in the format MMDD as a Field.Keyword.

Thanks,
Tom

  _

From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Friday, July 09, 2004 4:34 AM
To: Lucene Users List
Subject: Re: Browse by Letter within a Category



On Friday 09 July 2004 04:27, O'Hare, Thomas wrote:

> Searcher.search("category:\"Products\" AND title:\"A*\"", new
> Sort("title"));

You can only sort on fields which are not tokenized I think. So add an
extra
field with the title, but untokenized, just for sorting. Also, "A*"
might
slow down the query execution so you might want to add another field
which
just contains the first letter so there's no need for the asterisk.

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Browse by Letter within a Category

2004-07-12 Thread O'Hare, Thomas
Thank you for the suggestion. I implemented what you recommended and now
having it working. I'm sorting on the first word in the title. 
 
Does Lucene have a "beginning of line" query syntax, like the regular
expression ^ symbol? For example,
 
title:^A*
 
What is the best way to sort by a date? I currently have a date field
that is used for searching in the format MMDD as a Field.Keyword. 
 
Thanks,
Tom

  _  

From: Daniel Naber [mailto:[EMAIL PROTECTED] 
Sent: Friday, July 09, 2004 4:34 AM
To: Lucene Users List
Subject: Re: Browse by Letter within a Category



On Friday 09 July 2004 04:27, O'Hare, Thomas wrote:

> Searcher.search("category:\"Products\" AND title:\"A*\"", new
> Sort("title"));

You can only sort on fields which are not tokenized I think. So add an
extra
field with the title, but untokenized, just for sorting. Also, "A*"
might
slow down the query execution so you might want to add another field
which
just contains the first letter so there's no need for the asterisk.

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: IndexSearcher usage and caching?

2004-07-12 Thread Otis Gospodnetic
Cache/reuse your IndexSearcher.
On every search, check if the index has changed (there are methods for
that).
If it has changed, create a new IndexSearcher and assign it to your
IndexSearcher variable, and do not close the old IndexSearcher, just in
case something is still using it.

Otis

--- Joel Shellman <[EMAIL PROTECTED]> wrote:
> I'm working on a document management system using lucene to search 
> through all the documents.
> 
> This means that I'll be adding/updating/deleting documents at the
> same 
> time searches are going on.
> 
> I thought to create an IndexSearcher and reuse it throughout, but
> that 
> doesn't seem to work. If I do a search, then add a document, and do 
> another search with the same IndexSearcher, it won't find the newly 
> added document.
> 
> I'd rather not have to create a new IndexSearcher for every query...
> do 
> I have to?
> 
> Thanks,
> 
> -joel shellman

> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexSearcher usage and caching?

2004-07-12 Thread Joel Shellman
I'm working on a document management system using lucene to search 
through all the documents.

This means that I'll be adding/updating/deleting documents at the same 
time searches are going on.

I thought to create an IndexSearcher and reuse it throughout, but that 
doesn't seem to work. If I do a search, then add a document, and do 
another search with the same IndexSearcher, it won't find the newly 
added document.

I'd rather not have to create a new IndexSearcher for every query... do 
I have to?

Thanks,
-joel shellman
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Field.java -> STORED, NOT_STORED, etc...

2004-07-12 Thread wallen
I have 2 suggestions:

1) use Eclipse, or an IDE that references the javadoc with mouseovers
2) if you are going to create constants, consider using a bitflag.  Then
your constants can have a 2's value, ie

STORED = 1
INDEXED = 2
TOKENIZED = 4

Then you can have the constructor look like:

new Field("name", "value", STORED + TOKENIZED)

The constructor would break that down bitwise!

-Original Message-
From: Kevin A. Burton [mailto:[EMAIL PROTECTED]
Sent: Sunday, July 11, 2004 5:05 AM
To: Lucene Users List
Subject: Field.java -> STORED, NOT_STORED, etc...


I've been working with the Field class doing index conversions between 
an old index format to my new external content store proposal (thus the 
email about the 14M convert).

Anyway... I find the whole Field.Keyword, Field.Text thing confusing.  
The main problem is that the constructor to Field just takes booleans 
and if you forget the ordering of the booleans its very confusing.

new Field( "name", "value", true, false, true );

So looking at that you have NO idea what its doing without fetching javadoc.

So I added a few constants to my class:

new Field( "name", "value", NOT_STORED, INDEXED, NOT_TOKENIZED );

which IMO is a lot easier to maintain.

Why not add these constants to Field.java:

public static final boolean STORED = true;
public static final boolean NOT_STORED = false;

public static final boolean INDEXED = true;
public static final boolean NOT_INDEXED = false;

public static final boolean TOKENIZED = true;
public static final boolean NOT_TOKENIZED = false;

Of course you still have to remember the order but this becomes a lot 
easier to maintain.

Kevin

-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc

NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Aviran
Hi all,
First let me explain what I found out. I'm running Lucene on a 4 CPU server.
While doing some stress tests I've noticed (by doing full thread dump) that
searching threads are blocked on the method: public FieldInfo fieldInfo(int
fieldNumber) This causes for a significant cpu idle time. 
I noticed that the class org.apache.lucene.index.FieldInfos uses private
class members Vector byNumber and Hashtable byName, both of which are
synchronized objects. By changing the Vector byNumber to ArrayList byNumber
I was able to get 110% improvement in performance (number of searches per
second).
 
My question is: do the fields byNumber and byName have to be synchronized
and what can happen if I'll change them to be ArrayList and HashMap which
are not synchronized ? Can this corrupt the index or the integrity of the
results?

Thanks,
Aviran



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Understanding TooManyClauses-Exception and Query-RAM-size

2004-07-12 Thread Martin . Stein
Hi Kevin,

thanks for your answer. That could really solve the problem with the
modificationDate or similar fields.

But what if you create queries that ultimately return only a few hits but
contain a RangeQuery that searches for example an ID-Field of some kind,
where you have to cover a wide range of IDs? I think in general, you will
always have fields that contain lots of different terms and searching even a
small range of one of these fields may lead to this Exception. 

The bottom line in my opinion is, that you have to take care for yourself,
not to create certain type of queries that could lead to this Exception. The
type of query completely depends on the index which means as the index grows
you have to restrict the ranges of more and more rangequeries.

One way would be, to catch this Exception and gracefully present a message
to the user to further restrict his query. But this could lead to some
confusion, if the user knows that he has entered some very restrictive query
in addition to some RangeQuery that internally leads to this Exception. 

What I  really would like to see are some best practices or some advice from
some users who are working with really large indices how they handle this
situation, or why they  don't have to  care about it or maybe why I am
completely missing the point ;-))


Thanks,

Martin


-Ursprüngliche Nachricht-
Von: Kevin A. Burton [mailto:[EMAIL PROTECTED]
Gesendet: Donnerstag, 8. Juli 2004 21:11
An: Lucene Users List
Betreff: Re: Understanding TooManyClauses-Exception and Query-RAM-size


[EMAIL PROTECTED] wrote:

>Hi,
>
>a couple of weeks ago we migrated from Lucene 1.2 to 1.4rc3. Everything
went
>smoothly, but we are experiencing some problems with that new constant
limit
>
>
>   maxClauseCount=1024
>
>which leeds to Exceptions of type 
>
>   org.apache.lucene.search.BooleanQuery$TooManyClauses 
>
>when certain RangeQueries are executed (in fact, we get this Excpetion when
>we execute certain Wildcard queries, too). Although we are working with a
>fairly small index with about 35.000 documents, we encounter this Exception
>when we search for the property "modificationDate". For example
>
>   modificationDate:[00 TO 0dwc970kw] 
>
>  
>
We talked about this the other day.

http://wiki.apache.org/jakarta-lucene/IndexingDateFields

Find out what type of precision you need and use that.  If you only need 
days or hours or minutes then use that.   Millis is just too small. 

We're only using days and have queries for just the last 7 days as max 
so this really works out well...

Kevin

-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc

NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene shouldn't use java.io.tmpdir

2004-07-12 Thread Daniel Naber
On Monday 12 July 2004 09:04, Morus Walter wrote:

> Lucene might work around this by creating a directory in java.io.tmpdir
> setting apropriate permission (can that be done with java os
> independently?) and put the lock there.

But if everybody can delete your lock files, that would be a security 
problem. Deleting stale locks isn't a problem, but how would one decide if 
a lock is stale?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene shouldn't use java.io.tmpdir

2004-07-12 Thread Morus Walter
Doug Cutting writes:
> Armbrust, Daniel C. wrote:
> > The problem I ran into the other day with the new lock location is that Person A 
> > had started an index, ran into problems, erased the index and asked me to look at 
> > it.  I tried to rebuild the index (in the same place on a Solaris machine) and 
> > found out that A) - her locks still existed, B) - I didn't have a clue where it 
> > put the locks on the Solaris machine (since no full path was given with the error 
> > - has this been fixed?) and C) - I didn't have permission to remove her locks.
> 
> I think these problems have been fixed.  When an index is created, all 
> old locks are first removed.  And when a lock cannot be obtained, it's 
> full pathname is printed.  Can you replicate this with 1.4-final?
>
Hmm.
If user A creates a lock in /tmp and lucene crashes leaving the lock, user
B won't be able to remove the lock (unless B is root) since /tmp usually
has permissions 
drwxrwxrwt   12 root root 8192 Jul 12 08:50 tmp/
were the 't' means that normal users may delete only their own files 
(at least on linux and IIRC solaris).

Or did I miss something?
Lucene might work around this by creating a directory in java.io.tmpdir
setting apropriate permission (can that be done with java os independently?)
and put the lock there.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]