Re: Lucene in the Humanities

2005-02-19 Thread Paul Elschot
Erik,

On Saturday 19 February 2005 01:33, Erik Hatcher wrote:
 
 On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote:
 
  On Friday 18 February 2005 21:55, Erik Hatcher wrote:
 
  On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
 
  Erik,
 
  Just curious: it would seem easier to use multiple fields for the
  original case and lowercase searching. Is there any particular reason
  you analyzed the documents to multiple indexes instead of multiple
  fields?
 
  I considered that approach, however to expose QueryParser I'd have to
  get tricky.  If I have title_orig and title_lc fields, how would I
  allow freeform queries of title:something?
 
  By lowercasing the querytext and searching in title_lc ?
 
 Well sure, but how about this query:
 
   title:Something AND anotherField:someOtherValue
 
 QueryParser, as-is, won't be able to do field-name swapping.  I could 
 certainly apply that technique on all the structured queries that I 
 build up with the API, but with QueryParser it is trickier.   I'm 
 definitely open for suggestions on improving how case is handled.  The 

Overriding this (1.4.3 QueryParser.jj, line 286) might work:

protected Query getFieldQuery(String field, String queryText)
throws ParseException { ... }

It will be called by the parser for both parts of the query above, so one
could change the field depending on the requested type of search
and the field name in the query.

 only drawback now is that I'm duplicating indexes, but that is only an 
 issue in how long it takes to rebuild the index from scratch (currently 
 about 20 minutes or so on a good day - when the machine isn't swamped).

Once the users get the hang of this, you might end up having to quadruple
the index, or more.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scalability of Lucene indexes

2005-02-19 Thread Andy
Hi Bryan,

How big is your index?

Also what is the advantage of binding a user to a
server? 

Thanks.
Andy

--- Bryan McCormick [EMAIL PROTECTED] wrote:

 Hi chris, 
 
 I'm responsible for the webshots.com search index
 and we've had very
 good results with lucene. It currently indexes over
 100 Million
 documents and performs 4 Million searches / day. 
 
 We initially tested running multiple small copies
 and using a
 MultiSearcher and then merging results as compared
 to running a very
 large single index. We actually found that the
 single large instance
 performed better. To improve load handling we
 clustered multiple
 identical copies together, then session bind a user
 to particular server
 and cache the results, but each server is running a
 single index. 
 
 Bryan McCormick
 
 
 On Fri, 2005-02-18 at 08:01, Chris D wrote: 
  Hi all, 
  
  I have a question about scaling lucene across a
 cluster, and good ways
  of breaking up the work.
  
  We have a very large index and searches sometimes
 take more time than
  they're allowed. What we have been doing is during
 indexing we index
  into 256 seperate indexes (depending on the
 md5sum) then distribute
  the indexes to the search machines. So if a
 machine has 128 indexes it
  would have to do 128 searches. I gave
 parallelMultiSearcher a try and
  it was significantly slower than simply iterating
 through the indexes
  one at a time.
  
  Our new plan is to somehow have only one index per
 search machine and
  a larger main index stored on the master.
  
  What I'm interested to know is whether having one
 extremely large
  index for the master then splitting the index into
 several smaller
  indexes (if this is possible) would be better than
 having several
  smaller indexes and merging them on the search
 machines into one
  index.
  
  I would also be interested to know how others have
 divided up search
  work across a cluster.
  
  Thanks,
  Chris
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in the Humanities

2005-02-19 Thread Erik Hatcher
On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote:
By lowercasing the querytext and searching in title_lc ?
Well sure, but how about this query:
title:Something AND anotherField:someOtherValue
QueryParser, as-is, won't be able to do field-name swapping.  I could
certainly apply that technique on all the structured queries that I
build up with the API, but with QueryParser it is trickier.   I'm
definitely open for suggestions on improving how case is handled.  The
Overriding this (1.4.3 QueryParser.jj, line 286) might work:
protected Query getFieldQuery(String field, String queryText)
throws ParseException { ... }
It will be called by the parser for both parts of the query above, so 
one
could change the field depending on the requested type of search
and the field name in the query.
But that wouldn't work for any other type of query 
title:somethingFuzzy~

Though now that I think more about it, a simple s/title:/title_orig:/ 
before parsing would work, and of course make the default field 
dynamic.   I need to evaluate how many fields would need to be done 
this way - it'd be several.  Thanks for the food for thought!

only drawback now is that I'm duplicating indexes, but that is only an
issue in how long it takes to rebuild the index from scratch 
(currently
about 20 minutes or so on a good day - when the machine isn't 
swamped).
Once the users get the hang of this, you might end up having to 
quadruple
the index, or more.
Why would that be?   They want a case sensitive/insensitive switch.  
How would it expand beyond that?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene vs. in-DB-full-text-searching

2005-02-19 Thread Steven J. Owens
On Fri, Feb 18, 2005 at 04:45:50PM -0500, Mike Rose wrote:
 I can comment on this since I'm in the middle of excising Oracle text
 searching and replacing it with Lucene in one of my projects.

 Intereseting, particularly as it's from somebody who's already
tried an existing in-db fulltext search feature.

 All in all, I don't think that a JDBC wrapper is going to do what
 you want.

 I wasn't thinking about trying to do the whole thing under the
JDBC driver.  Mainly I was thinking that one key point is that you
need to treat the lucene index somewhat like a cache.  This also means
that you have to watch database writes and make sure you update your
cache, which means you have to have some sort of single point of data
access to monitor.  Well, we already have that - it's called the JDBC
driver.

 The general design I was eyeing speculatively is basically that
the driver would be set up with a reference to an object that
implements a CacheManager interface.  This interface basically gives
the driver a way to notify the cache manager of when certain tables
and columns are being edited.  Exactly how is another question.  I
don't know enough of the innards of, say, a PreparedStatement, to say
more.  It could be as simple as sending the CacheManager a copy of
every SQL query string and letting the CacheManager figure out the
rest.  Ideally I'd like it to be a little bit more structured.

 From there, it's the CacheManager's job to decide what to do
about it, and how to do it.  This leaves the tricky issue of mapping
from a specific database to a specific lucene index up to the
developer.

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - http://darksleep.com/notablog


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in the Humanities

2005-02-19 Thread Paul Elschot
On Saturday 19 February 2005 11:02, Erik Hatcher wrote:
 
 On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote:
  By lowercasing the querytext and searching in title_lc ?
 
  Well sure, but how about this query:
 
 title:Something AND anotherField:someOtherValue
 
  QueryParser, as-is, won't be able to do field-name swapping.  I could
  certainly apply that technique on all the structured queries that I
  build up with the API, but with QueryParser it is trickier.   I'm
  definitely open for suggestions on improving how case is handled.  The
 
  Overriding this (1.4.3 QueryParser.jj, line 286) might work:
 
  protected Query getFieldQuery(String field, String queryText)
  throws ParseException { ... }
 
  It will be called by the parser for both parts of the query above, so 
  one
  could change the field depending on the requested type of search
  and the field name in the query.
 
 But that wouldn't work for any other type of query 
 title:somethingFuzzy~

To get that it would be necessary to override all query parser
methods that take a field argument.

 
 Though now that I think more about it, a simple s/title:/title_orig:/ 
 before parsing would work, and of course make the default field 

In the overriding getFieldQuery method something like:

if (caseSensitiveSearch(field)  originalFieldIndexed(field)) {
  field = field + _orig;
} else { //the other 3 cases
 ...
}
return super.getFieldQuery(field, queryText);

The if statement could be factored out for the other overriding methods.

 dynamic.   I need to evaluate how many fields would need to be done 
 this way - it'd be several.  Thanks for the food for thought!
 
  only drawback now is that I'm duplicating indexes, but that is only an
  issue in how long it takes to rebuild the index from scratch 
  (currently
  about 20 minutes or so on a good day - when the machine isn't 
  swamped).
 
  Once the users get the hang of this, you might end up having to 
  quadruple
  the index, or more.
 
 Why would that be?   They want a case sensitive/insensitive switch.  
 How would it expand beyond that?

With an index for every combination of fields and case sensitivity for these
fields.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MultiFieldQueryParser 1.8 isn't parsing phrases

2005-02-19 Thread Ben
Hi

When I try to search for phrases using the MultiFieldQueryParser v1.8
from CVS, it gives me NullPointerException.

Using the following keyword works:

title:IBM backs linux

However, it gives me the exception if I use the following keyword:

IBM backs linux

Any idea why? I am using this MultiFieldQueryParser with Lucene 1.4.3.
Of course I changed some of the boolean stuff to make it works with
the production release.

Thanks,
Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Performance

2005-02-19 Thread sergiu gordea
Michael Celona wrote:
My index is changing in real time constantly... in this case I guess this
will not work for me any suggestions...
 

using a singleton pattern for the your index searcher makes sense anyway 
... I don'T think that you change
the index after each search. the computing effort is insignificant but 
the gain is.

How often do you optimize your index.
Run your jmeter tests before and after optimization!
Which is the value of your merge factor?
Try to use 2 or 3 and run the tests again.
I think it will be useful for lucene community to provide the results 
of your tests.

Best,
 Sergiu
Michael
-Original Message-
From: David Townsend [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 18, 2005 11:50 AM
To: Lucene Users List
Subject: RE: Search Performance

IndexSearchers are thread safe, so you can use the same object on multiple
requests.  If the index is static and not constantly updating, just keep one
IndexSearcher for the life of the app.  If the index changes and you need
that instantly reflected in the results, you need to check if the index has
changed, if it has create a new cached IndexSearcher.  To check for changes
use you'll need to monitor the version number of the index obtained via
IndexReader.getCurrentVersion(Index Name)
David
-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: 18 February 2005 16:15
To: Lucene Users List
Subject: Re: Search Performance
Try a singleton pattern or an static field.
Stefan
Michael Celona wrote:
 

I am creating new IndexSearchers... how do I cache my IndexSearcher...
Michael
-Original Message-
From: David Townsend [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 18, 2005 11:00 AM
To: Lucene Users List
Subject: RE: Search Performance

Are you creating new IndexSearchers or IndexReaders on each search?
   

Caching
 

your IndexSearchers has a dramatic effect on speed.
David Townsend
-Original Message-
From: Michael Celona [mailto:[EMAIL PROTECTED]
Sent: 18 February 2005 15:55
To: Lucene Users List
Subject: Search Performance
What is single handedly the best way to improve search performance?  I have
an index in the 2G range stored on the local file system of the searcher.
Under a load test of 5 simultaneous users my average search time is ~4700
ms.  Under a load test of 10 simultaneous users my average search time is
~1 ms.I have given the JVM 2G of memory and am a using a dual 3GHz
Zeons.  Any ideas?  


Michael
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiFieldQueryParser 1.8 isn't parsing phrases

2005-02-19 Thread Daniel Naber
On Saturday 19 February 2005 15:26, Ben wrote:

 When I try to search for phrases using the MultiFieldQueryParser v1.8
 from CVS, it gives me NullPointerException.

This has just been fixed in SVN (I assume you mean SVN, CVS still exists 
but is read only and probably not updated anymore).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser 1.8 isn't parsing phrases

2005-02-19 Thread Ben
Thanks


On Sat, 19 Feb 2005 16:09:49 +0100, Daniel Naber
[EMAIL PROTECTED] wrote:
 On Saturday 19 February 2005 15:26, Ben wrote:
 
  When I try to search for phrases using the MultiFieldQueryParser v1.8
  from CVS, it gives me NullPointerException.
 
 This has just been fixed in SVN (I assume you mean SVN, CVS still exists
 but is read only and probably not updated anymore).
 
 Regards
  Daniel
 
 --
 http://www.danielnaber.de
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scalability of Lucene indexes

2005-02-19 Thread Praveen Peddi
We are doing the same exacting thing. We didn't test with so many documents. 
The most we tested till now 3 million documents with 3GB file size.
I would be interested in seeing how you maintained replicated indices that r 
in sync. The way we did was, run the indexer on each server independently. I 
the data changes, one server will know the change. That server updates 
lucene index and notifies other servers (using multicast).

Glad to know someone else is doing the similar thing and more happy to know 
that the solution works even for 100 millions documents. I was little 
worried if the index size goes higher and higher but it looks like we should 
not have to worry anymore :)

Thanks
Praveen
- Original Message - 
From: Bryan McCormick [EMAIL PROTECTED]
To: Chris D [EMAIL PROTECTED]
Cc: lucene-user@jakarta.apache.org
Sent: Friday, February 18, 2005 3:45 PM
Subject: Re: Scalability of Lucene indexes


Hi chris,
I'm responsible for the webshots.com search index and we've had very
good results with lucene. It currently indexes over 100 Million
documents and performs 4 Million searches / day.
We initially tested running multiple small copies and using a
MultiSearcher and then merging results as compared to running a very
large single index. We actually found that the single large instance
performed better. To improve load handling we clustered multiple
identical copies together, then session bind a user to particular server
and cache the results, but each server is running a single index.
Bryan McCormick
On Fri, 2005-02-18 at 08:01, Chris D wrote:
Hi all,
I have a question about scaling lucene across a cluster, and good ways
of breaking up the work.
We have a very large index and searches sometimes take more time than
they're allowed. What we have been doing is during indexing we index
into 256 seperate indexes (depending on the md5sum) then distribute
the indexes to the search machines. So if a machine has 128 indexes it
would have to do 128 searches. I gave parallelMultiSearcher a try and
it was significantly slower than simply iterating through the indexes
one at a time.
Our new plan is to somehow have only one index per search machine and
a larger main index stored on the master.
What I'm interested to know is whether having one extremely large
index for the master then splitting the index into several smaller
indexes (if this is possible) would be better than having several
smaller indexes and merging them on the search machines into one
index.
I would also be interested to know how others have divided up search
work across a cluster.
Thanks,
Chris
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Mail Archive Broken?

2005-02-19 Thread Owen Densmore
I just beamed into the archive:
http://mail-archives.apache.org/eyebrowse/SummarizeList?listId=30
..and it only has through Feb 1!
What's up?
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Scalability of Lucene indexes

2005-02-19 Thread Bryan McCormick
Our index is currently about 40Gb. 

The advantage of binding a user is that once a search is performed then
caching within lucene and in the application is very effective if
subsequent searches go back to the same box.  Our initial searches are
usually in the sub 100milliS range while subsequent requests for deeper
pages in the search are returned instantly. 

Bryan McCormick

On Sat, 2005-02-19 at 01:24, Andy wrote:
 Hi Bryan,
 
 How big is your index?
 
 Also what is the advantage of binding a user to a
 server? 
 
 Thanks.
 Andy
 
 --- Bryan McCormick [EMAIL PROTECTED] wrote:
 
  Hi chris, 
  
  I'm responsible for the webshots.com search index
  and we've had very
  good results with lucene. It currently indexes over
  100 Million
  documents and performs 4 Million searches / day. 
  
  We initially tested running multiple small copies
  and using a
  MultiSearcher and then merging results as compared
  to running a very
  large single index. We actually found that the
  single large instance
  performed better. To improve load handling we
  clustered multiple
  identical copies together, then session bind a user
  to particular server
  and cache the results, but each server is running a
  single index. 
  
  Bryan McCormick
  
  
  On Fri, 2005-02-18 at 08:01, Chris D wrote: 
   Hi all, 
   
   I have a question about scaling lucene across a
  cluster, and good ways
   of breaking up the work.
   
   We have a very large index and searches sometimes
  take more time than
   they're allowed. What we have been doing is during
  indexing we index
   into 256 seperate indexes (depending on the
  md5sum) then distribute
   the indexes to the search machines. So if a
  machine has 128 indexes it
   would have to do 128 searches. I gave
  parallelMultiSearcher a try and
   it was significantly slower than simply iterating
  through the indexes
   one at a time.
   
   Our new plan is to somehow have only one index per
  search machine and
   a larger main index stored on the master.
   
   What I'm interested to know is whether having one
  extremely large
   index for the master then splitting the index into
  several smaller
   indexes (if this is possible) would be better than
  having several
   smaller indexes and merging them on the search
  machines into one
   index.
   
   I would also be interested to know how others have
  divided up search
   work across a cluster.
   
   Thanks,
   Chris
   
  
 
 -
   To unsubscribe, e-mail:
  [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
   
  
  
 
 -
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around 
 http://mail.yahoo.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



JavaLobby Lucene presentation

2005-02-19 Thread Erik Hatcher
I recorded a Meet Lucene presentation at JavaLobby.  It is a  
multimedia Flash video that shows slides with my voice recorded over  
them which spans just over 20 minutes (you can jump to specific  
slides).Check it out here:

	http://www.javalobby.org/members-only/eps/meet-lucene/index.html? 
source=archives

It's tailored as a high-level overview, and a quick one at that.  It'll  
certainly be too basic for most everyone on this list, but maybe your  
manager would enjoy it :)

It's awkward to record this type of thing and it sounds dry to me as I  
ended up having to script what I was going to say and read it rather  
than ad-lib like I would do in a face-to-face presentation.  ah's and  
um's don't work well in an audio-only track.

I'd love to hear (perhaps best through the JavaLobby forum associated  
with the presentation) feedback on it.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]