Re: Lucene in the Humanities
Erik, On Saturday 19 February 2005 01:33, Erik Hatcher wrote: On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote: On Friday 18 February 2005 21:55, Erik Hatcher wrote: On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote: Erik, Just curious: it would seem easier to use multiple fields for the original case and lowercase searching. Is there any particular reason you analyzed the documents to multiple indexes instead of multiple fields? I considered that approach, however to expose QueryParser I'd have to get tricky. If I have title_orig and title_lc fields, how would I allow freeform queries of title:something? By lowercasing the querytext and searching in title_lc ? Well sure, but how about this query: title:Something AND anotherField:someOtherValue QueryParser, as-is, won't be able to do field-name swapping. I could certainly apply that technique on all the structured queries that I build up with the API, but with QueryParser it is trickier. I'm definitely open for suggestions on improving how case is handled. The Overriding this (1.4.3 QueryParser.jj, line 286) might work: protected Query getFieldQuery(String field, String queryText) throws ParseException { ... } It will be called by the parser for both parts of the query above, so one could change the field depending on the requested type of search and the field name in the query. only drawback now is that I'm duplicating indexes, but that is only an issue in how long it takes to rebuild the index from scratch (currently about 20 minutes or so on a good day - when the machine isn't swamped). Once the users get the hang of this, you might end up having to quadruple the index, or more. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Scalability of Lucene indexes
Hi Bryan, How big is your index? Also what is the advantage of binding a user to a server? Thanks. Andy --- Bryan McCormick [EMAIL PROTECTED] wrote: Hi chris, I'm responsible for the webshots.com search index and we've had very good results with lucene. It currently indexes over 100 Million documents and performs 4 Million searches / day. We initially tested running multiple small copies and using a MultiSearcher and then merging results as compared to running a very large single index. We actually found that the single large instance performed better. To improve load handling we clustered multiple identical copies together, then session bind a user to particular server and cache the results, but each server is running a single index. Bryan McCormick On Fri, 2005-02-18 at 08:01, Chris D wrote: Hi all, I have a question about scaling lucene across a cluster, and good ways of breaking up the work. We have a very large index and searches sometimes take more time than they're allowed. What we have been doing is during indexing we index into 256 seperate indexes (depending on the md5sum) then distribute the indexes to the search machines. So if a machine has 128 indexes it would have to do 128 searches. I gave parallelMultiSearcher a try and it was significantly slower than simply iterating through the indexes one at a time. Our new plan is to somehow have only one index per search machine and a larger main index stored on the master. What I'm interested to know is whether having one extremely large index for the master then splitting the index into several smaller indexes (if this is possible) would be better than having several smaller indexes and merging them on the search machines into one index. I would also be interested to know how others have divided up search work across a cluster. Thanks, Chris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote: By lowercasing the querytext and searching in title_lc ? Well sure, but how about this query: title:Something AND anotherField:someOtherValue QueryParser, as-is, won't be able to do field-name swapping. I could certainly apply that technique on all the structured queries that I build up with the API, but with QueryParser it is trickier. I'm definitely open for suggestions on improving how case is handled. The Overriding this (1.4.3 QueryParser.jj, line 286) might work: protected Query getFieldQuery(String field, String queryText) throws ParseException { ... } It will be called by the parser for both parts of the query above, so one could change the field depending on the requested type of search and the field name in the query. But that wouldn't work for any other type of query title:somethingFuzzy~ Though now that I think more about it, a simple s/title:/title_orig:/ before parsing would work, and of course make the default field dynamic. I need to evaluate how many fields would need to be done this way - it'd be several. Thanks for the food for thought! only drawback now is that I'm duplicating indexes, but that is only an issue in how long it takes to rebuild the index from scratch (currently about 20 minutes or so on a good day - when the machine isn't swamped). Once the users get the hang of this, you might end up having to quadruple the index, or more. Why would that be? They want a case sensitive/insensitive switch. How would it expand beyond that? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. in-DB-full-text-searching
On Fri, Feb 18, 2005 at 04:45:50PM -0500, Mike Rose wrote: I can comment on this since I'm in the middle of excising Oracle text searching and replacing it with Lucene in one of my projects. Intereseting, particularly as it's from somebody who's already tried an existing in-db fulltext search feature. All in all, I don't think that a JDBC wrapper is going to do what you want. I wasn't thinking about trying to do the whole thing under the JDBC driver. Mainly I was thinking that one key point is that you need to treat the lucene index somewhat like a cache. This also means that you have to watch database writes and make sure you update your cache, which means you have to have some sort of single point of data access to monitor. Well, we already have that - it's called the JDBC driver. The general design I was eyeing speculatively is basically that the driver would be set up with a reference to an object that implements a CacheManager interface. This interface basically gives the driver a way to notify the cache manager of when certain tables and columns are being edited. Exactly how is another question. I don't know enough of the innards of, say, a PreparedStatement, to say more. It could be as simple as sending the CacheManager a copy of every SQL query string and letting the CacheManager figure out the rest. Ideally I'd like it to be a little bit more structured. From there, it's the CacheManager's job to decide what to do about it, and how to do it. This leaves the tricky issue of mapping from a specific database to a specific lucene index up to the developer. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - http://darksleep.com/notablog - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Saturday 19 February 2005 11:02, Erik Hatcher wrote: On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote: By lowercasing the querytext and searching in title_lc ? Well sure, but how about this query: title:Something AND anotherField:someOtherValue QueryParser, as-is, won't be able to do field-name swapping. I could certainly apply that technique on all the structured queries that I build up with the API, but with QueryParser it is trickier. I'm definitely open for suggestions on improving how case is handled. The Overriding this (1.4.3 QueryParser.jj, line 286) might work: protected Query getFieldQuery(String field, String queryText) throws ParseException { ... } It will be called by the parser for both parts of the query above, so one could change the field depending on the requested type of search and the field name in the query. But that wouldn't work for any other type of query title:somethingFuzzy~ To get that it would be necessary to override all query parser methods that take a field argument. Though now that I think more about it, a simple s/title:/title_orig:/ before parsing would work, and of course make the default field In the overriding getFieldQuery method something like: if (caseSensitiveSearch(field) originalFieldIndexed(field)) { field = field + _orig; } else { //the other 3 cases ... } return super.getFieldQuery(field, queryText); The if statement could be factored out for the other overriding methods. dynamic. I need to evaluate how many fields would need to be done this way - it'd be several. Thanks for the food for thought! only drawback now is that I'm duplicating indexes, but that is only an issue in how long it takes to rebuild the index from scratch (currently about 20 minutes or so on a good day - when the machine isn't swamped). Once the users get the hang of this, you might end up having to quadruple the index, or more. Why would that be? They want a case sensitive/insensitive switch. How would it expand beyond that? With an index for every combination of fields and case sensitivity for these fields. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MultiFieldQueryParser 1.8 isn't parsing phrases
Hi When I try to search for phrases using the MultiFieldQueryParser v1.8 from CVS, it gives me NullPointerException. Using the following keyword works: title:IBM backs linux However, it gives me the exception if I use the following keyword: IBM backs linux Any idea why? I am using this MultiFieldQueryParser with Lucene 1.4.3. Of course I changed some of the boolean stuff to make it works with the production release. Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Performance
Michael Celona wrote: My index is changing in real time constantly... in this case I guess this will not work for me any suggestions... using a singleton pattern for the your index searcher makes sense anyway ... I don'T think that you change the index after each search. the computing effort is insignificant but the gain is. How often do you optimize your index. Run your jmeter tests before and after optimization! Which is the value of your merge factor? Try to use 2 or 3 and run the tests again. I think it will be useful for lucene community to provide the results of your tests. Best, Sergiu Michael -Original Message- From: David Townsend [mailto:[EMAIL PROTECTED] Sent: Friday, February 18, 2005 11:50 AM To: Lucene Users List Subject: RE: Search Performance IndexSearchers are thread safe, so you can use the same object on multiple requests. If the index is static and not constantly updating, just keep one IndexSearcher for the life of the app. If the index changes and you need that instantly reflected in the results, you need to check if the index has changed, if it has create a new cached IndexSearcher. To check for changes use you'll need to monitor the version number of the index obtained via IndexReader.getCurrentVersion(Index Name) David -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: 18 February 2005 16:15 To: Lucene Users List Subject: Re: Search Performance Try a singleton pattern or an static field. Stefan Michael Celona wrote: I am creating new IndexSearchers... how do I cache my IndexSearcher... Michael -Original Message- From: David Townsend [mailto:[EMAIL PROTECTED] Sent: Friday, February 18, 2005 11:00 AM To: Lucene Users List Subject: RE: Search Performance Are you creating new IndexSearchers or IndexReaders on each search? Caching your IndexSearchers has a dramatic effect on speed. David Townsend -Original Message- From: Michael Celona [mailto:[EMAIL PROTECTED] Sent: 18 February 2005 15:55 To: Lucene Users List Subject: Search Performance What is single handedly the best way to improve search performance? I have an index in the 2G range stored on the local file system of the searcher. Under a load test of 5 simultaneous users my average search time is ~4700 ms. Under a load test of 10 simultaneous users my average search time is ~1 ms.I have given the JVM 2G of memory and am a using a dual 3GHz Zeons. Any ideas? Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser 1.8 isn't parsing phrases
On Saturday 19 February 2005 15:26, Ben wrote: When I try to search for phrases using the MultiFieldQueryParser v1.8 from CVS, it gives me NullPointerException. This has just been fixed in SVN (I assume you mean SVN, CVS still exists but is read only and probably not updated anymore). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser 1.8 isn't parsing phrases
Thanks On Sat, 19 Feb 2005 16:09:49 +0100, Daniel Naber [EMAIL PROTECTED] wrote: On Saturday 19 February 2005 15:26, Ben wrote: When I try to search for phrases using the MultiFieldQueryParser v1.8 from CVS, it gives me NullPointerException. This has just been fixed in SVN (I assume you mean SVN, CVS still exists but is read only and probably not updated anymore). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Scalability of Lucene indexes
We are doing the same exacting thing. We didn't test with so many documents. The most we tested till now 3 million documents with 3GB file size. I would be interested in seeing how you maintained replicated indices that r in sync. The way we did was, run the indexer on each server independently. I the data changes, one server will know the change. That server updates lucene index and notifies other servers (using multicast). Glad to know someone else is doing the similar thing and more happy to know that the solution works even for 100 millions documents. I was little worried if the index size goes higher and higher but it looks like we should not have to worry anymore :) Thanks Praveen - Original Message - From: Bryan McCormick [EMAIL PROTECTED] To: Chris D [EMAIL PROTECTED] Cc: lucene-user@jakarta.apache.org Sent: Friday, February 18, 2005 3:45 PM Subject: Re: Scalability of Lucene indexes Hi chris, I'm responsible for the webshots.com search index and we've had very good results with lucene. It currently indexes over 100 Million documents and performs 4 Million searches / day. We initially tested running multiple small copies and using a MultiSearcher and then merging results as compared to running a very large single index. We actually found that the single large instance performed better. To improve load handling we clustered multiple identical copies together, then session bind a user to particular server and cache the results, but each server is running a single index. Bryan McCormick On Fri, 2005-02-18 at 08:01, Chris D wrote: Hi all, I have a question about scaling lucene across a cluster, and good ways of breaking up the work. We have a very large index and searches sometimes take more time than they're allowed. What we have been doing is during indexing we index into 256 seperate indexes (depending on the md5sum) then distribute the indexes to the search machines. So if a machine has 128 indexes it would have to do 128 searches. I gave parallelMultiSearcher a try and it was significantly slower than simply iterating through the indexes one at a time. Our new plan is to somehow have only one index per search machine and a larger main index stored on the master. What I'm interested to know is whether having one extremely large index for the master then splitting the index into several smaller indexes (if this is possible) would be better than having several smaller indexes and merging them on the search machines into one index. I would also be interested to know how others have divided up search work across a cluster. Thanks, Chris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Mail Archive Broken?
I just beamed into the archive: http://mail-archives.apache.org/eyebrowse/SummarizeList?listId=30 ..and it only has through Feb 1! What's up? Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Scalability of Lucene indexes
Our index is currently about 40Gb. The advantage of binding a user is that once a search is performed then caching within lucene and in the application is very effective if subsequent searches go back to the same box. Our initial searches are usually in the sub 100milliS range while subsequent requests for deeper pages in the search are returned instantly. Bryan McCormick On Sat, 2005-02-19 at 01:24, Andy wrote: Hi Bryan, How big is your index? Also what is the advantage of binding a user to a server? Thanks. Andy --- Bryan McCormick [EMAIL PROTECTED] wrote: Hi chris, I'm responsible for the webshots.com search index and we've had very good results with lucene. It currently indexes over 100 Million documents and performs 4 Million searches / day. We initially tested running multiple small copies and using a MultiSearcher and then merging results as compared to running a very large single index. We actually found that the single large instance performed better. To improve load handling we clustered multiple identical copies together, then session bind a user to particular server and cache the results, but each server is running a single index. Bryan McCormick On Fri, 2005-02-18 at 08:01, Chris D wrote: Hi all, I have a question about scaling lucene across a cluster, and good ways of breaking up the work. We have a very large index and searches sometimes take more time than they're allowed. What we have been doing is during indexing we index into 256 seperate indexes (depending on the md5sum) then distribute the indexes to the search machines. So if a machine has 128 indexes it would have to do 128 searches. I gave parallelMultiSearcher a try and it was significantly slower than simply iterating through the indexes one at a time. Our new plan is to somehow have only one index per search machine and a larger main index stored on the master. What I'm interested to know is whether having one extremely large index for the master then splitting the index into several smaller indexes (if this is possible) would be better than having several smaller indexes and merging them on the search machines into one index. I would also be interested to know how others have divided up search work across a cluster. Thanks, Chris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
JavaLobby Lucene presentation
I recorded a Meet Lucene presentation at JavaLobby. It is a multimedia Flash video that shows slides with my voice recorded over them which spans just over 20 minutes (you can jump to specific slides).Check it out here: http://www.javalobby.org/members-only/eps/meet-lucene/index.html? source=archives It's tailored as a high-level overview, and a quick one at that. It'll certainly be too basic for most everyone on this list, but maybe your manager would enjoy it :) It's awkward to record this type of thing and it sounds dry to me as I ended up having to script what I was going to say and read it rather than ad-lib like I would do in a face-to-face presentation. ah's and um's don't work well in an audio-only track. I'd love to hear (perhaps best through the JavaLobby forum associated with the presentation) feedback on it. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]