Re: Lucene vs. in-DB-full-text-searching
David Sitsky wrote: On Sat, 19 Feb 2005 09:31, Otis Gospodnetic wrote: You are right. Since there are C++ and now C ports of Lucene, it would be interesting to integrate them directly with DBs, so that the RDBMS full-text search under the hood is actually powered by one of the Lucene ports. Or to see Lucene + Derby (100% JAVA embedded database donated from IBM currently in Apache incubation) integrated together... that would be really nice and powerful. Does anyone know if there are any integration plans? Don't forget BerkeleyDB Java Edition... that would be interesting too... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
Re: Lucene vs. in-DB-full-text-searching
Otis Gospodnetic wrote: The most obvious answer is that the full-text indexing features of RDBMS's are not as good (as fast) as Lucene. MySQL, PostgreSQL, Oracle, MS SQL Server etc. all have full-text indexing/searching features, but I always hear people complaining about the speed. A person from a well-known online bookseller told me recently that Lucene was about 10x faster that MySQL for full-text searching, and I am currently helping someone get away from MySQL and into Lucene for performance reasons. Also... MySQL full text search isn't perfect. If you're not a java programmer it would be difficult to hack on. Another downside is that FT in MySQL only works with MyISAM tables which aren't transaction aware and use global tables locks (not fun). I'm sure though that MySQL would do a better job at online index maintenance than Lucene. It falls down a bit in this area... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. in-DB-full-text-searching
On Sat, 19 Feb 2005 09:31, Otis Gospodnetic wrote: > You are right. > Since there are C++ and now C ports of Lucene, it would be interesting > to integrate them directly with DBs, so that the RDBMS full-text search > under the hood is actually powered by one of the Lucene ports. Or to see Lucene + Derby (100% JAVA embedded database donated from IBM currently in Apache incubation) integrated together... that would be really nice and powerful. Does anyone know if there are any integration plans? -- Cheers, David This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. in-DB-full-text-searching
On Fri, Feb 18, 2005 at 04:45:50PM -0500, Mike Rose wrote: > I can comment on this since I'm in the middle of excising Oracle text > searching and replacing it with Lucene in one of my projects. Intereseting, particularly as it's from somebody who's already tried an existing in-db fulltext search feature. > All in all, I don't think that a JDBC wrapper is going to do what > you want. I wasn't thinking about trying to do the whole thing under the JDBC driver. Mainly I was thinking that one key point is that you need to treat the lucene index somewhat like a cache. This also means that you have to watch database writes and make sure you update your cache, which means you have to have some sort of single point of data access to monitor. Well, we already have that - it's called the JDBC driver. The general design I was eyeing speculatively is basically that the driver would be set up with a reference to an object that implements a CacheManager interface. This interface basically gives the driver a way to notify the cache manager of when certain tables and columns are being edited. Exactly how is another question. I don't know enough of the innards of, say, a PreparedStatement, to say more. It could be as simple as sending the CacheManager a copy of every SQL query string and letting the CacheManager figure out the rest. Ideally I'd like it to be a little bit more structured. From there, it's the CacheManager's job to decide what to do about it, and how to do it. This leaves the tricky issue of mapping from a specific database to a specific lucene index up to the developer. -- Steven J. Owens [EMAIL PROTECTED] "I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt." - http://darksleep.com/notablog - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. in-DB-full-text-searching
markharw00d wrote: >>But this brings up - has anyone run Lucene off a database trigger or are triggers known to be slow and bad for this use? I suspect the tricky bit would be knowing when to balancing the calls to Reader/Writer closes, opens and optimizes. Record updates are the usual fun and games involving a reader.delete and a document.write. I agree this is the usual tricky/"fun" thing. In similar situations I have: - batched the updates in, well, sort of a "queue" - flushed the "queue" after "t" seconds or "n" documents (e.g. t=60sec, n=1000 docs) Part of the trick is a document that changes multiple times during one of these periods - if you have a "add queue" and a "delete queue" then you'll probably have the wrong index with the doc either zero times or more than one time - not impossible to cover, just something to keep in mind - Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. in-DB-full-text-searching
>>But this brings up - has anyone run Lucene off a database trigger or are triggers known to be slow and bad for this use? I suspect the tricky bit would be knowing when to balancing the calls to Reader/Writer closes, opens and optimizes. Record updates are the usual fun and games involving a reader.delete and a document.write. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. in-DB-full-text-searching
You are right. Since there are C++ and now C ports of Lucene, it would be interesting to integrate them directly with DBs, so that the RDBMS full-text search under the hood is actually powered by one of the Lucene ports. Otis --- David Spencer <[EMAIL PROTECTED]> wrote: > Otis Gospodnetic wrote: > > > The most obvious answer is that the full-text indexing features of > > RDBMS's are not as good (as fast) as Lucene. MySQL, PostgreSQL, > > Oracle, MS SQL Server etc. all have full-text indexing/searching > > features, > > but I always hear people complaining about the speed. > > Yeah, but in theory, in the ideal world :), it should't be any slower > - > there's no magic Lucene has that DB's don't. And the big advantage > of > it being embedded in the DB is the index can always be up to date, > just > as if you had Lucene updating the index based on a trigger. You don't > > need any separate cron job to periodically update the index. > > But this brings up - has anyone run Lucene off a database trigger or > are > triggers known to be slow and bad for this use? > > > A > > person from a well-known online bookseller told me recently that > Lucene > > was about 10x faster that MySQL for full-text searching, and I am > > currently helping someone get away from MySQL and into Lucene for > > performance reasons. > > > > Otis > > > > > > > > > > --- "Steven J. Owens" <[EMAIL PROTECTED]> wrote: > > > > > >>Hi, > >> > >> I was rambling to some friends about an idea to build a > >>cache-aware JDBC driver wrapper, to make it easier to keep a lucene > >>index of a database up to date. > >> > >> They asked me a question that I have to take seriously, which > is > >>that most RDBMSes provide some built-in fulltext searching - > >>postgres, > >>mysql, even oracle - why not use that instead of adding another > layer > >>of caching? > >> > >> I have to take this question seriously, especially since it > >>reminds me a lot of what Doug has often said to folks contemplating > >>doing similar things (caching query results, etc) with Lucene. > >> > >> Has anybody done some serious investigation into this, and > could > >>summarize the pros and cons? > >> > >>-- > >>Steven J. Owens > >>[EMAIL PROTECTED] > >> > >>"I'm going to make broad, sweeping generalizations and strong, > >> declarative statements, because otherwise I'll be here all night > and > >> this document will be four times longer and much less fun to read. > >> Take it all with a grain of salt." - http://darksleep.com/notablog > >> > >> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: > [EMAIL PROTECTED] > >> > >> > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. in-DB-full-text-searching
Otis Gospodnetic wrote: The most obvious answer is that the full-text indexing features of RDBMS's are not as good (as fast) as Lucene. MySQL, PostgreSQL, Oracle, MS SQL Server etc. all have full-text indexing/searching features, but I always hear people complaining about the speed. Yeah, but in theory, in the ideal world :), it should't be any slower - there's no magic Lucene has that DB's don't. And the big advantage of it being embedded in the DB is the index can always be up to date, just as if you had Lucene updating the index based on a trigger. You don't need any separate cron job to periodically update the index. But this brings up - has anyone run Lucene off a database trigger or are triggers known to be slow and bad for this use? A person from a well-known online bookseller told me recently that Lucene was about 10x faster that MySQL for full-text searching, and I am currently helping someone get away from MySQL and into Lucene for performance reasons. Otis --- "Steven J. Owens" <[EMAIL PROTECTED]> wrote: Hi, I was rambling to some friends about an idea to build a cache-aware JDBC driver wrapper, to make it easier to keep a lucene index of a database up to date. They asked me a question that I have to take seriously, which is that most RDBMSes provide some built-in fulltext searching - postgres, mysql, even oracle - why not use that instead of adding another layer of caching? I have to take this question seriously, especially since it reminds me a lot of what Doug has often said to folks contemplating doing similar things (caching query results, etc) with Lucene. Has anybody done some serious investigation into this, and could summarize the pros and cons? -- Steven J. Owens [EMAIL PROTECTED] "I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt." - http://darksleep.com/notablog - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene vs. in-DB-full-text-searching
I can comment on this since I'm in the middle of excising Oracle text searching and replacing it with Lucene in one of my projects. Oracle does provide mechanisms for creating fuzzy indexes of text and doing word stemming as well, has a scoring mechanism, etc... However, this requires additional licensing (or an enterprise license, big $$$) and index creation is slow. Unlike other indexes in Oracle, this needs to be explicitly dropped and recreated in order to pick up changes to the content, and you can't update a single entry in the index, you have to do the whole thing in one shot. That being said, it has been successful for me so far, you just have to use some non-standard funky SQL operators to make use of it. So why am I switching to Lucene on this project? Speed: Lucene is faster at indexing and searching. Price: I don't think I need to explain this one. Size: The size of the Lucene index is tiny and easier to deploy to the servers that search it. Flexibility: If I want to change my methodology of index or search, I don't need to worry about db schema evolution across multiple environments on the way to production. All in all, I don't think that a JDBC wrapper is going to do what you want. The material you want to index is application-specific, as are the mechanics of searching the index. A JDBC driver isn't going to know which of the fields you are updating you might care to index and search later. In the end, the approach that worked for me was to create a config driven wrapper that knows how to index specific properties of POJOs. The same config also drives the formation of the query expressions as well. This way I don't care if the content was instantiated from a db or xml (I need to do both), or some other source. I think one of the great benefits of Lucene is that it allows me to embed sophisticated search functionality into my apps without being dependent upon any particular persistence mechanism. Mike smime.p7s Description: S/MIME cryptographic signature
Re: Lucene vs. in-DB-full-text-searching
The most obvious answer is that the full-text indexing features of RDBMS's are not as good (as fast) as Lucene. MySQL, PostgreSQL, Oracle, MS SQL Server etc. all have full-text indexing/searching features, but I always hear people complaining about the speed. A person from a well-known online bookseller told me recently that Lucene was about 10x faster that MySQL for full-text searching, and I am currently helping someone get away from MySQL and into Lucene for performance reasons. Otis --- "Steven J. Owens" <[EMAIL PROTECTED]> wrote: > Hi, > > I was rambling to some friends about an idea to build a > cache-aware JDBC driver wrapper, to make it easier to keep a lucene > index of a database up to date. > > They asked me a question that I have to take seriously, which is > that most RDBMSes provide some built-in fulltext searching - > postgres, > mysql, even oracle - why not use that instead of adding another layer > of caching? > > I have to take this question seriously, especially since it > reminds me a lot of what Doug has often said to folks contemplating > doing similar things (caching query results, etc) with Lucene. > > Has anybody done some serious investigation into this, and could > summarize the pros and cons? > > -- > Steven J. Owens > [EMAIL PROTECTED] > > "I'm going to make broad, sweeping generalizations and strong, > declarative statements, because otherwise I'll be here all night and > this document will be four times longer and much less fun to read. > Take it all with a grain of salt." - http://darksleep.com/notablog > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene vs. in-DB-full-text-searching
Hi, I was rambling to some friends about an idea to build a cache-aware JDBC driver wrapper, to make it easier to keep a lucene index of a database up to date. They asked me a question that I have to take seriously, which is that most RDBMSes provide some built-in fulltext searching - postgres, mysql, even oracle - why not use that instead of adding another layer of caching? I have to take this question seriously, especially since it reminds me a lot of what Doug has often said to folks contemplating doing similar things (caching query results, etc) with Lucene. Has anybody done some serious investigation into this, and could summarize the pros and cons? -- Steven J. Owens [EMAIL PROTECTED] "I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt." - http://darksleep.com/notablog - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]