On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള് नोब्ळ् < [EMAIL PROTECTED]> wrote:
> Moving back to RDBMS model will be a big step backwards where we miss > mulivalued fields and arbitrary fields . No one is suggesting to "lose" any of the virtues of the field based indexing that Lucene provides. All but the contrary: by extending the RDBMS model with Lucene-based indexes one can map relational rows to documents and columns to fields. Note that one relational field can be mapped to one or more text based fields and multi-valued fields will still be allowed. Please check the Lucence OJVM implementation for details on implementation and philosophy on the RDBMS-Lucene converged model: http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg More discussions at Marcelo's blog who will be presenting in Oracle World 2008 this week. http://marceloochoa.blogspot.com/ BTW, it just happen that this was implemented using Oracle but similar implementation in H2 seems not only feasible but desirable. -- Joaquin > > On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen > <[EMAIL PROTECTED]> wrote: > > Cool. I mention H2 because it does have some Lucene code in it yes. > > Also according to some benchmarks it's the fastest of the open source > > databases. I think it's possible to integrate realtime search for H2. > > I suppose there is no need to store the data in Lucene in this case? > > One loses the multiple values per field Lucene offers, and the schema > > become static. Perhaps it's a trade off? > > > > On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[EMAIL PROTECTED]> > wrote: > >> Yes, both Marcelo and I would be interested. > >> > >> We looked into H2 and it looks like something similar to Oracle's ODCI > can > >> be implemented. Plus the primitive full-text implementación is based on > >> Lucene. > >> I say primitive because looking at the code I saw that one cannot define > an > >> Analyzer and for each scan corresponding to a where clause a searcher is > >> open and closed, instead of having a pool, plus it does not have any way > to > >> queue changes to reduce the use of the IndexWriter, etc. > >> > >> But its open source and that is a great starting point! > >> > >> -- Joaquin > >> > >> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen > >> <[EMAIL PROTECTED]> wrote: > >>> > >>> Perhaps an interesting project would be to integrate Ocean with H2 > >>> www.h2database.com to take advantage of both models. I'm not sure how > >>> exactly that would work, but it seems like it would not be too > >>> difficult. Perhaps this would solve being able to perform faster > >>> hierarchical queries and perhaps other types of queries that Lucene is > >>> not capable of. > >>> > >>> Is this something Joaquin you are interested in collaborating on? I > >>> am definitely interested in it. > >>> > >>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[EMAIL PROTECTED]> > >>> wrote: > >>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic > >>> > <[EMAIL PROTECTED]> wrote: > >>> >> > >>> >> Regarding real-time search and Solr, my feeling is the focus should > be > >>> >> on > >>> >> first adding real-time search to Lucene, and then we'll figure out > how > >>> >> to > >>> >> incorporate that into Solr later. > >>> > > >>> > > >>> > Otis, what do you mean exactly by "adding real-time search to > Lucene"? > >>> > Note > >>> > that Lucene, being a indexing/search library (and not a full blown > >>> > search > >>> > engine), is by definition "real-time": once you add/write a document > to > >>> > the > >>> > index it becomes immediately searchable and if a document is > logically > >>> > deleted and no longer returned in a search, though physical deletion > >>> > happens > >>> > during an index optimization. > >>> > > >>> > Now, the problem of adding/deleting documents in bulk, as part of a > >>> > transaction and making these documents available for search > immediately > >>> > after the transaction is commited sounds more like a search engine > >>> > problem > >>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known > to > >>> > be > >>> > I/O expensive and thus are usually implemented bached proceeses with > >>> > some > >>> > kind of sync mechanism, which makes them non real-time. > >>> > > >>> > For example, in my previous life, I designed and help implement a > >>> > quasi-realtime enterprise search engine using Lucene, having a set of > >>> > multi-threaded indexers hitting a set of multiple indexes alocatted > >>> > accross > >>> > different search services which powered a broker based distributed > >>> > search > >>> > interface. The most recent documents provided to the indexers were > >>> > always > >>> > added to the smaller in-memory (RAM) indexes which usually could > absorbe > >>> > the > >>> > load of a bulk "add" transaction and later would be merged into > larger > >>> > disk > >>> > based indexes and then flushed to make them ready to absorbe new > fresh > >>> > docs. > >>> > We even had further partitioning of the indexes that reflected time > >>> > periods > >>> > with caps on size for them to be merged into older more archive based > >>> > indexes which were used less (yes the search engine default search > was > >>> > on > >>> > data no more than 1 month old, though user could open the time window > by > >>> > including archives). > >>> > > >>> > As for SOLR and OCEAN, I would argue that these semi-structured > search > >>> > engines are becomming more and more like relational databases with > >>> > full-text > >>> > search capablities (without the benefit of full reletional algebra -- > >>> > for > >>> > example joins are not possible using SOLR). Notice that "real-time" > CRUD > >>> > operations and transactionality are core DB concepts adn have been > >>> > studied > >>> > and developed by database communities for aquite long time. There has > >>> > been > >>> > recent efforts on how to effeciently integrate Lucene into > releational > >>> > databases (see Lucene JVM ORACLE integration, see > >>> > > >>> > > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html > ) > >>> > > >>> > I think we should seriously look at joining efforts with open-source > >>> > Database engine projects, written in Java (see > >>> > http://java-source.net/open-source/database-engines) in order to > blend > >>> > IR > >>> > and ORM for once and for all. > >>> > > >>> > -- Joaquin > >>> > > >>> > > >>> >> > >>> >> I've read Jason's Wiki as well. Actually, I had to read it a number > of > >>> >> times to understand bits and pieces of it. I have to admit there is > >>> >> still > >>> >> some fuzziness about the whole things in my head - is "Ocean" > something > >>> >> that > >>> >> already works, a separate project on googlecode.com? I think so. > If > >>> >> so, > >>> >> and if you are working on getting it integrated into Lucene, would > it > >>> >> make > >>> >> it less confusing to just refer to it as "real-time search", so > there > >>> >> is no > >>> >> confusion? > >>> >> > >>> >> If this is to be initially integrated into Lucene, why are things > like > >>> >> replication, crowding/field collapsing, locallucene, name service, > tag > >>> >> index, etc. all mentioned there on the Wiki and bundled with > >>> >> description of > >>> >> how real-time search works and is to be implemented? I suppose > >>> >> mentioning > >>> >> replication kind-of makes sense because the replication approach is > >>> >> closely > >>> >> tied to real-time search - all query nodes need to see index changes > >>> >> fast. > >>> >> But Lucene itself offers no replication mechanism, so maybe the > >>> >> replication > >>> >> is something to figure out separately, say on the Solr level, later > on > >>> >> "once > >>> >> we get there". I think even just the essential real-time search > >>> >> requires > >>> >> substantial changes to Lucene (I remember seeing large patches in > >>> >> JIRA), > >>> >> which makes it hard to digest, understand, comment on, and > ultimately > >>> >> commit > >>> >> (hence the luke warm response, I think). Bringing other > non-essential > >>> >> elements into discussion at the same time makes it more difficult t > o > >>> >> process all this new stuff, at least for me. Am I the only one who > >>> >> finds > >>> >> this hard? > >>> >> > >>> >> That said, it sounds like we have some discussion going (Karl...), > so I > >>> >> look forward to understanding more! :) > >>> >> > >>> >> > >>> >> Otis > >>> >> -- > >>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > >>> >> > >>> >> > >>> >> > >>> >> ----- Original Message ---- > >>> >> > From: Yonik Seeley <[EMAIL PROTECTED]> > >>> >> > To: [email protected] > >>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM > >>> >> > Subject: Re: Realtime Search for Social Networks Collaboration > >>> >> > > >>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen > >>> >> > wrote: > >>> >> > > I also think it's got a > >>> >> > > lot of things now which makes integration difficult to do > properly. > >>> >> > > >>> >> > I agree, and that's why the major bump in version number rather > than > >>> >> > minor - we recognize that some features will need some amount of > >>> >> > rearchitecture. > >>> >> > > >>> >> > > I think the problem with integration with SOLR is it was > designed > >>> >> > > with > >>> >> > > a different problem set in mind than Ocean, originally the CNET > >>> >> > > shopping application. > >>> >> > > >>> >> > That was the first use of Solr, but it actually existed before > that > >>> >> > w/o any defined use other than to be a "plan B" alternative to > MySQL > >>> >> > based search servers (that's actually where some of the parameter > >>> >> > names come from... the default /select URL instead of /search, the > >>> >> > "rows" parameter, etc). > >>> >> > > >>> >> > But you're right... some things like the replication strategy were > >>> >> > designed (well, borrowed from Doug to be exact) with the idea that > it > >>> >> > would be OK to have slightly "stale" views of the data in the > range > >>> >> > of > >>> >> > minutes. It just made things easier/possible at the time. But > tons > >>> >> > of Solr and Lucene users want almost instantaneous visibility of > >>> >> > added > >>> >> > documents, if they can get it. It's hardly restricted to social > >>> >> > network applications. > >>> >> > > >>> >> > Bottom line is that Solr aims to be a general enterprise search > >>> >> > platform, and getting as real-time as we can get, and as scalable > as > >>> >> > we can get are some of the top priorities going forward. > >>> >> > > >>> >> > -Yonik > >>> >> > > >>> >> > > --------------------------------------------------------------------- > >>> >> > To unsubscribe, e-mail: [EMAIL PROTECTED] > >>> >> > For additional commands, e-mail: [EMAIL PROTECTED] > >>> >> > >>> >> > >>> >> > --------------------------------------------------------------------- > >>> >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>> >> For additional commands, e-mail: [EMAIL PROTECTED] > >>> >> > >>> > > >>> > > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>> For additional commands, e-mail: [EMAIL PROTECTED] > >>> > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > -- > --Noble Paul > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
