Sorry, I meant "loose" (replacing "lose") On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado <[EMAIL PROTECTED]>wrote:
> On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള് नोब्ळ् < > [EMAIL PROTECTED]> wrote: > >> Moving back to RDBMS model will be a big step backwards where we miss >> mulivalued fields and arbitrary fields . > > > No one is suggesting to "lose" any of the virtues of the field based > indexing that Lucene provides. All but the contrary: by extending the RDBMS > model with Lucene-based indexes one can map relational rows to documents and > columns to fields. Note that one relational field can be mapped to one or > more text based fields and multi-valued fields will still be allowed. > > Please check the Lucence OJVM implementation for details on implementation > and philosophy on the RDBMS-Lucene converged model: > > http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg > > More discussions at Marcelo's blog who will be presenting in Oracle World > 2008 this week. > http://marceloochoa.blogspot.com/ > > BTW, it just happen that this was implemented using Oracle but similar > implementation in H2 seems not only feasible but desirable. > > -- Joaquin > > > >> >> On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen >> <[EMAIL PROTECTED]> wrote: >> > Cool. I mention H2 because it does have some Lucene code in it yes. >> > Also according to some benchmarks it's the fastest of the open source >> > databases. I think it's possible to integrate realtime search for H2. >> > I suppose there is no need to store the data in Lucene in this case? >> > One loses the multiple values per field Lucene offers, and the schema >> > become static. Perhaps it's a trade off? >> > >> > On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[EMAIL PROTECTED]> >> wrote: >> >> Yes, both Marcelo and I would be interested. >> >> >> >> We looked into H2 and it looks like something similar to Oracle's ODCI >> can >> >> be implemented. Plus the primitive full-text implementación is based on >> >> Lucene. >> >> I say primitive because looking at the code I saw that one cannot >> define an >> >> Analyzer and for each scan corresponding to a where clause a searcher >> is >> >> open and closed, instead of having a pool, plus it does not have any >> way to >> >> queue changes to reduce the use of the IndexWriter, etc. >> >> >> >> But its open source and that is a great starting point! >> >> >> >> -- Joaquin >> >> >> >> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen >> >> <[EMAIL PROTECTED]> wrote: >> >>> >> >>> Perhaps an interesting project would be to integrate Ocean with H2 >> >>> www.h2database.com to take advantage of both models. I'm not sure >> how >> >>> exactly that would work, but it seems like it would not be too >> >>> difficult. Perhaps this would solve being able to perform faster >> >>> hierarchical queries and perhaps other types of queries that Lucene is >> >>> not capable of. >> >>> >> >>> Is this something Joaquin you are interested in collaborating on? I >> >>> am definitely interested in it. >> >>> >> >>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[EMAIL PROTECTED] >> > >> >>> wrote: >> >>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic >> >>> > <[EMAIL PROTECTED]> wrote: >> >>> >> >> >>> >> Regarding real-time search and Solr, my feeling is the focus should >> be >> >>> >> on >> >>> >> first adding real-time search to Lucene, and then we'll figure out >> how >> >>> >> to >> >>> >> incorporate that into Solr later. >> >>> > >> >>> > >> >>> > Otis, what do you mean exactly by "adding real-time search to >> Lucene"? >> >>> > Note >> >>> > that Lucene, being a indexing/search library (and not a full blown >> >>> > search >> >>> > engine), is by definition "real-time": once you add/write a document >> to >> >>> > the >> >>> > index it becomes immediately searchable and if a document is >> logically >> >>> > deleted and no longer returned in a search, though physical deletion >> >>> > happens >> >>> > during an index optimization. >> >>> > >> >>> > Now, the problem of adding/deleting documents in bulk, as part of a >> >>> > transaction and making these documents available for search >> immediately >> >>> > after the transaction is commited sounds more like a search engine >> >>> > problem >> >>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known >> to >> >>> > be >> >>> > I/O expensive and thus are usually implemented bached proceeses with >> >>> > some >> >>> > kind of sync mechanism, which makes them non real-time. >> >>> > >> >>> > For example, in my previous life, I designed and help implement a >> >>> > quasi-realtime enterprise search engine using Lucene, having a set >> of >> >>> > multi-threaded indexers hitting a set of multiple indexes alocatted >> >>> > accross >> >>> > different search services which powered a broker based distributed >> >>> > search >> >>> > interface. The most recent documents provided to the indexers were >> >>> > always >> >>> > added to the smaller in-memory (RAM) indexes which usually could >> absorbe >> >>> > the >> >>> > load of a bulk "add" transaction and later would be merged into >> larger >> >>> > disk >> >>> > based indexes and then flushed to make them ready to absorbe new >> fresh >> >>> > docs. >> >>> > We even had further partitioning of the indexes that reflected time >> >>> > periods >> >>> > with caps on size for them to be merged into older more archive >> based >> >>> > indexes which were used less (yes the search engine default search >> was >> >>> > on >> >>> > data no more than 1 month old, though user could open the time >> window by >> >>> > including archives). >> >>> > >> >>> > As for SOLR and OCEAN, I would argue that these semi-structured >> search >> >>> > engines are becomming more and more like relational databases with >> >>> > full-text >> >>> > search capablities (without the benefit of full reletional algebra >> -- >> >>> > for >> >>> > example joins are not possible using SOLR). Notice that "real-time" >> CRUD >> >>> > operations and transactionality are core DB concepts adn have been >> >>> > studied >> >>> > and developed by database communities for aquite long time. There >> has >> >>> > been >> >>> > recent efforts on how to effeciently integrate Lucene into >> releational >> >>> > databases (see Lucene JVM ORACLE integration, see >> >>> > >> >>> > >> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html >> ) >> >>> > >> >>> > I think we should seriously look at joining efforts with open-source >> >>> > Database engine projects, written in Java (see >> >>> > http://java-source.net/open-source/database-engines) in order to >> blend >> >>> > IR >> >>> > and ORM for once and for all. >> >>> > >> >>> > -- Joaquin >> >>> > >> >>> > >> >>> >> >> >>> >> I've read Jason's Wiki as well. Actually, I had to read it a >> number of >> >>> >> times to understand bits and pieces of it. I have to admit there >> is >> >>> >> still >> >>> >> some fuzziness about the whole things in my head - is "Ocean" >> something >> >>> >> that >> >>> >> already works, a separate project on googlecode.com? I think so. >> If >> >>> >> so, >> >>> >> and if you are working on getting it integrated into Lucene, would >> it >> >>> >> make >> >>> >> it less confusing to just refer to it as "real-time search", so >> there >> >>> >> is no >> >>> >> confusion? >> >>> >> >> >>> >> If this is to be initially integrated into Lucene, why are things >> like >> >>> >> replication, crowding/field collapsing, locallucene, name service, >> tag >> >>> >> index, etc. all mentioned there on the Wiki and bundled with >> >>> >> description of >> >>> >> how real-time search works and is to be implemented? I suppose >> >>> >> mentioning >> >>> >> replication kind-of makes sense because the replication approach is >> >>> >> closely >> >>> >> tied to real-time search - all query nodes need to see index >> changes >> >>> >> fast. >> >>> >> But Lucene itself offers no replication mechanism, so maybe the >> >>> >> replication >> >>> >> is something to figure out separately, say on the Solr level, later >> on >> >>> >> "once >> >>> >> we get there". I think even just the essential real-time search >> >>> >> requires >> >>> >> substantial changes to Lucene (I remember seeing large patches in >> >>> >> JIRA), >> >>> >> which makes it hard to digest, understand, comment on, and >> ultimately >> >>> >> commit >> >>> >> (hence the luke warm response, I think). Bringing other >> non-essential >> >>> >> elements into discussion at the same time makes it more difficult t >> o >> >>> >> process all this new stuff, at least for me. Am I the only one >> who >> >>> >> finds >> >>> >> this hard? >> >>> >> >> >>> >> That said, it sounds like we have some discussion going (Karl...), >> so I >> >>> >> look forward to understanding more! :) >> >>> >> >> >>> >> >> >>> >> Otis >> >>> >> -- >> >>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >>> >> >> >>> >> >> >>> >> >> >>> >> ----- Original Message ---- >> >>> >> > From: Yonik Seeley <[EMAIL PROTECTED]> >> >>> >> > To: [email protected] >> >>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM >> >>> >> > Subject: Re: Realtime Search for Social Networks Collaboration >> >>> >> > >> >>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen >> >>> >> > wrote: >> >>> >> > > I also think it's got a >> >>> >> > > lot of things now which makes integration difficult to do >> properly. >> >>> >> > >> >>> >> > I agree, and that's why the major bump in version number rather >> than >> >>> >> > minor - we recognize that some features will need some amount of >> >>> >> > rearchitecture. >> >>> >> > >> >>> >> > > I think the problem with integration with SOLR is it was >> designed >> >>> >> > > with >> >>> >> > > a different problem set in mind than Ocean, originally the CNET >> >>> >> > > shopping application. >> >>> >> > >> >>> >> > That was the first use of Solr, but it actually existed before >> that >> >>> >> > w/o any defined use other than to be a "plan B" alternative to >> MySQL >> >>> >> > based search servers (that's actually where some of the parameter >> >>> >> > names come from... the default /select URL instead of /search, >> the >> >>> >> > "rows" parameter, etc). >> >>> >> > >> >>> >> > But you're right... some things like the replication strategy >> were >> >>> >> > designed (well, borrowed from Doug to be exact) with the idea >> that it >> >>> >> > would be OK to have slightly "stale" views of the data in the >> range >> >>> >> > of >> >>> >> > minutes. It just made things easier/possible at the time. But >> tons >> >>> >> > of Solr and Lucene users want almost instantaneous visibility of >> >>> >> > added >> >>> >> > documents, if they can get it. It's hardly restricted to social >> >>> >> > network applications. >> >>> >> > >> >>> >> > Bottom line is that Solr aims to be a general enterprise search >> >>> >> > platform, and getting as real-time as we can get, and as scalable >> as >> >>> >> > we can get are some of the top priorities going forward. >> >>> >> > >> >>> >> > -Yonik >> >>> >> > >> >>> >> > >> --------------------------------------------------------------------- >> >>> >> > To unsubscribe, e-mail: [EMAIL PROTECTED] >> >>> >> > For additional commands, e-mail: [EMAIL PROTECTED] >> >>> >> >> >>> >> >> >>> >> >> --------------------------------------------------------------------- >> >>> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> >>> >> For additional commands, e-mail: [EMAIL PROTECTED] >> >>> >> >> >>> > >> >>> > >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >> >>> For additional commands, e-mail: [EMAIL PROTECTED] >> >>> >> >> >> >> >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: [EMAIL PROTECTED] >> > For additional commands, e-mail: [EMAIL PROTECTED] >> > >> > >> >> >> >> -- >> --Noble Paul >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >
