Please ignore the correction... "lose" is fine:-) On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado <[EMAIL PROTECTED]>wrote:
> Sorry, I meant "loose" (replacing "lose") > > > On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado <[EMAIL PROTECTED]>wrote: > >> On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള് नोब्ळ् < >> [EMAIL PROTECTED]> wrote: >> >>> Moving back to RDBMS model will be a big step backwards where we miss >>> mulivalued fields and arbitrary fields . >> >> >> No one is suggesting to "lose" any of the virtues of the field based >> indexing that Lucene provides. All but the contrary: by extending the RDBMS >> model with Lucene-based indexes one can map relational rows to documents and >> columns to fields. Note that one relational field can be mapped to one or >> more text based fields and multi-valued fields will still be allowed. >> >> Please check the Lucence OJVM implementation for details on implementation >> and philosophy on the RDBMS-Lucene converged model: >> >> http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg >> >> More discussions at Marcelo's blog who will be presenting in Oracle World >> 2008 this week. >> http://marceloochoa.blogspot.com/ >> >> BTW, it just happen that this was implemented using Oracle but similar >> implementation in H2 seems not only feasible but desirable. >> >> -- Joaquin >> >> >> >>> >>> On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen >>> <[EMAIL PROTECTED]> wrote: >>> > Cool. I mention H2 because it does have some Lucene code in it yes. >>> > Also according to some benchmarks it's the fastest of the open source >>> > databases. I think it's possible to integrate realtime search for H2. >>> > I suppose there is no need to store the data in Lucene in this case? >>> > One loses the multiple values per field Lucene offers, and the schema >>> > become static. Perhaps it's a trade off? >>> > >>> > On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[EMAIL PROTECTED]> >>> wrote: >>> >> Yes, both Marcelo and I would be interested. >>> >> >>> >> We looked into H2 and it looks like something similar to Oracle's ODCI >>> can >>> >> be implemented. Plus the primitive full-text implementación is based >>> on >>> >> Lucene. >>> >> I say primitive because looking at the code I saw that one cannot >>> define an >>> >> Analyzer and for each scan corresponding to a where clause a searcher >>> is >>> >> open and closed, instead of having a pool, plus it does not have any >>> way to >>> >> queue changes to reduce the use of the IndexWriter, etc. >>> >> >>> >> But its open source and that is a great starting point! >>> >> >>> >> -- Joaquin >>> >> >>> >> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen >>> >> <[EMAIL PROTECTED]> wrote: >>> >>> >>> >>> Perhaps an interesting project would be to integrate Ocean with H2 >>> >>> www.h2database.com to take advantage of both models. I'm not sure >>> how >>> >>> exactly that would work, but it seems like it would not be too >>> >>> difficult. Perhaps this would solve being able to perform faster >>> >>> hierarchical queries and perhaps other types of queries that Lucene >>> is >>> >>> not capable of. >>> >>> >>> >>> Is this something Joaquin you are interested in collaborating on? I >>> >>> am definitely interested in it. >>> >>> >>> >>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado < >>> [EMAIL PROTECTED]> >>> >>> wrote: >>> >>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic >>> >>> > <[EMAIL PROTECTED]> wrote: >>> >>> >> >>> >>> >> Regarding real-time search and Solr, my feeling is the focus >>> should be >>> >>> >> on >>> >>> >> first adding real-time search to Lucene, and then we'll figure out >>> how >>> >>> >> to >>> >>> >> incorporate that into Solr later. >>> >>> > >>> >>> > >>> >>> > Otis, what do you mean exactly by "adding real-time search to >>> Lucene"? >>> >>> > Note >>> >>> > that Lucene, being a indexing/search library (and not a full blown >>> >>> > search >>> >>> > engine), is by definition "real-time": once you add/write a >>> document to >>> >>> > the >>> >>> > index it becomes immediately searchable and if a document is >>> logically >>> >>> > deleted and no longer returned in a search, though physical >>> deletion >>> >>> > happens >>> >>> > during an index optimization. >>> >>> > >>> >>> > Now, the problem of adding/deleting documents in bulk, as part of a >>> >>> > transaction and making these documents available for search >>> immediately >>> >>> > after the transaction is commited sounds more like a search engine >>> >>> > problem >>> >>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are >>> known to >>> >>> > be >>> >>> > I/O expensive and thus are usually implemented bached proceeses >>> with >>> >>> > some >>> >>> > kind of sync mechanism, which makes them non real-time. >>> >>> > >>> >>> > For example, in my previous life, I designed and help implement a >>> >>> > quasi-realtime enterprise search engine using Lucene, having a set >>> of >>> >>> > multi-threaded indexers hitting a set of multiple indexes alocatted >>> >>> > accross >>> >>> > different search services which powered a broker based distributed >>> >>> > search >>> >>> > interface. The most recent documents provided to the indexers were >>> >>> > always >>> >>> > added to the smaller in-memory (RAM) indexes which usually could >>> absorbe >>> >>> > the >>> >>> > load of a bulk "add" transaction and later would be merged into >>> larger >>> >>> > disk >>> >>> > based indexes and then flushed to make them ready to absorbe new >>> fresh >>> >>> > docs. >>> >>> > We even had further partitioning of the indexes that reflected time >>> >>> > periods >>> >>> > with caps on size for them to be merged into older more archive >>> based >>> >>> > indexes which were used less (yes the search engine default search >>> was >>> >>> > on >>> >>> > data no more than 1 month old, though user could open the time >>> window by >>> >>> > including archives). >>> >>> > >>> >>> > As for SOLR and OCEAN, I would argue that these semi-structured >>> search >>> >>> > engines are becomming more and more like relational databases with >>> >>> > full-text >>> >>> > search capablities (without the benefit of full reletional algebra >>> -- >>> >>> > for >>> >>> > example joins are not possible using SOLR). Notice that "real-time" >>> CRUD >>> >>> > operations and transactionality are core DB concepts adn have been >>> >>> > studied >>> >>> > and developed by database communities for aquite long time. There >>> has >>> >>> > been >>> >>> > recent efforts on how to effeciently integrate Lucene into >>> releational >>> >>> > databases (see Lucene JVM ORACLE integration, see >>> >>> > >>> >>> > >>> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html >>> ) >>> >>> > >>> >>> > I think we should seriously look at joining efforts with >>> open-source >>> >>> > Database engine projects, written in Java (see >>> >>> > http://java-source.net/open-source/database-engines) in order to >>> blend >>> >>> > IR >>> >>> > and ORM for once and for all. >>> >>> > >>> >>> > -- Joaquin >>> >>> > >>> >>> > >>> >>> >> >>> >>> >> I've read Jason's Wiki as well. Actually, I had to read it a >>> number of >>> >>> >> times to understand bits and pieces of it. I have to admit there >>> is >>> >>> >> still >>> >>> >> some fuzziness about the whole things in my head - is "Ocean" >>> something >>> >>> >> that >>> >>> >> already works, a separate project on googlecode.com? I think so. >>> If >>> >>> >> so, >>> >>> >> and if you are working on getting it integrated into Lucene, would >>> it >>> >>> >> make >>> >>> >> it less confusing to just refer to it as "real-time search", so >>> there >>> >>> >> is no >>> >>> >> confusion? >>> >>> >> >>> >>> >> If this is to be initially integrated into Lucene, why are things >>> like >>> >>> >> replication, crowding/field collapsing, locallucene, name service, >>> tag >>> >>> >> index, etc. all mentioned there on the Wiki and bundled with >>> >>> >> description of >>> >>> >> how real-time search works and is to be implemented? I suppose >>> >>> >> mentioning >>> >>> >> replication kind-of makes sense because the replication approach >>> is >>> >>> >> closely >>> >>> >> tied to real-time search - all query nodes need to see index >>> changes >>> >>> >> fast. >>> >>> >> But Lucene itself offers no replication mechanism, so maybe the >>> >>> >> replication >>> >>> >> is something to figure out separately, say on the Solr level, >>> later on >>> >>> >> "once >>> >>> >> we get there". I think even just the essential real-time search >>> >>> >> requires >>> >>> >> substantial changes to Lucene (I remember seeing large patches in >>> >>> >> JIRA), >>> >>> >> which makes it hard to digest, understand, comment on, and >>> ultimately >>> >>> >> commit >>> >>> >> (hence the luke warm response, I think). Bringing other >>> non-essential >>> >>> >> elements into discussion at the same time makes it more difficult >>> t o >>> >>> >> process all this new stuff, at least for me. Am I the only one >>> who >>> >>> >> finds >>> >>> >> this hard? >>> >>> >> >>> >>> >> That said, it sounds like we have some discussion going (Karl...), >>> so I >>> >>> >> look forward to understanding more! :) >>> >>> >> >>> >>> >> >>> >>> >> Otis >>> >>> >> -- >>> >>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> ----- Original Message ---- >>> >>> >> > From: Yonik Seeley <[EMAIL PROTECTED]> >>> >>> >> > To: [email protected] >>> >>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM >>> >>> >> > Subject: Re: Realtime Search for Social Networks Collaboration >>> >>> >> > >>> >>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen >>> >>> >> > wrote: >>> >>> >> > > I also think it's got a >>> >>> >> > > lot of things now which makes integration difficult to do >>> properly. >>> >>> >> > >>> >>> >> > I agree, and that's why the major bump in version number rather >>> than >>> >>> >> > minor - we recognize that some features will need some amount of >>> >>> >> > rearchitecture. >>> >>> >> > >>> >>> >> > > I think the problem with integration with SOLR is it was >>> designed >>> >>> >> > > with >>> >>> >> > > a different problem set in mind than Ocean, originally the >>> CNET >>> >>> >> > > shopping application. >>> >>> >> > >>> >>> >> > That was the first use of Solr, but it actually existed before >>> that >>> >>> >> > w/o any defined use other than to be a "plan B" alternative to >>> MySQL >>> >>> >> > based search servers (that's actually where some of the >>> parameter >>> >>> >> > names come from... the default /select URL instead of /search, >>> the >>> >>> >> > "rows" parameter, etc). >>> >>> >> > >>> >>> >> > But you're right... some things like the replication strategy >>> were >>> >>> >> > designed (well, borrowed from Doug to be exact) with the idea >>> that it >>> >>> >> > would be OK to have slightly "stale" views of the data in the >>> range >>> >>> >> > of >>> >>> >> > minutes. It just made things easier/possible at the time. But >>> tons >>> >>> >> > of Solr and Lucene users want almost instantaneous visibility of >>> >>> >> > added >>> >>> >> > documents, if they can get it. It's hardly restricted to social >>> >>> >> > network applications. >>> >>> >> > >>> >>> >> > Bottom line is that Solr aims to be a general enterprise search >>> >>> >> > platform, and getting as real-time as we can get, and as >>> scalable as >>> >>> >> > we can get are some of the top priorities going forward. >>> >>> >> > >>> >>> >> > -Yonik >>> >>> >> > >>> >>> >> > >>> --------------------------------------------------------------------- >>> >>> >> > To unsubscribe, e-mail: [EMAIL PROTECTED] >>> >>> >> > For additional commands, e-mail: >>> [EMAIL PROTECTED] >>> >>> >> >>> >>> >> >>> >>> >> >>> --------------------------------------------------------------------- >>> >>> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> >>> >> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >> >>> >>> > >>> >>> > >>> >>> >>> >>> --------------------------------------------------------------------- >>> >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >> >>> >> >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: [EMAIL PROTECTED] >>> > For additional commands, e-mail: [EMAIL PROTECTED] >>> > >>> > >>> >>> >>> >>> -- >>> --Noble Paul >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >> >
