Re: Realtime Search for Social Networks Collaboration

J. Delgado Sun, 21 Sep 2008 20:39:32 -0700

Sorry, I meant "loose" (replacing "lose")

On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado <[EMAIL PROTECTED]>wrote:


> On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् <
> [EMAIL PROTECTED]> wrote:
>
>> Moving back to RDBMS model will be a big step backwards where we miss
>> mulivalued fields and arbitrary fields .
>
>
>  No one is suggesting to "lose" any of the virtues of the field based
> indexing that Lucene provides. All but the contrary: by extending the RDBMS
> model with Lucene-based indexes one can map relational rows to documents and
> columns to fields. Note that one relational field can be mapped to one or
> more text based fields and multi-valued fields will still be allowed.
>
> Please check the Lucence OJVM implementation for details on implementation
> and philosophy on the RDBMS-Lucene converged model:
>
> http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
>
> More discussions at Marcelo's blog who will be presenting in Oracle World
> 2008 this week.
> http://marceloochoa.blogspot.com/
>
> BTW, it just happen that this was implemented using Oracle but similar
> implementation in H2 seems not only feasible but desirable.
>
> -- Joaquin
>
>
>
>>
>> On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
>> <[EMAIL PROTECTED]> wrote:
>> > Cool.  I mention H2 because it does have some Lucene code in it yes.
>> > Also according to some benchmarks it's the fastest of the open source
>> > databases.  I think it's possible to integrate realtime search for H2.
>> >  I suppose there is no need to store the data in Lucene in this case?
>> > One loses the multiple values per field Lucene offers, and the schema
>> > become static.  Perhaps it's a trade off?
>> >
>> > On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[EMAIL PROTECTED]>
>> wrote:
>> >> Yes, both Marcelo and I would be interested.
>> >>
>> >> We looked into H2 and it looks like something similar to Oracle's ODCI
>> can
>> >> be implemented. Plus the primitive full-text implementación is based on
>> >> Lucene.
>> >> I say primitive because looking at the code I saw that one cannot
>> define an
>> >> Analyzer and for each scan corresponding to a where clause a searcher
>> is
>> >> open and closed, instead of having a pool, plus it does not have any
>> way to
>> >> queue changes to reduce the use of the IndexWriter, etc.
>> >>
>> >> But its open source and that is a great starting point!
>> >>
>> >> -- Joaquin
>> >>
>> >> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>> >> <[EMAIL PROTECTED]> wrote:
>> >>>
>> >>> Perhaps an interesting project would be to integrate Ocean with H2
>> >>> www.h2database.com to take advantage of both models.  I'm not sure
>> how
>> >>> exactly that would work, but it seems like it would not be too
>> >>> difficult.  Perhaps this would solve being able to perform faster
>> >>> hierarchical queries and perhaps other types of queries that Lucene is
>> >>> not capable of.
>> >>>
>> >>> Is this something Joaquin you are interested in collaborating on?  I
>> >>> am definitely interested in it.
>> >>>
>> >>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[EMAIL PROTECTED]
>> >
>> >>> wrote:
>> >>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>> >>> > <[EMAIL PROTECTED]> wrote:
>> >>> >>
>> >>> >> Regarding real-time search and Solr, my feeling is the focus should
>> be
>> >>> >> on
>> >>> >> first adding real-time search to Lucene, and then we'll figure out
>> how
>> >>> >> to
>> >>> >> incorporate that into Solr later.
>> >>> >
>> >>> >
>> >>> > Otis, what do you mean exactly by "adding real-time search to
>> Lucene"?
>> >>> >  Note
>> >>> > that Lucene, being a indexing/search library (and not a full blown
>> >>> > search
>> >>> > engine), is by definition "real-time": once you add/write a document
>> to
>> >>> > the
>> >>> > index it becomes immediately searchable and if a document is
>> logically
>> >>> > deleted and no longer returned in a search, though physical deletion
>> >>> > happens
>> >>> > during an index optimization.
>> >>> >
>> >>> > Now, the problem of adding/deleting documents in bulk, as part of a
>> >>> > transaction and making these documents available for search
>> immediately
>> >>> > after the transaction is commited sounds more like a search engine
>> >>> > problem
>> >>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known
>> to
>> >>> > be
>> >>> > I/O expensive and thus are usually implemented bached proceeses with
>> >>> > some
>> >>> > kind of sync mechanism, which makes them non real-time.
>> >>> >
>> >>> > For example, in my previous life, I designed and help implement a
>> >>> > quasi-realtime enterprise search engine using Lucene, having a set
>> of
>> >>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>> >>> > accross
>> >>> > different search services which powered a broker based distributed
>> >>> > search
>> >>> > interface. The most recent documents provided to the indexers were
>> >>> > always
>> >>> > added to the smaller in-memory (RAM) indexes which usually could
>> absorbe
>> >>> > the
>> >>> > load of a bulk "add" transaction and later would be merged into
>> larger
>> >>> > disk
>> >>> > based indexes and then flushed to make them ready to absorbe new
>> fresh
>> >>> > docs.
>> >>> > We even had further partitioning of the indexes that reflected time
>> >>> > periods
>> >>> > with caps on size for them to be merged into older more archive
>> based
>> >>> > indexes which were used less (yes the search engine default search
>> was
>> >>> > on
>> >>> > data no more than 1 month old, though user could open the time
>> window by
>> >>> > including archives).
>> >>> >
>> >>> > As for SOLR and OCEAN,  I would argue that these semi-structured
>> search
>> >>> > engines are becomming more and more like relational databases with
>> >>> > full-text
>> >>> > search capablities (without the benefit of full reletional algebra
>> --
>> >>> > for
>> >>> > example joins are not possible using SOLR). Notice that "real-time"
>> CRUD
>> >>> > operations and transactionality are core DB concepts adn have been
>> >>> > studied
>> >>> > and developed by database communities for aquite long time. There
>> has
>> >>> > been
>> >>> > recent efforts on how to effeciently integrate Lucene into
>> releational
>> >>> > databases (see Lucene JVM ORACLE integration, see
>> >>> >
>> >>> >
>> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
>> )
>> >>> >
>> >>> > I think we should seriously look at joining efforts with open-source
>> >>> > Database engine projects, written in Java (see
>> >>> > http://java-source.net/open-source/database-engines) in order to
>> blend
>> >>> > IR
>> >>> > and ORM for once and for all.
>> >>> >
>> >>> > -- Joaquin
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> I've read Jason's Wiki as well.  Actually, I had to read it a
>> number of
>> >>> >> times to understand bits and pieces of it.  I have to admit there
>> is
>> >>> >> still
>> >>> >> some fuzziness about the whole things in my head - is "Ocean"
>> something
>> >>> >> that
>> >>> >> already works, a separate project on googlecode.com?  I think so.
>>  If
>> >>> >> so,
>> >>> >> and if you are working on getting it integrated into Lucene, would
>> it
>> >>> >> make
>> >>> >> it less confusing to just refer to it as "real-time search", so
>> there
>> >>> >> is no
>> >>> >> confusion?
>> >>> >>
>> >>> >> If this is to be initially integrated into Lucene, why are things
>> like
>> >>> >> replication, crowding/field collapsing, locallucene, name service,
>> tag
>> >>> >> index, etc. all mentioned there on the Wiki and bundled with
>> >>> >> description of
>> >>> >> how real-time search works and is to be implemented?  I suppose
>> >>> >> mentioning
>> >>> >> replication kind-of makes sense because the replication approach is
>> >>> >> closely
>> >>> >> tied to real-time search - all query nodes need to see index
>> changes
>> >>> >> fast.
>> >>> >>  But Lucene itself offers no replication mechanism, so maybe the
>> >>> >> replication
>> >>> >> is something to figure out separately, say on the Solr level, later
>> on
>> >>> >> "once
>> >>> >> we get there".  I think even just the essential real-time search
>> >>> >> requires
>> >>> >> substantial changes to Lucene (I remember seeing large patches in
>> >>> >> JIRA),
>> >>> >> which makes it hard to digest, understand, comment on, and
>> ultimately
>> >>> >> commit
>> >>> >> (hence the luke warm response, I think).  Bringing other
>> non-essential
>> >>> >> elements into discussion at the same time makes it more difficult t
>> o
>> >>> >>  process all this new stuff, at least for me.  Am I the only one
>> who
>> >>> >> finds
>> >>> >> this hard?
>> >>> >>
>> >>> >> That said, it sounds like we have some discussion going (Karl...),
>> so I
>> >>> >> look forward to understanding more! :)
>> >>> >>
>> >>> >>
>> >>> >> Otis
>> >>> >> --
>> >>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> ----- Original Message ----
>> >>> >> > From: Yonik Seeley <[EMAIL PROTECTED]>
>> >>> >> > To: [email protected]
>> >>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>> >>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>> >>> >> >
>> >>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>> >>> >> > wrote:
>> >>> >> > > I also think it's got a
>> >>> >> > > lot of things now which makes integration difficult to do
>> properly.
>> >>> >> >
>> >>> >> > I agree, and that's why the major bump in version number rather
>> than
>> >>> >> > minor - we recognize that some features will need some amount of
>> >>> >> > rearchitecture.
>> >>> >> >
>> >>> >> > > I think the problem with integration with SOLR is it was
>> designed
>> >>> >> > > with
>> >>> >> > > a different problem set in mind than Ocean, originally the CNET
>> >>> >> > > shopping application.
>> >>> >> >
>> >>> >> > That was the first use of Solr, but it actually existed before
>> that
>> >>> >> > w/o any defined use other than to be a "plan B" alternative to
>> MySQL
>> >>> >> > based search servers (that's actually where some of the parameter
>> >>> >> > names come from... the default /select URL instead of /search,
>> the
>> >>> >> > "rows" parameter, etc).
>> >>> >> >
>> >>> >> > But you're right... some things like the replication strategy
>> were
>> >>> >> > designed (well, borrowed from Doug to be exact) with the idea
>> that it
>> >>> >> > would be OK to have slightly "stale" views of the data in the
>> range
>> >>> >> > of
>> >>> >> > minutes.  It just made things easier/possible at the time.  But
>> tons
>> >>> >> > of Solr and Lucene users want almost instantaneous visibility of
>> >>> >> > added
>> >>> >> > documents, if they can get it.  It's hardly restricted to social
>> >>> >> > network applications.
>> >>> >> >
>> >>> >> > Bottom line is that Solr aims to be a general enterprise search
>> >>> >> > platform, and getting as real-time as we can get, and as scalable
>> as
>> >>> >> > we can get are some of the top priorities going forward.
>> >>> >> >
>> >>> >> > -Yonik
>> >>> >> >
>> >>> >> >
>> ---------------------------------------------------------------------
>> >>> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >>> >> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >>> >>
>> >>> >>
>> >>> >>
>> ---------------------------------------------------------------------
>> >>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>> >>
>> >>> >
>> >>> >
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >>> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>>
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>> >
>>
>>
>>
>> --
>> --Noble Paul
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>

Re: Realtime Search for Social Networks Collaboration

Reply via email to