Re: Realtime Search for Social Networks Collaboration

J. Delgado Sun, 21 Sep 2008 20:54:04 -0700

Please ignore the correction... "lose" is fine:-)

On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado <[EMAIL PROTECTED]>wrote:


> Sorry, I meant "loose" (replacing "lose")
>
>
> On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado <[EMAIL PROTECTED]>wrote:
>
>> On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Moving back to RDBMS model will be a big step backwards where we miss
>>> mulivalued fields and arbitrary fields .
>>
>>
>>  No one is suggesting to "lose" any of the virtues of the field based
>> indexing that Lucene provides. All but the contrary: by extending the RDBMS
>> model with Lucene-based indexes one can map relational rows to documents and
>> columns to fields. Note that one relational field can be mapped to one or
>> more text based fields and multi-valued fields will still be allowed.
>>
>> Please check the Lucence OJVM implementation for details on implementation
>> and philosophy on the RDBMS-Lucene converged model:
>>
>> http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
>>
>> More discussions at Marcelo's blog who will be presenting in Oracle World
>> 2008 this week.
>> http://marceloochoa.blogspot.com/
>>
>> BTW, it just happen that this was implemented using Oracle but similar
>> implementation in H2 seems not only feasible but desirable.
>>
>> -- Joaquin
>>
>>
>>
>>>
>>> On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
>>> <[EMAIL PROTECTED]> wrote:
>>> > Cool.  I mention H2 because it does have some Lucene code in it yes.
>>> > Also according to some benchmarks it's the fastest of the open source
>>> > databases.  I think it's possible to integrate realtime search for H2.
>>> >  I suppose there is no need to store the data in Lucene in this case?
>>> > One loses the multiple values per field Lucene offers, and the schema
>>> > become static.  Perhaps it's a trade off?
>>> >
>>> > On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[EMAIL PROTECTED]>
>>> wrote:
>>> >> Yes, both Marcelo and I would be interested.
>>> >>
>>> >> We looked into H2 and it looks like something similar to Oracle's ODCI
>>> can
>>> >> be implemented. Plus the primitive full-text implementación is based
>>> on
>>> >> Lucene.
>>> >> I say primitive because looking at the code I saw that one cannot
>>> define an
>>> >> Analyzer and for each scan corresponding to a where clause a searcher
>>> is
>>> >> open and closed, instead of having a pool, plus it does not have any
>>> way to
>>> >> queue changes to reduce the use of the IndexWriter, etc.
>>> >>
>>> >> But its open source and that is a great starting point!
>>> >>
>>> >> -- Joaquin
>>> >>
>>> >> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>>> >> <[EMAIL PROTECTED]> wrote:
>>> >>>
>>> >>> Perhaps an interesting project would be to integrate Ocean with H2
>>> >>> www.h2database.com to take advantage of both models.  I'm not sure
>>> how
>>> >>> exactly that would work, but it seems like it would not be too
>>> >>> difficult.  Perhaps this would solve being able to perform faster
>>> >>> hierarchical queries and perhaps other types of queries that Lucene
>>> is
>>> >>> not capable of.
>>> >>>
>>> >>> Is this something Joaquin you are interested in collaborating on?  I
>>> >>> am definitely interested in it.
>>> >>>
>>> >>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <
>>> [EMAIL PROTECTED]>
>>> >>> wrote:
>>> >>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>>> >>> > <[EMAIL PROTECTED]> wrote:
>>> >>> >>
>>> >>> >> Regarding real-time search and Solr, my feeling is the focus
>>> should be
>>> >>> >> on
>>> >>> >> first adding real-time search to Lucene, and then we'll figure out
>>> how
>>> >>> >> to
>>> >>> >> incorporate that into Solr later.
>>> >>> >
>>> >>> >
>>> >>> > Otis, what do you mean exactly by "adding real-time search to
>>> Lucene"?
>>> >>> >  Note
>>> >>> > that Lucene, being a indexing/search library (and not a full blown
>>> >>> > search
>>> >>> > engine), is by definition "real-time": once you add/write a
>>> document to
>>> >>> > the
>>> >>> > index it becomes immediately searchable and if a document is
>>> logically
>>> >>> > deleted and no longer returned in a search, though physical
>>> deletion
>>> >>> > happens
>>> >>> > during an index optimization.
>>> >>> >
>>> >>> > Now, the problem of adding/deleting documents in bulk, as part of a
>>> >>> > transaction and making these documents available for search
>>> immediately
>>> >>> > after the transaction is commited sounds more like a search engine
>>> >>> > problem
>>> >>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are
>>> known to
>>> >>> > be
>>> >>> > I/O expensive and thus are usually implemented bached proceeses
>>> with
>>> >>> > some
>>> >>> > kind of sync mechanism, which makes them non real-time.
>>> >>> >
>>> >>> > For example, in my previous life, I designed and help implement a
>>> >>> > quasi-realtime enterprise search engine using Lucene, having a set
>>> of
>>> >>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>>> >>> > accross
>>> >>> > different search services which powered a broker based distributed
>>> >>> > search
>>> >>> > interface. The most recent documents provided to the indexers were
>>> >>> > always
>>> >>> > added to the smaller in-memory (RAM) indexes which usually could
>>> absorbe
>>> >>> > the
>>> >>> > load of a bulk "add" transaction and later would be merged into
>>> larger
>>> >>> > disk
>>> >>> > based indexes and then flushed to make them ready to absorbe new
>>> fresh
>>> >>> > docs.
>>> >>> > We even had further partitioning of the indexes that reflected time
>>> >>> > periods
>>> >>> > with caps on size for them to be merged into older more archive
>>> based
>>> >>> > indexes which were used less (yes the search engine default search
>>> was
>>> >>> > on
>>> >>> > data no more than 1 month old, though user could open the time
>>> window by
>>> >>> > including archives).
>>> >>> >
>>> >>> > As for SOLR and OCEAN,  I would argue that these semi-structured
>>> search
>>> >>> > engines are becomming more and more like relational databases with
>>> >>> > full-text
>>> >>> > search capablities (without the benefit of full reletional algebra
>>> --
>>> >>> > for
>>> >>> > example joins are not possible using SOLR). Notice that "real-time"
>>> CRUD
>>> >>> > operations and transactionality are core DB concepts adn have been
>>> >>> > studied
>>> >>> > and developed by database communities for aquite long time. There
>>> has
>>> >>> > been
>>> >>> > recent efforts on how to effeciently integrate Lucene into
>>> releational
>>> >>> > databases (see Lucene JVM ORACLE integration, see
>>> >>> >
>>> >>> >
>>> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
>>> )
>>> >>> >
>>> >>> > I think we should seriously look at joining efforts with
>>> open-source
>>> >>> > Database engine projects, written in Java (see
>>> >>> > http://java-source.net/open-source/database-engines) in order to
>>> blend
>>> >>> > IR
>>> >>> > and ORM for once and for all.
>>> >>> >
>>> >>> > -- Joaquin
>>> >>> >
>>> >>> >
>>> >>> >>
>>> >>> >> I've read Jason's Wiki as well.  Actually, I had to read it a
>>> number of
>>> >>> >> times to understand bits and pieces of it.  I have to admit there
>>> is
>>> >>> >> still
>>> >>> >> some fuzziness about the whole things in my head - is "Ocean"
>>> something
>>> >>> >> that
>>> >>> >> already works, a separate project on googlecode.com?  I think so.
>>>  If
>>> >>> >> so,
>>> >>> >> and if you are working on getting it integrated into Lucene, would
>>> it
>>> >>> >> make
>>> >>> >> it less confusing to just refer to it as "real-time search", so
>>> there
>>> >>> >> is no
>>> >>> >> confusion?
>>> >>> >>
>>> >>> >> If this is to be initially integrated into Lucene, why are things
>>> like
>>> >>> >> replication, crowding/field collapsing, locallucene, name service,
>>> tag
>>> >>> >> index, etc. all mentioned there on the Wiki and bundled with
>>> >>> >> description of
>>> >>> >> how real-time search works and is to be implemented?  I suppose
>>> >>> >> mentioning
>>> >>> >> replication kind-of makes sense because the replication approach
>>> is
>>> >>> >> closely
>>> >>> >> tied to real-time search - all query nodes need to see index
>>> changes
>>> >>> >> fast.
>>> >>> >>  But Lucene itself offers no replication mechanism, so maybe the
>>> >>> >> replication
>>> >>> >> is something to figure out separately, say on the Solr level,
>>> later on
>>> >>> >> "once
>>> >>> >> we get there".  I think even just the essential real-time search
>>> >>> >> requires
>>> >>> >> substantial changes to Lucene (I remember seeing large patches in
>>> >>> >> JIRA),
>>> >>> >> which makes it hard to digest, understand, comment on, and
>>> ultimately
>>> >>> >> commit
>>> >>> >> (hence the luke warm response, I think).  Bringing other
>>> non-essential
>>> >>> >> elements into discussion at the same time makes it more difficult
>>> t o
>>> >>> >>  process all this new stuff, at least for me.  Am I the only one
>>> who
>>> >>> >> finds
>>> >>> >> this hard?
>>> >>> >>
>>> >>> >> That said, it sounds like we have some discussion going (Karl...),
>>> so I
>>> >>> >> look forward to understanding more! :)
>>> >>> >>
>>> >>> >>
>>> >>> >> Otis
>>> >>> >> --
>>> >>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> ----- Original Message ----
>>> >>> >> > From: Yonik Seeley <[EMAIL PROTECTED]>
>>> >>> >> > To: [email protected]
>>> >>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>>> >>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>>> >>> >> >
>>> >>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>>> >>> >> > wrote:
>>> >>> >> > > I also think it's got a
>>> >>> >> > > lot of things now which makes integration difficult to do
>>> properly.
>>> >>> >> >
>>> >>> >> > I agree, and that's why the major bump in version number rather
>>> than
>>> >>> >> > minor - we recognize that some features will need some amount of
>>> >>> >> > rearchitecture.
>>> >>> >> >
>>> >>> >> > > I think the problem with integration with SOLR is it was
>>> designed
>>> >>> >> > > with
>>> >>> >> > > a different problem set in mind than Ocean, originally the
>>> CNET
>>> >>> >> > > shopping application.
>>> >>> >> >
>>> >>> >> > That was the first use of Solr, but it actually existed before
>>> that
>>> >>> >> > w/o any defined use other than to be a "plan B" alternative to
>>> MySQL
>>> >>> >> > based search servers (that's actually where some of the
>>> parameter
>>> >>> >> > names come from... the default /select URL instead of /search,
>>> the
>>> >>> >> > "rows" parameter, etc).
>>> >>> >> >
>>> >>> >> > But you're right... some things like the replication strategy
>>> were
>>> >>> >> > designed (well, borrowed from Doug to be exact) with the idea
>>> that it
>>> >>> >> > would be OK to have slightly "stale" views of the data in the
>>> range
>>> >>> >> > of
>>> >>> >> > minutes.  It just made things easier/possible at the time.  But
>>> tons
>>> >>> >> > of Solr and Lucene users want almost instantaneous visibility of
>>> >>> >> > added
>>> >>> >> > documents, if they can get it.  It's hardly restricted to social
>>> >>> >> > network applications.
>>> >>> >> >
>>> >>> >> > Bottom line is that Solr aims to be a general enterprise search
>>> >>> >> > platform, and getting as real-time as we can get, and as
>>> scalable as
>>> >>> >> > we can get are some of the top priorities going forward.
>>> >>> >> >
>>> >>> >> > -Yonik
>>> >>> >> >
>>> >>> >> >
>>> ---------------------------------------------------------------------
>>> >>> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> >>> >> > For additional commands, e-mail:
>>> [EMAIL PROTECTED]
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> ---------------------------------------------------------------------
>>> >>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> >>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>>> >>> >>
>>> >>> >
>>> >>> >
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> >>> For additional commands, e-mail: [EMAIL PROTECTED]
>>> >>>
>>> >>
>>> >>
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> > For additional commands, e-mail: [EMAIL PROTECTED]
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> --Noble Paul
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>
>

Re: Realtime Search for Social Networks Collaboration

Reply via email to