Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen Sat, 06 Sep 2008 06:13:55 -0700

Hi Otis,

LUCENE-1313 is realtime search.  The Ocean name should be removed from
it but I was not sure "realtime search" is what the technical name
should be at the time.  I have seen it used elsewhere (such as at
Summize the search company Twitter recently purchased, Bebo, LinkedIn)
now and so believe it is an accepted proper name.  The question is,
and this is for folks like Michael McCandless, what features should it
have, what version of Lucene should it target, does it need to be in
core or contrib, and when.  I will leave those discussions to others.

The wiki site has become more or less a dumping ground for the many
components of a next generation search database system hence the name
Ocean Realtime Search.  I prefer to work at the non-linear system
level rather than at the class component level and the documentation
reflects this.  I believe there is no comparable solution to Google's
GData in open source.  In that regard Ocean is more like Nutch in that
it solves a common problem (Nutch solves web indexing, Ocean solves
realtime search databases, and they are both based more or less on
paths Google paved).  Nutch also works above the Lucene level, just
like Ocean.  This is to minimize impact on Lucene and provide a
solution that works today rather than 1-2 years from now when
integration with SOLR and core Lucene may take place.  This simply
reflects my preference for working at the systems level and getting
the entire system working so that the Ocean system may be used in
production applications.

The feedback is helpful and I will start to divide up the
documentation into more discrete pieces like the code itself.  I found
SOLR to be incomplete as a system, at least the system I wanted which
is more in line with how Hadoop and Nutch operate.  Hadoop and Nutch
implement distributed objects which makes coding much simpler and
faster, they're designed for 1000s of servers scalability, and
always-on operation.  In SOLR (which has happened in production) when
the master fails or the master index is corrupted it replicates the
corrupted index to the slaves which causes the entire system to
immediately fail.  These are things that when I tried to address them
in SOLR became a coding nightmare because of the RequestHandlers and
things like this requiring XML which requires writing a custom client.
 Whereas in Nutch, Hadoop, and Ocean one simply writes the Java code
for the operation and it's completed (minutes compared to hours or
days).

While replication is not necessary in the Lucene core realtime search
(it is not included in LUCENE-1313), it is required for the search
systems I have worked on in the past and so I addressed it in the
Ocean search database system.  This way it would not need to be bolted
on later, and perhaps require a major rewrite of the realtime search
component.  I prefer this sort of advanced planning so that later on,
I do not have to rewrite core code which destroys valuable testing and
software contributed over time.  The TagIndex is another example of
something that I started on to see how it would work, then stopped
once I understood how it would fit in with the overall system.  This
way, again, I do not have to go back and rewrite core code that needs
to be retested again potentially over several months.

It is unfortunate that I cannot explain the system well enough for
folks to understand it.  It would help to go over it with someone who
does not know too much about it who can format the documentation in a
way that is easily digested by the Lucene community.

Have a nice weekend,
Jason

On Sat, Sep 6, 2008 at 4:36 AM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> Regarding real-time search and Solr, my feeling is the focus should be on 
> first adding real-time search to Lucene, and then we'll figure out how to 
> incorporate that into Solr later.
>
> I've read Jason's Wiki as well.  Actually, I had to read it a number of times 
> to understand bits and pieces of it.  I have to admit there is still some 
> fuzziness about the whole things in my head - is "Ocean" something that 
> already works, a separate project on googlecode.com?  I think so.  If so, and 
> if you are working on getting it integrated into Lucene, would it make it 
> less confusing to just refer to it as "real-time search", so there is no 
> confusion?
>
> If this is to be initially integrated into Lucene, why are things like 
> replication, crowding/field collapsing, locallucene, name service, tag index, 
> etc. all mentioned there on the Wiki and bundled with description of how 
> real-time search works and is to be implemented?  I suppose mentioning 
> replication kind-of makes sense because the replication approach is closely 
> tied to real-time search - all query nodes need to see index changes fast.  
> But Lucene itself offers no replication mechanism, so maybe the replication 
> is something to figure out separately, say on the Solr level, later on "once 
> we get there".  I think even just the essential real-time search requires 
> substantial changes to Lucene (I remember seeing large patches in JIRA), 
> which makes it hard to digest, understand, comment on, and ultimately commit 
> (hence the luke warm response, I think).  Bringing other non-essential 
> elements into discussion at the same time makes it more difficult to
>  process all this new stuff, at least for me.  Am I the only one who finds 
> this hard?
>
> That said, it sounds like we have some discussion going (Karl...), so I look 
> forward to understanding more! :)
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Yonik Seeley <[EMAIL PROTECTED]>
>> To: java-dev@lucene.apache.org
>> Sent: Thursday, September 4, 2008 10:13:32 AM
>> Subject: Re: Realtime Search for Social Networks Collaboration
>>
>> On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>> wrote:
>> > I also think it's got a
>> > lot of things now which makes integration difficult to do properly.
>>
>> I agree, and that's why the major bump in version number rather than
>> minor - we recognize that some features will need some amount of
>> rearchitecture.
>>
>> > I think the problem with integration with SOLR is it was designed with
>> > a different problem set in mind than Ocean, originally the CNET
>> > shopping application.
>>
>> That was the first use of Solr, but it actually existed before that
>> w/o any defined use other than to be a "plan B" alternative to MySQL
>> based search servers (that's actually where some of the parameter
>> names come from... the default /select URL instead of /search, the
>> "rows" parameter, etc).
>>
>> But you're right... some things like the replication strategy were
>> designed (well, borrowed from Doug to be exact) with the idea that it
>> would be OK to have slightly "stale" views of the data in the range of
>> minutes.  It just made things easier/possible at the time.  But tons
>> of Solr and Lucene users want almost instantaneous visibility of added
>> documents, if they can get it.  It's hardly restricted to social
>> network applications.
>>
>> Bottom line is that Solr aims to be a general enterprise search
>> platform, and getting as real-time as we can get, and as scalable as
>> we can get are some of the top priorities going forward.
>>
>> -Yonik
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

Reply via email to