Re: Realtime Search

2009-01-05 Thread Doug Cutting
Andrzej Bialecki wrote: No matter whether you are right or wrong, please keep a civil tone on this public forum. +1 Ad-hominem remarks are anti-community. Doug - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.or

Re: Realtime Search

2009-01-05 Thread Doug Cutting
Robert Engels wrote: Do what you like. You obviously will. This is the problem with the Lucene managers - the problems are only the ones they see - same with the solutions. If the solution (or questions) put them outside their comfort zone, they are ignored or dismissed in a tone that is des

Re: Realtime Search

2009-01-05 Thread robert engels
Then your comments are misdirected. On Jan 5, 2009, at 1:19 PM, Doug Cutting wrote: Robert Engels wrote: Do what you like. You obviously will. This is the problem with the Lucene managers - the problems are only the ones they see - same with the solutions. If the solution (or questions) p

Re: Realtime Search

2009-01-05 Thread Jason Rutherglen
+1 Agreed, the initial version should use RAMDirectory in order to keep things simple and to benchmark against other MemoryIndex like index representations. On Fri, Dec 26, 2008 at 10:20 AM, Doug Cutting wrote: > Michael McCandless wrote: > >> So then I think we should start with approach #2 (bu

Re: Realtime Search

2009-01-08 Thread Jason Rutherglen
Based on our discussions, it seems best to get realtime search going in small steps. Below are some possible steps to take. Patch #1: Expose an IndexWriter.getReader method that returns the current reader and shares the write lock Patch #2: Implement a realtime ram index class Patch #3: Implement

Re: Realtime Search

2009-01-08 Thread John Wang
This is the way MS Access worked, and > everyone that wanted performance needed to move to SQL server for the server > model. > > > -Original Message- > >From: Marvin Humphrey > >Sent: Dec 26, 2008 12:53 PM > >To: java-dev@lucene.apache.org > >Su

Re: Realtime Search

2009-01-09 Thread Michael McCandless
Jason Rutherglen wrote: Patch #1: Expose an IndexWriter.getReader method that returns the current reader and shares the write lock I tentatively like this approach so far... That reader is opened using IndexWriter's SegmentInfos instance, so it can read segments & deletions that have been f

Re: Realtime Search

2009-01-09 Thread Michael McCandless
Marvin Humphrey wrote: > The goal is to improve worst-case write performance. > ... > In between the time when the background merge writer starts up and the time > it finishes consolidating segment data, we assume that the primary writer > will have modified the index. > > * New docs have bee

Re: Realtime Search

2009-01-09 Thread Jason Rutherglen
M.M.: "That reader is opened using IndexWriter's SegmentInfos instance, so it can read segments & deletions that have been flushed but not committed. It's allowed to do its own deletions & norms updating. When reopen() is called, it grabs the writers SegmentInfos again." Are you referring to the

Re: Realtime Search

2009-01-09 Thread Michael McCandless
Jason Rutherglen wrote: > Are you referring to the IW.pendingCommit SegmentInfos variable? No, I'm referring to segmentInfos. (pendingCommit is the "snapshot" of segmentInfos taken when committing...). > When you say "flushed" you are referring to the IW.prepareCommit method? No, I'm referrin

Re: Realtime Search

2009-01-09 Thread Grant Ingersoll
On Jan 9, 2009, at 8:39 AM, Michael McCandless wrote: Jason Rutherglen wrote: Patch #1: Expose an IndexWriter.getReader method that returns the current reader and shares the write lock I tentatively like this approach so far... That reader is opened using IndexWriter's SegmentInfos insta

Re: Realtime Search

2009-01-09 Thread Michael McCandless
Grant Ingersoll wrote: We've spent a lot of time up until now getting write functionality out of the Reader, and now we are going to add read functionality into the Writer? Well... we're not really adding read functionality into IW; instead, we are asking IW to open the reader for us, exce

Re: Realtime Search

2009-01-09 Thread Grant Ingersoll
I realize we aren't adding read functionality to the Writer, but it would be coupling the Writer to the Reader nonetheless. I understand it is brainstorming (like I said, not trying to distract from the discussion), just saying that if the Reader and the Writer both need access to the unde

Re: Realtime Search

2009-01-09 Thread Jason Rutherglen
> "But I think for realtime we don't want to be using IW's deletion at all. We should do all deletes via the IndexReader. In fact if IW has handed out a reader (via getReader()) and that reader (or a reopened derivative) remains open we may have to block deletions via IW. Not sure..." Can't IW

Re: Realtime Search

2009-01-09 Thread Jason Rutherglen
I think the IW integrated IR needs a rule regarding the behavior of IW.flush and IR.flush. There will need to be a flush lock that is shared between the IW and IR. The lock is acquired at the beginning of a flush and released immediately after a successful or unsuccessful call. We will need to shar

Re: Realtime Search

2009-01-12 Thread Jason Rutherglen
"Patch #2: Implement a realtime ram index class I think this one is optional, or, rather an optimazation that we can swap in later if/when necessary? Ie for starters little segments are written into the main Directory." John, Zoie could be of use for this patch. In addition, we may want to impleme

Re: Realtime Search

2009-01-12 Thread Jason Rutherglen
Grant, Do you have a proposal in mind? It would help to suggest something like some classes and methods to help understand an alternative to what is being discussed. -J On Fri, Jan 9, 2009 at 12:05 PM, Grant Ingersoll wrote: > I realize we aren't adding read functionality to the Writer, but it

Re: Realtime Search

2009-01-12 Thread Grant Ingersoll
Just thinking out loud... haven't looked at your patch yet (one of these days I will be back up for air) My initial thought is that you would have a factory that produced both the Reader and the Writer as a pair, or was at least aware of what to go get from the Writer Something like: cl

Re: Realtime Search

2009-01-24 Thread Michael McCandless
Jason Rutherglen wrote: > "But I think for realtime we don't want to be using IW's deletion at all. We should do all deletes via the IndexReader. In fact if IW has handed out a reader (via getReader()) and that reader (or a reopened derivative) remains open we may have to block deletions via I

Re: Realtime Search

2009-01-29 Thread Jason Rutherglen
> We'd also need to ensure when a merge kicks off, the SegmentReaders used by the merging are not newly reopened but also "borrowed" from The IW merge code currently opens the SegmentReader with a 4096 buffer size (different than the 1024 default), how will this case be handled? > reopen would th

Re: Realtime Search

2009-01-30 Thread Michael McCandless
Jason Rutherglen wrote: > > We'd also need to ensure when a merge kicks off, the SegmentReaders > > used by the merging are not newly reopened but also "borrowed" from > > The IW merge code currently opens the SegmentReader with a 4096 > buffer size (different than the 1024 default), how will thi

Re: Realtime Search

2009-01-30 Thread Jason Rutherglen
> deletes made through reader (by docID) are immediately visible, but through writer are buffered until a flush or reopen? This is what I was thinking, IW buffers deletes, IR does not. Making IW.deletes visible immediately by applying them to the IR makes sense as well. What should be the behavio

Re: Realtime Search

2008-12-23 Thread Marvin Humphrey
On Tue, Dec 23, 2008 at 05:51:43PM -0800, Jason Rutherglen wrote: > Are there other implementation options? Here's the plan for Lucy/KS: 1) Design index formats that can be memory mapped rather than slurped, bringing the cost of opening/reopening an IndexReader down to a negligible l

Re: Realtime Search

2008-12-23 Thread robert engels
Is there something that I am missing? I see lots of references to using "memory mapped" files to "dramatically" improve performance. I don't think this is the case at all. At the lowest levels, it is somewhat more efficient from a CPU standpoint, but with a decent OS cache the IO performanc

Re: Realtime Search

2008-12-23 Thread Marvin Humphrey
On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote: > Is there something that I am missing? Yes. > I see lots of references to using "memory mapped" files to "dramatically" > improve performance. There have been substantial discussions about this design in JIRA, notably LUCENE-1458

Re: Realtime Search

2008-12-23 Thread robert engels
Seems doubtful you will be able to do this without increasing the index size dramatically. Since it will need to be stored "unpacked" (in order to have random access), yet the terms are variable length - leading to using a maximum=minimum size for every term. In the end I highly doubt it

Re: Realtime Search

2008-12-23 Thread robert engels
Also, if you are thinking that accessing the "buffer" directly will be faster than "parsing" the packed structure, I'm not so sure. You can review the source for the various buffers, and since the is no "struct" support in Java, you end up combining bytes to make longs, etc. Also, a lot of

Re: Realtime Search

2008-12-24 Thread robert engels
Thinking about this some more, you could use fixed length pages for the term index, with a page header containing a count of entries, and use key compression (to avoid the constant entry size). The problem with this is that you still have to decode the entries (slowing the processing - sinc

Re: Realtime Search

2008-12-24 Thread Paul Elschot
Op Wednesday 24 December 2008 17:51:04 schreef robert engels: > Thinking about this some more, you could use fixed length pages for > the term index, with a page header containing a count of entries, and > use key compression (to avoid the constant entry size). > > The problem with this is that yo

Re: Realtime Search

2008-12-24 Thread Doug Cutting
Jason Rutherglen wrote: 2) Implement realtime search by incrementally creating and merging readers in memory. The system would use MemoryIndex or InstantiatedIndex to quickly (more quickly than RAMDirectory) create indexes from added documents. As a baseline, how fast is it to simply use RAM

Re: Realtime Search

2008-12-24 Thread robert engels
As I pointed out in another email, I understand the benefits of compression (compressed disks vs. uncompressed, etc.). PFOR is definitely a winner ! As I understood this discussion though, it was an attempt to remove the in memory 'skip to' index, to avoid the reading of this during index

Re: Realtime Search

2008-12-24 Thread Jason Rutherglen
> Also, what are the requirements? Must a document be visible to search within 10ms of being added? 0-5ms. Otherwise it's not realtime, it's batch indexing. The realtime system can support small batches by encoding them into RAMDirectories if they are of sufficient size. > Or must it be visibl

Re: Realtime Search

2008-12-24 Thread robert engels
On Dec 24, 2008, at 12:23 PM, Jason Rutherglen wrote: > Also, what are the requirements? Must a document be visible to search within 10ms of being added? 0-5ms. Otherwise it's not realtime, it's batch indexing. The realtime system can support small batches by encoding them into RAMDir

Re: Realtime Search

2008-12-24 Thread Marvin Humphrey
On Tue, Dec 23, 2008 at 11:02:56PM -0600, robert engels wrote: > Seems doubtful you will be able to do this without increasing the > index size dramatically. Since it will need to be stored > "unpacked" (in order to have random access), yet the terms are > variable length - leading to using a

Re: Realtime Search

2008-12-24 Thread Marvin Humphrey
On Wed, Dec 24, 2008 at 12:02:24PM -0600, robert engels wrote: > As I understood this discussion though, it was an attempt to remove > the in memory 'skip to' index, to avoid the reading of this during > index open/reopen. No. That idea was entertained briefly and quickly discarded. There se

Re: Realtime Search

2008-12-25 Thread Michael McCandless
I think the necessary low-level changes to Lucene for real-time are actually already well underway... The biggest barrier is how we now ask for FieldCache values a the Multi*Reader level. This makes reopen cost catastrophic for a large index. Once we succeed in making FieldCache usage within Luc

Re: Realtime Search

2008-12-26 Thread Michael McCandless
Marvin Humphrey wrote: > 4) Allow 2 concurrent writers: one for small, fast updates, and one for > big background merges. Marvin can you describe more detail here? It sounds like this is your solution for "decoupling" segments changes due to merges from changes from docs being indexed, fro

Re: Realtime Search

2008-12-26 Thread Robert Engels
1:31 PM >To: java-dev@lucene.apache.org >Subject: Re: Realtime Search > >On Wed, Dec 24, 2008 at 12:02:24PM -0600, robert engels wrote: >> As I understood this discussion though, it was an attempt to remove >> the in memory 'skip to' index, to avoid the reading of th

Re: Realtime Search

2008-12-26 Thread Robert Engels
to be significantly smaller (improving the write time, and the cache efficiency). -Original Message- >From: Robert Engels >Sent: Dec 26, 2008 11:30 AM >To: java-dev@lucene.apache.org, java-dev@lucene.apache.org >Subject: Re: Realtime Search > >That could very well be, but

Re: Realtime Search

2008-12-26 Thread Doug Cutting
Michael McCandless wrote: So then I think we should start with approach #2 (build real-time on top of the Lucene core) and iterate from there. Newly added docs go into a tiny segments, which IndexReader.reopen pulls in. Replaced or deleted docs record the delete against the right SegmentReader

Re: Realtime Search

2008-12-26 Thread J. Delgado
The addition of docs into tiny segments using the current data structures seems the right way to go. Sometime back one of my engineers implemented pseudo real-time using MultiSearcher by having an in-memory (RAM based) "short-term" index that auto-merged into a disk-based "long term" index that eve

Re: Realtime Search

2008-12-26 Thread Marvin Humphrey
On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote: > > 4) Allow 2 concurrent writers: one for small, fast updates, and one for > > big background merges. > > Marvin can you describe more detail here? The goal is to improve worst-case write performance. Currently, writes

Re: Realtime Search

2008-12-26 Thread J. Delgado
One thing that I forgot to mention is that in our implementation the real-time indexing took place with many "folder-based" listeners writing to many tiny in-memory indexes partitioned by "sub-sources" with fewer long-term and archive indexes per box. Overall distributed search across various luc

Re: Realtime Search

2008-12-26 Thread Robert Engels
needed), and works well for many dbs (i.e. derby) -Original Message- >From: Doug Cutting >Sent: Dec 26, 2008 12:20 PM >To: java-dev@lucene.apache.org >Subject: Re: Realtime Search > >Michael McCandless wrote: >> So then I think we should start with approach

Re: Realtime Search

2008-12-26 Thread Robert Engels
way MS Access worked, and everyone that wanted performance needed to move to SQL server for the server model. -Original Message- >From: Marvin Humphrey >Sent: Dec 26, 2008 12:53 PM >To: java-dev@lucene.apache.org >Subject: Re: Realtime Search > >On Fri, Dec 26, 2008 at

Re: Realtime Search

2008-12-26 Thread Robert Engels
8 2:34 PM >To: java-dev@lucene.apache.org >Subject: Re: Realtime Search > >If you move to the "either embedded, or server model", the post reopen is >trivial, as the structures can be created as the segment is written. > >It is the networked shared access model that

Re: Realtime Search

2008-12-26 Thread Marvin Humphrey
Robert, Three exchanges ago in this thread, you made the incorrect assumption that the motivation behind using mmap was read speed, and that memory mapping was being waved around as some sort of magic wand: Is there something that I am missing? I see lots of references to using "memory ma

Re: Realtime Search

2008-12-26 Thread Robert Engels
ey >Sent: Dec 26, 2008 3:53 PM >To: java-dev@lucene.apache.org, Robert Engels >Subject: Re: Realtime Search > >Robert, > >Three exchanges ago in this thread, you made the incorrect assumption that the >motivation behind using mmap was read speed, and that memory mapping was bein

Re: Realtime Search

2008-12-26 Thread Andrzej Bialecki
Robert Engels wrote: You are full of **beep** *beep* ... No matter whether you are right or wrong, please keep a civil tone on this public forum. We are professionals here, so let's discuss and disagree if must be - but in a professional and grown-up way. Thank you. -- Best regards, Andrze

Re: Realtime Search for Social Networks Collaboration

2008-09-03 Thread Yonik Seeley
On Wed, Sep 3, 2008 at 3:20 PM, Jason Rutherglen <[EMAIL PROTECTED]> wrote: > I am wondering > if there are social networks (or anyone else) out there who would be > interested in collaborating with Apache on realtime search to get it > to the point it can be used in production. Good timing Jason,

Re: Realtime Search for Social Networks Collaboration

2008-09-03 Thread Jason Rutherglen
Hi Yonik, The SOLR 2 list looks good. The question is, who is going to do the work? I tried to simplify the scope of Ocean as much as possible to make it possible (and slowly at that over time) for me to eventually finish what is mentioned on the wiki. I think SOLR is very cool and was major

Re: Realtime Search for Social Networks Collaboration

2008-09-04 Thread Yonik Seeley
On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen <[EMAIL PROTECTED]> wrote: > I also think it's got a > lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some am

Re: Realtime Search for Social Networks Collaboration

2008-09-04 Thread Jason Rutherglen
Hi Yonik, I found the basic integration with SOLR and Ocean to be fairly straightforward, the https://issues.apache.org/jira/browse/SOLR-567 patch is key to that. SOLR just needs an optimistic concurrency update handler and most of the functionality would work. I guess the problem would be, remo

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Otis Gospodnetic
iscussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Yonik Seeley <[EMAIL PROTECTED]> > To: java-dev@lucene.apache.org > Sent: Thursday, September 4, 2008 10

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Jason Rutherglen
o understanding more! :) > > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: Yonik Seeley <[EMAIL PROTECTED]> >> To: java-dev@lucene.apache.org >> Sent: Thursday, September 4, 2008 1

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Yonik Seeley
There's a good percent of the Solr community that is looking to add everything you are (from a functional point of view). Some of the other little things that we haven't considered (like a remote Java API) sound cool... no reason not to add that also. We're also planning on adding alternatives to

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Jason Rutherglen
Hi Yonik, I fully agree with "good for projects in the long term". I just figured it would be best if someone went ahead and built the things and they could be integrated later into other projects, that's why I checked them into Apache as patches. Sounds like a few folks like Shalin and Noble wo

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Shalin Shekhar Mangar
Hi Jason, I think this is a misunderstanding. I only want to add these features incrementally so that users can use them as soon as possible, rather than delay them to a later release by re-architecting (which may take more time and shift our focus from our users). The features are more important

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Grant Ingersoll
On Sep 6, 2008, at 4:36 AM, Otis Gospodnetic wrote: Regarding real-time search and Solr, my feeling is the focus should be on first adding real-time search to Lucene, and then we'll figure out how to incorporate that into Solr later. I've read Jason's Wiki as well. Actually, I had to read

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Paul Elschot
Op Saturday 06 September 2008 18:53:39 schreef Shalin Shekhar Mangar: ... > > The features are more important than the code but it will of course > help a lot too. I think a good starting point for us (Lucene/Solr > folks) would be to study Ocean's source and any documentation that > you can provid

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Jason Rutherglen
Hello Shalin, When I tried to integrate before it seemed fairly simple. However the Ocean core code wasn't quite up to par yet so that needed work. It will help to work with SOLR people directly who can figure how they want to integrate such as yourself. Right now I'm finishing up the OceanData

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Jason Rutherglen
Hi Grant, I think the way to integrate with SOLR and Lucene is if people who are committers to the respective projects work with me (if they want) on the integration which will make it fairly straightforward as it was designed and intended to be. Cheers, Jason On Sat, Sep 6, 2008 at 3:16 PM, Gra

Re: Realtime Search for Social Networks Collaboration

2008-09-06 Thread Jason Rutherglen
Hi Paul, It's unfortunate the code is larger than most contribs. The libraries can be factored out. The next patch includes OceanDatabase. The Ocean package and class names can be removed in favor of "realtime"? > - There is a whole package of logging in there, but there's no logging > in luc

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado
://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: Yonik Seeley <[EMAIL PROTECTED]> > > To: java-dev@lucene.apache.org > > Sent: Thursday, September 4, 2008 10:13:32 AM > > Subject: Re: Realtime Search for Social Net

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread mark harwood
Interesting discussion. >>I think we should seriously look at joining efforts with open-source Database >>engine projects I posted some initial dabblings here with a couple of the databases on your list :http://markmail.org/message/3bu5klzzc5i6uhl7 but this is not really a scalable solution

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado
On Sun, Sep 7, 2008 at 2:41 AM, mark harwood <[EMAIL PROTECTED]>wrote: >>for example joins are not possible using SOLR). > > It's largely *because* Lucene doesn't do joins that it can be made to scale > out. I've replaced two large-scale database systems this year with > distributed Lucene solutio

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado
BTW, quoting Marcelo Ochoa (the developer behind the Oracle/Lucene implementation) the three minimal features a transactional DB should support for Lucene integration are: 1) The ability to define new functions (e.g. lcontains() lscore) which would allow to bind queries to lucene and obtain docu

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread Otis Gospodnetic
Hi, - Original Message From: J. Delgado <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Sunday, September 7, 2008 4:04:58 AM Subject: Re: Realtime Search for Social Networks Collaboration On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Jason Rutherglen
Hi Joaquin, Using HBase with realtime Lucene would be in line with what Google does. However the question is whether or not this is completely necessary or the most simple approach. That probably can only be answered by doing a live comparison of the two! Unfortunately that would require probab

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Ning Li
Hi, We experimented using HBase's scalable infrastructure to scale out Lucene: http://www.mail-archive.com/[EMAIL PROTECTED]/msg01143.html There is the concern on the impact of HDFS's random read performance on Lucene search performance. And we can discuss if HBase's architecture is best for scal

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Mark Miller
Ning Li wrote: I agree with Otis that the first step for Lucene is probably to support real-time search. The instantiated index in contrib seems to be something close.. Maybe we should start fleshing out what we want in realtime search on the wiki? Could it be as simple as making Instantiated

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Jason Rutherglen
InstantiatedIndex isn't quite realtime. Instead a new InstantiatedIndex is created per transaction in Ocean and managed thereafter. This however is fairly easy to build and could offer realtime in Lucene without adding the transaction logging. It would be good to find out what scope is acceptabl

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Michael McCandless
I'd also trying to make time to explore the approach of creating an IndexReader impl. that searches IndexWriter's RAM buffer. I think it's quite feasible, but, it'd still have a "reopen" cost in that any buffered delete by term or query would have to be "materialiazed" into docIDs on reop

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Yonik Seeley
On Mon, Sep 8, 2008 at 12:33 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > I'd also trying to make time to explore the approach of creating an > IndexReader impl. that searches IndexWriter's RAM buffer. That seems like it could possibly be the best performing approach in the long run. > I t

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Karl Wettin
I need to point out that the only thing I know InstantiatedIndex to be great at is read access in the inverted index. It consumes a lot more heap than RAMDirectory and InstantiatedIndexWriter is slightly less efficient than IndexWriter. Please let me know if your experience differs from the

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Michael McCandless
Yonik Seeley wrote: I think it's quite feasible, but, it'd still have a "reopen" cost in that any buffered delete by term or query would have to be "materialiazed" into docIDs on reopen. Though, if this somehow turns out to be a problem, in the future we could do this materializing immedi

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Ning Li
On Mon, Sep 8, 2008 at 2:43 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > But, how would you maintain a static view of an index...? > > IndexReader r1 = indexWriter.getCurrentIndex() > indexWriter.addDocument(...) > IndexReader r2 = indexWriter.getCurrentIndex() > > I assume r1 will have a view of

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Yonik Seeley
On Mon, Sep 8, 2008 at 3:56 PM, Ning Li <[EMAIL PROTECTED]> wrote: > On Mon, Sep 8, 2008 at 2:43 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: >> But, how would you maintain a static view of an index...? >> >> IndexReader r1 = indexWriter.getCurrentIndex() >> indexWriter.addDocument(...) >> IndexRead

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Jason Rutherglen
That sounds about correct and I don't think it matters much. I keep the documents by default stored in InstantiatedIndex to 100. So the heap size doesn't become a problem. On Mon, Sep 8, 2008 at 2:58 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: > I need to point out that the only thing I know Inst

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Jason Rutherglen
Term dictionary? I'm curious how that would be solved? On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Yonik Seeley wrote: > >>> I think it's quite feasible, but, it'd still have a "reopen" cost in that >>> any buffered delete by term or query would have to be "m

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Yonik Seeley
On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Right, getCurrentIndex would return a MultiReader that includes > SegmentReader for each segment in the index, plus a "RAMReader" that > searches the RAM buffer. That RAMReader is a tiny shell class that would > basica

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Jason Rutherglen
difficult t o >> process all this new stuff, at least for me. Am I the only one who finds >> this hard? >> >> That said, it sounds like we have some discussion going (Karl...), so I >> look forward to understanding more! :) >> >> >> Otis >> -- >

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread J. Delgado
>> substantial changes to Lucene (I remember seeing large patches in JIRA), > >> which makes it hard to digest, understand, comment on, and ultimately > commit > >> (hence the luke warm response, I think). Bringing other non-essential > >> elements into discussion

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Jason Rutherglen
ription of >> >> how real-time search works and is to be implemented? I suppose >> >> mentioning >> >> replication kind-of makes sense because the replication approach is >> >> closely >> >> tied to real-time search - all query nodes need t

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Marcelo Ochoa
Lucene, why are things like >>> >> replication, crowding/field collapsing, locallucene, name service, tag >>> >> index, etc. all mentioned there on the Wiki and bundled with >>> >> description of >>> >> how real-time search works and is

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Jason Rutherglen
, a separate project on googlecode.com? I think so. If >>>> >> so, >>>> >> and if you are working on getting it integrated into Lucene, would it >>>> >> make >>>> >> it less confusing to just refer to it as "real-time search"

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Marcelo Ochoa
ave to admit there is >>>>> >> still >>>>> >> some fuzziness about the whole things in my head - is "Ocean" something >>>>> >> that >>>>> >> already works, a separate project on googlecode.com? I think so. If

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Michael McCandless
Yonik Seeley wrote: On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: Right, getCurrentIndex would return a MultiReader that includes SegmentReader for each segment in the index, plus a "RAMReader" that searches the RAM buffer. That RAMReader is a tiny shell class

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Michael McCandless
This would just tap into the live hashtable that DocumentsWriter* maintain for the posting lists... except the docFreq will need to be copied away on reopen, I think. Mike Jason Rutherglen wrote: Term dictionary? I'm curious how that would be solved? On Mon, Sep 8, 2008 at 3:04 PM, Mic

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Yonik Seeley
On Tue, Sep 9, 2008 at 5:28 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Yonik Seeley wrote: >> What about something like term freq? Would it need to count the >> number of docs after the local maxDoc or is there a better way? > > Good question... > > I think we'd have to take a full copy o

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Ning Li
On Mon, Sep 8, 2008 at 4:23 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: >> I thought an index reader which supports real-time search no longer >> maintains a static view of an index? > > It seems advantageous to just make it really cheap to get a new view > of the index (if you do it for every sear

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Yonik Seeley
On Tue, Sep 9, 2008 at 11:42 AM, Ning Li <[EMAIL PROTECTED]> wrote: > On Tue, Sep 9, 2008 at 10:02 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote: >> Yeah, I think the underlying RandomAccessFile might do the right >> thing, but IndexInput isn't required to see any changes on the fly >> (and current im

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Michael McCandless
Yonik Seeley wrote: On Tue, Sep 9, 2008 at 5:28 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: Yonik Seeley wrote: What about something like term freq? Would it need to count the number of docs after the local maxDoc or is there a better way? Good question... I think we'd have to take

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Michael McCandless
Yonik Seeley wrote: On Tue, Sep 9, 2008 at 11:42 AM, Ning Li <[EMAIL PROTECTED]> wrote: On Tue, Sep 9, 2008 at 10:02 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote: Yeah, I think the underlying RandomAccessFile might do the right thing, but IndexInput isn't required to see any changes on the fly

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Yonik Seeley
On Tue, Sep 9, 2008 at 12:41 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Yonik Seeley wrote: >> OR, if all writes are append-only, perhaps we don't ever need to >> invalidate the read buffer and would just need to remove the current >> logic that caches the file length and then let the unde

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Yonik Seeley
On Tue, Sep 9, 2008 at 12:45 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Yonik Seeley wrote: >> No, it would essentially be a change in the semantics that all >> implementations would need to support. > > Right, which is you are allowed to open an IndexInput on a file when an > IndexOutput

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Ning Li
>>> Even so, >>> this may not be sufficient for some FS such as HDFS... Is it >>> reasonable in this case to keep in memory everything including >>> stored fields and term vectors? >> >> We could maybe do something like a proxy IndexInput/IndexOutput that >> would allow updating the read buffer fro

Re: Realtime Search for Social Networks Collaboration

2008-09-10 Thread Jason Rutherglen
Hi Mike, There would be a new sorted list or something to replace the hashtable? Seems like an issue that is not solved. Jason On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > This would just tap into the live hashtable that DocumentsWriter* maintain > for the p

Re: Realtime Search for Social Networks Collaboration

2008-09-11 Thread Michael McCandless
Right, there would need to be a snapshot taken of all terms when IndexWriter.getReader() is called. This snapshot would 1) hold a frozen int docFreq per term, and 2) sort the terms so TermEnum can just step through them. (We might be able to delay this sorting until the first time someth

Re: Realtime Search for Social Networks Collaboration

2008-09-18 Thread Jason Rutherglen
Mike, The other issue that will occur that I addressed is the field caches. The underlying smaller IndexReaders will need to be exposed because of the field caching. Currently in ocean realtime search the individual readers are searched on using a MultiSearcher in order to search in parallel and

  1   2   >