Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-05-08 Thread Ning Li
mance improved by 60% when inserts and deletes were interleaved in small batches. (See attached file: IndexWriter.java)(See attached file: TestWriterDelete.java) Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Roa

Re: Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-05-08 Thread Ning Li
I will create a bug in Jira. Let me try to attach the two files here again. (See attached file: IndexWriter.changed)(See attached file: TestWriterDelete.changed) Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120

Re: Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-05-09 Thread Ning Li
The machine is swamped with tests. I will run the experiment when the machine is free. Regards, Ning Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 |-+> | | Otis Gospodne

Re: Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-05-11 Thread Ning Li
riter --- Insert only 116 min 119 min116 min Insert/delete (big batches) -- 135 min125 min Insert/delete (small batches) -- 338 min134 min Regards, Ning Ning Li Search Technologies IBM Almaden Res

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-06 Thread Ning Li
Hi Otis and Robert, I added an overview of my changes in JIRA. Hope that helps. > Anyway, my test did exercise the small batches, in that in our > incremental updates we delete the documents with the unique term, and > then add the new (which is what I assumed this was improving), and I > saw o a

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-06 Thread Ning Li
Hi Yonik, > When one interleaves adds and deletes, it isn't the case that > indexreaders and indexwriters need to be opened and closed each > interleave. I'm not sure I understand this. Could you elaborate? I thought IndexWriter acquires the write lock and holds it until it is done. This will pr

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-06 Thread Ning Li
> Even with you code changes, to see the modification made using the > IndexWriter, it must be closed, and a new IndexReader opened. That behaviour remains the same. > So a far simpler way is to get the collection of updates first, then > using opened indexreader, > for each doc in collection >

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-06 Thread Ning Li
> Yonik mentioned this in email. It does sound like a better place for > this might be in a higher level class. IndexWriter would really not > be just a writer/appender once delete functionality is added to it, > even if it's the IndexReaders behind the scenes doing the work. So > if you are goi

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-09 Thread Ning Li
To clarify, higher level (application level) adds and deletes can be managed at a lower level such that index readers and writers aren't continually opened and closed. ... The big question is, what kind of efficiencies do you get by putting this functionallity in IndexWriter vs a higher level cl

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-10 Thread Ning Li
You keep stating that you never need to close the IndexWriter. I don't believe this is the case, and you are possibly misleading people as to the extent of your patch. Don't you need to close (or flush) to get the documents on disk, so a new IndexReader can find them? If not any documents added

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-10 Thread Ning Li
Random comment... ... An alternate implementation could use a HashMap to associate term with maxSegment. ... Very well taken. :-) I won't submit a new version of the patch at this point to avoid too many versions of the patch. Thanks, Ning ---

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-12 Thread Ning Li
Then I submit hat my proposed "BufferedWriter" is far simpler and probably performs equally as well, if not better, especially for the case where a document can be uniquely identified. Can I find the patch for this already somewhere? Does it require an explicit unique identifier understandable b

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-12 Thread Ning Li
I proposed a design of "BufferedWriter" in a previous email that would not have this limited. It is similar to what other have suggested, which is to handle the buffering in a higher-level class and level IndexWriter alone. Could you spell out the details, or better, submit the patch? So that we

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-12 Thread Ning Li
The current implementation makes some assumptions, such as the "unique key" is a single field, not any sort of compound key, and it doesn't allow deletes by query. That, coupled with a more complex implementation makes me wary of putting it in IndexWriter. By "current implementation", you meant

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-12 Thread Ning Li
I'm not sure I understand your question you mean why would one want to stick to public APIs? No, that's not what I meant. I definitely agree that we should stick to publich APIs as much as we can. If it can be done in a separate class, using public APIs (or at least with a minimum of prote

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-13 Thread Ning Li
Solr's implementation is here: http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/update/DirectUpdateHandler2.java?view=markup I read it and I see which point I didn't make clear. :-) I have viewed "delete by term" (which is supported by IndexReader and NewIndexModifier)

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-07-14 Thread Ning Li
Hey, you're moving the goalposts ;-) You proposed a specific patch, and it certtainly doesn't have support for delete-by-query. The patch makes IndexWriter support delete-by-term, which is what IndexReader supports. Granted, delete-by-term is not as general as delete-by-query so you don't have t

Re: [jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-08-02 Thread Ning Li
I rewrote IndexWriter in such a way that semantically it's the same as before, but it provides extension points so that delete-by-term, delete-by-query, and more functionalities can be easily supported in a subclass. NewIndexModifier is such a subclass that supports delete-by-term. Has anyone r

Re: LUCENE-528 and 565

2006-08-15 Thread Ning Li
Lucene-528 and Lucene-565 serve different purposes. One cannot replace the other. I'm totally for a version of addIndexes() where optimize() is not always called. However, with the one proposed in the patch, we could end up with an index where: segment 0 has 1000 docs, 1 has 2000, 2 has 4000, 3 h

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-08-22 Thread Ning Li
I tested just the IndexWriter from this code base, it does not seem to work. NewIndexModifier does work. I simply used IndexWriter to create several documents and then search for them. Nothing came back even though it seems something was written to disk. The patch worked until several days ago

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-08-29 Thread Ning Li
Could you elaborate? Jason Rutherglen commented on LUCENE-565: - It seems this writer works, but then some mysterious happens to the index and the searcher can no longer read it. I am using this in conjunction with Solr. The index files look ok, howev

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-08-29 Thread Ning Li
(reopen), then perform a batch addDocuments. Then when a search is executed nothing is returned, and after an optimize the index goes down to 1K. Seems What did you set maxBufferedDocs to? If it is bigger than the number of documents you inserted, the newly added documents haven't reached disk

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-08-29 Thread Ning Li
DirectUpdateHandler2. I will create a non-Solr reproduction of the issue. I'm still not clear how you used the patch. So this will definitely help. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-01 Thread Ning Li
I believe this patch probably also changes the merge behavior. I think we need to discuss what exactly the new merge behavior is, if it's OK, what we think the index invariants should be (no more than x segments of y size, etc), and I'd like to see some code to test those invariants. Yes, the pa

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-05 Thread Ning Li
What about an invariant that says the number of main index segments with the same level (f(n)) should be less than M. That is exactly what the second property says: "Less than M number of segments whose doc count n satisfies B*(M^c) <= n < B*(M^(c+1)) for any c >= 0." In other words, less than

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Ning Li
> "Less than M number of segments whose doc count n satisfies B*(M^c) <= > n < B*(M^(c+1)) for any c >= 0." > In other words, less than M number of segments with the same f(n). Ah, I had missed that. But I don't believe that lucene currently obeys this in all cases. I think it does hold for n

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Ning Li
So, I *think* most of our hypothetical problems go away with a simple adjustment to f(n): f(n) = floor(log_M((n-1)/B)) Correct. And nice. :-) Equivalently, f(n) = ceil(log_M (n / B)). If f(n) = c, it means B*(M^(c-1)) < n <= B*(M^(c)). So f(n) = 0 means n <= B. --

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Ning Li
So what's left... maxMergeDocs I guess. Capping the segment size breaks the simple invariants a bit. Correct. We also need to be able to handle changes to M and maxMergeDocs between different IndexWriter sessions. When checking for a merge for Hmmm. A change of M could easily break the inva

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-06 Thread Ning Li
On 9/6/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote: That's one way of thinking about it. There's only one "thing" though: a big bucket of serialized index entries. At the end of a session, those are sorted, pulled apart, and used to write the tis, tii, frq, and prx files. Interesting. Whe

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-09-12 Thread Ning Li
The new code does handle the case. After mergeSegments(...) in maybeMergeSegments(), there is the following code: numSegments -= mergeFactor; if (docCount > upperBound) { minSegment++; exceedsUpperLimit = true; } else if (docCount > 0) {

Re: Ferret's changes

2006-10-10 Thread Ning Li
On 10/10/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 10/10/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hi, > > Maybe I missed it, but I was surprised that nobody here wondered about the algorithm and data structure changes that Dave Balmain made in Ferret, to make it go faster (than Ja

Re: Ferret's changes

2006-10-11 Thread Ning Li
Actually not using single doc segments was only possible due to the fact that I have constant field numbers so both optimizations stem from this one change... Not using single doc segments can be done without constant field numbers... :-) Ning --

Re: [jira] Commented: (LUCENE-686) Resources not always reclaimed in scorers after each search

2006-10-17 Thread Ning Li
A new scorer that requires reclaiming resources could be used by many other scorers such as boolean scorers and conjunction scorers. Then those scorers should have a closing method and so do the ones use those scorers... A general closing method would be better, wouldn't it? -

Re: flushRamSegments possible perf improvement?

2006-10-19 Thread Ning Li
I also don't know if there are any negative performance implications of merging segments with sizes an order of magnitude apart. It should be relatively easy to test different scenarios by manipulating mergeFactor and maxBufferedDocs at the right time. I agree. In addition, it's not clear to me

Re: [jira] Commented: (LUCENE-690) LazyField use of IndexInput not thread safe

2006-10-19 Thread Ning Li
What makes, for example, FSIndexInput and its clones, thread-safe is the following. That is, the method is synchronized on the file object. protected void readInternal(byte[] b, int offset, int len) throws IOException { synchronized (file) { long position = getFilePointer(); i

Re: [jira] Commented: (LUCENE-690) LazyField use of IndexInput not thread safe

2006-10-19 Thread Ning Li
I don't think that's sufficient in part because the IndexInput's state is manipulated outside that sync block. The sync block is to protect the file only, not the IndexInput, which isn't thread-safe (by design). Correct, that sync block only protects the file. It and the rest of FSIndexInput

Re: flushRamSegments possible perf improvement?

2006-10-19 Thread Ning Li
There is, however, an opportunity of reducing number merges for disk segments. Assume maxBufferedDocs is 10 and mergeFactor is 3. Assume the segment sizes = 90, 30, 30, 10, 10. When a new disk segment of 10 is added, two merges are triggered. First, 3 segments of size 10 are merged and the segmen

Re: flushRamSegments possible perf improvement?

2006-10-19 Thread Ning Li
This is exactly what I mean - when flushing the ram segments I compute in advance if this (merge) would be followed immediately by an additional merge, and if so, I just do these two merges in one step. This has shown I see. One difference, however, is that I would keep flushing ram segments to

Re: [jira] Commented: (LUCENE-555) Index Corruption

2006-10-27 Thread Ning Li
It's only upon successfully writing the new segments that Lucene will write a new "segments" file referring to the new segments. After that, it removes the old segments. Since it makes these changes in the correct order, it should be the case that disk full exception never affects the already

Re: [jira] Resolved: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-22 Thread Ning Li
I was away so I'm catching up. If this (occasional large documents consume too much memory) happens to a few applications, should it be solved in IndexWriter? A possible design could be: First, in addDocument(), compute the byte size of a ram segment after the ram segment is created. In the sync

Re: [jira] Resolved: (LUCENE-709) [PATCH] Enable application-level management of IndexWriter.ramDirectory size

2006-11-22 Thread Ning Li
There is a flaw in this approach as you exceed the threshold before flushing. With very large documents, that can cause an OOM. This is a good point. I agree that it would be better to do this in IndexWriter, but more machinery would be needed. Lucene would need to estimate the size of the n

Re: Efficiently expunging deletions of recently added documents

2006-12-05 Thread Ning Li
I'd like to open up the API to mergeSegments() in IndexWriter and am wondering if there are potential problems with this. I'm worried that opening up mergeSegments() could easily break the invariants currently guaranteed by the new merge policy(http://issues.apache.org/jira/browse/LUCENE-672).

Re: Efficiently expunging deletions of recently added documents

2006-12-05 Thread Ning Li
addIndexesNoOptimize() (http://issues.apache.org/jira/browse/LUCENE-528 Optimization for IndexWriter.addIndexes()) would solve the problem. Ning On 12/5/06, Ning Li <[EMAIL PROTECTED]> wrote: > I'd like to open up the API to mergeSegments() in IndexWriter and am > wonde

Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-12-12 Thread Ning Li
Thanks for the comments Yonik! To minimize the number of reader open/closes on large persistent segments, I think the ability to apply deletes only before a merge is important. That might add a 4th method: doBeforeMerge() I'm not sure I get this. Buffered deletes are only applied(flushed) d

Re: Payloads

2006-12-21 Thread Ning Li
1. Make the index format extensible by adding user-implementable reader and writer interfaces for postings. ... Here's a very rough, sketchy, first draft of a type (1) proposal. Nice! In approach 1, what is the best abstraction of a flexible index format for Lucene? The draft proposal seems to

Re: Payloads

2006-12-22 Thread Ning Li
On 12/22/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote: Precision would be enhanced if boolean scoring took position into account, and could be further enhanced if each position were assigned a boost. For that purpose, having everything in one file is an advantage, as it cuts down disk seeks. T

Re: Payloads

2006-12-22 Thread Ning Li
On 12/22/06, Doug Cutting <[EMAIL PROTECTED]> wrote: Ning Li wrote: > The draft proposal seems to suggest the following (roughly): > A dictionary entry is . Perhaps this ought to be , where TermInfo contains a FilePointer and perhaps other information (e.g., frequency data).

Re: adding "explicit commits" to Lucene?

2007-01-15 Thread Ning Li
On 1/14/07, Michael McCandless <[EMAIL PROTECTED]> wrote: * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature could have a more efficient implementation (just like Solr) when autoCommit is false, because deletes don't need to be flushed until commit() is called. Whe

Re: adding "explicit commits" to Lucene?

2007-01-16 Thread Ning Li
On 1/16/07, Michael McCandless <[EMAIL PROTECTED]> wrote: Good catch Ning! And, I agree, when a reader plans to make modifications to the index, I think the best solution is to require that the reader has opened most recent "segments*_N" (be that a snapshot or a checkpoint). Really a reader is

Re: adding "explicit commits" to Lucene?

2007-01-16 Thread Ning Li
On 1/16/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 1/15/07, Chuck Williams <[EMAIL PROTECTED]> wrote: > (Side thought: I've been wondering how hard it would > be to make merging not a critical section). It would be very nice if segment merging didn't block the addition of new documents... i

Re: adding "explicit commits" to Lucene?

2007-01-17 Thread Ning Li
On 1/17/07, Michael McCandless <[EMAIL PROTECTED]> wrote: robert engels wrote: > Under this new scenario, what is the result of this: > > I open the IndexWriter. > > I delete all documents with Term A. > I add a new document with Term A. > I delete all documents with Term A. > > Is the new docume

Re: NewIndexModifier - - - DeletingIndexWriter

2007-02-10 Thread Ning Li
On 2/9/07, Michael McCandless <[EMAIL PROTECTED]> wrote: I agree w/ Hoss: the way NewIndexModifier works, if you don't do any deletes then there's no added cost (well, only some if statements) to the "addDocument only" case because no readers are opened during the flush when there are no deletes.

Concurrent merge

2007-02-20 Thread Ning Li
I think it's possible for another version of IndexWriter to have a concurrent merge thread so that disk segments could be merged while documents are being added or deleted. This would be beneficial not only because it will improve indexing performance when there are enough system resources, but m

Re: IndexWriter#deleteDocuments

2007-02-21 Thread Ning Li
On 2/20/07, karl wettin <[EMAIL PROTECTED]> wrote: Could the reader per segment be replaced by one single MultiReader created by the original indexDeleterFactory()? Or are the segments partially the RAMDirectory of the writer, partially the persistent index? All segments are disk segments. Howe

Re: Concurrent merge

2007-02-21 Thread Ning Li
I agree that the current blocking model works for some applications, especially if the indexes are batch built. But other applications, e.g. with online indexes, would greatly benefit from a non-blocking model. Most systems that merge data support background merges. As long as we keep it simple (

Re: [jira] Created: (LUCENE-808) bufferDeleteTerm in IndexWriter might flush prematurely

2007-02-21 Thread Ning Li
The code correctly reflects its designed semantics: numBufferedDeleteTerms is a simple sum of terms passed to updateDocument or deleteDocuments. If the first of two successive calls to the same term should be considered no op if no docs were added in between, shouldn't the first also be considere

Re: [jira] Commented: (LUCENE-808) bufferDeleteTerm in IndexWriter might flush prematurely

2007-02-21 Thread Ning Li
On 2/21/07, Doron Cohen (JIRA) <[EMAIL PROTECTED]> wrote: Imagine the application and Lucene could talk, with the current implementation we could hear something like this: ... However, there could be multiple threads updating the same index. For example, thread 1 deletes the term "id:5" twice,

Re: Concurrent merge

2007-03-02 Thread Ning Li
Many good points! Thanks, guys! When background merge is employed, document additions can out-pace merging, no matter how many background merge threads are used. Blocking has to happen at some point. So, if we do anything, we make it simple. I agree with what Robert and Yonik have proposed: docu

Commit while addIndexes is in progress

2008-07-11 Thread Ning Li
Hi, Should we guard against the case when commit() is called during addIndexes? Otherwise, errors such as a file does not exist could happen during commit. Cheers, Ning Li - To unsubscribe, e-mail: [EMAIL PROTECTED] For

Re: Commit while addIndexes is in progress

2008-07-11 Thread Ning Li
think there're similar problems with calling optimize() while addIndexes > is in progress... I think we should disallow that? Optimize waits for addIndexes to finish? I think it's useful to allow addIndexes during maybeMerge and optimize, no? Cheers, Ning Li ---

Re: [jira] Commented: (LUCENE-1335) Correctly handle concurrent calls to addIndexes, optimize, commit

2008-08-29 Thread Ning Li
+1 On Thu, Aug 28, 2008 at 8:19 PM, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote: > >[ > https://issues.apache.org/jira/browse/LUCENE-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626805#action_12626805 > ] > > Michael McCandless commen

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Ning Li
Hi, We experimented using HBase's scalable infrastructure to scale out Lucene: http://www.mail-archive.com/[EMAIL PROTECTED]/msg01143.html There is the concern on the impact of HDFS's random read performance on Lucene search performance. And we can discuss if HBase's architecture is best for scal

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread Ning Li
On Mon, Sep 8, 2008 at 2:43 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > But, how would you maintain a static view of an index...? > > IndexReader r1 = indexWriter.getCurrentIndex() > indexWriter.addDocument(...) > IndexReader r2 = indexWriter.getCurrentIndex() > > I assume r1 will have a view of

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Ning Li
On Mon, Sep 8, 2008 at 4:23 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: >> I thought an index reader which supports real-time search no longer >> maintains a static view of an index? > > It seems advantageous to just make it really cheap to get a new view > of the index (if you do it for every sear

Re: Realtime Search for Social Networks Collaboration

2008-09-09 Thread Ning Li
>>> Even so, >>> this may not be sufficient for some FS such as HDFS... Is it >>> reasonable in this case to keep in memory everything including >>> stored fields and term vectors? >> >> We could maybe do something like a proxy IndexInput/IndexOutput that >> would allow updating the read buffer fro

Re: 2.4 release candidate 1

2008-09-19 Thread Ning Li
LUCENE-1335 is not listed in CHANGES.txt? It also includes a minor behavior change: "no longer allow the same Directory to be passed into addIndexes* more than once". Cheers, Ning On Thu, Sep 18, 2008 at 2:29 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Hi, > > I just created the first

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Ning Li
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: Right I'm calling a newly created segment (ie flushed from RAM) level 0 and then a level 1 segment is created when you merge 10 level 0 segments, level 2 is created when merge 10 level 1 segments, etc. That is not how the current merge p

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Ning Li
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: Yes the code re-computes the level of a given segment from the current values of maxBufferedDocs & mergeFactor. But when these values have changed (or, segments were flushed by RAM not by maxBufferedDocs) then the way it computes level no

Re: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-03-23 Thread Ning Li
Hi Steven, I haven't read the details, but should maxBufferedDocs be exposed in some subinterfaces instead of the MergePolicy interface? Since some policies may use it and others may use byte size or something else. It's great that you've started on concurrent merge as well! I haven't got a chan

Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage" then IndexWriter may over-merge

2007-03-26 Thread Ning Li
On 3/26/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote: Ahhh, this is a very good point. OK I won't deprecate "flushing by doc count" and instead will allow either "flush by RAM usage" (default to this?) or "flush by doc count". Just want to clarify: It's either "flush and merge by by

Re: [jira] Created: (LUCENE-851) Pruning

2007-03-29 Thread Ning Li
It will be great to support early termination for top-K queries within the DAAT query processing model in Lucene. There is quite some work published in related areas. http://portal.acm.org/citation.cfm?id=956944 is one of them. Am I getting it right? If a query requires top-K results, isn't it su

Re: Concurrent merge

2007-03-29 Thread Ning Li
FYI: Patch submitted in http://issues.apache.org/jira/browse/LUCENE-847. Cheers, Ning "Here is a patch for concurrent merge as discussed in: http://www.gossamer-threads.com/lists/lucene/java-dev/45651?search_string=concurrent%20merge;#45651 "I put it under this issue because it helps design and

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Ning Li
On 4/3/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote: * With term vectors and/or stored fields, the new patch has substantially better RAM efficiency. Impressive numbers! The new patch improves RAM efficiency quite a bit even with no term vectors nor stored fields, because of the

Re: [jira] Created: (LUCENE-856) Optimize segment merging

2007-04-04 Thread Ning Li
On 4/4/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote: Note that for "autoCommit=false", this optimization is somewhat less important, depending on how often you actually close/open a new IndexWriter. In the extreme case, if you open a writer, add 100 MM docs, close the writer, then no

Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

2007-05-02 Thread Ning Li
On 3/31/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote: Create merge policy that doesn't periodically inadvertently optimize So we could make a small change to the policy by only merging the first mergeFactor segments o

Re: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-05-02 Thread Ning Li
On 3/23/07, Steven Parkes (JIRA) <[EMAIL PROTECTED]> wrote: In fact, there a few things here that are fairly subtle/important. The relationship/protocol between the writer and policy is pretty strong. This can be seen in the strawman concurrent merge code where the merge policy holds state and

Re: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-05-03 Thread Ning Li
Having the merge policy own segmentInfos sounds kind of hard to me. Among other things, there's a lot of code in IndexWriter for managing segmentInfos with regards to transactions. I'm pretty wary of touching that code. Is there a way around that? But conceptually, do you agree it's a good idea

Re: [jira] Created: (LUCENE-854) Create merge policy that doesn't periodically inadvertently optimize

2007-05-03 Thread Ning Li
Steve, Mike, Thanks for the explanation! I meant cascading but wrote optimizing. So it still cascades merges. It would merge based on size (not # docs), would be free to merge adjacent segments (not just rightmost segments), and would merge N (configurable) at a time. The part that's still unc

Deprecating IndexModifier

2007-08-07 Thread Ning Li
With the plan towards 3.0 release laid out, I think it's a good time to deprecate IndexModifier and eventually remove IndexModifier. The only method in IndexModifier which is not implemented in IndexWriter is "deleteDocument(int doc)". This is because of the concern that document ids are changing

Re: [jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-08 Thread Ning Li
On 8/7/07, Steven Parkes (JIRA) <[EMAIL PROTECTED]> wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518210 > ] > > Steven Parkes commented on LUCENE-847: > -- > >

Re: Deprecating IndexModifier

2007-08-08 Thread Ning Li
ffered delete doc ids. I don't think it should be the reason not to support "deleteDocument(int doc)" in IndexWriter. But its impact on concurrent merge is a concern. Ning On 8/7/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > +1 > > > On Aug 7, 2007, at 3:37 PM, Nin

Re: Deprecating IndexModifier

2007-08-08 Thread Ning Li
On 8/8/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: > To make delete by docid useful, one needs a way to *get* those docids. > A callback after flush that provided acurrent list of readers for the > segments would serve. Interesting. That makes sense. > I think IndexWriter.deleteDocument(int doc)

Re: Deprecating IndexModifier

2007-08-08 Thread Ning Li
On 8/8/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On 8/8/07, Ning Li <[EMAIL PROTECTED]> wrote: > > But you still think it's worth to be included in IndexWriter, right? > > I'm not sure... (unless I'm missing some obvious use-cases). > If one could g

Re: Deprecating IndexModifier

2007-08-08 Thread Ning Li
On 8/8/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Let's take a simple case of deleting documents in a range, like > date:[2006 TO 2008] > One would currently need to close the writer and open a new reader to > ensure that they can "see" all the documents. Then execute a > RangeQuery, collect th

Re: Deprecating IndexModifier

2007-08-08 Thread Ning Li
On 8/8/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On 8/8/07, Ning Li <[EMAIL PROTECTED]> wrote: > > This reminds me: It'd be nice if we could support delete-by-query someday. > > :) > > > > I was thinking people use deleteDocument(int docid) whe

Re: Deprecating IndexModifier

2007-08-12 Thread Ning Li
IndexWriter does everything IndexModifier does and more, except "deleteDocument(int doc)". Can we reach consensus on: 1 Should we deprecate IndexModifier before 3.0 and remove it in 3.0? 2 If so, do we have to add "deleteDocument(int doc)" to IndexWriter? We know how to support "deleteDocument(int

Re: [jira] Updated: (LUCENE-847) Factor merge policy out of IndexWriter

2007-08-27 Thread Ning Li
Hi Mike, I cannot apply the patch cleanly. MergePolicy.java, e.g., seems to be missing from the patch. On 8/24/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Re: setRAMBufferSizeMB vs. setMaxBufferedDocs

2007-09-23 Thread Ning Li
Hi Doron, > On the other, the logic of "use memory-limit unless added-docs-limit was > specified" seems somewhat confusing The design intention is to use either maxBufferedDocs/maxBufferedDeleteTerms or ramBufferSize, but not both at the same time. > (why only by pending adds, why not also by pe

Re: setRAMBufferSizeMB vs. setMaxBufferedDocs

2007-09-24 Thread Ning Li
to max int MB. Ning On 9/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > "Doron Cohen" <[EMAIL PROTECTED]> wrote: > > Hi Ning, > > > > "Ning Li" <[EMAIL PROTECTED]> wrote on 24/09/2007 00:26:36: > > > > > Do y

Re: setRAMBufferSizeMB vs. setMaxBufferedDocs

2007-09-24 Thread Ning Li
On 9/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > On flushing pending deletes by RAM usage: should we just bundle this > up under "flush by RAM usage"? Ie "when total RAM usage, either from > buffered deletes, buffered docs, anything else, exceeds X then it's > time to flush"? (Instead

Re: Exceptions in TestConcurrentMergeScheduler

2007-10-03 Thread Ning Li
The cause is that in MergeThread.run(), merge in the try block is a local variable, while merge in the catch block is the class variable. Merge in the try block could be one different from the original merge, but the catch block always checks the abort flag of the original merge. -

Re: lucene indexing and merge process

2007-10-18 Thread Ning Li
Make all documents have a term, say "ID:UID", and for each document, store its UID in the term's payload. You can read off this posting list to create your array. Will this work for you, John? Cheers, Ning On 10/18/07, Erik Hatcher <[EMAIL PROTECTED]> wrote: > Forwarding this to java-dev per req

Re: lucene indexing and merge process

2007-10-18 Thread Ning Li
lt set is large. But loading it in > memory when opening index can also be slow if the index is large and updates > often. > > Thanks > > -John > > On 10/18/07, Ning Li <[EMAIL PROTECTED]> wrote: > > > > Make all documents have a term, say "ID:UID",

Re: Per-document Payloads

2007-10-30 Thread Ning Li
> That may be a little too seamless. We want the user to have specific > control over which fields are efficiently stored separately since they > will know how that field will be used. Maybe let users decide field families, like the column families in BigTable? --

Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
HDFS block. This feature may be useful for other HDFS applications (e.g., HBase). We would like to collaborate with other people who are interested in adding this feature to HDFS. Regards, Ning Li

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
I work for IBM Research. I read the Rackspace article. Rackspace's Mailtrust has a similar design. Happy to see an existing application on such a system. Do they plan to open-source it? Is the AOL project an open source project? On Feb 6, 2008 11:33 AM, Clay Webster <[EMAIL PROTECTED]> wrote: > >

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
No. I'm curious too. :) On Feb 6, 2008 11:44 AM, J. Delgado <[EMAIL PROTECTED]> wrote: > I assume that Google also has distributed index over their > GFS/MapReduce implementation. Any idea how they achieve this? > > J.D. >

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
One main focus is to provide fault-tolerance in this distributed index system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging results from multiple shards right now. We'd like to start an open source project for a fault-tolerant distributed index system (or join if one already exi

[jira] Created: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-05-08 Thread Ning Li (JIRA)
Components: Index Reporter: Ning Li Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches

[jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)

2006-05-08 Thread Ning Li (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ] Ning Li updated LUCENE-565: --- Attachment: IndexWriter.java TestWriterDelete.java > Supporting deleteDocuments in IndexWriter (Code and Performance Results > Pr

  1   2   3   >