mance
improved by 60% when inserts and deletes were interleaved in small batches.
(See attached file: IndexWriter.java)(See attached file:
TestWriterDelete.java)
Regards,
Ning
Ning Li
Search Technologies
IBM Almaden Research Center
650 Harry Roa
I will create a bug in Jira.
Let me try to attach the two files here again.
(See attached file: IndexWriter.changed)(See attached file:
TestWriterDelete.changed)
Regards,
Ning
Ning Li
Search Technologies
IBM Almaden Research Center
650 Harry Road
San Jose, CA 95120
The machine is swamped with tests. I will run the experiment when the
machine is free.
Regards,
Ning
Ning Li
Search Technologies
IBM Almaden Research Center
650 Harry Road
San Jose, CA 95120
|-+>
| | Otis Gospodne
riter
---
Insert only 116 min 119 min116 min
Insert/delete (big batches) -- 135 min125 min
Insert/delete (small batches) -- 338 min134 min
Regards,
Ning
Ning Li
Search Technologies
IBM Almaden Res
Hi Otis and Robert,
I added an overview of my changes in JIRA. Hope that helps.
> Anyway, my test did exercise the small batches, in that in our
> incremental updates we delete the documents with the unique term, and
> then add the new (which is what I assumed this was improving), and I
> saw o a
Hi Yonik,
> When one interleaves adds and deletes, it isn't the case that
> indexreaders and indexwriters need to be opened and closed each
> interleave.
I'm not sure I understand this. Could you elaborate?
I thought IndexWriter acquires the write lock and holds it until
it is done. This will pr
> Even with you code changes, to see the modification made using the
> IndexWriter, it must be closed, and a new IndexReader opened.
That behaviour remains the same.
> So a far simpler way is to get the collection of updates first, then
> using opened indexreader,
> for each doc in collection
>
> Yonik mentioned this in email. It does sound like a better place for
> this might be in a higher level class. IndexWriter would really not
> be just a writer/appender once delete functionality is added to it,
> even if it's the IndexReaders behind the scenes doing the work. So
> if you are goi
To clarify, higher level (application level) adds and deletes can be
managed at a lower level such that index readers and writers aren't
continually opened and closed.
...
The big question is, what kind of efficiencies do you get by putting
this functionallity in IndexWriter vs a higher level cl
You keep stating that you never need to close the IndexWriter. I
don't believe this is the case, and you are possibly misleading
people as to the extent of your patch.
Don't you need to close (or flush) to get the documents on disk, so
a new IndexReader can find them? If not any documents added
Random comment...
...
An alternate implementation could use a HashMap to associate term with
maxSegment.
...
Very well taken. :-)
I won't submit a new version of the patch at this point to avoid too
many versions of the patch.
Thanks,
Ning
---
Then I submit hat my proposed "BufferedWriter" is far simpler and
probably performs equally as well, if not better, especially for the
case where a document can be uniquely identified.
Can I find the patch for this already somewhere? Does it require an
explicit unique identifier understandable b
I proposed a design of "BufferedWriter" in a previous email that
would not have this limited. It is similar to what other have
suggested, which is to handle the buffering in a higher-level class
and level IndexWriter alone.
Could you spell out the details, or better, submit the patch? So that
we
The current implementation makes some assumptions, such as the "unique
key" is a single field, not any sort of compound key, and it doesn't
allow deletes by query. That, coupled with a more complex
implementation makes me wary of putting it in IndexWriter.
By "current implementation", you meant
I'm not sure I understand your question you mean why would one
want to stick to public APIs?
No, that's not what I meant. I definitely agree that we should stick
to publich APIs as much as we can.
If it can be done in a separate class, using public APIs (or at least
with a minimum of prote
Solr's implementation is here:
http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/update/DirectUpdateHandler2.java?view=markup
I read it and I see which point I didn't make clear. :-)
I have viewed "delete by term" (which is supported by IndexReader and
NewIndexModifier)
Hey, you're moving the goalposts ;-)
You proposed a specific patch, and it certtainly doesn't have support
for delete-by-query.
The patch makes IndexWriter support delete-by-term, which is what
IndexReader supports. Granted, delete-by-term is not as general as
delete-by-query so you don't have t
I rewrote IndexWriter in such a way that semantically it's the same as before,
but
it provides extension points so that delete-by-term, delete-by-query, and more
functionalities can be easily supported in a subclass. NewIndexModifier is such
a subclass that supports delete-by-term.
Has anyone r
Lucene-528 and Lucene-565 serve different purposes. One cannot replace
the other.
I'm totally for a version of addIndexes() where optimize() is not
always called. However, with the one proposed in the patch, we could
end up with an index where: segment 0 has 1000 docs, 1 has 2000, 2 has
4000, 3 h
I tested just the IndexWriter from this code base, it does not seem to work.
NewIndexModifier does work. I simply used IndexWriter to create several
documents and then search for them. Nothing came back even though it seems
something was written to disk.
The patch worked until several days ago
Could you elaborate?
Jason Rutherglen commented on LUCENE-565:
-
It seems this writer works, but then some mysterious happens to the index and
the searcher can no longer read it. I am using this in conjunction with Solr.
The index files look ok, howev
(reopen), then perform a batch addDocuments. Then when a search is executed
nothing is returned, and after an optimize the index goes down to 1K. Seems
What did you set maxBufferedDocs to? If it is bigger than the number
of documents you inserted, the newly added documents haven't reached
disk
DirectUpdateHandler2. I will create a non-Solr reproduction of the issue.
I'm still not clear how you used the patch. So this will definitely help.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [
I believe this patch probably also changes the merge behavior.
I think we need to discuss what exactly the new merge behavior is, if it's OK,
what we think the index invariants should be (no more than x segments of y size,
etc), and I'd like to see some code to test those invariants.
Yes, the pa
What about an invariant that says the number of main index segments
with the same level (f(n)) should be less than M.
That is exactly what the second property says:
"Less than M number of segments whose doc count n satisfies B*(M^c) <=
n < B*(M^(c+1)) for any c >= 0."
In other words, less than
> "Less than M number of segments whose doc count n satisfies B*(M^c) <=
> n < B*(M^(c+1)) for any c >= 0."
> In other words, less than M number of segments with the same f(n).
Ah, I had missed that. But I don't believe that lucene currently
obeys this in all cases.
I think it does hold for n
So, I *think* most of our hypothetical problems go away with a simple
adjustment to f(n):
f(n) = floor(log_M((n-1)/B))
Correct. And nice. :-)
Equivalently,
f(n) = ceil(log_M (n / B)). If f(n) = c, it means B*(M^(c-1)) < n <= B*(M^(c)).
So f(n) = 0 means n <= B.
--
So what's left... maxMergeDocs I guess.
Capping the segment size breaks the simple invariants a bit.
Correct.
We also need to be able to handle changes to M and maxMergeDocs
between different IndexWriter sessions. When checking for a merge for
Hmmm. A change of M could easily break the inva
On 9/6/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
That's one way of thinking about it. There's only one "thing"
though: a big bucket of serialized index entries. At the end of a
session, those are sorted, pulled apart, and used to write the tis,
tii, frq, and prx files.
Interesting.
Whe
The new code does handle the case.
After mergeSegments(...) in maybeMergeSegments(), there is the following code:
numSegments -= mergeFactor;
if (docCount > upperBound) {
minSegment++;
exceedsUpperLimit = true;
} else if (docCount > 0) {
On 10/10/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
On 10/10/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Maybe I missed it, but I was surprised that nobody here wondered about the
algorithm and data structure changes that Dave Balmain made in Ferret, to make it
go faster (than Ja
Actually not using single doc segments was only possible due to the
fact that I have constant field numbers so both optimizations stem
from this one change...
Not using single doc segments can be done without constant field numbers... :-)
Ning
--
A new scorer that requires reclaiming resources could be used by many
other scorers such as boolean scorers and conjunction scorers. Then
those scorers should have a closing method and so do the ones use
those scorers... A general closing method would be better, wouldn't
it?
-
I also don't know if there are any negative performance implications
of merging segments with sizes an order of magnitude apart.
It should be relatively easy to test different scenarios by
manipulating mergeFactor and maxBufferedDocs at the right time.
I agree. In addition, it's not clear to me
What makes, for example, FSIndexInput and its clones, thread-safe is
the following.
That is, the method is synchronized on the file object.
protected void readInternal(byte[] b, int offset, int len)
throws IOException {
synchronized (file) {
long position = getFilePointer();
i
I don't think that's sufficient in part because the IndexInput's state is
manipulated outside that sync block. The sync block is to protect the file
only, not the IndexInput, which isn't thread-safe (by design).
Correct, that sync block only protects the file. It and the rest of
FSIndexInput
There is, however, an opportunity of reducing number merges for disk segments.
Assume maxBufferedDocs is 10 and mergeFactor is 3. Assume the segment
sizes = 90, 30, 30, 10, 10. When a new disk segment of 10 is added,
two merges are triggered. First, 3 segments of size 10 are merged and
the segmen
This is exactly what I mean - when flushing the ram segments I compute in
advance if this (merge) would be followed immediately by an additional
merge, and if so, I just do these two merges in one step. This has shown
I see. One difference, however, is that I would keep flushing ram
segments to
It's only upon successfully writing the new segments that Lucene will write a new
"segments" file referring to the new segments. After that, it removes the old
segments. Since it makes these changes in the correct order, it should be the case that
disk full exception never affects the already
I was away so I'm catching up.
If this (occasional large documents consume too much memory) happens
to a few applications, should it be solved in IndexWriter?
A possible design could be:
First, in addDocument(), compute the byte size of a ram segment after
the ram segment is created. In the sync
There is a flaw in this approach as you exceed the threshold before
flushing. With very large documents, that can cause an OOM.
This is a good point.
I agree that it would be better to do this in IndexWriter, but more
machinery would be needed. Lucene would need to estimate the size of
the n
I'd like to open up the API to mergeSegments() in IndexWriter and am
wondering if there are potential problems with this.
I'm worried that opening up mergeSegments() could easily break the
invariants currently guaranteed by the new merge
policy(http://issues.apache.org/jira/browse/LUCENE-672).
addIndexesNoOptimize()
(http://issues.apache.org/jira/browse/LUCENE-528 Optimization for
IndexWriter.addIndexes()) would solve the problem.
Ning
On 12/5/06, Ning Li <[EMAIL PROTECTED]> wrote:
> I'd like to open up the API to mergeSegments() in IndexWriter and am
> wonde
Thanks for the comments Yonik!
To minimize the number of reader open/closes on large persistent segments, I
think the ability to apply deletes only before a merge is important. That
might add a 4th method: doBeforeMerge()
I'm not sure I get this. Buffered deletes are only applied(flushed)
d
1. Make the index format extensible by adding user-implementable reader
and writer interfaces for postings.
...
Here's a very rough, sketchy, first draft of a type (1) proposal.
Nice!
In approach 1, what is the best abstraction of a flexible index format
for Lucene?
The draft proposal seems to
On 12/22/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
Precision would be enhanced if boolean scoring took position into
account, and could be further enhanced if each position were assigned
a boost. For that purpose, having everything in one file is an
advantage, as it cuts down disk seeks. T
On 12/22/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
Ning Li wrote:
> The draft proposal seems to suggest the following (roughly):
> A dictionary entry is .
Perhaps this ought to be , where TermInfo contains a
FilePointer and perhaps other information (e.g., frequency data).
On 1/14/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
* The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
could have a more efficient implementation (just like Solr) when
autoCommit is false, because deletes don't need to be flushed
until commit() is called. Whe
On 1/16/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
Good catch Ning! And, I agree, when a reader plans to make
modifications to the index, I think the best solution is to require
that the reader has opened most recent "segments*_N" (be that a
snapshot or a checkpoint). Really a reader is
On 1/16/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
On 1/15/07, Chuck Williams <[EMAIL PROTECTED]> wrote:
> (Side thought: I've been wondering how hard it would
> be to make merging not a critical section).
It would be very nice if segment merging didn't block the addition of
new documents... i
On 1/17/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
robert engels wrote:
> Under this new scenario, what is the result of this:
>
> I open the IndexWriter.
>
> I delete all documents with Term A.
> I add a new document with Term A.
> I delete all documents with Term A.
>
> Is the new docume
On 2/9/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
I agree w/ Hoss: the way NewIndexModifier works, if you don't do any
deletes then there's no added cost (well, only some if statements) to
the "addDocument only" case because no readers are opened during the
flush when there are no deletes.
I think it's possible for another version of IndexWriter to have
a concurrent merge thread so that disk segments could be merged
while documents are being added or deleted.
This would be beneficial not only because it will improve indexing
performance when there are enough system resources, but m
On 2/20/07, karl wettin <[EMAIL PROTECTED]> wrote:
Could the reader per segment be replaced by one single MultiReader
created by the original indexDeleterFactory()? Or are the segments
partially the RAMDirectory of the writer, partially the persistent
index?
All segments are disk segments. Howe
I agree that the current blocking model works for some applications,
especially if the indexes are batch built.
But other applications, e.g. with online indexes, would greatly
benefit from a non-blocking model. Most systems that merge data
support background merges. As long as we keep it simple (
The code correctly reflects its designed semantics:
numBufferedDeleteTerms is a simple sum of terms passed to
updateDocument or deleteDocuments.
If the first of two successive calls to the same term should be
considered no op if no docs were added in between, shouldn't the first
also be considere
On 2/21/07, Doron Cohen (JIRA) <[EMAIL PROTECTED]> wrote:
Imagine the application and Lucene could talk, with the current
implementation we could hear something like this: ...
However, there could be multiple threads updating the same index. For
example, thread 1 deletes the term "id:5" twice,
Many good points! Thanks, guys!
When background merge is employed, document additions can
out-pace merging, no matter how many background merge threads
are used. Blocking has to happen at some point.
So, if we do anything, we make it simple. I agree with what
Robert and Yonik have proposed: docu
Hi,
Should we guard against the case when commit() is called during addIndexes?
Otherwise, errors such as a file does not exist could happen during commit.
Cheers,
Ning Li
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For
think there're similar problems with calling optimize() while addIndexes
> is in progress... I think we should disallow that?
Optimize waits for addIndexes to finish? I think it's useful to allow addIndexes
during maybeMerge and optimize, no?
Cheers,
Ning Li
---
+1
On Thu, Aug 28, 2008 at 8:19 PM, Michael McCandless (JIRA)
<[EMAIL PROTECTED]> wrote:
>
>[
> https://issues.apache.org/jira/browse/LUCENE-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626805#action_12626805
> ]
>
> Michael McCandless commen
Hi,
We experimented using HBase's scalable infrastructure to scale out Lucene:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg01143.html
There is the concern on the impact of HDFS's random read performance
on Lucene search performance. And we can discuss if HBase's architecture
is best for scal
On Mon, Sep 8, 2008 at 2:43 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> But, how would you maintain a static view of an index...?
>
> IndexReader r1 = indexWriter.getCurrentIndex()
> indexWriter.addDocument(...)
> IndexReader r2 = indexWriter.getCurrentIndex()
>
> I assume r1 will have a view of
On Mon, Sep 8, 2008 at 4:23 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>> I thought an index reader which supports real-time search no longer
>> maintains a static view of an index?
>
> It seems advantageous to just make it really cheap to get a new view
> of the index (if you do it for every sear
>>> Even so,
>>> this may not be sufficient for some FS such as HDFS... Is it
>>> reasonable in this case to keep in memory everything including
>>> stored fields and term vectors?
>>
>> We could maybe do something like a proxy IndexInput/IndexOutput that
>> would allow updating the read buffer fro
LUCENE-1335 is not listed in CHANGES.txt? It also includes a minor
behavior change: "no longer allow the same Directory to be passed into
addIndexes* more than once".
Cheers,
Ning
On Thu, Sep 18, 2008 at 2:29 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I just created the first
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
Right I'm calling a newly created segment (ie flushed from RAM) level
0 and then a level 1 segment is created when you merge 10 level 0
segments, level 2 is created when merge 10 level 1 segments, etc.
That is not how the current merge p
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
Yes the code re-computes the level of a given segment from the current
values of maxBufferedDocs & mergeFactor. But when these values have
changed (or, segments were flushed by RAM not by maxBufferedDocs) then
the way it computes level no
Hi Steven,
I haven't read the details, but should maxBufferedDocs be exposed in
some subinterfaces instead of the MergePolicy interface? Since some
policies may use it and others may use byte size or something else.
It's great that you've started on concurrent merge as well! I haven't
got a chan
On 3/26/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
Ahhh, this is a very good point. OK I won't deprecate "flushing by
doc count" and instead will allow either "flush by RAM usage" (default
to this?) or "flush by doc count".
Just want to clarify: It's either "flush and merge by by
It will be great to support early termination for top-K queries within
the DAAT query processing model in Lucene. There is quite some work
published in related areas.
http://portal.acm.org/citation.cfm?id=956944 is one of them.
Am I getting it right? If a query requires top-K results, isn't it
su
FYI: Patch submitted in http://issues.apache.org/jira/browse/LUCENE-847.
Cheers,
Ning
"Here is a patch for concurrent merge as discussed in:
http://www.gossamer-threads.com/lists/lucene/java-dev/45651?search_string=concurrent%20merge;#45651
"I put it under this issue because it helps design and
On 4/3/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
* With term vectors and/or stored fields, the new patch has
substantially better RAM efficiency.
Impressive numbers! The new patch improves RAM efficiency quite a bit
even with no term vectors nor stored fields, because of the
On 4/4/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
Note that for "autoCommit=false", this optimization is somewhat less
important, depending on how often you actually close/open a new
IndexWriter. In the extreme case, if you open a writer, add 100 MM
docs, close the writer, then no
On 3/31/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
Create merge policy that doesn't periodically inadvertently optimize
So we could make a small change to the policy by only merging the
first mergeFactor segments o
On 3/23/07, Steven Parkes (JIRA) <[EMAIL PROTECTED]> wrote:
In fact, there a few things here that are fairly subtle/important. The relationship/protocol
between the writer and policy is pretty strong. This can be seen in the strawman concurrent
merge code where the merge policy holds state and
Having the merge policy own segmentInfos sounds kind of hard to me.
Among other things, there's a lot of code in IndexWriter for managing
segmentInfos with regards to transactions. I'm pretty wary of touching
that code. Is there a way around that?
But conceptually, do you agree it's a good idea
Steve, Mike,
Thanks for the explanation! I meant cascading but wrote optimizing. So
it still cascades merges.
It would merge based on size (not # docs), would be free to merge
adjacent segments (not just rightmost segments), and would merge N
(configurable) at a time. The part that's still unc
With the plan towards 3.0 release laid out, I think it's a good time
to deprecate IndexModifier and eventually remove IndexModifier.
The only method in IndexModifier which is not implemented in
IndexWriter is "deleteDocument(int doc)". This is because of the
concern that document ids are changing
On 8/7/07, Steven Parkes (JIRA) <[EMAIL PROTECTED]> wrote:
>
> [
> https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518210
> ]
>
> Steven Parkes commented on LUCENE-847:
> --
>
>
ffered delete doc ids. I
don't think it should be the reason not to support "deleteDocument(int
doc)" in IndexWriter. But its impact on concurrent merge is a concern.
Ning
On 8/7/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> +1
>
>
> On Aug 7, 2007, at 3:37 PM, Nin
On 8/8/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> To make delete by docid useful, one needs a way to *get* those docids.
> A callback after flush that provided acurrent list of readers for the
> segments would serve.
Interesting. That makes sense.
> I think IndexWriter.deleteDocument(int doc)
On 8/8/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On 8/8/07, Ning Li <[EMAIL PROTECTED]> wrote:
> > But you still think it's worth to be included in IndexWriter, right?
>
> I'm not sure... (unless I'm missing some obvious use-cases).
> If one could g
On 8/8/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> Let's take a simple case of deleting documents in a range, like
> date:[2006 TO 2008]
> One would currently need to close the writer and open a new reader to
> ensure that they can "see" all the documents. Then execute a
> RangeQuery, collect th
On 8/8/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On 8/8/07, Ning Li <[EMAIL PROTECTED]> wrote:
> > This reminds me: It'd be nice if we could support delete-by-query someday.
> > :)
> >
> > I was thinking people use deleteDocument(int docid) whe
IndexWriter does everything IndexModifier does and more, except
"deleteDocument(int doc)". Can we reach consensus on: 1 Should we
deprecate IndexModifier before 3.0 and remove it in 3.0? 2 If so, do
we have to add "deleteDocument(int doc)" to IndexWriter?
We know how to support "deleteDocument(int
Hi Mike,
I cannot apply the patch cleanly. MergePolicy.java, e.g., seems to be
missing from the patch.
On 8/24/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
>
> [
> https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
Hi Doron,
> On the other, the logic of "use memory-limit unless added-docs-limit was
> specified" seems somewhat confusing
The design intention is to use either
maxBufferedDocs/maxBufferedDeleteTerms or ramBufferSize, but not both
at the same time.
> (why only by pending adds, why not also by pe
to max int
MB.
Ning
On 9/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
>
> "Doron Cohen" <[EMAIL PROTECTED]> wrote:
> > Hi Ning,
> >
> > "Ning Li" <[EMAIL PROTECTED]> wrote on 24/09/2007 00:26:36:
> >
> > > Do y
On 9/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> On flushing pending deletes by RAM usage: should we just bundle this
> up under "flush by RAM usage"? Ie "when total RAM usage, either from
> buffered deletes, buffered docs, anything else, exceeds X then it's
> time to flush"? (Instead
The cause is that in MergeThread.run(), merge in the try block is a
local variable, while merge in the catch block is the class variable.
Merge in the try block could be one different from the original merge,
but the catch block always checks the abort flag of the original
merge.
-
Make all documents have a term, say "ID:UID", and for each document,
store its UID in the term's payload. You can read off this posting
list to create your array. Will this work for you, John?
Cheers,
Ning
On 10/18/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> Forwarding this to java-dev per req
lt set is large. But loading it in
> memory when opening index can also be slow if the index is large and updates
> often.
>
> Thanks
>
> -John
>
> On 10/18/07, Ning Li <[EMAIL PROTECTED]> wrote:
> >
> > Make all documents have a term, say "ID:UID",
> That may be a little too seamless. We want the user to have specific
> control over which fields are efficiently stored separately since they
> will know how that field will be used.
Maybe let users decide field families, like the column families in BigTable?
--
HDFS block. This feature may be useful for other HDFS applications (e.g.,
HBase). We would like to collaborate with other people who are interested in
adding this feature to HDFS.
Regards,
Ning Li
I work for IBM Research. I read the Rackspace article. Rackspace's Mailtrust
has a similar design. Happy to see an existing application on such a system.
Do they plan to open-source it? Is the AOL project an open source project?
On Feb 6, 2008 11:33 AM, Clay Webster <[EMAIL PROTECTED]> wrote:
>
>
No. I'm curious too. :)
On Feb 6, 2008 11:44 AM, J. Delgado <[EMAIL PROTECTED]> wrote:
> I assume that Google also has distributed index over their
> GFS/MapReduce implementation. Any idea how they achieve this?
>
> J.D.
>
One main focus is to provide fault-tolerance in this distributed index
system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging
results from multiple shards right now. We'd like to start an open source
project for a fault-tolerant distributed index system (or join if one
already exi
Components: Index
Reporter: Ning Li
Today, applications have to open/close an IndexWriter and open/close an
IndexReader directly or indirectly (via IndexModifier) in order to handle a
mix of inserts and deletes. This performs well when inserts and deletes
come in fairly large batches
[ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]
Ning Li updated LUCENE-565:
---
Attachment: IndexWriter.java
TestWriterDelete.java
> Supporting deleteDocuments in IndexWriter (Code and Performance Results
> Pr
1 - 100 of 202 matches
Mail list logo