[jira] Resolved: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-08-12 Thread Michael McCandless (JIRA)
Available, New]) improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene - Java Issue Type

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-07-12 Thread Michael McCandless (JIRA)
a unit test fix it. I will also make the flush(boolean triggerMerge, boolean flushDocStores) protected, not public, and move the javadoc back to the public flush(). improve how IndexWriter uses RAM to buffer added documents

[jira] Reopened: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-07-06 Thread Michael McCandless (JIRA)
. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene - Java Issue Type: Improvement

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-25 Thread Doron Cohen
Michael McCandless wrote: OK, when you say fair I think you mean because you already had a previous run that used compound file, you had to use compound file in the run with the LUCENE-843 patch (etc)? Yes, that's true. The recommendations above should speed up Lucene with or without my

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-24 Thread Doron Cohen (JIRA)
, for a fair comparison, I will remain with compound. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-24 Thread Michael McCandless (JIRA)
expect less that 50% speedup in this case. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-23 Thread Doron Cohen (JIRA)
the optimize part if this is not of interest for the comparison. (In fact I am still waiting for my optimize() to complete, but if it is not of interest I will just interrupt it...) Thanks, Doron improve how IndexWriter uses RAM to buffer added documents

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-23 Thread Michael McCandless (JIRA)
a separate issue (LUCENE-856) for optimizations in segment merging. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-22 Thread Grant Ingersoll
328.03.1 X StandardAnalyzer is definiteely rather time consuming! improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-22 Thread Michael McCandless
StandardAnalyzer is definiteely rather time consuming! improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael Busch (JIRA)
how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene - Java Issue Type: Improvement

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael McCandless (JIRA)
already? Good question ... I haven't measured the performance cost of using StandardAnalyzer or HTML parsing but I will test post back. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael McCandless (JIRA)
328.03.1 X StandardAnalyzer is definiteely rather time consuming! improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael Busch (JIRA)
is easily extensible in this regard? I'm wondering because of all the optimizations you're doing like e. g. sharing byte arrays. But I'm certainly not familiar enough with your code yet, so I'm only guessing here. improve how IndexWriter uses RAM to buffer added documents

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael McCandless (JIRA)
extensibility. Devil is in the details of course... I obviously haven't factored DocumentsWriter in this way (it has its own addPosition that writes the current Lucene index format) but I think this is very doable in the future. improve how IndexWriter uses RAM to buffer added documents

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Michael McCandless (JIRA)
/index and test again? Thanks. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene - Java

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Steven Parkes (JIRA)
to figure out how it will dove tail with the merge policy factoring. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Michael Busch (JIRA)
of invertField() like DocumentWriter did before 880 this is safe, because addPosition() serializes the term strings and payload bytes into the posting hash table right away. Is that right? improve how IndexWriter uses RAM to buffer added documents

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-18 Thread Michael McCandless (JIRA)
performance. I will open a separate issue to change the default after this issue is resolved. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-15 Thread Michael McCandless (JIRA)
merged segment has its own private doc stores again. So the sharing only occurs for the level 0 segments. I still need to update fileformats doc with this change. improve how IndexWriter uses RAM to buffer added documents

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-15 Thread Yonik Seeley (JIRA)
segments are sharing the same ones (big performance gain), Is this only in the case where the segments have no deleted docs? improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-15 Thread Michael McCandless (JIRA)
when all segments are sharing the same ones (big performance gain), Is this only in the case where the segments have no deleted docs? Right. Also the segments must be contiguous which the current merge policy ensures but future merge policies may not. improve how IndexWriter uses RAM

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-08 Thread Michael McCandless (JIRA)
test cases; added missing writer.close() to one of the contrib tests. * Cleanup, comments, etc. I think the code is getting more approachable now. improve how IndexWriter uses RAM to buffer added documents -- Key

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-08 Thread Michael McCandless (JIRA)
threaded. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene - Java Issue Type

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-05-21 Thread Michael McCandless (JIRA)
have another issue (LUCENE-856) to optimize segment merging I can carry over any optimizations that we may want to keep into that issue. If this doesn't lose much performance it will make the approach here even simpler. improve how IndexWriter uses RAM to buffer added documents

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Michael McCandless (JIRA)
segments have accumulated I flush to a real Lucene segment (autoCommit=true) or to on-disk partial segments (autoCommit=false) which are then merged in the end to create a real Lucene segment. improve how IndexWriter uses RAM to buffer added documents

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Marvin Humphrey (JIRA)
persistent arrays, improves indexing speed by about 15%. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Michael McCandless (JIRA)
speed by about 15%. Fabulous!! I think it's the custom memory management I'm doing with slices into shared byte[] arrays for the postings that made the persistent hash approach work well, this time around (when I had previously tried this it was slower). improve how IndexWriter uses RAM to buffer

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Yonik Seeley
On 4/30/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote: After discussion on java-dev last time, I decided to retry the persistent hash approach, where the Postings hash lasts across many docs and then a single flush produces a partial segment containing all of those docs. This is in

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Michael McCandless (JIRA)
; new47.5 [ 354.8% more] Avg RAM used (MB) @ flush: old74.3; new42.9 [ 42.2% less] improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Yonik Seeley (JIRA)
be much larger than segments to it's left. I suppose the idea of merging rightmost segments should just be dropped in favor of merging the smallest adjacent segments? Sorry if this has already been covered... as I said, I'm trying to follow along at a high level. improve how IndexWriter uses RAM

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Michael McCandless (JIRA)
then it would automatically resolve LUCENE-845 as well (which would otherwise block this issue). improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-07 Thread Otis Gospodnetic
Mike - thanks for explanation, it makes perfect sense! Otis - Original Message From: Michael McCandless [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, April 5, 2007 8:03:44 PM Subject: Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-06 Thread Michael McCandless
Marvin Humphrey [EMAIL PROTECTED] wrote: On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote: What we need to do is cut down on decompression and conflict resolution costs when reading from one segment to another. KS has solved this problem for stored fields. Field defs are global and

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless
Marvin Humphrey [EMAIL PROTECTED] wrote: On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote: (: Ironically, the numbers for Lucene on that page are a little better than they should be because of a sneaky bug. I would have made updating the results a priority if they'd gone the other

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless (JIRA)
in RAM before having to flush to disk. I would also expect this curve to be somewhat content dependent. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread eks dev
wow, impressive numbers, congrats ! - Original Message From: Michael McCandless (JIRA) [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, 5 April, 2007 3:22:32 PM Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless
eks dev [EMAIL PROTECTED] wrote: wow, impressive numbers, congrats ! Thanks! But remember many Lucene apps won't see these speedups since I've carefully minimized cost of tokenization and cost of document retrieval. I think for many Lucene apps these are a sizable part of time spend indexing.

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Marvin Humphrey
On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote: Marvin do you have any sense of what the equivalent cost is in KS It's big. I don't have any good optimizations to suggest in this area. (I think for KS you add a previous segment not that differently from how you add a document)?

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless
Marvin Humphrey [EMAIL PROTECTED] wrote: (I think for KS you add a previous segment not that differently from how you add a document)? Yeah. KS has to decompress and serialize posting content, which sux. The one saving grace is that with the Fibonacci merge schedule and the

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Otis Gospodnetic
] To: java-dev@lucene.apache.org Sent: Thursday, April 5, 2007 9:22:32 AM Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Marvin Humphrey
On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote: (I think for KS you add a previous segment not that differently from how you add a document)? Yeah. KS has to decompress and serialize posting content, which sux. The one saving grace is that with the Fibonacci merge schedule and the

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Chris Hostetter
: Thanks! But remember many Lucene apps won't see these speedups since I've : carefully minimized cost of tokenization and cost of document retrieval. I : think for many Lucene apps these are a sizable part of time spend indexing. true, but as long as the changes you are making has no impact

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Mike Klaas
On 4/5/07, Chris Hostetter [EMAIL PROTECTED] wrote: : Thanks! But remember many Lucene apps won't see these speedups since I've : carefully minimized cost of tokenization and cost of document retrieval. I : think for many Lucene apps these are a sizable part of time spend indexing. true, but

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless
Mike Klaas [EMAIL PROTECTED] wrote: On 4/5/07, Chris Hostetter [EMAIL PROTECTED] wrote: : Thanks! But remember many Lucene apps won't see these speedups since I've : carefully minimized cost of tokenization and cost of document retrieval. I : think for many Lucene apps these are a

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless
Hi Otis! Otis Gospodnetic [EMAIL PROTECTED] wrote: You talk about a RAM buffer from 1MB - 96MB, but then you have the amount of RAM @ flush time (e.g. Avg RAM used (MB) @ flush: old34.5; new 3.4 [ 90.1% less]). I don't follow 100% of what you are doing in LUCENE-843, so could

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless
Marvin Humphrey [EMAIL PROTECTED] wrote: On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote: (I think for KS you add a previous segment not that differently from how you add a document)? Yeah. KS has to decompress and serialize posting content, which sux. The one saving grace

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Grant Ingersoll
Michael, like everyone else, I am watching this very closely. So far it sounds great! On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote: When I measure amount of RAM @ flush time, I'm calling MemoryMXBean.getHeapMemoryUsage().getUsed(). So, this measures actual process memory usage

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless
Grant Ingersoll [EMAIL PROTECTED] wrote: Michael, like everyone else, I am watching this very closely. So far it sounds great! On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote: When I measure amount of RAM @ flush time, I'm calling MemoryMXBean.getHeapMemoryUsage().getUsed().

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Marvin Humphrey
On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote: What we need to do is cut down on decompression and conflict resolution costs when reading from one segment to another. KS has solved this problem for stored fields. Field defs are global and field values are keyed by name rather than

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Marvin Humphrey
On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote: Yonik Seeley [EMAIL PROTECTED] wrote: Wow, very nice results Mike! Thanks :) I'm just praying I don't have some sneaky bug making the results far better than they really are!! That's possible, but I'm confident that the model you're

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Michael McCandless
Marvin Humphrey [EMAIL PROTECTED] wrote: On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote: Yonik Seeley [EMAIL PROTECTED] wrote: Wow, very nice results Mike! Thanks :) I'm just praying I don't have some sneaky bug making the results far better than they really are!! That's

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Marvin Humphrey
On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote: (: Ironically, the numbers for Lucene on that page are a little better than they should be because of a sneaky bug. I would have made updating the results a priority if they'd gone the other way. :) Hrm. It would be nice to have

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)
, for a given # docs in the index. I also measure overall RAM used in the JVM (using MemoryMXBean.getHeapMemoryUsage().getUsed()) just prior to each flush except the last, to also capture the document processing RAM, object overhead, etc. improve how IndexWriter uses RAM to buffer added documents

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)
% less] improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene - Java Issue Type

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)
: old 268.9; new 818.7 [ 204.5% faster] Docs/MB @ flush:old46.7; new 432.2 [ 825.2% more] Avg RAM used (MB) @ flush: old93.0; new36.6 [ 60.6% less] improve how IndexWriter uses RAM to buffer added documents

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)
: old93.0; new36.6 [ 60.6% less] improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)
this also means you could push your RAM buffer size even higher to get better performance. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Marvin Humphrey (JIRA)
as to whether you see the same effect. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene - Java

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Ning Li
On 4/3/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote: * With term vectors and/or stored fields, the new patch has substantially better RAM efficiency. Impressive numbers! The new patch improves RAM efficiency quite a bit even with no term vectors nor stored fields, because of the

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Yonik Seeley
Wow, very nice results Mike! -Yonik On 4/3/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote: [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486335 ] Michael McCandless commented on LUCENE-843:

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)
as to whether you see the same effect. Interesting. OK I will run the benchmark across increasing RAM sizes to see where the sweet spot seems to be! improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless
Ning Li [EMAIL PROTECTED] wrote: On 4/3/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote: * With term vectors and/or stored fields, the new patch has substantially better RAM efficiency. Impressive numbers! The new patch improves RAM efficiency quite a bit even with no term

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless
Yonik Seeley [EMAIL PROTECTED] wrote: Wow, very nice results Mike! Thanks :) I'm just praying I don't have some sneaky bug making the results far better than they really are!! And still plenty to do... Mike - To

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-02 Thread Michael McCandless (JIRA)
consuming more RAM in this case than the baseline (trunk) so I'm still working on this one ... * Fixed a slow memory leak when building large (20+ GB) indices improve how IndexWriter uses RAM to buffer added documents

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-28 Thread Michael McCandless (JIRA)
or not, deleted docs or not, any merges or not. improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene - Java

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-25 Thread Michael McCandless (JIRA)
corruption case * Added more asserts (run with java -ea so asserts run) * Some more small optimizations * Updated to current trunk so patch applies cleanly improve how IndexWriter uses RAM to buffer added documents

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless
Chris Hostetter [EMAIL PROTECTED] wrote: : Actually is #2 a hard requirement? : : A lot of Lucene users depend on having document number correspond to : age, I think. ISTR Hatcher at least recommending techniques that : require it. Corrispond to age may be missleading as it implies that

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Erik Hatcher
On Mar 22, 2007, at 8:13 PM, Marvin Humphrey wrote: On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote: Actually is #2 a hard requirement? A lot of Lucene users depend on having document number correspond to age, I think. ISTR Hatcher at least recommending techniques that require

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Steven Parkes
Subject: Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents Chris Hostetter [EMAIL PROTECTED] wrote: : Actually is #2 a hard requirement? : : A lot of Lucene users depend on having document number correspond to : age, I think. ISTR Hatcher at least

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Yonik Seeley
On 3/22/07, Michael McCandless [EMAIL PROTECTED] wrote: We say that developers should not rely on docIDs but people still seem to rely on their monotonic ordering (even though they change). Yes. If the benefits of removing that guarantee are large enough, we could consider dumping it... but

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Yonik Seeley
On 3/22/07, Michael McCandless [EMAIL PROTECTED] wrote: Merging is costly because you read all data in then write all data out, so, you want to minimize for byte of data in the index in the index how many times it will be serviced (read in, written out) as part of a merge. Avoiding the

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless
Yonik Seeley [EMAIL PROTECTED] wrote: On 3/22/07, Michael McCandless [EMAIL PROTECTED] wrote: Merging is costly because you read all data in then write all data out, so, you want to minimize for byte of data in the index in the index how many times it will be serviced (read in, written

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Ning Li
On 3/22/07, Michael McCandless [EMAIL PROTECTED] wrote: Right I'm calling a newly created segment (ie flushed from RAM) level 0 and then a level 1 segment is created when you merge 10 level 0 segments, level 2 is created when merge 10 level 1 segments, etc. That is not how the current merge

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Grant Ingersoll
I've only been loosely following this... Do you think it is possible to separate the stored/term vector handling into a separate patch against the current trunk? This seems like a quick win and I know it has been speculated about before. On Mar 23, 2007, at 12:00 PM, Michael McCandless

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless
Grant Ingersoll [EMAIL PROTECTED] wrote: I've only been loosely following this... Do you think it is possible to separate the stored/term vector handling into a separate patch against the current trunk? This seems like a quick win and I know it has been speculated about before. This

[jira] Created: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless (JIRA)
improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene - Java Issue Type: Improvement

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless (JIRA)
:) improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene - Java Issue Type

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Steven Parkes
] Sent: Thursday, March 22, 2007 10:09 AM To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira .plugin.system.issuetabpanels:all-tabpanel

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless
Steven Parkes [EMAIL PROTECTED] wrote: * Merge policy has problems when you flush by RAM (this is true even before my patch). Not sure how to fix yet. Do you mean where one would be trying to use RAM usage to determine when to do a flush? Right, if you have your indexer measure RAM

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Steven Parkes
EG if you set maxBufferedDocs to say 1 but then it turns out based on RAM usage you actually flush every 300 docs then the merge policy will incorrectly merge a level 1 segment (with 3000 docs) in with the level 0 segments (with 300 docs). This is because the merge policy looks at the

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless
On Thu, 22 Mar 2007 13:34:39 -0700, Steven Parkes [EMAIL PROTECTED] said: EG if you set maxBufferedDocs to say 1 but then it turns out based on RAM usage you actually flush every 300 docs then the merge policy will incorrectly merge a level 1 segment (with 3000 docs) in with the level

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Steven Parkes
Right I'm calling a newly created segment (ie flushed from RAM) level 0 and then a level 1 segment is created when you merge 10 level 0 segments, level 2 is created when merge 10 level 1 segments, etc. This isn't the way the current code treats things. I'm not saying it's the only way to look

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Marvin Humphrey
On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote: Actually is #2 a hard requirement? A lot of Lucene users depend on having document number correspond to age, I think. ISTR Hatcher at least recommending techniques that require it. Do the loose ports of Lucene (KinoSearch,

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless
Steven Parkes wrote: Right I'm calling a newly created segment (ie flushed from RAM) level 0 and then a level 1 segment is created when you merge 10 level 0 segments, level 2 is created when merge 10 level 1 segments, etc. This isn't the way the current code treats things. I'm not saying

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Steven Parkes
But when these values have changed (or, segments were flushed by RAM not by maxBufferedDocs) then the way it computes level no longer results in the logarithmic policy that it's trying to implement, I think. That's right. Parts of the implementation assume that the segments are

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Chris Hostetter
: Actually is #2 a hard requirement? : : A lot of Lucene users depend on having document number correspond to : age, I think. ISTR Hatcher at least recommending techniques that : require it. Corrispond to age may be missleading as it implies that the actual docid has meaning ... it's more that