subject:"improve how IndexWriter uses RAM to buffer added documents"

[jira] Resolved: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-08-12 Thread Michael McCandless (JIRA)

Available, New]) > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java >

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-07-12 Thread Michael McCandless (JIRA)

a unit test & fix it. I will also make the flush(boolean triggerMerge, boolean flushDocStores) protected, not public, and move the javadoc back to the public flush(). > improve how IndexWriter uses RAM to buffer added do

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-07-12 Thread Steven Parkes (JIRA)

(docWriter.getMaxBufferedDocs()); } > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Proj

[jira] Reopened: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-07-06 Thread Michael McCandless (JIRA)

ix shortly. > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene

[jira] Resolved: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-07-04 Thread Michael McCandless (JIRA)

] (was: [New]) > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Jav

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-25 Thread Doron Cohen

Michael McCandless wrote: > OK, when you say "fair" I think you mean because you already had a > previous run that used compound file, you had to use compound file in > the run with the LUCENE-843 patch (etc)? Yes, that's true. > The recommendations above should speed up Lucene with or without m

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-24 Thread Michael McCandless (JIRA)

main with compound. OK, because you're doing StandardAnalyzer and HTML parsing and presumably loading one-doc-per-file, most of your time is spent outside of Lucene indexing so I'd expect less that 50% speedup in this case. > improve how IndexWriter uses RAM to buffer added documen

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-24 Thread Doron Cohen (JIRA)

0MB. Again, for a "fair" comparison, I will remain with compound. > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-23 Thread Michael McCandless (JIRA)

docs in RAM and then flush a new segment when it's time. I've opened a separate issue (LUCENE-856) for optimizations in segment merging. > improve how IndexWriter uses RAM to buffer added documents > -- > >

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-23 Thread Doron Cohen (JIRA)

p the optimize part if this is not of interest for the comparison. (In fact I am still waiting for my optimize() to complete, but if it is not of interest I will just interrupt it...) Thanks, Doron > improve how IndexWriter uses RAM to buffer added d

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-22 Thread Michael McCandless

ALYZERPATCH (sec) TRUNK (sec) SPEEDUP > > SimpleSpaceAnalyzer 79.0 326.54.1 X > > StandardAnalyzer 449.0 674.1 1.5 X > > WhitespaceAnalyzer 104.0 338.93.3 X > > SimpleAnalyzer 104.7 328.03.1

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-22 Thread Grant Ingersoll

104.7 328.03.1 X StandardAnalyzer is definiteely rather time consuming! improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael McCandless (JIRA)

into the IndexOutputs. I think a separation like that would work well: we could have good performance and also extensibility. Devil is in the details of course... I obviously haven't factored DocumentsWriter in this way (it has its own addPosition that writes the current Lucene index

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael Busch (JIRA)

is easily extensible in this regard? I'm wondering because of all the optimizations you're doing like e. g. sharing byte arrays. But I'm certainly not familiar enough with your code yet, so I'm only guessing here. > improve how IndexWriter uses RAM

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael McCandless (JIRA)

104.7 328.03.1 X StandardAnalyzer is definiteely rather time consuming! > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://is

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael McCandless (JIRA)

be a significant improvement. Did you run those kinds > of benchmarks already? Good question ... I haven't measured the performance cost of using StandardAnalyzer or HTML parsing but I will test & post back. > improve how IndexWriter uses RAM to buffer added documents > ---

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Michael Busch (JIRA)

ove how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Is

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Michael Busch (JIRA)

he finally clause of invertField() like DocumentWriter did before 880 this is safe, because addPosition() serializes the term strings and payload bytes into the posting hash table right away. Is that right? > improve how IndexWriter uses RAM to

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Michael McCandless (JIRA)

egments really only merges segments. > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 >

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Steven Parkes (JIRA)

ry to figure out how it will dove tail with the merge policy factoring. > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/ji

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Michael McCandless (JIRA)

src/test/org/apache/lucene/index and test again? Thanks. > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCEN

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Steven Parkes (JIRA)

h the merge policy stuff (LUCENE-847). Noticed that there are a couple of test failures? > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.ap

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-18 Thread Michael McCandless (JIRA)

;out of the box" performance. I will open a separate issue to change the default after this issue is resolved. > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-15 Thread Michael McCandless (JIRA)

> improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java >

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-15 Thread Yonik Seeley (JIRA)

ot; files when all segments > are sharing the same ones (big performance gain), Is this only in the case where the segments have no deleted docs? > improve how IndexWriter uses RAM to buffer added documents > -- > >

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-15 Thread Michael McCandless (JIRA)

t of "doc store" files. Then when segments are merged, the newly merged segment has its own "private" doc stores again. So the sharing only occurs for the "level 0" segments. I still need to update fileform

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-08 Thread Michael McCandless (JIRA)

consuming and single threaded. > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-08 Thread Michael McCandless (JIRA)

does. * Added some new unit test cases; added missing "writer.close()" to one of the contrib tests. * Cleanup, comments, etc. I think the code is getting more "approachable" now. >

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-05-21 Thread Michael McCandless (JIRA)

ent SegmentMerger, possibly w/ some small changes, to do the merges even when autoCommit=false. Since we have another issue (LUCENE-856) to optimize segment merging I can carry over any optimizations that we may want to keep into that issue. If this doesn't lose much performance it will

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Michael McCandless (JIRA)

ize by net # bytes in the segment. This would preserve the "docID monotonicity invariance". If we take that approach then it would automatically resolve LUCENE-845 as well (which would otherwise block this issue). > improve how IndexWriter uses RAM to buffer added documents >

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Yonik Seeley (JIRA)

ve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type:

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Michael McCandless (JIRA)

; new47.5 [ 354.8% more] Avg RAM used (MB) @ flush: old74.3; new42.9 [ 42.2% less] > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 >

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Yonik Seeley

On 4/30/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote: After discussion on java-dev last time, I decided to retry the "persistent hash" approach, where the Postings hash lasts across many docs and then a single flush produces a partial segment containing all of those docs. This is in c

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Michael McCandless (JIRA)

to shared byte[] arrays for the postings that made the persistent hash approach work well, this time around (when I had previously tried this it was slower). > improve how IndexWriter uses RAM to buffer added documents > --

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Marvin Humphrey (JIRA)

ue, which is a riff on your persistent arrays, improves indexing speed by about 15%. > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: htt

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Michael McCandless (JIRA)

segment. When enough of these RAM segments have accumulated I flush to a real Lucene segment (autoCommit=true) or to on-disk partial segments (autoCommit=false) which are then merged in the end to create a real Lucene segment. > improve how IndexWriter u

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-07 Thread Otis Gospodnetic

Mike - thanks for explanation, it makes perfect sense! Otis - Original Message From: Michael McCandless <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Thursday, April 5, 2007 8:03:44 PM Subject: Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-06 Thread Michael McCandless

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote: > > On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote: > > >> What we need to do is cut down on decompression and conflict > >> resolution costs when reading from one segment to another. KS has > >> solved this problem for stored fields. Field defs

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Marvin Humphrey

On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote: What we need to do is cut down on decompression and conflict resolution costs when reading from one segment to another. KS has solved this problem for stored fields. Field defs are global and field values are keyed by name rather than fiel

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

"Grant Ingersoll" <[EMAIL PROTECTED]> wrote: > > Michael, like everyone else, I am watching this very closely. So far > it sounds great! > > On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote: > > > When I measure "amount of RAM @ flush time", I'm calling > > MemoryMXBean.getHeapMemoryUsage

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Grant Ingersoll

Michael, like everyone else, I am watching this very closely. So far it sounds great! On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote: When I measure "amount of RAM @ flush time", I'm calling MemoryMXBean.getHeapMemoryUsage().getUsed(). So, this measures actual process memory usage w

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote: > On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote: > > >>> (I think for KS you "add" a previous segment not that > >>> differently from how you "add" a document)? > >> > >> Yeah. KS has to decompress and serialize posting content, which sux. > >

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

Hi Otis! "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote: > You talk about a RAM buffer from 1MB - 96MB, but then you have the amount > of RAM @ flush time (e.g. Avg RAM used (MB) @ flush: old34.5; new > 3.4 [ 90.1% less]). > > I don't follow 100% of what you are doing in LUCENE-843, so

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

"Mike Klaas" <[EMAIL PROTECTED]> wrote: > On 4/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > > : Thanks! But remember many Lucene apps won't see these speedups since I've > > : carefully minimized cost of tokenization and cost of document retrieval. > > I > > : think for many Lucene ap

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Mike Klaas

On 4/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : Thanks! But remember many Lucene apps won't see these speedups since I've : carefully minimized cost of tokenization and cost of document retrieval. I : think for many Lucene apps these are a sizable part of time spend indexing. true, bu

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Chris Hostetter

: Thanks! But remember many Lucene apps won't see these speedups since I've : carefully minimized cost of tokenization and cost of document retrieval. I : think for many Lucene apps these are a sizable part of time spend indexing. true, but as long as the changes you are making has no impact on

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Marvin Humphrey

On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote: (I think for KS you "add" a previous segment not that differently from how you "add" a document)? Yeah. KS has to decompress and serialize posting content, which sux. The one saving grace is that with the Fibonacci merge schedule and th

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Otis Gospodnetic

ROTECTED]> To: java-dev@lucene.apache.org Sent: Thursday, April 5, 2007 9:22:32 AM Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpan

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote: > > (I think for KS you "add" a previous segment not that > > differently from how you "add" a document)? > > Yeah. KS has to decompress and serialize posting content, which sux. > > The one saving grace is that with the Fibonacci merge schedule and

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Marvin Humphrey

On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote: Marvin do you have any sense of what the equivalent cost is in KS It's big. I don't have any good optimizations to suggest in this area. (I think for KS you "add" a previous segment not that differently from how you "add" a document)?

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

"eks dev" <[EMAIL PROTECTED]> wrote: > wow, impressive numbers, congrats ! Thanks! But remember many Lucene apps won't see these speedups since I've carefully minimized cost of tokenization and cost of document retrieval. I think for many Lucene apps these are a sizable part of time spend index

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread eks dev

wow, impressive numbers, congrats ! - Original Message From: Michael McCandless (JIRA) <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Thursday, 5 April, 2007 3:22:32 PM Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added doc

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless (JIRA)

ompress the terms dict, the more docs are merged in RAM before having to flush to disk. I would also expect this curve to be somewhat content dependent. > improve how IndexWriter uses RAM to buffer added documents > -- > &

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote: > On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote: > > >> (: Ironically, the numbers for Lucene on that page are a little > >> better than they should be because of a sneaky bug. I would have > >> made updating the results a priority if they'd go

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Marvin Humphrey

On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote: (: Ironically, the numbers for Lucene on that page are a little better than they should be because of a sneaky bug. I would have made updating the results a priority if they'd gone the other way. :) Hrm. It would be nice to have hard

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Michael McCandless

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote: > > On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote: > > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > >> Wow, very nice results Mike! > > > > Thanks :) I'm just praying I don't have some sneaky bug making > > the results far better than they rea

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Marvin Humphrey

On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote: "Yonik Seeley" <[EMAIL PROTECTED]> wrote: Wow, very nice results Mike! Thanks :) I'm just praying I don't have some sneaky bug making the results far better than they really are!! That's possible, but I'm confident that the model you'r

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless

"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > Wow, very nice results Mike! Thanks :) I'm just praying I don't have some sneaky bug making the results far better than they really are!! And still plenty to do... Mike - To unsubsc

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless

"Ning Li" <[EMAIL PROTECTED]> wrote: > On 4/3/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote: > > > * With term vectors and/or stored fields, the new patch has > >substantially better RAM efficiency. > > Impressive numbers! The new patch improves RAM efficiency quite a bit > even w

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)

AM > fragmentation is slowing down malloc/free. I'll be interested as to > whether you see the same effect. Interesting. OK I will run the benchmark across increasing RAM sizes to see where the sweet spot seems to

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Yonik Seeley

Wow, very nice results Mike! -Yonik On 4/3/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote: [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486335 ] Michael McCandless commented on LUCENE-843:

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Ning Li

On 4/3/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote: * With term vectors and/or stored fields, the new patch has substantially better RAM efficiency. Impressive numbers! The new patch improves RAM efficiency quite a bit even with no term vectors nor stored fields, because of the

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Marvin Humphrey (JIRA)

I'll be interested as to whether you see the same effect. > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/brows

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)

y reclaim. I think this also means you could push your RAM buffer size even higher to get better performance. > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 >

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)

: old93.0; new36.6 [ 60.6% less] > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 >

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)

l Docs/sec: old 268.9; new 818.7 [ 204.5% faster] Docs/MB @ flush:old46.7; new 432.2 [ 825.2% more] Avg RAM used (MB) @ flush: old93.0; new36.6 [ 60.6% less] > improve how IndexWriter uses RAM to buffer ad

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)

3 [ 50.9% less] > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)

these segments are then generally the less merging you need to do, for a given # docs in the index. I also measure overall RAM used in the JVM (using MemoryMXBean.getHeapMemoryUsage().getUsed()) just prior to each flush except the last, to also capture the "document processing RAM",

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)

on. I ran the tests with Java 1.5 on a Mac Pro quad (2 Intel CPUs, each dual core) OS X box with 2 GB RAM. I give java 1 GB heap (-Xmx1024m). > improve how IndexWriter uses RAM to buffer added documents > -- > >

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-02 Thread Michael McCandless (JIRA)

ocs (25 MB plain text each). I'm still consuming more RAM in this case than the baseline (trunk) so I'm still working on this one ... * Fixed a slow memory leak when building large (20+ GB) indices > im

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-28 Thread Michael McCandless (JIRA)

cases/combinations of added docs or not, deleted docs or not, any merges or not. > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.a

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-25 Thread Michael McCandless (JIRA)

corruption case * Added more asserts (run with "java -ea" so asserts run) * Some more small optimizations * Updated to current trunk so patch applies cleanly > improve how IndexWriter uses RAM to buffer ad

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless

"Grant Ingersoll" <[EMAIL PROTECTED]> wrote: > I've only been loosely following this... > > Do you think it is possible to separate the stored/term vector > handling into a separate patch against the current trunk? This seems > like a quick win and I know it has been speculated about before.

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Grant Ingersoll

I've only been loosely following this... Do you think it is possible to separate the stored/term vector handling into a separate patch against the current trunk? This seems like a quick win and I know it has been speculated about before. On Mar 23, 2007, at 12:00 PM, Michael McCandless wro

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Ning Li

On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: Yes the code re-computes the level of a given segment from the current values of maxBufferedDocs & mergeFactor. But when these values have changed (or, segments were flushed by RAM not by maxBufferedDocs) then the way it computes level no

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Ning Li

On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: Right I'm calling a newly created segment (ie flushed from RAM) level 0 and then a level 1 segment is created when you merge 10 level 0 segments, level 2 is created when merge 10 level 1 segments, etc. That is not how the current merge p

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless

"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Merging is costly because you read all data in then write all data > > out, so, you want to minimize for byte of data in the index in the > > index how many times it will be "serviced" (read i

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Yonik Seeley

On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: Merging is costly because you read all data in then write all data out, so, you want to minimize for byte of data in the index in the index how many times it will be "serviced" (read in, written out) as part of a merge. Avoiding the re-w

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless

"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > We say that > > developers should not rely on docIDs but people still seem to rely on > > their monotonic ordering (even though they change). > > Yes. If the benefits of removing that guarantee are large enough, we > could consider dumping it... but

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Yonik Seeley

On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: We say that developers should not rely on docIDs but people still seem to rely on their monotonic ordering (even though they change). Yes. If the benefits of removing that guarantee are large enough, we could consider dumping it... but

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Steven Parkes

e.org Subject: Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > : > Actually is #2 a hard requirement? > : > : A lot of Lucene users depend on having document number correspond to &

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Erik Hatcher

On Mar 22, 2007, at 8:13 PM, Marvin Humphrey wrote: On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote: Actually is #2 a hard requirement? A lot of Lucene users depend on having document number correspond to age, I think. ISTR Hatcher at least recommending techniques that require it.

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless

"Chris Hostetter" <[EMAIL PROTECTED]> wrote: > : > Actually is #2 a hard requirement? > : > : A lot of Lucene users depend on having document number correspond to > : age, I think. ISTR Hatcher at least recommending techniques that > : require it. > > "Corrispond to age" may be missleading as it

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Chris Hostetter

: > Actually is #2 a hard requirement? : : A lot of Lucene users depend on having document number correspond to : age, I think. ISTR Hatcher at least recommending techniques that : require it. "Corrispond to age" may be missleading as it implies that the actual docid has meaning ... it's more tha

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Steven Parkes

> But when these values have > changed (or, segments were flushed by RAM not by maxBufferedDocs) then > the way it computes level no longer results in the logarithmic policy > that it's trying to implement, I think. That's right. Parts of the implementation assume that the segments are logarithmic

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless

Steven Parkes wrote: >> Right I'm calling a newly created segment (ie flushed from RAM) >> level 0 and then a level 1 segment is created when you merge 10 >> level 0 segments, level 2 is created when merge 10 level 1 segments, >> etc. > > This isn't the way the current code treats things. I'm not

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Marvin Humphrey

On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote: Actually is #2 a hard requirement? A lot of Lucene users depend on having document number correspond to age, I think. ISTR Hatcher at least recommending techniques that require it. Do the loose ports of Lucene (KinoSearch, Ferret,

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Steven Parkes

> Right I'm calling a newly created segment (ie flushed from RAM) level > 0 and then a level 1 segment is created when you merge 10 level 0 > segments, level 2 is created when merge 10 level 1 segments, etc. This isn't the way the current code treats things. I'm not saying it's the only way to loo

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless

On Thu, 22 Mar 2007 13:34:39 -0700, "Steven Parkes" <[EMAIL PROTECTED]> said: > > EG if you set maxBufferedDocs to say 1 but then it turns out based > > on RAM usage you actually flush every 300 docs then the merge policy > > will incorrectly merge a level 1 segment (with 3000 docs) in with th

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Steven Parkes

> EG if you set maxBufferedDocs to say 1 but then it turns out based > on RAM usage you actually flush every 300 docs then the merge policy > will incorrectly merge a level 1 segment (with 3000 docs) in with the > level 0 segments (with 300 docs). This is because the merge policy > looks at th

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless

"Steven Parkes" <[EMAIL PROTECTED]> wrote: > * Merge policy has problems when you "flush by RAM" (this is true > even before my patch). Not sure how to fix yet. > > Do you mean where one would be trying to use RAM usage to determine when > to do a flush? Right, if you have your indexer m

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Steven Parkes

PROTECTED] Sent: Thursday, March 22, 2007 10:09 AM To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira .plugin.system.issuetabpanels:al

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless (JIRA)

er things on my TODO list :) > improve how IndexWriter uses RAM to buffer added documents > -- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Projec

[jira] Created: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless (JIRA)

improve how IndexWriter uses RAM to buffer added documents -- Key: LUCENE-843 URL: https://issues.apache.org/jira/browse/LUCENE-843 Project: Lucene - Java Issue Type: Improvement

94 matches

Mail list logo