Available, New])
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>
a unit
test & fix it. I will also make the flush(boolean triggerMerge,
boolean flushDocStores) protected, not public, and move the javadoc
back to the public flush().
> improve how IndexWriter uses RAM to buffer added do
(docWriter.getMaxBufferedDocs());
}
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Proj
ix shortly.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene
] (was: [New])
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Jav
Michael McCandless wrote:
> OK, when you say "fair" I think you mean because you already had a
> previous run that used compound file, you had to use compound file in
> the run with the LUCENE-843 patch (etc)?
Yes, that's true.
> The recommendations above should speed up Lucene with or without m
main with compound.
OK, because you're doing StandardAnalyzer and HTML parsing and
presumably loading one-doc-per-file, most of your time is spent
outside of Lucene indexing so I'd expect less that 50% speedup in
this case.
> improve how IndexWriter uses RAM to buffer added documen
0MB. Again, for a "fair"
comparison, I will remain with compound.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://
docs in RAM and then flush a new segment when it's
time. I've opened a separate issue (LUCENE-856) for optimizations
in segment merging.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
>
p the
optimize part if this is not of interest for the comparison. (In fact I am
still waiting for my optimize() to complete, but if it is not of interest I
will just interrupt it...)
Thanks,
Doron
> improve how IndexWriter uses RAM to buffer added d
ALYZERPATCH (sec) TRUNK (sec) SPEEDUP
> > SimpleSpaceAnalyzer 79.0 326.54.1 X
> > StandardAnalyzer 449.0 674.1 1.5 X
> > WhitespaceAnalyzer 104.0 338.93.3 X
> > SimpleAnalyzer 104.7 328.03.1
104.7 328.03.1 X
StandardAnalyzer is definiteely rather time consuming!
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/
into the IndexOutputs.
I think a separation like that would work well: we could have good
performance and also extensibility. Devil is in the details of
course...
I obviously haven't factored DocumentsWriter in this way (it has its
own addPosition that writes the current Lucene index
is easily extensible in this regard? I'm
wondering because of all the optimizations you're doing like e. g.
sharing byte arrays. But I'm certainly not familiar enough with your code
yet, so I'm only guessing here.
> improve how IndexWriter uses RAM
104.7 328.03.1 X
StandardAnalyzer is definiteely rather time consuming!
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://is
be a significant improvement. Did you run those kinds
> of benchmarks already?
Good question ... I haven't measured the performance cost of using
StandardAnalyzer or HTML parsing but I will test & post back.
> improve how IndexWriter uses RAM to buffer added documents
> ---
ove how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
> Is
he finally clause of invertField() like DocumentWriter did before 880 this
is safe, because addPosition() serializes the term strings and payload bytes
into the posting hash table right away. Is that right?
> improve how IndexWriter uses RAM to
egments really only merges segments.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
>
ry to figure out how it will dove tail
with the merge policy factoring.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/ji
src/test/org/apache/lucene/index and test again? Thanks.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCEN
h the merge policy
stuff (LUCENE-847). Noticed that there are a couple of test failures?
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.ap
;out of the box" performance. I will open a separate
issue to change the default after this issue is resolved.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>
ot; files when all segments
> are sharing the same ones (big performance gain),
Is this only in the case where the segments have no deleted docs?
> improve how IndexWriter uses RAM to buffer added documents
> --
>
>
t of "doc store" files. Then when segments
are merged, the newly merged segment has its own "private" doc
stores again. So the sharing only occurs for the "level 0"
segments.
I still need to update fileform
consuming and single threaded.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project
does.
* Added some new unit test cases; added missing "writer.close()" to
one of the contrib tests.
* Cleanup, comments, etc. I think the code is getting more
"approachable" now.
>
ent SegmentMerger, possibly w/ some small
changes, to do the merges even when autoCommit=false. Since we have
another issue (LUCENE-856) to optimize segment merging I can carry
over any optimizations that we may want to keep into that issue. If
this doesn't lose much performance it will
ize by net # bytes in the segment. This would preserve the
"docID monotonicity invariance".
If we take that approach then it would automatically resolve
LUCENE-845 as well (which would otherwise block this issue).
> improve how IndexWriter uses RAM to buffer added documents
>
ve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
> Issue Type:
; new47.5 [ 354.8% more]
Avg RAM used (MB) @ flush: old74.3; new42.9 [ 42.2% less]
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
>
On 4/30/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
After discussion on java-dev last time, I decided to retry the
"persistent hash" approach, where the Postings hash lasts across many
docs and then a single flush produces a partial segment containing all
of those docs. This is in c
to
shared byte[] arrays for the postings that made the persistent hash
approach work well, this time around (when I had previously tried this
it was slower).
> improve how IndexWriter uses RAM to buffer added documents
> --
ue, which is a riff
on your persistent arrays, improves indexing speed by about 15%.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: htt
segment. When enough of these
RAM segments have accumulated I flush to a real Lucene segment
(autoCommit=true) or to on-disk partial segments (autoCommit=false)
which are then merged in the end to create a real Lucene segment.
> improve how IndexWriter u
Mike - thanks for explanation, it makes perfect sense!
Otis
- Original Message
From: Michael McCandless <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 8:03:44 PM
Subject: Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to
"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
>
> On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:
>
> >> What we need to do is cut down on decompression and conflict
> >> resolution costs when reading from one segment to another. KS has
> >> solved this problem for stored fields. Field defs
On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:
What we need to do is cut down on decompression and conflict
resolution costs when reading from one segment to another. KS has
solved this problem for stored fields. Field defs are global and
field values are keyed by name rather than fiel
"Grant Ingersoll" <[EMAIL PROTECTED]> wrote:
>
> Michael, like everyone else, I am watching this very closely. So far
> it sounds great!
>
> On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote:
>
> > When I measure "amount of RAM @ flush time", I'm calling
> > MemoryMXBean.getHeapMemoryUsage
Michael, like everyone else, I am watching this very closely. So far
it sounds great!
On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote:
When I measure "amount of RAM @ flush time", I'm calling
MemoryMXBean.getHeapMemoryUsage().getUsed(). So, this measures actual
process memory usage w
"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
> On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:
>
> >>> (I think for KS you "add" a previous segment not that
> >>> differently from how you "add" a document)?
> >>
> >> Yeah. KS has to decompress and serialize posting content, which sux.
> >
Hi Otis!
"Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:
> You talk about a RAM buffer from 1MB - 96MB, but then you have the amount
> of RAM @ flush time (e.g. Avg RAM used (MB) @ flush: old34.5; new
> 3.4 [ 90.1% less]).
>
> I don't follow 100% of what you are doing in LUCENE-843, so
"Mike Klaas" <[EMAIL PROTECTED]> wrote:
> On 4/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> > : Thanks! But remember many Lucene apps won't see these speedups since I've
> > : carefully minimized cost of tokenization and cost of document retrieval.
> > I
> > : think for many Lucene ap
On 4/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: Thanks! But remember many Lucene apps won't see these speedups since I've
: carefully minimized cost of tokenization and cost of document retrieval. I
: think for many Lucene apps these are a sizable part of time spend indexing.
true, bu
: Thanks! But remember many Lucene apps won't see these speedups since I've
: carefully minimized cost of tokenization and cost of document retrieval. I
: think for many Lucene apps these are a sizable part of time spend indexing.
true, but as long as the changes you are making has no impact on
On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:
(I think for KS you "add" a previous segment not that
differently from how you "add" a document)?
Yeah. KS has to decompress and serialize posting content, which sux.
The one saving grace is that with the Fibonacci merge schedule and
th
ROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 9:22:32 AM
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to
buffer added documents
[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpan
"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
> > (I think for KS you "add" a previous segment not that
> > differently from how you "add" a document)?
>
> Yeah. KS has to decompress and serialize posting content, which sux.
>
> The one saving grace is that with the Fibonacci merge schedule and
On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote:
Marvin do you have any sense of what the equivalent cost is
in KS
It's big. I don't have any good optimizations to suggest in this area.
(I think for KS you "add" a previous segment not that
differently from how you "add" a document)?
"eks dev" <[EMAIL PROTECTED]> wrote:
> wow, impressive numbers, congrats !
Thanks! But remember many Lucene apps won't see these speedups since I've
carefully minimized cost of tokenization and cost of document retrieval. I
think for many Lucene apps these are a sizable part of time spend index
wow, impressive numbers, congrats !
- Original Message
From: Michael McCandless (JIRA) <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, 5 April, 2007 3:22:32 PM
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to
buffer added doc
ompress the terms dict, the
more docs are merged in RAM before having to flush to disk. I
would also expect this curve to be somewhat content dependent.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
&
"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
> On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:
>
> >> (: Ironically, the numbers for Lucene on that page are a little
> >> better than they should be because of a sneaky bug. I would have
> >> made updating the results a priority if they'd go
On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:
(: Ironically, the numbers for Lucene on that page are a little
better than they should be because of a sneaky bug. I would have
made updating the results a priority if they'd gone the other
way. :)
Hrm. It would be nice to have hard
"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
>
> On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote:
>
> > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> >> Wow, very nice results Mike!
> >
> > Thanks :) I'm just praying I don't have some sneaky bug making
> > the results far better than they rea
On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote:
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
Wow, very nice results Mike!
Thanks :) I'm just praying I don't have some sneaky bug making
the results far better than they really are!!
That's possible, but I'm confident that the model you'r
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> Wow, very nice results Mike!
Thanks :) I'm just praying I don't have some sneaky bug making
the results far better than they really are!! And still plenty
to do...
Mike
-
To unsubsc
"Ning Li" <[EMAIL PROTECTED]> wrote:
> On 4/3/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
>
> > * With term vectors and/or stored fields, the new patch has
> >substantially better RAM efficiency.
>
> Impressive numbers! The new patch improves RAM efficiency quite a bit
> even w
AM
> fragmentation is slowing down malloc/free. I'll be interested as to
> whether you see the same effect.
Interesting. OK I will run the benchmark across increasing RAM sizes
to see where the sweet spot seems to
Wow, very nice results Mike!
-Yonik
On 4/3/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486335
]
Michael McCandless commented on LUCENE-843:
On 4/3/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
* With term vectors and/or stored fields, the new patch has
substantially better RAM efficiency.
Impressive numbers! The new patch improves RAM efficiency quite a bit
even with no term vectors nor stored fields, because of the
I'll be interested as to whether you see the same effect.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/brows
y reclaim. I think this also means you could push your
RAM buffer size even higher to get better performance.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
>
: old93.0; new36.6 [ 60.6% less]
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
>
l Docs/sec: old 268.9; new 818.7 [ 204.5% faster]
Docs/MB @ flush:old46.7; new 432.2 [ 825.2% more]
Avg RAM used (MB) @ flush: old93.0; new36.6 [ 60.6% less]
> improve how IndexWriter uses RAM to buffer ad
3 [ 50.9% less]
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene
these segments are
then generally the less merging you need to do, for a given # docs in
the index.
I also measure overall RAM used in the JVM (using
MemoryMXBean.getHeapMemoryUsage().getUsed()) just prior to each flush
except the last, to also capture the "document processing RAM",
on.
I ran the tests with Java 1.5 on a Mac Pro quad (2 Intel CPUs, each
dual core) OS X box with 2 GB RAM. I give java 1 GB heap (-Xmx1024m).
> improve how IndexWriter uses RAM to buffer added documents
> --
>
>
ocs (25 MB plain text
each). I'm still consuming more RAM in this case than the
baseline (trunk) so I'm still working on this one ...
* Fixed a slow memory leak when building large (20+ GB) indices
> im
cases/combinations of added docs or not, deleted docs or
not, any merges or not.
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.a
corruption case
* Added more asserts (run with "java -ea" so asserts run)
* Some more small optimizations
* Updated to current trunk so patch applies cleanly
> improve how IndexWriter uses RAM to buffer ad
"Grant Ingersoll" <[EMAIL PROTECTED]> wrote:
> I've only been loosely following this...
>
> Do you think it is possible to separate the stored/term vector
> handling into a separate patch against the current trunk? This seems
> like a quick win and I know it has been speculated about before.
I've only been loosely following this...
Do you think it is possible to separate the stored/term vector
handling into a separate patch against the current trunk? This seems
like a quick win and I know it has been speculated about before.
On Mar 23, 2007, at 12:00 PM, Michael McCandless wro
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
Yes the code re-computes the level of a given segment from the current
values of maxBufferedDocs & mergeFactor. But when these values have
changed (or, segments were flushed by RAM not by maxBufferedDocs) then
the way it computes level no
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
Right I'm calling a newly created segment (ie flushed from RAM) level
0 and then a level 1 segment is created when you merge 10 level 0
segments, level 2 is created when merge 10 level 1 segments, etc.
That is not how the current merge p
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > Merging is costly because you read all data in then write all data
> > out, so, you want to minimize for byte of data in the index in the
> > index how many times it will be "serviced" (read i
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
Merging is costly because you read all data in then write all data
out, so, you want to minimize for byte of data in the index in the
index how many times it will be "serviced" (read in, written out) as
part of a merge.
Avoiding the re-w
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > We say that
> > developers should not rely on docIDs but people still seem to rely on
> > their monotonic ordering (even though they change).
>
> Yes. If the benefits of removing that guarantee are large enough, we
> could consider dumping it... but
On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
We say that
developers should not rely on docIDs but people still seem to rely on
their monotonic ordering (even though they change).
Yes. If the benefits of removing that guarantee are large enough, we
could consider dumping it... but
e.org
Subject: Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses
RAM to buffer added documents
"Chris Hostetter" <[EMAIL PROTECTED]> wrote:
> : > Actually is #2 a hard requirement?
> :
> : A lot of Lucene users depend on having document number correspond to
&
On Mar 22, 2007, at 8:13 PM, Marvin Humphrey wrote:
On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote:
Actually is #2 a hard requirement?
A lot of Lucene users depend on having document number correspond
to age, I think. ISTR Hatcher at least recommending techniques
that require it.
"Chris Hostetter" <[EMAIL PROTECTED]> wrote:
> : > Actually is #2 a hard requirement?
> :
> : A lot of Lucene users depend on having document number correspond to
> : age, I think. ISTR Hatcher at least recommending techniques that
> : require it.
>
> "Corrispond to age" may be missleading as it
: > Actually is #2 a hard requirement?
:
: A lot of Lucene users depend on having document number correspond to
: age, I think. ISTR Hatcher at least recommending techniques that
: require it.
"Corrispond to age" may be missleading as it implies that the actual
docid has meaning ... it's more tha
> But when these values have
> changed (or, segments were flushed by RAM not by maxBufferedDocs) then
> the way it computes level no longer results in the logarithmic policy
> that it's trying to implement, I think.
That's right. Parts of the implementation assume that the segments are
logarithmic
Steven Parkes wrote:
>> Right I'm calling a newly created segment (ie flushed from RAM)
>> level 0 and then a level 1 segment is created when you merge 10
>> level 0 segments, level 2 is created when merge 10 level 1 segments,
>> etc.
>
> This isn't the way the current code treats things. I'm not
On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote:
Actually is #2 a hard requirement?
A lot of Lucene users depend on having document number correspond to
age, I think. ISTR Hatcher at least recommending techniques that
require it.
Do the loose ports of Lucene
(KinoSearch, Ferret,
> Right I'm calling a newly created segment (ie flushed from RAM) level
> 0 and then a level 1 segment is created when you merge 10 level 0
> segments, level 2 is created when merge 10 level 1 segments, etc.
This isn't the way the current code treats things. I'm not saying it's
the only way to loo
On Thu, 22 Mar 2007 13:34:39 -0700, "Steven Parkes" <[EMAIL PROTECTED]> said:
> > EG if you set maxBufferedDocs to say 1 but then it turns out based
> > on RAM usage you actually flush every 300 docs then the merge policy
> > will incorrectly merge a level 1 segment (with 3000 docs) in with th
> EG if you set maxBufferedDocs to say 1 but then it turns out based
> on RAM usage you actually flush every 300 docs then the merge policy
> will incorrectly merge a level 1 segment (with 3000 docs) in with the
> level 0 segments (with 300 docs). This is because the merge policy
> looks at th
"Steven Parkes" <[EMAIL PROTECTED]> wrote:
> * Merge policy has problems when you "flush by RAM" (this is true
> even before my patch). Not sure how to fix yet.
>
> Do you mean where one would be trying to use RAM usage to determine when
> to do a flush?
Right, if you have your indexer m
PROTECTED]
Sent: Thursday, March 22, 2007 10:09 AM
To: java-dev@lucene.apache.org
Subject: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM
to buffer added documents
[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira
.plugin.system.issuetabpanels:al
er things on my TODO list :)
> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Projec
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
Issue Type: Improvement
94 matches
Mail list logo