Available, New])
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
Issue Type
a unit
test fix it. I will also make the flush(boolean triggerMerge,
boolean flushDocStores) protected, not public, and move the javadoc
back to the public flush().
improve how IndexWriter uses RAM to buffer added documents
.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
Issue Type: Improvement
Michael McCandless wrote:
OK, when you say fair I think you mean because you already had a
previous run that used compound file, you had to use compound file in
the run with the LUCENE-843 patch (etc)?
Yes, that's true.
The recommendations above should speed up Lucene with or without my
, for a fair
comparison, I will remain with compound.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project
expect less that 50% speedup in
this case.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene
the
optimize part if this is not of interest for the comparison. (In fact I am
still waiting for my optimize() to complete, but if it is not of interest I
will just interrupt it...)
Thanks,
Doron
improve how IndexWriter uses RAM to buffer added documents
a separate issue (LUCENE-856) for optimizations
in segment merging.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
328.03.1 X
StandardAnalyzer is definiteely rather time consuming!
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
StandardAnalyzer is definiteely rather time consuming!
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
Issue Type: Improvement
already?
Good question ... I haven't measured the performance cost of using
StandardAnalyzer or HTML parsing but I will test post back.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
328.03.1 X
StandardAnalyzer is definiteely rather time consuming!
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse
is easily extensible in this regard? I'm
wondering because of all the optimizations you're doing like e. g.
sharing byte arrays. But I'm certainly not familiar enough with your code
yet, so I'm only guessing here.
improve how IndexWriter uses RAM to buffer added documents
extensibility. Devil is in the details of
course...
I obviously haven't factored DocumentsWriter in this way (it has its
own addPosition that writes the current Lucene index format) but I
think this is very doable in the future.
improve how IndexWriter uses RAM to buffer added documents
/index and test again? Thanks.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
to figure out how it will dove tail
with the merge policy factoring.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
of invertField() like DocumentWriter did before 880 this
is safe, because addPosition() serializes the term strings and payload bytes
into the posting hash table right away. Is that right?
improve how IndexWriter uses RAM to buffer added documents
performance. I will open a separate
issue to change the default after this issue is resolved.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira
merged segment has its own private doc
stores again. So the sharing only occurs for the level 0
segments.
I still need to update fileformats doc with this change.
improve how IndexWriter uses RAM to buffer added documents
segments
are sharing the same ones (big performance gain),
Is this only in the case where the segments have no deleted docs?
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL
when all
segments are sharing the same ones (big performance gain),
Is this only in the case where the segments have no deleted docs?
Right. Also the segments must be contiguous which the current merge
policy ensures but future merge policies may not.
improve how IndexWriter uses RAM
test cases; added missing writer.close() to
one of the contrib tests.
* Cleanup, comments, etc. I think the code is getting more
approachable now.
improve how IndexWriter uses RAM to buffer added documents
--
Key
threaded.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
Issue Type
have
another issue (LUCENE-856) to optimize segment merging I can carry
over any optimizations that we may want to keep into that issue. If
this doesn't lose much performance it will make the approach here even
simpler.
improve how IndexWriter uses RAM to buffer added documents
segments have accumulated I flush to a real Lucene segment
(autoCommit=true) or to on-disk partial segments (autoCommit=false)
which are then merged in the end to create a real Lucene segment.
improve how IndexWriter uses RAM to buffer added documents
persistent arrays, improves indexing speed by about 15%.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
speed by about 15%.
Fabulous!!
I think it's the custom memory management I'm doing with slices into
shared byte[] arrays for the postings that made the persistent hash
approach work well, this time around (when I had previously tried this
it was slower).
improve how IndexWriter uses RAM to buffer
On 4/30/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote:
After discussion on java-dev last time, I decided to retry the
persistent hash approach, where the Postings hash lasts across many
docs and then a single flush produces a partial segment containing all
of those docs. This is in
; new47.5 [ 354.8% more]
Avg RAM used (MB) @ flush: old74.3; new42.9 [ 42.2% less]
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https
be much larger than segments to it's left. I
suppose the idea of merging rightmost segments should just be dropped in favor
of merging the smallest adjacent segments? Sorry if this has already been
covered... as I said, I'm trying to follow along at a high level.
improve how IndexWriter uses RAM
then it would automatically resolve
LUCENE-845 as well (which would otherwise block this issue).
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse
Mike - thanks for explanation, it makes perfect sense!
Otis
- Original Message
From: Michael McCandless [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 8:03:44 PM
Subject: Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to
buffer
Marvin Humphrey [EMAIL PROTECTED] wrote:
On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:
What we need to do is cut down on decompression and conflict
resolution costs when reading from one segment to another. KS has
solved this problem for stored fields. Field defs are global and
Marvin Humphrey [EMAIL PROTECTED] wrote:
On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:
(: Ironically, the numbers for Lucene on that page are a little
better than they should be because of a sneaky bug. I would have
made updating the results a priority if they'd gone the other
in RAM before having to flush to disk. I
would also expect this curve to be somewhat content dependent.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https
wow, impressive numbers, congrats !
- Original Message
From: Michael McCandless (JIRA) [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, 5 April, 2007 3:22:32 PM
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to
buffer added documents
eks dev [EMAIL PROTECTED] wrote:
wow, impressive numbers, congrats !
Thanks! But remember many Lucene apps won't see these speedups since I've
carefully minimized cost of tokenization and cost of document retrieval. I
think for many Lucene apps these are a sizable part of time spend indexing.
On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote:
Marvin do you have any sense of what the equivalent cost is
in KS
It's big. I don't have any good optimizations to suggest in this area.
(I think for KS you add a previous segment not that
differently from how you add a document)?
Marvin Humphrey [EMAIL PROTECTED] wrote:
(I think for KS you add a previous segment not that
differently from how you add a document)?
Yeah. KS has to decompress and serialize posting content, which sux.
The one saving grace is that with the Fibonacci merge schedule and
the
]
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 9:22:32 AM
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to
buffer added documents
[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel
On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:
(I think for KS you add a previous segment not that
differently from how you add a document)?
Yeah. KS has to decompress and serialize posting content, which sux.
The one saving grace is that with the Fibonacci merge schedule and
the
: Thanks! But remember many Lucene apps won't see these speedups since I've
: carefully minimized cost of tokenization and cost of document retrieval. I
: think for many Lucene apps these are a sizable part of time spend indexing.
true, but as long as the changes you are making has no impact
On 4/5/07, Chris Hostetter [EMAIL PROTECTED] wrote:
: Thanks! But remember many Lucene apps won't see these speedups since I've
: carefully minimized cost of tokenization and cost of document retrieval. I
: think for many Lucene apps these are a sizable part of time spend indexing.
true, but
Mike Klaas [EMAIL PROTECTED] wrote:
On 4/5/07, Chris Hostetter [EMAIL PROTECTED] wrote:
: Thanks! But remember many Lucene apps won't see these speedups since I've
: carefully minimized cost of tokenization and cost of document retrieval.
I
: think for many Lucene apps these are a
Hi Otis!
Otis Gospodnetic [EMAIL PROTECTED] wrote:
You talk about a RAM buffer from 1MB - 96MB, but then you have the amount
of RAM @ flush time (e.g. Avg RAM used (MB) @ flush: old34.5; new
3.4 [ 90.1% less]).
I don't follow 100% of what you are doing in LUCENE-843, so could
Marvin Humphrey [EMAIL PROTECTED] wrote:
On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:
(I think for KS you add a previous segment not that
differently from how you add a document)?
Yeah. KS has to decompress and serialize posting content, which sux.
The one saving grace
Michael, like everyone else, I am watching this very closely. So far
it sounds great!
On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote:
When I measure amount of RAM @ flush time, I'm calling
MemoryMXBean.getHeapMemoryUsage().getUsed(). So, this measures actual
process memory usage
Grant Ingersoll [EMAIL PROTECTED] wrote:
Michael, like everyone else, I am watching this very closely. So far
it sounds great!
On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote:
When I measure amount of RAM @ flush time, I'm calling
MemoryMXBean.getHeapMemoryUsage().getUsed().
On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:
What we need to do is cut down on decompression and conflict
resolution costs when reading from one segment to another. KS has
solved this problem for stored fields. Field defs are global and
field values are keyed by name rather than
On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote:
Yonik Seeley [EMAIL PROTECTED] wrote:
Wow, very nice results Mike!
Thanks :) I'm just praying I don't have some sneaky bug making
the results far better than they really are!!
That's possible, but I'm confident that the model you're
Marvin Humphrey [EMAIL PROTECTED] wrote:
On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote:
Yonik Seeley [EMAIL PROTECTED] wrote:
Wow, very nice results Mike!
Thanks :) I'm just praying I don't have some sneaky bug making
the results far better than they really are!!
That's
On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:
(: Ironically, the numbers for Lucene on that page are a little
better than they should be because of a sneaky bug. I would have
made updating the results a priority if they'd gone the other
way. :)
Hrm. It would be nice to have
, for a given # docs in
the index.
I also measure overall RAM used in the JVM (using
MemoryMXBean.getHeapMemoryUsage().getUsed()) just prior to each flush
except the last, to also capture the document processing RAM, object
overhead, etc.
improve how IndexWriter uses RAM to buffer added documents
% less]
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
Issue Type
: old 268.9; new 818.7 [ 204.5% faster]
Docs/MB @ flush:old46.7; new 432.2 [ 825.2% more]
Avg RAM used (MB) @ flush: old93.0; new36.6 [ 60.6% less]
improve how IndexWriter uses RAM to buffer added documents
: old93.0; new36.6 [ 60.6% less]
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene
this also means you could push your
RAM buffer size even higher to get better performance.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira
as to whether you see the same effect.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
On 4/3/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote:
* With term vectors and/or stored fields, the new patch has
substantially better RAM efficiency.
Impressive numbers! The new patch improves RAM efficiency quite a bit
even with no term vectors nor stored fields, because of the
Wow, very nice results Mike!
-Yonik
On 4/3/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote:
[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486335
]
Michael McCandless commented on LUCENE-843:
as to
whether you see the same effect.
Interesting. OK I will run the benchmark across increasing RAM sizes
to see where the sweet spot seems to be!
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
Ning Li [EMAIL PROTECTED] wrote:
On 4/3/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote:
* With term vectors and/or stored fields, the new patch has
substantially better RAM efficiency.
Impressive numbers! The new patch improves RAM efficiency quite a bit
even with no term
Yonik Seeley [EMAIL PROTECTED] wrote:
Wow, very nice results Mike!
Thanks :) I'm just praying I don't have some sneaky bug making
the results far better than they really are!! And still plenty
to do...
Mike
-
To
consuming more RAM in this case than the
baseline (trunk) so I'm still working on this one ...
* Fixed a slow memory leak when building large (20+ GB) indices
improve how IndexWriter uses RAM to buffer added documents
or not, deleted docs or
not, any merges or not.
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
corruption case
* Added more asserts (run with java -ea so asserts run)
* Some more small optimizations
* Updated to current trunk so patch applies cleanly
improve how IndexWriter uses RAM to buffer added documents
Chris Hostetter [EMAIL PROTECTED] wrote:
: Actually is #2 a hard requirement?
:
: A lot of Lucene users depend on having document number correspond to
: age, I think. ISTR Hatcher at least recommending techniques that
: require it.
Corrispond to age may be missleading as it implies that
On Mar 22, 2007, at 8:13 PM, Marvin Humphrey wrote:
On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote:
Actually is #2 a hard requirement?
A lot of Lucene users depend on having document number correspond
to age, I think. ISTR Hatcher at least recommending techniques
that require
Subject: Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses
RAM to buffer added documents
Chris Hostetter [EMAIL PROTECTED] wrote:
: Actually is #2 a hard requirement?
:
: A lot of Lucene users depend on having document number correspond to
: age, I think. ISTR Hatcher at least
On 3/22/07, Michael McCandless [EMAIL PROTECTED] wrote:
We say that
developers should not rely on docIDs but people still seem to rely on
their monotonic ordering (even though they change).
Yes. If the benefits of removing that guarantee are large enough, we
could consider dumping it... but
On 3/22/07, Michael McCandless [EMAIL PROTECTED] wrote:
Merging is costly because you read all data in then write all data
out, so, you want to minimize for byte of data in the index in the
index how many times it will be serviced (read in, written out) as
part of a merge.
Avoiding the
Yonik Seeley [EMAIL PROTECTED] wrote:
On 3/22/07, Michael McCandless [EMAIL PROTECTED] wrote:
Merging is costly because you read all data in then write all data
out, so, you want to minimize for byte of data in the index in the
index how many times it will be serviced (read in, written
On 3/22/07, Michael McCandless [EMAIL PROTECTED] wrote:
Right I'm calling a newly created segment (ie flushed from RAM) level
0 and then a level 1 segment is created when you merge 10 level 0
segments, level 2 is created when merge 10 level 1 segments, etc.
That is not how the current merge
I've only been loosely following this...
Do you think it is possible to separate the stored/term vector
handling into a separate patch against the current trunk? This seems
like a quick win and I know it has been speculated about before.
On Mar 23, 2007, at 12:00 PM, Michael McCandless
Grant Ingersoll [EMAIL PROTECTED] wrote:
I've only been loosely following this...
Do you think it is possible to separate the stored/term vector
handling into a separate patch against the current trunk? This seems
like a quick win and I know it has been speculated about before.
This
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
Issue Type: Improvement
:)
improve how IndexWriter uses RAM to buffer added documents
--
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
Issue Type
]
Sent: Thursday, March 22, 2007 10:09 AM
To: java-dev@lucene.apache.org
Subject: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM
to buffer added documents
[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira
.plugin.system.issuetabpanels:all-tabpanel
Steven Parkes [EMAIL PROTECTED] wrote:
* Merge policy has problems when you flush by RAM (this is true
even before my patch). Not sure how to fix yet.
Do you mean where one would be trying to use RAM usage to determine when
to do a flush?
Right, if you have your indexer measure RAM
EG if you set maxBufferedDocs to say 1 but then it turns out based
on RAM usage you actually flush every 300 docs then the merge policy
will incorrectly merge a level 1 segment (with 3000 docs) in with the
level 0 segments (with 300 docs). This is because the merge policy
looks at the
On Thu, 22 Mar 2007 13:34:39 -0700, Steven Parkes [EMAIL PROTECTED] said:
EG if you set maxBufferedDocs to say 1 but then it turns out based
on RAM usage you actually flush every 300 docs then the merge policy
will incorrectly merge a level 1 segment (with 3000 docs) in with the
level
Right I'm calling a newly created segment (ie flushed from RAM) level
0 and then a level 1 segment is created when you merge 10 level 0
segments, level 2 is created when merge 10 level 1 segments, etc.
This isn't the way the current code treats things. I'm not saying it's
the only way to look
On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote:
Actually is #2 a hard requirement?
A lot of Lucene users depend on having document number correspond to
age, I think. ISTR Hatcher at least recommending techniques that
require it.
Do the loose ports of Lucene
(KinoSearch,
Steven Parkes wrote:
Right I'm calling a newly created segment (ie flushed from RAM)
level 0 and then a level 1 segment is created when you merge 10
level 0 segments, level 2 is created when merge 10 level 1 segments,
etc.
This isn't the way the current code treats things. I'm not saying
But when these values have
changed (or, segments were flushed by RAM not by maxBufferedDocs) then
the way it computes level no longer results in the logarithmic policy
that it's trying to implement, I think.
That's right. Parts of the implementation assume that the segments are
: Actually is #2 a hard requirement?
:
: A lot of Lucene users depend on having document number correspond to
: age, I think. ISTR Hatcher at least recommending techniques that
: require it.
Corrispond to age may be missleading as it implies that the actual
docid has meaning ... it's more that
87 matches
Mail list logo