date:20070403

[jira] Commented: (LUCENE-806) Synchronization bottleneck in FieldSortedHitQueue with many concurrent readers

2007-04-03 Thread Paul Cowan (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486252
]

Paul Cowan commented on LUCENE-806:
---

Otis, you're probably right -- it may not be wise to tackle two birds with one
stone. My concern is that if I do this a quick and dirty way, it may involve
exposing an API to enable/disable this behaviour which a subsequent refactor
would then remove, and I'd obviously rather keep the API stable.

I'm about to attach 3 patches, with varying levels of effect on the code. I'd
be interested to see what people think is the best approach given the possible
refactor.

Hoss, I've had a look at your patch, and rather like it. That's kind of
tackling a slightly different problem; that's cleaning up the FieldCache (which
is a great idea) whereas cleaning up FSHQ is only incidentally related to
FieldCache. It uses it (and if it was broken up, each comparator source would
be using your much cleaner API) but I think the two coexist quite happily. I'd
like to see both, in other words!

Synchronization bottleneck in FieldSortedHitQueue with many concurrent readers
--

Key: LUCENE-806
URL: https://issues.apache.org/jira/browse/LUCENE-806
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.0.0
Reporter: Paul Cowan
Priority: Minor
Attachments: lucene-806-proposed-direction.patch, lucene-806.patch

The below is from a post by (my colleague) Paul Smith to the java-users list:
---
Hi ho peoples.
We have an application that is internationalized, and stores data from many
languages (each project has it's own index, mostly aligned with a single
language, maybe 2).
Anyway, I've noticed during some thread dumps diagnosing some performance
issues, that there appears to be a _potential_ synchronization bottleneck
using Locale-based sorting of Strings. I don't think this problem is the
root cause of our performance problem, but I thought I'd mention it here.
Here's the stack dump of a thread waiting:
http-1001-Processor245 daemon prio=1 tid=0x31434da0 nid=0x3744 waiting for
monitor entry [0x2cd44000..0x2cd45f30]
at java.text.RuleBasedCollator.compare(RuleBasedCollator.java)
- waiting to lock 0x6b1e8c68 (a java.text.RuleBasedCollator)
at
org.apache.lucene.search.FieldSortedHitQueue$4.compare(FieldSortedHitQueue.java:320)
at
org.apache.lucene.search.FieldSortedHitQueue.lessThan(FieldSortedHitQueue.java:114)
at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:120)
at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:47)
at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:58)
at
org.apache.lucene.search.FieldSortedHitQueue.insert(FieldSortedHitQueue.java:90)
at
org.apache.lucene.search.FieldSortedHitQueue.insert(FieldSortedHitQueue.java:97)
at
org.apache.lucene.search.TopFieldDocCollector.collect(TopFieldDocCollector.java:47)
at
org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:291)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110)
at
com.aconex.index.search.FastLocaleSortIndexSearcher.search(FastLocaleSortIndexSearcher.java:90)
.
In our case we had 12 threads waiting like this, while one thread had the
lock on the RuleBasedCollator. Turns out RuleBasedCollator's.compare(...)
method is synchronized. I wonder if a ThreadLocal based collator would be
better here... ? There doesn't appear to be a reason for other threads
searching the same index to wait on this sort. Be just as easy to use their
own. (Is RuleBasedCollator a heavy object memory wise? Wouldn't have
thought so, per thread)
Thoughts?
---
I've investigated this somewhat, and agree that this is a potential problem
with a series of possible workarounds. Further discussion (including
proof-of-concept patch) to follow.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-806) Synchronization bottleneck in FieldSortedHitQueue with many concurrent readers

2007-04-03 Thread Paul Cowan (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Cowan updated LUCENE-806:
--

Attachment: LUCENE-806-minimal-usealways.patch

Minimal ThreadLocal wrapper, Implementation #1: an always-on version. This is 
used all the time, which may not be ideal (not sure if there are any major 
disadvantages, mind you; I think ThreadLocals are very low-impact, Collators 
are quite lightweight, and there shouldn't be any duplicated object instances 
floating around)

Note that with this version, the original comparatorStringLocale() method can 
be removed; I've left it in-place for now though.

 Synchronization bottleneck in FieldSortedHitQueue with many concurrent readers
 --

 Key: LUCENE-806
 URL: https://issues.apache.org/jira/browse/LUCENE-806
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.0.0
Reporter: Paul Cowan
Priority: Minor
 Attachments: LUCENE-806-minimal-usealways.patch, 
 lucene-806-proposed-direction.patch, lucene-806.patch


 The below is from a post by (my colleague) Paul Smith to the java-users list:
 ---
 Hi ho peoples.
 We have an application that is internationalized, and stores data from many 
 languages (each project has it's own index, mostly aligned with a single 
 language, maybe 2).
 Anyway, I've noticed during some thread dumps diagnosing some performance 
 issues, that there appears to be a _potential_ synchronization bottleneck 
 using Locale-based sorting of Strings.  I don't think this problem is the 
 root cause of our performance problem, but I thought I'd mention it here.  
 Here's the stack dump of a thread waiting:
 http-1001-Processor245 daemon prio=1 tid=0x31434da0 nid=0x3744 waiting for 
 monitor entry [0x2cd44000..0x2cd45f30]
 at java.text.RuleBasedCollator.compare(RuleBasedCollator.java)
 - waiting to lock 0x6b1e8c68 (a java.text.RuleBasedCollator)
 at 
 org.apache.lucene.search.FieldSortedHitQueue$4.compare(FieldSortedHitQueue.java:320)
 at 
 org.apache.lucene.search.FieldSortedHitQueue.lessThan(FieldSortedHitQueue.java:114)
 at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:120)
 at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:47)
 at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:58)
 at 
 org.apache.lucene.search.FieldSortedHitQueue.insert(FieldSortedHitQueue.java:90)
 at 
 org.apache.lucene.search.FieldSortedHitQueue.insert(FieldSortedHitQueue.java:97)
 at 
 org.apache.lucene.search.TopFieldDocCollector.collect(TopFieldDocCollector.java:47)
 at 
 org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:291)
 at 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
 at 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110)
 at 
 com.aconex.index.search.FastLocaleSortIndexSearcher.search(FastLocaleSortIndexSearcher.java:90)
 .
 In our case we had 12 threads waiting like this, while one thread had the 
 lock on the RuleBasedCollator.  Turns out RuleBasedCollator's.compare(...) 
 method is synchronized.  I wonder if a ThreadLocal based collator would be 
 better here... ?  There doesn't appear to be a reason for other threads 
 searching the same index to wait on this sort.  Be just as easy to use their 
 own.  (Is RuleBasedCollator a heavy object memory wise?  Wouldn't have 
 thought so, per thread)
 Thoughts?
 ---
 I've investigated this somewhat, and agree that this is a potential problem 
 with a series of possible workarounds. Further discussion (including 
 proof-of-concept patch) to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-806) Synchronization bottleneck in FieldSortedHitQueue with many concurrent readers

2007-04-03 Thread Paul Cowan (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Cowan updated LUCENE-806:
--

Attachment: LUCENE-806-minimal-systemproperty.patch

Minimal ThreadLocal wrapper, Implementation #2: based on a system property 
(org.apache.lucene.usePerThreadLocaleComparators). This is messy, but leaves 
the current behaviour as default and is not unprecedented in the Lucene 
codebase. If it's decided the behaviour shouldn't be 'always-on', this may be 
the best compromise as it's still (in a way) exposing a public API, but as it's 
a system property it's less visible and it may be less painful if it's yanked 
later.

 Synchronization bottleneck in FieldSortedHitQueue with many concurrent readers
 --

 Key: LUCENE-806
 URL: https://issues.apache.org/jira/browse/LUCENE-806
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.0.0
Reporter: Paul Cowan
Priority: Minor
 Attachments: LUCENE-806-minimal-systemproperty.patch, 
 LUCENE-806-minimal-usealways.patch, lucene-806-proposed-direction.patch, 
 lucene-806.patch


 The below is from a post by (my colleague) Paul Smith to the java-users list:
 ---
 Hi ho peoples.
 We have an application that is internationalized, and stores data from many 
 languages (each project has it's own index, mostly aligned with a single 
 language, maybe 2).
 Anyway, I've noticed during some thread dumps diagnosing some performance 
 issues, that there appears to be a _potential_ synchronization bottleneck 
 using Locale-based sorting of Strings.  I don't think this problem is the 
 root cause of our performance problem, but I thought I'd mention it here.  
 Here's the stack dump of a thread waiting:
 http-1001-Processor245 daemon prio=1 tid=0x31434da0 nid=0x3744 waiting for 
 monitor entry [0x2cd44000..0x2cd45f30]
 at java.text.RuleBasedCollator.compare(RuleBasedCollator.java)
 - waiting to lock 0x6b1e8c68 (a java.text.RuleBasedCollator)
 at 
 org.apache.lucene.search.FieldSortedHitQueue$4.compare(FieldSortedHitQueue.java:320)
 at 
 org.apache.lucene.search.FieldSortedHitQueue.lessThan(FieldSortedHitQueue.java:114)
 at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:120)
 at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:47)
 at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:58)
 at 
 org.apache.lucene.search.FieldSortedHitQueue.insert(FieldSortedHitQueue.java:90)
 at 
 org.apache.lucene.search.FieldSortedHitQueue.insert(FieldSortedHitQueue.java:97)
 at 
 org.apache.lucene.search.TopFieldDocCollector.collect(TopFieldDocCollector.java:47)
 at 
 org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:291)
 at 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
 at 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110)
 at 
 com.aconex.index.search.FastLocaleSortIndexSearcher.search(FastLocaleSortIndexSearcher.java:90)
 .
 In our case we had 12 threads waiting like this, while one thread had the 
 lock on the RuleBasedCollator.  Turns out RuleBasedCollator's.compare(...) 
 method is synchronized.  I wonder if a ThreadLocal based collator would be 
 better here... ?  There doesn't appear to be a reason for other threads 
 searching the same index to wait on this sort.  Be just as easy to use their 
 own.  (Is RuleBasedCollator a heavy object memory wise?  Wouldn't have 
 thought so, per thread)
 Thoughts?
 ---
 I've investigated this somewhat, and agree that this is a potential problem 
 with a series of possible workarounds. Further discussion (including 
 proof-of-concept patch) to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-806) Synchronization bottleneck in FieldSortedHitQueue with many concurrent readers

2007-04-03 Thread Paul Cowan (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Cowan updated LUCENE-806:
--

Attachment: LUCENE-806-minimal-publicapi.patch

Minimal ThreadLocal wrapper, Implementation #3: public static API. This is the 
easiest way to do this, but means that if it's later refactored to be 
unnecessary (or, more accurately, be done in a cleaner way) the API may get 
yanked after only a relatively short lifespan.

 Synchronization bottleneck in FieldSortedHitQueue with many concurrent readers
 --

 Key: LUCENE-806
 URL: https://issues.apache.org/jira/browse/LUCENE-806
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.0.0
Reporter: Paul Cowan
Priority: Minor
 Attachments: LUCENE-806-minimal-publicapi.patch, 
 LUCENE-806-minimal-systemproperty.patch, LUCENE-806-minimal-usealways.patch, 
 lucene-806-proposed-direction.patch, lucene-806.patch


 The below is from a post by (my colleague) Paul Smith to the java-users list:
 ---
 Hi ho peoples.
 We have an application that is internationalized, and stores data from many 
 languages (each project has it's own index, mostly aligned with a single 
 language, maybe 2).
 Anyway, I've noticed during some thread dumps diagnosing some performance 
 issues, that there appears to be a _potential_ synchronization bottleneck 
 using Locale-based sorting of Strings.  I don't think this problem is the 
 root cause of our performance problem, but I thought I'd mention it here.  
 Here's the stack dump of a thread waiting:
 http-1001-Processor245 daemon prio=1 tid=0x31434da0 nid=0x3744 waiting for 
 monitor entry [0x2cd44000..0x2cd45f30]
 at java.text.RuleBasedCollator.compare(RuleBasedCollator.java)
 - waiting to lock 0x6b1e8c68 (a java.text.RuleBasedCollator)
 at 
 org.apache.lucene.search.FieldSortedHitQueue$4.compare(FieldSortedHitQueue.java:320)
 at 
 org.apache.lucene.search.FieldSortedHitQueue.lessThan(FieldSortedHitQueue.java:114)
 at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:120)
 at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:47)
 at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:58)
 at 
 org.apache.lucene.search.FieldSortedHitQueue.insert(FieldSortedHitQueue.java:90)
 at 
 org.apache.lucene.search.FieldSortedHitQueue.insert(FieldSortedHitQueue.java:97)
 at 
 org.apache.lucene.search.TopFieldDocCollector.collect(TopFieldDocCollector.java:47)
 at 
 org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:291)
 at 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
 at 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110)
 at 
 com.aconex.index.search.FastLocaleSortIndexSearcher.search(FastLocaleSortIndexSearcher.java:90)
 .
 In our case we had 12 threads waiting like this, while one thread had the 
 lock on the RuleBasedCollator.  Turns out RuleBasedCollator's.compare(...) 
 method is synchronized.  I wonder if a ThreadLocal based collator would be 
 better here... ?  There doesn't appear to be a reason for other threads 
 searching the same index to wait on this sort.  Be just as easy to use their 
 own.  (Is RuleBasedCollator a heavy object memory wise?  Wouldn't have 
 thought so, per thread)
 Thoughts?
 ---
 I've investigated this somewhat, and agree that this is a potential problem 
 with a series of possible workarounds. Further discussion (including 
 proof-of-concept patch) to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486292
]

Michael McCandless commented on LUCENE-843:
---

Some details on how I measure RAM usage: both the baseline (current
lucene trunk) and my patch have two general classes of RAM usage.

The first class, document processing RAM, is RAM used while
processing a single doc. This RAM is re-used for each document (in the
trunk, it's GC'd and new RAM is allocated; in my patch, I explicitly
re-use these objects) and how large it gets is driven by how big each
document is.

The second class, indexed documents RAM, is the RAM used up by
previously indexed documents. This RAM grows with each added
document and how large it gets is driven by the number and size of
docs indexed since the last flush.

So when I say the writer is allowed to use 32 MB of RAM, I'm only
measuring the indexed documents RAM. With trunk I do this by
calling ramSizeInBytes(), and with my patch I do the analagous thing
by measuring how many RAM buffers are held up storing previously
indexed documents.

I then define RAM efficiency (docs/MB) as how many docs we can hold
in indexed documents RAM per MB RAM, at the point that we flush to
disk. I think this is an important metric because it drives how large
your initial (level 0) segments are. The larger these segments are
then generally the less merging you need to do, for a given # docs in
the index.

I also measure overall RAM used in the JVM (using
MemoryMXBean.getHeapMemoryUsage().getUsed()) just prior to each flush
except the last, to also capture the document processing RAM, object
overhead, etc.

improve how IndexWriter uses RAM to buffer added documents
--

Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 2.2
Reporter: Michael McCandless
Assigned To: Michael McCandless
Priority: Minor
Attachments: LUCENE-843.patch, LUCENE-843.take2.patch,
LUCENE-843.take3.patch, LUCENE-843.take4.patch

I'm working on a new class (MultiDocumentWriter) that writes more than
one document directly into a single Lucene segment, more efficiently
than the current approach.
This only affects the creation of an initial segment from added
documents. I haven't changed anything after that, eg how segments are
merged.
The basic ideas are:
* Write stored fields and term vectors directly to disk (don't
use up RAM for these).
* Gather posting lists term infos in RAM, but periodically do
in-RAM merges. Once RAM is full, flush buffers to disk (and
merge them later when it's time to make a real segment).
* Recycle objects/buffers to reduce time/stress in GC.
* Other various optimizations.
Some of these changes are similar to how KinoSearch builds a segment.
But, I haven't made any changes to Lucene's file format nor added
requirements for a global fields schema.
So far the only externally visible change is a new method
setRAMBufferSize in IndexWriter (and setMaxBufferedDocs is
deprecated) so that it flushes according to RAM usage and not a fixed
number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486332
 ] 

Michael McCandless commented on LUCENE-843:
---

A couple more details on the testing: I run java -server to get all
optimizations in the JVM, and the IO system is a local OS X RAID 0 of
4 SATA drives.

Using the above tool I ran an initial set of benchmarks comparing old
(= Lucene trunk) vs new (= this patch), varying document size (~550
bytes to ~5,500 bytes to ~55,000 bytes of plain text from Europarl
en).

For each document size I run 4 combinations of whether term vectors
and stored fields are on or off and whether autoCommit is true or
false.  I measure net docs/sec (= total # docs indexed divided by
total time taken), RAM efficiency (= avg # docs flushed with each
flush divided by RAM buffer size), and avg HEAP RAM usage before each
flush.

Here are the results for the 10K tokens (= ~55,000 bytes plain text)
per document:

  2 DOCS @ ~55,000 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  2 docs in 200.3 secs
  index size = 358M

new
  2 docs in 126.0 secs
  index size = 356M

Total Docs/sec: old99.8; new   158.7 [   59.0% faster]
Docs/MB @ flush:old24.2; new49.1 [  102.5% more]
Avg RAM used (MB) @ flush:  old74.5; new36.2 [   51.4% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  2 docs in 202.7 secs
  index size = 358M

new
  2 docs in 120.0 secs
  index size = 354M

Total Docs/sec: old98.7; new   166.7 [   69.0% faster]
Docs/MB @ flush:old24.2; new48.9 [  101.7% more]
Avg RAM used (MB) @ flush:  old74.3; new37.0 [   50.2% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  2 docs in 374.7 secs
  index size = 1.4G

new
  2 docs in 236.1 secs
  index size = 1.4G

Total Docs/sec: old53.4; new84.7 [   58.7% faster]
Docs/MB @ flush:old10.2; new49.1 [  382.8% more]
Avg RAM used (MB) @ flush:  old   129.3; new36.6 [   71.7% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  2 docs in 385.7 secs
  index size = 1.4G

new
  2 docs in 182.8 secs
  index size = 1.4G

Total Docs/sec: old51.9; new   109.4 [  111.0% faster]
Docs/MB @ flush:old10.2; new48.9 [  380.9% more]
Avg RAM used (MB) @ flush:  old76.0; new37.3 [   50.9% less]



 improve how IndexWriter uses RAM to buffer added documents
 --

 Key: LUCENE-843
 URL: https://issues.apache.org/jira/browse/LUCENE-843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.2
Reporter: Michael McCandless
 Assigned To: Michael McCandless
Priority: Minor
 Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
 LUCENE-843.take3.patch, LUCENE-843.take4.patch


 I'm working on a new class (MultiDocumentWriter) that writes more than
 one document directly into a single Lucene segment, more efficiently
 than the current approach.
 This only affects the creation of an initial segment from added
 documents.  I haven't changed anything after that, eg how segments are
 merged.
 The basic ideas are:
   * Write stored fields and term vectors directly to disk (don't
 use up RAM for these).
   * Gather posting lists  term infos in RAM, but periodically do
 in-RAM merges.  Once RAM is full, flush buffers to disk (and
 merge them later when it's time to make a real segment).
   * Recycle objects/buffers to reduce time/stress in GC.
   * Other various optimizations.
 Some of these changes are similar to how KinoSearch builds a segment.
 But, I haven't made any changes to Lucene's file format nor added
 requirements for a global fields schema.
 So far the only externally visible change is a new method
 setRAMBufferSize in IndexWriter (and setMaxBufferedDocs is
 deprecated) so that it flushes according to RAM usage and not a fixed
 number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486334
 ] 

Michael McCandless commented on LUCENE-843:
---

Here are the results for normal sized docs (1K tokens = ~5,500 bytes plain 
text each):

  20 DOCS @ ~5,500 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 397.6 secs
  index size = 415M

new
  20 docs in 167.5 secs
  index size = 411M

Total Docs/sec: old   503.1; new  1194.1 [  137.3% faster]
Docs/MB @ flush:old81.6; new   406.2 [  397.6% more]
Avg RAM used (MB) @ flush:  old87.3; new35.2 [   59.7% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 394.6 secs
  index size = 415M

new
  20 docs in 168.4 secs
  index size = 408M

Total Docs/sec: old   506.9; new  1187.7 [  134.3% faster]
Docs/MB @ flush:old81.6; new   432.2 [  429.4% more]
Avg RAM used (MB) @ flush:  old   126.6; new36.9 [   70.8% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 754.2 secs
  index size = 1.7G

new
  20 docs in 304.9 secs
  index size = 1.7G

Total Docs/sec: old   265.2; new   656.0 [  147.4% faster]
Docs/MB @ flush:old46.7; new   406.2 [  769.6% more]
Avg RAM used (MB) @ flush:  old92.9; new35.2 [   62.1% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 743.9 secs
  index size = 1.7G

new
  20 docs in 244.3 secs
  index size = 1.7G

Total Docs/sec: old   268.9; new   818.7 [  204.5% faster]
Docs/MB @ flush:old46.7; new   432.2 [  825.2% more]
Avg RAM used (MB) @ flush:  old93.0; new36.6 [   60.6% less]





 improve how IndexWriter uses RAM to buffer added documents
 --

 Key: LUCENE-843
 URL: https://issues.apache.org/jira/browse/LUCENE-843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.2
Reporter: Michael McCandless
 Assigned To: Michael McCandless
Priority: Minor
 Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
 LUCENE-843.take3.patch, LUCENE-843.take4.patch


 I'm working on a new class (MultiDocumentWriter) that writes more than
 one document directly into a single Lucene segment, more efficiently
 than the current approach.
 This only affects the creation of an initial segment from added
 documents.  I haven't changed anything after that, eg how segments are
 merged.
 The basic ideas are:
   * Write stored fields and term vectors directly to disk (don't
 use up RAM for these).
   * Gather posting lists  term infos in RAM, but periodically do
 in-RAM merges.  Once RAM is full, flush buffers to disk (and
 merge them later when it's time to make a real segment).
   * Recycle objects/buffers to reduce time/stress in GC.
   * Other various optimizations.
 Some of these changes are similar to how KinoSearch builds a segment.
 But, I haven't made any changes to Lucene's file format nor added
 requirements for a global fields schema.
 So far the only externally visible change is a new method
 setRAMBufferSize in IndexWriter (and setMaxBufferedDocs is
 deprecated) so that it flushes according to RAM usage and not a fixed
 number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486335
 ] 

Michael McCandless commented on LUCENE-843:
---


Last is the results for small docs (100 tokens = ~550 bytes plain text each):

  200 DOCS @ ~550 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  200 docs in 886.7 secs
  index size = 438M

new
  200 docs in 230.5 secs
  index size = 435M

Total Docs/sec: old  2255.6; new  8676.4 [  284.7% faster]
Docs/MB @ flush:old   128.0; new  4194.6 [ 3176.2% more]
Avg RAM used (MB) @ flush:  old   107.3; new37.7 [   64.9% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  200 docs in 888.7 secs
  index size = 438M

new
  200 docs in 239.6 secs
  index size = 432M

Total Docs/sec: old  2250.5; new  8348.7 [  271.0% faster]
Docs/MB @ flush:old   128.0; new  4146.8 [ 3138.9% more]
Avg RAM used (MB) @ flush:  old   108.1; new38.9 [   64.0% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  200 docs in 1480.1 secs
  index size = 2.1G

new
  200 docs in 462.0 secs
  index size = 2.1G

Total Docs/sec: old  1351.2; new  4329.3 [  220.4% faster]
Docs/MB @ flush:old93.1; new  4194.6 [ 4405.7% more]
Avg RAM used (MB) @ flush:  old   296.4; new38.3 [   87.1% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  200 docs in 1489.4 secs
  index size = 2.1G

new
  200 docs in 347.9 secs
  index size = 2.1G

Total Docs/sec: old  1342.8; new  5749.4 [  328.2% faster]
Docs/MB @ flush:old93.1; new  4146.8 [ 4354.5% more]
Avg RAM used (MB) @ flush:  old   297.1; new38.6 [   87.0% less]



  20 DOCS @ ~5,500 bytes plain text


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 397.6 secs
  index size = 415M

new
  20 docs in 167.5 secs
  index size = 411M

Total Docs/sec: old   503.1; new  1194.1 [  137.3% faster]
Docs/MB @ flush:old81.6; new   406.2 [  397.6% more]
Avg RAM used (MB) @ flush:  old87.3; new35.2 [   59.7% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 394.6 secs
  index size = 415M

new
  20 docs in 168.4 secs
  index size = 408M

Total Docs/sec: old   506.9; new  1187.7 [  134.3% faster]
Docs/MB @ flush:old81.6; new   432.2 [  429.4% more]
Avg RAM used (MB) @ flush:  old   126.6; new36.9 [   70.8% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 754.2 secs
  index size = 1.7G

new
  20 docs in 304.9 secs
  index size = 1.7G

Total Docs/sec: old   265.2; new   656.0 [  147.4% faster]
Docs/MB @ flush:old46.7; new   406.2 [  769.6% more]
Avg RAM used (MB) @ flush:  old92.9; new35.2 [   62.1% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 743.9 secs
  index size = 1.7G

new
  20 docs in 244.3 secs
  index size = 1.7G

Total Docs/sec: old   268.9; new   818.7 [  204.5% faster]
Docs/MB @ flush:old46.7; new   432.2 [  825.2% more]
Avg RAM used (MB) @ flush:  old93.0; new36.6 [   60.6% less]





 improve how IndexWriter uses RAM to buffer added documents
 --

 Key: LUCENE-843
 URL: https://issues.apache.org/jira/browse/LUCENE-843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.2
Reporter: Michael McCandless
 Assigned To: Michael McCandless
Priority: Minor
 Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
 LUCENE-843.take3.patch, LUCENE-843.take4.patch


 I'm working on a new class (MultiDocumentWriter) that writes more than
 one document directly into a single Lucene segment, more efficiently
 than the current approach.
 This only

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486339
]

Michael McCandless commented on LUCENE-843:
---

A few notes from these results:

* A real Lucene app won't see these gains because frequently the
retrieval of docs from the content source, and the tokenization,
take substantial amounts of time whereas for this test I've
intentionally minimized the cost of those steps but they are very
low for this test because I'm 1) pulling one line at a time from a
big text file, and 2) using my simplistic SimpleSpaceAnalyzer
which just breaks tokens at the space character.

* Best speedup is ~4.3X faster, for tiny docs (~550 bytes) with term
vectors and stored fields enabled and using autoCommit=false.

* Least speedup is still ~1.6X faster, for large docs (~55,000
bytes) with autoCommit=true.

* The autoCommit=false cases are a little unfair to the new patch
because with the new patch, you get a single-segment (optimized)
index in the end, but with existing Lucene trunk, you don't.

* With term vectors and/or stored fields, autoCommit=false is quite
a bit faster with the patch, because we never pay the price to
merge them since they are written once.

* With term vectors and/or stored fields, the new patch has
substantially better RAM efficiency.

* The patch is especially faster and has better RAM efficiency with
smaller documents.

* The actual HEAP RAM usage is quite a bit more stable with the
patch, especially with term vectors stored fields enabled. I
think this is because the patch creates far less garbage for GC to
periodically reclaim. I think this also means you could push your
RAM buffer size even higher to get better performance.

improve how IndexWriter uses RAM to buffer added documents
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Marvin Humphrey (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486373
]

Marvin Humphrey commented on LUCENE-843:

The actual HEAP RAM usage is quite a bit more
stable with the patch, especially with term vectors
stored fields enabled. I think this is because the
patch creates far less garbage for GC to periodically
reclaim. I think this also means you could push your
RAM buffer size even higher to get better performance.

For KinoSearch, the sweet spot seems to be a buffer of around 16 MB when
benchmarking with the Reuters corpus on my G4 laptop. Larger than that and
things actually slow down, unless the buffer is large enough that it never
needs flushing. My hypothesis is that RAM fragmentation is slowing down
malloc/free. I'll be interested as to whether you see the same effect.

improve how IndexWriter uses RAM to buffer added documents
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Ning Li


On 4/3/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote:


 * With term vectors and/or stored fields, the new patch has
   substantially better RAM efficiency.


Impressive numbers! The new patch improves RAM efficiency quite a bit
even with no term vectors nor stored fields, because of the periodic
in-RAM merges of posting lists  term infos etc. The frequency of the
in-RAM merges is controlled by flushedMergeFactor, which measures in
doc count, right? How sensitive is performance to the value of
flushedMergeFactor?

Cheers,
Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Yonik Seeley


Wow, very nice results Mike!

-Yonik

On 4/3/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote:


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486335
 ]

Michael McCandless commented on LUCENE-843:
---


Last is the results for small docs (100 tokens = ~550 bytes plain text each):

  200 DOCS @ ~550 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  200 docs in 886.7 secs
  index size = 438M

new
  200 docs in 230.5 secs
  index size = 435M

Total Docs/sec: old  2255.6; new  8676.4 [  284.7% faster]
Docs/MB @ flush:old   128.0; new  4194.6 [ 3176.2% more]
Avg RAM used (MB) @ flush:  old   107.3; new37.7 [   64.9% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  200 docs in 888.7 secs
  index size = 438M

new
  200 docs in 239.6 secs
  index size = 432M

Total Docs/sec: old  2250.5; new  8348.7 [  271.0% faster]
Docs/MB @ flush:old   128.0; new  4146.8 [ 3138.9% more]
Avg RAM used (MB) @ flush:  old   108.1; new38.9 [   64.0% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  200 docs in 1480.1 secs
  index size = 2.1G

new
  200 docs in 462.0 secs
  index size = 2.1G

Total Docs/sec: old  1351.2; new  4329.3 [  220.4% faster]
Docs/MB @ flush:old93.1; new  4194.6 [ 4405.7% more]
Avg RAM used (MB) @ flush:  old   296.4; new38.3 [   87.1% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  200 docs in 1489.4 secs
  index size = 2.1G

new
  200 docs in 347.9 secs
  index size = 2.1G

Total Docs/sec: old  1342.8; new  5749.4 [  328.2% faster]
Docs/MB @ flush:old93.1; new  4146.8 [ 4354.5% more]
Avg RAM used (MB) @ flush:  old   297.1; new38.6 [   87.0% less]



  20 DOCS @ ~5,500 bytes plain text


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 397.6 secs
  index size = 415M

new
  20 docs in 167.5 secs
  index size = 411M

Total Docs/sec: old   503.1; new  1194.1 [  137.3% faster]
Docs/MB @ flush:old81.6; new   406.2 [  397.6% more]
Avg RAM used (MB) @ flush:  old87.3; new35.2 [   59.7% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 394.6 secs
  index size = 415M

new
  20 docs in 168.4 secs
  index size = 408M

Total Docs/sec: old   506.9; new  1187.7 [  134.3% faster]
Docs/MB @ flush:old81.6; new   432.2 [  429.4% more]
Avg RAM used (MB) @ flush:  old   126.6; new36.9 [   70.8% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 754.2 secs
  index size = 1.7G

new
  20 docs in 304.9 secs
  index size = 1.7G

Total Docs/sec: old   265.2; new   656.0 [  147.4% faster]
Docs/MB @ flush:old46.7; new   406.2 [  769.6% more]
Avg RAM used (MB) @ flush:  old92.9; new35.2 [   62.1% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 743.9 secs
  index size = 1.7G

new
  20 docs in 244.3 secs
  index size = 1.7G

Total Docs/sec: old   268.9; new   818.7 [  204.5% faster]
Docs/MB @ flush:old46.7; new   432.2 [  825.2% more]
Avg RAM used (MB) @ flush:  old93.0; new36.6 [   60.6% less]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486385
]

Michael McCandless commented on LUCENE-843:
---

For KinoSearch, the sweet spot seems to be a buffer of around 16 MB
when benchmarking with the Reuters corpus on my G4 laptop. Larger
than that and things actually slow down, unless the buffer is large
enough that it never needs flushing. My hypothesis is that RAM
fragmentation is slowing down malloc/free. I'll be interested as to
whether you see the same effect.

Interesting. OK I will run the benchmark across increasing RAM sizes
to see where the sweet spot seems to be!

improve how IndexWriter uses RAM to buffer added documents
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless


Ning Li [EMAIL PROTECTED] wrote:
 On 4/3/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote:
 
   * With term vectors and/or stored fields, the new patch has
 substantially better RAM efficiency.
 
 Impressive numbers! The new patch improves RAM efficiency quite a bit
 even with no term vectors nor stored fields, because of the periodic
 in-RAM merges of posting lists  term infos etc. The frequency of the
 in-RAM merges is controlled by flushedMergeFactor, which measures in
 doc count, right? How sensitive is performance to the value of
 flushedMergeFactor?

Right, the in-RAM merges seem to help *alot* because you get great
compression of the terms dictionary, and also some compression of the
freq postings since the docIDs are delta encoded.  Also, you waste
less end buffer space (buffers are fixed sizes) when you merge together
into a large segment.

The in-RAM merges are triggered by number of bytes used vs RAM buffer
size.  Each doc is indexed to its own RAM segment, then once these
level 0 segments use  1/Nth of the RAM buffer size, I merge into
level 1.  Then once level 1 segments are using  1/Mth of the RAM
buffer size, I merge into level 2.  I don't do any merges beyond that.
Right now N = 14 and M = 7 but I haven't really tuned them yet ...

Once RAM is full, all of those segments are merged into a single
on-disk segment.  Once enough on-disk segments accumulate they are
periodically merged (based on flushedMergeFactor) as well.  Finally
when it's time to commit a real segment I merge all RAM segments and
flushed segments into a real Lucene segment.

I haven't done much testing to find sweet spot for these merge
settings just yet.  Still plenty to do!

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless


Yonik Seeley [EMAIL PROTECTED] wrote:
 Wow, very nice results Mike!

Thanks :)  I'm just praying I don't have some sneaky bug making
the results far better than they really are!!  And still plenty
to do...

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

publish to maven-repository

2007-04-03 Thread Joerg Hohwiller

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi there,

I will give it another try:

Could you please publish lucene 2.* artifacts (including contribs) to the maven2
repository at ibiblio?

Currently there is only the lucene-core available up to version 2.0.0:
http://repo1.maven.org/maven2/org/apache/lucene/

JARs and POMs go to:
scp://people.apache.org/www/www.apache.org/dist/maven-repository

If you need assitance I am pleased to help.
But I am not an official apache member and do NOT have access to do the
deployment myself.

Thank you so much...
  Jörg
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGEr3LmPuec2Dcv/8RAh1sAJ9m3qs7upNGJTgie5tNeAFKZenBowCgjufY
uB1/RNnI4wB3dviKy0w7XEs=
=llLh
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-806) Synchronization bottleneck in FieldSortedHitQueue with many concurrent readers

[jira] Updated: (LUCENE-806) Synchronization bottleneck in FieldSortedHitQueue with many concurrent readers

[jira] Updated: (LUCENE-806) Synchronization bottleneck in FieldSortedHitQueue with many concurrent readers

[jira] Updated: (LUCENE-806) Synchronization bottleneck in FieldSortedHitQueue with many concurrent readers

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Re: improve how IndexWriter uses RAM to buffer added documents

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Re: improve how IndexWriter uses RAM to buffer added documents

publish to maven-repository

16 matches

Site Navigation

Mail list logo

Footer information