date:20070405

All yours - thanks Sami!

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Sami Siren [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 1:51:30 AM
Subject: Re: publish to maven-repository

hi,

I am volunteering to help on putting together releaseable m2 artifacts for
Lucene, I have high hopes to start building and spreading m2 artifacts for
other Lucene sub projects too (of course if there are no objections).

--
 Sami Siren

2007/4/5, Otis Gospodnetic [EMAIL PROTECTED]:

 Jörg,
 Since you offered to help - please see
 https://issues.apache.org/jira/browse/LUCENE-622 .  lucene-core POM is
 there for 2.1.0, but if you need POMs for contrib/*, please attach them to
 that issue.  We have Jars, obviously, so we just need to copy those.  When
 we'll need .sha1 and .md5 files for all pushed Jars.  One of the other
 developers will have to do that, as I don't have my PGP set up, and hence no
 key for the KEYS file (if that's needed for the .sha1).

 Otis

 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
 Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

 - Original Message 
 From: Joerg Hohwiller [EMAIL PROTECTED]
 To: java-dev@lucene.apache.org
 Sent: Tuesday, April 3, 2007 4:49:15 PM
 Subject: publish to maven-repository

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi there,

 I will give it another try:

 Could you please publish lucene 2.* artifacts (including contribs) to the
 maven2
 repository at ibiblio?

 Currently there is only the lucene-core available up to version 2.0.0:
 http://repo1.maven.org/maven2/org/apache/lucene/

 JARs and POMs go to:
 scp://people.apache.org/www/www.apache.org/dist/maven-repository

 If you need assitance I am pleased to help.
 But I am not an official apache member and do NOT have access to do the
 deployment myself.

 Thank you so much...
   Jörg
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.5 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iD8DBQFGEr3LmPuec2Dcv/8RAh1sAJ9m3qs7upNGJTgie5tNeAFKZenBowCgjufY
 uB1/RNnI4wB3dviKy0w7XEs=
 =llLh
 -END PGP SIGNATURE-

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

Marvin Humphrey [EMAIL PROTECTED] wrote:
 On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:
 
  (: Ironically, the numbers for Lucene on that page are a little
  better than they should be because of a sneaky bug.  I would have
  made updating the results a priority if they'd gone the other  
  way.  :)
 
  Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
  Ferret and others?).
 
 Doing honest, rigorous benchmarking is exacting and labor-intensive.   
 Publishing results tends to ignite flame wars I don't have time for.
 
 The main point that I wanted to make with that page was that KS was a  
 lot faster than Plucene, and that it was in Lucene's ballpark.   
 Having made that point, I've moved on.  The benchmarking code is  
 still very useful for internal development and I use it frequently.

Agreed.  Though, if the benchmarking is done in a way that anyone
could download  re-run it (eg as part of Lucene's new  developing
benchmark framework), it should help to keep flaming in check.

Accurate  well communicated benchmark results both within each
variant/port of Lucene and across them is crucial for all of us making
iterative progress on performance.

 At some point I would like to port the benchmarking work that has  
 been contributed to Lucene of late, but I'm waiting for that code  
 base to settle down first.  After that happens, I'll probably make a  
 pass and publish some results.  Better to spend the time preparing  
 one definitive presentation than to have to rebut every idiot's  
 latest wildly inaccurate shootout.

Excellent!

  ... However, Lucene has been tuned by an army of developers over the
  years, while KS is young yet and still had many opportunities for
  optimization.  Current svn trunk for KS is about twice as fast for
  indexing as when I did those benchmarking tests.
 
  Wow, that's an awesome speedup!
 
 The big bottleneck for KS has been its Tokenizer class.  There's only  
 one such class in KS, and it's regex-based.  A few weeks ago, I  
 finally figured out how to hook it into Perl's regex engine at the C  
 level.  The regex engine is not an official part of Perl's C API, so  
 I wouldn't do this if I didn't have to, but the tokenizing loop is  
 only about 100 lines of code and the speedup is dramatic.

Tokenization is a very big part of Lucene's indexing time as well.

StandardAnalyzer is very time consuming.  When I switched to testing
with WhitespaceAnalyzer, it was quite a bit faster (I don't have exact
numbers).  Then when I created and switched to SimpleSpaceAnalyzer
(just splits on the space character, and, doesn't do new String(...)
for every token, instead makes offset+lenth slices into a char[]
array), it was even faster.

This is why your mileage will vary caveat is extremely important.
For most users of Lucene, I'd expect that 1) retrieving the doc from
whatever its source is, and 2) tokenizing, take a substantial amount
of time.  So the gains I'm seeing in my benchmarks won't usually be
seen by normal applications unless these applications have already
optimized their doc retrieval/tokenization.

And now that indexing each document is so fast, segment merging has
become a BIG part (66% in my large index test in LUCENE-856) of
indexing.  Marvin do you have any sense of what the equivalent cost is
in KS (I think for KS you add a previous segment not that
differently from how you add a document)?
 
 I've also squeezed out another 30-40% by changing the implementation  
 in ways which have gradually winnowed down the number of malloc()  
 calls.  Some of the techniques may be applicable to Lucene; I'll get  
 around to firing up JIRA issues describing them someday.

This generally was my approach in LUCENE-843 (minimize new
Object()).  I re-use Posting objects, the hash for Posting objects,
byte buffers, etc.  I share large int[] blocks and char[] blocks
across Postings and re-use them.  Etc.

The one thing that still baffles me is: I can't get a persistent
Posting hash to be any faster.  I still reset the Posting hash with
every document, but I had variants in my iterations that kept the
Postings hash between documents (just flushing the int[]'s
periodically).  I had expected that leaving Posting instances in the
hash, esp. for frequent terms, would be a win, but so far I haven't
seen that empirically.

  So KS is faster than Lucene today?
 
 I haven't tested recent versions of Lucene.  I believe that the  
 current svn trunk for KS is faster for indexing than Lucene 1.9.1.   
 But... A) I don't have an official release out with the current  
 Tokenizer code, B) I have no immediate plans to prepare further  
 published benchmarks, and C) it's not really important, because so  
 long as the numbers are close you'd be nuts to choose one engine or  
 the other based on that criteria rather than, say, what language your  
 development team speaks.  KinoSearch scales to multiple machines, too.

On C) I think it is important so the many

Re: [jira] Created: (LUCENE-856) Optimize segment merging


Ning Li [EMAIL PROTECTED] wrote:
 On 4/4/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote:
  Note that for autoCommit=false, this optimization is somewhat less
  important, depending on how often you actually close/open a new
  IndexWriter.  In the extreme case, if you open a writer, add 100 MM
  docs, close the writer, then no segment merges happen at all.
 
 I think in the current code, the merge behavior for autoCommit=false
 is the same as that for autoCommit=true, isn't it?

Right, the current code implements the autoCommit=false case rather
inefficiently.  While LUCENE-843 fixes that to some extent, it must
still do its own merging of flushed segments.  But that merging
ought to be faster since it does not merge the term vectors  stored
fields.  I will run a test to compare (the test I ran for that issue
was with autoCommit=true).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene and Javolution: A good mix ?

2007-04-05 Thread Jean-Philippe Robichaud

I understand your concerns!

I was a little skeptical at the beginning.  But even with the 1.5 jvm,
the improvements still holds.

Lucene creates a lots of garbage (strings, tokens, ...) either at
index time or query time. While the new garbage collector strategies did
seriously improve since java 1.4, the gains are still there as the
object creation is also a cost that javolution easily saves us from.  

What javolution requires to give the best is can is for you to make
certain critical classes extends the RealtimeObject class and implement
a Factory pattern inside.  Once this is done, you can now fully profit
from the Javolution features.  Even without doing that, we could still
benefit from the FastList/FastMap classes being already pooled and the
possibility to 'thread-safely' iterate lists/maps without creating any
objects.

Javolution is also released for gcj, which is great since it won't
interfere with the lucene's gcj effort.

From what I can foresee, the pros/cons would be:
Pros:
Leaner memory footprint
Saving many cpu cycles
Cons:
Adding a dependency to lucene codebase
Lucene developers must get familiar with the Context concepts


Jp
-Original Message-
From: robert engels [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 04, 2007 10:31 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene and Javolution: A good mix ?

I would suggest that the Javolution folks do their tests against  
modern JVM...

I have followed the Javolution project for some time, and while I  
agree that some of the techniques should improve things, I think that  
modern JVMs do most of this work for you (and the latest class  
libraries also help - StringBuilder and others).

I also think that when you start doing you own memory management you  
might as well write the code in C/C++ because you need to use similar  
techniques (similar to the resource management when using SWT).

Just my thoughts.

On Apr 4, 2007, at 8:54 PM, Jean-Philippe Robichaud wrote:

 Hello Dear Lucene coders!



 Some of you may remember, I'm using lucene for a product (and many  
 other
 internal utilities).  I'm also using another open source library  
 called
 Javolution (www.javolution.org http://www.javolution.org/ ) which  
 does
 many things, one of them being to offer excellent replacements for
 ArrayList/Map/... and a super good memory management extension to the
 java language.



 As I'm [trying to] follow the conversations on this list, I see that
 many of you are working towards optimizing lucene in term of memory
 footprint and speed.  I just finished optimizing my code (not lucene
 itself, but my code written on top of it) using Javolution PoolContext
 and the FastList/FastMap/... classes.  The resulting speedup is a 6
 times faster code.



 Javolution make it easy to recycle objects and do some object  
 allocation
 on the stack rather than on the head, which remove stress on the  
 garbage
 collector.  Javolution also offers 2 classes  (Text and  
 TextBuilder) to
 replace String/StringBuffer which are perfect for anything related to
 string manipulation and some C union/struct equivalent for java.   
 The
 thing is really great.



 Would anyone be interested in doing Lucene a face lift and start using
 javolution as a core lucene dependency?  I understand that right now,
 lucene is free of any dependencies, which is quite great, but anyone
 interested in doing fast/lean/stable java application should seriously
 consider using javolution anyway.



 Any thoughts?



 Jp



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-789) Custom similarity is ignored when using MultiSearcher

2007-04-05 Thread Alexey Lef (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexey Lef updated LUCENE-789:
--

Attachment: TestMultiSearcherSimilarity.java

Attached unit test

Custom similarity is ignored when using MultiSearcher
-

Key: LUCENE-789
URL: https://issues.apache.org/jira/browse/LUCENE-789
Project: Lucene - Java
Issue Type: Bug
Components: Search
Affects Versions: 2.0.1
Reporter: Alexey Lef
Attachments: TestMultiSearcherSimilarity.java

Symptoms:
I am using Searcher.setSimilarity() to provide a custom similarity that turns
off tf() factor. However, somewhere along the way the custom similarity is
ignored and the DefaultSimilarity is used. I am using MultiSearcher and
BooleanQuery.
Problem analysis:
The problem seems to be in MultiSearcher.createWeight(Query) method. It
creates an instance of CachedDfSource but does not set the similarity. As the
result CachedDfSource provides DefaultSimilarity to queries that use it.
Potential solution:
Adding the following line:
cacheSim.setSimilarity(getSimilarity());
after creating an instance of CacheDfSource (line 312) seems to fix the
problem. However, I don't understand enough of the inner workings of this
class to be absolutely sure that this is the right thing to do.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Fwd: Re: svn commit: r525669 - /lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java

2007-04-05 Thread Paul Elschot

Once more, now to java-dev instead of to java-commits:

Otis,

Can I ask which tool you used to catch this, and the previous one?

Regards,
Paul Elschot


On Thursday 05 April 2007 03:06, [EMAIL PROTECTED] wrote:
 Author: otis
 Date: Wed Apr  4 18:06:16 2007
 New Revision: 525669
 
 URL: http://svn.apache.org/viewvc?view=revrev=525669
 Log:
 - Removed unused BooleanScore param passed to the inner BucketTable class
 
 Modified:
 lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java
 
 Modified: 
lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java
 URL: 
http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java?view=diffrev=525669r1=525668r2=525669
 
==
 --- lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java 
(original)
 +++ lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java 
Wed Apr  4 18:06:16 2007
 @@ -21,7 +21,7 @@
  
  final class BooleanScorer extends Scorer {
private SubScorer scorers = null;
 -  private BucketTable bucketTable = new BucketTable(this);
 +  private BucketTable bucketTable = new BucketTable();
  
private int maxCoord = 1;
private float[] coordFactors = null;
 @@ -201,11 +201,7 @@
  final Bucket[] buckets = new Bucket[SIZE];
  Bucket first = null;  // head of valid list

 -private BooleanScorer scorer;
 -
 -public BucketTable(BooleanScorer scorer) {
 -  this.scorer = scorer;
 -}
 +public BucketTable() {}
  
  public final int size() { return SIZE; }
  
 
 
 
 



---

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-622) Provide More of Lucene For Maven

2007-04-05 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörg Hohwiller updated LUCENE-622:
--

Attachment: lucene-highlighter-2.0.0.pom

pom for lucene-highlighter

 Provide More of Lucene For Maven
 

 Key: LUCENE-622
 URL: https://issues.apache.org/jira/browse/LUCENE-622
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Stephen Duncan Jr
 Attachments: lucene-core.pom, lucene-highlighter-2.0.0.pom


 Please provide javadoc  source jars for lucene-core.  Also, please provide 
 the rest of lucene (the jars inside of contrib in the download bundle) if 
 possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942
 ] 

Michael McCandless commented on LUCENE-843:
---


OK I ran old (trunk) vs new (this patch) with increasing RAM buffer
sizes up to 96 MB.

I used the normal sized docs (~5,500 bytes plain text), left stored
fields and term vectors (positions + offsets) on, and
autoCommit=false.

Here're the results:

NUM THREADS = 1
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)


1 MB

  old
20 docs in 862.2 secs
index size = 1.7G

  new
20 docs in 297.1 secs
index size = 1.7G

  Total Docs/sec: old   232.0; new   673.2 [  190.2% faster]
  Docs/MB @ flush:old47.2; new   278.4 [  489.6% more]
  Avg RAM used (MB) @ flush:  old34.5; new 3.4 [   90.1% less]



2 MB

  old
20 docs in 828.7 secs
index size = 1.7G

  new
20 docs in 279.0 secs
index size = 1.7G

  Total Docs/sec: old   241.3; new   716.8 [  197.0% faster]
  Docs/MB @ flush:old47.0; new   322.4 [  586.7% more]
  Avg RAM used (MB) @ flush:  old37.9; new 4.5 [   88.0% less]



4 MB

  old
20 docs in 840.5 secs
index size = 1.7G

  new
20 docs in 260.8 secs
index size = 1.7G

  Total Docs/sec: old   237.9; new   767.0 [  222.3% faster]
  Docs/MB @ flush:old46.8; new   363.1 [  675.4% more]
  Avg RAM used (MB) @ flush:  old33.9; new 6.5 [   80.9% less]



8 MB

  old
20 docs in 678.8 secs
index size = 1.7G

  new
20 docs in 248.8 secs
index size = 1.7G

  Total Docs/sec: old   294.6; new   803.7 [  172.8% faster]
  Docs/MB @ flush:old46.8; new   392.4 [  739.1% more]
  Avg RAM used (MB) @ flush:  old60.3; new10.7 [   82.2% less]



16 MB

  old
20 docs in 660.6 secs
index size = 1.7G

  new
20 docs in 247.3 secs
index size = 1.7G

  Total Docs/sec: old   302.8; new   808.7 [  167.1% faster]
  Docs/MB @ flush:old46.7; new   415.4 [  788.8% more]
  Avg RAM used (MB) @ flush:  old47.1; new19.2 [   59.3% less]



24 MB

  old
20 docs in 658.1 secs
index size = 1.7G

  new
20 docs in 243.0 secs
index size = 1.7G

  Total Docs/sec: old   303.9; new   823.0 [  170.8% faster]
  Docs/MB @ flush:old46.7; new   430.9 [  822.2% more]
  Avg RAM used (MB) @ flush:  old70.0; new27.5 [   60.8% less]



32 MB

  old
20 docs in 714.2 secs
index size = 1.7G

  new
20 docs in 239.2 secs
index size = 1.7G

  Total Docs/sec: old   280.0; new   836.0 [  198.5% faster]
  Docs/MB @ flush:old46.7; new   432.2 [  825.2% more]
  Avg RAM used (MB) @ flush:  old92.5; new36.7 [   60.3% less]



48 MB

  old
20 docs in 640.3 secs
index size = 1.7G

  new
20 docs in 236.0 secs
index size = 1.7G

  Total Docs/sec: old   312.4; new   847.5 [  171.3% faster]
  Docs/MB @ flush:old46.7; new   438.5 [  838.8% more]
  Avg RAM used (MB) @ flush:  old   138.9; new52.8 [   62.0% less]



64 MB

  old
20 docs in 649.3 secs
index size = 1.7G

  new
20 docs in 238.3 secs
index size = 1.7G

  Total Docs/sec: old   308.0; new   839.3 [  172.5% faster]
  Docs/MB @ flush:old46.7; new   441.3 [  844.7% more]
  Avg RAM used (MB) @ flush:  old   302.6; new72.7 [   76.0% less]



80 MB

  old
20 docs in 670.2 secs
index size = 1.7G

  new
20 docs in 227.2 secs
index size = 1.7G

  Total Docs/sec: old   298.4; new   880.5 [  195.0% faster]
  Docs/MB @ flush:old46.7; new   446.2 [  855.2% more]
  Avg RAM used (MB) @ flush:  old   231.7; new94.3 [   59.3% less]



96 MB

  old
20 docs in 683.4 secs
index size = 1.7G

  new
20 docs in 226.8 secs
index size = 1.7G

  Total Docs/sec: old   292.7; new   882.0 [  201.4% faster]
  Docs/MB @ flush:old46.7; new   448.0 [  859.1% more]
  Avg RAM used (MB) @ flush:  old   274.5; new   112.7 [   59.0% less]


Some observations:

  * Remember the test is already biased against new because with the
patch you get an optimized index in the end but with old you
don't.

  * Sweet spot for old (trunk) seems to be 48 MB: that is the peak
docs/sec @ 312.4.

  * New (with patch) seems to just get faster the more memory you give
it, though gradually.  The peak was 96 MB (the largest I ran).  So
no sweet spot (or maybe I need to give more memory, but, above 96
MB the trunk was starting to swap on my test env).

  * New gets better and better RAM efficiency, the more RAM you

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread eks dev

wow, impressive numbers, congrats !

- Original Message 
From: Michael McCandless (JIRA) [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, 5 April, 2007 3:22:32 PM
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to 
buffer added documents


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942
 ] 

Michael McCandless commented on LUCENE-843:
---


OK I ran old (trunk) vs new (this patch) with increasing RAM buffer
sizes up to 96 MB.

I used the normal sized docs (~5,500 bytes plain text), left stored
fields and term vectors (positions + offsets) on, and
autoCommit=false.

Here're the results:

NUM THREADS = 1
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)


1 MB

  old
20 docs in 862.2 secs
index size = 1.7G

  new
20 docs in 297.1 secs
index size = 1.7G

  Total Docs/sec: old   232.0; new   673.2 [  190.2% faster]
  Docs/MB @ flush:old47.2; new   278.4 [  489.6% more]
  Avg RAM used (MB) @ flush:  old34.5; new 3.4 [   90.1% less]



2 MB

  old
20 docs in 828.7 secs
index size = 1.7G

  new
20 docs in 279.0 secs
index size = 1.7G

  Total Docs/sec: old   241.3; new   716.8 [  197.0% faster]
  Docs/MB @ flush:old47.0; new   322.4 [  586.7% more]
  Avg RAM used (MB) @ flush:  old37.9; new 4.5 [   88.0% less]



4 MB

  old
20 docs in 840.5 secs
index size = 1.7G

  new
20 docs in 260.8 secs
index size = 1.7G

  Total Docs/sec: old   237.9; new   767.0 [  222.3% faster]
  Docs/MB @ flush:old46.8; new   363.1 [  675.4% more]
  Avg RAM used (MB) @ flush:  old33.9; new 6.5 [   80.9% less]



8 MB

  old
20 docs in 678.8 secs
index size = 1.7G

  new
20 docs in 248.8 secs
index size = 1.7G

  Total Docs/sec: old   294.6; new   803.7 [  172.8% faster]
  Docs/MB @ flush:old46.8; new   392.4 [  739.1% more]
  Avg RAM used (MB) @ flush:  old60.3; new10.7 [   82.2% less]



16 MB

  old
20 docs in 660.6 secs
index size = 1.7G

  new
20 docs in 247.3 secs
index size = 1.7G

  Total Docs/sec: old   302.8; new   808.7 [  167.1% faster]
  Docs/MB @ flush:old46.7; new   415.4 [  788.8% more]
  Avg RAM used (MB) @ flush:  old47.1; new19.2 [   59.3% less]



24 MB

  old
20 docs in 658.1 secs
index size = 1.7G

  new
20 docs in 243.0 secs
index size = 1.7G

  Total Docs/sec: old   303.9; new   823.0 [  170.8% faster]
  Docs/MB @ flush:old46.7; new   430.9 [  822.2% more]
  Avg RAM used (MB) @ flush:  old70.0; new27.5 [   60.8% less]



32 MB

  old
20 docs in 714.2 secs
index size = 1.7G

  new
20 docs in 239.2 secs
index size = 1.7G

  Total Docs/sec: old   280.0; new   836.0 [  198.5% faster]
  Docs/MB @ flush:old46.7; new   432.2 [  825.2% more]
  Avg RAM used (MB) @ flush:  old92.5; new36.7 [   60.3% less]



48 MB

  old
20 docs in 640.3 secs
index size = 1.7G

  new
20 docs in 236.0 secs
index size = 1.7G

  Total Docs/sec: old   312.4; new   847.5 [  171.3% faster]
  Docs/MB @ flush:old46.7; new   438.5 [  838.8% more]
  Avg RAM used (MB) @ flush:  old   138.9; new52.8 [   62.0% less]



64 MB

  old
20 docs in 649.3 secs
index size = 1.7G

  new
20 docs in 238.3 secs
index size = 1.7G

  Total Docs/sec: old   308.0; new   839.3 [  172.5% faster]
  Docs/MB @ flush:old46.7; new   441.3 [  844.7% more]
  Avg RAM used (MB) @ flush:  old   302.6; new72.7 [   76.0% less]



80 MB

  old
20 docs in 670.2 secs
index size = 1.7G

  new
20 docs in 227.2 secs
index size = 1.7G

  Total Docs/sec: old   298.4; new   880.5 [  195.0% faster]
  Docs/MB @ flush:old46.7; new   446.2 [  855.2% more]
  Avg RAM used (MB) @ flush:  old   231.7; new94.3 [   59.3% less]



96 MB

  old
20 docs in 683.4 secs
index size = 1.7G

  new
20 docs in 226.8 secs
index size = 1.7G

  Total Docs/sec: old   292.7; new   882.0 [  201.4% faster]
  Docs/MB @ flush:old46.7; new   448.0 [  859.1% more]
  Avg RAM used (MB) @ flush:  old   274.5; new   112.7 [   59.0% less]


Some observations:

  * Remember the test is already biased against new because with the
patch you get an optimized index in the end but with old you
don't.

  * Sweet spot for old (trunk) seems to be 48 MB: that is the peak
docs/sec @ 312.4.

  * New (with patch) seems to just get faster

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents


eks dev [EMAIL PROTECTED] wrote:
 wow, impressive numbers, congrats !

Thanks!  But remember many Lucene apps won't see these speedups since I've
carefully minimized cost of tokenization and cost of document retrieval.  I
think for many Lucene apps these are a sizable part of time spend indexing.

Next up I'm going to test thread concurrency of new vs old.

And then still a fair number of things to resolve before committing...

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)



On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote:


The one thing that still baffles me is: I can't get a persistent
Posting hash to be any faster.


Don't use a hash, then.  :)

KS doesn't.

  * Give Token a position member.
  * After you've got accumulated all the Tokens, calculate
position for each token from the position increment.
  * Arrange the postings in an array sorted by position.
  * Count the number of postings in a row with identical
text to get freq.

Relevant code from KinoSearch::Analysis::TokenBatch below.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



void
TokenBatch_invert(TokenBatch *self)
{
Token **tokens = (Token**)self-elems;
Token **limit  = tokens + self-size;
i32_t   token_pos = 0;

/* thwart future attempts to append */
if (self-inverted)
CONFESS(TokenBatch has already been inverted);
self-inverted = true;

/* assign token positions */
for ( ;tokens  limit; tokens++) {
(*tokens)-pos = token_pos;
token_pos += (*tokens)-pos_inc;
}

/* sort the tokens lexically, and hand off to cluster counting  
routine */

qsort(self-elems, self-size, sizeof(Token*), Token_compare);
count_clusters(self);
}

static void
count_clusters(TokenBatch *self)
{
Token **tokens  = (Token**)self-elems;
u32_t  *counts  = CALLOCATE(self-size + 1, u32_t);
u32_t   i;

/* save the cluster counts */
self-cluster_counts_size = self-size;
self-cluster_counts = counts;

for (i = 0; i  self-size; ) {
Token *const base_token = tokens[i];
char  *const base_text  = base_token-text;
const size_t base_len   = base_token-len;
u32_t j = i + 1;

/* iterate through tokens until text doesn't match */
while (j  self-size) {
Token *const candidate = tokens[j];

if (   (candidate-len == base_len)
 (memcmp(candidate-text, base_text, base_len) == 0)
) {
j++;
}
else {
break;
}
}

/* record a count at the position of the first token in the  
cluster */

counts[i] = j - i;

/* start the next loop at the next token we haven't seen */
i = j;
}
}

Token**
TokenBatch_next_cluster(TokenBatch *self, u32_t *count)
{
Token **cluster = (Token**)(self-elems + self-cur);

if (self-cur == self-size) {
*count = 0;
return NULL;
}

/* don't read past the end of the cluster counts array */
if (!self-inverted)
CONFESS(TokenBatch not yet inverted);
if (self-cur  self-cluster_counts_size)
CONFESS(Tokens were added after inversion);

/* place cluster count in passed-in var, advance bookmark */
*count = self-cluster_counts[ self-cur ];
self-cur += *count;

return cluster;
}





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-622) Provide More of Lucene For Maven

2007-04-05 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörg Hohwiller updated LUCENE-622:
--

Attachment: lucene-maven.patch

patch for partial mavenization of lucene

 Provide More of Lucene For Maven
 

 Key: LUCENE-622
 URL: https://issues.apache.org/jira/browse/LUCENE-622
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.0.0
Reporter: Stephen Duncan Jr
 Attachments: lucene-core.pom, lucene-highlighter-2.0.0.pom, 
 lucene-maven.patch


 Please provide javadoc  source jars for lucene-core.  Also, please provide 
 the rest of lucene (the jars inside of contrib in the download bundle) if 
 possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-622) Provide More of Lucene For Maven

2007-04-05 Thread JIRA

[
https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487030
]

Jörg Hohwiller commented on LUCENE-622:
---

If you apply this patch to svn
(http://svn.apache.org/repos/asf/lucene/java/trunk), you can easily use maven
for building and deploying artifacts for the maven repository.
I did NOT modify your structure in any way because it seems the majority of the
lucene community is not interested in maven and wants to keep going with ant.
So all the patch does is adding some pom.xml files. Further I only added POMs
for toplevel, core, demo and highlighter.
From the highlighter POM you can easily create the POMs for the other contribs
via cutpast and add them to the toplevel pom (contrib/pom.xml).
If you need further assistance do NOT hesitate to ask me.

Somehow the tests do NOT work, when I build with maven. Maybe they are
currently broken. If you want to build (package), install or deploy anyways
call maven (mvn) with the option -Dmaven.test.skip=true. E.g.:
mvn install -Dmaven.test.skip=true

I did not spent the time to dig into the problem with the tests. If you are
loading resources from the classpath please consider this:
http://jira.codehaus.org/browse/SUREFIRE-106

Provide More of Lucene For Maven

Key: LUCENE-622
URL: https://issues.apache.org/jira/browse/LUCENE-622
Project: Lucene - Java
Issue Type: Task
Affects Versions: 2.0.0
Reporter: Stephen Duncan Jr
Attachments: lucene-core.pom, lucene-highlighter-2.0.0.pom,
lucene-maven.patch

Please provide javadoc source jars for lucene-core. Also, please provide
the rest of lucene (the jars inside of contrib in the download bundle) if
possible.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)

Marvin Humphrey [EMAIL PROTECTED] wrote:
 
 On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote:
 
  The one thing that still baffles me is: I can't get a persistent
  Posting hash to be any faster.
 
 Don't use a hash, then.  :)
 
 KS doesn't.
 
* Give Token a position member.
* After you've got accumulated all the Tokens, calculate
  position for each token from the position increment.
* Arrange the postings in an array sorted by position.
* Count the number of postings in a row with identical
  text to get freq.
 
 Relevant code from KinoSearch::Analysis::TokenBatch below.

OH!  I like that approach!

So you basically do not de-dup by field+term on your first pass
through the tokens in the doc (which is roughly what that hash
does).  Instead, append all tokens in an array, then sort first by
field+text and second by position?  This is done for each document
right?

This seems like it could be a major win!

Did you every compare this approach against hash (or other de-dup data
structure, letter trie or something) approach?

I guess it depends on how many total terms you have in the doc vs how
many unique terms you have in the doc.  Qsort is NlogN, and, the
comparison of 2 terms is somewhat costly.  With de-duping, you pay a
hash lookup/insert cost (plus cost of managing little int buffers to
hold positions/offsets/etc) per term, but then only qsort on the
number of unique terms.

Whereas with your approach you don't pay any deduping cost but your
qsort is on total # terms, not total # unique terms.  I bet your
approach is quite a bit faster for small to medium sized docs (by
far the norm) but not faster for very large docs?  Or maybe it's
even faster for very large docs because qsort is so darned fast :)

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: publish to maven-repository

2007-04-05 Thread Joerg Hohwiller

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

 Jörg,
Hi Otis,
 Since you offered to help - please see 
 https://issues.apache.org/jira/browse/LUCENE-622 .  
 lucene-core POM is there for 2.1.0, but if you need POMs for contrib/*,
 please attach them to that issue.  We have Jars, obviously,
 so we just need to copy those.
Since you asked for my help, I did an initial mavenization of the lucene project
and submitted it as patch.
Additionally I attached the 2.0.0 POM for the highlighter that I wanted to have
at ibiblio.
 When we'll need .sha1 and .md5 files for all pushed Jars.
 One of the other developers will have to do that,
 as I don't have my PGP set up,
 and hence no key for the KEYS file (if that's needed for the .sha1).
You do not need PGP or something like this for SHA-* or MD5.
Those are just specific checksums but no authentic signature.
I never deployed to ibiblio but I think these files are generated automatically.

I hope my work helps to make it easier for the lucene community to put further
releases also into the maven repository.
  
 Otis
Best regards
  Jörg
 
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
 Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
 
 - Original Message 
 From: Joerg Hohwiller [EMAIL PROTECTED]
 To: java-dev@lucene.apache.org
 Sent: Tuesday, April 3, 2007 4:49:15 PM
 Subject: publish to maven-repository
 
 Hi there,
 
 I will give it another try:
 
 Could you please publish lucene 2.* artifacts (including contribs) to the 
 maven2
 repository at ibiblio?
 
 Currently there is only the lucene-core available up to version 2.0.0:
 http://repo1.maven.org/maven2/org/apache/lucene/
 
 JARs and POMs go to:
 scp://people.apache.org/www/www.apache.org/dist/maven-repository
 
 If you need assitance I am pleased to help.
 But I am not an official apache member and do NOT have access to do the
 deployment myself.
 
 Thank you so much...
   Jörg

- -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





- -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGFSRRmPuec2Dcv/8RAra/AJ97TFROLjvfxH/fy/oGZdTV7PIzDgCeP9Kj
T784yUbeS3QaqmWIjwAuQ6I=
=LoAq
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: publish to maven-repository

2007-04-05 Thread Joerg Hohwiller

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Eric,
 
 On Apr 4, 2007, at 4:33 PM, Otis Gospodnetic wrote:
 Eh, missing Jars in the Maven repo again.  Why does this always get
 dropped?
 
 Because none of us Lucene committers care much about Maven?  :)
Its okay for you personally. And no one wants you to use maven instead of ant.
But you should know that maven opens some sort of new world.
And users of maven expect projects of call to be available in the central
repository.
BTW - once you are addicted to maven you can not understand anymore why people
still fiddle around writing and build-files ;)
 
 Perhaps it's time to keep a lucene-core.pom in our repo, rename it at
 release time (e.g. cp lucene-core.pom lucene-core-2.1.0.pom) and push
 the core jar + core POM out?
 
 I don't know the Maven specifics, but I'm all for us maintaining the
 Maven POM file and bundling it with releases that get pushed to the repos.
I supplied a patch at LUCENE-622 to make it easier for you.
So in the end it will take you a few more minutes to publish your release to the
 maven repository as well, but this allows many many users to use lucene easier
and therefore NOT begg you on this list - it should be worth the effort.
 
 Erik
Thanks
  Jörg
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGFSX5mPuec2Dcv/8RAhnNAJ9iNo1/eAh2mgay78yYobpjDCkWfgCePpdN
S+5/xD5t7wP2/h3wkDBHDms=
=aF2Q
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-856) Optimize segment merging

2007-04-05 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487049
]

Michael McCandless commented on LUCENE-856:
---

OK I re-ran the above test (10 MM docs @ ~5,500 bytes plain text each)
with autoCommit=false: this time it took 5 hrs 7 minutes, which is
40.7% faster than the autoCommit=true test above.

Both of these tests were run with the patch from LUCENE-843.

So this means, if all you need to do is build a massive index with
term vector positions offsets, the fastest way to do so is with the
patch from LUCENE-843 and with autoCommit=false with your writer.

Basically LUCENE-843 makes autoCommit=false quite a bit faster for a
very large index, assuming you are storing term vectors / stored
fields.

Still, I think optimizing segment merging is important because for
many uses of Lucene, the interactivity (how quickly a searcher sees
the recently indexed documents) is very important. For such cases you
should open a writer with autoCommit=false and then periodically close
re-open it to publish the indexed documents to the searchers. With
that model, segment merging will still be a factor slowing down indexing
(though how much of a factor depends on how often you close/open
your writers).

Optimize segment merging

Key: LUCENE-856
URL: https://issues.apache.org/jira/browse/LUCENE-856
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 2.1
Reporter: Michael McCandless
Assigned To: Michael McCandless
Priority: Minor

With LUCENE-843, the time spent indexing documents has been
substantially reduced and now the time spent merging is a sizable
portion of indexing time.
I ran a test using the patch for LUCENE-843, building an index of 10
million docs, each with ~5,500 byte plain text, with term vectors
(positions + offsets) on and with 2 small stored fields per document.
RAM buffer size was 32 MB. I didn't optimize the index in the end,
though optimize speed would also improve if we optimize segment
merging. Index size is 86 GB.
Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes
of which was spent merging. That's 65.6% of the time!
Most of this time is presumably IO which probably can't be reduced
much unless we improve overall merge policy and experiment with values
for mergeFactor / buffer size.
These tests were run on a Mac Pro with 2 dual-core Intel CPUs. The IO
system is RAID 0 of 4 drives, so, these times are probably better than
the more common case of a single hard drive which would likely be
slower IO.
I think there are some simple things we could do to speed up merging:
* Experiment with buffer sizes -- maybe larger buffers for the
IndexInputs used during merging could help? Because at a default
mergeFactor of 10, the disk heads must do alot of seeking back and
forth between these 10 files (and then to the 11th file where we
are writing).
* Use byte copying when possible, eg if there are no deletions on a
segment we can almost (I think?) just copy things like prox
postings, stored fields, term vectors, instead of full parsing to
Jave objects and then re-serializing them.
* Experiment with mergeFactor / different merge policies. For
example I think LUCENE-854 would reduce time spend merging for a
given index size.
This is currently just a place to list ideas for optimizing segment
merges. I don't plan on working on this until after LUCENE-843.
Note that for autoCommit=false, this optimization is somewhat less
important, depending on how often you actually close/open a new
IndexWriter. In the extreme case, if you open a writer, add 100 MM
docs, close the writer, then no segment merges happen at all.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)



On Apr 5, 2007, at 8:54 AM, Michael McCandless wrote:


So you basically do not de-dup by field+term on your first pass
through the tokens in the doc (which is roughly what that hash
does).  Instead, append all tokens in an array, then sort first by
field+text and second by position?  This is done for each document
right?


Almost.  The sorting is done per-field.

Token doesn't have a field, so comparison is cheaper than you're  
thinking.


int
Token_compare(const void *va, const void *vb)
{
Token *const a = *((Token**)va);
Token *const b = *((Token**)vb);

size_t min_len = a-len  b-len ? a-len : b-len;

int comparison = memcmp(a-text, b-text, min_len);

if (comparison == 0) {
if (a-len != b-len) {
comparison = a-len  b-len ? -1 : 1;
}
else {
comparison = a-pos  b-pos ? -1 : 1;
}
}

return comparison;
}


Did you every compare this approach against hash (or other de-dup data
structure, letter trie or something) approach?


KS used to use hashing, though it wasn't directly analogous to how  
Lucene does things.  I've only tried these two techniques.  This was  
faster by about 30%, but the difference is not all in the de-duping.


KS is concerned with preparing serialized postings to feed into an  
external sorter.  In the hashing stratagem, every position added to a  
term_text = serialized_posting pair in the hash requires a string  
concatenation onto the end of serialized_posting, and thus a call to  
realloc().


Besides switching out hashing overhead for qsort overhead, the  
sorting technique also allows KS to know up front how many positions  
are associated with each posting, so the memory for the serialized  
string only has to be allocated once.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

TestIndexWriter.testAddIndexOnDiskFull failed

2007-04-05 Thread Paul Elschot

At revision 525912:

[junit] Testsuite: org.apache.lucene.index.TestIndexWriter
[junit] Tests run: 16, Failures: 1, Errors: 0, Time elapsed: 52.161 sec
[junit]
[junit] Testcase: 
testAddIndexOnDiskFull(org.apache.lucene.index.TestIndexWriter):  FAILED
[junit] max free Directory space required exceeded 1X the total input 
index sizes during addIndexes(IndexReader[]): max temp usage = 127589 bytes; 
starting disk usage = 3915 bytes; input index disk usage = 
7364554824446210604 bytes
[junit] junit.framework.AssertionFailedError: max free Directory space 
required exceeded 1X the total input index sizes during 
addIndexes(IndexReader[]): max temp usage = 127589 bytes; starting disk usage 
= 3915 bytes; input index disk usage = 7364554824446210604 bytes
[junit] at 
org.apache.lucene.index.TestIndexWriter.testAddIndexOnDiskFull(TestIndexWriter.java:387)
[junit]

Is there anything I can do to make this test pass locally?

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: svn commit: r525669 - /lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java

Nothing fancy - Eclipse.  It flagged it, I removed it, nothing turned red 
indicating everything still compiled, unit tests still passed, committed.

If I recall correctly, one has to configure Eclipse to alert you to unused 
variables, methods, and such, and I have that turned on.

Otis 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Paul Elschot [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, April 5, 2007 4:14:23 AM
Subject: Re: svn commit: r525669 - 
/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java

Otis,

Can I ask which tool you used to catch this, and the previous one?

Regards,
Paul Elschot


On Thursday 05 April 2007 03:06, [EMAIL PROTECTED] wrote:
 Author: otis
 Date: Wed Apr  4 18:06:16 2007
 New Revision: 525669
 
 URL: http://svn.apache.org/viewvc?view=revrev=525669
 Log:
 - Removed unused BooleanScore param passed to the inner BucketTable class
 
 Modified:
 lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java
 
 Modified: 
lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java
 URL: 
http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java?view=diffrev=525669r1=525668r2=525669
 
==
 --- lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java 
(original)
 +++ lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java 
Wed Apr  4 18:06:16 2007
 @@ -21,7 +21,7 @@
  
  final class BooleanScorer extends Scorer {
private SubScorer scorers = null;
 -  private BucketTable bucketTable = new BucketTable(this);
 +  private BucketTable bucketTable = new BucketTable();
  
private int maxCoord = 1;
private float[] coordFactors = null;
 @@ -201,11 +201,7 @@
  final Bucket[] buckets = new Bucket[SIZE];
  Bucket first = null;  // head of valid list

 -private BooleanScorer scorer;
 -
 -public BucketTable(BooleanScorer scorer) {
 -  this.scorer = scorer;
 -}
 +public BucketTable() {}
  
  public final int size() { return SIZE; }
  
 
 
 
 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and Javolution: A good mix ?

2007-04-05 Thread Mike Klaas


On 4/4/07, Jean-Philippe Robichaud [EMAIL PROTECTED] wrote:

I understand your concerns!

I was a little skeptical at the beginning.  But even with the 1.5 jvm,
the improvements still holds.

Lucene creates a lots of garbage (strings, tokens, ...) either at
index time or query time. While the new garbage collector strategies did
seriously improve since java 1.4, the gains are still there as the
object creation is also a cost that javolution easily saves us from.


I think the best approach at convincing people would be to produce a
patch that implements some of the suggested changes, and benchmark it.
As it stands, the positives are all hypothetical and the negatives
rather tangible.

-MIke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents



On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote:


Marvin do you have any sense of what the equivalent cost is
in KS


It's big.  I don't have any good optimizations to suggest in this area.


(I think for KS you add a previous segment not that
differently from how you add a document)?


Yeah.  KS has to decompress and serialize posting content, which sux.

The one saving grace is that with the Fibonacci merge schedule and  
the seg-at-a-time indexing strategy, segments don't get merged nearly  
as often as they do in Lucene.



I share large int[] blocks and char[] blocks
across Postings and re-use them.  Etc.


Interesting.  I will have to try something like that!


On C) I think it is important so the many ports of Lucene can compare
notes and cross fertilize.


Well, if you port Lucene's benchmarking stuff to Perl/C, I'll apply  
the patch. ;)


Cross-fertilization is a powerful tool for stimulating algorithmic  
innovation.  Exhibit A: our unfolding collaborative successes.


That's why it was built into the Lucy proposal:

[Lucy's C engine] will provide core, performance-critical
functionality, but leave as much up to the higher-level
language as possible.

Users from diverse communities approach problems from different  
angles and come up with different solutions.  The best ones will  
propagate across Lucy bindings.


The only problem is that since Dave Balmain has been much less  
available than we expected, it's been largely up to me to get Lucy to  
critical mass where other people can start writing bindings.



Performance certainly isn't everything.


That's a given in scripting language culture.  Most users are  
concerned with minimizing developer time above all else.  Ergo, my  
emphasis on API design and simplicity.



But does KS give its users a choice in Tokenizer?


You supply a regular expression which matches one token.

  # Presto! A WhiteSpaceTokenizer:
  my $tokenizer = KinoSearch::Analysis::Tokenizer-new(
  token_re = qr/\S+/
  );


Or, can users pre-tokenize their fields themselves?


TokenBatch provides an API for bulk addition of tokens; you can  
subclass Analyzer to exploit that.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TestIndexWriter.testAddIndexOnDiskFull failed


Paul Elschot [EMAIL PROTECTED] wrote:
 At revision 525912:
 
 [junit] Testsuite: org.apache.lucene.index.TestIndexWriter
 [junit] Tests run: 16, Failures: 1, Errors: 0, Time elapsed: 52.161
 sec
 [junit]
 [junit] Testcase: 
 testAddIndexOnDiskFull(org.apache.lucene.index.TestIndexWriter):  FAILED
 [junit] max free Directory space required exceeded 1X the total input 
 index sizes during addIndexes(IndexReader[]): max temp usage = 127589
 bytes; 
 starting disk usage = 3915 bytes; input index disk usage = 
 7364554824446210604 bytes
 [junit] junit.framework.AssertionFailedError: max free Directory
 space 
 required exceeded 1X the total input index sizes during 
 addIndexes(IndexReader[]): max temp usage = 127589 bytes; starting disk
 usage 
 = 3915 bytes; input index disk usage = 7364554824446210604 bytes
 [junit] at 
 org.apache.lucene.index.TestIndexWriter.testAddIndexOnDiskFull(TestIndexWriter.java:387)
 [junit]
 
 Is there anything I can do to make this test pass locally?

I just got a fresh checkout and the test is passing.  That's one scary output
from the test (input index disk usage).  It seems like 
RAMDirectory.fileLength(...)
may be returning a bad (incorrectly immense) result in your checkout?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)

Marvin Humphrey [EMAIL PROTECTED] wrote:
 
 On Apr 5, 2007, at 8:54 AM, Michael McCandless wrote:
 
  So you basically do not de-dup by field+term on your first pass
  through the tokens in the doc (which is roughly what that hash
  does).  Instead, append all tokens in an array, then sort first by
  field+text and second by position?  This is done for each document
  right?
 
 Almost.  The sorting is done per-field.

 Token doesn't have a field, so comparison is cheaper than you're  
 thinking.

Got it.  I've done exactly that (process one field's tokens at a time)
with LUCENE-843 as well.

  Did you every compare this approach against hash (or other de-dup data
  structure, letter trie or something) approach?
 
 KS used to use hashing, though it wasn't directly analogous to how  
 Lucene does things.  I've only tried these two techniques.  This was  
 faster by about 30%, but the difference is not all in the de-duping.

OK.  30% is very nice :)

 KS is concerned with preparing serialized postings to feed into an  
 external sorter.  In the hashing stratagem, every position added to a  
 term_text = serialized_posting pair in the hash requires a string  
 concatenation onto the end of serialized_posting, and thus a call to  
 realloc().

 Besides switching out hashing overhead for qsort overhead, the  
 sorting technique also allows KS to know up front how many positions  
 are associated with each posting, so the memory for the serialized  
 string only has to be allocated once.

Yeah those realloc()'s are costly (Lucene trunk has them too).  In
LUCENE-843 I found a way to share large int[] arrays so that a given
posting uses slices into the shared arrays instead of doing reallocs.

I think I'm doing something similar to feeding an external sorter: I'm
just using the same approach as Lucene's segment merging of the
postings, optimized somewhat to handle a very large number of segments
at once (for the merging of the level 0 single document segments).
I use this same merger to merge level N RAM segments to level N+1 ram
segments, to merge RAM segments into a single flushed segment, to
merge flushed segments into a single flushed segment and then finally
to merge flushed and RAM segments into the real Lucene segment at
the end.

I think it differs from an external sorter in that I manage explicitly
when to flush a run to disk (autoCommit=false case) or to flush it
to a Lucene segment (autoCommit=true case) rather than letting the
sorter API decide.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and Javolution: A good mix ?

What Mike said.  Without seeing the Javalutionized Lucene in action we won't 
get very far.
jean-Philippe, are you interested in making the changes to Lucene and showing 
the performance improvement?
Note that you can use the super-nice and easy to use contrib/benchmark to 
compare the vanilla Lucene and the Javalutionized Lucene.


Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 1:58:38 PM
Subject: Re: Lucene and Javolution: A good mix ?

On 4/4/07, Jean-Philippe Robichaud [EMAIL PROTECTED] wrote:
 I understand your concerns!

 I was a little skeptical at the beginning.  But even with the 1.5 jvm,
 the improvements still holds.

 Lucene creates a lots of garbage (strings, tokens, ...) either at
 index time or query time. While the new garbage collector strategies did
 seriously improve since java 1.4, the gains are still there as the
 object creation is also a cost that javolution easily saves us from.

I think the best approach at convincing people would be to produce a
patch that implements some of the suggested changes, and benchmark it.
 As it stands, the positives are all hypothetical and the negatives
rather tangible.

-MIke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents


Marvin Humphrey [EMAIL PROTECTED] wrote:

  (I think for KS you add a previous segment not that
  differently from how you add a document)?
 
 Yeah.  KS has to decompress and serialize posting content, which sux.
 
 The one saving grace is that with the Fibonacci merge schedule and  
 the seg-at-a-time indexing strategy, segments don't get merged nearly  
 as often as they do in Lucene.

Yeah we need to work on this one.  One thing that irks me about the
current Lucene merge policy (besides that it gets confused when you
flush-by-RAM-usage) is that it's a pay it forward design so you're
always over-paying when you build a given index size.  With KS's
Fibonacci merge policy, you don't.  LUCENE-854 has some more details.

Segment merging really is costly.  In building a large (86 GB, 10 MM
docs) index, 65.6% of the time was spent merging!  Details are in
LUCENE-856...

  On C) I think it is important so the many ports of Lucene can compare
  notes and cross fertilize.
 
 Well, if you port Lucene's benchmarking stuff to Perl/C, I'll apply  
 the patch. ;)

I hear you!

 Cross-fertilization is a powerful tool for stimulating algorithmic  
 innovation.  Exhibit A: our unfolding collaborative successes.

Couldn't agree more.

 That's why it was built into the Lucy proposal:
 
  [Lucy's C engine] will provide core, performance-critical
  functionality, but leave as much up to the higher-level
  language as possible.
 
 Users from diverse communities approach problems from different  
 angles and come up with different solutions.  The best ones will  
 propagate across Lucy bindings.
 
 The only problem is that since Dave Balmain has been much less  
 available than we expected, it's been largely up to me to get Lucy to  
 critical mass where other people can start writing bindings.

This is a great model.  Are there Python bindings to Lucy yet/coming?

  But does KS give its users a choice in Tokenizer?
 
 You supply a regular expression which matches one token.
 
# Presto! A WhiteSpaceTokenizer:
my $tokenizer = KinoSearch::Analysis::Tokenizer-new(
token_re = qr/\S+/
);
 
  Or, can users pre-tokenize their fields themselves?
 
 TokenBatch provides an API for bulk addition of tokens; you can  
 subclass Analyzer to exploit that.

Ahh, I get it.  Nice!

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene and Javolution: A good mix ?

2007-04-05 Thread Jean-Philippe Robichaud

Yes, I believe enough in this approach to try it.  I'm already starting
to play with it.  I took the current trunk and I'm starting to play with
it.  That begin said, I'm quite busy right now so I can't promise any
steady progress.  Also, I won't apply patches that are already in JIRA,
so the numbers I'll get won't be the 'up-to-date ones.

I understand that before this idea gets any traction, we must have an
idea of how much this could help.  But before going deep with this work,
I wanted to know if Lucene developers have any interest in this kind of
work.  If the gurus dislike the idea of adding a dependency to Lucene
(which is not the case for others Apache projects!), then I won't spend
too much time on this.

Jp

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 05, 2007 3:01 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene and Javolution: A good mix ?

What Mike said.  Without seeing the Javalutionized Lucene in action we
won't get very far.
jean-Philippe, are you interested in making the changes to Lucene and
showing the performance improvement?
Note that you can use the super-nice and easy to use contrib/benchmark
to compare the vanilla Lucene and the Javalutionized Lucene.


Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 1:58:38 PM
Subject: Re: Lucene and Javolution: A good mix ?

On 4/4/07, Jean-Philippe Robichaud [EMAIL PROTECTED]
wrote:
 I understand your concerns!

 I was a little skeptical at the beginning.  But even with the 1.5 jvm,
 the improvements still holds.

 Lucene creates a lots of garbage (strings, tokens, ...) either at
 index time or query time. While the new garbage collector strategies
did
 seriously improve since java 1.4, the gains are still there as the
 object creation is also a cost that javolution easily saves us from.

I think the best approach at convincing people would be to produce a
patch that implements some of the suggested changes, and benchmark it.
 As it stands, the positives are all hypothetical and the negatives
rather tangible.

-MIke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and Javolution: A good mix ?

I'm not in love with the dependency idea, though it's not that big of a deal 
for me.
However, I think you will want to get some of the performance patched (e.g. 
LUCENE-843) in first, so you can compare the latest and greatest version of 
Lucene with your Javalutionized version.  From what I gather from Mike's 
emails, he is doing a lot of object and array sharing and reusing in order to 
minimize object creation, memory allocation, and thus create less work for the 
garbage collector.

My 2 pick a currency, say Levs.

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Jean-Philippe Robichaud [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 3:19:51 PM
Subject: RE: Lucene and Javolution: A good mix ?

Yes, I believe enough in this approach to try it.  I'm already starting
to play with it.  I took the current trunk and I'm starting to play with
it.  That begin said, I'm quite busy right now so I can't promise any
steady progress.  Also, I won't apply patches that are already in JIRA,
so the numbers I'll get won't be the 'up-to-date ones.

I understand that before this idea gets any traction, we must have an
idea of how much this could help.  But before going deep with this work,
I wanted to know if Lucene developers have any interest in this kind of
work.  If the gurus dislike the idea of adding a dependency to Lucene
(which is not the case for others Apache projects!), then I won't spend
too much time on this.

Jp

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 05, 2007 3:01 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene and Javolution: A good mix ?

What Mike said.  Without seeing the Javalutionized Lucene in action we
won't get very far.
jean-Philippe, are you interested in making the changes to Lucene and
showing the performance improvement?
Note that you can use the super-nice and easy to use contrib/benchmark
to compare the vanilla Lucene and the Javalutionized Lucene.


Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 1:58:38 PM
Subject: Re: Lucene and Javolution: A good mix ?

On 4/4/07, Jean-Philippe Robichaud [EMAIL PROTECTED]
wrote:
 I understand your concerns!

 I was a little skeptical at the beginning.  But even with the 1.5 jvm,
 the improvements still holds.

 Lucene creates a lots of garbage (strings, tokens, ...) either at
 index time or query time. While the new garbage collector strategies
did
 seriously improve since java 1.4, the gains are still there as the
 object creation is also a cost that javolution easily saves us from.

I think the best approach at convincing people would be to produce a
patch that implements some of the suggested changes, and benchmark it.
 As it stands, the positives are all hypothetical and the negatives
rather tangible.

-MIke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Quick question, Mike:

You talk about a RAM buffer from 1MB - 96MB, but then you have the amount of 
RAM @ flush time (e.g. Avg RAM used (MB) @ flush:  old34.5; new 3.4 [   
90.1% less]).

I don't follow 100% of what you are doing in LUCENE-843, so could you please 
explain what these 2 different amounts of RAM are?
Is the first (1-96) the RAM you use for in-memory merging of segments?
What is the RAM used @ flush?  More precisely, why is it that that amount of 
RAM exceeds the RAM buffer?

Thanks,
Otis



- Original Message 
From: Michael McCandless (JIRA) [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 9:22:32 AM
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to 
buffer added documents


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942
 ] 

Michael McCandless commented on LUCENE-843:
---


OK I ran old (trunk) vs new (this patch) with increasing RAM buffer
sizes up to 96 MB.

I used the normal sized docs (~5,500 bytes plain text), left stored
fields and term vectors (positions + offsets) on, and
autoCommit=false.

Here're the results:

NUM THREADS = 1
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)


1 MB

  old
20 docs in 862.2 secs
index size = 1.7G

  new
20 docs in 297.1 secs
index size = 1.7G

  Total Docs/sec: old   232.0; new   673.2 [  190.2% faster]
  Docs/MB @ flush:old47.2; new   278.4 [  489.6% more]
  Avg RAM used (MB) @ flush:  old34.5; new 3.4 [   90.1% less]



2 MB

  old
20 docs in 828.7 secs
index size = 1.7G

  new
20 docs in 279.0 secs
index size = 1.7G

  Total Docs/sec: old   241.3; new   716.8 [  197.0% faster]
  Docs/MB @ flush:old47.0; new   322.4 [  586.7% more]
  Avg RAM used (MB) @ flush:  old37.9; new 4.5 [   88.0% less]



4 MB

  old
20 docs in 840.5 secs
index size = 1.7G

  new
20 docs in 260.8 secs
index size = 1.7G

  Total Docs/sec: old   237.9; new   767.0 [  222.3% faster]
  Docs/MB @ flush:old46.8; new   363.1 [  675.4% more]
  Avg RAM used (MB) @ flush:  old33.9; new 6.5 [   80.9% less]



8 MB

  old
20 docs in 678.8 secs
index size = 1.7G

  new
20 docs in 248.8 secs
index size = 1.7G

  Total Docs/sec: old   294.6; new   803.7 [  172.8% faster]
  Docs/MB @ flush:old46.8; new   392.4 [  739.1% more]
  Avg RAM used (MB) @ flush:  old60.3; new10.7 [   82.2% less]



16 MB

  old
20 docs in 660.6 secs
index size = 1.7G

  new
20 docs in 247.3 secs
index size = 1.7G

  Total Docs/sec: old   302.8; new   808.7 [  167.1% faster]
  Docs/MB @ flush:old46.7; new   415.4 [  788.8% more]
  Avg RAM used (MB) @ flush:  old47.1; new19.2 [   59.3% less]



24 MB

  old
20 docs in 658.1 secs
index size = 1.7G

  new
20 docs in 243.0 secs
index size = 1.7G

  Total Docs/sec: old   303.9; new   823.0 [  170.8% faster]
  Docs/MB @ flush:old46.7; new   430.9 [  822.2% more]
  Avg RAM used (MB) @ flush:  old70.0; new27.5 [   60.8% less]



32 MB

  old
20 docs in 714.2 secs
index size = 1.7G

  new
20 docs in 239.2 secs
index size = 1.7G

  Total Docs/sec: old   280.0; new   836.0 [  198.5% faster]
  Docs/MB @ flush:old46.7; new   432.2 [  825.2% more]
  Avg RAM used (MB) @ flush:  old92.5; new36.7 [   60.3% less]



48 MB

  old
20 docs in 640.3 secs
index size = 1.7G

  new
20 docs in 236.0 secs
index size = 1.7G

  Total Docs/sec: old   312.4; new   847.5 [  171.3% faster]
  Docs/MB @ flush:old46.7; new   438.5 [  838.8% more]
  Avg RAM used (MB) @ flush:  old   138.9; new52.8 [   62.0% less]



64 MB

  old
20 docs in 649.3 secs
index size = 1.7G

  new
20 docs in 238.3 secs
index size = 1.7G

  Total Docs/sec: old   308.0; new   839.3 [  172.5% faster]
  Docs/MB @ flush:old46.7; new   441.3 [  844.7% more]
  Avg RAM used (MB) @ flush:  old   302.6; new72.7 [   76.0% less]



80 MB

  old
20 docs in 670.2 secs
index size = 1.7G

  new
20 docs in 227.2 secs
index size = 1.7G

  Total Docs/sec: old   298.4; new   880.5 [  195.0% faster]
  Docs/MB @ flush:old46.7; new   446.2 [  855.2% more]
  Avg RAM used (MB) @ flush:  old   231.7; new94.3 [   59.3% less]



96 MB

  old
20 docs in 683.4 secs
index size = 1.7G

  new
20 docs in 226.8 secs
index size = 1.7G

  Total Docs/sec: old

Re: Caching in QueryFilter - why?

Sounds like I need to cut that out.
Since caching is built into the public BitSet bits(IndexReader reader)  method, 
I don't see a way to deprecate that, which means I'll just cut it out and 
document it in CHANGES.txt.  Anyone who wants QueryFilter caching will be able 
to get the caching back by wrapping the QueryFilter in your 
CachingWrapperFilter.


Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Erik Hatcher [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Wednesday, April 4, 2007 7:38:00 PM
Subject: Re: Caching in QueryFilter - why?

CachingWrapperFilter came along after QueryFilter.  I think I added  
CachingWrapperFilter when I realized that every Filter should have  
the capability to be cached without having to implement it.  So, the  
only reason is legacy.  I'm perfectly fine with removing the  
caching from QueryFilter in a future major release.

Erik

On Apr 4, 2007, at 5:57 PM, Otis Gospodnetic wrote:

 Hi,

 I'm looking at LUCENE-853, so I also looked at CachingWrapperFilter  
 and then at QueryFilter.  I noticed QueryFilter does its own BitSet  
 caching, and the caching part of its code is nearly identical to  
 the code in CachingWrapperFilter.

 Why is that?  Is there a good reason for that?

 Thanks,
 Otis
  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
 Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-857) Remove BitSet caching from QueryFilter

2007-04-05 Thread Otis Gospodnetic (JIRA)

Remove BitSet caching from QueryFilter
--

 Key: LUCENE-857
 URL: https://issues.apache.org/jira/browse/LUCENE-857
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Otis Gospodnetic
 Assigned To: Otis Gospodnetic
Priority: Minor


Since caching is built into the public BitSet bits(IndexReader reader)  method, 
I don't see a way to deprecate that, which means I'll just cut it out and 
document it in CHANGES.txt.  Anyone who wants QueryFilter caching will be able 
to get the caching back by wrapping the QueryFilter in the CachingWrapperFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and Javolution: A good mix ?

2007-04-05 Thread Grant Ingersoll

I'm not saying I'm against it, but one of the things that makes  
Lucene so great is it's lack of dependencies in the core.  It isn't  
necessarily a slippery slope, either, if we do add one dependency.


Javolution is BSD license, AFAICT.  I don't know if that is a good or  
bad license as far as Apache is concerned, but it should be looked  
into before you spend any time on it.


This is not meant to be a discouragement.  If it shows a significant  
improvement, people will notice and it will be taken seriously,  
especially if it is backward compatible, well tested and well  
documented.


-Grant

On Apr 5, 2007, at 3:19 PM, Jean-Philippe Robichaud wrote:

Yes, I believe enough in this approach to try it.  I'm already  
starting
to play with it.  I took the current trunk and I'm starting to play  
with

it.  That begin said, I'm quite busy right now so I can't promise any
steady progress.  Also, I won't apply patches that are already in  
JIRA,

so the numbers I'll get won't be the 'up-to-date ones.

I understand that before this idea gets any traction, we must have an
idea of how much this could help.  But before going deep with this  
work,
I wanted to know if Lucene developers have any interest in this  
kind of

work.  If the gurus dislike the idea of adding a dependency to Lucene
(which is not the case for others Apache projects!), then I won't  
spend

too much time on this.

Jp

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 05, 2007 3:01 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene and Javolution: A good mix ?

What Mike said.  Without seeing the Javalutionized Lucene in action we
won't get very far.
jean-Philippe, are you interested in making the changes to Lucene and
showing the performance improvement?
Note that you can use the super-nice and easy to use contrib/benchmark
to compare the vanilla Lucene and the Javalutionized Lucene.


Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 1:58:38 PM
Subject: Re: Lucene and Javolution: A good mix ?

On 4/4/07, Jean-Philippe Robichaud Jean- 
[EMAIL PROTECTED]

wrote:

I understand your concerns!

I was a little skeptical at the beginning.  But even with the 1.5  
jvm,

the improvements still holds.

Lucene creates a lots of garbage (strings, tokens, ...) either at
index time or query time. While the new garbage collector strategies

did

seriously improve since java 1.4, the gains are still there as the
object creation is also a cost that javolution easily saves us  
from.


I think the best approach at convincing people would be to produce a
patch that implements some of the suggested changes, and benchmark it.
 As it stands, the positives are all hypothetical and the negatives
rather tangible.

-MIke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents



On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:


(I think for KS you add a previous segment not that
differently from how you add a document)?


Yeah.  KS has to decompress and serialize posting content, which sux.

The one saving grace is that with the Fibonacci merge schedule and
the seg-at-a-time indexing strategy, segments don't get merged nearly
as often as they do in Lucene.


Yeah we need to work on this one.


What we need to do is cut down on decompression and conflict  
resolution costs when reading from one segment to another.  KS has  
solved this problem for stored fields.  Field defs are global and  
field values are keyed by name rather than field number in the field  
data file.  Benefits:


  * Whole documents can be read from one segment to
another as blobs.
  * No flags byte.
  * No remapping of field numbers.
  * No conflict resolution at all.
  * Compressed, uncompressed... doesn't matter.
  * Less code.
  * The possibility of allowing the user to provide their
own subclass for reading and writing fields. (For
Lucy, in the language of your choice.)

What I haven't got yet is a way to move terms and postings  
economically from one segment to another.  But I'm working on it.  :)



One thing that irks me about the
current Lucene merge policy (besides that it gets confused when you
flush-by-RAM-usage) is that it's a pay it forward design so you're
always over-paying when you build a given index size.  With KS's
Fibonacci merge policy, you don't.  LUCENE-854 has some more details.


However, even under Fibo, when you get socked with a big merge, you  
really get socked.  It bothers me that the time for adding to your  
index can vary so unpredictably.



Segment merging really is costly.  In building a large (86 GB, 10 MM
docs) index, 65.6% of the time was spent merging!  Details are in
LUCENE-856...



This is a great model.  Are there Python bindings to Lucy yet/coming?


I'm sure that they will appear once the C core is ready.  The  
approach I am taking is to make some high-level design decisions  
collaboratively on lucy-dev, then implement them in KS.  There's a  
large amount of code that has been written according to our specs  
that is working in KS and ready to commit to Lucy after trivial  
changes.  There's more that's ready for review.  However, release of  
KS 0.20 is taking priority, so code flow into the Lucy repository has  
slowed.


I'll also be looking for a job in about a month.  That may slow us  
down some more, though it won't stop things --  I've basically  
decided that I'll do what it takes to Lucy off the ground.  I'll go  
with something stopgap if nothing materializes which is compatible  
with that commitment.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Caching in QueryFilter - why?

2007-04-05 Thread Chris Hostetter


: Since caching is built into the public BitSet bits(IndexReader reader)
: method, I don't see a way to deprecate that, which means I'll just cut
: it out and document it in CHANGES.txt.  Anyone who wants QueryFilter
: caching will be able to get the caching back by wrapping the QueryFilter
: in your CachingWrapperFilter.

this seems like a potentially big surprise for people when upgrading ...
old code will continue to work fine without warning, just get a lot less
efficient.

If the concern is duplicated code, perhaps QueryFilter should be
deprecated and changed to be a subclass of CachingWrapperFilter, with a
constructor that takes in the Query and wraps it in some new class
(QueryWrapperFilter perhaps?)  that does the meaty part (collecting the
matches) ...

@deprecated use CachingWrapperFilter and QueryWrapperFilter directly
public class QueryFilter extends CachingWrapperFilter {
  public QueryFilter(Query query) {
super(new QueryWrapperFilter(query));
  }
}

public class QueryWrapperFilter extends Filter {
  private Query query;
  public QueryWrapperFilter(Query query) {
this.query = query;
  }
  public BitSet bits(IndexReader reader) throws IOException {
final BitSet bits = new BitSet(reader.maxDoc());
new IndexSearcher(reader).search(query, new HitCollector() {
  public final void collect(int doc, float score) {
bits.set(doc);  // set bit for hit
  }
});
return bits;
  }
}


(obviously we need some toString, equals, and hashCode methods in here as well)








:
:
: Otis
:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
: Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
:
: - Original Message 
: From: Erik Hatcher [EMAIL PROTECTED]
: To: java-dev@lucene.apache.org
: Sent: Wednesday, April 4, 2007 7:38:00 PM
: Subject: Re: Caching in QueryFilter - why?
:
: CachingWrapperFilter came along after QueryFilter.  I think I added
: CachingWrapperFilter when I realized that every Filter should have
: the capability to be cached without having to implement it.  So, the
: only reason is legacy.  I'm perfectly fine with removing the
: caching from QueryFilter in a future major release.
:
: Erik
:
: On Apr 4, 2007, at 5:57 PM, Otis Gospodnetic wrote:
:
:  Hi,
: 
:  I'm looking at LUCENE-853, so I also looked at CachingWrapperFilter
:  and then at QueryFilter.  I noticed QueryFilter does its own BitSet
:  caching, and the caching part of its code is nearly identical to
:  the code in CachingWrapperFilter.
: 
:  Why is that?  Is there a good reason for that?
: 
:  Thanks,
:  Otis
:   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
:  Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
: 
: 
: 
:  -
:  To unsubscribe, e-mail: [EMAIL PROTECTED]
:  For additional commands, e-mail: [EMAIL PROTECTED]
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:
:
:
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-04-05 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-584:


Attachment: bench-diff.txt

Perhaps I did something wrong with the benchmark, but I didn't get any speed-up 
when using searcher.match(Query, MatchCollector) vs. searcher.search(Query, 
HitCollector).

Here are the benchmark numbers (5 queries with each), HitCollector first, 
MatchCollector second:

HITCOLLECTOR:

 [java]  Report Sum By (any) Name (11 about 41 out of 41)
 [java] Operation   round mrg buf   runCnt   recsPerRun
rec/s  elapsedSecavgUsedMemavgTotalMem
 [java] Rounds_40  10  101   808020
787.51,026.04 7,217,624 17,780,736
 [java] Populate -  -  -  -  -  - - - - - -  -   4 -  -  - 2003 -  -   
129.9 -  -  61.67 -   9,938,986 -   13,821,952
 [java] CreateIndex -   -   -41  
4.40.91 3,937,522 10,916,864
 [java] MAddDocs_2000 -  -  -   - - - - - -  -   4 -  -  - 2000 -  -   
138.1 -  -  57.92 -   9,368,584 -   13,821,952
 [java] Optimize-   -   -41  
1.42.83 9,938,218 13,821,952
 [java] CloseIndex -  -  -  -   - - - - - -  -   4 -  -  -  - 1 -  - 
2,000.0 -  -   0.00 -   9,938,986 -   13,821,952
 [java] OpenReader  -   -   -41 
24.00.17 9,957,592 13,821,952
 [java] SearchSameRdr_5 -   - - - - - -  -   4 -  -   5 -  - 
1,070.3 -  - 186.86 -  10,500,146 -   13,821,952
 [java] CloseReader -   -   -41  
4,000.00.00 9,059,756 13,821,952
 [java] WarmNewRdr_50 -  -  -   - - - - - -  -   4 -  -  10 -   
16,237.7 -  -  24.63 -   9,060,268 -   13,821,952
 [java] SrchNewRdr_5-   -   -45
265.9  752.0210,800,006 13,821,952


 [java]  Report sum by Prefix (MAddDocs) and Round (4 about 4 
out of 41)
 [java] Operation round mrg buf   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem
 [java] MAddDocs_2000 0  10  101 2000 94.6  
 21.15 7,844,112 10,407,936
 [java] MAddDocs_2000 -   1 100  10 -  -   1 -  -  - 2000 -  -   136.7 -  - 
 14.63 -   8,968,144 -   11,309,056
 [java] MAddDocs_2000 2  10 1001 2000173.2  
 11.5510,528,264 15,740,928
 [java] MAddDocs_2000 -   3 100 100 -  -   1 -  -  - 2000 -  -   188.7 -  - 
 10.60 -  10,133,816 -   17,829,888


MATCHCOLLECTOR:


 [java]  Report Sum By (any) Name (11 about 41 out of 41)
 [java] Operation   round mrg buf   runCnt   recsPerRun
rec/s  elapsedSecavgUsedMemavgTotalMem
 [java] Rounds_40  10  101   808020
781.01,034.6210,566,608 15,859,712
 [java] Populate -  -  -  -  -  - - - - - -  -   4 -  -  - 2003 -  -   
130.9 -  -  61.23 -  10,963,452 -   14,806,016
 [java] CreateIndex -   -   -41 
33.90.12 3,616,570 11,020,288
 [java] MAddDocs_2000 -  -  -   - - - - - -  -   4 -  -  - 2000 -  -   
137.3 -  -  58.29 -  10,445,568 -   14,806,016
 [java] Optimize-   -   -41  
1.42.8210,979,398 14,806,016
 [java] CloseIndex -  -  -  -   - - - - - -  -   4 -  -  -  - 1 -  - 
2,000.0 -  -   0.00 -  10,963,452 -   14,806,016
 [java] OpenReader  -   -   -41 
22.00.1810,982,058 14,806,016
 [java] SearchSameRdr_5 -   - - - - - -  -   4 -  -   5 -  - 
1,064.7 -  - 187.84 -  11,060,036 -   14,806,016
 [java] CloseReader -   -   -41  
4,000.00.0010,353,206 14,806,016
 [java] WarmNewRdr_50 -  -  -   - - - - - -  -   4 -  -  10 -   
16,419.0 -  -  24.36 -  10,431,062 -   14,806,016
 [java] SrchNewRdr_5-   -   -45
263.0  760.3411,912,358 14,806,016


 [java]  Report sum by Prefix (MAddDocs) and Round (4 about 4 
out of 41)
 [java] Operation round mrg buf   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem
 [java] MAddDocs_2000 0  10  101 2000 92.2  
 21.69 7,844,112 10,407,936
 [java] MAddDocs_2000 -   1 100  10 -  -   1 -  -  - 2000 -  -   136.6 -  - 
 14.64 -   7,720,352 -   10,407,936
 [java] MAddDocs_2000 2  10 1001 2000167.8  
 11.9211,325,952 17,571,840
 [java] MAddDocs_2000 -   3 100 100 -  -   1 -  -  - 2000 -  -   199.3 -  - 
 10.03 -  14,891,856 -

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Chris Hostetter


: Thanks!  But remember many Lucene apps won't see these speedups since I've
: carefully minimized cost of tokenization and cost of document retrieval.  I
: think for many Lucene apps these are a sizable part of time spend indexing.

true, but as long as the changes you are making has no impact on the
tokenization/docbuilding times, that shouldn't be a factor -- that should
be consiered a cosntant time adjunct to the code you are varying ...
people with expensive analysis may not see any significant increases, but
that's their own problem -- people concerned about performance will
already have that as fast as they can get it, and now the internals of
document adding will get faster as well.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-05 Thread Matt Ericson (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487108
 ] 

Matt Ericson commented on LUCENE-855:
-

I am almost done with my patch and I wanted to test it against this patch so 
see who has the faster version 
But the MemoryCachedRangeFilter is written using Java 1.5

And as far as I know Lucene is still on java 1.4 

Lines like 
private static WeakHashMapIndexReader, MapString,SortedFieldCache cache = 
new WeakHashMapIndexReader, MapString, SortedFieldCache();


Will not compile in java 1.4 Andy I would love to see who has the faster patch 
if you would convert your patch to use java 1.4 I would be happy to put them 
side by side

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: MemoryCachedRangeFilter.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Mike Klaas


On 4/5/07, Chris Hostetter [EMAIL PROTECTED] wrote:


: Thanks!  But remember many Lucene apps won't see these speedups since I've
: carefully minimized cost of tokenization and cost of document retrieval.  I
: think for many Lucene apps these are a sizable part of time spend indexing.

true, but as long as the changes you are making has no impact on the
tokenization/docbuilding times, that shouldn't be a factor -- that should
be consiered a cosntant time adjunct to the code you are varying ...
people with expensive analysis may not see any significant increases, but
that's their own problem -- people concerned about performance will
already have that as fast as they can get it, and now the internals of
document adding will get faster as well.


Especially since it is relatively easy for users to tweak the analysis
bits for performance--compared to the messy guts of index creation.

I am eagerly tracking the progress of your work.

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-857) Remove BitSet caching from QueryFilter

2007-04-05 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487116
 ] 

Hoss Man commented on LUCENE-857:
-

From email since i didn't notice Otis opened this issue already...

Date: Thu, 5 Apr 2007 14:24:31 -0700 (PDT)
To: java-dev@lucene.apache.org
Subject: Re: Caching in QueryFilter - why?

: Since caching is built into the public BitSet bits(IndexReader reader)
: method, I don't see a way to deprecate that, which means I'll just cut
: it out and document it in CHANGES.txt.  Anyone who wants QueryFilter
: caching will be able to get the caching back by wrapping the QueryFilter
: in your CachingWrapperFilter.

this seems like a potentially big surprise for people when upgrading ...
old code will continue to work fine without warning, just get a lot less
efficient.

If the concern is duplicated code, perhaps QueryFilter should be
deprecated and changed to be a subclass of CachingWrapperFilter, with a
constructor that takes in the Query and wraps it in some new class
(QueryWrapperFilter perhaps?)  that does the meaty part (collecting the
matches) ...

@deprecated use CachingWrapperFilter and QueryWrapperFilter directly
public class QueryFilter extends CachingWrapperFilter {
  public QueryFilter(Query query) {
super(new QueryWrapperFilter(query));
  }
}

public class QueryWrapperFilter extends Filter {
  private Query query;
  public QueryWrapperFilter(Query query) {
this.query = query;
  }
  public BitSet bits(IndexReader reader) throws IOException {
final BitSet bits = new BitSet(reader.maxDoc());
new IndexSearcher(reader).search(query, new HitCollector() {
  public final void collect(int doc, float score) {
bits.set(doc);  // set bit for hit
  }
});
return bits;
  }
}


(obviously we need some toString, equals, and hashCode methods in here as well)


 Remove BitSet caching from QueryFilter
 --

 Key: LUCENE-857
 URL: https://issues.apache.org/jira/browse/LUCENE-857
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Otis Gospodnetic
 Assigned To: Otis Gospodnetic
Priority: Minor
 Attachments: LUCENE-857.patch


 Since caching is built into the public BitSet bits(IndexReader reader)  
 method, I don't see a way to deprecate that, which means I'll just cut it out 
 and document it in CHANGES.txt.  Anyone who wants QueryFilter caching will be 
 able to get the caching back by wrapping the QueryFilter in the 
 CachingWrapperFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Resolved: (LUCENE-796) Change Visibility of fields[] in MultiFieldQueryParser

2007-04-05 Thread Mike Klaas


On 4/4/07, Otis Gospodnetic (JIRA) [EMAIL PROTECTED] wrote:


 [ 
https://issues.apache.org/jira/browse/LUCENE-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-796.
-

Resolution: Fixed

Makes sense.  Thanks Steve, applied.  I left those 2 private attributes of MFQP 
as private until somebody asks for them to be protected.


I'm not sure if this applies to this issue, but ISTM that a private
unless you bug the devs approach to variable scoping is a little odd.
A few unecessary privates sprinkled through the code can really
wreck havoc on effects to extend functionality cleanly. This has
caused me grief in the past, and waiting for a lucene release isn't
usually a good solution--cp is faster.

What if maintaining backward-compatibility of the inheritance
interface of classes was explicitly not guaranteed--would this allow
the default policy for new code to use protected rather than private
(unless there is a reason for the latter)?

A class is either designed for extensibility in mind (or certain
kinds), or not at all.  It is perhaps unrealistic to audit all lucene
classes, but perhaps a whole class could be opened up when a bug
report is filed?

FWIW:
$ find -name \*.java | xargs grep private | wc
   914
$ find -name \*.java | xargs grep protected | wc
   260

cheers,
-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents


Mike Klaas [EMAIL PROTECTED] wrote:
 On 4/5/07, Chris Hostetter [EMAIL PROTECTED] wrote:
 
  : Thanks!  But remember many Lucene apps won't see these speedups since I've
  : carefully minimized cost of tokenization and cost of document retrieval.  
  I
  : think for many Lucene apps these are a sizable part of time spend 
  indexing.
 
  true, but as long as the changes you are making has no impact on the
  tokenization/docbuilding times, that shouldn't be a factor -- that should
  be consiered a cosntant time adjunct to the code you are varying ...
  people with expensive analysis may not see any significant increases, but
  that's their own problem -- people concerned about performance will
  already have that as fast as they can get it, and now the internals of
  document adding will get faster as well.
 
 Especially since it is relatively easy for users to tweak the analysis
 bits for performance--compared to the messy guts of index creation.
 
 I am eagerly tracking the progress of your work.

Thanks Mike (and Hoss).

Hoss, what you said is correct: I'm only affecting the actual indexing of
a document, nothing before that.

I just want to make sure I get that disclaimer out, as much as possible, so
nobody tries the patch and says Hey!  My app only got 10% faster!  This was
false advertising!.

People who indeed have minimized their doc retrieval and tokenization time
should see speedups around what I'm seeing with the benchmarks (I hope!).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents


Hi Otis!

Otis Gospodnetic [EMAIL PROTECTED] wrote:
 You talk about a RAM buffer from 1MB - 96MB, but then you have the amount
 of RAM @ flush time (e.g. Avg RAM used (MB) @ flush:  old34.5; new   
  3.4 [   90.1% less]).
 
 I don't follow 100% of what you are doing in LUCENE-843, so could you
 please explain what these 2 different amounts of RAM are?
 Is the first (1-96) the RAM you use for in-memory merging of segments?
 What is the RAM used @ flush?  More precisely, why is it that that amount
 of RAM exceeds the RAM buffer?

Very good questions!

When I say the RAM buffer size is set to 96 MB, what I mean is I
flush the writer when the in-memory segments are using 96 MB RAM.  On
trunk, I just call ramSizeInBytes().  I do the analogous thing with my
patch (sum up size of RAM buffers used by segments).  I call this part
of the RAM usage the indexed documents RAM.  With every added
document, this grows.

But: this does not account for all data structures (Posting instances,
HashMap, FieldsWriter, TermVectorsWriter, int[] arrays, aetc.) used,
but not saved away, during the indexing of a single document.  All the
things used temporarily while indexing a document take up RAM too.
I call this part of the RAM usage the document processing RAM.  This
RAM does not grow with every added document, though its size is in
proportion to the how big each document is.  This memory is always
re-used (does not grow with time).  But with the trunk, this is done
by creating garbage, whereas with my patch, I explicitly reuse it.

When I measure amount of RAM @ flush time, I'm calling
MemoryMXBean.getHeapMemoryUsage().getUsed().  So, this measures actual
process memory usage which should be (for my tests) around the sum of
the above two types of RAM usage.

With the trunk, the actual process memory usage tends to be quite a
bit higher than the RAM buffer size and also tends to be very noisy
(jumps around with each flush).  I think this is because of
delays/unpredictability on when GC kicks in to reclaim the garbage
created during indexing of the doc.  Whereas with my patch, it's
usually quite a bit closer to the indexed documents RAM and does not
jump around nearly as much.

So the actual process RAM used will always exceed my RAM buffer
size.  The amount of excess is a measure of the overhead required
to process the document.  The trunk has far worse overhead than with
my patch, which I think means a given application will be able to use
a *larger* RAM buffer size with LUCENE-843.

Does that make sense?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents


Marvin Humphrey [EMAIL PROTECTED] wrote:
 On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:
 
  (I think for KS you add a previous segment not that
  differently from how you add a document)?
 
  Yeah.  KS has to decompress and serialize posting content, which sux.
 
  The one saving grace is that with the Fibonacci merge schedule and
  the seg-at-a-time indexing strategy, segments don't get merged nearly
  as often as they do in Lucene.
 
  Yeah we need to work on this one.
 
 What we need to do is cut down on decompression and conflict  
 resolution costs when reading from one segment to another.  KS has  
 solved this problem for stored fields.  Field defs are global and  
 field values are keyed by name rather than field number in the field  
 data file.  Benefits:
 
* Whole documents can be read from one segment to
  another as blobs.
* No flags byte.
* No remapping of field numbers.
* No conflict resolution at all.
* Compressed, uncompressed... doesn't matter.
* Less code.
* The possibility of allowing the user to provide their
  own subclass for reading and writing fields. (For
  Lucy, in the language of your choice.)

I hear you, and I really really love those benefits, but, we just
don't have this freedom with Lucene.

I think the ability to suddenly birth a new field, or change a field's
attributes like has vectors, stores norms, etc., with a new
document, is something we just can't break at this point with Lucene?

If we could get those benefits without breaking backwards
compatibility then that would be awesome.  I suppose if we had a
single mapping of field names - numbers in the index, that would gain
us many of the above benefits?  Hmmm.

 What I haven't got yet is a way to move terms and postings  
 economically from one segment to another.  But I'm working on it.  :)

Here's one idea I just had: assuming there are no deletions, you can
almost do a raw bytes copy from input segment to output (merged)
segment of the postings for a given term X.  I think for prox postings
you can.  But for freq postings, you can't, because they are delta
coded.

Except: it's only the first entry of the incoming segments's freq
postings that needs to be re-interpreted?  So you could read that one,
encode the delta based on last docID for previous segment (I think
we'd have to store this in index, probably only if termFreq 
threshold), and then copyBytes the rest of the posting?  I will try
this out on the merges I'm doing in LUCENE-843; I think it should
work and make merging faster (assuming no deletes)?

  One thing that irks me about the
  current Lucene merge policy (besides that it gets confused when you
  flush-by-RAM-usage) is that it's a pay it forward design so you're
  always over-paying when you build a given index size.  With KS's
  Fibonacci merge policy, you don't.  LUCENE-854 has some more details.
 
 However, even under Fibo, when you get socked with a big merge, you  
 really get socked.  It bothers me that the time for adding to your  
 index can vary so unpredictably.

Yeah, I think that's best solved by concurrency (either with threads
or with our own scheduling eg on adding a doc you go and merge
another N terms in the running merge)?  There have been several
proposals recently for making Lucene's merging concurrent
(backgrounded), as part of LUCENE-847.

  Segment merging really is costly.  In building a large (86 GB, 10 MM
  docs) index, 65.6% of the time was spent merging!  Details are in
  LUCENE-856...
 
  This is a great model.  Are there Python bindings to Lucy yet/coming?
 
 I'm sure that they will appear once the C core is ready.  The  
 approach I am taking is to make some high-level design decisions  
 collaboratively on lucy-dev, then implement them in KS.  There's a  
 large amount of code that has been written according to our specs  
 that is working in KS and ready to commit to Lucy after trivial  
 changes.  There's more that's ready for review.  However, release of  
 KS 0.20 is taking priority, so code flow into the Lucy repository has  
 slowed.

OK, good to hear.

 I'll also be looking for a job in about a month.  That may slow us  
 down some more, though it won't stop things --  I've basically  
 decided that I'll do what it takes to Lucy off the ground.  I'll go  
 with something stopgap if nothing materializes which is compatible  
 with that commitment.

Whoa, I'm sorry to hear that :(  I hope you land, quickly, somewhere
that takes Lucy/KS seriously.  It's clearly excellent work.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Grant Ingersoll



Michael, like everyone else, I am watching this very closely.  So far  
it sounds great!


On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote:


When I measure amount of RAM @ flush time, I'm calling
MemoryMXBean.getHeapMemoryUsage().getUsed().  So, this measures actual
process memory usage which should be (for my tests) around the sum of
the above two types of RAM usage.


One thing caught my eye, though, MemoryMXBean is JDK 1.5.  :-(

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/management/ 
MemoryMXBean.html


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Grant Ingersoll [EMAIL PROTECTED] wrote:
 
 Michael, like everyone else, I am watching this very closely.  So far  
 it sounds great!
 
 On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote:
 
  When I measure amount of RAM @ flush time, I'm calling
  MemoryMXBean.getHeapMemoryUsage().getUsed().  So, this measures actual
  process memory usage which should be (for my tests) around the sum of
  the above two types of RAM usage.
 
 One thing caught my eye, though, MemoryMXBean is JDK 1.5.  :-(
 
 http://java.sun.com/j2se/1.5.0/docs/api/java/lang/management/ 
 MemoryMXBean.html

Yeah, thanks for pointing this out.  I'm only using that to do my
benchmarking, not to actually measure RAM usage for when to flush,
so I will definitely remove it before committing (I always go to a
1.4.2 environment and do a ant clean test to be certain I didn't do
something like this by accident :).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents