Re: publish to maven-repository
All yours - thanks Sami! Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Sami Siren [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, April 5, 2007 1:51:30 AM Subject: Re: publish to maven-repository hi, I am volunteering to help on putting together releaseable m2 artifacts for Lucene, I have high hopes to start building and spreading m2 artifacts for other Lucene sub projects too (of course if there are no objections). -- Sami Siren 2007/4/5, Otis Gospodnetic [EMAIL PROTECTED]: Jörg, Since you offered to help - please see https://issues.apache.org/jira/browse/LUCENE-622 . lucene-core POM is there for 2.1.0, but if you need POMs for contrib/*, please attach them to that issue. We have Jars, obviously, so we just need to copy those. When we'll need .sha1 and .md5 files for all pushed Jars. One of the other developers will have to do that, as I don't have my PGP set up, and hence no key for the KEYS file (if that's needed for the .sha1). Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Joerg Hohwiller [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Tuesday, April 3, 2007 4:49:15 PM Subject: publish to maven-repository -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi there, I will give it another try: Could you please publish lucene 2.* artifacts (including contribs) to the maven2 repository at ibiblio? Currently there is only the lucene-core available up to version 2.0.0: http://repo1.maven.org/maven2/org/apache/lucene/ JARs and POMs go to: scp://people.apache.org/www/www.apache.org/dist/maven-repository If you need assitance I am pleased to help. But I am not an official apache member and do NOT have access to do the deployment myself. Thank you so much... Jörg -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGEr3LmPuec2Dcv/8RAh1sAJ9m3qs7upNGJTgie5tNeAFKZenBowCgjufY uB1/RNnI4wB3dviKy0w7XEs= =llLh -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: improve how IndexWriter uses RAM to buffer added documents
Marvin Humphrey [EMAIL PROTECTED] wrote: On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote: (: Ironically, the numbers for Lucene on that page are a little better than they should be because of a sneaky bug. I would have made updating the results a priority if they'd gone the other way. :) Hrm. It would be nice to have hard comparison of the Lucene, KS (and Ferret and others?). Doing honest, rigorous benchmarking is exacting and labor-intensive. Publishing results tends to ignite flame wars I don't have time for. The main point that I wanted to make with that page was that KS was a lot faster than Plucene, and that it was in Lucene's ballpark. Having made that point, I've moved on. The benchmarking code is still very useful for internal development and I use it frequently. Agreed. Though, if the benchmarking is done in a way that anyone could download re-run it (eg as part of Lucene's new developing benchmark framework), it should help to keep flaming in check. Accurate well communicated benchmark results both within each variant/port of Lucene and across them is crucial for all of us making iterative progress on performance. At some point I would like to port the benchmarking work that has been contributed to Lucene of late, but I'm waiting for that code base to settle down first. After that happens, I'll probably make a pass and publish some results. Better to spend the time preparing one definitive presentation than to have to rebut every idiot's latest wildly inaccurate shootout. Excellent! ... However, Lucene has been tuned by an army of developers over the years, while KS is young yet and still had many opportunities for optimization. Current svn trunk for KS is about twice as fast for indexing as when I did those benchmarking tests. Wow, that's an awesome speedup! The big bottleneck for KS has been its Tokenizer class. There's only one such class in KS, and it's regex-based. A few weeks ago, I finally figured out how to hook it into Perl's regex engine at the C level. The regex engine is not an official part of Perl's C API, so I wouldn't do this if I didn't have to, but the tokenizing loop is only about 100 lines of code and the speedup is dramatic. Tokenization is a very big part of Lucene's indexing time as well. StandardAnalyzer is very time consuming. When I switched to testing with WhitespaceAnalyzer, it was quite a bit faster (I don't have exact numbers). Then when I created and switched to SimpleSpaceAnalyzer (just splits on the space character, and, doesn't do new String(...) for every token, instead makes offset+lenth slices into a char[] array), it was even faster. This is why your mileage will vary caveat is extremely important. For most users of Lucene, I'd expect that 1) retrieving the doc from whatever its source is, and 2) tokenizing, take a substantial amount of time. So the gains I'm seeing in my benchmarks won't usually be seen by normal applications unless these applications have already optimized their doc retrieval/tokenization. And now that indexing each document is so fast, segment merging has become a BIG part (66% in my large index test in LUCENE-856) of indexing. Marvin do you have any sense of what the equivalent cost is in KS (I think for KS you add a previous segment not that differently from how you add a document)? I've also squeezed out another 30-40% by changing the implementation in ways which have gradually winnowed down the number of malloc() calls. Some of the techniques may be applicable to Lucene; I'll get around to firing up JIRA issues describing them someday. This generally was my approach in LUCENE-843 (minimize new Object()). I re-use Posting objects, the hash for Posting objects, byte buffers, etc. I share large int[] blocks and char[] blocks across Postings and re-use them. Etc. The one thing that still baffles me is: I can't get a persistent Posting hash to be any faster. I still reset the Posting hash with every document, but I had variants in my iterations that kept the Postings hash between documents (just flushing the int[]'s periodically). I had expected that leaving Posting instances in the hash, esp. for frequent terms, would be a win, but so far I haven't seen that empirically. So KS is faster than Lucene today? I haven't tested recent versions of Lucene. I believe that the current svn trunk for KS is faster for indexing than Lucene 1.9.1. But... A) I don't have an official release out with the current Tokenizer code, B) I have no immediate plans to prepare further published benchmarks, and C) it's not really important, because so long as the numbers are close you'd be nuts to choose one engine or the other based on that criteria rather than, say, what language your development team speaks. KinoSearch scales to multiple machines, too. On C) I think it is important so the many
Re: [jira] Created: (LUCENE-856) Optimize segment merging
Ning Li [EMAIL PROTECTED] wrote: On 4/4/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote: Note that for autoCommit=false, this optimization is somewhat less important, depending on how often you actually close/open a new IndexWriter. In the extreme case, if you open a writer, add 100 MM docs, close the writer, then no segment merges happen at all. I think in the current code, the merge behavior for autoCommit=false is the same as that for autoCommit=true, isn't it? Right, the current code implements the autoCommit=false case rather inefficiently. While LUCENE-843 fixes that to some extent, it must still do its own merging of flushed segments. But that merging ought to be faster since it does not merge the term vectors stored fields. I will run a test to compare (the test I ran for that issue was with autoCommit=true). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene and Javolution: A good mix ?
I understand your concerns! I was a little skeptical at the beginning. But even with the 1.5 jvm, the improvements still holds. Lucene creates a lots of garbage (strings, tokens, ...) either at index time or query time. While the new garbage collector strategies did seriously improve since java 1.4, the gains are still there as the object creation is also a cost that javolution easily saves us from. What javolution requires to give the best is can is for you to make certain critical classes extends the RealtimeObject class and implement a Factory pattern inside. Once this is done, you can now fully profit from the Javolution features. Even without doing that, we could still benefit from the FastList/FastMap classes being already pooled and the possibility to 'thread-safely' iterate lists/maps without creating any objects. Javolution is also released for gcj, which is great since it won't interfere with the lucene's gcj effort. From what I can foresee, the pros/cons would be: Pros: Leaner memory footprint Saving many cpu cycles Cons: Adding a dependency to lucene codebase Lucene developers must get familiar with the Context concepts Jp -Original Message- From: robert engels [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 04, 2007 10:31 PM To: java-dev@lucene.apache.org Subject: Re: Lucene and Javolution: A good mix ? I would suggest that the Javolution folks do their tests against modern JVM... I have followed the Javolution project for some time, and while I agree that some of the techniques should improve things, I think that modern JVMs do most of this work for you (and the latest class libraries also help - StringBuilder and others). I also think that when you start doing you own memory management you might as well write the code in C/C++ because you need to use similar techniques (similar to the resource management when using SWT). Just my thoughts. On Apr 4, 2007, at 8:54 PM, Jean-Philippe Robichaud wrote: Hello Dear Lucene coders! Some of you may remember, I'm using lucene for a product (and many other internal utilities). I'm also using another open source library called Javolution (www.javolution.org http://www.javolution.org/ ) which does many things, one of them being to offer excellent replacements for ArrayList/Map/... and a super good memory management extension to the java language. As I'm [trying to] follow the conversations on this list, I see that many of you are working towards optimizing lucene in term of memory footprint and speed. I just finished optimizing my code (not lucene itself, but my code written on top of it) using Javolution PoolContext and the FastList/FastMap/... classes. The resulting speedup is a 6 times faster code. Javolution make it easy to recycle objects and do some object allocation on the stack rather than on the head, which remove stress on the garbage collector. Javolution also offers 2 classes (Text and TextBuilder) to replace String/StringBuffer which are perfect for anything related to string manipulation and some C union/struct equivalent for java. The thing is really great. Would anyone be interested in doing Lucene a face lift and start using javolution as a core lucene dependency? I understand that right now, lucene is free of any dependencies, which is quite great, but anyone interested in doing fast/lean/stable java application should seriously consider using javolution anyway. Any thoughts? Jp - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-789) Custom similarity is ignored when using MultiSearcher
[ https://issues.apache.org/jira/browse/LUCENE-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Lef updated LUCENE-789: -- Attachment: TestMultiSearcherSimilarity.java Attached unit test Custom similarity is ignored when using MultiSearcher - Key: LUCENE-789 URL: https://issues.apache.org/jira/browse/LUCENE-789 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.0.1 Reporter: Alexey Lef Attachments: TestMultiSearcherSimilarity.java Symptoms: I am using Searcher.setSimilarity() to provide a custom similarity that turns off tf() factor. However, somewhere along the way the custom similarity is ignored and the DefaultSimilarity is used. I am using MultiSearcher and BooleanQuery. Problem analysis: The problem seems to be in MultiSearcher.createWeight(Query) method. It creates an instance of CachedDfSource but does not set the similarity. As the result CachedDfSource provides DefaultSimilarity to queries that use it. Potential solution: Adding the following line: cacheSim.setSimilarity(getSimilarity()); after creating an instance of CacheDfSource (line 312) seems to fix the problem. However, I don't understand enough of the inner workings of this class to be absolutely sure that this is the right thing to do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fwd: Re: svn commit: r525669 - /lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java
Once more, now to java-dev instead of to java-commits: Otis, Can I ask which tool you used to catch this, and the previous one? Regards, Paul Elschot On Thursday 05 April 2007 03:06, [EMAIL PROTECTED] wrote: Author: otis Date: Wed Apr 4 18:06:16 2007 New Revision: 525669 URL: http://svn.apache.org/viewvc?view=revrev=525669 Log: - Removed unused BooleanScore param passed to the inner BucketTable class Modified: lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java Modified: lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java?view=diffrev=525669r1=525668r2=525669 == --- lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java (original) +++ lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java Wed Apr 4 18:06:16 2007 @@ -21,7 +21,7 @@ final class BooleanScorer extends Scorer { private SubScorer scorers = null; - private BucketTable bucketTable = new BucketTable(this); + private BucketTable bucketTable = new BucketTable(); private int maxCoord = 1; private float[] coordFactors = null; @@ -201,11 +201,7 @@ final Bucket[] buckets = new Bucket[SIZE]; Bucket first = null; // head of valid list -private BooleanScorer scorer; - -public BucketTable(BooleanScorer scorer) { - this.scorer = scorer; -} +public BucketTable() {} public final int size() { return SIZE; } --- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-622) Provide More of Lucene For Maven
[ https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jörg Hohwiller updated LUCENE-622: -- Attachment: lucene-highlighter-2.0.0.pom pom for lucene-highlighter Provide More of Lucene For Maven Key: LUCENE-622 URL: https://issues.apache.org/jira/browse/LUCENE-622 Project: Lucene - Java Issue Type: Task Affects Versions: 2.0.0 Reporter: Stephen Duncan Jr Attachments: lucene-core.pom, lucene-highlighter-2.0.0.pom Please provide javadoc source jars for lucene-core. Also, please provide the rest of lucene (the jars inside of contrib in the download bundle) if possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942 ] Michael McCandless commented on LUCENE-843: --- OK I ran old (trunk) vs new (this patch) with increasing RAM buffer sizes up to 96 MB. I used the normal sized docs (~5,500 bytes plain text), left stored fields and term vectors (positions + offsets) on, and autoCommit=false. Here're the results: NUM THREADS = 1 MERGE FACTOR = 10 With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = false (commit only once at the end) 1 MB old 20 docs in 862.2 secs index size = 1.7G new 20 docs in 297.1 secs index size = 1.7G Total Docs/sec: old 232.0; new 673.2 [ 190.2% faster] Docs/MB @ flush:old47.2; new 278.4 [ 489.6% more] Avg RAM used (MB) @ flush: old34.5; new 3.4 [ 90.1% less] 2 MB old 20 docs in 828.7 secs index size = 1.7G new 20 docs in 279.0 secs index size = 1.7G Total Docs/sec: old 241.3; new 716.8 [ 197.0% faster] Docs/MB @ flush:old47.0; new 322.4 [ 586.7% more] Avg RAM used (MB) @ flush: old37.9; new 4.5 [ 88.0% less] 4 MB old 20 docs in 840.5 secs index size = 1.7G new 20 docs in 260.8 secs index size = 1.7G Total Docs/sec: old 237.9; new 767.0 [ 222.3% faster] Docs/MB @ flush:old46.8; new 363.1 [ 675.4% more] Avg RAM used (MB) @ flush: old33.9; new 6.5 [ 80.9% less] 8 MB old 20 docs in 678.8 secs index size = 1.7G new 20 docs in 248.8 secs index size = 1.7G Total Docs/sec: old 294.6; new 803.7 [ 172.8% faster] Docs/MB @ flush:old46.8; new 392.4 [ 739.1% more] Avg RAM used (MB) @ flush: old60.3; new10.7 [ 82.2% less] 16 MB old 20 docs in 660.6 secs index size = 1.7G new 20 docs in 247.3 secs index size = 1.7G Total Docs/sec: old 302.8; new 808.7 [ 167.1% faster] Docs/MB @ flush:old46.7; new 415.4 [ 788.8% more] Avg RAM used (MB) @ flush: old47.1; new19.2 [ 59.3% less] 24 MB old 20 docs in 658.1 secs index size = 1.7G new 20 docs in 243.0 secs index size = 1.7G Total Docs/sec: old 303.9; new 823.0 [ 170.8% faster] Docs/MB @ flush:old46.7; new 430.9 [ 822.2% more] Avg RAM used (MB) @ flush: old70.0; new27.5 [ 60.8% less] 32 MB old 20 docs in 714.2 secs index size = 1.7G new 20 docs in 239.2 secs index size = 1.7G Total Docs/sec: old 280.0; new 836.0 [ 198.5% faster] Docs/MB @ flush:old46.7; new 432.2 [ 825.2% more] Avg RAM used (MB) @ flush: old92.5; new36.7 [ 60.3% less] 48 MB old 20 docs in 640.3 secs index size = 1.7G new 20 docs in 236.0 secs index size = 1.7G Total Docs/sec: old 312.4; new 847.5 [ 171.3% faster] Docs/MB @ flush:old46.7; new 438.5 [ 838.8% more] Avg RAM used (MB) @ flush: old 138.9; new52.8 [ 62.0% less] 64 MB old 20 docs in 649.3 secs index size = 1.7G new 20 docs in 238.3 secs index size = 1.7G Total Docs/sec: old 308.0; new 839.3 [ 172.5% faster] Docs/MB @ flush:old46.7; new 441.3 [ 844.7% more] Avg RAM used (MB) @ flush: old 302.6; new72.7 [ 76.0% less] 80 MB old 20 docs in 670.2 secs index size = 1.7G new 20 docs in 227.2 secs index size = 1.7G Total Docs/sec: old 298.4; new 880.5 [ 195.0% faster] Docs/MB @ flush:old46.7; new 446.2 [ 855.2% more] Avg RAM used (MB) @ flush: old 231.7; new94.3 [ 59.3% less] 96 MB old 20 docs in 683.4 secs index size = 1.7G new 20 docs in 226.8 secs index size = 1.7G Total Docs/sec: old 292.7; new 882.0 [ 201.4% faster] Docs/MB @ flush:old46.7; new 448.0 [ 859.1% more] Avg RAM used (MB) @ flush: old 274.5; new 112.7 [ 59.0% less] Some observations: * Remember the test is already biased against new because with the patch you get an optimized index in the end but with old you don't. * Sweet spot for old (trunk) seems to be 48 MB: that is the peak docs/sec @ 312.4. * New (with patch) seems to just get faster the more memory you give it, though gradually. The peak was 96 MB (the largest I ran). So no sweet spot (or maybe I need to give more memory, but, above 96 MB the trunk was starting to swap on my test env). * New gets better and better RAM efficiency, the more RAM you
Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
wow, impressive numbers, congrats ! - Original Message From: Michael McCandless (JIRA) [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, 5 April, 2007 3:22:32 PM Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942 ] Michael McCandless commented on LUCENE-843: --- OK I ran old (trunk) vs new (this patch) with increasing RAM buffer sizes up to 96 MB. I used the normal sized docs (~5,500 bytes plain text), left stored fields and term vectors (positions + offsets) on, and autoCommit=false. Here're the results: NUM THREADS = 1 MERGE FACTOR = 10 With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = false (commit only once at the end) 1 MB old 20 docs in 862.2 secs index size = 1.7G new 20 docs in 297.1 secs index size = 1.7G Total Docs/sec: old 232.0; new 673.2 [ 190.2% faster] Docs/MB @ flush:old47.2; new 278.4 [ 489.6% more] Avg RAM used (MB) @ flush: old34.5; new 3.4 [ 90.1% less] 2 MB old 20 docs in 828.7 secs index size = 1.7G new 20 docs in 279.0 secs index size = 1.7G Total Docs/sec: old 241.3; new 716.8 [ 197.0% faster] Docs/MB @ flush:old47.0; new 322.4 [ 586.7% more] Avg RAM used (MB) @ flush: old37.9; new 4.5 [ 88.0% less] 4 MB old 20 docs in 840.5 secs index size = 1.7G new 20 docs in 260.8 secs index size = 1.7G Total Docs/sec: old 237.9; new 767.0 [ 222.3% faster] Docs/MB @ flush:old46.8; new 363.1 [ 675.4% more] Avg RAM used (MB) @ flush: old33.9; new 6.5 [ 80.9% less] 8 MB old 20 docs in 678.8 secs index size = 1.7G new 20 docs in 248.8 secs index size = 1.7G Total Docs/sec: old 294.6; new 803.7 [ 172.8% faster] Docs/MB @ flush:old46.8; new 392.4 [ 739.1% more] Avg RAM used (MB) @ flush: old60.3; new10.7 [ 82.2% less] 16 MB old 20 docs in 660.6 secs index size = 1.7G new 20 docs in 247.3 secs index size = 1.7G Total Docs/sec: old 302.8; new 808.7 [ 167.1% faster] Docs/MB @ flush:old46.7; new 415.4 [ 788.8% more] Avg RAM used (MB) @ flush: old47.1; new19.2 [ 59.3% less] 24 MB old 20 docs in 658.1 secs index size = 1.7G new 20 docs in 243.0 secs index size = 1.7G Total Docs/sec: old 303.9; new 823.0 [ 170.8% faster] Docs/MB @ flush:old46.7; new 430.9 [ 822.2% more] Avg RAM used (MB) @ flush: old70.0; new27.5 [ 60.8% less] 32 MB old 20 docs in 714.2 secs index size = 1.7G new 20 docs in 239.2 secs index size = 1.7G Total Docs/sec: old 280.0; new 836.0 [ 198.5% faster] Docs/MB @ flush:old46.7; new 432.2 [ 825.2% more] Avg RAM used (MB) @ flush: old92.5; new36.7 [ 60.3% less] 48 MB old 20 docs in 640.3 secs index size = 1.7G new 20 docs in 236.0 secs index size = 1.7G Total Docs/sec: old 312.4; new 847.5 [ 171.3% faster] Docs/MB @ flush:old46.7; new 438.5 [ 838.8% more] Avg RAM used (MB) @ flush: old 138.9; new52.8 [ 62.0% less] 64 MB old 20 docs in 649.3 secs index size = 1.7G new 20 docs in 238.3 secs index size = 1.7G Total Docs/sec: old 308.0; new 839.3 [ 172.5% faster] Docs/MB @ flush:old46.7; new 441.3 [ 844.7% more] Avg RAM used (MB) @ flush: old 302.6; new72.7 [ 76.0% less] 80 MB old 20 docs in 670.2 secs index size = 1.7G new 20 docs in 227.2 secs index size = 1.7G Total Docs/sec: old 298.4; new 880.5 [ 195.0% faster] Docs/MB @ flush:old46.7; new 446.2 [ 855.2% more] Avg RAM used (MB) @ flush: old 231.7; new94.3 [ 59.3% less] 96 MB old 20 docs in 683.4 secs index size = 1.7G new 20 docs in 226.8 secs index size = 1.7G Total Docs/sec: old 292.7; new 882.0 [ 201.4% faster] Docs/MB @ flush:old46.7; new 448.0 [ 859.1% more] Avg RAM used (MB) @ flush: old 274.5; new 112.7 [ 59.0% less] Some observations: * Remember the test is already biased against new because with the patch you get an optimized index in the end but with old you don't. * Sweet spot for old (trunk) seems to be 48 MB: that is the peak docs/sec @ 312.4. * New (with patch) seems to just get faster
Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
eks dev [EMAIL PROTECTED] wrote: wow, impressive numbers, congrats ! Thanks! But remember many Lucene apps won't see these speedups since I've carefully minimized cost of tokenization and cost of document retrieval. I think for many Lucene apps these are a sizable part of time spend indexing. Next up I'm going to test thread concurrency of new vs old. And then still a fair number of things to resolve before committing... Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)
On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote: The one thing that still baffles me is: I can't get a persistent Posting hash to be any faster. Don't use a hash, then. :) KS doesn't. * Give Token a position member. * After you've got accumulated all the Tokens, calculate position for each token from the position increment. * Arrange the postings in an array sorted by position. * Count the number of postings in a row with identical text to get freq. Relevant code from KinoSearch::Analysis::TokenBatch below. Marvin Humphrey Rectangular Research http://www.rectangular.com/ void TokenBatch_invert(TokenBatch *self) { Token **tokens = (Token**)self-elems; Token **limit = tokens + self-size; i32_t token_pos = 0; /* thwart future attempts to append */ if (self-inverted) CONFESS(TokenBatch has already been inverted); self-inverted = true; /* assign token positions */ for ( ;tokens limit; tokens++) { (*tokens)-pos = token_pos; token_pos += (*tokens)-pos_inc; } /* sort the tokens lexically, and hand off to cluster counting routine */ qsort(self-elems, self-size, sizeof(Token*), Token_compare); count_clusters(self); } static void count_clusters(TokenBatch *self) { Token **tokens = (Token**)self-elems; u32_t *counts = CALLOCATE(self-size + 1, u32_t); u32_t i; /* save the cluster counts */ self-cluster_counts_size = self-size; self-cluster_counts = counts; for (i = 0; i self-size; ) { Token *const base_token = tokens[i]; char *const base_text = base_token-text; const size_t base_len = base_token-len; u32_t j = i + 1; /* iterate through tokens until text doesn't match */ while (j self-size) { Token *const candidate = tokens[j]; if ( (candidate-len == base_len) (memcmp(candidate-text, base_text, base_len) == 0) ) { j++; } else { break; } } /* record a count at the position of the first token in the cluster */ counts[i] = j - i; /* start the next loop at the next token we haven't seen */ i = j; } } Token** TokenBatch_next_cluster(TokenBatch *self, u32_t *count) { Token **cluster = (Token**)(self-elems + self-cur); if (self-cur == self-size) { *count = 0; return NULL; } /* don't read past the end of the cluster counts array */ if (!self-inverted) CONFESS(TokenBatch not yet inverted); if (self-cur self-cluster_counts_size) CONFESS(Tokens were added after inversion); /* place cluster count in passed-in var, advance bookmark */ *count = self-cluster_counts[ self-cur ]; self-cur += *count; return cluster; } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-622) Provide More of Lucene For Maven
[ https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jörg Hohwiller updated LUCENE-622: -- Attachment: lucene-maven.patch patch for partial mavenization of lucene Provide More of Lucene For Maven Key: LUCENE-622 URL: https://issues.apache.org/jira/browse/LUCENE-622 Project: Lucene - Java Issue Type: Task Affects Versions: 2.0.0 Reporter: Stephen Duncan Jr Attachments: lucene-core.pom, lucene-highlighter-2.0.0.pom, lucene-maven.patch Please provide javadoc source jars for lucene-core. Also, please provide the rest of lucene (the jars inside of contrib in the download bundle) if possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-622) Provide More of Lucene For Maven
[ https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487030 ] Jörg Hohwiller commented on LUCENE-622: --- If you apply this patch to svn (http://svn.apache.org/repos/asf/lucene/java/trunk), you can easily use maven for building and deploying artifacts for the maven repository. I did NOT modify your structure in any way because it seems the majority of the lucene community is not interested in maven and wants to keep going with ant. So all the patch does is adding some pom.xml files. Further I only added POMs for toplevel, core, demo and highlighter. From the highlighter POM you can easily create the POMs for the other contribs via cutpast and add them to the toplevel pom (contrib/pom.xml). If you need further assistance do NOT hesitate to ask me. Somehow the tests do NOT work, when I build with maven. Maybe they are currently broken. If you want to build (package), install or deploy anyways call maven (mvn) with the option -Dmaven.test.skip=true. E.g.: mvn install -Dmaven.test.skip=true I did not spent the time to dig into the problem with the tests. If you are loading resources from the classpath please consider this: http://jira.codehaus.org/browse/SUREFIRE-106 Provide More of Lucene For Maven Key: LUCENE-622 URL: https://issues.apache.org/jira/browse/LUCENE-622 Project: Lucene - Java Issue Type: Task Affects Versions: 2.0.0 Reporter: Stephen Duncan Jr Attachments: lucene-core.pom, lucene-highlighter-2.0.0.pom, lucene-maven.patch Please provide javadoc source jars for lucene-core. Also, please provide the rest of lucene (the jars inside of contrib in the download bundle) if possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)
Marvin Humphrey [EMAIL PROTECTED] wrote: On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote: The one thing that still baffles me is: I can't get a persistent Posting hash to be any faster. Don't use a hash, then. :) KS doesn't. * Give Token a position member. * After you've got accumulated all the Tokens, calculate position for each token from the position increment. * Arrange the postings in an array sorted by position. * Count the number of postings in a row with identical text to get freq. Relevant code from KinoSearch::Analysis::TokenBatch below. OH! I like that approach! So you basically do not de-dup by field+term on your first pass through the tokens in the doc (which is roughly what that hash does). Instead, append all tokens in an array, then sort first by field+text and second by position? This is done for each document right? This seems like it could be a major win! Did you every compare this approach against hash (or other de-dup data structure, letter trie or something) approach? I guess it depends on how many total terms you have in the doc vs how many unique terms you have in the doc. Qsort is NlogN, and, the comparison of 2 terms is somewhat costly. With de-duping, you pay a hash lookup/insert cost (plus cost of managing little int buffers to hold positions/offsets/etc) per term, but then only qsort on the number of unique terms. Whereas with your approach you don't pay any deduping cost but your qsort is on total # terms, not total # unique terms. I bet your approach is quite a bit faster for small to medium sized docs (by far the norm) but not faster for very large docs? Or maybe it's even faster for very large docs because qsort is so darned fast :) Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: publish to maven-repository
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jörg, Hi Otis, Since you offered to help - please see https://issues.apache.org/jira/browse/LUCENE-622 . lucene-core POM is there for 2.1.0, but if you need POMs for contrib/*, please attach them to that issue. We have Jars, obviously, so we just need to copy those. Since you asked for my help, I did an initial mavenization of the lucene project and submitted it as patch. Additionally I attached the 2.0.0 POM for the highlighter that I wanted to have at ibiblio. When we'll need .sha1 and .md5 files for all pushed Jars. One of the other developers will have to do that, as I don't have my PGP set up, and hence no key for the KEYS file (if that's needed for the .sha1). You do not need PGP or something like this for SHA-* or MD5. Those are just specific checksums but no authentic signature. I never deployed to ibiblio but I think these files are generated automatically. I hope my work helps to make it easier for the lucene community to put further releases also into the maven repository. Otis Best regards Jörg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Joerg Hohwiller [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Tuesday, April 3, 2007 4:49:15 PM Subject: publish to maven-repository Hi there, I will give it another try: Could you please publish lucene 2.* artifacts (including contribs) to the maven2 repository at ibiblio? Currently there is only the lucene-core available up to version 2.0.0: http://repo1.maven.org/maven2/org/apache/lucene/ JARs and POMs go to: scp://people.apache.org/www/www.apache.org/dist/maven-repository If you need assitance I am pleased to help. But I am not an official apache member and do NOT have access to do the deployment myself. Thank you so much... Jörg - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGFSRRmPuec2Dcv/8RAra/AJ97TFROLjvfxH/fy/oGZdTV7PIzDgCeP9Kj T784yUbeS3QaqmWIjwAuQ6I= =LoAq -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: publish to maven-repository
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Eric, On Apr 4, 2007, at 4:33 PM, Otis Gospodnetic wrote: Eh, missing Jars in the Maven repo again. Why does this always get dropped? Because none of us Lucene committers care much about Maven? :) Its okay for you personally. And no one wants you to use maven instead of ant. But you should know that maven opens some sort of new world. And users of maven expect projects of call to be available in the central repository. BTW - once you are addicted to maven you can not understand anymore why people still fiddle around writing and build-files ;) Perhaps it's time to keep a lucene-core.pom in our repo, rename it at release time (e.g. cp lucene-core.pom lucene-core-2.1.0.pom) and push the core jar + core POM out? I don't know the Maven specifics, but I'm all for us maintaining the Maven POM file and bundling it with releases that get pushed to the repos. I supplied a patch at LUCENE-622 to make it easier for you. So in the end it will take you a few more minutes to publish your release to the maven repository as well, but this allows many many users to use lucene easier and therefore NOT begg you on this list - it should be worth the effort. Erik Thanks Jörg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGFSX5mPuec2Dcv/8RAhnNAJ9iNo1/eAh2mgay78yYobpjDCkWfgCePpdN S+5/xD5t7wP2/h3wkDBHDms= =aF2Q -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-856) Optimize segment merging
[ https://issues.apache.org/jira/browse/LUCENE-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487049 ] Michael McCandless commented on LUCENE-856: --- OK I re-ran the above test (10 MM docs @ ~5,500 bytes plain text each) with autoCommit=false: this time it took 5 hrs 7 minutes, which is 40.7% faster than the autoCommit=true test above. Both of these tests were run with the patch from LUCENE-843. So this means, if all you need to do is build a massive index with term vector positions offsets, the fastest way to do so is with the patch from LUCENE-843 and with autoCommit=false with your writer. Basically LUCENE-843 makes autoCommit=false quite a bit faster for a very large index, assuming you are storing term vectors / stored fields. Still, I think optimizing segment merging is important because for many uses of Lucene, the interactivity (how quickly a searcher sees the recently indexed documents) is very important. For such cases you should open a writer with autoCommit=false and then periodically close re-open it to publish the indexed documents to the searchers. With that model, segment merging will still be a factor slowing down indexing (though how much of a factor depends on how often you close/open your writers). Optimize segment merging Key: LUCENE-856 URL: https://issues.apache.org/jira/browse/LUCENE-856 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.1 Reporter: Michael McCandless Assigned To: Michael McCandless Priority: Minor With LUCENE-843, the time spent indexing documents has been substantially reduced and now the time spent merging is a sizable portion of indexing time. I ran a test using the patch for LUCENE-843, building an index of 10 million docs, each with ~5,500 byte plain text, with term vectors (positions + offsets) on and with 2 small stored fields per document. RAM buffer size was 32 MB. I didn't optimize the index in the end, though optimize speed would also improve if we optimize segment merging. Index size is 86 GB. Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes of which was spent merging. That's 65.6% of the time! Most of this time is presumably IO which probably can't be reduced much unless we improve overall merge policy and experiment with values for mergeFactor / buffer size. These tests were run on a Mac Pro with 2 dual-core Intel CPUs. The IO system is RAID 0 of 4 drives, so, these times are probably better than the more common case of a single hard drive which would likely be slower IO. I think there are some simple things we could do to speed up merging: * Experiment with buffer sizes -- maybe larger buffers for the IndexInputs used during merging could help? Because at a default mergeFactor of 10, the disk heads must do alot of seeking back and forth between these 10 files (and then to the 11th file where we are writing). * Use byte copying when possible, eg if there are no deletions on a segment we can almost (I think?) just copy things like prox postings, stored fields, term vectors, instead of full parsing to Jave objects and then re-serializing them. * Experiment with mergeFactor / different merge policies. For example I think LUCENE-854 would reduce time spend merging for a given index size. This is currently just a place to list ideas for optimizing segment merges. I don't plan on working on this until after LUCENE-843. Note that for autoCommit=false, this optimization is somewhat less important, depending on how often you actually close/open a new IndexWriter. In the extreme case, if you open a writer, add 100 MM docs, close the writer, then no segment merges happen at all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)
On Apr 5, 2007, at 8:54 AM, Michael McCandless wrote: So you basically do not de-dup by field+term on your first pass through the tokens in the doc (which is roughly what that hash does). Instead, append all tokens in an array, then sort first by field+text and second by position? This is done for each document right? Almost. The sorting is done per-field. Token doesn't have a field, so comparison is cheaper than you're thinking. int Token_compare(const void *va, const void *vb) { Token *const a = *((Token**)va); Token *const b = *((Token**)vb); size_t min_len = a-len b-len ? a-len : b-len; int comparison = memcmp(a-text, b-text, min_len); if (comparison == 0) { if (a-len != b-len) { comparison = a-len b-len ? -1 : 1; } else { comparison = a-pos b-pos ? -1 : 1; } } return comparison; } Did you every compare this approach against hash (or other de-dup data structure, letter trie or something) approach? KS used to use hashing, though it wasn't directly analogous to how Lucene does things. I've only tried these two techniques. This was faster by about 30%, but the difference is not all in the de-duping. KS is concerned with preparing serialized postings to feed into an external sorter. In the hashing stratagem, every position added to a term_text = serialized_posting pair in the hash requires a string concatenation onto the end of serialized_posting, and thus a call to realloc(). Besides switching out hashing overhead for qsort overhead, the sorting technique also allows KS to know up front how many positions are associated with each posting, so the memory for the serialized string only has to be allocated once. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TestIndexWriter.testAddIndexOnDiskFull failed
At revision 525912: [junit] Testsuite: org.apache.lucene.index.TestIndexWriter [junit] Tests run: 16, Failures: 1, Errors: 0, Time elapsed: 52.161 sec [junit] [junit] Testcase: testAddIndexOnDiskFull(org.apache.lucene.index.TestIndexWriter): FAILED [junit] max free Directory space required exceeded 1X the total input index sizes during addIndexes(IndexReader[]): max temp usage = 127589 bytes; starting disk usage = 3915 bytes; input index disk usage = 7364554824446210604 bytes [junit] junit.framework.AssertionFailedError: max free Directory space required exceeded 1X the total input index sizes during addIndexes(IndexReader[]): max temp usage = 127589 bytes; starting disk usage = 3915 bytes; input index disk usage = 7364554824446210604 bytes [junit] at org.apache.lucene.index.TestIndexWriter.testAddIndexOnDiskFull(TestIndexWriter.java:387) [junit] Is there anything I can do to make this test pass locally? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: svn commit: r525669 - /lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java
Nothing fancy - Eclipse. It flagged it, I removed it, nothing turned red indicating everything still compiled, unit tests still passed, committed. If I recall correctly, one has to configure Eclipse to alert you to unused variables, methods, and such, and I have that turned on. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Paul Elschot [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, April 5, 2007 4:14:23 AM Subject: Re: svn commit: r525669 - /lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java Otis, Can I ask which tool you used to catch this, and the previous one? Regards, Paul Elschot On Thursday 05 April 2007 03:06, [EMAIL PROTECTED] wrote: Author: otis Date: Wed Apr 4 18:06:16 2007 New Revision: 525669 URL: http://svn.apache.org/viewvc?view=revrev=525669 Log: - Removed unused BooleanScore param passed to the inner BucketTable class Modified: lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java Modified: lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java?view=diffrev=525669r1=525668r2=525669 == --- lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java (original) +++ lucene/java/trunk/src/java/org/apache/lucene/search/BooleanScorer.java Wed Apr 4 18:06:16 2007 @@ -21,7 +21,7 @@ final class BooleanScorer extends Scorer { private SubScorer scorers = null; - private BucketTable bucketTable = new BucketTable(this); + private BucketTable bucketTable = new BucketTable(); private int maxCoord = 1; private float[] coordFactors = null; @@ -201,11 +201,7 @@ final Bucket[] buckets = new Bucket[SIZE]; Bucket first = null; // head of valid list -private BooleanScorer scorer; - -public BucketTable(BooleanScorer scorer) { - this.scorer = scorer; -} +public BucketTable() {} public final int size() { return SIZE; } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and Javolution: A good mix ?
On 4/4/07, Jean-Philippe Robichaud [EMAIL PROTECTED] wrote: I understand your concerns! I was a little skeptical at the beginning. But even with the 1.5 jvm, the improvements still holds. Lucene creates a lots of garbage (strings, tokens, ...) either at index time or query time. While the new garbage collector strategies did seriously improve since java 1.4, the gains are still there as the object creation is also a cost that javolution easily saves us from. I think the best approach at convincing people would be to produce a patch that implements some of the suggested changes, and benchmark it. As it stands, the positives are all hypothetical and the negatives rather tangible. -MIke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: improve how IndexWriter uses RAM to buffer added documents
On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote: Marvin do you have any sense of what the equivalent cost is in KS It's big. I don't have any good optimizations to suggest in this area. (I think for KS you add a previous segment not that differently from how you add a document)? Yeah. KS has to decompress and serialize posting content, which sux. The one saving grace is that with the Fibonacci merge schedule and the seg-at-a-time indexing strategy, segments don't get merged nearly as often as they do in Lucene. I share large int[] blocks and char[] blocks across Postings and re-use them. Etc. Interesting. I will have to try something like that! On C) I think it is important so the many ports of Lucene can compare notes and cross fertilize. Well, if you port Lucene's benchmarking stuff to Perl/C, I'll apply the patch. ;) Cross-fertilization is a powerful tool for stimulating algorithmic innovation. Exhibit A: our unfolding collaborative successes. That's why it was built into the Lucy proposal: [Lucy's C engine] will provide core, performance-critical functionality, but leave as much up to the higher-level language as possible. Users from diverse communities approach problems from different angles and come up with different solutions. The best ones will propagate across Lucy bindings. The only problem is that since Dave Balmain has been much less available than we expected, it's been largely up to me to get Lucy to critical mass where other people can start writing bindings. Performance certainly isn't everything. That's a given in scripting language culture. Most users are concerned with minimizing developer time above all else. Ergo, my emphasis on API design and simplicity. But does KS give its users a choice in Tokenizer? You supply a regular expression which matches one token. # Presto! A WhiteSpaceTokenizer: my $tokenizer = KinoSearch::Analysis::Tokenizer-new( token_re = qr/\S+/ ); Or, can users pre-tokenize their fields themselves? TokenBatch provides an API for bulk addition of tokens; you can subclass Analyzer to exploit that. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TestIndexWriter.testAddIndexOnDiskFull failed
Paul Elschot [EMAIL PROTECTED] wrote: At revision 525912: [junit] Testsuite: org.apache.lucene.index.TestIndexWriter [junit] Tests run: 16, Failures: 1, Errors: 0, Time elapsed: 52.161 sec [junit] [junit] Testcase: testAddIndexOnDiskFull(org.apache.lucene.index.TestIndexWriter): FAILED [junit] max free Directory space required exceeded 1X the total input index sizes during addIndexes(IndexReader[]): max temp usage = 127589 bytes; starting disk usage = 3915 bytes; input index disk usage = 7364554824446210604 bytes [junit] junit.framework.AssertionFailedError: max free Directory space required exceeded 1X the total input index sizes during addIndexes(IndexReader[]): max temp usage = 127589 bytes; starting disk usage = 3915 bytes; input index disk usage = 7364554824446210604 bytes [junit] at org.apache.lucene.index.TestIndexWriter.testAddIndexOnDiskFull(TestIndexWriter.java:387) [junit] Is there anything I can do to make this test pass locally? I just got a fresh checkout and the test is passing. That's one scary output from the test (input index disk usage). It seems like RAMDirectory.fileLength(...) may be returning a bad (incorrectly immense) result in your checkout? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)
Marvin Humphrey [EMAIL PROTECTED] wrote: On Apr 5, 2007, at 8:54 AM, Michael McCandless wrote: So you basically do not de-dup by field+term on your first pass through the tokens in the doc (which is roughly what that hash does). Instead, append all tokens in an array, then sort first by field+text and second by position? This is done for each document right? Almost. The sorting is done per-field. Token doesn't have a field, so comparison is cheaper than you're thinking. Got it. I've done exactly that (process one field's tokens at a time) with LUCENE-843 as well. Did you every compare this approach against hash (or other de-dup data structure, letter trie or something) approach? KS used to use hashing, though it wasn't directly analogous to how Lucene does things. I've only tried these two techniques. This was faster by about 30%, but the difference is not all in the de-duping. OK. 30% is very nice :) KS is concerned with preparing serialized postings to feed into an external sorter. In the hashing stratagem, every position added to a term_text = serialized_posting pair in the hash requires a string concatenation onto the end of serialized_posting, and thus a call to realloc(). Besides switching out hashing overhead for qsort overhead, the sorting technique also allows KS to know up front how many positions are associated with each posting, so the memory for the serialized string only has to be allocated once. Yeah those realloc()'s are costly (Lucene trunk has them too). In LUCENE-843 I found a way to share large int[] arrays so that a given posting uses slices into the shared arrays instead of doing reallocs. I think I'm doing something similar to feeding an external sorter: I'm just using the same approach as Lucene's segment merging of the postings, optimized somewhat to handle a very large number of segments at once (for the merging of the level 0 single document segments). I use this same merger to merge level N RAM segments to level N+1 ram segments, to merge RAM segments into a single flushed segment, to merge flushed segments into a single flushed segment and then finally to merge flushed and RAM segments into the real Lucene segment at the end. I think it differs from an external sorter in that I manage explicitly when to flush a run to disk (autoCommit=false case) or to flush it to a Lucene segment (autoCommit=true case) rather than letting the sorter API decide. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and Javolution: A good mix ?
What Mike said. Without seeing the Javalutionized Lucene in action we won't get very far. jean-Philippe, are you interested in making the changes to Lucene and showing the performance improvement? Note that you can use the super-nice and easy to use contrib/benchmark to compare the vanilla Lucene and the Javalutionized Lucene. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Mike Klaas [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, April 5, 2007 1:58:38 PM Subject: Re: Lucene and Javolution: A good mix ? On 4/4/07, Jean-Philippe Robichaud [EMAIL PROTECTED] wrote: I understand your concerns! I was a little skeptical at the beginning. But even with the 1.5 jvm, the improvements still holds. Lucene creates a lots of garbage (strings, tokens, ...) either at index time or query time. While the new garbage collector strategies did seriously improve since java 1.4, the gains are still there as the object creation is also a cost that javolution easily saves us from. I think the best approach at convincing people would be to produce a patch that implements some of the suggested changes, and benchmark it. As it stands, the positives are all hypothetical and the negatives rather tangible. -MIke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: improve how IndexWriter uses RAM to buffer added documents
Marvin Humphrey [EMAIL PROTECTED] wrote: (I think for KS you add a previous segment not that differently from how you add a document)? Yeah. KS has to decompress and serialize posting content, which sux. The one saving grace is that with the Fibonacci merge schedule and the seg-at-a-time indexing strategy, segments don't get merged nearly as often as they do in Lucene. Yeah we need to work on this one. One thing that irks me about the current Lucene merge policy (besides that it gets confused when you flush-by-RAM-usage) is that it's a pay it forward design so you're always over-paying when you build a given index size. With KS's Fibonacci merge policy, you don't. LUCENE-854 has some more details. Segment merging really is costly. In building a large (86 GB, 10 MM docs) index, 65.6% of the time was spent merging! Details are in LUCENE-856... On C) I think it is important so the many ports of Lucene can compare notes and cross fertilize. Well, if you port Lucene's benchmarking stuff to Perl/C, I'll apply the patch. ;) I hear you! Cross-fertilization is a powerful tool for stimulating algorithmic innovation. Exhibit A: our unfolding collaborative successes. Couldn't agree more. That's why it was built into the Lucy proposal: [Lucy's C engine] will provide core, performance-critical functionality, but leave as much up to the higher-level language as possible. Users from diverse communities approach problems from different angles and come up with different solutions. The best ones will propagate across Lucy bindings. The only problem is that since Dave Balmain has been much less available than we expected, it's been largely up to me to get Lucy to critical mass where other people can start writing bindings. This is a great model. Are there Python bindings to Lucy yet/coming? But does KS give its users a choice in Tokenizer? You supply a regular expression which matches one token. # Presto! A WhiteSpaceTokenizer: my $tokenizer = KinoSearch::Analysis::Tokenizer-new( token_re = qr/\S+/ ); Or, can users pre-tokenize their fields themselves? TokenBatch provides an API for bulk addition of tokens; you can subclass Analyzer to exploit that. Ahh, I get it. Nice! Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene and Javolution: A good mix ?
Yes, I believe enough in this approach to try it. I'm already starting to play with it. I took the current trunk and I'm starting to play with it. That begin said, I'm quite busy right now so I can't promise any steady progress. Also, I won't apply patches that are already in JIRA, so the numbers I'll get won't be the 'up-to-date ones. I understand that before this idea gets any traction, we must have an idea of how much this could help. But before going deep with this work, I wanted to know if Lucene developers have any interest in this kind of work. If the gurus dislike the idea of adding a dependency to Lucene (which is not the case for others Apache projects!), then I won't spend too much time on this. Jp -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, April 05, 2007 3:01 PM To: java-dev@lucene.apache.org Subject: Re: Lucene and Javolution: A good mix ? What Mike said. Without seeing the Javalutionized Lucene in action we won't get very far. jean-Philippe, are you interested in making the changes to Lucene and showing the performance improvement? Note that you can use the super-nice and easy to use contrib/benchmark to compare the vanilla Lucene and the Javalutionized Lucene. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Mike Klaas [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, April 5, 2007 1:58:38 PM Subject: Re: Lucene and Javolution: A good mix ? On 4/4/07, Jean-Philippe Robichaud [EMAIL PROTECTED] wrote: I understand your concerns! I was a little skeptical at the beginning. But even with the 1.5 jvm, the improvements still holds. Lucene creates a lots of garbage (strings, tokens, ...) either at index time or query time. While the new garbage collector strategies did seriously improve since java 1.4, the gains are still there as the object creation is also a cost that javolution easily saves us from. I think the best approach at convincing people would be to produce a patch that implements some of the suggested changes, and benchmark it. As it stands, the positives are all hypothetical and the negatives rather tangible. -MIke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and Javolution: A good mix ?
I'm not in love with the dependency idea, though it's not that big of a deal for me. However, I think you will want to get some of the performance patched (e.g. LUCENE-843) in first, so you can compare the latest and greatest version of Lucene with your Javalutionized version. From what I gather from Mike's emails, he is doing a lot of object and array sharing and reusing in order to minimize object creation, memory allocation, and thus create less work for the garbage collector. My 2 pick a currency, say Levs. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Jean-Philippe Robichaud [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, April 5, 2007 3:19:51 PM Subject: RE: Lucene and Javolution: A good mix ? Yes, I believe enough in this approach to try it. I'm already starting to play with it. I took the current trunk and I'm starting to play with it. That begin said, I'm quite busy right now so I can't promise any steady progress. Also, I won't apply patches that are already in JIRA, so the numbers I'll get won't be the 'up-to-date ones. I understand that before this idea gets any traction, we must have an idea of how much this could help. But before going deep with this work, I wanted to know if Lucene developers have any interest in this kind of work. If the gurus dislike the idea of adding a dependency to Lucene (which is not the case for others Apache projects!), then I won't spend too much time on this. Jp -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, April 05, 2007 3:01 PM To: java-dev@lucene.apache.org Subject: Re: Lucene and Javolution: A good mix ? What Mike said. Without seeing the Javalutionized Lucene in action we won't get very far. jean-Philippe, are you interested in making the changes to Lucene and showing the performance improvement? Note that you can use the super-nice and easy to use contrib/benchmark to compare the vanilla Lucene and the Javalutionized Lucene. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Mike Klaas [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, April 5, 2007 1:58:38 PM Subject: Re: Lucene and Javolution: A good mix ? On 4/4/07, Jean-Philippe Robichaud [EMAIL PROTECTED] wrote: I understand your concerns! I was a little skeptical at the beginning. But even with the 1.5 jvm, the improvements still holds. Lucene creates a lots of garbage (strings, tokens, ...) either at index time or query time. While the new garbage collector strategies did seriously improve since java 1.4, the gains are still there as the object creation is also a cost that javolution easily saves us from. I think the best approach at convincing people would be to produce a patch that implements some of the suggested changes, and benchmark it. As it stands, the positives are all hypothetical and the negatives rather tangible. -MIke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Quick question, Mike: You talk about a RAM buffer from 1MB - 96MB, but then you have the amount of RAM @ flush time (e.g. Avg RAM used (MB) @ flush: old34.5; new 3.4 [ 90.1% less]). I don't follow 100% of what you are doing in LUCENE-843, so could you please explain what these 2 different amounts of RAM are? Is the first (1-96) the RAM you use for in-memory merging of segments? What is the RAM used @ flush? More precisely, why is it that that amount of RAM exceeds the RAM buffer? Thanks, Otis - Original Message From: Michael McCandless (JIRA) [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, April 5, 2007 9:22:32 AM Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942 ] Michael McCandless commented on LUCENE-843: --- OK I ran old (trunk) vs new (this patch) with increasing RAM buffer sizes up to 96 MB. I used the normal sized docs (~5,500 bytes plain text), left stored fields and term vectors (positions + offsets) on, and autoCommit=false. Here're the results: NUM THREADS = 1 MERGE FACTOR = 10 With term vectors (positions + offsets) and 2 small stored fields AUTOCOMMIT = false (commit only once at the end) 1 MB old 20 docs in 862.2 secs index size = 1.7G new 20 docs in 297.1 secs index size = 1.7G Total Docs/sec: old 232.0; new 673.2 [ 190.2% faster] Docs/MB @ flush:old47.2; new 278.4 [ 489.6% more] Avg RAM used (MB) @ flush: old34.5; new 3.4 [ 90.1% less] 2 MB old 20 docs in 828.7 secs index size = 1.7G new 20 docs in 279.0 secs index size = 1.7G Total Docs/sec: old 241.3; new 716.8 [ 197.0% faster] Docs/MB @ flush:old47.0; new 322.4 [ 586.7% more] Avg RAM used (MB) @ flush: old37.9; new 4.5 [ 88.0% less] 4 MB old 20 docs in 840.5 secs index size = 1.7G new 20 docs in 260.8 secs index size = 1.7G Total Docs/sec: old 237.9; new 767.0 [ 222.3% faster] Docs/MB @ flush:old46.8; new 363.1 [ 675.4% more] Avg RAM used (MB) @ flush: old33.9; new 6.5 [ 80.9% less] 8 MB old 20 docs in 678.8 secs index size = 1.7G new 20 docs in 248.8 secs index size = 1.7G Total Docs/sec: old 294.6; new 803.7 [ 172.8% faster] Docs/MB @ flush:old46.8; new 392.4 [ 739.1% more] Avg RAM used (MB) @ flush: old60.3; new10.7 [ 82.2% less] 16 MB old 20 docs in 660.6 secs index size = 1.7G new 20 docs in 247.3 secs index size = 1.7G Total Docs/sec: old 302.8; new 808.7 [ 167.1% faster] Docs/MB @ flush:old46.7; new 415.4 [ 788.8% more] Avg RAM used (MB) @ flush: old47.1; new19.2 [ 59.3% less] 24 MB old 20 docs in 658.1 secs index size = 1.7G new 20 docs in 243.0 secs index size = 1.7G Total Docs/sec: old 303.9; new 823.0 [ 170.8% faster] Docs/MB @ flush:old46.7; new 430.9 [ 822.2% more] Avg RAM used (MB) @ flush: old70.0; new27.5 [ 60.8% less] 32 MB old 20 docs in 714.2 secs index size = 1.7G new 20 docs in 239.2 secs index size = 1.7G Total Docs/sec: old 280.0; new 836.0 [ 198.5% faster] Docs/MB @ flush:old46.7; new 432.2 [ 825.2% more] Avg RAM used (MB) @ flush: old92.5; new36.7 [ 60.3% less] 48 MB old 20 docs in 640.3 secs index size = 1.7G new 20 docs in 236.0 secs index size = 1.7G Total Docs/sec: old 312.4; new 847.5 [ 171.3% faster] Docs/MB @ flush:old46.7; new 438.5 [ 838.8% more] Avg RAM used (MB) @ flush: old 138.9; new52.8 [ 62.0% less] 64 MB old 20 docs in 649.3 secs index size = 1.7G new 20 docs in 238.3 secs index size = 1.7G Total Docs/sec: old 308.0; new 839.3 [ 172.5% faster] Docs/MB @ flush:old46.7; new 441.3 [ 844.7% more] Avg RAM used (MB) @ flush: old 302.6; new72.7 [ 76.0% less] 80 MB old 20 docs in 670.2 secs index size = 1.7G new 20 docs in 227.2 secs index size = 1.7G Total Docs/sec: old 298.4; new 880.5 [ 195.0% faster] Docs/MB @ flush:old46.7; new 446.2 [ 855.2% more] Avg RAM used (MB) @ flush: old 231.7; new94.3 [ 59.3% less] 96 MB old 20 docs in 683.4 secs index size = 1.7G new 20 docs in 226.8 secs index size = 1.7G Total Docs/sec: old
Re: Caching in QueryFilter - why?
Sounds like I need to cut that out. Since caching is built into the public BitSet bits(IndexReader reader) method, I don't see a way to deprecate that, which means I'll just cut it out and document it in CHANGES.txt. Anyone who wants QueryFilter caching will be able to get the caching back by wrapping the QueryFilter in your CachingWrapperFilter. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Erik Hatcher [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Wednesday, April 4, 2007 7:38:00 PM Subject: Re: Caching in QueryFilter - why? CachingWrapperFilter came along after QueryFilter. I think I added CachingWrapperFilter when I realized that every Filter should have the capability to be cached without having to implement it. So, the only reason is legacy. I'm perfectly fine with removing the caching from QueryFilter in a future major release. Erik On Apr 4, 2007, at 5:57 PM, Otis Gospodnetic wrote: Hi, I'm looking at LUCENE-853, so I also looked at CachingWrapperFilter and then at QueryFilter. I noticed QueryFilter does its own BitSet caching, and the caching part of its code is nearly identical to the code in CachingWrapperFilter. Why is that? Is there a good reason for that? Thanks, Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-857) Remove BitSet caching from QueryFilter
Remove BitSet caching from QueryFilter -- Key: LUCENE-857 URL: https://issues.apache.org/jira/browse/LUCENE-857 Project: Lucene - Java Issue Type: Improvement Reporter: Otis Gospodnetic Assigned To: Otis Gospodnetic Priority: Minor Since caching is built into the public BitSet bits(IndexReader reader) method, I don't see a way to deprecate that, which means I'll just cut it out and document it in CHANGES.txt. Anyone who wants QueryFilter caching will be able to get the caching back by wrapping the QueryFilter in the CachingWrapperFilter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and Javolution: A good mix ?
I'm not saying I'm against it, but one of the things that makes Lucene so great is it's lack of dependencies in the core. It isn't necessarily a slippery slope, either, if we do add one dependency. Javolution is BSD license, AFAICT. I don't know if that is a good or bad license as far as Apache is concerned, but it should be looked into before you spend any time on it. This is not meant to be a discouragement. If it shows a significant improvement, people will notice and it will be taken seriously, especially if it is backward compatible, well tested and well documented. -Grant On Apr 5, 2007, at 3:19 PM, Jean-Philippe Robichaud wrote: Yes, I believe enough in this approach to try it. I'm already starting to play with it. I took the current trunk and I'm starting to play with it. That begin said, I'm quite busy right now so I can't promise any steady progress. Also, I won't apply patches that are already in JIRA, so the numbers I'll get won't be the 'up-to-date ones. I understand that before this idea gets any traction, we must have an idea of how much this could help. But before going deep with this work, I wanted to know if Lucene developers have any interest in this kind of work. If the gurus dislike the idea of adding a dependency to Lucene (which is not the case for others Apache projects!), then I won't spend too much time on this. Jp -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, April 05, 2007 3:01 PM To: java-dev@lucene.apache.org Subject: Re: Lucene and Javolution: A good mix ? What Mike said. Without seeing the Javalutionized Lucene in action we won't get very far. jean-Philippe, are you interested in making the changes to Lucene and showing the performance improvement? Note that you can use the super-nice and easy to use contrib/benchmark to compare the vanilla Lucene and the Javalutionized Lucene. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Mike Klaas [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, April 5, 2007 1:58:38 PM Subject: Re: Lucene and Javolution: A good mix ? On 4/4/07, Jean-Philippe Robichaud Jean- [EMAIL PROTECTED] wrote: I understand your concerns! I was a little skeptical at the beginning. But even with the 1.5 jvm, the improvements still holds. Lucene creates a lots of garbage (strings, tokens, ...) either at index time or query time. While the new garbage collector strategies did seriously improve since java 1.4, the gains are still there as the object creation is also a cost that javolution easily saves us from. I think the best approach at convincing people would be to produce a patch that implements some of the suggested changes, and benchmark it. As it stands, the positives are all hypothetical and the negatives rather tangible. -MIke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: improve how IndexWriter uses RAM to buffer added documents
On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote: (I think for KS you add a previous segment not that differently from how you add a document)? Yeah. KS has to decompress and serialize posting content, which sux. The one saving grace is that with the Fibonacci merge schedule and the seg-at-a-time indexing strategy, segments don't get merged nearly as often as they do in Lucene. Yeah we need to work on this one. What we need to do is cut down on decompression and conflict resolution costs when reading from one segment to another. KS has solved this problem for stored fields. Field defs are global and field values are keyed by name rather than field number in the field data file. Benefits: * Whole documents can be read from one segment to another as blobs. * No flags byte. * No remapping of field numbers. * No conflict resolution at all. * Compressed, uncompressed... doesn't matter. * Less code. * The possibility of allowing the user to provide their own subclass for reading and writing fields. (For Lucy, in the language of your choice.) What I haven't got yet is a way to move terms and postings economically from one segment to another. But I'm working on it. :) One thing that irks me about the current Lucene merge policy (besides that it gets confused when you flush-by-RAM-usage) is that it's a pay it forward design so you're always over-paying when you build a given index size. With KS's Fibonacci merge policy, you don't. LUCENE-854 has some more details. However, even under Fibo, when you get socked with a big merge, you really get socked. It bothers me that the time for adding to your index can vary so unpredictably. Segment merging really is costly. In building a large (86 GB, 10 MM docs) index, 65.6% of the time was spent merging! Details are in LUCENE-856... This is a great model. Are there Python bindings to Lucy yet/coming? I'm sure that they will appear once the C core is ready. The approach I am taking is to make some high-level design decisions collaboratively on lucy-dev, then implement them in KS. There's a large amount of code that has been written according to our specs that is working in KS and ready to commit to Lucy after trivial changes. There's more that's ready for review. However, release of KS 0.20 is taking priority, so code flow into the Lucy repository has slowed. I'll also be looking for a job in about a month. That may slow us down some more, though it won't stop things -- I've basically decided that I'll do what it takes to Lucy off the ground. I'll go with something stopgap if nothing materializes which is compatible with that commitment. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Caching in QueryFilter - why?
: Since caching is built into the public BitSet bits(IndexReader reader) : method, I don't see a way to deprecate that, which means I'll just cut : it out and document it in CHANGES.txt. Anyone who wants QueryFilter : caching will be able to get the caching back by wrapping the QueryFilter : in your CachingWrapperFilter. this seems like a potentially big surprise for people when upgrading ... old code will continue to work fine without warning, just get a lot less efficient. If the concern is duplicated code, perhaps QueryFilter should be deprecated and changed to be a subclass of CachingWrapperFilter, with a constructor that takes in the Query and wraps it in some new class (QueryWrapperFilter perhaps?) that does the meaty part (collecting the matches) ... @deprecated use CachingWrapperFilter and QueryWrapperFilter directly public class QueryFilter extends CachingWrapperFilter { public QueryFilter(Query query) { super(new QueryWrapperFilter(query)); } } public class QueryWrapperFilter extends Filter { private Query query; public QueryWrapperFilter(Query query) { this.query = query; } public BitSet bits(IndexReader reader) throws IOException { final BitSet bits = new BitSet(reader.maxDoc()); new IndexSearcher(reader).search(query, new HitCollector() { public final void collect(int doc, float score) { bits.set(doc); // set bit for hit } }); return bits; } } (obviously we need some toString, equals, and hashCode methods in here as well) : : : Otis : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : Simpy -- http://www.simpy.com/ - Tag - Search - Share : : - Original Message : From: Erik Hatcher [EMAIL PROTECTED] : To: java-dev@lucene.apache.org : Sent: Wednesday, April 4, 2007 7:38:00 PM : Subject: Re: Caching in QueryFilter - why? : : CachingWrapperFilter came along after QueryFilter. I think I added : CachingWrapperFilter when I realized that every Filter should have : the capability to be cached without having to implement it. So, the : only reason is legacy. I'm perfectly fine with removing the : caching from QueryFilter in a future major release. : : Erik : : On Apr 4, 2007, at 5:57 PM, Otis Gospodnetic wrote: : : Hi, : : I'm looking at LUCENE-853, so I also looked at CachingWrapperFilter : and then at QueryFilter. I noticed QueryFilter does its own BitSet : caching, and the caching part of its code is nearly identical to : the code in CachingWrapperFilter. : : Why is that? Is there a good reason for that? : : Thanks, : Otis : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : Simpy -- http://www.simpy.com/ - Tag - Search - Share : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : : : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-584: Attachment: bench-diff.txt Perhaps I did something wrong with the benchmark, but I didn't get any speed-up when using searcher.match(Query, MatchCollector) vs. searcher.search(Query, HitCollector). Here are the benchmark numbers (5 queries with each), HitCollector first, MatchCollector second: HITCOLLECTOR: [java] Report Sum By (any) Name (11 about 41 out of 41) [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSecavgUsedMemavgTotalMem [java] Rounds_40 10 101 808020 787.51,026.04 7,217,624 17,780,736 [java] Populate - - - - - - - - - - - - 4 - - - 2003 - - 129.9 - - 61.67 - 9,938,986 - 13,821,952 [java] CreateIndex - - -41 4.40.91 3,937,522 10,916,864 [java] MAddDocs_2000 - - - - - - - - - - 4 - - - 2000 - - 138.1 - - 57.92 - 9,368,584 - 13,821,952 [java] Optimize- - -41 1.42.83 9,938,218 13,821,952 [java] CloseIndex - - - - - - - - - - - 4 - - - - 1 - - 2,000.0 - - 0.00 - 9,938,986 - 13,821,952 [java] OpenReader - - -41 24.00.17 9,957,592 13,821,952 [java] SearchSameRdr_5 - - - - - - - - 4 - - 5 - - 1,070.3 - - 186.86 - 10,500,146 - 13,821,952 [java] CloseReader - - -41 4,000.00.00 9,059,756 13,821,952 [java] WarmNewRdr_50 - - - - - - - - - - 4 - - 10 - 16,237.7 - - 24.63 - 9,060,268 - 13,821,952 [java] SrchNewRdr_5- - -45 265.9 752.0210,800,006 13,821,952 [java] Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 41) [java] Operation round mrg buf runCnt recsPerRunrec/s elapsedSecavgUsedMemavgTotalMem [java] MAddDocs_2000 0 10 101 2000 94.6 21.15 7,844,112 10,407,936 [java] MAddDocs_2000 - 1 100 10 - - 1 - - - 2000 - - 136.7 - - 14.63 - 8,968,144 - 11,309,056 [java] MAddDocs_2000 2 10 1001 2000173.2 11.5510,528,264 15,740,928 [java] MAddDocs_2000 - 3 100 100 - - 1 - - - 2000 - - 188.7 - - 10.60 - 10,133,816 - 17,829,888 MATCHCOLLECTOR: [java] Report Sum By (any) Name (11 about 41 out of 41) [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSecavgUsedMemavgTotalMem [java] Rounds_40 10 101 808020 781.01,034.6210,566,608 15,859,712 [java] Populate - - - - - - - - - - - - 4 - - - 2003 - - 130.9 - - 61.23 - 10,963,452 - 14,806,016 [java] CreateIndex - - -41 33.90.12 3,616,570 11,020,288 [java] MAddDocs_2000 - - - - - - - - - - 4 - - - 2000 - - 137.3 - - 58.29 - 10,445,568 - 14,806,016 [java] Optimize- - -41 1.42.8210,979,398 14,806,016 [java] CloseIndex - - - - - - - - - - - 4 - - - - 1 - - 2,000.0 - - 0.00 - 10,963,452 - 14,806,016 [java] OpenReader - - -41 22.00.1810,982,058 14,806,016 [java] SearchSameRdr_5 - - - - - - - - 4 - - 5 - - 1,064.7 - - 187.84 - 11,060,036 - 14,806,016 [java] CloseReader - - -41 4,000.00.0010,353,206 14,806,016 [java] WarmNewRdr_50 - - - - - - - - - - 4 - - 10 - 16,419.0 - - 24.36 - 10,431,062 - 14,806,016 [java] SrchNewRdr_5- - -45 263.0 760.3411,912,358 14,806,016 [java] Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 41) [java] Operation round mrg buf runCnt recsPerRunrec/s elapsedSecavgUsedMemavgTotalMem [java] MAddDocs_2000 0 10 101 2000 92.2 21.69 7,844,112 10,407,936 [java] MAddDocs_2000 - 1 100 10 - - 1 - - - 2000 - - 136.6 - - 14.64 - 7,720,352 - 10,407,936 [java] MAddDocs_2000 2 10 1001 2000167.8 11.9211,325,952 17,571,840 [java] MAddDocs_2000 - 3 100 100 - - 1 - - - 2000 - - 199.3 - - 10.03 - 14,891,856 -
Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
: Thanks! But remember many Lucene apps won't see these speedups since I've : carefully minimized cost of tokenization and cost of document retrieval. I : think for many Lucene apps these are a sizable part of time spend indexing. true, but as long as the changes you are making has no impact on the tokenization/docbuilding times, that shouldn't be a factor -- that should be consiered a cosntant time adjunct to the code you are varying ... people with expensive analysis may not see any significant increases, but that's their own problem -- people concerned about performance will already have that as fast as they can get it, and now the internals of document adding will get faster as well. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487108 ] Matt Ericson commented on LUCENE-855: - I am almost done with my patch and I wanted to test it against this patch so see who has the faster version But the MemoryCachedRangeFilter is written using Java 1.5 And as far as I know Lucene is still on java 1.4 Lines like private static WeakHashMapIndexReader, MapString,SortedFieldCache cache = new WeakHashMapIndexReader, MapString, SortedFieldCache(); Will not compile in java 1.4 Andy I would love to see who has the faster patch if you would convert your patch to use java 1.4 I would be happy to put them side by side MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: MemoryCachedRangeFilter.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
On 4/5/07, Chris Hostetter [EMAIL PROTECTED] wrote: : Thanks! But remember many Lucene apps won't see these speedups since I've : carefully minimized cost of tokenization and cost of document retrieval. I : think for many Lucene apps these are a sizable part of time spend indexing. true, but as long as the changes you are making has no impact on the tokenization/docbuilding times, that shouldn't be a factor -- that should be consiered a cosntant time adjunct to the code you are varying ... people with expensive analysis may not see any significant increases, but that's their own problem -- people concerned about performance will already have that as fast as they can get it, and now the internals of document adding will get faster as well. Especially since it is relatively easy for users to tweak the analysis bits for performance--compared to the messy guts of index creation. I am eagerly tracking the progress of your work. -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-857) Remove BitSet caching from QueryFilter
[ https://issues.apache.org/jira/browse/LUCENE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487116 ] Hoss Man commented on LUCENE-857: - From email since i didn't notice Otis opened this issue already... Date: Thu, 5 Apr 2007 14:24:31 -0700 (PDT) To: java-dev@lucene.apache.org Subject: Re: Caching in QueryFilter - why? : Since caching is built into the public BitSet bits(IndexReader reader) : method, I don't see a way to deprecate that, which means I'll just cut : it out and document it in CHANGES.txt. Anyone who wants QueryFilter : caching will be able to get the caching back by wrapping the QueryFilter : in your CachingWrapperFilter. this seems like a potentially big surprise for people when upgrading ... old code will continue to work fine without warning, just get a lot less efficient. If the concern is duplicated code, perhaps QueryFilter should be deprecated and changed to be a subclass of CachingWrapperFilter, with a constructor that takes in the Query and wraps it in some new class (QueryWrapperFilter perhaps?) that does the meaty part (collecting the matches) ... @deprecated use CachingWrapperFilter and QueryWrapperFilter directly public class QueryFilter extends CachingWrapperFilter { public QueryFilter(Query query) { super(new QueryWrapperFilter(query)); } } public class QueryWrapperFilter extends Filter { private Query query; public QueryWrapperFilter(Query query) { this.query = query; } public BitSet bits(IndexReader reader) throws IOException { final BitSet bits = new BitSet(reader.maxDoc()); new IndexSearcher(reader).search(query, new HitCollector() { public final void collect(int doc, float score) { bits.set(doc); // set bit for hit } }); return bits; } } (obviously we need some toString, equals, and hashCode methods in here as well) Remove BitSet caching from QueryFilter -- Key: LUCENE-857 URL: https://issues.apache.org/jira/browse/LUCENE-857 Project: Lucene - Java Issue Type: Improvement Reporter: Otis Gospodnetic Assigned To: Otis Gospodnetic Priority: Minor Attachments: LUCENE-857.patch Since caching is built into the public BitSet bits(IndexReader reader) method, I don't see a way to deprecate that, which means I'll just cut it out and document it in CHANGES.txt. Anyone who wants QueryFilter caching will be able to get the caching back by wrapping the QueryFilter in the CachingWrapperFilter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Resolved: (LUCENE-796) Change Visibility of fields[] in MultiFieldQueryParser
On 4/4/07, Otis Gospodnetic (JIRA) [EMAIL PROTECTED] wrote: [ https://issues.apache.org/jira/browse/LUCENE-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-796. - Resolution: Fixed Makes sense. Thanks Steve, applied. I left those 2 private attributes of MFQP as private until somebody asks for them to be protected. I'm not sure if this applies to this issue, but ISTM that a private unless you bug the devs approach to variable scoping is a little odd. A few unecessary privates sprinkled through the code can really wreck havoc on effects to extend functionality cleanly. This has caused me grief in the past, and waiting for a lucene release isn't usually a good solution--cp is faster. What if maintaining backward-compatibility of the inheritance interface of classes was explicitly not guaranteed--would this allow the default policy for new code to use protected rather than private (unless there is a reason for the latter)? A class is either designed for extensibility in mind (or certain kinds), or not at all. It is perhaps unrealistic to audit all lucene classes, but perhaps a whole class could be opened up when a bug report is filed? FWIW: $ find -name \*.java | xargs grep private | wc 914 $ find -name \*.java | xargs grep protected | wc 260 cheers, -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Mike Klaas [EMAIL PROTECTED] wrote: On 4/5/07, Chris Hostetter [EMAIL PROTECTED] wrote: : Thanks! But remember many Lucene apps won't see these speedups since I've : carefully minimized cost of tokenization and cost of document retrieval. I : think for many Lucene apps these are a sizable part of time spend indexing. true, but as long as the changes you are making has no impact on the tokenization/docbuilding times, that shouldn't be a factor -- that should be consiered a cosntant time adjunct to the code you are varying ... people with expensive analysis may not see any significant increases, but that's their own problem -- people concerned about performance will already have that as fast as they can get it, and now the internals of document adding will get faster as well. Especially since it is relatively easy for users to tweak the analysis bits for performance--compared to the messy guts of index creation. I am eagerly tracking the progress of your work. Thanks Mike (and Hoss). Hoss, what you said is correct: I'm only affecting the actual indexing of a document, nothing before that. I just want to make sure I get that disclaimer out, as much as possible, so nobody tries the patch and says Hey! My app only got 10% faster! This was false advertising!. People who indeed have minimized their doc retrieval and tokenization time should see speedups around what I'm seeing with the benchmarks (I hope!). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Hi Otis! Otis Gospodnetic [EMAIL PROTECTED] wrote: You talk about a RAM buffer from 1MB - 96MB, but then you have the amount of RAM @ flush time (e.g. Avg RAM used (MB) @ flush: old34.5; new 3.4 [ 90.1% less]). I don't follow 100% of what you are doing in LUCENE-843, so could you please explain what these 2 different amounts of RAM are? Is the first (1-96) the RAM you use for in-memory merging of segments? What is the RAM used @ flush? More precisely, why is it that that amount of RAM exceeds the RAM buffer? Very good questions! When I say the RAM buffer size is set to 96 MB, what I mean is I flush the writer when the in-memory segments are using 96 MB RAM. On trunk, I just call ramSizeInBytes(). I do the analogous thing with my patch (sum up size of RAM buffers used by segments). I call this part of the RAM usage the indexed documents RAM. With every added document, this grows. But: this does not account for all data structures (Posting instances, HashMap, FieldsWriter, TermVectorsWriter, int[] arrays, aetc.) used, but not saved away, during the indexing of a single document. All the things used temporarily while indexing a document take up RAM too. I call this part of the RAM usage the document processing RAM. This RAM does not grow with every added document, though its size is in proportion to the how big each document is. This memory is always re-used (does not grow with time). But with the trunk, this is done by creating garbage, whereas with my patch, I explicitly reuse it. When I measure amount of RAM @ flush time, I'm calling MemoryMXBean.getHeapMemoryUsage().getUsed(). So, this measures actual process memory usage which should be (for my tests) around the sum of the above two types of RAM usage. With the trunk, the actual process memory usage tends to be quite a bit higher than the RAM buffer size and also tends to be very noisy (jumps around with each flush). I think this is because of delays/unpredictability on when GC kicks in to reclaim the garbage created during indexing of the doc. Whereas with my patch, it's usually quite a bit closer to the indexed documents RAM and does not jump around nearly as much. So the actual process RAM used will always exceed my RAM buffer size. The amount of excess is a measure of the overhead required to process the document. The trunk has far worse overhead than with my patch, which I think means a given application will be able to use a *larger* RAM buffer size with LUCENE-843. Does that make sense? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: improve how IndexWriter uses RAM to buffer added documents
Marvin Humphrey [EMAIL PROTECTED] wrote: On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote: (I think for KS you add a previous segment not that differently from how you add a document)? Yeah. KS has to decompress and serialize posting content, which sux. The one saving grace is that with the Fibonacci merge schedule and the seg-at-a-time indexing strategy, segments don't get merged nearly as often as they do in Lucene. Yeah we need to work on this one. What we need to do is cut down on decompression and conflict resolution costs when reading from one segment to another. KS has solved this problem for stored fields. Field defs are global and field values are keyed by name rather than field number in the field data file. Benefits: * Whole documents can be read from one segment to another as blobs. * No flags byte. * No remapping of field numbers. * No conflict resolution at all. * Compressed, uncompressed... doesn't matter. * Less code. * The possibility of allowing the user to provide their own subclass for reading and writing fields. (For Lucy, in the language of your choice.) I hear you, and I really really love those benefits, but, we just don't have this freedom with Lucene. I think the ability to suddenly birth a new field, or change a field's attributes like has vectors, stores norms, etc., with a new document, is something we just can't break at this point with Lucene? If we could get those benefits without breaking backwards compatibility then that would be awesome. I suppose if we had a single mapping of field names - numbers in the index, that would gain us many of the above benefits? Hmmm. What I haven't got yet is a way to move terms and postings economically from one segment to another. But I'm working on it. :) Here's one idea I just had: assuming there are no deletions, you can almost do a raw bytes copy from input segment to output (merged) segment of the postings for a given term X. I think for prox postings you can. But for freq postings, you can't, because they are delta coded. Except: it's only the first entry of the incoming segments's freq postings that needs to be re-interpreted? So you could read that one, encode the delta based on last docID for previous segment (I think we'd have to store this in index, probably only if termFreq threshold), and then copyBytes the rest of the posting? I will try this out on the merges I'm doing in LUCENE-843; I think it should work and make merging faster (assuming no deletes)? One thing that irks me about the current Lucene merge policy (besides that it gets confused when you flush-by-RAM-usage) is that it's a pay it forward design so you're always over-paying when you build a given index size. With KS's Fibonacci merge policy, you don't. LUCENE-854 has some more details. However, even under Fibo, when you get socked with a big merge, you really get socked. It bothers me that the time for adding to your index can vary so unpredictably. Yeah, I think that's best solved by concurrency (either with threads or with our own scheduling eg on adding a doc you go and merge another N terms in the running merge)? There have been several proposals recently for making Lucene's merging concurrent (backgrounded), as part of LUCENE-847. Segment merging really is costly. In building a large (86 GB, 10 MM docs) index, 65.6% of the time was spent merging! Details are in LUCENE-856... This is a great model. Are there Python bindings to Lucy yet/coming? I'm sure that they will appear once the C core is ready. The approach I am taking is to make some high-level design decisions collaboratively on lucy-dev, then implement them in KS. There's a large amount of code that has been written according to our specs that is working in KS and ready to commit to Lucy after trivial changes. There's more that's ready for review. However, release of KS 0.20 is taking priority, so code flow into the Lucy repository has slowed. OK, good to hear. I'll also be looking for a job in about a month. That may slow us down some more, though it won't stop things -- I've basically decided that I'll do what it takes to Lucy off the ground. I'll go with something stopgap if nothing materializes which is compatible with that commitment. Whoa, I'm sorry to hear that :( I hope you land, quickly, somewhere that takes Lucy/KS seriously. It's clearly excellent work. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Michael, like everyone else, I am watching this very closely. So far it sounds great! On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote: When I measure amount of RAM @ flush time, I'm calling MemoryMXBean.getHeapMemoryUsage().getUsed(). So, this measures actual process memory usage which should be (for my tests) around the sum of the above two types of RAM usage. One thing caught my eye, though, MemoryMXBean is JDK 1.5. :-( http://java.sun.com/j2se/1.5.0/docs/api/java/lang/management/ MemoryMXBean.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Grant Ingersoll [EMAIL PROTECTED] wrote: Michael, like everyone else, I am watching this very closely. So far it sounds great! On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote: When I measure amount of RAM @ flush time, I'm calling MemoryMXBean.getHeapMemoryUsage().getUsed(). So, this measures actual process memory usage which should be (for my tests) around the sum of the above two types of RAM usage. One thing caught my eye, though, MemoryMXBean is JDK 1.5. :-( http://java.sun.com/j2se/1.5.0/docs/api/java/lang/management/ MemoryMXBean.html Yeah, thanks for pointing this out. I'm only using that to do my benchmarking, not to actually measure RAM usage for when to flush, so I will definitely remove it before committing (I always go to a 1.4.2 environment and do a ant clean test to be certain I didn't do something like this by accident :). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: improve how IndexWriter uses RAM to buffer added documents
On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote: What we need to do is cut down on decompression and conflict resolution costs when reading from one segment to another. KS has solved this problem for stored fields. Field defs are global and field values are keyed by name rather than field number in the field data file. Benefits: * Whole documents can be read from one segment to another as blobs. * No flags byte. * No remapping of field numbers. * No conflict resolution at all. * Compressed, uncompressed... doesn't matter. * Less code. * The possibility of allowing the user to provide their own subclass for reading and writing fields. (For Lucy, in the language of your choice.) I hear you, and I really really love those benefits, but, we just don't have this freedom with Lucene. Yeah, too bad. This is one area where Lucene and Lucy are going to differ. Balmain and I are of one mind about global field defs. I think the ability to suddenly birth a new field, You can do that in KS as of version 0.20_02. :) or change a field's attributes like has vectors, stores norms, etc., with a new document, Can't do that, though, and I make no apologies. I think it's a misfeature. I suppose if we had a single mapping of field names - numbers in the index, that would gain us many of the above benefits? Hmmm. You'll still have to be able to remap field numbers when adding entire indexes. Here's one idea I just had: assuming there are no deletions, you can almost do a raw bytes copy from input segment to output (merged) segment of the postings for a given term X. I think for prox postings you can. You can probably squeeze out some nice gains using a skipVint() function, even with deletions. But for freq postings, you can't, because they are delta coded. I'm working on this task right now for KS. KS implements the Flexible Indexing paradigm, so all posting data goes in a single file. I've applied an additional constraint to KS: Every binary file must consist of one type of record repeated over and over. Every indexed field gets its own dedicated posting file with the suffix .pNNN to allow per-field posting formats. The I/O code is isolated in subclasses of a new class called Stepper: You can turn any Stepper loose on its file and read it from top to tail. When the file format changes, Steppers will get archived, like old plugins. My present task is to write the code for the Stepper subclasses MatchPosting, ScorePosting, and RichPosting. (PayloadPosting can wait.) As I write them, I will see if I can figure out format that can be merged as speedily as possible. Perhaps the precise variant of delta encoding used in Lucene's .frq file should be avoided. Except: it's only the first entry of the incoming segments's freq postings that needs to be re-interpreted? So you could read that one, encode the delta based on last docID for previous segment (I think we'd have to store this in index, probably only if termFreq threshold), and then copyBytes the rest of the posting? I will try this out on the merges I'm doing in LUCENE-843; I think it should work and make merging faster (assuming no deletes)? Ugh, more special case code. I have to say, I started trying to go over your patch, and the overwhelming impression I got coming back to this part of the Lucene code base in earnest for the first time since using 1.4.3 as a porting reference was: simplicity seems to be nobody's priority these days. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: publish to maven-repository
Joerg Hohwiller wrote: When we'll need .sha1 and .md5 files for all pushed Jars. One of the other developers will have to do that, as I don't have my PGP set up, and hence no key for the KEYS file (if that's needed for the .sha1). You do not need PGP or something like this for SHA-* or MD5. Those are just specific checksums but no authentic signature. I never deployed to ibiblio but I think these files are generated automatically. Actually you need to deploy pgp signature due to apache policy. But fortunately there is maven-gpg-plugin that does this for you. I hope my work helps to make it easier for the lucene community to put further releases also into the maven repository. Thanks for your efforts, I will get down the work you started and put something together for a review soon. -- Sami Siren - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]