Re: improve how IndexWriter uses RAM to buffer added documents
On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote: Yonik Seeley [EMAIL PROTECTED] wrote: Wow, very nice results Mike! Thanks :) I'm just praying I don't have some sneaky bug making the results far better than they really are!! That's possible, but I'm confident that the model you're using is capable of the gains you're seeing. When I benched KinoSearch a year ago against Lucene, KS was getting close, but was still a little behind... http://www.rectangular.com/kinosearch/benchmarks.html (: Ironically, the numbers for Lucene on that page are a little better than they should be because of a sneaky bug. I would have made updating the results a priority if they'd gone the other way. :) ... However, Lucene has been tuned by an army of developers over the years, while KS is young yet and still had many opportunities for optimization. Current svn trunk for KS is about twice as fast for indexing as when I did those benchmarking tests. I look forward to studying your patch in detail at some point to see what you've done differently. It sounds like you only familiarized yourself with the high-level details of how KS has been working, yes? Hopefully, you misunderstood and came up with something better. ;) Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Created: (LUCENE-851) Pruning
On Mar 29, 2007, at 7:44 PM, Ning Li wrote: If a query requires top-K results, isn't it sufficient to find top-K results in each segment and merge them to return the overall top-K results? They are merged by collecting them into a HitQueue. Early termination happens in finding top-K results in one segment. Assuming each document has a static score, document ids are assigned in the same order of their static scores within a segment. If a top-K query is scored by the same static score, query processing on a segment can stop as soon as the first K results are found. Indeed, that's exactly how the loop in Scorer_collect() works. As to the indexing side, applications should be able to pick such a static score? If Lucene score function is used, norm is a good candidate? (One tricky thing with norm is that it is updatable.) I would argue that only a single mechanism based on indexed, non- tokenized fields should be used to determine sort order. Sort order based upon norms is easy for the user to fake using a dedicated field at a small cost, so library-level support is not needed. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: improve how IndexWriter uses RAM to buffer added documents
Marvin Humphrey [EMAIL PROTECTED] wrote: On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote: Yonik Seeley [EMAIL PROTECTED] wrote: Wow, very nice results Mike! Thanks :) I'm just praying I don't have some sneaky bug making the results far better than they really are!! That's possible, but I'm confident that the model you're using is capable of the gains you're seeing. When I benched KinoSearch a year ago against Lucene, KS was getting close, but was still a little behind... http://www.rectangular.com/kinosearch/benchmarks.html OK glad to hear that :) I *think* I don't have such bugs. (: Ironically, the numbers for Lucene on that page are a little better than they should be because of a sneaky bug. I would have made updating the results a priority if they'd gone the other way. :) Hrm. It would be nice to have hard comparison of the Lucene, KS (and Ferret and others?). ... However, Lucene has been tuned by an army of developers over the years, while KS is young yet and still had many opportunities for optimization. Current svn trunk for KS is about twice as fast for indexing as when I did those benchmarking tests. Wow, that's an awesome speedup! So KS is faster than Lucene today? I look forward to studying your patch in detail at some point to see what you've done differently. It sounds like you only familiarized yourself with the high-level details of how KS has been working, yes? Hopefully, you misunderstood and came up with something better. ;) Exactly! I very carefully didn't look closely at how KS does indexing. I did read your posts on this list and did read the Wiki page and I think a few other pages describing KS's merge model but stopped there. We can compare our approaches in detail at some point and then cross-fertilize :) Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-853) Caching does not work when using RMI
[ https://issues.apache.org/jira/browse/LUCENE-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Ericson updated LUCENE-853: Attachment: RemoteCachingWrapperFilter.patch .patch A new version that will hopefully patch more correctly Caching does not work when using RMI Key: LUCENE-853 URL: https://issues.apache.org/jira/browse/LUCENE-853 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.1 Environment: All Reporter: Matt Ericson Priority: Minor Attachments: RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch .patch Filters and caching uses transient maps so that caching does not work if you are using RMI and a remote searcher I want to add a new RemoteCachededFilter that will make sure that the caching is done on the remote searcher side -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486758 ] Otis Gospodnetic commented on LUCENE-855: - A colleague of mine is working on something similar, but possibly more efficient (less sorting and binary searching). He'll probably attach his patch to this issue. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: MemoryCachedRangeFilter.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486763 ] Yonik Seeley commented on LUCENE-855: - There is also something from Mark Harwood: https://issues.apache.org/jira/browse/LUCENE-798 MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: MemoryCachedRangeFilter.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: publish to maven-repository
Eh, missing Jars in the Maven repo again. Why does this always get dropped? I can push the Jars out, but I see we have no Maven POMs, or have we? I can create one for 2.1.0 based on http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/2.0.0/lucene-core-2.0.0.pom , but where should we keep those? Perhaps it's time to keep a lucene-core.pom in our repo, rename it at release time (e.g. cp lucene-core.pom lucene-core-2.1.0.pom) and push the core jar + core POM out? Thoughts? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Joerg Hohwiller [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Tuesday, April 3, 2007 4:49:15 PM Subject: publish to maven-repository -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi there, I will give it another try: Could you please publish lucene 2.* artifacts (including contribs) to the maven2 repository at ibiblio? Currently there is only the lucene-core available up to version 2.0.0: http://repo1.maven.org/maven2/org/apache/lucene/ JARs and POMs go to: scp://people.apache.org/www/www.apache.org/dist/maven-repository If you need assitance I am pleased to help. But I am not an official apache member and do NOT have access to do the deployment myself. Thank you so much... Jörg -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGEr3LmPuec2Dcv/8RAh1sAJ9m3qs7upNGJTgie5tNeAFKZenBowCgjufY uB1/RNnI4wB3dviKy0w7XEs= =llLh -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-853) Caching does not work when using RMI
[ https://issues.apache.org/jira/browse/LUCENE-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486764 ] Otis Gospodnetic commented on LUCENE-853: - Nice. Unit tests pass and caching seems to work. I'll make some small javadoc and cosmetic fixes, upload the prettified patch and commit on Friday. This will give 2 more days to others to review your changes and raise any issues they may see. Caching does not work when using RMI Key: LUCENE-853 URL: https://issues.apache.org/jira/browse/LUCENE-853 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.1 Environment: All Reporter: Matt Ericson Priority: Minor Attachments: RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch .patch Filters and caching uses transient maps so that caching does not work if you are using RMI and a remote searcher I want to add a new RemoteCachededFilter that will make sure that the caching is done on the remote searcher side -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486767 ] Andy Liu commented on LUCENE-855: - Otis, looking forward to your colleague's patch. LUCENE-798 caches RangeFilters so that if the same exact range is executed again, the cached RangeFilter is used. However, the first time a range is encountered, you'll still have to calculate the RangeFilter, which can be slow. I haven't looked at the patch, but I'm sure LUCENE-798 can be used in conjunction with MemoryCachedRangeFilter to further boost performance for repeated range queries. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: MemoryCachedRangeFilter.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-798) Factory for RangeFilters that caches sections of ranges to reduce disk reads
[ https://issues.apache.org/jira/browse/LUCENE-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486768 ] Matt Ericson commented on LUCENE-798: - I am working on a patch that will use the Field cache to do range queries. The bit sets will be proxies to the field cache. This way the data is stored in the filed cache and if you change the limits of your range it will just need a new proxy BitSet Factory for RangeFilters that caches sections of ranges to reduce disk reads Key: LUCENE-798 URL: https://issues.apache.org/jira/browse/LUCENE-798 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Mark Harwood Attachments: CachedRangesFilterFactory.java RangeFilters can be cached using CachingWrapperFilter but are only re-used if a user happens to use *exactly* the same upper/lower bounds. This class demonstrates a caching approach where *sections* of ranges are cached as bitsets and these are re-used/combined to construct large range filters if they fall within the required range. This can improve the cache hit ratio and avoid going to disk to read large lists of Doc ids from TermDocs. This class needs some more work to add thread safety but I'm making it available to gather feedback on the design at this early stage before making robust. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: publish to maven-repository
On Apr 4, 2007, at 4:33 PM, Otis Gospodnetic wrote: Eh, missing Jars in the Maven repo again. Why does this always get dropped? Because none of us Lucene committers care much about Maven? :) Perhaps it's time to keep a lucene-core.pom in our repo, rename it at release time (e.g. cp lucene-core.pom lucene-core-2.1.0.pom) and push the core jar + core POM out? I don't know the Maven specifics, but I'm all for us maintaining the Maven POM file and bundling it with releases that get pushed to the repos. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-856) Optimize segment merging
Optimize segment merging Key: LUCENE-856 URL: https://issues.apache.org/jira/browse/LUCENE-856 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.1 Reporter: Michael McCandless Assigned To: Michael McCandless Priority: Minor With LUCENE-843, the time spent indexing documents has been substantially reduced and now the time spent merging is a sizable portion of indexing time. I ran a test using the patch for LUCENE-843, building an index of 10 million docs, each with ~5,500 byte plain text, with term vectors (positions + offsets) on and with 2 small stored fields per document. RAM buffer size was 32 MB. I didn't optimize the index in the end, though optimize speed would also improve if we optimize segment merging. Index size is 86 GB. Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes of which was spent merging. That's 65.6% of the time! Most of this time is presumably IO which probably can't be reduced much unless we improve overall merge policy and experiment with values for mergeFactor / buffer size. These tests were run on a Mac Pro with 2 dual-core Intel CPUs. The IO system is RAID 0 of 4 drives, so, these times are probably better than the more common case of a single hard drive which would likely be slower IO. I think there are some simple things we could do to speed up merging: * Experiment with buffer sizes -- maybe larger buffers for the IndexInputs used during merging could help? Because at a default mergeFactor of 10, the disk heads must do alot of seeking back and forth between these 10 files (and then to the 11th file where we are writing). * Use byte copying when possible, eg if there are no deletions on a segment we can almost (I think?) just copy things like prox postings, stored fields, term vectors, instead of full parsing to Jave objects and then re-serializing them. * Experiment with mergeFactor / different merge policies. For example I think LUCENE-854 would reduce time spend merging for a given index size. This is currently just a place to list ideas for optimizing segment merges. I don't plan on working on this until after LUCENE-843. Note that for autoCommit=false, this optimization is somewhat less important, depending on how often you actually close/open a new IndexWriter. In the extreme case, if you open a writer, add 100 MM docs, close the writer, then no segment merges happen at all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486788 ] Yonik Seeley commented on LUCENE-855: - LUCENE-798 caches RangeFilters so that if the same exact range is executed again [...] It's not just the exact same range though... it can reuse parts of ranges AFAIK. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: MemoryCachedRangeFilter.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Caching in QueryFilter - why?
Hi, I'm looking at LUCENE-853, so I also looked at CachingWrapperFilter and then at QueryFilter. I noticed QueryFilter does its own BitSet caching, and the caching part of its code is nearly identical to the code in CachingWrapperFilter. Why is that? Is there a good reason for that? Thanks, Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486791 ] Andy Liu commented on LUCENE-855: - Ah, you're right. I didn't read closely enough! MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: MemoryCachedRangeFilter.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-619) Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed
[ https://issues.apache.org/jira/browse/LUCENE-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-619. - Resolution: Fixed That Jar is still invalid (2.3K). However, if anyone is going to be upgrading to the newer version of Lucene, they'll go straight to Lucene 2.0.0, or 2.1.0, not 1.9.1, so I'll mark this as Won't Fix. The Jars for Lucene 2.0.0 are good - see LUCENE-734. We still need to push 2.1.0 jars + POMs, though. Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed Key: LUCENE-619 URL: https://issues.apache.org/jira/browse/LUCENE-619 Project: Lucene - Java Issue Type: Bug Affects Versions: 1.9, 2.0.0 Environment: http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/ Reporter: Jordan Christensen The lucene JARs at the URL listed in the Environment field only contain the maven 2 POMs, and not the actual compiled classes. The correct JARs need to be uploaded so that Lucene 1.9.1. and 2.0 can work in Maven 2. This was listed as fixed in http://issues.apache.org/jira/browse/LUCENE-551, but was not properly done. The JARs in the Apache Maven repo are incorrect as well. (http://www.apache.org/dist/maven-repository/org/apache/lucene/lucene-core/) This issue was raised and confirmed on the mailing list as well: http://www.gossamer-threads.com/lists/lucene/java-user/37169 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Reopened: (LUCENE-619) Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed
[ https://issues.apache.org/jira/browse/LUCENE-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reopened LUCENE-619: - Eh, I said Won't Fix, not Fixed. Reopening... Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed Key: LUCENE-619 URL: https://issues.apache.org/jira/browse/LUCENE-619 Project: Lucene - Java Issue Type: Bug Affects Versions: 1.9, 2.0.0 Environment: http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/ Reporter: Jordan Christensen The lucene JARs at the URL listed in the Environment field only contain the maven 2 POMs, and not the actual compiled classes. The correct JARs need to be uploaded so that Lucene 1.9.1. and 2.0 can work in Maven 2. This was listed as fixed in http://issues.apache.org/jira/browse/LUCENE-551, but was not properly done. The JARs in the Apache Maven repo are incorrect as well. (http://www.apache.org/dist/maven-repository/org/apache/lucene/lucene-core/) This issue was raised and confirmed on the mailing list as well: http://www.gossamer-threads.com/lists/lucene/java-user/37169 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-619) Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed
[ https://issues.apache.org/jira/browse/LUCENE-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-619. - Resolution: Won't Fix Lucene 1.9.1 and 2.0.0 Maven 2 packages are incorrectly deployed Key: LUCENE-619 URL: https://issues.apache.org/jira/browse/LUCENE-619 Project: Lucene - Java Issue Type: Bug Affects Versions: 1.9, 2.0.0 Environment: http://www.ibiblio.org/maven2/org/apache/lucene/lucene-core/ Reporter: Jordan Christensen The lucene JARs at the URL listed in the Environment field only contain the maven 2 POMs, and not the actual compiled classes. The correct JARs need to be uploaded so that Lucene 1.9.1. and 2.0 can work in Maven 2. This was listed as fixed in http://issues.apache.org/jira/browse/LUCENE-551, but was not properly done. The JARs in the Apache Maven repo are incorrect as well. (http://www.apache.org/dist/maven-repository/org/apache/lucene/lucene-core/) This issue was raised and confirmed on the mailing list as well: http://www.gossamer-threads.com/lists/lucene/java-user/37169 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-622) Provide More of Lucene For Maven
[ https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-622: Attachment: lucene-core.pom Here is the POM for Maven boys and girls who want lucene-core. Stephen: What would the contrib POM look like? I don't think we'd have 1 POM, because each project in Lucene contrib is a separate project and a separate jar with its own dependencies. But maybe one can construct a single POM for the whole Lucene contrib - I haven't touched Maven in a few years. Provide More of Lucene For Maven Key: LUCENE-622 URL: https://issues.apache.org/jira/browse/LUCENE-622 Project: Lucene - Java Issue Type: Task Affects Versions: 2.0.0 Reporter: Stephen Duncan Jr Attachments: lucene-core.pom Please provide javadoc source jars for lucene-core. Also, please provide the rest of lucene (the jars inside of contrib in the download bundle) if possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-853) Caching does not work when using RMI
[ https://issues.apache.org/jira/browse/LUCENE-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-853: Lucene Fields: [New, Patch Available] (was: [New]) Caching does not work when using RMI Key: LUCENE-853 URL: https://issues.apache.org/jira/browse/LUCENE-853 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.1 Environment: All Reporter: Matt Ericson Priority: Minor Attachments: RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch .patch Filters and caching uses transient maps so that caching does not work if you are using RMI and a remote searcher I want to add a new RemoteCachededFilter that will make sure that the caching is done on the remote searcher side -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-853) Caching does not work when using RMI
[ https://issues.apache.org/jira/browse/LUCENE-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-853: Attachment: RemoteCachingWrapperFilter.patch Here is a cleaned up version. - Changed CachingWrapperFilter - private - protected vars, so CachingWrapperFilterHelper can extend it - Expanded unit tests to be more convincing - Javadocs all fixed up + cosmetics + code comments n.b. The @todo in CachingWrapperFilter can go now: /** * @todo What about serialization in RemoteSearchable? Caching won't work. * Should transient be removed? */ protected transient Map cache; We keep the transient, and if you want remote caching, use RemoteCachingWrapperFilter. I'll commit on Friday. Caching does not work when using RMI Key: LUCENE-853 URL: https://issues.apache.org/jira/browse/LUCENE-853 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.1 Environment: All Reporter: Matt Ericson Priority: Minor Attachments: RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch .patch Filters and caching uses transient maps so that caching does not work if you are using RMI and a remote searcher I want to add a new RemoteCachededFilter that will make sure that the caching is done on the remote searcher side -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-622) Provide More of Lucene For Maven
[ https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486805 ] Stephen Duncan Jr commented on LUCENE-622: -- Because they are separate projects jars, they would each have their own POM. Provide More of Lucene For Maven Key: LUCENE-622 URL: https://issues.apache.org/jira/browse/LUCENE-622 Project: Lucene - Java Issue Type: Task Affects Versions: 2.0.0 Reporter: Stephen Duncan Jr Attachments: lucene-core.pom Please provide javadoc source jars for lucene-core. Also, please provide the rest of lucene (the jars inside of contrib in the download bundle) if possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-622) Provide More of Lucene For Maven
[ https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486807 ] Otis Gospodnetic commented on LUCENE-622: - Right, that's what I was trying to say. Can you provide POMs for contrib projects, or maybe just the ones that you use/need? Provide More of Lucene For Maven Key: LUCENE-622 URL: https://issues.apache.org/jira/browse/LUCENE-622 Project: Lucene - Java Issue Type: Task Affects Versions: 2.0.0 Reporter: Stephen Duncan Jr Attachments: lucene-core.pom Please provide javadoc source jars for lucene-core. Also, please provide the rest of lucene (the jars inside of contrib in the download bundle) if possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: publish to maven-repository
Jörg, Since you offered to help - please see https://issues.apache.org/jira/browse/LUCENE-622 . lucene-core POM is there for 2.1.0, but if you need POMs for contrib/*, please attach them to that issue. We have Jars, obviously, so we just need to copy those. When we'll need .sha1 and .md5 files for all pushed Jars. One of the other developers will have to do that, as I don't have my PGP set up, and hence no key for the KEYS file (if that's needed for the .sha1). Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Joerg Hohwiller [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Tuesday, April 3, 2007 4:49:15 PM Subject: publish to maven-repository -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi there, I will give it another try: Could you please publish lucene 2.* artifacts (including contribs) to the maven2 repository at ibiblio? Currently there is only the lucene-core available up to version 2.0.0: http://repo1.maven.org/maven2/org/apache/lucene/ JARs and POMs go to: scp://people.apache.org/www/www.apache.org/dist/maven-repository If you need assitance I am pleased to help. But I am not an official apache member and do NOT have access to do the deployment myself. Thank you so much... Jörg -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGEr3LmPuec2Dcv/8RAh1sAJ9m3qs7upNGJTgie5tNeAFKZenBowCgjufY uB1/RNnI4wB3dviKy0w7XEs= =llLh -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-622) Provide More of Lucene For Maven
[ https://issues.apache.org/jira/browse/LUCENE-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486809 ] Stephen Duncan Jr commented on LUCENE-622: -- I'm no longer doing any work with Lucene, and I'm not even sure which contrib project I wanted at the time I filed this request. While I'm sure that having poms for the contrib releases would be helpful to many people using Maven, this is no longer something that's a priority for me. Provide More of Lucene For Maven Key: LUCENE-622 URL: https://issues.apache.org/jira/browse/LUCENE-622 Project: Lucene - Java Issue Type: Task Affects Versions: 2.0.0 Reporter: Stephen Duncan Jr Attachments: lucene-core.pom Please provide javadoc source jars for lucene-core. Also, please provide the rest of lucene (the jars inside of contrib in the download bundle) if possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Caching in QueryFilter - why?
CachingWrapperFilter came along after QueryFilter. I think I added CachingWrapperFilter when I realized that every Filter should have the capability to be cached without having to implement it. So, the only reason is legacy. I'm perfectly fine with removing the caching from QueryFilter in a future major release. Erik On Apr 4, 2007, at 5:57 PM, Otis Gospodnetic wrote: Hi, I'm looking at LUCENE-853, so I also looked at CachingWrapperFilter and then at QueryFilter. I noticed QueryFilter does its own BitSet caching, and the caching part of its code is nearly identical to the code in CachingWrapperFilter. Why is that? Is there a good reason for that? Thanks, Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Created: (LUCENE-856) Optimize segment merging
On 4/4/07, Michael McCandless (JIRA) [EMAIL PROTECTED] wrote: Note that for autoCommit=false, this optimization is somewhat less important, depending on how often you actually close/open a new IndexWriter. In the extreme case, if you open a writer, add 100 MM docs, close the writer, then no segment merges happen at all. I think in the current code, the merge behavior for autoCommit=false is the same as that for autoCommit=true, isn't it? Cheers, Ning - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-636) [PATCH] Differently configured Lucene 'instances' in same JVM
[ https://issues.apache.org/jira/browse/LUCENE-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-636. - Resolution: Won't Fix We've moved away from using system properties. I think there are only a couple of places in the code that still refer to system properties, and those are, I believe, depreated: $ ffjg System.getProp ./org/apache/lucene/analysis/standard/ParseException.java: protected String eol = System.getProperty(line.separator, \n); ./org/apache/lucene/index/SegmentReader.java: System.getProperty(org.apache.lucene.SegmentReader.class, ./org/apache/lucene/queryParser/ParseException.java: protected String eol = System.getProperty(line.separator, \n); ./org/apache/lucene/store/FSDirectory.java: public static final String LOCK_DIR = System.getProperty(org.apache.lucene.lockDir, ./org/apache/lucene/store/FSDirectory.java: System.getProperty(java.io.tmpdir)); ./org/apache/lucene/store/FSDirectory.java: System.getProperty(org.apache.lucene.FSDirectory.class, ./org/apache/lucene/store/FSDirectory.java:String lockClassName = System.getProperty(org.apache.lucene.store.FSDirectoryLockFactoryClass); ./org/apache/lucene/util/Constants.java: /** The value of ttSystem.getProperty(java.version)tt. **/ ./org/apache/lucene/util/Constants.java: public static final String JAVA_VERSION = System.getProperty(java.version); ./org/apache/lucene/util/Constants.java: /** The value of ttSystem.getProperty(os.name)tt. **/ ./org/apache/lucene/util/Constants.java: public static final String OS_NAME = System.getProperty(os.name); I'll close this as Won't Fix. [PATCH] Differently configured Lucene 'instances' in same JVM - Key: LUCENE-636 URL: https://issues.apache.org/jira/browse/LUCENE-636 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.0.0 Reporter: Johan Stuyts Attachments: Lucene2DifferentConfigurations.patch Currently Lucene can be configured using system properties. When running multiple 'instances' of Lucene for different purposes in the same JVM, it is not possible to use different settings for each 'instance'. I made changes to some Lucene classes so you can pass a configuration to that class. The Lucene 'instance' will use the settings from that configuration. The changes do not effect the API and/or the current behavior so are backwards compatible. In addition to the changes above I also made the SegmentReader and SegmentTermDocs extensible outside of their package. I would appreciate the inclusion of these changes but don't mind creating a separate issue for them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-789) Custom similarity is ignored when using MultiSearcher
[ https://issues.apache.org/jira/browse/LUCENE-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486835 ] Otis Gospodnetic commented on LUCENE-789: - Alexey, the best way to start with this, and the way that will help get this fixed in Lucene core is to write a unit test class that does what your code does with MultiSearcher and BooleanQuery, and shows that the test fails when a custom Similarity class is used. You can make that custom Similarity an inner class in your unit test class, to contain everything neatly in a single class. Once we see the test failing we cann apply your suggested fix and see if that works, if your previously broken unit test now passes, and if all other unit tests still pass. Custom similarity is ignored when using MultiSearcher - Key: LUCENE-789 URL: https://issues.apache.org/jira/browse/LUCENE-789 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.0.1 Reporter: Alexey Lef Symptoms: I am using Searcher.setSimilarity() to provide a custom similarity that turns off tf() factor. However, somewhere along the way the custom similarity is ignored and the DefaultSimilarity is used. I am using MultiSearcher and BooleanQuery. Problem analysis: The problem seems to be in MultiSearcher.createWeight(Query) method. It creates an instance of CachedDfSource but does not set the similarity. As the result CachedDfSource provides DefaultSimilarity to queries that use it. Potential solution: Adding the following line: cacheSim.setSimilarity(getSimilarity()); after creating an instance of CacheDfSource (line 312) seems to fix the problem. However, I don't understand enough of the inner workings of this class to be absolutely sure that this is the right thing to do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
[ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486837 ] Karl Wettin commented on LUCENE-848: Karl, it looks like your stuff grabs individual articles, right? I'm gong to have it download the bzip2 snapshots they provide (and that they prefer you use, if you're getting much). They also supply the rendered HTML every now and then. It should be enough to change the URL pattern to file:///tmp/wikipedia/. I was considering porting the MediaWiki BNF as a tokenizer, but found it much simpler to just parse the HTML. Add supported for Wikipedia English as a corpus in the benchmarker stuff Key: LUCENE-848 URL: https://issues.apache.org/jira/browse/LUCENE-848 Project: Lucene - Java Issue Type: New Feature Components: contrib/benchmark Reporter: Steven Parkes Assigned To: Steven Parkes Priority: Minor Fix For: 2.2 Attachments: WikipediaHarvester.java Add support for using Wikipedia for benchmarking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and Javolution: A good mix ?
I would suggest that the Javolution folks do their tests against modern JVM... I have followed the Javolution project for some time, and while I agree that some of the techniques should improve things, I think that modern JVMs do most of this work for you (and the latest class libraries also help - StringBuilder and others). I also think that when you start doing you own memory management you might as well write the code in C/C++ because you need to use similar techniques (similar to the resource management when using SWT). Just my thoughts. On Apr 4, 2007, at 8:54 PM, Jean-Philippe Robichaud wrote: Hello Dear Lucene coders! Some of you may remember, I'm using lucene for a product (and many other internal utilities). I'm also using another open source library called Javolution (www.javolution.org http://www.javolution.org/ ) which does many things, one of them being to offer excellent replacements for ArrayList/Map/... and a super good memory management extension to the java language. As I'm [trying to] follow the conversations on this list, I see that many of you are working towards optimizing lucene in term of memory footprint and speed. I just finished optimizing my code (not lucene itself, but my code written on top of it) using Javolution PoolContext and the FastList/FastMap/... classes. The resulting speedup is a 6 times faster code. Javolution make it easy to recycle objects and do some object allocation on the stack rather than on the head, which remove stress on the garbage collector. Javolution also offers 2 classes (Text and TextBuilder) to replace String/StringBuffer which are perfect for anything related to string manipulation and some C union/struct equivalent for java. The thing is really great. Would anyone be interested in doing Lucene a face lift and start using javolution as a core lucene dependency? I understand that right now, lucene is free of any dependencies, which is quite great, but anyone interested in doing fast/lean/stable java application should seriously consider using javolution anyway. Any thoughts? Jp - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-636) [PATCH] Differently configured Lucene 'instances' in same JVM
[ https://issues.apache.org/jira/browse/LUCENE-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486840 ] Ken Geis commented on LUCENE-636: - This is not going to be sufficient. There are active code paths that still use System.getProperty(..). For instance, the static initializers of FSDirectory and SegmentReader. If I load up a Compass-based web app, and it uses an old version of Lucene that works off system properties, it will set the org.apache.lucene.SegmentReader.class property to use a Compass-specific segment reader. Then in another web app that uses a current version of Lucene that has moved away from using system properties, the application will crash when it tries to load the SegmentReader class. [PATCH] Differently configured Lucene 'instances' in same JVM - Key: LUCENE-636 URL: https://issues.apache.org/jira/browse/LUCENE-636 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.0.0 Reporter: Johan Stuyts Attachments: Lucene2DifferentConfigurations.patch Currently Lucene can be configured using system properties. When running multiple 'instances' of Lucene for different purposes in the same JVM, it is not possible to use different settings for each 'instance'. I made changes to some Lucene classes so you can pass a configuration to that class. The Lucene 'instance' will use the settings from that configuration. The changes do not effect the API and/or the current behavior so are backwards compatible. In addition to the changes above I also made the SegmentReader and SegmentTermDocs extensible outside of their package. I would appreciate the inclusion of these changes but don't mind creating a separate issue for them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: improve how IndexWriter uses RAM to buffer added documents
On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote: (: Ironically, the numbers for Lucene on that page are a little better than they should be because of a sneaky bug. I would have made updating the results a priority if they'd gone the other way. :) Hrm. It would be nice to have hard comparison of the Lucene, KS (and Ferret and others?). Doing honest, rigorous benchmarking is exacting and labor-intensive. Publishing results tends to ignite flame wars I don't have time for. The main point that I wanted to make with that page was that KS was a lot faster than Plucene, and that it was in Lucene's ballpark. Having made that point, I've moved on. The benchmarking code is still very useful for internal development and I use it frequently. At some point I would like to port the benchmarking work that has been contributed to Lucene of late, but I'm waiting for that code base to settle down first. After that happens, I'll probably make a pass and publish some results. Better to spend the time preparing one definitive presentation than to have to rebut every idiot's latest wildly inaccurate shootout. ... However, Lucene has been tuned by an army of developers over the years, while KS is young yet and still had many opportunities for optimization. Current svn trunk for KS is about twice as fast for indexing as when I did those benchmarking tests. Wow, that's an awesome speedup! The big bottleneck for KS has been its Tokenizer class. There's only one such class in KS, and it's regex-based. A few weeks ago, I finally figured out how to hook it into Perl's regex engine at the C level. The regex engine is not an official part of Perl's C API, so I wouldn't do this if I didn't have to, but the tokenizing loop is only about 100 lines of code and the speedup is dramatic. I've also squeezed out another 30-40% by changing the implementation in ways which have gradually winnowed down the number of malloc() calls. Some of the techniques may be applicable to Lucene; I'll get around to firing up JIRA issues describing them someday. So KS is faster than Lucene today? I haven't tested recent versions of Lucene. I believe that the current svn trunk for KS is faster for indexing than Lucene 1.9.1. But... A) I don't have an official release out with the current Tokenizer code, B) I have no immediate plans to prepare further published benchmarks, and C) it's not really important, because so long as the numbers are close you'd be nuts to choose one engine or the other based on that criteria rather than, say, what language your development team speaks. KinoSearch scales to multiple machines, too. Looking to the future, I wouldn't be surprised if Lucene edged ahead and stayed slightly ahead speed-wise, because I'm prepared to make some sacrifices for the sake of keeping KinoSearch's core API simple and the code base as small as possible. I'd rather maintain a single, elegant, useful, flexible, plenty fast regex-based Tokenizer than the slew of Tokenizers Lucene offers, for instance. It might be at a slight disadvantage going mano a mano against Lucene's WhiteSpaceTokenizer, but that's fine. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: publish to maven-repository
hi, I am volunteering to help on putting together releaseable m2 artifacts for Lucene, I have high hopes to start building and spreading m2 artifacts for other Lucene sub projects too (of course if there are no objections). -- Sami Siren 2007/4/5, Otis Gospodnetic [EMAIL PROTECTED]: Jörg, Since you offered to help - please see https://issues.apache.org/jira/browse/LUCENE-622 . lucene-core POM is there for 2.1.0, but if you need POMs for contrib/*, please attach them to that issue. We have Jars, obviously, so we just need to copy those. When we'll need .sha1 and .md5 files for all pushed Jars. One of the other developers will have to do that, as I don't have my PGP set up, and hence no key for the KEYS file (if that's needed for the .sha1). Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Joerg Hohwiller [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Tuesday, April 3, 2007 4:49:15 PM Subject: publish to maven-repository -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi there, I will give it another try: Could you please publish lucene 2.* artifacts (including contribs) to the maven2 repository at ibiblio? Currently there is only the lucene-core available up to version 2.0.0: http://repo1.maven.org/maven2/org/apache/lucene/ JARs and POMs go to: scp://people.apache.org/www/www.apache.org/dist/maven-repository If you need assitance I am pleased to help. But I am not an official apache member and do NOT have access to do the deployment myself. Thank you so much... Jörg -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGEr3LmPuec2Dcv/8RAh1sAJ9m3qs7upNGJTgie5tNeAFKZenBowCgjufY uB1/RNnI4wB3dviKy0w7XEs= =llLh -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]