Re: [Lucene.Net] Possible bug in Lucene with Prefix Search and Danish Locale
Hey Matt, This is issue 420: https://issues.apache.org/jira/browse/LUCENENET-420 I think the theory so far has been that the user should manage the culture rather than Lucene. If you disagree could you post on that issue ticket? Thanks, -Ben - Original Message - From: Matt Warren To: lucene-net-...@lucene.apache.org Cc: Sent: Thursday, June 30, 2011 9:28 AM Subject: [Lucene.Net] Possible bug in Lucene with Prefix Search and Danish Locale I think that the code here shows a bug in Lucene.NET, see http://gist.github.com/1056231. This happens when using 2.9.2. After some digging I think that it's due to the way it does a Prefix search. The main problem is shown by this code http://gist.github.com/1056242. If the Locale is Danish, this returns FALSE, weird eh!! "daab".StartsWith("da") //false But this works as expected "daab".StartsWith("da", StringComparison.InvariantCulture) //true The line of code that has this problem is the TermCompare(..) function in PrefixTermEnum.cs, see http://svn.apache.org/repos/asf/incubator/lucene.net/trunk/src/core/Search/PrefixTermEnum.cs
[Lucene.Net] [jira] [Commented] (LUCENENET-425) MMapDirectory implementation
[ https://issues.apache.org/jira/browse/LUCENENET-425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049435#comment-13049435 ] Ben West commented on LUCENENET-425: Unfortunately (or perhaps fortunately in that Digy doesn't need to do more work :-) MMap is slower on 64 bit too. Index is 2.2gb. {panel} Create index, FSDir: 419061 Create index, MMapdir: 532536 Search index, FSDir: 757 Search index, MMapdir: 2030 {panel} Reversing order: {panel} Search index, FSDir: 734 Search index, MMap dir: 1934 {panel} I have 8gb ram, so I think the entire index was able to be cached in memory by the OS. > MMapDirectory implementation > > > Key: LUCENENET-425 > URL: https://issues.apache.org/jira/browse/LUCENENET-425 > Project: Lucene.Net > Issue Type: New Feature >Affects Versions: Lucene.Net 2.9.4g >Reporter: Digy >Priority: Trivial > Fix For: Lucene.Net 2.9.4g > > Attachments: MMapDirectory.patch > > > Since this is not a direct port of MMapDirectory.java, I'll put it under > "Support" and implement MMapDirectory as > {code} > public class MMapDirectory:Lucene.Net.Support.MemoryMappedDirectory > { > } > {code} > If a Mem-Map can not be created(for ex, if the file is too big to fit in 32 > bit address range), it will default to FSDirectory.FSIndexInput > In my tests, I didn't see any performance gain in 32bit environment and I > consider it as better then nothing. > I would be happy if someone could send test results on 64bit platform. > DIGY -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Commented] (LUCENENET-425) MMapDirectory implementation
[ https://issues.apache.org/jira/browse/LUCENENET-425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049299#comment-13049299 ] Ben West commented on LUCENENET-425: The entire "Store" set of tests (including TestWindowsMMap) passes on Windows 7 64 bit with your patch. Let me know if there are other tests you'd like me to run. I'm not familiar with what mmap directories do, so I probably won't be able to write a perf test myself. > MMapDirectory implementation > > > Key: LUCENENET-425 > URL: https://issues.apache.org/jira/browse/LUCENENET-425 > Project: Lucene.Net > Issue Type: New Feature >Affects Versions: Lucene.Net 2.9.4g >Reporter: Digy >Priority: Trivial > Fix For: Lucene.Net 2.9.4g > > Attachments: MMapDirectory.patch > > > Since this is not a direct port of MMapDirectory.java, I'll put it under > "Support" and implement MMapDirectory as > {code} > public class MMapDirectory:Lucene.Net.Support.MemoryMappedDirectory > { > } > {code} > If a Mem-Map can not be created(for ex, if the file is too big to fit in 32 > bit address range), it will default to FSDirectory.FSIndexInput > In my tests, I didn't see any performance gain in 32bit environment and I > consider it as better then nothing. > I would be happy if someone could send test results on 64bit platform. > DIGY -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Commented] (LUCENENET-415) Contrib/Faceted Search
[ https://issues.apache.org/jira/browse/LUCENENET-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039278#comment-13039278 ] Ben West commented on LUCENENET-415: No, I don't think we need them. I still don't understand why the CachingWrapperFilters are so much faster than QueryWrapperFilter even on fresh queries. But I guess since the cache has weak references, there isn't a lot of harm in using them. > Contrib/Faceted Search > -- > > Key: LUCENENET-415 > URL: https://issues.apache.org/jira/browse/LUCENENET-415 > Project: Lucene.Net > Issue Type: New Feature >Affects Versions: Lucene.Net 2.9.4 >Reporter: Digy >Priority: Minor > Attachments: PerformanceTest.cs, PerformanceTest.cs, > PerformanceTest.cs, SimpleFacetedSearch.cs, SimpleFacetedSearch.cs, > SimpleFacetedSearch.cs, SimpleFacetedSearch.cs, SimpleFacetedSearch2.cs, > SimpleFacetedSearch2.cs, SimpleFacetedSearch2.cs, TestSimpleFacetedSearch.cs, > TestSimpleFacetedSearch.cs, TestSimpleFacetedSearch2.cs, > TestSimpleFacetedSearch2.cs, TestSimpleFacetedSearch2.cs, > TestSimpleFacetedSearch2.cs, facet performance.xls, facet performance.xls, > facet performance.xls > > > Since I see a lot of questions about faceted search in these days, I plan to > add a Faceted-Search project to contrib. > DIGY -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Updated] (LUCENENET-415) Contrib/Faceted Search
[ https://issues.apache.org/jira/browse/LUCENENET-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben West updated LUCENENET-415: --- Attachment: facet performance.xls Everything is exactly as DIGY predicted. I will never disagree with him again :-) See tab "Round 3". I commented out the Cardinality() call, and enabled caching but with unique queries. The bitset way is much faster now. > Contrib/Faceted Search > -- > > Key: LUCENENET-415 > URL: https://issues.apache.org/jira/browse/LUCENENET-415 > Project: Lucene.Net > Issue Type: New Feature >Affects Versions: Lucene.Net 2.9.4 >Reporter: Digy >Priority: Minor > Attachments: PerformanceTest.cs, PerformanceTest.cs, > PerformanceTest.cs, SimpleFacetedSearch.cs, SimpleFacetedSearch.cs, > SimpleFacetedSearch.cs, SimpleFacetedSearch.cs, TestSimpleFacetedSearch.cs, > TestSimpleFacetedSearch.cs, facet performance.xls, facet performance.xls, > facet performance.xls > > > Since I see a lot of questions about faceted search in these days, I plan to > add a Faceted-Search project to contrib. > DIGY -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Updated] (LUCENENET-415) Contrib/Faceted Search
[ https://issues.apache.org/jira/browse/LUCENENET-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben West updated LUCENENET-415: --- Attachment: SimpleFacetedSearch.cs Added parameter to choose whether search is via queries or doc id sets. Not sure if this will be desired in a final module, but hopefully useful for running perf tests. > Contrib/Faceted Search > -- > > Key: LUCENENET-415 > URL: https://issues.apache.org/jira/browse/LUCENENET-415 > Project: Lucene.Net > Issue Type: New Feature >Affects Versions: Lucene.Net 2.9.4 >Reporter: Digy >Priority: Minor > Attachments: PerformanceTest.cs, PerformanceTest.cs, > PerformanceTest.cs, SimpleFacetedSearch.cs, SimpleFacetedSearch.cs, > SimpleFacetedSearch.cs, SimpleFacetedSearch.cs, TestSimpleFacetedSearch.cs, > TestSimpleFacetedSearch.cs, facet performance.xls, facet performance.xls > > > Since I see a lot of questions about faceted search in these days, I plan to > add a Faceted-Search project to contrib. > DIGY -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Commented] (LUCENENET-415) Contrib/Faceted Search
[ https://issues.apache.org/jira/browse/LUCENENET-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038239#comment-13038239 ] Ben West commented on LUCENENET-415: But that's just because some queries were repeated, right? The cache that I changed is the one which wraps the query, not the one which wraps the groups. So it's comparing a cached query to a fresh one, which isn't legitimate. The second round (comparing fresh to fresh) is the one we should look at. > Contrib/Faceted Search > -- > > Key: LUCENENET-415 > URL: https://issues.apache.org/jira/browse/LUCENENET-415 > Project: Lucene.Net > Issue Type: New Feature >Affects Versions: Lucene.Net 2.9.4 >Reporter: Digy >Priority: Minor > Attachments: PerformanceTest.cs, PerformanceTest.cs, > PerformanceTest.cs, SimpleFacetedSearch.cs, SimpleFacetedSearch.cs, > SimpleFacetedSearch.cs, TestSimpleFacetedSearch.cs, > TestSimpleFacetedSearch.cs, facet performance.xls, facet performance.xls > > > Since I see a lot of questions about faceted search in these days, I plan to > add a Faceted-Search project to contrib. > DIGY -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Updated] (LUCENENET-415) Contrib/Faceted Search
[ https://issues.apache.org/jira/browse/LUCENENET-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben West updated LUCENENET-415: --- Attachment: facet performance.xls PerformanceTest.cs I redid the tests, this time with no caching. The faceting is significantly (10x) slower in big indexes. See "Round 2" in attached spreadsheet. I tried your idea of doing it with boolean queries, and that seems to be much faster. If you think that these stats are correct, we might want to consider doing facets in this style instead of with the bitset. I tried to look at how Solr does it - I'm not familiar with their code, but it seems like they do it with a DocSet. > Contrib/Faceted Search > -- > > Key: LUCENENET-415 > URL: https://issues.apache.org/jira/browse/LUCENENET-415 > Project: Lucene.Net > Issue Type: New Feature >Affects Versions: Lucene.Net 2.9.4 >Reporter: Digy >Priority: Minor > Attachments: PerformanceTest.cs, PerformanceTest.cs, > PerformanceTest.cs, SimpleFacetedSearch.cs, SimpleFacetedSearch.cs, > SimpleFacetedSearch.cs, TestSimpleFacetedSearch.cs, > TestSimpleFacetedSearch.cs, facet performance.xls, facet performance.xls > > > Since I see a lot of questions about faceted search in these days, I plan to > add a Faceted-Search project to contrib. > DIGY -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Updated] (LUCENENET-415) Contrib/Faceted Search
[ https://issues.apache.org/jira/browse/LUCENENET-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben West updated LUCENENET-415: --- Attachment: PerformanceTest.cs Forgot to attach performance test code. > Contrib/Faceted Search > -- > > Key: LUCENENET-415 > URL: https://issues.apache.org/jira/browse/LUCENENET-415 > Project: Lucene.Net > Issue Type: New Feature >Affects Versions: Lucene.Net 2.9.4 >Reporter: Digy >Priority: Minor > Attachments: PerformanceTest.cs, PerformanceTest.cs, > SimpleFacetedSearch.cs, SimpleFacetedSearch.cs, SimpleFacetedSearch.cs, > TestSimpleFacetedSearch.cs, TestSimpleFacetedSearch.cs, facet performance.xls > > > Since I see a lot of questions about faceted search in these days, I plan to > add a Faceted-Search project to contrib. > DIGY -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Updated] (LUCENENET-415) Contrib/Faceted Search
[ https://issues.apache.org/jira/browse/LUCENENET-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben West updated LUCENENET-415: --- Attachment: facet performance.xls SimpleFacetedSearch.cs I added an option to SFS which selects whether queries are cached. I think this cache is only cleared on index reopen, so caching everything can be memory intensive (right?). Without this caching, it seems that memory impact should be minimal. Also attached some performance results, having made Digy's suggested changes. Faceting is somewhat slower, particularly on larger indexes, but is fine for my usage. Interestingly enough, instantiation is almost instantaneous, which is good for those of us on NRT. > Contrib/Faceted Search > -- > > Key: LUCENENET-415 > URL: https://issues.apache.org/jira/browse/LUCENENET-415 > Project: Lucene.Net > Issue Type: New Feature >Affects Versions: Lucene.Net 2.9.4 >Reporter: Digy >Priority: Minor > Attachments: PerformanceTest.cs, PerformanceTest.cs, > SimpleFacetedSearch.cs, SimpleFacetedSearch.cs, SimpleFacetedSearch.cs, > TestSimpleFacetedSearch.cs, TestSimpleFacetedSearch.cs, facet performance.xls > > > Since I see a lot of questions about faceted search in these days, I plan to > add a Faceted-Search project to contrib. > DIGY -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Commented] (LUCENENET-415) Contrib/Faceted Search
[ https://issues.apache.org/jira/browse/LUCENENET-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037066#comment-13037066 ] Ben West commented on LUCENENET-415: I believe line 94 should be _GroupByField, not "cat". > Contrib/Faceted Search > -- > > Key: LUCENENET-415 > URL: https://issues.apache.org/jira/browse/LUCENENET-415 > Project: Lucene.Net > Issue Type: New Feature >Affects Versions: Lucene.Net 2.9.4 >Reporter: Digy >Priority: Minor > Attachments: SimpleFacetedSearch.cs, TestSimpleFacetedSearch.cs > > > Since I see a lot of questions about faceted search in these days, I plan to > add a Faceted-Search project to contrib. > DIGY -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (LUCENENET-358) CloseableThreadLocal memory leak in LocalDataStoreSlot (with workaround)
[ https://issues.apache.org/jira/browse/LUCENENET-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865785#action_12865785 ] Ben West commented on LUCENENET-358: wow. Searching fell from low hundred milliseconds to teens, so order of magnitude improvement. Thanks guys! > CloseableThreadLocal memory leak in LocalDataStoreSlot (with workaround) > - > > Key: LUCENENET-358 > URL: https://issues.apache.org/jira/browse/LUCENENET-358 > Project: Lucene.Net > Issue Type: Bug > Environment: Microsoft WIndows Server 2008 Enterprise x64. SP2. > .NET Framework 4.0 >Reporter: Rezgar Cadro >Priority: Critical > Attachments: CloseableThreadLocal MemoryLeak.patch, > CloseableThreadLocal.diff, CloseableThreadLocal.patch > > > Recently we have been suffering from a severe memory leak when executing > intense open/close operations on IndexSearcher and IndexModifier. > Memory profiling showed that memory is being held by LocalDataStore[] objects. > After some digging, the root of the problem has been found in > CloseableThreadLocal class: > private System.LocalDataStoreSlot t = > System.Threading.Thread.AllocateDataSlot(); > What we see is that every instantiated object of CloseableThreadLocal causes > new data slot allocation performed for every thread. > Thread.AllocateDataSlot() does not simply allocate a new slot, replacing an > old one, but enlarging an existing buffer in-thread, appending data to the > end of internal LocalDataStore[] collection, which causes a severe memory > leak . > As long as "t" variable is instantiated on every object creation, and (in > current class implementation) every object is used by a single thread, > replacing "private System.LocalDataStoreSlot t = > System.Threading.Thread.AllocateDataSlot();" with simple "private object > dataSlot;" and removing "hardRefs" Dictionary solves the problem and prevents > memory leak. > We have tried to implement the expected behavior by using [ThreadStatic] > attribute instead of LocalDataStoreSlot, but the attempt failed because of > unexpected exceptions being thrown. > Patch can be found at Lucene.Net repository under -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (LUCENENET-366) Spellchecker issues
[ https://issues.apache.org/jira/browse/LUCENENET-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863448#action_12863448 ] Ben West edited comment on LUCENENET-366 at 5/3/10 2:24 PM: Hey DIGY, Java lucene doesn't have the duplicate checking - should I submit a bug to them? EDIT: bah, I take it back. it does. Will work on porting. was (Author: xodarap): Hey DIGY, Java lucene doesn't have the duplicate checking - should I submit a bug to them? > Spellchecker issues > --- > > Key: LUCENENET-366 > URL: https://issues.apache.org/jira/browse/LUCENENET-366 > Project: Lucene.Net > Issue Type: Bug >Reporter: Ben West >Priority: Minor > Attachments: LuceneNet-SpellcheckFixes.patch > > > There are several issues with the spellchecker: > - It doesn't do duplicate checking across updates (so the same word is often > indexed many, many times) > - The n-gram fields are stored as well as indexed, which increases the size > of the index by several orders of magnitude and provides no benefit > - Some deprecated functions are used, which slows it down > - Some methods aren't commented fully > I will attach a patch that fixes these. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENENET-366) Spellchecker issues
[ https://issues.apache.org/jira/browse/LUCENENET-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863448#action_12863448 ] Ben West commented on LUCENENET-366: Hey DIGY, Java lucene doesn't have the duplicate checking - should I submit a bug to them? > Spellchecker issues > --- > > Key: LUCENENET-366 > URL: https://issues.apache.org/jira/browse/LUCENENET-366 > Project: Lucene.Net > Issue Type: Bug >Reporter: Ben West >Priority: Minor > Attachments: LuceneNet-SpellcheckFixes.patch > > > There are several issues with the spellchecker: > - It doesn't do duplicate checking across updates (so the same word is often > indexed many, many times) > - The n-gram fields are stored as well as indexed, which increases the size > of the index by several orders of magnitude and provides no benefit > - Some deprecated functions are used, which slows it down > - Some methods aren't commented fully > I will attach a patch that fixes these. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENENET-359) Spellchecker accuracy gets overwritten
[ https://issues.apache.org/jira/browse/LUCENENET-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862693#action_12862693 ] Ben West commented on LUCENENET-359: Works for me. Thanks Digy. > Spellchecker accuracy gets overwritten > -- > > Key: LUCENENET-359 > URL: https://issues.apache.org/jira/browse/LUCENENET-359 > Project: Lucene.Net > Issue Type: Bug > Reporter: Ben West >Priority: Minor > Attachments: LUCENENET-359.patch > > > Spellchecker.cs line 205 has the following: > {quote} > //if queue full , maintain the min score > min = ((SuggestWord) sugqueue.Top()).score; > {quote} > what this is doing is resetting min to be the highest of the suggestions > found so far. This is fine, except that min is a global, persistent variable. > So if you set min to be .5, do a search that has a result of .9, your next > search will have a min of .9, which means that the next suggestion probably > will fail. > Fix would just be to make a localMin copy or some such and update that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (LUCENENET-360) Spellchecker method misnamed "accuraty"
Spellchecker method misnamed "accuraty" --- Key: LUCENENET-360 URL: https://issues.apache.org/jira/browse/LUCENENET-360 Project: Lucene.Net Issue Type: Bug Reporter: Ben West Priority: Trivial In spellchecker.cs there is a function name "setAccura*t*y". I'm pretty sure this is supposed to be "setAccura*c*y". -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (LUCENENET-359) Spellchecker accuracy gets overwritten
Spellchecker accuracy gets overwritten -- Key: LUCENENET-359 URL: https://issues.apache.org/jira/browse/LUCENENET-359 Project: Lucene.Net Issue Type: Bug Reporter: Ben West Priority: Minor Spellchecker.cs line 205 has the following: {quote} //if queue full , maintain the min score min = ((SuggestWord) sugqueue.Top()).score; {quote} what this is doing is resetting min to be the highest of the suggestions found so far. This is fine, except that min is a global, persistent variable. So if you set min to be .5, do a search that has a result of .9, your next search will have a min of .9, which means that the next suggestion probably will fail. Fix would just be to make a localMin copy or some such and update that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.