[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693524#action_12693524 ] Shai Erera commented on LUCENE-1575: BooleanScorer defines an internal package private static final Collector class. Two questions: # May I change it to BooleanCollector? (the name conflicts with the Collector name we want to give to all base collectors) # May I change it to private static final? It is used only in BooleanScorer's newCollector() method. I think the two are safe because it's already package-private and there's no other Lucene code which uses it. BTW, we might wanna review BooleanScorer's internal classes visibility. They are all package-private, with some public methods, however used by BooleanScorer only ... But that's something for a different issue. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Possible IndexInput optimization
Earwin, I did not experiment lately, but I'd like to add a general compressed integer array to the basic types in an index, that would be compressed on writing and decompressed on reading. A first attempt is at LUCENE-1410, and one of the choices I had there was whether or not to use NIO buffer methods on the index side. I started there using these NIO buffer methods, but it seems that the explicit byte arrays you're using here could be a good alternative. I think my question boils down to whether or not these NIO buffers will (in the end) get in the way of similar low level optimizations you'd like to see applied here. Regards, Paul Elschot On Sunday 29 March 2009 00:43:28 Earwin Burrfoot wrote: While drooling over MappedBigByteBuffer, which we'll (hopefully) see in JDK7, I revisited my own Directory code and noticed a certain peculiarity, shared by Lucene core classes: Each and every IndexInput implementation only implements readByte() and readBytes(), never trying to override readInt/VInt/Long/etc methods. Currently RAMDirectory uses a list of byte arrays as a backing store, and I got some speedup when switched to custom version that knows each file size beforehand and thus is able to allocate a single byte array (deliberately accepting 2Gb file size limitation) of exactly needed length. Nothing strange here, readByte(s) methods are easily most oft called ones in a Lucene app and they were greatly simplified - readByte became mere: public byte readByte() throws IOException { return buffer[position++]; // I dropped bounds checking, relying on natural ArrayIndexOOBE, we can't easily catch and recover from it anyway } But now, readInt is four readByte calls, readLong is two readInts (ten calls in total), readString - god knows how many. Unless you use a single type of Directory through the lifetime of your application, these readByte calls are never inlined, JIT invokevirtual short-circuit optimization (it skips method lookup if it always finds the same one during this exact invocation) cannot be applied too. There are three cases when we can override readNNN methods and provide implementations with zero or minimum method invocations - RAMDirectory, MMapDirectory and BufferedIndexInput for FSDirectory/CompoundFileReader. Anybody tried this? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
[ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693528#action_12693528 ] Digy commented on LUCENE-1581: -- I believe also that Character.toLowerCase in Java works ok, But the proplem is: I -- i (in US) I -- ı (in TR) . So, I think, I should be able to choose the conversions. DIGY. LowerCaseFilter should be able to be configured to use a specific locale. - Key: LUCENE-1581 URL: https://issues.apache.org/jira/browse/LUCENE-1581 Project: Lucene - Java Issue Type: Improvement Reporter: Digy //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it would be a problem to understand them. // Assume an input text like İ and and analyzer like below {code} public class SomeAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) { TokenStream t = new SomeTokenizer(reader); t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); t = new LowerCaseFilter(t); return t; } } {code} ASCIIFoldingFilter will return I and after, LowerCaseFilter will return i (if locale is en-US) or ı' if(locale is tr-TR) (that means,this token should be input to another instance of ASCIIFoldingFilter) So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better approach can be adding a new constructor to LowerCaseFilter and forcing it to use a specific locale. {code} public sealed class LowerCaseFilter : TokenFilter { /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture; public LowerCaseFilter(TokenStream in) : base(in) { } /* +++ */ public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo CultureInfo) : base(in) /* +++ */ { /* +++ */ this.CultureInfo = CultureInfo; /* +++ */ } public override Token Next(Token result) { result = Input.Next(result); if (result != null) { char[] buffer = result.TermBuffer(); int length = result.termLength; for (int i = 0; i length; i++) /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo); return result; } else return null; } } {code} DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693533#action_12693533 ] Michael McCandless commented on LUCENE-1575: bq. May I change it to BooleanCollector? (the name conflicts with the Collector name we want to give to all base collectors) bq. May I change it to private static final? It is used only in BooleanScorer's newCollector() method. I think these are fine. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Possible IndexInput optimization
On 3/29/09 12:43 AM, Earwin Burrfoot wrote: There are three cases when we can override readNNN methods and provide implementations with zero or minimum method invocations - RAMDirectory, MMapDirectory and BufferedIndexInput for FSDirectory/CompoundFileReader. Anybody tried this? A while ago I tried overriding the read* methods in BufferedIndexInput like this: public int readVInt() throws IOException { if (5 = (bufferLength-bufferPosition)) { return readVIntFast(); } return super.readVInt(); } private int readVIntFast() throws IOException { byte b = buffer[bufferPosition++]; int i = b 0x7F; for (int shift = 6; (b 0x80) != 0; shift += 7) { b = buffer[bufferPosition++]; i |= (b 0x7F) shift; } return i; } Notice that I don't rely on ArrayIndexOutOfBoundsException, instead I do one range check in readVInt() and then call the readVIntFast() method, which accesses the buffer array directly to avoid multiple range checks. Surprisingly I did not see any performance improvement. In my test I wrote a huge file (several GBs) to disk with VInts, making sure they occupied more than just a single byte each. Reading the file with and without this optimization in BufferedIndexInput made almost no difference. Only when I ran it in a profiler I saw a big difference, because with his change there are less method calls, hence less invocation count overhead. I'm still surprised there was no performance improvement at all. Maybe something was wrong with my test and I should try it again... -Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
[ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693540#action_12693540 ] Shai Erera commented on LUCENE-1581: From the javadocs (http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#toLowerCase(char)): _In general, String.toLowerCase() should be used to map characters to lowercase. String case mapping methods have several benefits over Character case mapping methods. String case mapping methods can perform locale-sensitive mappings, context-sensitive mappings, and 1:M character mappings, whereas the Character case mapping methods cannot._ So I agree this is a problem, but I see no easy way (and efficient) to fix it. Suppose that we allow LowerCaseFilter to accept Locale. What would it do with it? We could add in LowerCaseFilter a MapLocale, char[65536] and allow one to pass in the Locale. If it's not null, and there's an entry in the map, lookup every character the filter receives. The lookup will be quite fast, as the character will serve as the index to the array (notice that we cover only 2-byte characters though) and if it's \u we can assume there's no special handling and call Character.toLowerCase. That is very fragile though as it's not easy to cover all the special case characters. Also, every time (including this one) we will find a special character that was not handled properly by the filter, it'd break back-compt, no? BTW, when characters are uppercase, I don't think we have a problem, as they will always be lowercased to a single character (even if it's the wrong one, it will be consistent in indexing and search). The problem comes with the lowercase characters. The character \u0131 (lowercase I in Turkish) is lowercased to \u0131, while its uppercase version (I) is lowercased to 'i'. Therefore there is a mismatch and we'll fail if the user will enter a lowercase query (as they often do). LowerCaseFilter should be able to be configured to use a specific locale. - Key: LUCENE-1581 URL: https://issues.apache.org/jira/browse/LUCENE-1581 Project: Lucene - Java Issue Type: Improvement Reporter: Digy //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it would be a problem to understand them. // Assume an input text like İ and and analyzer like below {code} public class SomeAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) { TokenStream t = new SomeTokenizer(reader); t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); t = new LowerCaseFilter(t); return t; } } {code} ASCIIFoldingFilter will return I and after, LowerCaseFilter will return i (if locale is en-US) or ı' if(locale is tr-TR) (that means,this token should be input to another instance of ASCIIFoldingFilter) So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better approach can be adding a new constructor to LowerCaseFilter and forcing it to use a specific locale. {code} public sealed class LowerCaseFilter : TokenFilter { /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture; public LowerCaseFilter(TokenStream in) : base(in) { } /* +++ */ public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo CultureInfo) : base(in) /* +++ */ { /* +++ */ this.CultureInfo = CultureInfo; /* +++ */ } public override Token Next(Token result) { result = Input.Next(result); if (result != null) { char[] buffer = result.TermBuffer(); int length = result.termLength; for (int i = 0; i length; i++) /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo); return result; } else return null; } } {code} DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Possible IndexInput optimization
A while ago I tried overriding the read* methods in BufferedIndexInput like this: I'm still surprised there was no performance improvement at all. Maybe something was wrong with my test and I should try it again... For BufferedIndexInput improvement should be noticeable only when the file you're reading is loaded completely into OS disk cache (which was not your case, I guess). Even then, you're making a syscall for each 1Kb chunk, it could probably dominate 1K method calls. But for RAMDirectory/MMapDirectory you're not reading disk (if disk cache kicked in), and you're not making syscalls. I guess I should stop asking around and try. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Possible IndexInput optimization
Earwin, I did not experiment lately, but I'd like to add a general compressed integer array to the basic types in an index, that would be compressed on writing and decompressed on reading. A first attempt is at LUCENE-1410, and one of the choices I had there was whether or not to use NIO buffer methods on the index side. I started there using these NIO buffer methods, but it seems that the explicit byte arrays you're using here could be a good alternative. I think my question boils down to whether or not these NIO buffers will (in the end) get in the way of similar low level optimizations you'd like to see applied here. Regards, Paul Elschot In my case I have to switch to MMap/Buffers, Java behaves ugly with 8Gb heaps. I'm thinking of trying to use Short/Int/LongBuffers that wrap my initial ByteBuffer. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
[ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693545#action_12693545 ] DM Smith commented on LUCENE-1581: -- This a bit larger of a problem. It also pertains to upper casing, too. I don't remember exactly, but I seem to remember that Java is behind with regard to the Unicode spec and Locale support (e.g. it does not include fa, farsi). I find that ICU4J keeps current with the spec. I don't remember which way it goes, maybe it's both, but some Locales don't have a corresponding upper or lower case for some characters. I'm not sure, but I think efficiency pertains to how it is normalized in Unicode (e.g. NFC, NFKC, NFD, or NFKD). These might produce different performance results. (It is a different issue, but it is critical that the search requests perform the same Unicode normalization as the indes. As Lucene does not have these normalization filters, I find, I have to do this externally in my own filters using ICU.) (Again a different issue: Another related kind of folding is that of base 10 number shaping.) Regarding: bq. I see no easy way (and efficient) to fix it. Suppose that we allow LowerCaseFilter to accept Locale. What would it do with it? I think that we need Upper and Lower case filters that operates on the token as a whole, using the string-level method to do case conversion. What I'd like to see is that lucene has a pluggable way to handle ICU, in so far as it does Locale specific things such as this. Such as using a base class UpperCaseFolder that provides the Java implementation, but that can take an alternate implementation, perhaps by reflection. LowerCaseFilter should be able to be configured to use a specific locale. - Key: LUCENE-1581 URL: https://issues.apache.org/jira/browse/LUCENE-1581 Project: Lucene - Java Issue Type: Improvement Reporter: Digy //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it would be a problem to understand them. // Assume an input text like İ and and analyzer like below {code} public class SomeAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) { TokenStream t = new SomeTokenizer(reader); t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); t = new LowerCaseFilter(t); return t; } } {code} ASCIIFoldingFilter will return I and after, LowerCaseFilter will return i (if locale is en-US) or ı' if(locale is tr-TR) (that means,this token should be input to another instance of ASCIIFoldingFilter) So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better approach can be adding a new constructor to LowerCaseFilter and forcing it to use a specific locale. {code} public sealed class LowerCaseFilter : TokenFilter { /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture; public LowerCaseFilter(TokenStream in) : base(in) { } /* +++ */ public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo CultureInfo) : base(in) /* +++ */ { /* +++ */ this.CultureInfo = CultureInfo; /* +++ */ } public override Token Next(Token result) { result = Input.Next(result); if (result != null) { char[] buffer = result.TermBuffer(); int length = result.termLength; for (int i = 0; i length; i++) /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo); return result; } else return null; } } {code} DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
[ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693553#action_12693553 ] Digy commented on LUCENE-1581: -- Although, it is not directly related to this issue, It is good to remember some existing problems in Lucene. https://issues.apache.org/jira/browse/LUCENENET-51 DIGY LowerCaseFilter should be able to be configured to use a specific locale. - Key: LUCENE-1581 URL: https://issues.apache.org/jira/browse/LUCENE-1581 Project: Lucene - Java Issue Type: Improvement Reporter: Digy //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it would be a problem to understand them. // Assume an input text like İ and and analyzer like below {code} public class SomeAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) { TokenStream t = new SomeTokenizer(reader); t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); t = new LowerCaseFilter(t); return t; } } {code} ASCIIFoldingFilter will return I and after, LowerCaseFilter will return i (if locale is en-US) or ı' if(locale is tr-TR) (that means,this token should be input to another instance of ASCIIFoldingFilter) So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better approach can be adding a new constructor to LowerCaseFilter and forcing it to use a specific locale. {code} public sealed class LowerCaseFilter : TokenFilter { /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture; public LowerCaseFilter(TokenStream in) : base(in) { } /* +++ */ public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo CultureInfo) : base(in) /* +++ */ { /* +++ */ this.CultureInfo = CultureInfo; /* +++ */ } public override Token Next(Token result) { result = Input.Next(result); if (result != null) { char[] buffer = result.TermBuffer(); int length = result.termLength; for (int i = 0; i length; i++) /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo); return result; } else return null; } } {code} DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1516) Integrate IndexReader with IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1516: --- Attachment: ssd2.png OK using the last patch, I ran another near real-time test, using this alg: {code} analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker merge.policy=org.apache.lucene.index.LogDocMergePolicy docs.file=/Volumes/External/lucene/wiki.txt doc.stored = false doc.term.vector = false doc.add.log.step=10 max.field.length=2147483647 directory=FSDirectory autocommit=false compound=false merge.factor = 10 ram.flush.mb = 128 doc.maker.forever = false doc.random.id.limit = 3204040 work.dir=/lucene/work { BuildIndex - OpenIndex - NearRealtimeReader(1) { UpdateDocs UpdateDoc : 10 : 50/sec - CloseIndex } RepSumByPrefRound BuildIndex {code} It opens a full (3.2M docs, previously built) wikipedia index, then randomly selects a doc and updates it (deletes old, adds new) at the rate of 50 docs/sec. Then, once per second I open a new reader, do the same search (term 1, sorted by date). I attached another graph (ssd2.png) with the results, showing reopen search time as a function of how many updates have been done; rough comments: * Search time is pretty constant ~35 msec, except occassional glitches where it goes as high as ~340 msec. Net/net very reasonable I think. * Search time is remarkably non-noisy, except for occasional spikes. * Reopen time is also fast (~ 40 msec) but is more noisy. * It's not clear the merges are really impacting things that much. It could simply be that I didn't run test for long enough for a big merge to run. Also, this index has no stored fields nor term vectors, so if we added those, merges would get slower. * This is a better test than last one, since it's doing some deletes * Since I open writer with autoCommit false, and near-realtime carries all pending deletes in RAM, no *.del file ever gets written to the index Integrate IndexReader with IndexWriter --- Key: LUCENE-1516 URL: https://issues.apache.org/jira/browse/LUCENE-1516 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png Original Estimate: 672h Remaining Estimate: 672h The current problem is an IndexReader and IndexWriter cannot be open at the same time and perform updates as they both require a write lock to the index. While methods such as IW.deleteDocuments enables deleting from IW, methods such as IR.deleteDocument(int doc) and norms updating are not available from IW. This limits the capabilities of performing updates to the index dynamically or in realtime without closing the IW and opening an IR, deleting or updating norms, flushing, then opening the IW again, a process which can be detrimental to realtime updates. This patch will expose an IndexWriter.getReader method that returns the currently flushed state of the index as a class that implements IndexReader. The new IR implementation will differ from existing IR implementations such as MultiSegmentReader in that flushing will synchronize updates with IW in part by sharing the write lock. All methods of IR will be usable including reopen and clone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
[ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693568#action_12693568 ] Shai Erera commented on LUCENE-1581: bq. What I'd like to see is that lucene has a pluggable way to handle ICU, in so far as it does Locale specific things such as this. Such as using a base class UpperCaseFolder that provides the Java implementation, but that can take an alternate implementation, perhaps by reflection. Why do this? What prevents you in your application from creating such a filter? Lucene does not provide too many analyzers, or a single Analyzer for use by all, with configurable options. So why provide in Lucene a filter which uses ICU4J? I'm asking that for core Lucene. Of course such a module can sit in contrib, as do the other analyzers for other languages ... BTW, I've had some experience with ICU4J and it had several performance issues, such as large consecutive array allocations. It also operates on strings, and does not have the efficient API Lucene has in tokenization (i.e., working on char[]). However, I've worked with it long time ago, and perhaps things have changed since. LowerCaseFilter should be able to be configured to use a specific locale. - Key: LUCENE-1581 URL: https://issues.apache.org/jira/browse/LUCENE-1581 Project: Lucene - Java Issue Type: Improvement Reporter: Digy //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it would be a problem to understand them. // Assume an input text like İ and and analyzer like below {code} public class SomeAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) { TokenStream t = new SomeTokenizer(reader); t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); t = new LowerCaseFilter(t); return t; } } {code} ASCIIFoldingFilter will return I and after, LowerCaseFilter will return i (if locale is en-US) or ı' if(locale is tr-TR) (that means,this token should be input to another instance of ASCIIFoldingFilter) So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better approach can be adding a new constructor to LowerCaseFilter and forcing it to use a specific locale. {code} public sealed class LowerCaseFilter : TokenFilter { /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture; public LowerCaseFilter(TokenStream in) : base(in) { } /* +++ */ public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo CultureInfo) : base(in) /* +++ */ { /* +++ */ this.CultureInfo = CultureInfo; /* +++ */ } public override Token Next(Token result) { result = Input.Next(result); if (result != null) { char[] buffer = result.TermBuffer(); int length = result.termLength; for (int i = 0; i length; i++) /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo); return result; } else return null; } } {code} DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Possible IndexInput optimization
On Sunday 29 March 2009 13:47:59 Earwin Burrfoot wrote: Earwin, I did not experiment lately, but I'd like to add a general compressed integer array to the basic types in an index, that would be compressed on writing and decompressed on reading. A first attempt is at LUCENE-1410, and one of the choices I had there was whether or not to use NIO buffer methods on the index side. I started there using these NIO buffer methods, but it seems that the explicit byte arrays you're using here could be a good alternative. I think my question boils down to whether or not these NIO buffers will (in the end) get in the way of similar low level optimizations you'd like to see applied here. Regards, Paul Elschot In my case I have to switch to MMap/Buffers, Java behaves ugly with 8Gb heaps. Do you mean that because garbage collection does not perform well on these larger heaps, one should avoid to create arrays to have heaps of that size, and rather use (direct) MMap/Buffers? I'm thinking of trying to use Short/Int/LongBuffers that wrap my initial ByteBuffer. So far I have used an IntBuffer wrapping a ByteBuffer at LUCENE-1410. In case arrays are better not created for data to be read from index, I'll keep it that way, hoping that that doesn't run into backward compatibility problems. Regards, Paul Elschot -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: LockObtainFailedException exception
Hi Mike, Thanks for the response. I did a code check but this was a random error, which indicated towards something to do with the environment. Finally, I did figure out the problem - low disk space. Though there was around 1 GB of free space on the server, it was not sufficient when we had to merge a large number of indexes. Anyway, we have now done the needful and the problem hasnt recurred again! Cheers, Ketan --- On Mon, 2/3/09, Michael McCandless luc...@mikemccandless.com wrote: From: Michael McCandless luc...@mikemccandless.com Subject: Re: LockObtainFailedException exception To: java-dev@lucene.apache.org Date: Monday, 2 March, 2009, 10:24 PM Is it possible you accidentally allow two writers to try to open the index? That would explain this failure; the 2nd writer would fail to acquire the lock, because the first writer has the index open. Or, is it possible you're not closing a previously opened writer? Mike Ketan Deshpande wrote: Hi, I am fairly new to Lucene, so forgive my elaborate explanation. We were facing frequent issues with Lucene 1.2 (Unreleased write.lock() files). To overcome the same, we have recently upgraded to Lucene 2.3.2 - however, we observed the following LockObtainFailedException exception during our testing - 2009-02-26 15:34:35,525 DEBUG [com.eu.prnewswire.search.document.WDPIndexDocument] Document() called 2009-02-26 15:34:35,529 DEBUG [com.eu.prnewswire.search.document.WDPIndexDocument] adding associated type 2009-02-26 15:34:35,529 DEBUG [com.eu.prnewswire.search.document.WDPIndexDocument] added 2009-02-26 15:34:36,535 ERROR [STDERR] org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@/jboss/jboss-4.0.5.GA/spool/lucene/search1/index/PRNJ_2009_02/write.lock 2009-02-26 15:34:36,536 ERROR [STDERR] at org.apache.lucene.store.Lock.obtain(Lock.java:85) 2009-02-26 15:34:36,536 ERROR [STDERR] at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:692) 2009-02-26 15:34:36,536 ERROR [STDERR] at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:503) 2009-02-26 15:34:36,536 ERROR [STDERR] at com.eu.prnewswire.search.index.LuceneIndex.addDocument(LuceneIndex.java:124) 2009-02-26 15:34:36,536 ERROR [STDERR] at com.eu.prnewswire.search.indexer.prnjindexer.PRNJIndexerEJB.addToLuceneIndex(PRNJIndexerEJB.java:193) 2009-02-26 15:34:36,536 ERROR [STDERR] at com.eu.prnewswire.search.indexer.prnjindexer.PRNJIndexerEJB.indexDocument(PRNJIndexerEJB.java:121) 2009-02-26 15:34:36,536 ERROR [STDERR] at sun.reflect.GeneratedMethodAccessor106.invoke(Unknown Source) 2009-02-26 15:34:36,536 ERROR [STDERR] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 2009-02-26 15:34:36,536 ERROR [STDERR] at java.lang.reflect.Method.invoke(Method.java:324) 2009-02-26 15:34:36,536 ERROR [STDERR] at org.jboss.invocation.Invocation.performCall(Invocation.java:359) From the stack trace, we can trace back the exception to the following code in the IndexWriter class (while trying to acquire a lock): Lock writeLock = directory.makeLock(IndexWriter.WRITE_LOCK_NAME); if (!writeLock.obtain(writeLockTimeout)) // obtain write lock throw new LockObtainFailedException(Index locked for write: + writeLock); We have seen this issue only once till now and the files did not index until we deleted the lock file manually. (When I checked for existing issues, Lucene-715 came closest, but it has been resolved in 2.1 version) I am afraid this may crop up sometime again. Any inputs on how to resolve the the error would be appreciated. If any more details are required, I would be happy to share the same. Thanks, Ketan Bollywood news, movie reviews, film trailers and more! Click here. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org Add more friends to your messenger and enjoy! Go to http://messenger.yahoo.com/invite/
[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
[ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693579#action_12693579 ] DM Smith commented on LUCENE-1581: -- bq.Why do this? Lucene has a bias toward English texts and does not have a fundamental architecture focused on internationalization and localization. IMHO, it should. Java does not implement Unicode well and does not keep abreast with it's changes. It's not that ICU is the right solution. It is *a* robust solution. bq. What prevents you in your application from creating such a filter? Nothing at all. But I think that proper behavior regarding Unicode and locales is something that many want. Especially for non-English indexes. As such it belongs with Lucene not individual projects. With that in mind, I think it would be great if Lucene were fully internationalized and localized, at least from a fundamental architecture perspective. (There is a separate issue on what core and contrib should be. I'm not clear where analyzers fall wrt that.) As an implementation, if ICU is present it is used, with potential performance impacts, if not behavior degrades predictably and gracefully. This would create a quasi dependency not a hard one. LowerCaseFilter should be able to be configured to use a specific locale. - Key: LUCENE-1581 URL: https://issues.apache.org/jira/browse/LUCENE-1581 Project: Lucene - Java Issue Type: Improvement Reporter: Digy //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it would be a problem to understand them. // Assume an input text like İ and and analyzer like below {code} public class SomeAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) { TokenStream t = new SomeTokenizer(reader); t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); t = new LowerCaseFilter(t); return t; } } {code} ASCIIFoldingFilter will return I and after, LowerCaseFilter will return i (if locale is en-US) or ı' if(locale is tr-TR) (that means,this token should be input to another instance of ASCIIFoldingFilter) So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better approach can be adding a new constructor to LowerCaseFilter and forcing it to use a specific locale. {code} public sealed class LowerCaseFilter : TokenFilter { /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture; public LowerCaseFilter(TokenStream in) : base(in) { } /* +++ */ public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo CultureInfo) : base(in) /* +++ */ { /* +++ */ this.CultureInfo = CultureInfo; /* +++ */ } public override Token Next(Token result) { result = Input.Next(result); if (result != null) { char[] buffer = result.TermBuffer(); int length = result.termLength; for (int i = 0; i length; i++) /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo); return result; } else return null; } } {code} DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: LockObtainFailedException exception
Super, thanks for bringing closure! Mike On Sun, Mar 29, 2009 at 11:58 AM, Ketan Deshpande ketandes...@yahoo.co.in wrote: Hi Mike, Thanks for the response. I did a code check but this was a random error, which indicated towards something to do with the environment. Finally, I did figure out the problem - low disk space. Though there was around 1 GB of free space on the server, it was not sufficient when we had to merge a large number of indexes. Anyway, we have now done the needful and the problem hasnt recurred again! Cheers, Ketan --- On Mon, 2/3/09, Michael McCandless luc...@mikemccandless.com wrote: From: Michael McCandless luc...@mikemccandless.com Subject: Re: LockObtainFailedException exception To: java-dev@lucene.apache.org Date: Monday, 2 March, 2009, 10:24 PM Is it possible you accidentally allow two writers to try to open the index? That would explain this failure; the 2nd writer would fail to acquire the lock, because the first writer has the index open. Or, is it possible you're not closing a previously opened writer? Mike Ketan Deshpande wrote: Hi, I am fairly new to Lucene, so forgive my elaborate explanation. We were facing frequent issues with Lucene 1.2 (Unreleased write.lock() files). To overcome the same, we have recently upgraded to Lucene 2.3.2 - however, we observed the following LockObtainFailedException exception during our testing - 2009-02-26 15:34:35,525 DEBUG [com.eu.prnewswire.search.document.WDPIndexDocument] Document() called 2009-02-26 15:34:35,529 DEBUG [com.eu.prnewswire.search.document.WDPIndexDocument] adding associated type 2009-02-26 15:34:35,529 DEBUG [com.eu.prnewswire.search.document.WDPIndexDocument] added 2009-02-26 15:34:36,535 ERROR [STDERR] org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@/jboss/jboss-4.0.5.GA/spool/lucene/search1/index/PRNJ_2009_02/write.lock 2009-02-26 15:34:36,536 ERROR [STDERR] at org.apache.lucene.store.Lock.obtain(Lock.java:85) 2009-02-26 15:34:36,536 ERROR [STDERR] at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:692) 2009-02-26 15:34:36,536 ERROR [STDERR] at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:503) 2009-02-26 15:34:36,536 ERROR [STDERR] at com.eu.prnewswire.search.index.LuceneIndex.addDocument(LuceneIndex.java:124) 2009-02-26 15:34:36,536 ERROR [STDERR] at com.eu.prnewswire.search.indexer.prnjindexer.PRNJIndexerEJB.addToLuceneIndex(PRNJIndexerEJB.java:193) 2009-02-26 15:34:36,536 ERROR [STDERR] at com.eu.prnewswire.search.indexer.prnjindexer.PRNJIndexerEJB.indexDocument(PRNJIndexerEJB.java:121) 2009-02-26 15:34:36,536 ERROR [STDERR] at sun.reflect.GeneratedMethodAccessor106.invoke(Unknown Source) 2009-02-26 15:34:36,536 ERROR [STDERR] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 2009-02-26 15:34:36,536 ERROR [STDERR] at java.lang.reflect.Method.invoke(Method.java:324) 2009-02-26 15:34:36,536 ERROR [STDERR] at org.jboss.invocation.Invocation.performCall(Invocation.java:359) From the stack trace, we can trace back the exception to the following code in the IndexWriter class (while trying to acquire a lock): Lock writeLock = directory.makeLock(IndexWriter.WRITE_LOCK_NAME); if (!writeLock.obtain(writeLockTimeout)) // obtain write lock throw new LockObtainFailedException(Index locked for write: + writeLock); We have seen this issue only once till now and the files did not index until we deleted the lock file manually. (When I checked for existing issues, Lucene-715 came closest, but it has been resolved in 2.1 version) I am afraid this may crop up sometime again. Any inputs on how to resolve the the error would be appreciated. If any more details are required, I would be happy to share the same. Thanks, Ketan Bollywood news, movie reviews, film trailers and more! Click here. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org Add more friends to your messenger and enjoy! Invite them now. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
[ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693591#action_12693591 ] Robert Muir commented on LUCENE-1581: - some comments I have on this topic: the problems i have with default internationalization support in lucene revolve around the following: 1. breaking text into words (parsing) is not unicode-sensitive i.e. if i have a word containing s + macron (s̄) it will not tokenize it correctly. 2. various filters like lowercase as mentioned here, but also accent removal are not unicode-sensitive i.e. if i have s + macron (s̄) it will not remove the macron. this is not a normalization problem, but its true it also doesn't seem to work correctly on decomposed NF(K)D text for similar reasons. in this example, there is no composed form for s + macron available in unicode so I cannot 'hack' around the problem by running NFC on this text before i feed it to lucene. 3. unicode text must be normalized so that both queries and text are in a consistent representation. one option I see is to have at least a basic analyzer that uses ICU to do the following. 1. Break text into words correctly. 2. common filters to do things like lowercase and accent-removal correctly. 3. uses a filter to normalize text to one unicode normal form (say, NFKC by default) In my opinion, having this available would solve a majority of the current problems. I kinda started trying to implement some of this with lucene-1488... (at least it does step 1!) LowerCaseFilter should be able to be configured to use a specific locale. - Key: LUCENE-1581 URL: https://issues.apache.org/jira/browse/LUCENE-1581 Project: Lucene - Java Issue Type: Improvement Reporter: Digy //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it would be a problem to understand them. // Assume an input text like İ and and analyzer like below {code} public class SomeAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) { TokenStream t = new SomeTokenizer(reader); t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); t = new LowerCaseFilter(t); return t; } } {code} ASCIIFoldingFilter will return I and after, LowerCaseFilter will return i (if locale is en-US) or ı' if(locale is tr-TR) (that means,this token should be input to another instance of ASCIIFoldingFilter) So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better approach can be adding a new constructor to LowerCaseFilter and forcing it to use a specific locale. {code} public sealed class LowerCaseFilter : TokenFilter { /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture; public LowerCaseFilter(TokenStream in) : base(in) { } /* +++ */ public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo CultureInfo) : base(in) /* +++ */ { /* +++ */ this.CultureInfo = CultureInfo; /* +++ */ } public override Token Next(Token result) { result = Input.Next(result); if (result != null) { char[] buffer = result.TermBuffer(); int length = result.termLength; for (int i = 0; i length; i++) /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo); return result; } else return null; } } {code} DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1579) Cloned SegmentReaders fail to share FieldCache entries
[ https://issues.apache.org/jira/browse/LUCENE-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1579: --- Attachment: LUCENE-1579.patch New patch, even simpler. Cloned SegmentReaders fail to share FieldCache entries -- Key: LUCENE-1579 URL: https://issues.apache.org/jira/browse/LUCENE-1579 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1579.patch, LUCENE-1579.patch, LUCENE-1579.patch I just hit this on LUCENE-1516, which returns a cloned readOnly readers from IndexWriter. The problem is, when cloning, we create a new [thin] cloned SegmentReader for each segment. FieldCache keys directly off this object, so if you clone the reader and do a search that requires the FieldCache (eg, sorting) then that first search is always very slow because every single segment is reloading the FieldCache. This is of course a complete showstopper for LUCENE-1516. With LUCENE-831 we'll switch to a new FieldCache API; we should ensure this bug is not present there. We should also fix the bug in the current FieldCache API since for 2.9, users may hit this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Possible IndexInput optimization
In my case I have to switch to MMap/Buffers, Java behaves ugly with 8Gb heaps. Do you mean that because garbage collection does not perform well on these larger heaps, one should avoid to create arrays to have heaps of that size, and rather use (direct) MMap/Buffers? Yes, exactly. Keeping big Directories in heap is painful in many ways: 1. Old-gen GC is slow on big heaps. Our 3Gb heaps were collected for 6-8 seconds with parallel collector on four-way machines. Concurrent collector consistently core dumps, whatever the settings :) Then we tried increasing heaps (upto 8Gb) in pursuit of less machines in cluster, and it just collected for eternity. 2. Eden-survivor-old chain is showering sparks around when you feed it with huge arrays created in numbers. So your New-gen GCs are still swift (100-200ms), but happen too often. As a consequence some of short-lived objects start leaking into Old-gen. 3. You have to reserve place for merges. Fully optimizing index is very taxing, I cheat by stopping accepting outside requests, switching off memory cache, optimizing, then putting everything back in place. I'm currently testing mmap approach, and despite Sun's braindead API, it works like a charm. While I'm at it, I got two more questions about MMapDirectory. How often openInput() is called for a file? Is it worthy to do getChannel().map() when file is written and closed, and then clone the buffer for each openInput()? Why don't you force() a newly-mapped Buffer? It will save first few searches hitting a new segment from pagefaults and waiting for that segment to be loaded. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1039) Bayesian classifiers using Lucene as data store
[ https://issues.apache.org/jira/browse/LUCENE-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693643#action_12693643 ] Vaijanath N. Rao commented on LUCENE-1039: -- Hi Karl, Can you tell me how to use this with FSDirectory() rather then RAMDirectory(). I am getting following error Exception in thread main java.lang.NullPointerException at org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.doc(MultiSegmentReader.java:552) at org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:94) at org.apache.lucene.classifier.BayesianClassifier.weightedFeatureClassProbability(BayesianClassifier.java:139) at org.apache.lucene.classifier.NaiveBayesClassifier.featuresClassProbability(NaiveBayesClassifier.java:54) at org.apache.lucene.classifier.NaiveBayesClassifier.classify(NaiveBayesClassifier.java:71) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:72) at org.apache.lucene.classifier.BayesianClassifier.classify(BayesianClassifier.java:64) When I am trying to use the FSDirectory(). I created the instance Index as per the test sample and closed it. Now while doing a classification I am getting the above error. The way I create the directory is: FSDirectory dir = FSDirectory.getDirectory(new File(indexPath)); IndexWriter iw = new IndexWriter(dir,instanceFactory.getAnalyzer(),create, MaxFieldLength.LIMITED); iw.close(); The code for addinig the instance is : instances.addInstance(record.getText(), record.getClass()); instance.flush() and instance.close() all go fine. While doing classification I again open the directory ( with just create set to false ) and rest call remains the same. Instances instances = new Instances(dir, indexCreator.instanceFactory, class); classifier = new NaiveBayesClassifier(); return classifier.classify(instances, text)[0].getClassification(); Can you help me in pointing out where I am doing wrong. --Thanks and Regards Vaijanath N. Rao Bayesian classifiers using Lucene as data store --- Key: LUCENE-1039 URL: https://issues.apache.org/jira/browse/LUCENE-1039 Project: Lucene - Java Issue Type: New Feature Reporter: Karl Wettin Assignee: Karl Wettin Priority: Minor Attachments: LUCENE-1039.txt Bayesian classifiers using Lucene as data store. Based on the Naive Bayes and Fisher method algorithms as described by Toby Segaran in Programming Collective Intelligence, ISBN 978-0-596-52932-1. Have fun. Poor java docs, but the TestCase shows how to use it: {code:java} public class TestClassifier extends TestCase { public void test() throws Exception { InstanceFactory instanceFactory = new InstanceFactory() { public Document factory(String text, String _class) { Document doc = new Document(); doc.add(new Field(class, _class, Field.Store.YES, Field.Index.NO_NORMS)); doc.add(new Field(text, text, Field.Store.YES, Field.Index.NO, Field.TermVector.NO)); doc.add(new Field(text/ngrams/start, text, Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.YES)); doc.add(new Field(text/ngrams/inner, text, Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.YES)); doc.add(new Field(text/ngrams/end, text, Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.YES)); return doc; } Analyzer analyzer = new Analyzer() { private int minGram = 2; private int maxGram = 3; public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream ts = new StandardTokenizer(reader); ts = new LowerCaseFilter(ts); if (fieldName.endsWith(/ngrams/start)) { ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.FRONT, minGram, maxGram); } else if (fieldName.endsWith(/ngrams/inner)) { ts = new NGramTokenFilter(ts, minGram, maxGram); } else if (fieldName.endsWith(/ngrams/end)) { ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.BACK, minGram, maxGram); } return ts; } }; public Analyzer getAnalyzer() { return analyzer; } }; Directory dir = new RAMDirectory(); new IndexWriter(dir, null, true).close(); Instances instances = new Instances(dir, instanceFactory, class); instances.addInstance(hello world, en); instances.addInstance(hallå världen, sv); instances.addInstance(this is london calling, en); instances.addInstance(detta är london som ringer, sv); instances.addInstance(john has a long mustache, en);