Re: Lucene 2.9 and deprecated IR.open() methods
On Sat, Oct 3, 2009 at 03:29, Uwe Schindler u...@thetaphi.de wrote: It is also probably a good idea to move various settings methods from IW to that builder and have IW immutable in regards to configuration. I'm speaking of the likes of setWriteLockTimeout, setRAMBufferSizeMB, setMergePolicy, setMergeScheduler, setSimilarity. IndexWriter.Builder iwb = IndexWriter.builder(). writeLockTimeout(0). RAMBufferSize(config.indexationBufferMB). maxBufferedDocs(...). similarity(...). analyzer(...); ... = iwb.build(dir1); ... = iwb.build(dir2); A happy user of google-collections API :-) These builders are really cool! I feel myself caught in the act. There is still a couple of things bothering me. 1. Introducing a builder, we'll have a whole heap of deprecated constructors that will hang there for eternity. And then users will scream in frustration - This class has 14(!) constructors and all of them are deprecated! How on earth am I supposed to create this thing? 2. If someone creates IW with some reflectish javabeanish tools - he's busted. Not that I'm feeling compassionate for such a person. I like Earwin's version more. A builder is very flexible, because you can concat all your properties (like StringBuilder works with its append method returning itself) and create the instance at the end. Besides (arguably) cleaner syntax, the lack of which is (arguably) a curse of many Java libraries, it also allows us to return a different concrete implementation of IW without breaking back-compat, and also to choose this concrete implementation based on settings provided. If we feel like doing it at some point. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 and deprecated IR.open() methods
Though what about required settings? EG IW's builder must have Directory, Analyzer. Would we pass these as up-front args to the initial builder? I'd try to keep required settings at minimum. The only one absolutely required, imho, is a Directory, and it's best to specify it in create() method, so you could set all your IW parameters and then build several instances, for different Directories for example. If you decide to add more required settings, we're back to square one - after a couple of years we're looking at 14 builder() methods. Okay, there is a way. Take a look at how Guice handles binding declarations in Modules - different builder methods may return different interfaces implemented by 'this'. class IndexWriter { public static NoAnalyzerYetBuilder builder() { return new HiddenTrueBuilder(); } interface NoAnalyzerYetBuilder { NoAnalyzerYetBuilder setRAMBuffer(...) NoAnalyzerYetBuilder setUseCompound(...) Builder setAnalyzer(Analyzer) } interface Builder extends NoAnalyzerYetBuilder { Builder setRAMBuffer(...) Builder setUseCompound (...) IndexWriter create(Directory) } private static class HiddenTrueBuilder implements Builder { } } This approach looks nice from client-side, but is a mess to implement. And shouldn't we still specify the version up-front so we can improve defaults over time without breaking back-compat? (Else, how can we change defaults?) EG: IndexWriter.builder(Version.29, dir, analyzer) .setRAMBufferSizeMB(128) .setUseCompoundFile(false) ... .create() ? It's probably okay to specify version upfront. But also, nothing bad happens if we do it like: IndexWriter.builder(). defaultsFor(Version.29). setRam... Mike On Fri, Oct 2, 2009 at 7:45 PM, Earwin Burrfoot ear...@gmail.com wrote: On Sat, Oct 3, 2009 at 03:29, Uwe Schindler u...@thetaphi.de wrote: It is also probably a good idea to move various settings methods from IW to that builder and have IW immutable in regards to configuration. I'm speaking of the likes of setWriteLockTimeout, setRAMBufferSizeMB, setMergePolicy, setMergeScheduler, setSimilarity. IndexWriter.Builder iwb = IndexWriter.builder(). writeLockTimeout(0). RAMBufferSize(config.indexationBufferMB). maxBufferedDocs(...). similarity(...). analyzer(...); ... = iwb.build(dir1); ... = iwb.build(dir2); A happy user of google-collections API :-) These builders are really cool! I feel myself caught in the act. There is still a couple of things bothering me. 1. Introducing a builder, we'll have a whole heap of deprecated constructors that will hang there for eternity. And then users will scream in frustration - This class has 14(!) constructors and all of them are deprecated! How on earth am I supposed to create this thing? 2. If someone creates IW with some reflectish javabeanish tools - he's busted. Not that I'm feeling compassionate for such a person. I like Earwin's version more. A builder is very flexible, because you can concat all your properties (like StringBuilder works with its append method returning itself) and create the instance at the end. Besides (arguably) cleaner syntax, the lack of which is (arguably) a curse of many Java libraries, it also allows us to return a different concrete implementation of IW without breaking back-compat, and also to choose this concrete implementation based on settings provided. If we feel like doing it at some point. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene 2.9 and deprecated IR.open() methods
Call me old fashioned, but I like how the non constructor params are set now. And what happens when you index some docs, change these params, index more docs, change params, commit? Let's throw in some threads? You either end up writing really hairy state control code, or just leave it broken, with Don't change parameters after you start pumping docs through it! plea covering your back somewhere in JavaDocs. If nothing else, having stuff 'final' keeps JIT really happy. And for some reason I like a config object over a builder pattern for the required constructor params. Builder pattern allows you to switch concrete implementations as you please, taking parameters into account or not. Besides that there's no real difference. I prefer builder, but that's just me :) Thats just me though. Michael McCandless wrote: OK, I agree, using the builder approach looks compelling! Though what about required settings? EG IW's builder must have Directory, Analyzer. Would we pass these as up-front args to the initial builder? And shouldn't we still specify the version up-front so we can improve defaults over time without breaking back-compat? (Else, how can we change defaults?) EG: IndexWriter.builder(Version.29, dir, analyzer) .setRAMBufferSizeMB(128) .setUseCompoundFile(false) ... .create() ? Mike On Fri, Oct 2, 2009 at 7:45 PM, Earwin Burrfoot ear...@gmail.com wrote: On Sat, Oct 3, 2009 at 03:29, Uwe Schindler u...@thetaphi.de wrote: It is also probably a good idea to move various settings methods from IW to that builder and have IW immutable in regards to configuration. I'm speaking of the likes of setWriteLockTimeout, setRAMBufferSizeMB, setMergePolicy, setMergeScheduler, setSimilarity. IndexWriter.Builder iwb = IndexWriter.builder(). writeLockTimeout(0). RAMBufferSize(config.indexationBufferMB). maxBufferedDocs(...). similarity(...). analyzer(...); ... = iwb.build(dir1); ... = iwb.build(dir2); A happy user of google-collections API :-) These builders are really cool! I feel myself caught in the act. There is still a couple of things bothering me. 1. Introducing a builder, we'll have a whole heap of deprecated constructors that will hang there for eternity. And then users will scream in frustration - This class has 14(!) constructors and all of them are deprecated! How on earth am I supposed to create this thing? 2. If someone creates IW with some reflectish javabeanish tools - he's busted. Not that I'm feeling compassionate for such a person. I like Earwin's version more. A builder is very flexible, because you can concat all your properties (like StringBuilder works with its append method returning itself) and create the instance at the end. Besides (arguably) cleaner syntax, the lack of which is (arguably) a curse of many Java libraries, it also allows us to return a different concrete implementation of IW without breaking back-compat, and also to choose this concrete implementation based on settings provided. If we feel like doing it at some point. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Optimization and Corruption Issues
2.0 is pre Mike's fabulous indexing updates - which just for one means one thread doing the merging rather than multiple. I'm sure overall its much slower. If you're doing a full optimize, you're still using a single thread. Am I wrong? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Optimization and Corruption Issues
If you're doing a full optimize, you're still using a single thread. Am I wrong? Depends on how many merges are required, and, the merge scheduler. In this case (w/ 7000 segments, which is way too many, normally!), assuming ConcurrentMergeScheduler, multiple threads will be used since many merges will be pending. When it gets down to the last (enormous) merge, it's only one thread. I'm speaking about full optimize. Is there any way to do it more efficiently then running a single, last (enormous) merge? If you try to parallelize, you're merging some documents several times (more work) and killing your disks, as merges are mostly IO-bound. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Query Parsing was Fwd: Lab - Esqueranto
We use antlr, though without its tree api, it's a bit of overkill. It directly builds a query in our intermediate format which is traversed for synonym/phrase detection and converted to lucene query. The library/language itself is pretty easy to learn, flexible, and has a nice IDE. On Fri, Sep 25, 2009 at 19:17, Peter Keegan peterlkee...@gmail.com wrote: We're using Antlr for our query parsing. What I like about it: - flexibility of separate lexer/parser and tree api - excellent IDE for building/testing the grammar However, the learning curve was quite long for me, although this was my first real encounter with parsers. Peter On Fri, Sep 25, 2009 at 9:58 AM, Grant Ingersoll gsing...@apache.org wrote: Has anyone looked/used Antlr for Query Parser capabilities? There was some discussion over at Apache Labs that might bear discussing in light of our new Query Parser contrib. Begin forwarded message: From: Tim Williams william...@gmail.com Date: August 17, 2009 8:09:04 PM EDT To: l...@labs.apache.org Subject: Re: Lab - Esqueranto Reply-To: l...@labs.apache.org On Mon, Aug 17, 2009 at 7:00 PM, Grant Ingersollgsing...@apache.org wrote: On Aug 2, 2009, at 1:43 PM, Tim Williams wrote: Hi Martin, Sure, if it works like I envision it, Lucene would just be *one* concrete tree grammar implementation - there could be others (ie OracleText), I'm thinking it is broader than one implementation - otherwise, I reckon it's Yet Another Lucene Query Parser (YALQP). For more practical reasons, I'm not a Lucene committer and it'd be slow going to play around with this through JIRA patches to their sandbox. FWIW, Lucene has recently added a new, more flexible Query Parser that allows for separation of the various pieces (syntax, intermediate representation, Lucene Query). You might want to check it out and see how that fits Thanks Grant, yeah I've looked at that and it seems really (overly?) complex for what I'm trying to achieve. It seems to re-implement much of the goodness that antlr provides for free. For example, with antlr I already get a lexer/parser grammar separate from the tree grammar. So, to plug in a new parser syntax is trivial - just implement a new lexer/parser grammar that provides tree rewrites consistent with a lucene tree grammar. Conversely, to implement a new concrete implementation, just implement a new tree grammar for the existing lexer/parser grammar. Of course, maybe I'll get down this road and realize how naive my path is and just switch over. For now, just looking at a query parser that, by itself, is approaching the size of the lucene core code base is intimidating:) Thanks for the pointer though, I'm subscribed over there and keep an eye out for progress on the new parser Thanks, --tim - To unsubscribe, e-mail: labs-unsubscr...@labs.apache.org For additional commands, e-mail: labs-h...@labs.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: How to leverage the LogMergePolicy calibrateSizeByDeletes patch in Solr ?
On Tue, Sep 22, 2009 at 19:08, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Sep 22, 2009 at 10:48 AM, Michael McCandless luc...@mikemccandless.com wrote: John are you using IndexWriter.setMergedSegmentWarmer, so that a newly merged segment is warmed before it's put into production (returned by getReader)? I'm still not sure I see the reason for complicating the IndexWriter with warming... can't this be done just as efficiently (if not more efficiently) in user/application space? +1 -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: who clears attributes?
On Tue, Aug 11, 2009 at 15:09, Yonik Seeleyyo...@lucidimagination.com wrote: On Tue, Aug 11, 2009 at 6:50 AM, Robert Muirrcm...@gmail.com wrote: On Tue, Aug 11, 2009 at 4:28 AM, Michael Buschbusch...@gmail.com wrote: There was a performance test in Solr that apparently ran much slower after upgrading to the new Lucene jar. This test is testing a rather uncommon scenario: very very short documents. Actually, its more uncommon than that: its very very short documents, without implementing reusableTokenStream() this makes it basically a benchmark of ctor cost... doesn't really benchmark the token api in my opinion. You would be surprized... there are quite a few Solr users that have relatively short documents... or even if they are sizeable documents, they have up to hundreds of short metadata-type fields (generally a token or two). Reusing TokenStreams has become a must in Solr IMO since construction costs (hashmap lookups, etc) and GC costs (larger objects) have been growing. I'm focused on that now... Robert's taking a crack at fixing things up so users can actually create reusable analyzers out of our filters: https://issues.apache.org/jira/browse/LUCENE-1794 +1. We don't use Solr, but have quite a bunch of medium and short-sized documents. Plus heaps of metadata fields. I'm yet to read Uwe's example, but I feel I'm a bit misunderstood by some of you. My gripe with new API is not that it brings us troubles (which are solved one way or another), it is that the switch and associated migration costs bring zero benefits in immediate and remote future. The only person that tried to disprove this claim is Uwe. Others either say the problems are solved, so it's okay to move to the new API, or this will be usable when flexindexing arrives. Sorry, the last phrase doesn't hold its place, this API is orthogonal to flexindexing, or at least nobody has shown the opposite. So, what I'm arguing against is adding some code (and forcing users to migrate) just because we can, with no other reasons. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: who clears attributes?
The only person that tried to disprove this claim is Uwe. Others either say the problems are solved, so it's okay to move to the new API, or this will be usable when flexindexing arrives. Others (not me) have spent a lot of time going over this before (more than once I think) - they prob are just sick of retyping. Lots of searchable archives out there though. Okay, I'll dig into them. Sorry for being a bother. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1799) Unicode compression
[ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741868#action_12741868 ] Earwin Burrfoot commented on LUCENE-1799: - I think right now this can be implemented as a delegating Directory. Unicode compression --- Key: LUCENE-1799 URL: https://issues.apache.org/jira/browse/LUCENE-1799 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 2.4.1 Reporter: DM Smith Priority: Minor In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index. This led to the comment that a different or compressed encoding would be a generally useful feature. BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained. SCSU is another Unicode compression algorithm that could be used. An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: indexing_slowdown_with_latest_lucene_udpate
Or, we can just throw that detection out of the window, for less smooth back-compat experience, less hacky code and no slowdown. On Mon, Aug 10, 2009 at 19:02, Uwe Schindleru...@thetaphi.de wrote: The question is, if that would get better if the reflection calls are only done one time per class using a IdentityHashMapClass,Boolean. The other reflection code in AttributeSource uses a static cache for such type of things (e.g. the Attribute - AttributeImpl mappings in AttributeSource. DefaultAttributeFactory.getClassForInterface()). I could do some tests about that and supply a patch. I was thinking about that but throwed it away (as it needs some synchronization on the cache Map which may also overweigh). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Monday, August 10, 2009 4:48 PM To: java-dev@lucene.apache.org Subject: Re: indexing_slowdown_with_latest_lucene_udpate Robert Muir wrote: This is real and not just for very short docs. Yes, you still pay the cost for longer docs, but it just becomes less important the longer the docs, as it plays a smaller role. Load a ton of one term docs, and it might be 50-60% slower - add a bunch of articles, and it might be closer to 20%-15% (I don't know the numbers, but the longer I made the docs, the less % slowdown, obviously). Still a good hit, but a short doc test magnafies the problem. It affects things no matter what, but when you don't do much tokenizing, normalizing, the cost of the reflection/tokenstream init dominates. - Mark - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers
[ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741372#action_12741372 ] Earwin Burrfoot commented on LUCENE-1793: - bq. I am guessing the rationale for the current code is to try to reduce index size? (since these languages are double-byte encoded in Unicode). Rationale was most probably to support existing non-unicode systems/databases/files, whatever. My say is - anyone still holding onto koi8, cp1251 and friends should silently do harakiri. remove custom encoding support in Greek/Russian Analyzers - Key: LUCENE-1793 URL: https://issues.apache.org/jira/browse/LUCENE-1793 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Robert Muir Priority: Minor Attachments: LUCENE-1793.patch The Greek and Russian analyzers support custom encodings such as KOI-8, they define things like Lowercase and tokenization for these. I think that analyzers should support unicode and that conversion/handling of other charsets belongs somewhere else. I would like to deprecate/remove the support for these other encodings. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: who clears attributes?
I'll deviate from the topic somewhat. What are exact benefits that new tokenstream API yields? Are we sure we want it released with 2.9? By now I only see various elaborate problems, but haven't seen a single piece of code becoming simpler. On Mon, Aug 10, 2009 at 21:50, Uwe Schindleru...@thetaphi.de wrote: Yes. Is there a way to enforce this for all Tokenizers automatically? As incrementToken() will be abstract in 3.0, there cannot be a default impl. So all Tokenizers should call clearAttributes() as first call in incrementToken(). Then we have still the problem of the slow iterator creation (which was speed up a little bit by removing the unmodifiable wrapper). This can be solved by using an additional ArrayList in AttributeSource that gets all AttributeImpl instances, but this would bring an additional initialization cost() on creating the Tokenizer chain. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Monday, August 10, 2009 7:42 PM To: java-dev@lucene.apache.org Subject: Re: who clears attributes? Thinking through this a little more, I don't see an alternative to the tokenizer clearing all attributes at the start of incrementToken(). Consider a DefaultPayloadTokenFilter that only sets a payload if one isn't already set - it's clear that this filter can't clear the payload attribute, so it must be cleared by the head of the chain - the tokenizer. Right? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: who clears attributes?
On Mon, Aug 10, 2009 at 22:50, Grant Ingersollgsing...@apache.org wrote: On Aug 10, 2009, at 2:00 PM, Earwin Burrfoot wrote: I'll deviate from the topic somewhat. What are exact benefits that new tokenstream API yields? Are we sure we want it released with 2.9? By now I only see various elaborate problems, but haven't seen a single piece of code becoming simpler. In theory, it sets up for more indexing/searching possibilities in 3.0, but in the meantime, it is proving to be quite problematic due to back compatibility restrictions. I'm not quite sure which exact indexing/searching possibilities does the new API open for us. Some new ways of handling text? Okay, I'd like each token to have one more number in addition to posIncr, so I can have my 'true multiword synonyms'. Maybe, just maybe, there will be a pair of other extensions. Usecases here are really scarce. Plus, if they're successful/useful, they will most probably be included out of the box, so we don't need much flexibility here. Something other than text? Numbers, with good rangequeries. Dates. Spatial data. Your-type-here. For these, flexible text-processing stream-oriented API is totally useless. I have serious doubts about releasing this new API until these performance issues are resolved and better proven out from a usability standpoint. It simply is too much to swallow for most users, as Analyzers/TokenStreams/etc. are easily the most common place for people to inject their own capabilities and there is no way we should be taking a 30% hit in performance for some theoretical speed up and new search capability 1 year from now. I have a feeling that best idea, before more damage is done, is to rollback this new API, store the patch, and try rolling it out once again, when we have usecases/more code to justify it. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: 2.5 versus 2.9, was Re: who clears attributes?
On Tue, Aug 11, 2009 at 00:37, Michael Buschbusch...@gmail.com wrote: On 8/10/09 1:30 PM, Grant Ingersoll wrote: I think your 2.5 proposal has drawbacks: if we release 2.5 now to test the new major features in the field, then do you want to stop adding new features to trunk until we release 2.9 to not have the same situation then again? How long should this testing in the field take? I don't know. How long does any release cycle last in Lucene? But we'll always have the same problem, no? We need to find a solution that allows us to keep adding features; dedicated deprecation releases are not good. Parallel branches. The only way of simultaneously satisfying several conflicting needs in software development. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: who clears attributes?
On Tue, Aug 11, 2009 at 00:54, Uwe Schindleru...@thetaphi.de wrote: I have serious doubts about releasing this new API until these performance issues are resolved and better proven out from a usability standpoint. I think LUCENE-1796 has fixed the performance problems, which was caused by a missing reflection-cache needed for bw compatibility. I hope to commit soon! 2.9 may be a little bit slower when you mix old and new API and do not reuse Tokenizers (but Robert is already adding reusableTokenStream to all contrib analyzers). When the backwards layer is removed completely or setOnlyUseNewAPI is enabled, there is no speed impact at all. The Analysis features of Lucene are the single most common place where people enhance Lucene. Very few add queries, or muck with field caches, but they do write their own Analyzers and TokenStreams, etc. Within that, mixing old and new is likely the most common case for everyone who has made their own customizations, so a little bit slower is something I'd rather not live with just for the sake of some supposed goodness in a year or two. But because of this flexibility, we added the backwards layer. The old style with setUseNewAPI was not flexible at all, and nobody would move his Tokenizers to the new API without that flexibility (maybe he uses external analyzer packages not yet updated). With a little bit I mean the cost of wrapping the old and new API is really minimal, it is just an if statement and a method call, hopefully optimized away by the JVM. In my tests the standard deviation between different test runs was much higher than the difference between mixing old/new API (on Win32), so it is not really sure, that the cost comes from the delegation. The only case that is really slower is (now minimized cost of creation in TokenStream.init, if you not reuse TokenStreams: Two LinkedHashMaps have to be created and setup). But this is not caused by the backwards layer. Uwe Uwe, the problems I raised are still here - what is the benefit of moving to this API right now? I see none. What is the future benefit of moving to this API? It is very vague. Someone said this API is generic, but there are different kinds of genericity. Are we sure we abstracted the right thing? How will it be used? Where are examples? Right now it is an excercise in programming, which forces us to do new and new excercises. Very exciting, very rewarding, but as of now - pointless. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: pieces missing in reusable analyzers?
I had thought that implementing reusable analyzers in solr was going to be cake... but either I'm missing something, or Lucene is missing something. Here's the way that one used to create custom analyzers: class CustomAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return new LowerCaseFilter(new NGramTokenFilter(new StandardTokenizer(reader))); } } Now let's try to make this reusable: class CustomAnalyzer2 extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return new LowerCaseFilter(new NGramTokenFilter(new StandardTokenizer(reader))); } �...@override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { TokenStream ts = getPreviousTokenStream(); if (ts == null) { ts = tokenStream(fieldName, reader); setPreviousTokenStream(ts); return ts; } else { // uh... how do I reset a token stream? return ts; } } } See the missing piece? Seems like TokenStream needs a reset(Reader r) method or something? I'm just keeping a reference to Tokenizer, so I can reset it with a new reader. Though this situation is awkward, TS definetly does not need a reset(Reader). -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: who clears attributes?
Well, I have real use cases for it, but all of it is still missing the biggest piece: search side support. It's the 900 lb. elephant in the room. The 500 lb. elephant is the fact that all these attributes, AIUI, require you to hook in your own indexing chain, etc. in order to even be indexed, which is all package private stuff. It's not even clear to me what happens right now if you were to, say have a Token Stream that, say, had only one Attribute on it and none of the existing attributes (term buffer, length, position, etc.) Please correct me if I am wrong, I still don't have a deep understanding of it all. Even pseudocode would be good. Custom indexing chain for abstract attributes sounds like one of microsoft.com definitions - serious, determined, but vague. If you take current Token and start throwing away some of its fields, the resulting index contents are obvious for one combinations and absurd for others. You don't need this new API to handle obvious ones. Oh, and now it seems the new QP is dependent on it all. That's why I said earlier before more damage is done. Michael has always been up front that this new API is in preparation for flexible indexing. It doesn't give us the goodness - he has laid out the reasons for moving before the goodness comes more than once I think. My problem is not waiting for 'goodness'. It is that I don't currently see what goodness will come from this API even in remote future. That's why I am asking! :) Flexible indexing will lead to all kinds of little cool things - the likes of which have been discussed a lot in older emails. It will likely lead to things we cannot predict as well. Everything will be more flexible. It also could play a part in CSF, and work on allowing custom files to plug into merging. Plus everything else thats been mentioned (pfor, etc) I've been sold on the long term benefits. I don't think you need these API for them, but its my understanding it helps solve part of the equation. Yeah. I too, would like to see all these little cool things, and I don't think we need this API for them. Flexible indexing is going to handle various different datatypes besides text, so I can only reiterate - it cannot rely on generic stream-based text-handling API for consuming data. A bunch of issues have come up. To my knowledge, they have been addressed with vigor every time. If someone is unhappy with how something has been addressed, and it needs to be addressed further, please speak up. Otherwise, I don't think the sky is falling - I think the new API is being shaken out. API is born dead without usecases. If a year later we get closer to flexindexing it is supposed to support, and then we understand we missed some crucial thing - WHAM! our back-compat policy kicks in and makes our lives miserable once more. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: pieces missing in reusable analyzers?
I'm just keeping a reference to Tokenizer, so I can reset it with a new reader. Though this situation is awkward, TS definetly does not need a reset(Reader). Then how do you notify the other filters that they should reset their state? TokenStream.reset()? The javadoc specifies that it's actually used for something else - but perhaps it can be reused for this purpose? Yes, exactly. TokenFilter override of reset() chains the call to the input stream. I specifically used NGramTokenFilter in my example because it did use internal state (and it's a bug that it has no way to reset that state currently). My filters are all my own, so they reset and chain properly. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: ConcurrentMergeScheduler and MergePolicy question
On Sun, Aug 9, 2009 at 08:38, Jason Rutherglenjason.rutherg...@gmail.com wrote: You don't have to copy. You can have one machine optimize your indexes whilst other serves user requests, then they switch roles, rinse, repeat. This approach also works with sharding, and more than 2-way mirroring. What does the un-optimized server do after the other server is optimized? The search requests go to the newly optimized server, however if we're mirroring, the 2nd server now needs the optimized index as well? The second server now stops servicing requests and starts optimizing. You can also keep them running together for some time, depending on how you're serious about always running on optimized index. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: ConcurrentMergeScheduler and MergePolicy question
Perhaps the ideal search system architecture that requires optimizing is to dedicate a server to it, copy the index to the optimize server, do the optimize, copy the index off (to a search server) and start again for the next optimize task. I wonder how/if this would work with Hadoop/HDFS as copying 100GB around would presumably tie up the network? Also, I've found rsyncing large optimized indexes to be time consuming and wreaks havoc on the searcher server's IO subsystem. Usually this is unacceptable for the user as the queries will suddenly degrade. You don't have to copy. You can have one machine optimize your indexes whilst other serves user requests, then they switch roles, rinse, repeat. This approach also works with sharding, and more than 2-way mirroring. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Attributes, DocConsumer, Flexible Indexing, etc.
I always thought flexible indexing is not only for storing your app-specific data next to terms/docs. Something more along the lines of efficient geo search, or ability to try out various index encoding schemes without patching lucene. In other words, this is something that can be a basis for easy/pluggable implementation of payload-type functionality, not vice-versa. On Thu, Aug 6, 2009 at 01:55, Grant Ingersollgsing...@apache.org wrote: On Aug 5, 2009, at 4:35 PM, Michael Busch wrote: On 8/5/09 1:07 PM, Grant Ingersoll wrote: Hmmm, OK. Random, somewhat uneducated thought: Why not just define the codecs to create byte arrays? Then we can use the existing payload capability much like I do with the DelimitedPayloadTokenFilter. We'd probably have to make sure this still worked with Similarity, but it seems like it could. Thinking on this some more, seems like this could work already with a a AttributePayloadEncoder or something like an AttributeToPayloadTokenFilter (I know, horrible name). Then, on the Query side, the AttributeTermQuery is just a glorified BoostingTermQuery with some callback hooks for dealing with the Attribute (but maybe that isn't even needed), either that or we just provide helper methods to the Similarity class so that people can easily decode the byte array into an Attribute. In fact, maybe all that needs to happen is the Attributes need to define encode/decode methods that (de)serialize a byte array. Seems like this approach would require very little in the way of changes to Lucene, but I admit it isn't fully baked in my mind just yet. It also has the nice benefit that all the work we did on Payloads isn't wasted. This is resonating more and more with me. What do you think? Well I think this would be a nice way of using the payloads better. However, the idea behind flexible indexing is that you can customize the on-disk encoding in a way that it is as efficient as it can be for your particular use case. E.g. for payloads we currently have to encode the length. An application might not have to do that if it knows exactly what is stored. Then there's only the Payload API that returns you a byte array. It basically copies the contents of the IndexInput (usually a BufferedIndexInput, which means array copy from the byte buffer to the payload byte array). If the application knows exactly what is stored it can read/decode it more efficiently. Yeah, but really are you saving that much? 4 bytes per token? It's not like you are saving much in terms of seeks, since you are already there anyway. The only downside I see is a slightly larger index. Would be interesting to try it out and see. The latter inefficiency we could solve by improving the payloads API: it could return an IndexInput instead of the byte array and the caller could consume it more efficient. This is also interesting, but again requires some changes. With what I'm proposing, I think it could be done very simply w/o any API changes, and we just need to expose some of the IndexInput/Output helper classes a bit more to make it easier for people to encode/decode their stuff. Then, just documentation and some more Boosting*Query (Peter has already done BoostingNearQuery) and I think you have a pretty good flexible indexing AND searching capability all in a back compatible way using our existing code. So I agree that we could use Attributes to make the payloads feature better usable, but I don't think it will be a replacement for flexible indexing. Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: IndexWriter.getReader usage
The biggest win for NRT was switching to per-segment Collector because that meant we could re-use FieldCache entries for all segments that hadn't changed. In my opinion, this switch was enough to get as NRT-ey, as you want. Fusing IR/IW together makes Lucene a great deal more complicated and just a milli-tad closer to RT. I'm curious as to how it obviates the need for a RAM dir? In my use case I use them to create indexes and perform searches. In the latter it avoids OS file indexing and virus scanner contention (40 min reduced to less than 2 min). Isn't indexing your indexes (omg), checking them for viruses and striving for performance is ..err.. a little bit self-contradictary? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Java caching of low-level index data?
I'm curious if anyone has thought about (or even tried) caching the low-level index data in Java, rather than in the OS. For example, at the IndexInput level there could be an LRU cache of byte[] blocks, similar to how a RDBMS caches index pages. (Conveniently, BufferedIndexInput already reads in 1k chunks.) You would reverse the advice above and instead make your JVM heap as large as possible (or at least large enough to achieve a desired speed/space tradeoff). I did something along these lines. It sucks. Having big Java heaps ends you up with insane GC times. Loading GB-sized files into a bunch of byte[1024] also wastes memory. Best bet by now is to rely on mmap/file cache. I think swappiness is exactly the configuration that tells Linux just how happily it should swapp out application memory for IO cache vs other IO cache for new IO cache. swappiness is roughly the percentage of free memory after which OS starts searching for pages suitable for paging out. If set to low values, OS wakes up in near-OOM conditions. If set to high values, as soon as OS decides (according to some heuristics) that page is eligible for page-out, it goes to disk. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732938#action_12732938 ] Earwin Burrfoot commented on LUCENE-1748: - bq. We should drop PayloadSpans and just add getPayload to Spans. This should be a compile time break. +1 getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.9, 3.0, 3.1 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731939#action_12731939 ] Earwin Burrfoot commented on LUCENE-1748: - bq. Shouldnt it throw a runtime exception (unsupported operation?) or something? What is the difference between adding an abstract method and adding a method that throws exception in regards to jar drop in back compat? In both cases when you drop your new jar in you get an exception, except in the latter case exception is deferred. getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731972#action_12731972 ] Earwin Burrfoot commented on LUCENE-1748: - I took a glance at the code, the whole getPayloadSpans deal is a herecy. Each and every implementation looks like: public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException { return (PayloadSpans) getSpans(reader); } Moving it to the base SpanQuery is broken equally to current solution, but yields much less strange copypaste. I also have a faint feeling that if you expose a method like ClassA method(); you can then upgrade it to SubclassOfClassA method(); without breaking drop-in compatibility, which renders getPayloadSpans vs getSpans alternative totally useless getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1748) getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract
[ https://issues.apache.org/jira/browse/LUCENE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731972#action_12731972 ] Earwin Burrfoot edited comment on LUCENE-1748 at 7/16/09 7:54 AM: -- I took a glance at the code, the whole getPayloadSpans deal is a herecy. Each and every implementation looks like: public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException { return (PayloadSpans) getSpans(reader); } Moving it to the base SpanQuery is broken equally to current solution, but yields much less strange copypaste. -I also have a faint feeling that if you expose a method like- -ClassA method();- -you can then upgrade it to- -SubclassOfClassA method();- -without breaking drop-in compatibility, which renders getPayloadSpans vs getSpans alternative totally useless- Ok, I'm wrong. was (Author: earwin): I took a glance at the code, the whole getPayloadSpans deal is a herecy. Each and every implementation looks like: public PayloadSpans getPayloadSpans(IndexReader reader) throws IOException { return (PayloadSpans) getSpans(reader); } Moving it to the base SpanQuery is broken equally to current solution, but yields much less strange copypaste. I also have a faint feeling that if you expose a method like ClassA method(); you can then upgrade it to SubclassOfClassA method(); without breaking drop-in compatibility, which renders getPayloadSpans vs getSpans alternative totally useless getPayloadSpans on org.apache.lucene.search.spans.SpanQuery should be abstract -- Key: LUCENE-1748 URL: https://issues.apache.org/jira/browse/LUCENE-1748 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.4, 2.4.1 Environment: all Reporter: Hugh Cayless Fix For: 2.4.2 I just spent a long time tracking down a bug resulting from upgrading to Lucene 2.4.1 on a project that implements some SpanQuerys of its own and was written against 2.3. Since the project's SpanQuerys didn't implement getPayloadSpans, the call to that method went to SpanQuery.getPayloadSpans which returned null and caused a NullPointerException in the Lucene code, far away from the actual source of the problem. It would be much better for this kind of thing to show up at compile time, I think. Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1743) MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS
[ https://issues.apache.org/jira/browse/LUCENE-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731632#action_12731632 ] Earwin Burrfoot commented on LUCENE-1743: - The initial motive for the issue seems wrong to me. bq. For most operating systems, mapping a file into memory is more expensive than reading or writing a few tens of kilobytes of data via the usual read and write methods. From the standpoint of performance it is generally only worth mapping relatively large files into memory. It is probably right if you're doing a single read through the file. If you're opening/mapping it and do thousands of repeated reads, mmap would be superior, because after initial mapping it's just a memory access VS system call for file.read(). MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS - Key: LUCENE-1743 URL: https://issues.apache.org/jira/browse/LUCENE-1743 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 This is a followup to LUCENE-1741: Javadocs state (in FileChannel#map): For most operating systems, mapping a file into memory is more expensive than reading or writing a few tens of kilobytes of data via the usual read and write methods. From the standpoint of performance it is generally only worth mapping relatively large files into memory. MMapDirectory should get a user-configureable size parameter that is a lower limit for mmapping files. All files with a sizelimit should be opened using a conventional IndexInput from SimpleFS or NIO (another configuration option for the fallback?). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1743) MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS
[ https://issues.apache.org/jira/browse/LUCENE-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731632#action_12731632 ] Earwin Burrfoot edited comment on LUCENE-1743 at 7/15/09 12:14 PM: --- The initial motive for the issue seems wrong to me. bq. For most operating systems, mapping a file into memory is more expensive than reading or writing a few tens of kilobytes of data via the usual read and write methods. From the standpoint of performance it is generally only worth mapping relatively large files into memory. It is probably right if you're doing a single read through the file. If you're opening/mapping it and do thousands of repeated reads, mmap would be superior, because after initial mapping it's just a memory access VS system call for file.read(). Add: In case you're not doing repeated reads, and just read these small files once from time to time, you can totally neglect speed difference between mmap and fopen. At least it doesn't warrant increased complexity. was (Author: earwin): The initial motive for the issue seems wrong to me. bq. For most operating systems, mapping a file into memory is more expensive than reading or writing a few tens of kilobytes of data via the usual read and write methods. From the standpoint of performance it is generally only worth mapping relatively large files into memory. It is probably right if you're doing a single read through the file. If you're opening/mapping it and do thousands of repeated reads, mmap would be superior, because after initial mapping it's just a memory access VS system call for file.read(). MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS - Key: LUCENE-1743 URL: https://issues.apache.org/jira/browse/LUCENE-1743 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 This is a followup to LUCENE-1741: Javadocs state (in FileChannel#map): For most operating systems, mapping a file into memory is more expensive than reading or writing a few tens of kilobytes of data via the usual read and write methods. From the standpoint of performance it is generally only worth mapping relatively large files into memory. MMapDirectory should get a user-configureable size parameter that is a lower limit for mmapping files. All files with a sizelimit should be opened using a conventional IndexInput from SimpleFS or NIO (another configuration option for the fallback?). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1743) MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS
[ https://issues.apache.org/jira/browse/LUCENE-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12731639#action_12731639 ] Earwin Burrfoot commented on LUCENE-1743: - bq. My problem was more with all these small files like segments_ and segments.gen or *.del files. They are small and only used one time. I can only reiterate my point. These files aren't opened like 10k files per second, so your win is going to be in the order of microseconds per reopen - at the cost of increased complexity. MMapDirectory should only mmap large files, small files should be opened using SimpleFS/NIOFS - Key: LUCENE-1743 URL: https://issues.apache.org/jira/browse/LUCENE-1743 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 This is a followup to LUCENE-1741: Javadocs state (in FileChannel#map): For most operating systems, mapping a file into memory is more expensive than reading or writing a few tens of kilobytes of data via the usual read and write methods. From the standpoint of performance it is generally only worth mapping relatively large files into memory. MMapDirectory should get a user-configureable size parameter that is a lower limit for mmapping files. All files with a sizelimit should be opened using a conventional IndexInput from SimpleFS or NIO (another configuration option for the fallback?). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: A Comparison of Open Source Search Engines
I'd say out of these libraries only Lucene and Sphinx are worth mentioning. There's also MG4J, which wasn't covered and has a nice algorithmic background. Anybody knows other interesting open-source search engines? On Tue, Jul 7, 2009 at 00:39, John Wangjohn.w...@gmail.com wrote: Vik did a very nice job. One thing the experiment did not mention is that Lucene handles incremental updates, whereas many of the other competitors do not. So the indexing performance comparison is not really fair. -John On Mon, Jul 6, 2009 at 8:06 AM, Sean Owen sro...@gmail.com wrote: http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/ I imagine many of you already saw this -- Lucene does pretty well in this shootout. The only area it tended to lag, it seems, is memory usage and speed in some cases. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text
[ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12726571#action_12726571 ] Earwin Burrfoot commented on LUCENE-1488: - bq. There is no morphological processing or any other language-specific functionality in this patch... I'm speaking of stemming in ArabicAnalyzer. Why can't you use its stemming tokenfilter over all ICU goodness from this patch? Everything else ArabicAnalyzer consists of might as well be deleted right after. issues with standardanalyzer on multilingual text - Key: LUCENE-1488 URL: https://issues.apache.org/jira/browse/LUCENE-1488 Project: Lucene - Java Issue Type: Wish Components: contrib/analyzers Reporter: Robert Muir Priority: Minor Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt, LUCENE-1488.txt The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts. This is because it is unaware of unicode bounds properties. I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier. Thanks, Robert -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Improving TimeLimitedCollector
Why don't you use Thread.interrupt(), .isInterrupted() ? On Sat, Jun 27, 2009 at 16:16, Shai Ereraser...@gmail.com wrote: A downside of breaking it out into static methods like this is that a thread cannot run 1 time-limited activity simultaneously but I guess that might be a reasonable restriction. I'm not sure I understand that - how can a thread run 1 activity simultaneously anyway, and how's your impl in TimeLimitingIndexReader allows it to do so? You use the thread as a key to the map. Am I missing something? Anyway, I think we can let go of the static methods and make them instance methods. I think .. if I want to use time limited activities, I should create a TimeLimitedThreadActivity instance and pass it around, to TimeLimitingIndexReader (and maybe in the future to a similar **IndexWriter) and any other custom code I have which I want to put a time limit on. A static class has the advantage of not needing to pass it around everywhere, and is accessible from everywhere, so that if we discover that limiting on IndexReader is not enough, and we want some of the scorers to check more frequently if they should stop, we won't need to pass that instance all the way down to them. I don't mind keeping it static, but I also don't mind if it will be an instance passed around, since currently it's only passed to IndexReader. Are you going to open an issue for that? Seems like a nice addition to me. Do you think it should belong in core or contrib? If 'core' then if that's possible I'd like to see all timeout classes under one package, including TimeLimitingCollector (which until 2.9 we can safely move around as we want). I don't mind working on that w/ you, if you want. Shai On Sat, Jun 27, 2009 at 2:24 PM, Mark Harwood markharw...@yahoo.co.uk wrote: Thanks for the feedback, Shai. So I guess you're suggesting breaking this out into a general utility class e.g. something like: class TimeLimitedThreadActivity { //called by client public static void startTimeLimitedActivity(long maxTimePermitted). public static void endTimeLimitedActivity() //called by resources (reader/writers) that need to be shared fairly by threads public static void checkActivityNotElapsed(); //throws some form of runtime exception } A downside of breaking it out into static methods like this is that a thread cannot run 1 time-limited activity simultaneously but I guess that might be a reasonable restriction. Aside, how about using a PQ for the threads' times, or a TreeMap. That will save looping over the collection to find the next candidate. Just an implementation detail though. Yep, that was one of the rough edges - I just wanted to get raw timings first for the all the is timed out? checks we're injecting into reader calls. Cheers Mark On 27 Jun 2009, at 11:37, Shai Erera wrote: I like the overall approach. However it's very local to an IndexReader. I.e., if someone wanted to limit other operations (say indexing), or does not use an IndexReader (for a Scorer impl maybe), one cannot reuse it. What if we factor out the timeout logic to a Timeout class (I think it can be a static class, with the way you implemented it) and use it in TimeLimitingIndexReader? That class can offer a method check() which will do the internal logic (the 'if' check and throw exception). It will be similar to the current ensureOpen() followed by an operation. It might be considered more expensive since it won't check a boolean, but instead call a check() method, but it will be more reusable. Also, ensureOpen today is also a method call, so I don't think Timeout.check() is that bad. We can even later create a TimeLimitingIndexWriter and document Timeout class for other usage by external code. Aside, how about using a PQ for the threads' times, or a TreeMap. That will save looping over the collection to find the next candidate. Just an implementation detail though. Shai On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood markharw...@yahoo.co.uk wrote: Going back to my post re TimeLimitedIndexReaders - here's an incomplete but functional prototype: http://www.inperspective.com/lucene/TimeLimitedIndexReader.java http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java The principle is that all reader accesses check a volatile variable indicating something may have timed out (no need to check thread locals etc.) If and only if a time out has been noted threadlocals are checked to see which thread should throw a timeout exception. All time-limited use of reader must be wrapped in try...finally calls to indicate the start and stop of a timed set of activities. A background thread maintains the next anticipated timeout deadline and simply waits until this is reached or the list of planned activities changes with new deadlines. Performance seems reasonable on my Wikipedia index: //some tests for heavy use of termenum/term
[jira] Commented: (LUCENE-1342) 64bit JVM crashes on Linux
[ https://issues.apache.org/jira/browse/LUCENE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724441#action_12724441 ] Earwin Burrfoot commented on LUCENE-1342: - bq. Sun can't ignore a HotSpot compiler bug, can they? They are safely ignoring CMS collector bugs on 64bit archs. 64bit JVM crashes on Linux -- Key: LUCENE-1342 URL: https://issues.apache.org/jira/browse/LUCENE-1342 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.0.0 Environment: 2.6.18-53.el5 x86_64 GNU/Linux Java(TM) SE Runtime Environment (build 1.6.0_04-b12) Reporter: Kevin Richards Attachments: hs_err_pid10565.log, hs_err_pid21301.log, hs_err_pid27882.log Whilst running lucene in our QA environment we received the following exception. This problem was also reported here : http://confluence.atlassian.com/display/KB/JSP-20240+-+POSSIBLE+64+bit+JDK+1.6+update+4+may+have+HotSpot+problems. Is this a JVM problem or a problem in Lucene. # # An unexpected error has been detected by Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x2adb9e3f, pid=2275, tid=1085356352 # # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b19 mixed mode linux-amd64) # Problematic frame: # V [libjvm.so+0x1fce3f] # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # --- T H R E A D --- Current thread (0x2aab0007f000): JavaThread CompilerThread0 daemon [_thread_in_vm, id=2301, stack(0x40a13000,0x40b14000)] siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x Registers: RAX=0x, RBX=0x2aab0007f000, RCX=0x, RDX=0x2aab00309aa0 RSP=0x40b10f60, RBP=0x40b10fb0, RSI=0x2aaab37d1ce8, RDI=0x2aaad000 R8 =0x2b40cd88, R9 =0x0ffc, R10=0x2b40cd90, R11=0x2b410810 R12=0x2aab00ae60b0, R13=0x2aab0a19cc30, R14=0x40b112f0, R15=0x2aab00ae60b0 RIP=0x2adb9e3f, EFL=0x00010246, CSGSFS=0x0033, ERR=0x0004 TRAPNO=0x000e Top of Stack: (sp=0x40b10f60) 0x40b10f60: 2aab0007f000 0x40b10f70: 2aab0a19cc30 0001 0x40b10f80: 2aab0007f000 0x40b10f90: 40b10fe0 2aab0a19cc30 0x40b10fa0: 2aab0a19cc30 2aab00ae60b0 0x40b10fb0: 40b10fe0 2ae9c2e4 0x40b10fc0: 2b413210 2b413350 0x40b10fd0: 40b112f0 2aab09796260 0x40b10fe0: 40b110e0 2ae9d7d8 0x40b10ff0: 2b40f3d0 2aab08c2a4c8 0x40b11000: 40b11940 2aab09796260 0x40b11010: 2aab09795b28 0x40b11020: 2aab08c2a4c8 2aab009b9750 0x40b11030: 2aab09796260 40b11940 0x40b11040: 2b40f3d0 2023 0x40b11050: 40b11940 2aab09796260 0x40b11060: 40b11090 2b0f199e 0x40b11070: 40b11978 2aab08c2a458 0x40b11080: 2b413210 2023 0x40b11090: 40b110e0 2b0f1fcf 0x40b110a0: 2023 2aab09796260 0x40b110b0: 2aab08c2a3c8 40b123b0 0x40b110c0: 2aab08c2a458 40b112f0 0x40b110d0: 2b40f3d0 2aab00043670 0x40b110e0: 40b11160 2b0e808d 0x40b110f0: 2aab000417c0 2aab009b66a8 0x40b11100: 2aab009b9750 0x40b0: 40b112f0 2aab009bb360 0x40b11120: 0003 40b113d0 0x40b11130: 01002aab0052d0c0 40b113d0 0x40b11140: 00b3 40b112f0 0x40b11150: 40b113d0 2aab08c2a108 Instructions: (pc=0x2adb9e3f) 0x2adb9e2f: 48 89 5d b0 49 8b 55 08 49 8b 4c 24 08 48 8b 32 0x2adb9e3f: 4c 8b 21 8b 4e 1c 49 8d 7c 24 10 89 cb 4a 39 34 Stack: [0x40a13000,0x40b14000], sp=0x40b10f60, free space=1015k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x1fce3f] V [libjvm.so+0x2df2e4] V [libjvm.so+0x2e07d8] V [libjvm.so+0x52b08d] V [libjvm.so+0x524914] V [libjvm.so+0x51c0ea] V [libjvm.so+0x519f77] V [libjvm.so+0x519e7c] V [libjvm.so+0x519ad5] V [libjvm.so+0x1e0cf4] V [libjvm.so+0x2a0bc0] V [libjvm.so+0x528e03] V [libjvm.so+0x51c0ea] V [libjvm.so+0x519f77] V [libjvm.so+0x519e7c] V [libjvm.so+0x519ad5] V
Re: Improving TimeLimitedCollector
Having scorers check timeouts while advancing will definetly increase the frequency of said timeouts. On Wed, Jun 24, 2009 at 13:13, eks deveks...@yahoo.co.uk wrote: Re: I think such a parameter should not exist on individual search methods since it's more of a global setting (i.e., I want my searches to be limited to 5 seconds, always, not just for a particular query). Right? I am not sure about this one, we had cases where one phisical index served two logical indices with different requirements for clients. having Timeout settable per Query is nice to have. At the end of day, with such timeout you support Quality/Time compromise settings: if you need all results, be ready to wait longer and set longer timeout if you need SOME results quickly than reduce this timeout that should be idealy user decision From: Shai Erera ser...@gmail.com To: java-dev@lucene.apache.org Sent: Wednesday, 24 June, 2009 10:55:50 Subject: Re: Improving TimeLimitedCollector But TimeLimitingCollector's logic is coded in its collect() method. The top scorer calls nextDoc() or advance() on all its sub-scorers, and only when a match is found it calls collect(). If we want the sub-scorers to check whether they should abort, we'd need to revamp (liked the word :)) TimeLimitingCollector, to be something like CheckAbort SegmentMerger uses. I.e., the top scorer will pass such an instance to its sub scorers, which will call a TimeLimit.check() or something and if the time limit has expired this call will throw a TimeExceededException (like TLC). We can enable this by adding another parameter to IndexSearcher whether searches should be limited by time, and what's the time limit. It will then instantiate that object and pass it to its Scorer and so on. I think such a parameter should not exist on individual search methods since it's more of a global setting (i.e., I want my searches to be limited to 5 seconds, always, not just for a particular query). Right? Another option would be to add a setTimeout method on Query, which will use it when it constructs its Scorer. The shortcoming of this is that if I want to use someone else's query which did not implement setTimeout, then I'll need to build a TimeOutQueryWrapper that will wrap a Query, and implement the timeout logic, but that's get complicated. I think the Collector approach makes the most sense to me, since it's the only object I fully control in the search process. I cannot control Query implementations, and I cannot control the decisions made by IndexSearcher. But I can always wrap someone else's Collector with TLC and pass it to search(). Shai On Wed, Jun 24, 2009 at 12:26 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: As we're revamping collectors, weights, and scorers, perhaps we can push time limiting into the individual subscorers? Currently on a boolean query, we're timing out the query at the top level which doesn't work well if the subqueries exceed the time limit. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1712) Set default precisionStep for NumericField and NumericRangeFilter
[ https://issues.apache.org/jira/browse/LUCENE-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722996#action_12722996 ] Earwin Burrfoot commented on LUCENE-1712: - Having half of your methods constantly fail with an exception depending on constructor parameter. That just screams - Split me into two classes! Set default precisionStep for NumericField and NumericRangeFilter - Key: LUCENE-1712 URL: https://issues.apache.org/jira/browse/LUCENE-1712 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Michael McCandless Priority: Minor Fix For: 2.9 This is a spinoff from LUCENE-1701. A user using Numeric* should not need to understand what's under the hood in order to do their indexing searching. They should be able to simply: {code} doc.add(new NumericField(price, 15.50); {code} And have a decent default precisionStep selected for them. Actually, if we add ctors to NumericField for each of the supported types (so the above code works), we can set the default per-type. I think we should do that? 4 for int and 6 for long was proposed as good defaults. The default need not be perfect, as advanced users can always optimize their precisionStep, and for users experiencing slow RangeQuery performance, NumericRangeQuery with any of the defaults we are discussing will be much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1715) DirectoryIndexReader finalize() holding TermInfosReader longer than necessary
[ https://issues.apache.org/jira/browse/LUCENE-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723224#action_12723224 ] Earwin Burrfoot commented on LUCENE-1715: - I object nulling references in attempt to speed up GC. It's totally useless on any decent JVM implementation and if someone uses indecent JVM, I doubt he's concerned with his app efficiency. DirectoryIndexReader finalize() holding TermInfosReader longer than necessary - Key: LUCENE-1715 URL: https://issues.apache.org/jira/browse/LUCENE-1715 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Environment: Sun JDK 6 update 12 64-bit, Debian Lenny Reporter: Brian Groose Assignee: Michael McCandless Fix For: 2.9 DirectoryIndexReader has a finalize method, which causes the JDK to keep a reference to the object until it can be finalized. SegmentReader and MultiSegmentReader are subclasses that contain references to, potentially, hundreds of megabytes of cached data in a TermInfosReader. Some options would be removing finalize() from DirectoryIndexReader (it releases a write lock at the moment) or possibly nulling out references in various close() and doClose() methods throughout the class hierarchy so that the finalizable object doesn't references the Term arrays. Original mailing list message: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200906.mbox/%3c7a5cb4a7bbce0c40b81c5145c326c31301a62...@numevp06.na.imtn.com%3e -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1715) DirectoryIndexReader finalize() holding TermInfosReader longer than necessary
[ https://issues.apache.org/jira/browse/LUCENE-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723225#action_12723225 ] Earwin Burrfoot commented on LUCENE-1715: - And support removing finalizers everywhere if their only point is to guard against forgotten close(). DirectoryIndexReader finalize() holding TermInfosReader longer than necessary - Key: LUCENE-1715 URL: https://issues.apache.org/jira/browse/LUCENE-1715 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Environment: Sun JDK 6 update 12 64-bit, Debian Lenny Reporter: Brian Groose Assignee: Michael McCandless Fix For: 2.9 DirectoryIndexReader has a finalize method, which causes the JDK to keep a reference to the object until it can be finalized. SegmentReader and MultiSegmentReader are subclasses that contain references to, potentially, hundreds of megabytes of cached data in a TermInfosReader. Some options would be removing finalize() from DirectoryIndexReader (it releases a write lock at the moment) or possibly nulling out references in various close() and doClose() methods throughout the class hierarchy so that the finalizable object doesn't references the Term arrays. Original mailing list message: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200906.mbox/%3c7a5cb4a7bbce0c40b81c5145c326c31301a62...@numevp06.na.imtn.com%3e -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1715) DirectoryIndexReader finalize() holding TermInfosReader longer than necessary
[ https://issues.apache.org/jira/browse/LUCENE-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723289#action_12723289 ] Earwin Burrfoot commented on LUCENE-1715: - There's in fact one case where nulling harms. I'm going to try making as much of IR as possible immutable and final. Load everything upfront on creation/reopen (or don't load if IR is created for, say, merging). Unlike nulling references, making frequently accessed fields final does have an impact under adequate JVMs. Well, nulling can be added now and removed when/if I finish my IR stuff. DirectoryIndexReader finalize() holding TermInfosReader longer than necessary - Key: LUCENE-1715 URL: https://issues.apache.org/jira/browse/LUCENE-1715 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Environment: Sun JDK 6 update 12 64-bit, Debian Lenny Reporter: Brian Groose Assignee: Michael McCandless Fix For: 2.9 DirectoryIndexReader has a finalize method, which causes the JDK to keep a reference to the object until it can be finalized. SegmentReader and MultiSegmentReader are subclasses that contain references to, potentially, hundreds of megabytes of cached data in a TermInfosReader. Some options would be removing finalize() from DirectoryIndexReader (it releases a write lock at the moment) or possibly nulling out references in various close() and doClose() methods throughout the class hierarchy so that the finalizable object doesn't references the Term arrays. Original mailing list message: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200906.mbox/%3c7a5cb4a7bbce0c40b81c5145c326c31301a62...@numevp06.na.imtn.com%3e -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1607) String.intern() faster alternative
[ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723352#action_12723352 ] Earwin Burrfoot commented on LUCENE-1607: - Okay, let's have an extra class and ability to switch impls. I liked that static method could get inlined (at least its short-path), but that's not necessary. Except I'd like the javadoc demand each impl to be String.intern()-compatible. There's nothing bad in it, as in any decent impl an unique string will be String.intern()'ed one time at most. And the case when you get an infinite flow of unique strings is degenerate anyway, you have to fix something, not deal with it. On the other hand, we can remove This should never be changed after other Lucene APIs have been used clause. rewrite 'for' as 'for (Entry e = first;e != null;e = e.next)' for clarity? 'Entry[] arr = cache;' - this can be skipped? 'cache' is already final and optimizer loves finals. Plus further down the method you use both cache[slot] and arr[slot]. Or am I missing some voodoo? If check around 'nextToLast = e' can also be removed? 'public String intern(char[] arr, int offset, int len)' - is this needed? String.intern() faster alternative -- Key: LUCENE-1607 URL: https://issues.apache.org/jira/browse/LUCENE-1607 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot Assignee: Yonik Seeley Fix For: 2.9 Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch By using our own interned string pool on top of default, String.intern() can be greatly optimized. On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)' For java 5 and 4 speedup is lower, but still considerable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations
[ https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723355#action_12723355 ] Earwin Burrfoot commented on LUCENE-1677: - Mike, are we going to postpone actual deletion of these classes for 3.0? Remove GCJ IndexReader specializations -- Key: LUCENE-1677 URL: https://issues.apache.org/jira/browse/LUCENE-1677 Project: Lucene - Java Issue Type: Task Reporter: Earwin Burrfoot Assignee: Michael McCandless Fix For: 2.9 These specializations are outdated, unsupported, most probably pointless due to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you are going to ask people on java-user, anybody replied that they need it?). While giving nothing, they make SegmentReader instantiation code look real ugly. If nobody objects, I'm going to post a patch that removes these from Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations
[ https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723378#action_12723378 ] Earwin Burrfoot commented on LUCENE-1677: - I thought we're doing everything right now as it is broken already. And I have a half-written patch with SR cleanup after GCJ removal :) Remove GCJ IndexReader specializations -- Key: LUCENE-1677 URL: https://issues.apache.org/jira/browse/LUCENE-1677 Project: Lucene - Java Issue Type: Task Reporter: Earwin Burrfoot Assignee: Michael McCandless Fix For: 2.9 These specializations are outdated, unsupported, most probably pointless due to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you are going to ask people on java-user, anybody replied that they need it?). While giving nothing, they make SegmentReader instantiation code look real ugly. If nobody objects, I'm going to post a patch that removes these from Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache
[ https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722769#action_12722769 ] Earwin Burrfoot commented on LUCENE-1701: - Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8. Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache Key: LUCENE-1701 URL: https://issues.apache.org/jira/browse/LUCENE-1701 Project: Lucene - Java Issue Type: New Feature Components: Index, Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1701-test-tag-special.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, NumericField.java In discussions about LUCENE-1673, Mike me wanted to add a new NumericField to o.a.l.document specific for easy indexing. An alternative would be to add a NumericUtils.newXxxField() factory, that creates a preconfigured Field instance with norms and tf off, optionally a stored text (LUCENE-1699) and the TokenStream already initialized. On the other hand NumericUtils.newXxxSortField could be moved to NumericSortField. I and Yonik tend to use the factory for both, Mike tends to create the new classes. Also the parsers for string-formatted numerics are not public in FieldCache. As the new SortField API (LUCENE-1478) makes it possible to support a parser in SortField instantiation, it would be good to have the static parsers in FieldCache public available. SortField would init its member variable to them (instead of NULL), so making code a lot easier (FieldComparator has this ugly null checks when retrieving values from the cache). Moving the Trie parsers also as static instances into FieldCache would make the code cleaner and we would be able to hide the hack StopFillCacheException by making it private to FieldCache (currently its public because NumericUtils is in o.a.l.util). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache
[ https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722769#action_12722769 ] Earwin Burrfoot edited comment on LUCENE-1701 at 6/22/09 12:18 PM: --- Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8. Though, if you want really fast dates, chosing hour/day/month/year as precision steps is vastly superior, plus it also clicks well with user-selected ranges. Still, I dumped this approach for uniformity and clarity. was (Author: earwin): Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8. Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache Key: LUCENE-1701 URL: https://issues.apache.org/jira/browse/LUCENE-1701 Project: Lucene - Java Issue Type: New Feature Components: Index, Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1701-test-tag-special.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, NumericField.java In discussions about LUCENE-1673, Mike me wanted to add a new NumericField to o.a.l.document specific for easy indexing. An alternative would be to add a NumericUtils.newXxxField() factory, that creates a preconfigured Field instance with norms and tf off, optionally a stored text (LUCENE-1699) and the TokenStream already initialized. On the other hand NumericUtils.newXxxSortField could be moved to NumericSortField. I and Yonik tend to use the factory for both, Mike tends to create the new classes. Also the parsers for string-formatted numerics are not public in FieldCache. As the new SortField API (LUCENE-1478) makes it possible to support a parser in SortField instantiation, it would be good to have the static parsers in FieldCache public available. SortField would init its member variable to them (instead of NULL), so making code a lot easier (FieldComparator has this ugly null checks when retrieving values from the cache). Moving the Trie parsers also as static instances into FieldCache would make the code cleaner and we would be able to hide the hack StopFillCacheException by making it private to FieldCache (currently its public because NumericUtils is in o.a.l.util). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache
[ https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722775#action_12722775 ] Earwin Burrfoot commented on LUCENE-1701: - Design for today. And spend two years deprecating and supporting today's designs after you get a better thing tomorrow. Back-compat Lucene-style and agile design aren't something that marries well. donating something to Lucene means casting it in concrete. We can't let fear of back-compat prevent us from making progress. My point was that strict back-compat prevents people from donating work which is not yet finalized. They either lose comfortable volatility of private code, or have to maintain two versions of it - private and Lucene. NRT seems to tread the same path, and I'm not sure it's going to win that much turnaround time after newly-introduced per-segment collection. I agree, per-segment collection was the bulk of the gains needed for NRT. This was a big change and a huge step forward in simple reopen turnaround. I vote it for the most frustrating (in terms of adopting your custom code) and most useful change of 2.9 :) But, not having to write read deletes to disk, not commit (fsync) from writer in order to see those changes in reader should also give us decent gains. fsync is surprisingly and intermittently costly. I'm not sure this can't be achieved without messing with IR/W guts so much. Guys from LinkedIn that drive this feature (if i'm not mistaken), they had a prior solution with separate indexes, one on disk, one in RAM. Per-segment collection adds superfast reopens and MultiReader that is way greater than MultiSearcher - you can finally do adequate fast searches across separate indexes. Do we still need to add complexity for minor performance gains? And this integration lets us take it a step further with LUCENE-1313, where recently created segments can remain in RAM and be shared with the reader. RAMDirectory? Some time ago I finished a first version of IR plugins, and enjoy pretty low reopen times (field/facet/filter cache warmups included). (Yes, I'm going to open an issue for plugins once they stabilize enough) I'm confused: I thought that effort was to make SegmentReader's components fully pluggable? (Not to actually change what components SegmentReader is creating). EG does this modularization alter the approach to NRT? I thought they were orthogonal. Yes, they are orthonogal. This was yet another praise to per-segment collection and an example of how this approach can be extended on your custom stuff (like filtercache). Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache Key: LUCENE-1701 URL: https://issues.apache.org/jira/browse/LUCENE-1701 Project: Lucene - Java Issue Type: New Feature Components: Index, Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1701-test-tag-special.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, NumericField.java In discussions about LUCENE-1673, Mike me wanted to add a new NumericField to o.a.l.document specific for easy indexing. An alternative would be to add a NumericUtils.newXxxField() factory, that creates a preconfigured Field instance with norms and tf off, optionally a stored text (LUCENE-1699) and the TokenStream already initialized. On the other hand NumericUtils.newXxxSortField could be moved to NumericSortField. I and Yonik tend to use the factory for both, Mike tends to create the new classes. Also the parsers for string-formatted numerics are not public in FieldCache. As the new SortField API (LUCENE-1478) makes it possible to support a parser in SortField instantiation, it would be good to have the static parsers in FieldCache public available. SortField would init its member variable to them (instead of NULL), so making code a lot easier (FieldComparator has this ugly null checks when retrieving values from the cache). Moving the Trie parsers also as static instances into FieldCache would make the code cleaner and we would be able to hide the hack StopFillCacheException by making it private to FieldCache (currently its public because NumericUtils is in o.a.l.util). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java
Re: Shouldn't IndexWriter.commit(Map) accept Properties instead?
What other issues would we be taking on by using Java's serialization here...? It's insanely slow. Though, that doesn't apply to a once-per-commit call. The other point is, if you store Object, you can no longer mix lucene and user data. With MapString, whatever approach you could reserve some key space for lucene and let user add his stuff on top. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1712) Set default precisionStep for NumericField and NumericRangeFilter
[ https://issues.apache.org/jira/browse/LUCENE-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722843#action_12722843 ] Earwin Burrfoot commented on LUCENE-1712: - Am I misunderstanding something or the problem still persists? Even if you use a common default, what is your base type - int or long? Are floats converted to ints, or to longs? Set default precisionStep for NumericField and NumericRangeFilter - Key: LUCENE-1712 URL: https://issues.apache.org/jira/browse/LUCENE-1712 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Michael McCandless Priority: Minor Fix For: 2.9 This is a spinoff from LUCENE-1701. A user using Numeric* should not need to understand what's under the hood in order to do their indexing searching. They should be able to simply: {code} doc.add(new NumericField(price, 15.50); {code} And have a decent default precisionStep selected for them. Actually, if we add ctors to NumericField for each of the supported types (so the above code works), we can set the default per-type. I think we should do that? 4 for int and 6 for long was proposed as good defaults. The default need not be perfect, as advanced users can always optimize their precisionStep, and for users experiencing slow RangeQuery performance, NumericRangeQuery with any of the defaults we are discussing will be much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1712) Set default precisionStep for NumericField and NumericRangeFilter
[ https://issues.apache.org/jira/browse/LUCENE-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722851#action_12722851 ] Earwin Burrfoot commented on LUCENE-1712: - Aha! And each time you invoke setFloatValue/setDoubleValue it switches base type behind the scenes? Eeek. Set default precisionStep for NumericField and NumericRangeFilter - Key: LUCENE-1712 URL: https://issues.apache.org/jira/browse/LUCENE-1712 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Michael McCandless Priority: Minor Fix For: 2.9 This is a spinoff from LUCENE-1701. A user using Numeric* should not need to understand what's under the hood in order to do their indexing searching. They should be able to simply: {code} doc.add(new NumericField(price, 15.50); {code} And have a decent default precisionStep selected for them. Actually, if we add ctors to NumericField for each of the supported types (so the above code works), we can set the default per-type. I think we should do that? 4 for int and 6 for long was proposed as good defaults. The default need not be perfect, as advanced users can always optimize their precisionStep, and for users experiencing slow RangeQuery performance, NumericRangeQuery with any of the defaults we are discussing will be much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: 3MB lucene-analyzers.jar?
But: I do not understand the problems with this JAR file. If somebody really wants to have smaller files, one could use some tools, that do it automatically on class usage. I personally have a couple of usecases for that as I have to work in very limited environments. Imagine embedded systems or mobile phones where 500 kb is a lot. if you realy need the analyzer you can include the additional jar. Jar Jar Links - special tools for special tasks. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache
[ https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721787#action_12721787 ] Earwin Burrfoot commented on LUCENE-1701: - I vote for factories - escaping back-compat woes by exposing minimum interface. Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache Key: LUCENE-1701 URL: https://issues.apache.org/jira/browse/LUCENE-1701 Project: Lucene - Java Issue Type: New Feature Components: Index, Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 In discussions about LUCENE-1673, Mike me wanted to add a new NumericField to o.a.l.document specific for easy indexing. An alternative would be to add a NumericUtils.newXxxField() factory, that creates a preconfigured Field instance with norms and tf off, optionally a stored text (LUCENE-1699) and the TokenStream already initialized. On the other hand NumericUtils.newXxxSortField could be moved to NumericSortField. I and Yonik tend to use the factory for both, Mike tends to create the new classes. Also the parsers for string-formatted numerics are not public in FieldCache. As the new SortField API (LUCENE-1478) makes it possible to support a parser in SortField instantiation, it would be good to have the static parsers in FieldCache public available. SortField would init its member variable to them (instead of NULL), so making code a lot easier (FieldComparator has this ugly null checks when retrieving values from the cache). Moving the Trie parsers also as static instances into FieldCache would make the code cleaner and we would be able to hide the hack StopFillCacheException by making it private to FieldCache (currently its public because NumericUtils is in o.a.l.util). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache
[ https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721830#action_12721830 ] Earwin Burrfoot commented on LUCENE-1701: - Mike, I very much agree with everything you said, except factory is less consumable than constructor and add stuff to index to handle NumericField. Out of your three examples the second one is bad, no questions. But first and last are absolutely equal in terms of consumability. Static factories are cool (they allow to switch implementations and instantiation logic without changing API) and are as easy to use (probably even easier with generics in Java5) as constructors. If we add some generic storable flags for Lucene fields, this is cool (probably), NumericField can then capitalize on it, as well as users writing their own NNNFields. Tying index format to some particular implementation of numerics is bad design. Why on earth can't my own split-field (vs single-field as in current Lucene) trie-encoded number enjoy the same benefits as NumericField from Lucene core? bq. By this same logic, should we remove NumericRangeFilter/Query and use static factories instead? I do use factory methods for all my queries and filters, and it makes me feel warm and fuzzy! :) Under the hood some of them consult FieldInfo to instantiate custom-tailored query variants, so I just use range(CREATION_TIME, from, to) and don't think if this field is trie-encoded or raw. Simple things should be simple, okay. Complex things should be simple too, argh! :) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache Key: LUCENE-1701 URL: https://issues.apache.org/jira/browse/LUCENE-1701 Project: Lucene - Java Issue Type: New Feature Components: Index, Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 In discussions about LUCENE-1673, Mike me wanted to add a new NumericField to o.a.l.document specific for easy indexing. An alternative would be to add a NumericUtils.newXxxField() factory, that creates a preconfigured Field instance with norms and tf off, optionally a stored text (LUCENE-1699) and the TokenStream already initialized. On the other hand NumericUtils.newXxxSortField could be moved to NumericSortField. I and Yonik tend to use the factory for both, Mike tends to create the new classes. Also the parsers for string-formatted numerics are not public in FieldCache. As the new SortField API (LUCENE-1478) makes it possible to support a parser in SortField instantiation, it would be good to have the static parsers in FieldCache public available. SortField would init its member variable to them (instead of NULL), so making code a lot easier (FieldComparator has this ugly null checks when retrieving values from the cache). Moving the Trie parsers also as static instances into FieldCache would make the code cleaner and we would be able to hide the hack StopFillCacheException by making it private to FieldCache (currently its public because NumericUtils is in o.a.l.util). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache
[ https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721830#action_12721830 ] Earwin Burrfoot edited comment on LUCENE-1701 at 6/19/09 8:50 AM: -- Mike, I very much agree with everything you said, except factory is less consumable than constructor and add stuff to index to handle NumericField. Out of your three examples the second one is bad, no questions. But first and last are absolutely equal in terms of consumability. Static factories are cool (they allow to switch implementations and instantiation logic without changing API) and are as easy to use (probably even easier with generics in Java5) as constructors. If we add some generic storable flags for Lucene fields, this is cool (probably), NumericField can then capitalize on it, as well as users writing their own NNNFields. Tying index format to some particular implementation of numerics is bad design. Why on earth can't my own split-field (vs single-field as in current Lucene) trie-encoded number enjoy the same benefits as NumericField from Lucene core? bq. By this same logic, should we remove NumericRangeFilter/Query and use static factories instead? I do use factory methods for all my queries and filters, and it makes me feel warm and fuzzy! :) Under the hood some of them consult FieldInfo to instantiate custom-tailored query variants, so I just use range(CREATION_TIME, from, to) and don't think if this field is trie-encoded or raw. Simple things should be simple, okay. Complex things should be simple too, argh! :) was (Author: earwin): Mike, I very much agree with everything you said, except factory is less consumable than constructor and add stuff to index to handle NumericField. Out of your three examples the second one is bad, no questions. But first and last are absolutely equal in terms of consumability. Static factories are cool (they allow to switch implementations and instantiation logic without changing API) and are as easy to use (probably even easier with generics in Java5) as constructors. If we add some generic storable flags for Lucene fields, this is cool (probably), NumericField can then capitalize on it, as well as users writing their own NNNFields. Tying index format to some particular implementation of numerics is bad design. Why on earth can't my own split-field (vs single-field as in current Lucene) trie-encoded number enjoy the same benefits as NumericField from Lucene core? bq. By this same logic, should we remove NumericRangeFilter/Query and use static factories instead? I do use factory methods for all my queries and filters, and it makes me feel warm and fuzzy! :) Under the hood some of them consult FieldInfo to instantiate custom-tailored query variants, so I just use range(CREATION_TIME, from, to) and don't think if this field is trie-encoded or raw. Simple things should be simple, okay. Complex things should be simple too, argh! :) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache Key: LUCENE-1701 URL: https://issues.apache.org/jira/browse/LUCENE-1701 Project: Lucene - Java Issue Type: New Feature Components: Index, Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 In discussions about LUCENE-1673, Mike me wanted to add a new NumericField to o.a.l.document specific for easy indexing. An alternative would be to add a NumericUtils.newXxxField() factory, that creates a preconfigured Field instance with norms and tf off, optionally a stored text (LUCENE-1699) and the TokenStream already initialized. On the other hand NumericUtils.newXxxSortField could be moved to NumericSortField. I and Yonik tend to use the factory for both, Mike tends to create the new classes. Also the parsers for string-formatted numerics are not public in FieldCache. As the new SortField API (LUCENE-1478) makes it possible to support a parser in SortField instantiation, it would be good to have the static parsers in FieldCache public available. SortField would init its member variable to them (instead of NULL), so making code a lot easier (FieldComparator has this ugly null checks when retrieving values from the cache). Moving the Trie parsers also as static instances into FieldCache would make the code cleaner and we would be able to hide the hack StopFillCacheException by making it private to FieldCache (currently its public because NumericUtils is in o.a.l.util). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment
[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache
[ https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722060#action_12722060 ] Earwin Burrfoot commented on LUCENE-1701: - bq. Someday maybe I'll convince you to donate this schema layer on top of Lucene It's not generic enough to be of use for every user of Lucene, and it doesn't aim to be such. It also evolves, and donating something to Lucene means casting it in concrete. So that's not me being greedy or lazy (okay, maybe a little bit of the latter), it's simply not public-quality (as I understand it) code. I can share the design if anybody's interested, but everyone's coping with it themselves it seems. Solr has its own schema approach, and it has its merits and downfalls compared to mine. That's what is nice, we're able to use the same library in differing ways, and it doesn't force its sense of 'best practices' on us. bq. But I hope there are SOME named classes in there and not all static factory methods returning anonymous untyped impls. SOME of them aren't static :-D bq. We shouldn't weaken trie's integration to core just because others have private implementations. You shouldn't integrate into core something that is not core functionality. Think microkernels. It's strange seeing you drive CSFs, custom indexing chains, pluggability everywhere on one side, and trying to add some weird custom properties into index that are tightly interwoven with only one of possible numeric implementations on the other side. bq. Design for today. And spend two years deprecating and supporting today's designs after you get a better thing tomorrow. Back-compat Lucene-style and agile design aren't something that marries well. bq. What's important is that we don't weaken those private implementations with trie's addition, and I don't think our approach here has done that. You're weakening Lucene itself by introducing too much coupling between its components. IndexReader/Writer pair is a good example of what I'm arguing against. A dusty closet of microfeatures that are tightly interwoven into a complex hard-to-maintain mess with zillions of (possibly broken) control paths - remember mutable deletes/norms+clone/reopen permutations? It could be avoided if IR/W were kept to the bare minimum (which most people are going to use), and more advanced features were built on top of it, not in the same place. NRT seems to tread the same path, and I'm not sure it's going to win that much turnaround time after newly-introduced per-segment collection. Some time ago I finished a first version of IR plugins, and enjoy pretty low reopen times (field/facet/filter cache warmups included). (Yes, I'm going to open an issue for plugins once they stabilize enough) {quote} If we add some generic storable flags for Lucene fields, this is cool (probably), NumericField can then capitalize on it, as well as users writing their own NNNFields. +1 Wanna make a patch? {quote} No, I'd like to continue IR cleanup and play with positionIncrement companion value that could enable true multiword synonyms. I know, I know, it's do-a-cracy. But it's not an excuse for hacks. Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache Key: LUCENE-1701 URL: https://issues.apache.org/jira/browse/LUCENE-1701 Project: Lucene - Java Issue Type: New Feature Components: Index, Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: NumericField.java In discussions about LUCENE-1673, Mike me wanted to add a new NumericField to o.a.l.document specific for easy indexing. An alternative would be to add a NumericUtils.newXxxField() factory, that creates a preconfigured Field instance with norms and tf off, optionally a stored text (LUCENE-1699) and the TokenStream already initialized. On the other hand NumericUtils.newXxxSortField could be moved to NumericSortField. I and Yonik tend to use the factory for both, Mike tends to create the new classes. Also the parsers for string-formatted numerics are not public in FieldCache. As the new SortField API (LUCENE-1478) makes it possible to support a parser in SortField instantiation, it would be good to have the static parsers in FieldCache public available. SortField would init its member variable to them (instead of NULL), so making code a lot easier (FieldComparator has this ugly null checks when retrieving values from the cache). Moving the Trie parsers also as static instances into FieldCache would make the code cleaner and we would be able to hide the hack
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720619#action_12720619 ] Earwin Burrfoot commented on LUCENE-1630: - I wasn't following the issue closely, so this question might by silly - how does out-of-order scoring/collection marry with filters? If I remember right, filter/scorer intersection relies on proper orderness. Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. One other optimization that was discussed in LUCENE-1593 is to expose a topScorer() API (on Weight) which returns a Scorer that its score(Collector) will be called, and additionally add a start() method to DISI. That will allow Scorers to initialize either on start() or score(Collector). This was proposed mainly because of BS and BS2 which check if they are initialized in every call to next(), skipTo() and score(). Personally I prefer to see that in a separate issue, following that one (as it might add methods to QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java
Re: madvise(ptr, len, MADV_SEQUENTIAL)
Except, you don't know the size of the file to be written upfront. One probable solution is to map output file in pages. As a complementary solution you can map a huge area of the file, and hope few real memory is allocated by OS unless you actually write all over that area. Dunno. The idea of using mmapped write has stopped looking interesting to me. On Tue, Jun 16, 2009 at 18:32, Uwe Schindleru...@thetaphi.de wrote: But to use it, we should change MMapDirectory to also use the mapping when writing to files. I thought about it, it is very simple to implement (just copy the IndexInput and change all gets() to sets()) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, June 16, 2009 4:22 PM To: java-dev@lucene.apache.org Cc: Alan Bateman; nio-disc...@openjdk.java.net Subject: Re: madvise(ptr, len, MADV_SEQUENTIAL) Lucene could really make use of this method. When a segment merge takes place, we can read write many GB of data, which without madvise on many OSs would effectively flush the IO cache (thus hurting our search performance). Mike On Mon, Jun 15, 2009 at 6:01 PM, Jason Rutherglenjason.rutherg...@gmail.com wrote: Thanks Alan. I cross posted this to the Lucene dev list where we are discussing using madvise for minimizing unnecessary IO cache usage when merging segments (where we really want the newly merged segments in the IO cache rather than the old segment files). How would the advise method work? Would there need to be a hint in the FileChannel.map method? -J On Mon, Jun 15, 2009 at 12:36 AM, Alan Bateman alan.bate...@sun.com wrote: Jason Rutherglen wrote: Is there going to be a way to do this in the new Java IO APIs? Good question, as it has come up a few times and is needed for some important use-cases. A while back I looked into adding a MappedByteBuffer#advise method to allow the application provide hints on the expected usage but didn't complete it. We should probably look at this again for jdk7. -Alan. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal for changing the backwards-compatibility policy
Oh yes! Again! +1 One point is missing. What about incompatible behavioral changes that do not touch API and file format? Like posIncr=0 at the first token in stream, or analyzer fixes, or something along these lines. Are we free to introduce them in a minor release without warning, or are we going to warn one release before the change, or do we provide old-behaviour switches that are deprecated since their birth, or we keep said switches for a couple of major releases? On Tue, Jun 16, 2009 at 14:37, Michael Buschbusch...@gmail.com wrote: Probably everyone is thinking right now Oh no! Not again!. I admit I didn't fully read the incredibly long recent thread about backwards-compatibility, so maybe what I'm about to propose has been proposed already. In that case my apologies in advance. Rather than discussing our current backwards-compatibility policy again, I'd like to make here a concrete proposal for changing the policy after Lucene 3.0 is released. I'll call X.Y - X+1.0 a 'major release', X.Y - X.Y+1 a 'minor release' and X.Y.Z - X.Y.Z+1 a 'bugfix release'. (we can later use different names; just for convenience here...) 1. The file format backwards-compatiblity policy will remain unchanged; i.e. Lucene X.Y supports reading all indexes written with Lucene X-1.Y. That means Lucene 4.0 will not have to be able to read 2.x indexes. 2. Deprecated public and protected APIs can be removed if they have been released in at least one major or minor release. E.g. an 3.1 API can be released as deprecated in 3.2 and removed in 3.3 or 4.0 (if 4.0 comes after 3.2). 3. No public or protected APIs are changed in a bugfix release; except if a severe bug can't be changed otherwise. 4. Each release will have release notes with a new section Incompatible changes, which lists, as the names says, all changes that break backwards compatibility. The list should also have information about how to convert to the new API. I think the eclipse releases have such a release notes section. The big change here apparently is 2. Consider the current situation: We can release e.g. the new TokenStream API with 2.9; then we can remove it a month later in 3.0, while still complying with our current backwards-compatibility policy. A transition period of one month is very short for such an important API. On the other hand, a transition period of presumably 2 years, until 4.0 is released, seems very long to stick with a deprecated API that clutters the APIs and docs. With the proposed change, we couldn't do that. Given our current release schedule, the transition period would at least be 6-9 months, which seems a very reasonable timeframe. We should also not consider 2. as a must. I.e. we don't *have* to deprecate after one major or minor release already. We could for a very popular API like the TokenStream API send a mail to java-user, asking if people need more transition time and be flexible. I think this policy is much more dynamic and flexible, but should still give our users enough confidence. It also removes the need to do things just for the sake of the current policy rather than because they make the most sense, like our somewhat goofy X.9 releases. :) Just to make myself clear: I think we should definitely stick with our 2.9 and 3.0 plans and change the policy afterwards. My +1 to all 4 points above. -Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720231#action_12720231 ] Earwin Burrfoot commented on LUCENE-1673: - bq. This is that baking in a specific implementation into the index format that I don't like. +many bq. I do agree that retrieving a doc is already buggy, in that various things are lost from your index time doc (a well known issue at this point!) How on earth is it buggy? You're working with an inverted index, you aren't supposed to get original document from it in the first place. It's like saying a hash function is buggy because it is not reversible. The less coupling various lucene components have on each other - the better. If you'd like to have end-to-end experience for numeric fields, build something schema-like and put it in contribs. If it's hard to build - Lucene core is to blame, it's not extensible enough. From my experience, for that purporse it's okay as it is. Move TrieRange to core -- Key: LUCENE-1673 URL: https://issues.apache.org/jira/browse/LUCENE-1673 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch TrieRange was iterated many times and seems stable now (LUCENE-1470, LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to its default FieldTypes (SOLR-940) and if possible I want to move it to core before release of 2.9. Before this can be done, there are some things to think about: # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I would suggest to leave it as it is. On the other hand, if this keeps our only numeric query implementation, we could call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here are problems). Same for the TokenStreams and Filters. # Maybe the pairs of classes for indexing and searching should be moved into one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The problem here: ctors must be able to pass int, long, double, float as range parameters. For the end user, mixing these 4 types in one class is hard to handle. If somebody forgets to add a L to a long, it suddenly instantiates a int version of range query, hitting no results and so on. Same with other types. Maybe accept java.lang.Number as parameter (because nullable for half-open bounds) and one enum for the type. # TrieUtils move into o.a.l.util? or document or? # Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes? Somewhere else? # If we rename the classes, should Solr stay with Trie (because there are different impls)? # Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719539#action_12719539 ] Earwin Burrfoot commented on LUCENE-1630: - I like the last option most. Creating dummy scorer looks ugly to me, and looks like it will cause more problems of the same kind in the future. Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. One other optimization that was discussed in LUCENE-1593 is to expose a topScorer() API (on Weight) which returns a Scorer that its score(Collector) will be called, and additionally add a start() method to DISI. That will allow Scorers to initialize either on start() or score(Collector). This was proposed mainly because of BS and BS2 which check if they are initialized in every call to next(), skipTo() and score(). Personally I prefer to see that in a separate issue, following that one (as it might add methods to QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719539#action_12719539 ] Earwin Burrfoot edited comment on LUCENE-1630 at 6/15/09 5:36 AM: -- I like the last option (move scoresOutOfOrder to Weight) most. Creating dummy scorer looks ugly to me, and looks like it will cause more problems of the same kind in the future. was (Author: earwin): I like the last option most. Creating dummy scorer looks ugly to me, and looks like it will cause more problems of the same kind in the future. Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. One other optimization that was discussed in LUCENE-1593 is to expose a topScorer() API (on Weight) which returns a Scorer that its score(Collector) will be called, and additionally add a start() method to DISI. That will allow Scorers to initialize either on start() or score(Collector). This was proposed mainly because of BS and BS2 which check if they are initialized in every call to next(), skipTo() and score(). Personally I prefer to see that in a separate issue, following that one (as it might add methods to QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online
Re: Payloads and TrieRangeQuery
Just to throw something out, the new Token API is not very consumable in my experience. The old one was very intuitive and very easy to follow the code. I've had to refigure out what the heck was going on with the new one more than once now. Writing some example code with it is hard to follow or justify to a new user. What was the big improvement with it again? Advanced, expert custom indexing chains require less casting or something right? I dunno - anyone else have any thoughts now that the new API has been in circulation for some time? I have an advanced, expert custom indexing chain, and it's still not ported over the new API. It's counter intuitive alright, with names not really saying what's going on (please, for an AttributeSource, whose Attribute is it? Attribute is a quality of 'something', but that 'something' is amiss), but the biggest problem for me is that it capitalizes on the idea of token stream even further, making filters whose output is several times the input tokenwise, or which need to inspect a number of tokens before emitting something - much harder to write. I most probably missed something and there IS a way not to trash your memory with non-reused linkedhashmaps, but than again, there's no pointers. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text
[ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719322#action_12719322 ] Earwin Burrfoot commented on LUCENE-1488: - bq. But this can't replace ArabicAnalyzer completely, because ArabicAnalyzer stems arabic text in a language-specific way, which has a huge effect on retrieval quality for Arabic language text. What about separating word-tokenizing from morphological processing? issues with standardanalyzer on multilingual text - Key: LUCENE-1488 URL: https://issues.apache.org/jira/browse/LUCENE-1488 Project: Lucene - Java Issue Type: Wish Components: contrib/analyzers Reporter: Robert Muir Priority: Minor Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt The standard analyzer in lucene is not exactly unicode-friendly with regards to breaking text into words, especially with respect to non-alphabetic scripts. This is because it is unaware of unicode bounds properties. I actually couldn't figure out how the Thai analyzer could possibly be working until i looked at the jflex rules and saw that codepoint range for most of the Thai block was added to the alphanum specification. defining the exact codepoint ranges like this for every language could help with the problem but you'd basically be reimplementing the bounds properties already stated in the unicode standard. in general it looks like this kind of behavior is bad in lucene for even latin, for instance, the analyzer will break words around accent marks in decomposed form. While most latin letter + accent combinations have composed forms in unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). I've got a partially tested standardanalyzer that uses icu Rule-based BreakIterator instead of jflex. Using this method you can define word boundaries according to the unicode bounds properties. After getting it into some good shape i'd be happy to contribute it for contrib but I wonder if theres a better solution so that out of box lucene will be more friendly to non-ASCII text. Unfortunately it seems jflex does not support use of these properties such as [\p{Word_Break = Extend}] so this is probably the major barrier. Thanks, Robert -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting
[ https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718009#action_12718009 ] Earwin Burrfoot commented on LUCENE-1453: - bq. As the Filter is just a deprecated wrapper, that is removed in 3.0, I think reusing SegmentReader.Ref for that is ok. Ok. Maybe you are right. bq. Closeable is a Java 1.5 interface only, so this refactoring must wait until 3.0, but the idea is good! We can introduce our own Closeable, and replace it with java native in 3.0, thank gods the interface is simple :) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting - Key: LUCENE-1453 URL: https://issues.apache.org/jira/browse/LUCENE-1453 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.4.1, 2.9 Attachments: Failing-testcase-LUCENE-1453.patch, LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch Rough summary. Basically, FSDirectory tracks references to FSDirectory and when IndexReader.reopen shares a Directory with a created IndexReader and closeDirectory is true, FSDirectory's ref management will see two decrements for one increment. You can end up getting an AlreadyClosed exception on the Directory when the IndexReader is open. I have a test I'll put up. A solution seems fairly straightforward (at least in what needs to be accomplished). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
And this information about the trie structure and where payloads are should be stored in FieldInfos. As is the case today, the info is encoded in the class you use (and it's settings)... no need to add it to the index structure. In any case, it's a completely different issue and shouldn't be tied to TrieRange improvements. The problem is, because the details of Trie* at index time affect what's in each segment, this information needs to be stored per segment. And then, when you merge segments indexed with different Trie* settings, you need to convert them to some common form. Sounds like something too complex and with minimum returns. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1607) String.intern() faster alternative
[ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718198#action_12718198 ] Earwin Burrfoot commented on LUCENE-1607: - bq. but I was waiting for some kind of feedback if people in general thought it was the right approach. It introduces another static, and people tend to not like that. Just forgot somehow about this issue. You're right about static, it's not clear how and when to initialize it, plus you introduce some public classes we'll be unable to change/remove later. I still have a feeling we should expose a single static method - intern() and hide implementation away, possibly tuning it to be advantageous for thousands of fields, and degrading to raw String.intern() level if there are more fields. I'm going to be away from AC power for three days starting now, so I won't be able to reply until then. String.intern() faster alternative -- Key: LUCENE-1607 URL: https://issues.apache.org/jira/browse/LUCENE-1607 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot Fix For: 2.9 Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch By using our own interned string pool on top of default, String.intern() can be greatly optimized. On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)' For java 5 and 4 speedup is lower, but still considerable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Payloads and TrieRangeQuery
* Was the field even indexed w/ Trie, or indexed as simple text? It's useful to know this automatically at search time, so eg a RangeQuery can do the right thing by default. FieldInfos seems like the natural place to store this. It's basically Lucene's per-segment write-once schema. Eg we use this to record did any token in this field have a Payload?, which is analogous. This should really be in a schema of some kind (like in my project for instance). Why do you do autodetection for tries, but recently removed it for FieldCache? Things should be concise, either store all settings in the index (and die in the process), or don't store them there at all. * We have a bug (or an important improvement) in how Trie encodes terms that we need to fix. This one is not easy to handle, since such a change could alter the term order, and merging segments then becomes problematic. Not sure how to handle that. Yonik, has Solr ever had to make a change to NumberUtils? There are cases when reindexing is inevitable. What so horrible about it anyway? Even if you have a humongous index, you can rebuild it in a matter of days, and you don't do this often. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting
[ https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717657#action_12717657 ] Earwin Burrfoot commented on LUCENE-1453: - Patch looks fine. I read the last one, LUCENE-1453-with-FSDir-open.patch. When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting - Key: LUCENE-1453 URL: https://issues.apache.org/jira/browse/LUCENE-1453 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.4.1, 2.9 Attachments: Failing-testcase-LUCENE-1453.patch, LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch Rough summary. Basically, FSDirectory tracks references to FSDirectory and when IndexReader.reopen shares a Directory with a created IndexReader and closeDirectory is true, FSDirectory's ref management will see two decrements for one increment. You can end up getting an AlreadyClosed exception on the Directory when the IndexReader is open. I have a test I'll put up. A solution seems fairly straightforward (at least in what needs to be accomplished). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Some thoughts around the use of reader.isDeleted and hasDeletions
Actually: I think we should also change IndexReader.document to not check if it's deleted? (Renaming it to something like rawDocument(), storedDocument(), something, in the process, and deprecating the old one). Yup. After all the most common use-case is to load a document after finding it in one or another way. Pretty hard to come up with id of a deleted document. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting
[ https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717769#action_12717769 ] Earwin Burrfoot commented on LUCENE-1453: - bq. I think it should (be closed in a finally clause). Then there's the next question of the same sort, but probably belonging in a separate issue. If we close a DR and one of SR throws an exception - should we close the others (currently we don't)? What is the right way, in general, of handling IOExceptions on IR close? Can we retry the close? What does this exception mean? When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting - Key: LUCENE-1453 URL: https://issues.apache.org/jira/browse/LUCENE-1453 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.4.1, 2.9 Attachments: Failing-testcase-LUCENE-1453.patch, LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch Rough summary. Basically, FSDirectory tracks references to FSDirectory and when IndexReader.reopen shares a Directory with a created IndexReader and closeDirectory is true, FSDirectory's ref management will see two decrements for one increment. You can end up getting an AlreadyClosed exception on the Directory when the IndexReader is open. I have a test I'll put up. A solution seems fairly straightforward (at least in what needs to be accomplished). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream
[ https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717823#action_12717823 ] Earwin Burrfoot commented on LUCENE-1678: - Second this. Though I lost any hope for sane Lucene release/compat rules. Deprecate Analyzer.tokenStream -- Key: LUCENE-1678 URL: https://issues.apache.org/jira/browse/LUCENE-1678 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 The addition of reusableTokenStream to the core analyzers unfortunately broke back compat of external subclasses: http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html On upgrading, such subclasses would silently not be used anymore, since Lucene's indexing invokes reusableTokenStream. I think we should should at least deprecate Analyzer.tokenStream, today, so that users see deprecation warnings if their classes override this method. But going forward when we want to change the API of core classes that are extended, I think we have to introduce entirely new classes, to keep back compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream
[ https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717862#action_12717862 ] Earwin Burrfoot commented on LUCENE-1678: - bq. If there are sane/smart ways to change our back compat policy, I think you have seen that no one would object. It's not a matter of finding a smart way. It is a matter of sacrifice that has to be made and readiness to take the blame for decision that can be unpopular with someone. You go zealously for back-compat - you sacrifice readability/maintainability of your code but free users from any troubles when they want to 'simply upgrade'. You adopt more relaxed policy - you sacrifice users' time, but in return you gain cleaner codebase and new stuff can be written and used faster. There's no way to ride two horses at once. Some people are comfortable with current policies. Few cringe when they hear things like above. Most theoretically want to relax the rules. Nobody's ready to give up something for it. Okay, there's an escape hatch I (and someone else) mentioned on the list before. Adopting a fixed release cycle with small intervals between releases (compared to what we have now). Fixed - as in, releases are made each N months instead of when everyone feels they finished and polished up all their pet projects and there's nothing else exciting to do. That way we can keep the current policy, but deletion-through-deprecation approach will work, at last! This solution is halfassed, I can already see discussions like That was a big change, let's keep the deprecates around longer, say - for a couple of releases., it doesn't solve good-name-thrashing problem, as you have to go through two rounds of deprecation to change semantics on something, but keep the name. But this is something better than what we have now, a-a-and this is something that needs commiter backing. bq. Thats a great indication to me that the issue is not simple. The issue is simple, the choice is not. And maintaining status quo is free. bq. Giving up is really not the answer though It is the answer. I have no moral right to hammer my ideals into heads that did tremendously more for the project, than I did. And maintaining a patch queue over Lucene trunk is not 'that' hard. Deprecate Analyzer.tokenStream -- Key: LUCENE-1678 URL: https://issues.apache.org/jira/browse/LUCENE-1678 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 The addition of reusableTokenStream to the core analyzers unfortunately broke back compat of external subclasses: http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html On upgrading, such subclasses would silently not be used anymore, since Lucene's indexing invokes reusableTokenStream. I think we should should at least deprecate Analyzer.tokenStream, today, so that users see deprecation warnings if their classes override this method. But going forward when we want to change the API of core classes that are extended, I think we have to introduce entirely new classes, to keep back compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting
[ https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717866#action_12717866 ] Earwin Burrfoot commented on LUCENE-1453: - Two suggestions: Factor out RefCount class and use it everywhere through Lucene. I see at least one identical to yours in SegmentReader. Would be easier to replace all these uses with AtomicInteger later. Looking at the new unsightly loop in doClose(), what if we change all Lucene closeable classes to implement java.io.Closeable and create a static utility method(-s) that receives a bunch of Closeables (an array, iterable, vararg in 1.5) and tries to close them all? The method should be nullsafe (so you can skip != null checks) and will handle/rethrow exceptions. The most proper way to handle exceptions is probably this - rethrow original exception if it is the only one (be it Runtime or IO), if there's more - gather all exceptions and wrap them into a special IOException subclass that concatenates their messages and keeps them around, so they are inspectable at debug-time or if you implement special treatment for that exception in your code. This method can be reused in a heap of places later, SR.doClose() comes first to mind. I can do the latter one in a separate patch to close this issue faster. When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting - Key: LUCENE-1453 URL: https://issues.apache.org/jira/browse/LUCENE-1453 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.4.1, 2.9 Attachments: Failing-testcase-LUCENE-1453.patch, LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch Rough summary. Basically, FSDirectory tracks references to FSDirectory and when IndexReader.reopen shares a Directory with a created IndexReader and closeDirectory is true, FSDirectory's ref management will see two decrements for one increment. You can end up getting an AlreadyClosed exception on the Directory when the IndexReader is open. I have a test I'll put up. A solution seems fairly straightforward (at least in what needs to be accomplished). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream
@Mark: Okay, there's an escape hatch I (and someone else) mentioned on the list before. Adopting a fixed release cycle with small intervals between releases (compared to what we have now). Fixed - as in, releases are made each N months instead of when everyone feels they finished and polished up all their pet projects and there's nothing else exciting to do. That way we can keep the current policy, but deletion-through-deprecation approach will work, at last! Thats a big change. I think its a nice idea, but I don't know how practical it is. Most of us are basically volunteering time for this type of thing. Even still, with the pace of development lately (and you can be sure that the current pace is a *new* thing, Lucene did not always have this amount of activity), it might make sense. You're missing the most important point. Fixed schedule means that the only reason not to do a release is the total abscence of changes. No matter how much or how few changes are released each time, fixed schedule gives you predictable lifecycle for all your deprecation/back-compat needs. But that idea needs a champion, and frankly I don't have the time right now (it wouldn't likely be in my realm anyway). And thats probably the deal with most others. They have work and/or other itches that are higher priority than championing a big change. And here we got at one of the roots of the problem. The root that is going to stay. bq. Giving up is really not the answer though It is the answer. I have no moral right to hammer my ideals into heads that did tremendously more for the project, than I did. And maintaining a patch queue over Lucene trunk is not 'that' hard. Its not about hammering your ideals - that almost feels like what you are doing, but frankly, it doesn't help. If you even just keep prompting the issue as it dies away you will likely keep progress going. There is a solution that everyone will accept. I promise you that. Its more work than it looks to find that solution and guide it to fruition though. Its fully possible, and I'm sure it will happen eventually. Would have beat even money that Mike had it a few weeks ago. No dice it looks though ;) I consciously took a bit of an extremist stance in hope to shift the mean. Okay, will try ditching it in favour of gently bugging people like Grant did in the comment that spawned this discussion. :) @Yonik: You go zealously for back-compat - you sacrifice readability/maintainability of your code but free users from any troubles when they want to 'simply upgrade'. You adopt more relaxed policy - you sacrifice users' time, but in return you gain cleaner codebase and new stuff can be written and used faster. Not sure I agree with that - if changes become too easy you can get a thrashing effect... change just because someone thought it was a little better can lead to more chaos. You're right. I'm not advocating anarchy. :) But currently we are afraid to break anything at all, and that is as far away from juste milieu as the chaos you speak of. IMO, changes to interfaces should be clearly better than what existed before. Recent changes to DISI? Were they clearly for the better? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes
[ https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717089#action_12717089 ] Earwin Burrfoot commented on LUCENE-1648: - As LUCENE-1651 is now committed, this issue can be resolved. when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes --- Key: LUCENE-1648 URL: https://issues.apache.org/jira/browse/LUCENE-1648 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1648-followup.patch, LUCENE-1648-followup.patch, LUCENE-1648.patch While working on LUCENE-1647, I came across this issue... we are failing to carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting
[ https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717107#action_12717107 ] Earwin Burrfoot commented on LUCENE-1453: - bq. There are two possibilities to fix this: Vote for leave them open. Yes, it breaches the contract, but the breach is controlled (and thus harmless) and we get rid of some weird code (=possible point of failure) without introducing new. There is a way to notice change in DirectoryReader behaviour, but it is too unrealistic: {code} IndexReader r = IndexReader.open(/path/to/index); . Directory d = r.directory(); // you have to get directory reference as you're not the one who created it . r.close(); . d.doSomething(); // and EXPECT this call to fail with exception {code} When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting - Key: LUCENE-1453 URL: https://issues.apache.org/jira/browse/LUCENE-1453 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.4 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.4.1, 2.9 Attachments: Failing-testcase-LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch Rough summary. Basically, FSDirectory tracks references to FSDirectory and when IndexReader.reopen shares a Directory with a created IndexReader and closeDirectory is true, FSDirectory's ref management will see two decrements for one increment. You can end up getting an AlreadyClosed exception on the Directory when the IndexReader is open. I have a test I'll put up. A solution seems fairly straightforward (at least in what needs to be accomplished). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (SOLR-706) Fast auto-complete suggestions
[ https://issues.apache.org/jira/browse/SOLR-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717108#action_12717108 ] Earwin Burrfoot commented on SOLR-706: -- When I did autocompletion for my project, simple java TreeMap had superior memory characteristics and almost the same performance as tries. I think it's not worth inventing something elaborate for this task. Fast auto-complete suggestions -- Key: SOLR-706 URL: https://issues.apache.org/jira/browse/SOLR-706 Project: Solr Issue Type: New Feature Components: search Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Fix For: 1.5 A lot of users have suggested that facet.prefix in Solr is not the most efficient way to implement an auto-complete suggestion feature. A fast in-memory trie like structure has often been suggested instead. This issue aims to incorporate a faster/efficient way to answer auto-complete queries in Solr. Refer to the following discussion on solr-dev -- http://markmail.org/message/sjjojrnroo3msugj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-236) Field collapsing
[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717110#action_12717110 ] Earwin Burrfoot commented on SOLR-236: -- I have implemented collapsing on a high-volume project of mine in much less flexible, but more practical manner. Part I. You have to guarantee that all documents having the same value of collapse-field are dropped into Lucene index as a sequential batch. That guarantees they get sequential docIds, and with some more work - that they all end up in the same segment. Part II. When doing collection you always get docIds in sequential order, and thus, thanks to Part I you get the docs-to-be-collapsed already grouped by collapse-field, even before you drop the docs into PriorityQueue to sort them. Cons: You can only collapse on a single predetermined at index creation time field. If one document changes, you have to reindex all docs that have the same collapse-field value, so it's best if you have either low update/add rates, or few documents sharing the same collapse-field value. Pros: The CPU and memory costs for collapsing compared to usual search are very close to zero and do not depend on index size/total docs found. The same idea works with new Lucene per-segment collection and in distributed mode (sharded index). Within collapsed group you can sort hits however you want, and select one that will represent the group for usual sort/paging. The implementation is not brain-dead simple, but nears it. Field collapsing Key: SOLR-236 URL: https://issues.apache.org/jira/browse/SOLR-236 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Emmanuel Keller Fix For: 1.5 Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch This patch include a new feature called Field collapsing. Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated more documents from this site link. See also Duplicate detection. http://www.fastsearch.com/glossary.aspx?m=48amid=299 The implementation add 3 new query parameters (SolrParams): collapse.field to choose the field used to group results collapse.type normal (default value) or adjacent collapse.max to select how many continuous results are allowed before collapsing TODO (in progress): - More documentation (on source code) - Test cases Two patches: - field_collapsing.patch for current development version - field_collapsing_1.1.0.patch for Solr-1.1.0 P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: IR static methods
Index/Commit/SegmentMetadata? Several classes, as you can reflect on various levels of the index. Some offtopic - SegmentInfo/SegmentsInfo should really be named Segment/Segments. That's exactly what these objects represent. You don't use names like PreparedStatementInfo or FileInfo or IntegerInfo :) On Fri, Jun 5, 2009 at 02:21, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: We have . $ ff \*Info\*java ./src/java/org/apache/lucene/index/FieldInfo.java ./src/java/org/apache/lucene/index/TermVectorOffsetInfo.java ./src/java/org/apache/lucene/index/SegmentInfo.java ./src/java/org/apache/lucene/index/TermInfosWriter.java ./src/java/org/apache/lucene/index/TermInfo.java ./src/java/org/apache/lucene/index/FieldInfos.java ./src/java/org/apache/lucene/index/SegmentMergeInfo.java ./src/java/org/apache/lucene/index/TermInfosReader.java ./src/java/org/apache/lucene/index/SegmentInfos.java How about IndexInfo? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Earwin Burrfoot ear...@gmail.com To: java-dev@lucene.apache.org Sent: Wednesday, June 3, 2009 8:08:50 AM Subject: IR static methods I have a strong desire to remove all these static methods from IR - lastModified, getCurrentVersion, getCommitUserData, indexExists. But haven't found a good place for them yet. Directory - is a bad place, it shouldn't concern itself with details of what exactly is stored inside, it should think of 'how' it is stored. IndexReader - is bad, it is too heavyweight to be created for getting something simple once. We should probably create some new lightweight class that provides a kind of reflection for the index? Mod dates, versions, userdata, existence, sizes, deletions, whatever. Both per-index and per-segment. Essentially it is a wrapper over SegmentInfos that allows us to keep them hidden (and thus easily changeable), and provides users with more concise and adequate interface. Any thoughts? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.
[ https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715836#action_12715836 ] Earwin Burrfoot commented on LUCENE-1651: - Seems the patch didn't apply completely. Your line numbers are off, also directory/readOnly are now members of SegmentReader, no way they can't be seen: {code} class SegmentReader extends IndexReader implements Cloneable { protected Directory directory; protected boolean readOnly; private String segment; private SegmentInfo si; private int readBufferSize; {code} Here's corresponding part of the patch, I bet $Id$ is the reason. {code} -/** - * @version $Id$ - */ -class SegmentReader extends DirectoryIndexReader { +/** @version $Id$ */ +class SegmentReader extends IndexReader implements Cloneable { + protected Directory directory; + protected boolean readOnly; + {code} Make IndexReader.open() always return MSR to simplify (re-)opens. - Key: LUCENE-1651 URL: https://issues.apache.org/jira/browse/LUCENE-1651 Project: Lucene - Java Issue Type: Task Affects Versions: 2.9 Reporter: Earwin Burrfoot Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1651-tag.patch, LUCENE-1651.patch, LUCENE-1651.patch As per discussion in mailing list, I'm making DirectoryIndexReader.open() always return MSR, even for single-segment indexes. While theoretically valid in the past (if you make sure to keep your index constantly optimized) this feature is made practically obsolete by per-segment collection. The patch somewhat de-hairies (re-)open logic for MSR/SR. SR no longer needs an ability to pose as toplevel directory-owning IR. All related logic is moved from DIR to MSR. DIR becomes almost empty, and copying two or three remaining fields over to MSR/SR, I remove it. Lots of tests fail, as they rely on SR returned from IR.open(), I fix by introducing SR.getOnlySegmentReader static package-private method. Some previous bugs are uncovered, one is fixed in LUCENE-1645, another (partially fixed in LUCENE-1648) is fixed in this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.
[ https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715908#action_12715908 ] Earwin Burrfoot commented on LUCENE-1651: - bq. Patch looks good Earwin, thanks! I believe the readers can be cleaned up further, but I'm short on time and don't want to delay it for another week or two, and then rebase it against updated trunk once again. Might as well do that under a separate issue. bq. I think we should now rename MultiSegmentReader to DirectoryIndexReader? Maybe DirectoryReader instead of DirectoryIndexReader? But all three are in fact okay with me, I really don't have any preference here. Make IndexReader.open() always return MSR to simplify (re-)opens. - Key: LUCENE-1651 URL: https://issues.apache.org/jira/browse/LUCENE-1651 Project: Lucene - Java Issue Type: Task Affects Versions: 2.9 Reporter: Earwin Burrfoot Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1651-tag.patch, LUCENE-1651.patch, LUCENE-1651.patch As per discussion in mailing list, I'm making DirectoryIndexReader.open() always return MSR, even for single-segment indexes. While theoretically valid in the past (if you make sure to keep your index constantly optimized) this feature is made practically obsolete by per-segment collection. The patch somewhat de-hairies (re-)open logic for MSR/SR. SR no longer needs an ability to pose as toplevel directory-owning IR. All related logic is moved from DIR to MSR. DIR becomes almost empty, and copying two or three remaining fields over to MSR/SR, I remove it. Lots of tests fail, as they rely on SR returned from IR.open(), I fix by introducing SR.getOnlySegmentReader static package-private method. Some previous bugs are uncovered, one is fixed in LUCENE-1645, another (partially fixed in LUCENE-1648) is fixed in this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
IR static methods
I have a strong desire to remove all these static methods from IR - lastModified, getCurrentVersion, getCommitUserData, indexExists. But haven't found a good place for them yet. Directory - is a bad place, it shouldn't concern itself with details of what exactly is stored inside, it should think of 'how' it is stored. IndexReader - is bad, it is too heavyweight to be created for getting something simple once. We should probably create some new lightweight class that provides a kind of reflection for the index? Mod dates, versions, userdata, existence, sizes, deletions, whatever. Both per-index and per-segment. Essentially it is a wrapper over SegmentInfos that allows us to keep them hidden (and thus easily changeable), and provides users with more concise and adequate interface. Any thoughts? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1672) Deprecate all String/File ctors/opens in IndexReader/IndexWriter/IndexSearcher
[ https://issues.apache.org/jira/browse/LUCENE-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715944#action_12715944 ] Earwin Burrfoot commented on LUCENE-1672: - bq. I will later try to solve this problem with the closeDir inside the different IndexReaders (but maybe Earwin has done it already in LUCENE-1651) My issue removes closeDir from SegmentReader, as it cannot 'own' a directory anymore. MSR-to-be-DirectoryReader still has this flag. Deprecate all String/File ctors/opens in IndexReader/IndexWriter/IndexSearcher -- Key: LUCENE-1672 URL: https://issues.apache.org/jira/browse/LUCENE-1672 Project: Lucene - Java Issue Type: Task Affects Versions: 2.9 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1672.patch, LUCENE-1672.patch During investigation of LUCENE-1658, I found out, that even LUCENE-1453 is not completely fixed. As 1658 deprecates all FSDirectory.getDirectory() static factories, we should not use them anymore. As the user is now free to choose the correct directory implementation using direct instantiation or using FSDir.open() he should no longer use all ctors/methods in IndexWriter/IndexReader/IndexSearcher Co. that simply take path names as String or File and always instantiate the Directory himself. LUCENE-1453 currently works for the cached directory implementations from FSDir.getDirectory, but not with uncached, non refcounting FSDirs. Sometime reopen() closes the directory (as far as I see, when a SegmentReader changes to a MultiSegmentReader and/or deletes apply). This is hard to track. In Lucene 3.0 we then can remove the whole bunch of closeDirectory parameters/fields in these classes and simply do not care anymore about closing directories. To remove this closeDirectory parameter now (before 3.0) and also fix 1453 correctly, an additional idea would be to change these factories that take the File/String to return the IndexReader wrapped by a FilteredIndexReader, that keeps track on closing the underlying directory after close and reopen. This is simplier than passing this boolean between different DirectoryIndexReader instances. The small performance impact by wrapping with FilterIndexReader should not be so bad, because the method is deprecated and we can state, that it is better to user the factory method with Directory parameter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1672) Deprecate all String/File ctors/opens in IndexReader/IndexWriter/IndexSearcher
[ https://issues.apache.org/jira/browse/LUCENE-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715962#action_12715962 ] Earwin Burrfoot commented on LUCENE-1672: - bq. And DirectoryIR/MSR still have this Flag, but reopening a MSR always returns a MSR again (even if it only consists of one segment)? Exactly. Deprecate all String/File ctors/opens in IndexReader/IndexWriter/IndexSearcher -- Key: LUCENE-1672 URL: https://issues.apache.org/jira/browse/LUCENE-1672 Project: Lucene - Java Issue Type: Task Affects Versions: 2.9 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1672.patch, LUCENE-1672.patch During investigation of LUCENE-1658, I found out, that even LUCENE-1453 is not completely fixed. As 1658 deprecates all FSDirectory.getDirectory() static factories, we should not use them anymore. As the user is now free to choose the correct directory implementation using direct instantiation or using FSDir.open() he should no longer use all ctors/methods in IndexWriter/IndexReader/IndexSearcher Co. that simply take path names as String or File and always instantiate the Directory himself. LUCENE-1453 currently works for the cached directory implementations from FSDir.getDirectory, but not with uncached, non refcounting FSDirs. Sometime reopen() closes the directory (as far as I see, when a SegmentReader changes to a MultiSegmentReader and/or deletes apply). This is hard to track. In Lucene 3.0 we then can remove the whole bunch of closeDirectory parameters/fields in these classes and simply do not care anymore about closing directories. To remove this closeDirectory parameter now (before 3.0) and also fix 1453 correctly, an additional idea would be to change these factories that take the File/String to return the IndexReader wrapped by a FilteredIndexReader, that keeps track on closing the underlying directory after close and reopen. This is simplier than passing this boolean between different DirectoryIndexReader instances. The small performance impact by wrapping with FilterIndexReader should not be so bad, because the method is deprecated and we can state, that it is better to user the factory method with Directory parameter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.
[ https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-1651: Attachment: LUCENE-1651-tag.patch LUCENE-1651.patch Argh! The rename broke test-tag again :) in new and innovative ways. New patches attached. Make IndexReader.open() always return MSR to simplify (re-)opens. - Key: LUCENE-1651 URL: https://issues.apache.org/jira/browse/LUCENE-1651 Project: Lucene - Java Issue Type: Task Affects Versions: 2.9 Reporter: Earwin Burrfoot Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1651-tag.patch, LUCENE-1651-tag.patch, LUCENE-1651.patch, LUCENE-1651.patch, LUCENE-1651.patch As per discussion in mailing list, I'm making DirectoryIndexReader.open() always return MSR, even for single-segment indexes. While theoretically valid in the past (if you make sure to keep your index constantly optimized) this feature is made practically obsolete by per-segment collection. The patch somewhat de-hairies (re-)open logic for MSR/SR. SR no longer needs an ability to pose as toplevel directory-owning IR. All related logic is moved from DIR to MSR. DIR becomes almost empty, and copying two or three remaining fields over to MSR/SR, I remove it. Lots of tests fail, as they rely on SR returned from IR.open(), I fix by introducing SR.getOnlySegmentReader static package-private method. Some previous bugs are uncovered, one is fixed in LUCENE-1645, another (partially fixed in LUCENE-1648) is fixed in this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.
[ https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-1651: Attachment: LUCENE-1651.patch One more version, applies against current trunk without fuzzy hunk matching. Make IndexReader.open() always return MSR to simplify (re-)opens. - Key: LUCENE-1651 URL: https://issues.apache.org/jira/browse/LUCENE-1651 Project: Lucene - Java Issue Type: Task Affects Versions: 2.9 Reporter: Earwin Burrfoot Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1651-tag.patch, LUCENE-1651-tag.patch, LUCENE-1651.patch, LUCENE-1651.patch, LUCENE-1651.patch, LUCENE-1651.patch As per discussion in mailing list, I'm making DirectoryIndexReader.open() always return MSR, even for single-segment indexes. While theoretically valid in the past (if you make sure to keep your index constantly optimized) this feature is made practically obsolete by per-segment collection. The patch somewhat de-hairies (re-)open logic for MSR/SR. SR no longer needs an ability to pose as toplevel directory-owning IR. All related logic is moved from DIR to MSR. DIR becomes almost empty, and copying two or three remaining fields over to MSR/SR, I remove it. Lots of tests fail, as they rely on SR returned from IR.open(), I fix by introducing SR.getOnlySegmentReader static package-private method. Some previous bugs are uncovered, one is fixed in LUCENE-1645, another (partially fixed in LUCENE-1648) is fixed in this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Enhance StandardTokenizer to support words which will not be tokenized
Not sure you can easily marry generated JFlex grammar and runtime-provided list of protected words. I took the approach of creating tokens for punctuation inside my tokenizer and later gluing them with nearby text tokens or dropping from the stream with a tokenfilter. On Wed, Jun 3, 2009 at 20:10, Grant Ingersoll gsing...@apache.org wrote: You'd have to modify the JFlex grammar. I'd suggest adding in a generic protected words approach whereby you can pass in a list of protected words. This would be a nice patch/improvement. -Grant On Jun 3, 2009, at 4:07 AM, ami dudu wrote: Hi, I'm using a StandardTokenizer which do great job for me but i need to enhance it somehow to consider words like c++ c#, .net as is and not tokenized it into c or net. I know that there are other tokenizers such as KeywordTokenizer and WhitespaceTokenizer but they do not include the StandardTokenizer logic. Any ideas on what is the best way to add this enhancement? Thanks, Amid -- View this message in context: http://www.nabble.com/Enhance-StandardTokenizer-to-support-words-which-will-not-be-tokenized-tp23849495p23849495.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715973#action_12715973 ] Earwin Burrfoot commented on LUCENE-1630: - Searcher is supposed to be a little cherry of userfriendliness atop a glass of Lucene murky internals, ain't it? I mean, even you had to be explained the ways of Query, Weight and Scorer, what would a Lucene neophyte do if we remove his beloved convenience methods? Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. One other optimization that was discussed in LUCENE-1593 is to expose a topScorer() API (on Weight) which returns a Scorer that its score(Collector) will be called, and additionally add a start() method to DISI. That will allow Scorers to initialize either on start() or score(Collector). This was proposed mainly because of BS and BS2 which check if they are initialized in every call to next(), skipTo() and score(). Personally I prefer to see that in a separate issue, following that one (as it might add methods to QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e
[jira] Created: (LUCENE-1677) Remove GCJ IndexReader specializations
Remove GCJ IndexReader specializations -- Key: LUCENE-1677 URL: https://issues.apache.org/jira/browse/LUCENE-1677 Project: Lucene - Java Issue Type: Task Reporter: Earwin Burrfoot Fix For: 2.9 These specializations are outdated, unsupported, most probably pointless due to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you are going to ask people on java-user, anybody replied that they need it?). While giving nothing, they make SegmentReader instantiation code look real ugly. If nobody objects, I'm going to post a patch that removes these from Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715509#action_12715509 ] Earwin Burrfoot commented on LUCENE-1630: - You can't, because Weights produced from same Query are different for different indexes. You can probably modify Query inplace for a given index, produce some scorers, do scoring, then modify Query for another index, produce scorers, etc.. But now your Query is no longer thread-safe, and I can't reuse it from different threads. So with all its strange looks the trio of Q, W, S is still the best approach if you ask me. Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. One other optimization that was discussed in LUCENE-1593 is to expose a topScorer() API (on Weight) which returns a Scorer that its score(Collector) will be called, and additionally add a start() method to DISI. That will allow Scorers to initialize either on start() or score(Collector). This was proposed mainly because of BS and BS2 which check if they are initialized in every call to next(), skipTo() and score(). Personally I prefer to see that in a separate issue, following that one (as it might add methods to QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online
[jira] Commented: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.
[ https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715672#action_12715672 ] Earwin Burrfoot commented on LUCENE-1651: - Hm.. okay, I've got back to work on this patch. To fix tests relying on getting SR from IR.open() on trunk I introduced a package-private utility method that extracts SR from MSR if it is the only one there. The tests in tags/XXX don't see this method, should I backport it somewhere there? Make IndexReader.open() always return MSR to simplify (re-)opens. - Key: LUCENE-1651 URL: https://issues.apache.org/jira/browse/LUCENE-1651 Project: Lucene - Java Issue Type: Task Affects Versions: 2.9 Reporter: Earwin Burrfoot Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1651.patch As per discussion in mailing list, I'm making DirectoryIndexReader.open() always return MSR, even for single-segment indexes. While theoretically valid in the past (if you make sure to keep your index constantly optimized) this feature is made practically obsolete by per-segment collection. The patch somewhat de-hairies (re-)open logic for MSR/SR. SR no longer needs an ability to pose as toplevel directory-owning IR. All related logic is moved from DIR to MSR. DIR becomes almost empty, and copying two or three remaining fields over to MSR/SR, I remove it. Lots of tests fail, as they rely on SR returned from IR.open(), I fix by introducing SR.getOnlySegmentReader static package-private method. Some previous bugs are uncovered, one is fixed in LUCENE-1645, another (partially fixed in LUCENE-1648) is fixed in this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.
[ https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Earwin Burrfoot updated LUCENE-1651: Attachment: LUCENE-1651-tag.patch LUCENE-1651.patch Here are the patches for current lucene trunk and back compat tag. Make IndexReader.open() always return MSR to simplify (re-)opens. - Key: LUCENE-1651 URL: https://issues.apache.org/jira/browse/LUCENE-1651 Project: Lucene - Java Issue Type: Task Affects Versions: 2.9 Reporter: Earwin Burrfoot Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1651-tag.patch, LUCENE-1651.patch, LUCENE-1651.patch As per discussion in mailing list, I'm making DirectoryIndexReader.open() always return MSR, even for single-segment indexes. While theoretically valid in the past (if you make sure to keep your index constantly optimized) this feature is made practically obsolete by per-segment collection. The patch somewhat de-hairies (re-)open logic for MSR/SR. SR no longer needs an ability to pose as toplevel directory-owning IR. All related logic is moved from DIR to MSR. DIR becomes almost empty, and copying two or three remaining fields over to MSR/SR, I remove it. Lots of tests fail, as they rely on SR returned from IR.open(), I fix by introducing SR.getOnlySegmentReader static package-private method. Some previous bugs are uncovered, one is fixed in LUCENE-1645, another (partially fixed in LUCENE-1648) is fixed in this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory
[ https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715008#action_12715008 ] Earwin Burrfoot commented on LUCENE-1658: - I told you, Java mmap doesn't work on Windows. And please, don't use the unmap hack! If it doesn't work, it doesn't work. Let's for all windows versions use SimpleFSD. Look, what are you going to do if you unmap a buffer and then access it by accident? Crash JVM? Absorb NIOFSDirectory into FSDirectory -- Key: LUCENE-1658 URL: https://issues.apache.org/jira/browse/LUCENE-1658 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, LUCENE-1658.patch, LUCENE-1658.patch I think whether one uses java.io.* vs java.nio.* or eventually java.nio2.*, or some other means, is an under-the-hood implementation detail of FSDirectory and doesn't merit a whole separate class. I think FSDirectory should be the core class one uses when one's index is in the filesystem. So, I'd like to deprecate NIOFSDirectory, absorbing it into FSDirectory, and add a setting useNIO to FSDirectory. It should default to true for non-Windows OSs, because it gives far better concurrent performance on all platforms but Windows (due to known Sun JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory
[ https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715016#action_12715016 ] Earwin Burrfoot edited comment on LUCENE-1658 at 6/1/09 1:14 AM: - bq. The buffer is nulled directly after unmapping. Really? Let me quote some code (MacOS, Java 1.6): unsafe.freeMemory(address); address = 0; Bits.unreserveMemory(capacity); Does windows version differ? What we see here is 'zeroing', not 'nulling'. When doing get/set, buffer never checks for address to have sense, so the next access will yield a GPF :) The guys from Sun explained the absence of unmap() in the original design - the only way of closing mapped buffer and not getting unpredictable behaviour is to introduce a synchronized isClosed check on each read/write operation, which kills the performance even if the sync method used is just a volatile variable. was (Author: earwin): Really? Let me quote some code (MacOS, Java 1.6): unsafe.freeMemory(address); address = 0; Bits.unreserveMemory(capacity); Does windows version differ? What we see here is 'zeroing', not 'nulling'. When doing get/set, buffer never checks for address to have sense, so the next access will yield a GPF :) The guys from Sun explained the absence of unmap() in the original design - the only way of closing mapped buffer and not getting unpredictable behaviour is to introduce a synchronized isClosed check on each read/write operation, which kills the performance even if the sync method used is just a volatile variable. Absorb NIOFSDirectory into FSDirectory -- Key: LUCENE-1658 URL: https://issues.apache.org/jira/browse/LUCENE-1658 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, LUCENE-1658.patch, LUCENE-1658.patch I think whether one uses java.io.* vs java.nio.* or eventually java.nio2.*, or some other means, is an under-the-hood implementation detail of FSDirectory and doesn't merit a whole separate class. I think FSDirectory should be the core class one uses when one's index is in the filesystem. So, I'd like to deprecate NIOFSDirectory, absorbing it into FSDirectory, and add a setting useNIO to FSDirectory. It should default to true for non-Windows OSs, because it gives far better concurrent performance on all platforms but Windows (due to known Sun JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory
[ https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715016#action_12715016 ] Earwin Burrfoot commented on LUCENE-1658: - Really? Let me quote some code (MacOS, Java 1.6): unsafe.freeMemory(address); address = 0; Bits.unreserveMemory(capacity); Does windows version differ? What we see here is 'zeroing', not 'nulling'. When doing get/set, buffer never checks for address to have sense, so the next access will yield a GPF :) The guys from Sun explained the absence of unmap() in the original design - the only way of closing mapped buffer and not getting unpredictable behaviour is to introduce a synchronized isClosed check on each read/write operation, which kills the performance even if the sync method used is just a volatile variable. Absorb NIOFSDirectory into FSDirectory -- Key: LUCENE-1658 URL: https://issues.apache.org/jira/browse/LUCENE-1658 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, LUCENE-1658.patch, LUCENE-1658.patch I think whether one uses java.io.* vs java.nio.* or eventually java.nio2.*, or some other means, is an under-the-hood implementation detail of FSDirectory and doesn't merit a whole separate class. I think FSDirectory should be the core class one uses when one's index is in the filesystem. So, I'd like to deprecate NIOFSDirectory, absorbing it into FSDirectory, and add a setting useNIO to FSDirectory. It should default to true for non-Windows OSs, because it gives far better concurrent performance on all platforms but Windows (due to known Sun JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory
[ https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715018#action_12715018 ] Earwin Burrfoot commented on LUCENE-1658: - Ah! You was referring to your code. It's not thread-safe still. Someone could access the closed buffer before it sees the now-null reference to it. You also employ the hack on non-windows machines, that work quite well without it. What for? Absorb NIOFSDirectory into FSDirectory -- Key: LUCENE-1658 URL: https://issues.apache.org/jira/browse/LUCENE-1658 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, LUCENE-1658.patch, LUCENE-1658.patch I think whether one uses java.io.* vs java.nio.* or eventually java.nio2.*, or some other means, is an under-the-hood implementation detail of FSDirectory and doesn't merit a whole separate class. I think FSDirectory should be the core class one uses when one's index is in the filesystem. So, I'd like to deprecate NIOFSDirectory, absorbing it into FSDirectory, and add a setting useNIO to FSDirectory. It should default to true for non-Windows OSs, because it gives far better concurrent performance on all platforms but Windows (due to known Sun JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory
[ https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715026#action_12715026 ] Earwin Burrfoot commented on LUCENE-1658: - I tested on MacOS: Invalid memory access of location 8b55a000 rip=0110c367 * Here JVM quietly dies. non-null return code, all threads are killed, no diagnostic files created. Absorb NIOFSDirectory into FSDirectory -- Key: LUCENE-1658 URL: https://issues.apache.org/jira/browse/LUCENE-1658 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, LUCENE-1658.patch, LUCENE-1658.patch I think whether one uses java.io.* vs java.nio.* or eventually java.nio2.*, or some other means, is an under-the-hood implementation detail of FSDirectory and doesn't merit a whole separate class. I think FSDirectory should be the core class one uses when one's index is in the filesystem. So, I'd like to deprecate NIOFSDirectory, absorbing it into FSDirectory, and add a setting useNIO to FSDirectory. It should default to true for non-Windows OSs, because it gives far better concurrent performance on all platforms but Windows (due to known Sun JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory
[ https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715027#action_12715027 ] Earwin Burrfoot commented on LUCENE-1658: - bq. It uses less virtual memory :) 64bit systems have an abundance of said valuable resource. Why taint them with dangerous hacks for the sake of zero returns? Absorb NIOFSDirectory into FSDirectory -- Key: LUCENE-1658 URL: https://issues.apache.org/jira/browse/LUCENE-1658 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, LUCENE-1658.patch, LUCENE-1658.patch I think whether one uses java.io.* vs java.nio.* or eventually java.nio2.*, or some other means, is an under-the-hood implementation detail of FSDirectory and doesn't merit a whole separate class. I think FSDirectory should be the core class one uses when one's index is in the filesystem. So, I'd like to deprecate NIOFSDirectory, absorbing it into FSDirectory, and add a setting useNIO to FSDirectory. It should default to true for non-Windows OSs, because it gives far better concurrent performance on all platforms but Windows (due to known Sun JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory
[ https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715057#action_12715057 ] Earwin Burrfoot commented on LUCENE-1658: - bq. I'm a bit nervous about creating MMapDirectory automatically for any OS, not just Windows. It's almost okay for 64bit systems. bq. The hack also saves transient disk space, on all systems, right? That's a nice catch. Now I have some of the non-buggy-but-weird behaviour my app exhibits explained. bq. But they have a 64 bit buffer, so you could use it instead of many buffers. They don't. When NIO2 project was merged into OpenJDK, they left some stuff unmerged, including 64bit buffers. Currently they aren't present in OpenJDK and Java7 preview builds, and not even a rough estimate is given on whether they are going to make it through, and when. bq. Maybe we should move this hack to contrib ( a class that extends MMapDirectory by adding a close method) with a big warning! I support this. The hack has some merits if carefully applied, but is outright too dangerous to ship it as default. Absorb NIOFSDirectory into FSDirectory -- Key: LUCENE-1658 URL: https://issues.apache.org/jira/browse/LUCENE-1658 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, LUCENE-1658.patch, LUCENE-1658.patch I think whether one uses java.io.* vs java.nio.* or eventually java.nio2.*, or some other means, is an under-the-hood implementation detail of FSDirectory and doesn't merit a whole separate class. I think FSDirectory should be the core class one uses when one's index is in the filesystem. So, I'd like to deprecate NIOFSDirectory, absorbing it into FSDirectory, and add a setting useNIO to FSDirectory. It should default to true for non-Windows OSs, because it gives far better concurrent performance on all platforms but Windows (due to known Sun JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory
[ https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715063#action_12715063 ] Earwin Burrfoot edited comment on LUCENE-1658 at 6/1/09 4:16 AM: - bq. On a couple of projects I've worked in, they were very reluctant to having packages allocate memory outside the JVM, and that's my understanding of memory mapped buffers. mmap does not allocate memory. It allocates address space, and uses the same disk cache system already has. For example, you can't cause OOM in your (or another co-existing) app with mmaps (except eating up your own address space on 32bit systems). bq. But if you decide to include MMapDir in that auto-create logic, I hope there will be a way to instantiate a specific FSDir, in case we'll have problems with that logic. Public constructors for all D variants are a must, and for me they are the best that this patch has to offer :) was (Author: earwin): bq. On a couple of projects I've worked in, they were very reluctant to having packages allocate memory outside the JVM, and that's my understanding of memory mapped buffers. mmap does not allocate memory. It allocates address space, and uses the same disk cache system already has. For example, you can't cause OOM in your (or another co-existing) app with mmaps (except eating up your own address space on 32bit systems). Absorb NIOFSDirectory into FSDirectory -- Key: LUCENE-1658 URL: https://issues.apache.org/jira/browse/LUCENE-1658 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, LUCENE-1658.patch, LUCENE-1658.patch I think whether one uses java.io.* vs java.nio.* or eventually java.nio2.*, or some other means, is an under-the-hood implementation detail of FSDirectory and doesn't merit a whole separate class. I think FSDirectory should be the core class one uses when one's index is in the filesystem. So, I'd like to deprecate NIOFSDirectory, absorbing it into FSDirectory, and add a setting useNIO to FSDirectory. It should default to true for non-Windows OSs, because it gives far better concurrent performance on all platforms but Windows (due to known Sun JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory
[ https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715063#action_12715063 ] Earwin Burrfoot commented on LUCENE-1658: - bq. On a couple of projects I've worked in, they were very reluctant to having packages allocate memory outside the JVM, and that's my understanding of memory mapped buffers. mmap does not allocate memory. It allocates address space, and uses the same disk cache system already has. For example, you can't cause OOM in your (or another co-existing) app with mmaps (except eating up your own address space on 32bit systems). Absorb NIOFSDirectory into FSDirectory -- Key: LUCENE-1658 URL: https://issues.apache.org/jira/browse/LUCENE-1658 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, LUCENE-1658.patch, LUCENE-1658.patch I think whether one uses java.io.* vs java.nio.* or eventually java.nio2.*, or some other means, is an under-the-hood implementation detail of FSDirectory and doesn't merit a whole separate class. I think FSDirectory should be the core class one uses when one's index is in the filesystem. So, I'd like to deprecate NIOFSDirectory, absorbing it into FSDirectory, and add a setting useNIO to FSDirectory. It should default to true for non-Windows OSs, because it gives far better concurrent performance on all platforms but Windows (due to known Sun JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org