Re: TestCodecs running time
See you already did that Mike :). Thanks ! now the tests run for 2s. Shai On Fri, Apr 9, 2010 at 12:49 PM, Michael McCandless luc...@mikemccandless.com wrote: It's also slow because it repeats all the tests for each of the core codecs (standard, sep, pulsing, intblock). I think it's fine to reduce the number of iterations -- just make sure there's no seed to newRandom() so the distributing testing is effective. Mike On Fri, Apr 9, 2010 at 12:43 AM, Shai Erera ser...@gmail.com wrote: Hi I've noticed that TestCodecs takes an insanely long time to run on my machine - between 35-40 seconds. Is that expected? The reason why it runs so long, seems to be that its threads make (each) 4000 iterations ... is that really required to ensure correctness? Shai - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
SnapshotDeletionPolicy throws NPE if no commit happened
SDP throws NPE if the index includes no commits, but snapshot() is called. This is an extreme case, but can happen if one takes snapshots (for backup purposes for example) in a separate code segment than indexing, and does not know if commit was called or not. I think we should throw an IllegalStateException instead of falling on NPE, w/ a descriptive message. Alternatively, we can just return null and document it ... But I prefer the ISE instead. What do you think? Shai
Re: SnapshotDeletionPolicy throws NPE if no commit happened
Well ... one can still call commit() or close() right after IW creation. And this is a very rare case to be hit by. Was just asking about whether we want to add an explicit and clear protective code about it or not. Shai On Thu, Apr 15, 2010 at 10:26 AM, Earwin Burrfoot ear...@gmail.com wrote: We should just let IW create a null commit on an empty directory, like it always did ;) Then a whole class of such problems disappears. On Thu, Apr 15, 2010 at 11:16, Shai Erera ser...@gmail.com wrote: SDP throws NPE if the index includes no commits, but snapshot() is called. This is an extreme case, but can happen if one takes snapshots (for backup purposes for example) in a separate code segment than indexing, and does not know if commit was called or not. I think we should throw an IllegalStateException instead of falling on NPE, w/ a descriptive message. Alternatively, we can just return null and document it ... But I prefer the ISE instead. What do you think? Shai -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Well ... I think that version numbers mean more than we'd like them to mean, as people perceive them. Let's discuss the format X.Y.Z: When X is changed, it should mean something 'big' happened - index structure has changed (e.g. the flexible scoring work), new Java version supported (Java 1.6) and even stuff like 'flex' which includes statements like if you don't want your app to slow down, consider reindexing. Such things signal a major change in Lucene, sometimes even just policy changes (Java version supported) and therefore I think we should reserve the ability to bump X when such things happen. Another thing is the index structure back-compat policy - today Lucene supports X-1 index structure, but during upgrades of X.Y versions, your segments are gradually migrated. Eventually, when you upgrade to 4.0 you should know whether you have a 2.x index, and call optimize just in case if you're not sure it's not migrated yet (if you've upgraded to 3.x). If we start bumping up 'X' too often, we'll either need to change the X-1 policy to X-N, which will just complicate matters for users. Or we'll keep the X-1 policy, but people will need to call optimize more frequently. Y should change on a regular basis, and no back-compat API-wise or index runtime-wise is guaranteed. So the Collector and per-segment searches in 2.9 could go w/o deprecating tons of API, so is the TokenStream work. Changes to Analyzer's runtime capabilities will also be allowed between Y revisions. Z should change when bugfixes are fixed, or when features are backported. Really ... we rarely fix bugs on a released Y branch, and I don't expect tons of features will be backported to a Y branch (to create a Z+1 release). Therefore this should not confuse anyone. So all I'm saying is that instead of increasing X whenever the API, index structure or runtime behavior has changed, I'm simply proposing to differentiate between really major changes to those that just say 'we're not back-compat compliant'. But above all, I'd like to see this change happening, so if I need to surrender to the X vs. X+Y approach, I will. Just think it will create some confusion. BTW, w/ all that - does it mean 'backwards' can be dropped, or at least test-backwards activated only on a branch which we decide needs it? That'll be really great. Shai On Thu, Apr 15, 2010 at 10:24 AM, Earwin Burrfoot ear...@gmail.com wrote: We can remove Version, because all incompatible changes go straight to a new major release, which we release more often, yes. 3.x is going to be released after 4.0 if bugs are found and fixed, or if people ask to backport some (minor?) features, and some dev has time for this. The question of what to call major release in X.Y.Z scheme - X or Y, is there, but immaterial :) I think it's okay to settle with X.Y, we have major releases and bugfixes, what that third number can be used for? On Thu, Apr 15, 2010 at 09:29, Shai Erera ser...@gmail.com wrote: So then I don't understand this: {quote} * A major release always bumps the major release number (2.x - 3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3) releases along that branch * There is no back compat across major releases (index nor APIs), but full back compat within branches. {quote} What's different than what's done today? How can we remove Version in that world, if we need to maintain full back-compat between 3.1 and 3.2, index and API-wise? We'll still need to deprecate and come up w/ new classes every time, and we'll still need to maintain runtime changes back-compat. Unless you're telling me we'll start releasing major releases more often? Well ... then we're saying the same thing, only I think that instead of releasing 4, 5, 6, 7, 8 every 6 months, we can release 3.1, 3.2, 3.5 ... because if you look back, every minor release included API deprecations as well as back-compat breaks. That means that every minor release should have been a major release right? Point is, if I understand correctly and you agree w/ my statement above - I don't see why would anyone releases a 3.x after 4.0 is out unless someone really wants to work hard on maintaining back-compat of some features. If it's just a numbering thing, then I don't think it matters what is defined as 'major' vs. 'minor'. One way is to define 'major' as X and minor X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I prefer the latter but don't have any strong feelings against the former. Just pointing out that X will grow more rapidly than today. That's all. So did I get it right? Shai On Thu, Apr 15, 2010 at 8:19 AM, Mark Miller markrmil...@gmail.com wrote: I don't read what you wrote and what Mike wrote as even close to the same. - Mark http://www.lucidimagination.com (mobile) On Apr 15, 2010, at 12:05 AM, Shai Erera ser...@gmail.com wrote: Ahh ... a dream finally comes true ... what
Re: SnapshotDeletionPolicy throws NPE if no commit happened
BTW, even if it's a stupid thing to do, someone can today create SDP and call snapshot without ever creating IW. And it's not an impossible scenario. Consider a backup-aware application which creates SDP first, then passes it to the indexing process and the backup process, separately. The backup process doesn't need to know of IW at all, and might call snapshot() before IW was even created, and SDP.onInit was called. It's a possibility, not saying it's a great and safe architecture. So this is really about do we want to write clear protective code, or allow the NPE? Shai 2010/4/15 Shai Erera ser...@gmail.com Well ... one can still call commit() or close() right after IW creation. And this is a very rare case to be hit by. Was just asking about whether we want to add an explicit and clear protective code about it or not. Shai On Thu, Apr 15, 2010 at 10:26 AM, Earwin Burrfoot ear...@gmail.comwrote: We should just let IW create a null commit on an empty directory, like it always did ;) Then a whole class of such problems disappears. On Thu, Apr 15, 2010 at 11:16, Shai Erera ser...@gmail.com wrote: SDP throws NPE if the index includes no commits, but snapshot() is called. This is an extreme case, but can happen if one takes snapshots (for backup purposes for example) in a separate code segment than indexing, and does not know if commit was called or not. I think we should throw an IllegalStateException instead of falling on NPE, w/ a descriptive message. Alternatively, we can just return null and document it ... But I prefer the ISE instead. What do you think? Shai -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. Up until now, Lucene migrated my segments gradually, and before I upgraded from X+1 to X+2 I could run optimize() to ensure my index will be readable by X+2. I don't think I can myself agree to it, let alone convince all the stakeholders in my company who adopt Lucene today in numerous projects, to let go of such capability. We've been there before (requiring reindexing on version upgrades) w/ some offerings and customers simply didn't like it and were forced to use an enterprise-class search engine which offered less (and didn't use Lucene, up until recently !). Until we moved to Lucene ... What's Solr's take on it? I differentiate between structural changes and runtime changes. I, myself, don't mind if we let go of back-compat support for runtime changes, such as those generated by analyzers. For a couple of reasons, the most important ones are (1) these are not so frequent (but so is index structural change) and (2) that's a decision I, as the application developer, makes - using or not a newer version of an Analyzer. I don't mind working hard to make a 2.x Analyzer version work in the 3.x world, but I cannot make a 2.x index readable by a 3.x Lucene jar, if the latter doesn't support it. That's the key difference, in my mind, between the two. I can choose not to upgrade at all to a newer analyzer version ... but I don't want to be forced to stay w/ older Lucene versions and features because of that ... well people might say that it's not Lucene's problem, but I beg to differ. Lucene benefits from wider and faster adoption and we rely on new features to be adopted quickly. That might be jeopardized if we let go of that strong capability, IMO. What we can do is provide an index migration tool ... but personally I don't know what's the difference between that and gradually migrating segments as they are merged, code-wise. I mean - it has to be the same code. Only an index migration tool may take days to complete on a very large index, while the ongoing migration takes ~0 time when you come to upgrade to a newer Lucene release. And the note about Terrier requiring reindexing ... well I can't say it's a strength of it but a damn big weakness IMO. About the release pace, I don't think we can suddenly release every 2 years ... makes people think the project is stuck. And some out there are not so fond of using a 'trunk' version and release it w/ their products because trunk is perceived as ongoing development (which it is) and thus less stable, or is likely to change and most importantly harder to maintain (as the consumer). So I still think we should release more often than not. That's why I wanted to differentiate X and Y, but I don't mind if we release just X ... if that's so important to people. BTW Mike, Eclipse's releases are like Lucene, and in fact I don't know of so many projects that just release X ... many of them seem to release X.Y. I don't understand why we're treating this as a all or nothing thing. We can let go of API back-compat, that clearly has no affect on index structure and content. We can even let go of index runtime changes for all I care. But I simply don't think we can let go of index structure back-support. Shai On Thu, Apr 15, 2010 at 1:12 PM, Michael McCandless luc...@mikemccandless.com wrote: 2010/4/15 Shai Erera ser...@gmail.com: One way is to define 'major' as X and minor X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I prefer the latter but don't have any strong feelings against the former. I prefer X.Y, ie, changes to Y only is a minor release (mostly bug fixes but maybe small features); changes to X is a major release. I think that's more standard, ie, people will generally grok that 3.3 - 4.0 is a major change but 3.3 - 3.4 isn't. So this proposal would change how Lucene releases are numbered. Ie, the next release would be 4.0. Bug fixes / small features would then be 4.1. Index back compat should be maintained between major releases, like it is today, STRUCTURE-wise. No... in the proposal, you must re-index on upgrading to the next major release (3.x - 4.0). I think supporting old indexes, badly (what we do today) is not a great solution. EG on upgrading to 3.1 you'll immediately see a search perf hit since the flex emulation layer is running. It's a trap. It's this freedom, I think, that'd let us drop Version entirely. It's the back-compat of the index that is the major driver for having Version today (eg so that the analyzers can produce tokens matching your old index). EG Terrier seems
Re: Proposal about Version API relaxation
Thanks Danil - you reminded me of another reason why reindexing is impossible - fetching the data, even if it's available is too damn costly. Robert, I think you're driven by Analyzers changes ... been too much around them I'm afraid :). A major version upgrade is a move to Java 1.5 for example. I can do that, and I don't see why I need to reindex my data because of that. And I simply don't buy that do this work on your own ... people can take a snapshot of the code, maintain it separately and you'll never hear back from them. Who benefits - neither ! It's open source - true, but it's way past the Hey look, I'm a new open source project w/ a dozen users, I can do whatever I want. Lucene is a respected open source project, w/ serious adoption and deployments. People trust on the select few committers here to do it right for them, so they don't need to invest the time and resources in developing core IR stuff. And now you're pushing to do it yourself approach? I simply don't get or buy it. When were you struck w/ maintaining backwards change because the index structure changed? I bet no so many of us, or shall I say just the few Mikes out there? So how hard is it to require such back-compat support? I wholeheartedly agree that we shouldn't keep back-compat on Analyzer changes, nor on bugs such that one which changed the position of the field from -1 to 0 (a while ago - don't remember the exact details). Shai On Thu, Apr 15, 2010 at 3:17 PM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to Version 3 in a completely backwards compatible way, then under this new scheme it can happen. -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
I can live w/ that Earwin ... I prefer the ongoing upgrades still, but I won't hold off the back-compat policy change vote because of that. Shai On Thu, Apr 15, 2010 at 3:30 PM, Earwin Burrfoot ear...@gmail.com wrote: I think an index upgrade tool is okay? While you still definetly have to code it, things like if idxVer==m doOneStuff elseif idxVer==n doOtherStuff else blowUp are kept away from lucene innards and we all profit? On Thu, Apr 15, 2010 at 16:21, Robert Muir rcm...@gmail.com wrote: its open source, if you feel this way, you can put the work to add features to some version branch from trunk in a backwards compatible way. Then this branch can have a backwards-compatible minor release with new features, but nothing ground-breaking. but this kinda stuff shouldnt hinder development on trunk. On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from Version 4 and don't want to reindex, and are willing to do the work to port it back to Version 3 in a completely backwards compatible way, then under this new scheme it can happen. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Well ... I could argue that it's you who miss the point :). I completely don't buy the all the new features comment -- how many new features are in a major release which force you to consider reindexing? Yet there are many of them that change the API. How will I know whether a release supports my index or not? Why do I need to work hard to back-port all the new developed issues onto a branch I use? How many of those branches will exist? Will they all run nightly unit tests? Can I cut a release of such branch myself? Or will I need the PMC or a VOTE? This will get complicated pretty fast ... Lucene is not a do it yourself kit - we try so hard to have the best defaults, best out of the box experience ... best everything for our users. Even w/ Analyzers we try so damn hard. While we could have simply componentize everything and tell the users you can use those filters, tokenizers, segment mergers, policies etc. to make up your indexing application ... And I don't think there are features out there that exist and are not contributed because people are afraid of the index format changes ... obviously if they have done it, they're passed the fear of handling index format ... I'd like to hear of one such feature. I'd bet there are such out there that are not contributed for IP, Business and Laziness reasons. Shai On Thu, Apr 15, 2010 at 3:56 PM, Robert Muir rcm...@gmail.com wrote: I think you guys miss the entire point. The idea that you can keep getting all the new features without reindexing is merely an illusion Instead, features simply aren't being added at all, because the policy makes it too cumbersome. Why is it problematic to have a different SVN branch/release, with lots of new features, but requires you to reindex and change your app? If its too difficult to reindex, it doesnt break your app that features exist elsewhere that you cannot access. Its the same as it is today, there are features you cannot access, except they do not even exist in apache SVN at all, even trunk, because of these problems. On Thu, Apr 15, 2010 at 8:42 AM, Earwin Burrfoot ear...@gmail.com wrote: I like the idea of index conversion tool over silent online upgrade because it is 1. controllable - with online upgrade you never know for sure when your index is completely upgraded, even optimize() won't help here, as it is a noop for already-optimized indexes 2. way easier to write - as flex shows, index format changes are accompanied by API changes. Here you don't have to emulate new APIs over old structures (can be impossible for some cases?), you only have to, well, convert. On Thu, Apr 15, 2010 at 16:32, Danil ŢORIN torin...@gmail.com wrote: All I ask is a way to migrate existing indexes to newer format. On Thu, Apr 15, 2010 at 15:21, Robert Muir rcm...@gmail.com wrote: its open source, if you feel this way, you can put the work to add features to some version branch from trunk in a backwards compatible way. Then this branch can have a backwards-compatible minor release with new features, but nothing ground-breaking. but this kinda stuff shouldnt hinder development on trunk. On Thu, Apr 15, 2010 at 8:17 AM, Danil ŢORIN torin...@gmail.com wrote: Sometimes it's REALLY impossible to reindex, or has absolutely prohibitive cost to do in a running production system (i can't shut it down for maintainance, so i need a lot of hardware to reindex ~5 billion documents, i have no idea what are the costs to retrieve that data all over again, but i estimate it to be quite a lot) And providing a way to migrate existing indexes to new lucene is crucial from my point of view. I don't care what this way is: calling optimize() with newer lucene or running some tool that takes 5 days, it's ok with me. Just don't put me through full reindexing as I really don't have all that data anymore. It's not my data, i just receive it from clients, and provide a search interface. It took years to build those indexes, rebuilding is not an option, and staying with old lucene forever just sucks. Danil. On Thu, Apr 15, 2010 at 14:57, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 7:52 AM, Shai Erera ser...@gmail.com wrote: Well ... I must say that I completely disagree w/ dropping index structure back-support. Our customers will simply not hear of reindexing 10s of TBs of content because of version upgrades. Such a decision is key to Lucene adoption in large-scale projects. It's entirely not about whether Lucene is a content store or not - content is stored on other systems, I agree. But that doesn't mean reindexing it is tolerable. I don't understand how its helpful to do a MAJOR version upgrade without reindexing... what in the world do you stand to gain from that? The idea here, is that development can be free of such hassles. Development should be this way. If you, Shai, need some feature X.Y.Z from
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857388#action_12857388 ] Shai Erera commented on LUCENE-2396: Robert I think this is great! Can we move more analyzers from core here? I think however that a backwards section in changes is important because it alerts users about those analyzers whose runtime behavior changed. Otherwise how would the poor uses know that? It doesn't mean you need to maintain back compat support but at least alert them when things change. Even if we eventually decide to remove API bw completely, a section in CHANGES will still be required to help users upgrade easily. remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2396) remove version from contrib/analyzers.
[ https://issues.apache.org/jira/browse/LUCENE-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857396#action_12857396 ] Shai Erera commented on LUCENE-2396: Static? Weren't you against that!? But if we remove back compat from analyzers why do we need Version? Or is this API bw that we remove? remove version from contrib/analyzers. -- Key: LUCENE-2396 URL: https://issues.apache.org/jira/browse/LUCENE-2396 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 3.1 Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2396.patch Contrib/analyzers has no backwards-compatibility policy, so let's remove Version so the API is consumable. if you think we shouldn't do this, then instead explicitly state and vote on what the backwards compatibility policy for contrib/analyzers should be instead, or move it all to core. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
I seriously don't understand the fuss around index format back compat. How many times is this changed such that it is too much to ask to keep X support X-1? I prefer to have ongoing segment merging but can live w/ a manual converter tool. Thing is - I'll probably not be able to develop one myself outside the scope of Lucene because I'll miss tons of API. So having Lucene declare and adhere to it seems reasonable to me. BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. And I also think that a manual migration tool will need access to some lower level API which is not exposed today, and will generally not perform as well as online migration. But that's a side note... Shai On Thursday, April 15, 2010, Earwin Burrfoot ear...@gmail.com wrote: I'd like to remind that Mike's proposal has stable branches. We can branch off preflex trunk right now and wrap it up as 3.1. Current trunk is declared as future 4.0 and all backcompat cruft is removed from it. If some new features/bugfixes appear in trunk, and they don't break stuff - we backport them to 3.x branch, eventually releasing 3.2, 3.3, etc Thus, devs are free to work without back-compat burden, bleeding edge users get their blood, conservative users get their stability + a subset of new features from stable branches. On Thu, Apr 15, 2010 at 22:02, DM Smith dmsmith...@gmail.com wrote: On 04/15/2010 01:50 PM, Earwin Burrfoot wrote: First, the index format. IMHO, it is a good thing for a major release to be able to read the prior major release's index. And the ability to convert it to the current format via optimize is also good. Whatever is decided on this thread should take this seriously. Optimize is a bad way to convert to current. 1. conversion is not guaranteed, optimizing already optimized index is a noop 2. it merges all your segments. if you use BalancedSegmentMergePolicy, that destroys your segment size distribution Dedicated upgrade tool (available both from command-line and programmatically) is a good way to convert to current. 1. conversion happens exactly when you need it, conversion happens for sure, no additional checks needed 2. it should leave all your segments as is, only changing their format It is my observation, though possibly not correct, that core only has rudimentary analysis capabilities, handling English very well. To handle other languages well contrib/analyzers is required. Until recently it did not get much love. There have been many bw compat breaking changes (though w/ version one can probably get the prior behavior). IMHO, most of contrib/analyzers should be core. My guess is that most non-trivial applications will use contrib/analyzers. I counter - most non-trivial applications will use their own analyzers. The more modules - the merrier. You can choose precisely what you need. By and large an analyzer is a simple wrapper for a tokenizer and some filters. Are you suggesting that most non-trivial apps write their own tokenizers and filters? I'd find that hard to believe. For example, I don't know enough Chinese, Farsi, Arabic, Polish, ... to come up with anything better than what Lucene has to tokenize, stem or filter these. Our user base are those with ancient, underpowered laptops in 3-rd world countries. On those machines it might take 10 minutes to create an index and during that time the machine is fairly unresponsive. There is no opportunity to do it in the background. Major Lucene releases (feature-wise, not version-wise) happen like once in a year, or year-and-a-half. Is it that hard for your users to wait ten minutes once a year? I said that was for one index. Multiply that times the number of books available (300+) and yes, it is too much to ask. Even if a small subset is indexed, say 30, that's around 5 hours of waiting. Under consideration is the frequency of breakage. Some are suggesting a greater frequency than yearly. DM - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
The reason Earwin why online migration is faster is because when u finally need to *fully* migrate your index, most chances are that most of the segments are already on the newer format. Offline migration will just keep the application idle for some amount of time until ALL segments are migrated. During the lifecycle of the index, segments are merged anyway, so migrating them on the fly virtually costs nothing. At the end, when u upgrade to a Lucene version which doesn't support the previous index format, you'll on the worse case need to migrate few large segments which were never merged. I don't know how many of those there will be as it really depends on the application, but I'd bet this process will touch just a few segments. And hence, throughput wise it will be a lot faster. We should create a migrate() API on IW which will touch just those segments and not incur a full optimize. That API can also be used for an offline migration tool, if we decide that's what we want. Shai On Thursday, April 15, 2010, jm jmugur...@gmail.com wrote: Not sure if plain users are allowed/encouraged to post in this list, but wanted to mention (just an opinion from a happy user), as other users have, that not all of us can reindex just like that. It would not be 10 min for one of our installations for sure... First, i would need to implement some code to reindex, cause my source data is postprocessed/compressed/encrypted/moved after it arrives to the application, so I would need to retrieve all etc. And then reindexing it would take days. javier On Thu, Apr 15, 2010 at 9:04 PM, Earwin Burrfoot ear...@gmail.com wrote: BTW Earwin, we can come up w/ a migrate() method on IW to accomplish manual migration on the segments that are still on old versions. That's not the point about whether optimize() is good or not. It is the difference between telling the customer to run a 5-day migration process, or a couple of hours. At the end of the day, the same migration code will need to be written whether for the manual or automatic case. And probably by the same developer which changed the index format. It's the difference of when does it happen. Converting stuff is easier then emulating, that's exactly why I want a separate tool. There's no need to support cross-version merging, nor to emulate old APIs. I also don't understand why offline migration is going to take days instead of hours for online migration?? WTF, it's gonna be even faster, as it doesn't have to merge things. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
+1 on the Analyzers as well. Earwin, I think I don't mind if we introduce migrate() elsewhere rather than on IW. What I meant to say is that if we stick w/ index format back-compat and ongoing migration, then such a method would be useful on IW for customers to call to ensure they're on the latest version. But if the majority here agree w/ a standalone tool, then I'm ok if it sits elsewhere. Grant, I'm all for 'just doing it and see what happens'. But I think we need to at least decide what we're going to do so it's clear to everyone. Because I'd like to know if I'm about to propose an index format change, whether I need to build migration tool or not. Actually, I'd like to know if people like Robert (basically those who have no problem to reindex and don't understand the fuss around it) will want to change the index format - can I count on them to be asked to provide such tool? That's to me a policy we should decide on ... whatever the consequences. But +1 for changing something ! Analyzers at first, API second. Shai On Thu, Apr 15, 2010 at 10:52 PM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Apr 15, 2010 at 3:50 PM, Robert Muir rcm...@gmail.com wrote: for now simply moving analyzers to its own jar filE would be a great step! +1 -- why not consolidate all analyzers now? (And fix indexer to require a minimal API = TokenStream minus reset close). Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Grant ... you've made it - the 100th response to that thread. Do we keep records somewhere? :) Ok I'm simply proposing to define 'index back-compat' as index format back-compat. With that, we don't 'wait' for something to happen, we just say up front that if that changes, we provide a migration tool for the latest index format version. Simple as that. The rest, we can 'see what happens' ... Shai On Thu, Apr 15, 2010 at 11:29 PM, Grant Ingersoll gsing...@apache.orgwrote: On Apr 15, 2010, at 4:21 PM, Shai Erera wrote: +1 on the Analyzers as well. Earwin, I think I don't mind if we introduce migrate() elsewhere rather than on IW. What I meant to say is that if we stick w/ index format back-compat and ongoing migration, then such a method would be useful on IW for customers to call to ensure they're on the latest version. But if the majority here agree w/ a standalone tool, then I'm ok if it sits elsewhere. Grant, I'm all for 'just doing it and see what happens'. But I think we need to at least decide what we're going to do so it's clear to everyone. Because I'd like to know if I'm about to propose an index format change, whether I need to build migration tool or not. Actually, I'd like to know if people like Robert (basically those who have no problem to reindex and don't understand the fuss around it) will want to change the index format - can I count on them to be asked to provide such tool? That's to me a policy we should decide on ... whatever the consequences. As I said, we should strive for index compatibility, but even in the past, we said we did, but the implications weren't always clear. I think index compatibility is very important. I've seen plenty of times where reindexing is not possible. But even then, you still have the option of testing to find out whether you can update or not. If you can't update, then don't until you can figure out how to do it. FWIW, I think our approach is much more proactive than see what happens. I'd argue, that in the past, our approach was see what happens, only the seeing didn't happen until after the release! -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Robert ... I'm sorry but changes to Analyzers don't *force* people to reindex. They can simply choose not to use the latest version. They can choose not to upgrade a Unicode version. They can copy the entire Analyzer code to match their needs. Index format changes is what I'm worried about because that *forces* people to reindex. Analyzers, believe it or not, are just a tool, an out of the box tool even, we're giving users to analyze their stuff. Probably a tool used by most of our users, but not all. Some have their own tools, that are currently wrapped as a Lucene Analyzer just because the API mandates. But we were talking about that too recently no? Ripping Analyzer off IndexWriter? Just to be clear - I think your work on Analyzers is fantastic ! Really ! Seriously ! But it's a choice someone can make ... whereas index format is a given - you have to live with it, or never upgrade Lucene. But I think we've chewed that way too much. I am all for removing bw on Analyzers, and 2396 is a great step towards it (or maybe it is IT?). Even index format - I don't see when it will change next (but I think I have an idea ...), so we can tackle it then. Shai On Thu, Apr 15, 2010 at 11:33 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 4:21 PM, Shai Erera ser...@gmail.com wrote: Actually, I'd like to know if people like Robert (basically those who have no problem to reindex and don't understand the fuss around it) will want to change the index format - can I count on them to be asked to provide such tool? That's to me a policy we should decide on ... whatever the consequences. just look at the 1.8MB of backwards compat code in contrib/analyzers i want to remove in LUCENE-2396? are you serious? I wrote most of that cruft to prevent reindexing and you are trying to say I don't understand the fuss about it? We shouldnt make people reindex, but we should have the chance, even if we only do it ONE TIME, to reset Lucene to a new Major Version that has a bunch of stuff fixed we couldnt fix before, and more flexibility. because with the current policy, its like we are in 1.x forever our version numbers are a joke! -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
By all means Robert ... by all means :). Remember who started that thread, and for what reason :D. Shai On Fri, Apr 16, 2010 at 12:01 AM, Robert Muir rcm...@gmail.com wrote: If you really believe this. then you have no problem if i remove all Version from all core and contrib analyzers right now. On Thu, Apr 15, 2010 at 4:50 PM, Shai Erera ser...@gmail.com wrote: Robert ... I'm sorry but changes to Analyzers don't *force* people to reindex. They can simply choose not to use the latest version. They can choose not to upgrade a Unicode version. They can copy the entire Analyzer code to match their needs. Index format changes is what I'm worried about because that *forces* people to reindex. Analyzers, believe it or not, are just a tool, an out of the box tool even, we're giving users to analyze their stuff. Probably a tool used by most of our users, but not all. Some have their own tools, that are currently wrapped as a Lucene Analyzer just because the API mandates. But we were talking about that too recently no? Ripping Analyzer off IndexWriter? Just to be clear - I think your work on Analyzers is fantastic ! Really ! Seriously ! But it's a choice someone can make ... whereas index format is a given - you have to live with it, or never upgrade Lucene. But I think we've chewed that way too much. I am all for removing bw on Analyzers, and 2396 is a great step towards it (or maybe it is IT?). Even index format - I don't see when it will change next (but I think I have an idea ...), so we can tackle it then. Shai On Thu, Apr 15, 2010 at 11:33 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 4:21 PM, Shai Erera ser...@gmail.com wrote: Actually, I'd like to know if people like Robert (basically those who have no problem to reindex and don't understand the fuss around it) will want to change the index format - can I count on them to be asked to provide such tool? That's to me a policy we should decide on ... whatever the consequences. just look at the 1.8MB of backwards compat code in contrib/analyzers i want to remove in LUCENE-2396? are you serious? I wrote most of that cruft to prevent reindexing and you are trying to say I don't understand the fuss about it? We shouldnt make people reindex, but we should have the chance, even if we only do it ONE TIME, to reset Lucene to a new Major Version that has a bunch of stuff fixed we couldnt fix before, and more flexibility. because with the current policy, its like we are in 1.x forever our version numbers are a joke! -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: Proposal about Version API relaxation
DM I think ICU is great. But currently we use JFlex and you can run Java 10 if you want, but as long as JFlex is compiled w/ Java 1.4, that's what you'll get. Luckily Uwe and Robert recently bumped it up to Java 1.5. Such a change should be clearly documented in CHANGES so people are aware of this, and at least until they figure out what they want to do with it, they should take the pre-3.1 analyzers (assuming that's the next release w/ JFlex 1.5 tokenizers) and use them. Alternatively, we can think of writing an ICU analyzer/tokenizer, but we're still using JFlex, so I don't know how much control we have on that ... Shai On Fri, Apr 16, 2010 at 12:21 AM, DM Smith dmsmith...@gmail.com wrote: On Apr 15, 2010, at 4:50 PM, Shai Erera wrote: Robert ... I'm sorry but changes to Analyzers don't *force* people to reindex. They can simply choose not to use the latest version. They can choose not to upgrade a Unicode version. They can copy the entire Analyzer code to match their needs. Index format changes is what I'm worried about because that *forces* people to reindex. In several threads and issues it has been pointed out that upgrading Unicode versions is not an obvious choice or even controllable. It is dictated by the version of Java, the version of the OS and any Unicode specific libraries. A desktop application which internally uses lucene has no control over the automatic update of Java (yes it can detect the version change and refuse to run or force an upgrade) or when the user feels like upgrading the OS (not sure how to detect the Unicode version of an arbitrary OS. Not sure I want to). Even with server applications, some shared servers have one version of Java that all use. And the owner of an individual application might have no say in if or when that is upgraded. This is to say that one needs to be ready to re-index at all times unless it can be controlled. One way to handle the Java/Unicode is to use ICU at a specific version and control its upgrade. One way to handle the OS problem (which really is one of user input) is to keep up with the changes to Unicode and create a filter that handles the differences normalizing to the Unicode version of the index (if that's even possible). Still goes to your point. The onus is on the application not on Lucene. -- DM - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2397) SnapshotDeletionPolicy.snapshot() throws NPE if no commits happened
SnapshotDeletionPolicy.snapshot() throws NPE if no commits happened --- Key: LUCENE-2397 URL: https://issues.apache.org/jira/browse/LUCENE-2397 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.1 SDP throws NPE if no commits occurred and snapshot() was called. I will replace it w/ throwing IllegalStateException. I'll also move TestSDP from o.a.l to o.a.l,index. I'll post a patch soon -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-trunk #1157
DB jars again ... I think this one is a false alarm. Shai On Fri, Apr 16, 2010 at 5:14 AM, Apache Hudson Server hud...@hudson.zones.apache.org wrote: See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1157/changes Changes: [mikemccand] speed up TestStressIndexing2 -- [...truncated 4473 lines...] jflex-notice: javacc-uptodate-check: javacc-notice: init: clover.setup: clover.info: clover: common.compile-core: compile-core: compile-demo: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/classes/demo [javac] Compiling 17 source files to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/classes/demo [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. compile-memory: [echo] Building memory... common.init: build-lucene: init: clover.setup: clover.info: clover: compile-core: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/memory/classes/java [javac] Compiling 1 source file to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/memory/classes/java [javac] Note: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar-core: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/memory/lucene-memory-2010-04-16_02-03-48.jar default: compile-highlighter: [echo] Building highlighter... build-memory: build-queries: [echo] Highlighter building dependency contrib/queries [echo] Building queries... common.init: build-lucene: init: clover.setup: clover.info: clover: compile-core: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries/classes/java [javac] Compiling 18 source files to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries/classes/java [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. jar-core: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries/lucene-queries-2010-04-16_02-03-48.jar default: common.init: build-lucene: init: clover.setup: clover.info: clover: common.compile-core: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/highlighter/classes/java [javac] Compiling 35 source files to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/highlighter/classes/java [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. compile-core: jar-core: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/highlighter/lucene-highlighter-2010-04-16_02-03-48.jar default: compile-analyzers-common: init: clover.setup: clover.info: clover: compile-core: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/java [javac] Compiling 106 source files to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/classes/java [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/contrib/benchmark/src/java/org/apache/lucene/benchmark/quality/trec/TrecJudge.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar-core: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/benchmark/lucene-benchmark-2010-04-16_02-03-48.jar jar: compile-test: [echo] Building benchmark... common.init: compile-demo: jflex-uptodate-check: jflex-notice: javacc-uptodate-check: javacc-notice: init: clover.setup:
[jira] Resolved: (LUCENE-2316) Define clear semantics for Directory.fileLength
[ https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-2316. Lucene Fields: [New, Patch Available] (was: [New]) Assignee: Shai Erera Resolution: Fixed Committed revision 933879. Define clear semantics for Directory.fileLength --- Key: LUCENE-2316 URL: https://issues.apache.org/jira/browse/LUCENE-2316 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.1 Attachments: LUCENE-2316.patch On this thread: http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e it was mentioned that Directory's fileLength behavior is not consistent between Directory implementations if the given file name does not exist. FSDirectory returns a 0 length while RAMDirectory throws FNFE. The problem is that the semantics of fileLength() are not defined. As proposed in the thread, we'll define the following semantics: * Returns the length of the file denoted by codename/code if the file exists. The return value may be anything between 0 and Long.MAX_VALUE. * Throws FileNotFoundException if the file does not exist. Note that you can call dir.fileExists(name) if you are not sure whether the file exists or not. For backwards we'll create a new method w/ clear semantics. Something like: {code} /** * @deprecated the method will become abstract when #fileLength(name) has been removed. */ public long getFileLength(String name) throws IOException { long len = fileLength(name); if (len == 0 !fileExists(name)) { throw new FileNotFoundException(name); } return len; } {code} The first line just calls the current impl. If it throws exception for a non-existing file, we're ok. The second line verifies whether a 0 length is for an existing file or not and throws an exception appropriately. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.
[ https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856845#action_12856845 ] Shai Erera commented on LUCENE-2159: This looks like a nice tool. But all it does is create multiple copies of the same segment(s) right? So what exactly do you want to test with it? What worries me is that we'll be multiplying the lexicon, posting lists, statistics etc., therefore I'm not sure how reliable the tests will be (whatever they are), except for measuring things related to large number of segments (like merge performance). Am I right? I also think this class better fits in benchmark rather than misc, as it's really for perf. testing/measurements and not as a generic utility ... You can create a Task out if it, like ExpandIndexTask which one can include in his algorithm. Tool to expand the index for perf/stress testing. - Key: LUCENE-2159 URL: https://issues.apache.org/jira/browse/LUCENE-2159 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.0 Reporter: John Wang Attachments: ExpandIndex.java Sometimes it is useful to take a small-ish index and expand it into a large index with K segments for perf/stress testing. This tool does that. See attached class. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.
[ https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856877#action_12856877 ] Shai Erera commented on LUCENE-2159: bq. I understand having a general performance suite to test regression is a good thing. But we found having a more focused test for segmentation and merge is important. Are you saying that because of the benchmark proposal? I still think that an ExpandIndexTask will be useful for benchmark and fits better there, than in contrib/misc. We can have that task together w/ a predefined .alg for using it ... Tool to expand the index for perf/stress testing. - Key: LUCENE-2159 URL: https://issues.apache.org/jira/browse/LUCENE-2159 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.0 Reporter: John Wang Attachments: ExpandIndex.java Sometimes it is useful to take a small-ish index and expand it into a large index with K segments for perf/stress testing. This tool does that. See attached class. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.
[ https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856911#action_12856911 ] Shai Erera commented on LUCENE-2159: Which is fine - I think this would be a neat task to add to benchmark, w/ specific documentation on how to use it and for what purposes. If you can also write a sample .alg file which e.g. creates a small index and then Expand it, that'd be great. I've looked at the different PerfTask implementations in benchmark, and I'm thinking if we perhaps should do the following: * Create an AddIndexesTask which receives one or more Directories as input and calls writer.addIndexesNoOptimize * If one wants, he can add an OptimizeTask call afterwards. * Write an expandIndex.alg which initially creates an index of size N from one content source and then calls the AddIndexesTask several times. The .alg file is meant to be an example as well as people can change it to create bigger or smaller indexes, use other content sources and switch between RAM/FS directories. How's that sound? Tool to expand the index for perf/stress testing. - Key: LUCENE-2159 URL: https://issues.apache.org/jira/browse/LUCENE-2159 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.0 Reporter: John Wang Attachments: ExpandIndex.java Sometimes it is useful to take a small-ish index and expand it into a large index with K segments for perf/stress testing. This tool does that. See attached class. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.
[ https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856917#action_12856917 ] Shai Erera commented on LUCENE-2159: bq. There is an excellent section on it in LIA2 Indeed ! Ok so to create a task, you just extend PerfTask. You can look under contrib/benchmark/src/java/o.a.l/benchmark/byTask/tasks for many examples. OptimizeTask seems relevant here (i.e. it calls an IW API and receives a parameter). For writing .alg files, that's SUPER simple, just look under contrib/benchmark/conf for many existing examples. You can post a patch once you feel comfortable enough with it and I can help you with the struggles (if you'll run into any). Another great source (besides LIA2) on writing .alg files is the package.html under contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask. Tool to expand the index for perf/stress testing. - Key: LUCENE-2159 URL: https://issues.apache.org/jira/browse/LUCENE-2159 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.0 Reporter: John Wang Attachments: ExpandIndex.java Sometimes it is useful to take a small-ish index and expand it into a large index with K segments for perf/stress testing. This tool does that. See attached class. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Ahh ... a dream finally comes true ... what a great way to start a day :). +1 !!! I have some questions/comments though: * Index back compat should be maintained between major releases, like it is today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their segments when they move from 2.x to 3.x before 4.0 lands and they'll need to call optimize() to ensure 4.0 still works on their index. I hope that will still be the case? Otherwise I don't see how we can prevent reindexing by apps. ** Index behavioral/runtime changes, like those of Analyzers, are ok to require a reindex, as proposed. So after 3.1 is out, trunk can break the API and 3.2 will have a new set of API? Cool and convenient. For how long do we keep the 3.1 branch around? Also, it used to only fix bugs, but from now on it'll be allowed to introduce new features, if they maintain back-compat? So 3.1.1 can have 'flex' (going for the extreme on purpose) if someone maintains back-compat? I think the back-compat on branches should be only for index runtime changes. There's no point, in my opinion, to maintain API back-compat anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1 just to get a new feature but get it API back-supported? As soon as they upgrade to 3.2, that means a new set of API right? Major releases will just change the index structure format then? Or move to Java 1.6? Well ... not even that because as I understand it, 3.2 can move to Java 1.6 ... no API back-compat right :). That's definitely a great step forward ! Shai On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda va...@osafoundation.org wrote: On Thu, 15 Apr 2010, Earwin Burrfoot wrote: Can't believe my eyes. +1 Likewise. +1 ! Andi.. On Thu, Apr 15, 2010 at 01:22, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey mar...@rectangular.com wrote: Essentially, we're free to break back compat within Lucy at any time, but we're not able to break back compat within a stable fork like Lucy1, Lucy2, etc. So what we'll probably do during normal development with Analyzers is just change them and note the break in the Changes file. So... what if we change up how we develop and release Lucene: * A major release always bumps the major release number (2.x - 3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3) releases along that branch * There is no back compat across major releases (index nor APIs), but full back compat within branches. This would match how many other projects work (KS/Lucy, as Marvin describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.). The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and, if any devs have the itch, they could freely back-port improvements from trunk as long as they kept back-compat within the branch. I think in such a future world, we could: * Remove Version entirely! * Not worry at all about back-compat when developing on trunk * Give proper names to new improved classes instead of StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing today; rename existing classes. * Let analyzers freely, incrementally improve * Use interfaces without fear * Stop spending the truly substantial time (look @ Uwe's awesome back-compat layer for analyzers!) that we now must spend when adding new features, for back-compat * Be more free to introduce very new not-fully-baked features/APIs, marked as experimental, on the expectation that once they are used (in trunk) they will iterate/change/improve vs trying so hard to get things right on the first go for fear of future back compat horrors. Thoughts...? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/?? ? (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Also, we will still need to maintain the Backwards section in CHANGES (or move it to API Changes), to help people upgrade from release to release. Just pointing that out as well. Shai On Thu, Apr 15, 2010 at 7:05 AM, Shai Erera ser...@gmail.com wrote: Ahh ... a dream finally comes true ... what a great way to start a day :). +1 !!! I have some questions/comments though: * Index back compat should be maintained between major releases, like it is today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their segments when they move from 2.x to 3.x before 4.0 lands and they'll need to call optimize() to ensure 4.0 still works on their index. I hope that will still be the case? Otherwise I don't see how we can prevent reindexing by apps. ** Index behavioral/runtime changes, like those of Analyzers, are ok to require a reindex, as proposed. So after 3.1 is out, trunk can break the API and 3.2 will have a new set of API? Cool and convenient. For how long do we keep the 3.1 branch around? Also, it used to only fix bugs, but from now on it'll be allowed to introduce new features, if they maintain back-compat? So 3.1.1 can have 'flex' (going for the extreme on purpose) if someone maintains back-compat? I think the back-compat on branches should be only for index runtime changes. There's no point, in my opinion, to maintain API back-compat anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1 just to get a new feature but get it API back-supported? As soon as they upgrade to 3.2, that means a new set of API right? Major releases will just change the index structure format then? Or move to Java 1.6? Well ... not even that because as I understand it, 3.2 can move to Java 1.6 ... no API back-compat right :). That's definitely a great step forward ! Shai On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda va...@osafoundation.orgwrote: On Thu, 15 Apr 2010, Earwin Burrfoot wrote: Can't believe my eyes. +1 Likewise. +1 ! Andi.. On Thu, Apr 15, 2010 at 01:22, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey mar...@rectangular.com wrote: Essentially, we're free to break back compat within Lucy at any time, but we're not able to break back compat within a stable fork like Lucy1, Lucy2, etc. So what we'll probably do during normal development with Analyzers is just change them and note the break in the Changes file. So... what if we change up how we develop and release Lucene: * A major release always bumps the major release number (2.x - 3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3) releases along that branch * There is no back compat across major releases (index nor APIs), but full back compat within branches. This would match how many other projects work (KS/Lucy, as Marvin describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.). The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and, if any devs have the itch, they could freely back-port improvements from trunk as long as they kept back-compat within the branch. I think in such a future world, we could: * Remove Version entirely! * Not worry at all about back-compat when developing on trunk * Give proper names to new improved classes instead of StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing today; rename existing classes. * Let analyzers freely, incrementally improve * Use interfaces without fear * Stop spending the truly substantial time (look @ Uwe's awesome back-compat layer for analyzers!) that we now must spend when adding new features, for back-compat * Be more free to introduce very new not-fully-baked features/APIs, marked as experimental, on the expectation that once they are used (in trunk) they will iterate/change/improve vs trying so hard to get things right on the first go for fear of future back compat horrors. Thoughts...? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/?? ? (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
So then I don't understand this: {quote} * A major release always bumps the major release number (2.x - 3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3) releases along that branch * There is no back compat across major releases (index nor APIs), but full back compat within branches. {quote} What's different than what's done today? How can we remove Version in that world, if we need to maintain full back-compat between 3.1 and 3.2, index and API-wise? We'll still need to deprecate and come up w/ new classes every time, and we'll still need to maintain runtime changes back-compat. Unless you're telling me we'll start releasing major releases more often? Well ... then we're saying the same thing, only I think that instead of releasing 4, 5, 6, 7, 8 every 6 months, we can release 3.1, 3.2, 3.5 ... because if you look back, every minor release included API deprecations as well as back-compat breaks. That means that every minor release should have been a major release right? Point is, if I understand correctly and you agree w/ my statement above - I don't see why would anyone releases a 3.x after 4.0 is out unless someone really wants to work hard on maintaining back-compat of some features. If it's just a numbering thing, then I don't think it matters what is defined as 'major' vs. 'minor'. One way is to define 'major' as X and minor X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I prefer the latter but don't have any strong feelings against the former. Just pointing out that X will grow more rapidly than today. That's all. So did I get it right? Shai On Thu, Apr 15, 2010 at 8:19 AM, Mark Miller markrmil...@gmail.com wrote: I don't read what you wrote and what Mike wrote as even close to the same. - Mark http://www.lucidimagination.com (mobile) On Apr 15, 2010, at 12:05 AM, Shai Erera ser...@gmail.com wrote: Ahh ... a dream finally comes true ... what a great way to start a day :). +1 !!! I have some questions/comments though: * Index back compat should be maintained between major releases, like it is today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their segments when they move from 2.x to 3.x before 4.0 lands and they'll need to call optimize() to ensure 4.0 still works on their index. I hope that will still be the case? Otherwise I don't see how we can prevent reindexing by apps. ** Index behavioral/runtime changes, like those of Analyzers, are ok to require a reindex, as proposed. So after 3.1 is out, trunk can break the API and 3.2 will have a new set of API? Cool and convenient. For how long do we keep the 3.1 branch around? Also, it used to only fix bugs, but from now on it'll be allowed to introduce new features, if they maintain back-compat? So 3.1.1 can have 'flex' (going for the extreme on purpose) if someone maintains back-compat? I think the back-compat on branches should be only for index runtime changes. There's no point, in my opinion, to maintain API back-compat anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1 just to get a new feature but get it API back-supported? As soon as they upgrade to 3.2, that means a new set of API right? Major releases will just change the index structure format then? Or move to Java 1.6? Well ... not even that because as I understand it, 3.2 can move to Java 1.6 ... no API back-compat right :). That's definitely a great step forward ! Shai On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda va...@osafoundation.org va...@osafoundation.org wrote: On Thu, 15 Apr 2010, Earwin Burrfoot wrote: Can't believe my eyes. +1 Likewise. +1 ! Andi.. On Thu, Apr 15, 2010 at 01:22, Michael McCandless luc...@mikemccandless.comluc...@mikemccandless.com wrote: On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey mar...@rectangular.commar...@rectangular.com wrote: Essentially, we're free to break back compat within Lucy at any time, but we're not able to break back compat within a stable fork like Lucy1, Lucy2, etc. So what we'll probably do during normal development with Analyzers is just change them and note the break in the Changes file. So... what if we change up how we develop and release Lucene: * A major release always bumps the major release number (2.x - 3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3) releases along that branch * There is no back compat across major releases (index nor APIs), but full back compat within branches. This would match how many other projects work (KS/Lucy, as Marvin describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.). The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and, if any devs have the itch, they could freely back-port improvements from trunk as long as they kept back-compat within the branch. I think in such a future world, we could: * Remove Version entirely! * Not worry at all about back-compat when
[jira] Resolved: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-2386. Resolution: Fixed Committed revision 933613. (take #2) IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Proposal about Version API relaxation
Hi I'd like to propose a relaxation on the Version API. Uwe, please read the entire email before you reply :). I was thinking, following a question on the user list, that the Version-based API may not be very intuitive to users, especially those who don't care about versioning, as well as very inconvenient. So there are two issues here: 1) How should one use Version smartly so that he keeps backwards compatibility. I think we all know the answer, but a Wiki page with some best practices tips would really help users use it. 2) How can one write sane code, which doesn't pass versions all over the place if: (1) he doesn't care about versions, or (2) he cares, and sets the Version to the same value in his app, in all places. Also, I think that today we offer a flexibility to users, to set different Versions on different objects in the life span of their application - which is a good flexibility but can also lead people to shoot themselves in the legs if they're not careful -- e.g. upgrading Version across their app, but failing to do so for one or two places ... So the change I'd like to propose is to mostly alleviate (2) and better protect users - I DO NOT PROPOSE TO GET RID OF Version :). I was thinking that we can add on Version a DEFAULT version, which the caller can set. So Version.setDefault and Version.getDefault will be added, as static members (more on the static-ness of it later). We then change the API which requires Version to also expose an API which doesn't require it, and that API will call Version.getDefault(). People can use it if they want to ... Few points: 1) As a default DEFAULT Version is controversial, I don't want to propose it, even though I think Lucene can define the DEFAULT to be the latest. Instead, I propose that Version.getDefault throw a DefaultVersionNotSetException if it wasn't set, while an API which relies on the default Version is called (I don't want to return null, not sure how safe it is). 2) That DEFAULT Version is static, which means it will affect all indexing code running inside the JVM. Which is fine: 2.1) Perhaps all the indexing code should use the same Version 2.2) If you know that's not the case, then pass Version to the API which requires it - you cannot use the 'default Version' API -- nothing changes for you. One case is missing -- you might not know if your code is the only indexing code which runs in the JVM ... I don't have a solution to that, but I think it'll be revealed pretty quickly, and you can change your code then ... So to summarize - the current Version API will remain and people can still use it. The DEFAULT Version API is meant for convenience for those who don't want to pass Version everywhere, for the reasons I outlined above. This will also clean our test code significantly, as the tests will set the DEFAULT version to TEST_VERSION_CURRENT at start ... The changes to the Version class will be very simple. If people think that's acceptable, I can open an issue and work on it. Shai
[jira] Updated: (LUCENE-2316) Define clear semantics for Directory.fileLength
[ https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2316: --- Attachment: LUCENE-2316.patch Patch clarifies the contract, fixes the directories to adhere to it and adds a CHANGES under backwards section. All tests pass. Define clear semantics for Directory.fileLength --- Key: LUCENE-2316 URL: https://issues.apache.org/jira/browse/LUCENE-2316 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Priority: Minor Fix For: 3.1 Attachments: LUCENE-2316.patch On this thread: http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e it was mentioned that Directory's fileLength behavior is not consistent between Directory implementations if the given file name does not exist. FSDirectory returns a 0 length while RAMDirectory throws FNFE. The problem is that the semantics of fileLength() are not defined. As proposed in the thread, we'll define the following semantics: * Returns the length of the file denoted by codename/code if the file exists. The return value may be anything between 0 and Long.MAX_VALUE. * Throws FileNotFoundException if the file does not exist. Note that you can call dir.fileExists(name) if you are not sure whether the file exists or not. For backwards we'll create a new method w/ clear semantics. Something like: {code} /** * @deprecated the method will become abstract when #fileLength(name) has been removed. */ public long getFileLength(String name) throws IOException { long len = fileLength(name); if (len == 0 !fileExists(name)) { throw new FileNotFoundException(name); } return len; } {code} The first line just calls the current impl. If it throws exception for a non-existing file, we're ok. The second line verifies whether a 0 length is for an existing file or not and throws an exception appropriately. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
Well the no-arg ctor will be using Version.getDefault() which will throw an exception if not set, and delegate the call to the Version-aware ctor. On Tuesday, April 13, 2010, Robert Muir rcm...@gmail.com wrote: On Tue, Apr 13, 2010 at 11:27 AM, Shai Erera ser...@gmail.com wrote: I was thinking that we can add on Version a DEFAULT version, which the caller can set. So Version.setDefault and Version.getDefault will be added, as static members (more on the static-ness of it later). We then change the API which requires Version to also expose an API which doesn't require it, and that API will call Version.getDefault(). People can use it if they want to ... I don't understand how this works... if Something has a no-arg ctor today, and i want to improve it in a backwards-compatible way, how will this work? the way this works today, lets say while working with 3.1 is: Something() is deprecated, and invokes Something(3.0)Something(Version) is added, and emulates the old behavior for 3.1, and the new behavior for = 3.1 i dont see how backwards compatibility will work with this proposal, since the no-arg ctor would then emulate some random behavior depending on a static. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
That is a static default! Yes Uwe ... I'm aware of that :) But that's not a static default for Lucene ... only for the application, if it chooses to use it ... so there are no plans to reimplement such a thing again Well ... that's not exactly what I'm proposing here. I'm not for re-implementing any sort of staticness, unless the app chooses to use it. And please don't give me that 'there are no plans ...' answer - it kind of kills the discussion, which is not healthy for a community. I agree that static variables might cause troubles to some deployments, BUT: 1) Not all apps are deployed on a Web Server together with other apps who happen to use Lucene. 2) Those that are deployed on web servers usually include lucene.jar in their classpath and are loaded by a different class loader than the rest ... So we're really talking about deployments where Lucene is a common, shared library between all apps ... And I guess that what bothers me the most is that it feels to me like we're trying to protect people from stuff we haven't yet received complaints on (at least none that I'm aware of), while we're hurting the programming experience of others ... almost recklessly. I'd hope we can find a way around that, because today I pass the same Version value around everywhere, and it's simply inconvenient. Just yesterday people complained about the need to call writer.commit() after new IW() if they want to open a reader ... one-liner inconvenience vs. dozen of lines here -- point is, what's perceived as unnecessary code DOES bother people ... only here it's just a setting thing, and my proposal is not to make it generically static. So let's not get caught on that 'static-ness'. And besides, if you ask me - variables like Version, that are needed in so many places, are usually made static ... but not in Lucene ... So if possible ... I'd like to think how we can fix/improve the use of Version, in ways that won't break apps. Because the fact to the matter is - we invented Version to allow for changes w/o breaking back-compat, while the backwards section in CHANGES seems to grow from release to release (I know - I'm partly to blame for it :)), and another fact is that I don't remember even one complaint about a change which broke back-compat. People have raised this issue numerous times in the past, even proposed all sorts of contracts and definitions on how we can be 'allowed' to break back-compat ... but nothing came out of it. The fact that we are not able to reach consensus doesn't mean the problem doesn't bother many out there. And ignoring the fact that currently the API looks cluttered is not doing any good. There must be away to allow some apps out there (IMO the majority) to set that Version thing once, and let Lucene use that value everywhere else ... while for others to pass it along as much as they want. Shai On Tue, Apr 13, 2010 at 7:41 PM, Uwe Schindler u...@thetaphi.de wrote: Hi Shai, one of the problem I have is: That is a static default! We want to get rid of them (and did it mostly, only some relicts remain), so there are no plans to reimplement such a thing again. The badest one is BooleanQuery.maxClauseCount. The same applies to all types of sysprops. As Lucene and solr is mostly running in servlet containers, this type of thing makes web applications no longer isolated. This is also a general contract for libraries: never ever rely on sysprops or statics. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de *From:* Shai Erera [mailto:ser...@gmail.com] *Sent:* Tuesday, April 13, 2010 5:27 PM *To:* java-dev@lucene.apache.org *Subject:* Proposal about Version API relaxation Hi I'd like to propose a relaxation on the Version API. Uwe, please read the entire email before you reply :). I was thinking, following a question on the user list, that the Version-based API may not be very intuitive to users, especially those who don't care about versioning, as well as very inconvenient. So there are two issues here: 1) How should one use Version smartly so that he keeps backwards compatibility. I think we all know the answer, but a Wiki page with some best practices tips would really help users use it. 2) How can one write sane code, which doesn't pass versions all over the place if: (1) he doesn't care about versions, or (2) he cares, and sets the Version to the same value in his app, in all places. Also, I think that today we offer a flexibility to users, to set different Versions on different objects in the life span of their application - which is a good flexibility but can also lead people to shoot themselves in the legs if they're not careful -- e.g. upgrading Version across their app, but failing to do so for one or two places ... So the change I'd like to propose is to mostly alleviate (2) and better protect users - I DO NOT PROPOSE TO GET RID OF Version :). I was thinking
Re: Proposal about Version API relaxation
Because the version mechanism is not a single value for the entire library but rather feature by feature. I don't see how a global setter can help. That's only true if we believe people use different Version values in different places of their code ... and note that they will still be able to. I'm not proposing to take out Version from the ctors, just to add an additional default-version the app can set and use.So if the app doesn't want to do it .. it doesn't have to. Shai On Tue, Apr 13, 2010 at 9:40 PM, DM Smith dmsmith...@gmail.com wrote: I like the concept of version, but I'm concerned about it too. The current Version mechanism allows one to use more than one Version in their code. Imagine that we are at 3.2 and one was unable to upgrade to a most version for a particular feature. Let's also suppose that at 3.2 a new feature was introduced and was taken advantage of. But at 3.5 that new feature is versioned but one is unable to upgrade for it, too. Now what? Use 3.0 for the one feature and 3.2 for the other? What about the interoperability of versioned features? Does a version 3.0 class play well with a 3.2 versioned class? How do we test that? A long term issue is that of bw compat for the version itself. The bw compat contract is two fold: API and index. The API has a shorter lifetime of compatibility than that of an index. How does one deprecate a particular version for the api but not the index? How does one know whether one versioned feature impacts the index and an other does not? I'm hoping that I'm imagining a problem that will never actually arise. Shai, to your suggestion: Because the version mechanism is not a single value for the entire library but rather feature by feature. I don't see how a global setter can help. -- DM On 04/13/2010 11:27 AM, Shai Erera wrote: Hi I'd like to propose a relaxation on the Version API. Uwe, please read the entire email before you reply :). I was thinking, following a question on the user list, that the Version-based API may not be very intuitive to users, especially those who don't care about versioning, as well as very inconvenient. So there are two issues here: 1) How should one use Version smartly so that he keeps backwards compatibility. I think we all know the answer, but a Wiki page with some best practices tips would really help users use it. 2) How can one write sane code, which doesn't pass versions all over the place if: (1) he doesn't care about versions, or (2) he cares, and sets the Version to the same value in his app, in all places. Also, I think that today we offer a flexibility to users, to set different Versions on different objects in the life span of their application - which is a good flexibility but can also lead people to shoot themselves in the legs if they're not careful -- e.g. upgrading Version across their app, but failing to do so for one or two places ... So the change I'd like to propose is to mostly alleviate (2) and better protect users - I DO NOT PROPOSE TO GET RID OF Version :). I was thinking that we can add on Version a DEFAULT version, which the caller can set. So Version.setDefault and Version.getDefault will be added, as static members (more on the static-ness of it later). We then change the API which requires Version to also expose an API which doesn't require it, and that API will call Version.getDefault(). People can use it if they want to ... Few points: 1) As a default DEFAULT Version is controversial, I don't want to propose it, even though I think Lucene can define the DEFAULT to be the latest. Instead, I propose that Version.getDefault throw a DefaultVersionNotSetException if it wasn't set, while an API which relies on the default Version is called (I don't want to return null, not sure how safe it is). 2) That DEFAULT Version is static, which means it will affect all indexing code running inside the JVM. Which is fine: 2.1) Perhaps all the indexing code should use the same Version 2.2) If you know that's not the case, then pass Version to the API which requires it - you cannot use the 'default Version' API -- nothing changes for you. One case is missing -- you might not know if your code is the only indexing code which runs in the JVM ... I don't have a solution to that, but I think it'll be revealed pretty quickly, and you can change your code then ... So to summarize - the current Version API will remain and people can still use it. The DEFAULT Version API is meant for convenience for those who don't want to pass Version everywhere, for the reasons I outlined above. This will also clean our test code significantly, as the tests will set the DEFAULT version to TEST_VERSION_CURRENT at start ... The changes to the Version class will be very simple. If people think that's acceptable, I can open an issue and work on it. Shai - To unsubscribe, e
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855870#action_12855870 ] Shai Erera commented on LUCENE-2386: I'm not sure if we're arguing about the same thing here ... why when I open an IW on empty Directory I need an empty segment that's created, and from now on never changed, populated or even read? That just seems wrong to me ... when I fixed the tests to not rely on the buggy behavior, I noticed several which count the list of commits (especially the IDP ones) w/ a documentation like 1 for opening + N for committing ... It just looks weird that when you open IW a commit happens, a set of empty files are created, but from now on they are never modified, until IDP kicks in, after the second commit ... it's nothing like initing the Directory to be able to receive input .. And I don't know what's the benefit of doing new IW() following by IR.open() ... that IR will always see 0 documents, until you call reopen (if commit happened in between). So what's the convenience here? that your code can call IR.open once, and from that point forward just 'reopen()'? That seems low advantage to me, really. Maybe what we should do is fix IR.open to return a null IR in case the directory hasn't been populated w/ anything yet. Then you can check easily if you should call open() (==null) or reopen (otherwise). Or create a blank stub of IR which emulates an empty Dir, and when reopen is called works well (if the Directory is not empty now) ... BTW, FWIW, Solr's code did not break from this change at all ... it was the combination of FSDir and NoLF/SingleInstanceLF that broke some tests that used it ... I don't know how many apps out there are using that combination, but I'd bet it's small? I use that combination, however in my case an IR is opened only after a commit signal/event is raised (so I don't check isCurrent often or attempt to reopen()). What I'm trying to say is that this combination is dangerous, and the application needs to ensure that only one IW is open at any given time, and I'm sure such apps are more sophisticated then opening IW and then IR just for the convenience of it. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2316) Define clear semantics for Directory.fileLength
[ https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855873#action_12855873 ] Shai Erera commented on LUCENE-2316: Well ... dir.fileLength is also used by SegmentInfos.sizeInBytes to compute the size of all the files in the Directory. If we remove fileLength, then SI will need to call dir.openInput.length() and the close it? Seems like a lot of work to me, for just obtaining the length of the file. So I agree that if you have an IndexInput at hand, you should call its length() method rather than Dir.fileLength. But otherwise, if you just have a name at hand, a dir.fileLength is convenient? I'm also ok w/ the bw break rather than going through the new/deprecate cycle. Define clear semantics for Directory.fileLength --- Key: LUCENE-2316 URL: https://issues.apache.org/jira/browse/LUCENE-2316 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Priority: Minor Fix For: 3.1 On this thread: http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e it was mentioned that Directory's fileLength behavior is not consistent between Directory implementations if the given file name does not exist. FSDirectory returns a 0 length while RAMDirectory throws FNFE. The problem is that the semantics of fileLength() are not defined. As proposed in the thread, we'll define the following semantics: * Returns the length of the file denoted by codename/code if the file exists. The return value may be anything between 0 and Long.MAX_VALUE. * Throws FileNotFoundException if the file does not exist. Note that you can call dir.fileExists(name) if you are not sure whether the file exists or not. For backwards we'll create a new method w/ clear semantics. Something like: {code} /** * @deprecated the method will become abstract when #fileLength(name) has been removed. */ public long getFileLength(String name) throws IOException { long len = fileLength(name); if (len == 0 !fileExists(name)) { throw new FileNotFoundException(name); } return len; } {code} The first line just calls the current impl. If it throws exception for a non-existing file, we're ok. The second line verifies whether a 0 length is for an existing file or not and throws an exception appropriately. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2392) Enable flexible scoring
[ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855875#action_12855875 ] Shai Erera commented on LUCENE-2392: Mike - it'll also be great if we can store the length of the document in a custom way. I think what I'm saying is that if we can open up the norms computation to custom code - that will do what I want, right? Maybe we can have a class like DocLengthProvider which apps can plug in if they want to customize how that length is computed. Wherever we write the norms, we'll call that impl, which by default will do what Lucene does today? I think though that it's not a field-level setting, but an IW one? Enable flexible scoring --- Key: LUCENE-2392 URL: https://issues.apache.org/jira/browse/LUCENE-2392 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2392.patch This is a first step (nowhere near committable!), implementing the design iterated to in the recent Baby steps towards making Lucene's scoring more flexible java-dev thread. The idea is (if you turn it on for your Field; it's off by default) to store full stats in the index, into a new _X.sts file, per doc (X field) in the index. And then have FieldSimilarityProvider impls that compute doc's boost bytes (norms) from these stats. The patch is able to index the stats, merge them when segments are merged, and provides an iterator-only API. It also has starting point for per-field Sims that use the stats iterator API to compute boost bytes. But it's not at all tied into actual searching! There's still tons left to do, eg, how does one configure via Field/FieldType which stats one wants indexed. All tests pass, and I added one new TestStats unit test. The stats I record now are: - field's boost - field's unique term count (a b c a a b -- 3) - field's total term count (a b c a a b -- 6) - total term count per-term (sum of total term count for all docs that have this term) Still need at least the total term count for each field. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems
[ https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855877#action_12855877 ] Shai Erera commented on LUCENE-2373: I'd rather not count on file length as well ... so a put/getTermDictSize method on Codec will allow one to implement it however one wants, if running on HDFS for example? Change StandardTermsDictWriter to work with streaming and append-only filesystems - Key: LUCENE-2373 URL: https://issues.apache.org/jira/browse/LUCENE-2373 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Andrzej Bialecki Fix For: 3.1 Since early 2.x times Lucene used a skip/seek/write trick to patch the length of the terms dict into a place near the start of the output data file. This however made it impossible to use Lucene with append-only filesystems such as HDFS. In the post-flex trunk the following code in StandardTermsDictWriter initiates this: {code} // Count indexed fields up front CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); out.writeLong(0); // leave space for end index pointer {code} and completes this in close(): {code} out.seek(CodecUtil.headerLength(CODEC_NAME)); out.writeLong(dirStart); {code} I propose to change this layout so that this pointer is stored simply at the end of the file. It's always 8 bytes long, and we known the final length of the file from Directory, so it's a single additional seek(length - 8) to read it, which is not much considering the benefits. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2392) Enable flexible scoring
I'm not sure Robert where did I propose to shove random statistics into the index? Lucene calculated a doc length today which some in the academy/research here disagree w/ how it's done. So instead of attempting to fix it for all, I think it'd be great if one can define what is the doc Length as one perceives it. Why is that problematic? What Mike opened is an issue titled enable flexible scoring ... what I'm asking for falls under that hood? Also, maybe we should have that discussion on the issue? Shai On Mon, Apr 12, 2010 at 11:31 AM, Robert Muir rcm...@gmail.com wrote: I disagree. I think what Mike has defined here is way beyond a baby-step: its all the stats needed to support modern IR models in Lucene: BM25, additional vector space algorithms, divergence from randomness, and language modelling. I think the ability to calculate your own random statistics and shove them into the index (this would be messy like how to get access to the aggregates you need anyway) is something different entirely, best left to research systems. You can't even do that with Terrier now. On Mon, Apr 12, 2010 at 3:35 AM, Shai Erera (JIRA) j...@apache.orgwrote: [ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855875#action_12855875] Shai Erera commented on LUCENE-2392: Mike - it'll also be great if we can store the length of the document in a custom way. I think what I'm saying is that if we can open up the norms computation to custom code - that will do what I want, right? Maybe we can have a class like DocLengthProvider which apps can plug in if they want to customize how that length is computed. Wherever we write the norms, we'll call that impl, which by default will do what Lucene does today? I think though that it's not a field-level setting, but an IW one? Enable flexible scoring --- Key: LUCENE-2392 URL: https://issues.apache.org/jira/browse/LUCENE-2392 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2392.patch This is a first step (nowhere near committable!), implementing the design iterated to in the recent Baby steps towards making Lucene's scoring more flexible java-dev thread. The idea is (if you turn it on for your Field; it's off by default) to store full stats in the index, into a new _X.sts file, per doc (X field) in the index. And then have FieldSimilarityProvider impls that compute doc's boost bytes (norms) from these stats. The patch is able to index the stats, merge them when segments are merged, and provides an iterator-only API. It also has starting point for per-field Sims that use the stats iterator API to compute boost bytes. But it's not at all tied into actual searching! There's still tons left to do, eg, how does one configure via Field/FieldType which stats one wants indexed. All tests pass, and I added one new TestStats unit test. The stats I record now are: - field's boost - field's unique term count (a b c a a b -- 3) - field's total term count (a b c a a b -- 6) - total term count per-term (sum of total term count for all docs that have this term) Still need at least the total term count for each field. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855892#action_12855892 ] Shai Erera commented on LUCENE-2386: bq. what is the proper way (after this fix) to open an IR over possibly-empty directory? You can simply call commit() immediately after you open IW. If that's what you need then it will work for you. You're right that if I add docs, deletes and them commits, I'll get an empty segment. So is if you do new IW() and then iw.close() w/ no addDocument in between. The point here was that we should not create a commit unless the user has specifically asked for it. Calling close() means asking for a commit, per close semantics and contract. But if the app called new IW, add docs and crashed in the middle, the Directory will still remain empty ... which is sort of what, IMO, should happen. I agree it's a matter of perspective. I think that when autoCommit was removed, so should have been this code. I don't know if it was left behind for a good reason, or simply because when someone tried to do it, he found out it's not that simple (like I have :)). IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2392) Enable flexible scoring
[ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855913#action_12855913 ] Shai Erera commented on LUCENE-2392: I'd like to withdraw my request from above. I misunderstood that the stats I need are stored per-field per-doc. So that will allow me to compute the docLength as I want. Enable flexible scoring --- Key: LUCENE-2392 URL: https://issues.apache.org/jira/browse/LUCENE-2392 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2392.patch This is a first step (nowhere near committable!), implementing the design iterated to in the recent Baby steps towards making Lucene's scoring more flexible java-dev thread. The idea is (if you turn it on for your Field; it's off by default) to store full stats in the index, into a new _X.sts file, per doc (X field) in the index. And then have FieldSimilarityProvider impls that compute doc's boost bytes (norms) from these stats. The patch is able to index the stats, merge them when segments are merged, and provides an iterator-only API. It also has starting point for per-field Sims that use the stats iterator API to compute boost bytes. But it's not at all tied into actual searching! There's still tons left to do, eg, how does one configure via Field/FieldType which stats one wants indexed. All tests pass, and I added one new TestStats unit test. The stats I record now are: - field's boost - field's unique term count (a b c a a b -- 3) - field's total term count (a b c a a b -- 6) - total term count per-term (sum of total term count for all docs that have this term) Still need at least the total term count for each field. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855924#action_12855924 ] Shai Erera commented on LUCENE-2386: I don't think that people need to write that emptiness-detection-then-commit code ... if they care, they can simply immediately call commit() after they open IW. bq. Isn't opening IW with CREATE* mode called specifically asking for? It depends on how you interpret the mode ... for example, you cannot pass OpenMode.APPEND for an empty Directory, because IW throws an exception. The modes are just meant to tell IW how to behave: * APPEND - I know there is an index in the Directory, and I'd like to append to it. * CREATE - I don't care if there is an index in the Directory -- create a new one, zeroing out all segments. * CREATE_OR_APPEND - If there is an index, open it, otherwise create a new one. So if you pass CREATE on an already populated index, IW doesn't do the implicit commit, until you call commit() yourself. But if you pass CREATE on an empty index, IW suddenly calls commit()? That's just an inconsistency that's meant to allow you to open an IR immediately after new IW() call, irregardless of what was there? And if you open that IR, then if the index was populated you see the previous set of documents, but if it wasn't you see nothing, even though you meant to say override what's there? I've checked what FileOutputStream does, using the following code: {code} File file = new File(d:/temp/tmpfile); FileOutputStream fos = new FileOutputStream(file); fos.write(3); fos.close(); fos = new FileOutputStream(file); FileInputStream fis = new FileInputStream(file); System.out.println(fis.read()); {code} * Second line creates an empty file immediately, not waiting for close() or flush() -- which resembles the behavior that you're suggesting we should take w/ IW (which is the 'today's behavior') * Forth line closes the file, flushing and writing the content. * Fifth line *recreates* the file, empty, again, w/o calling close. So it zeros out the file content immediately, even before you wrote a single piece of byte to it. * Sixth+Seventh line proves it by attempting to read from the file, and the output printed is -1. I've wrapped the FOS w/ a BufferedOS and the behavior is still the same. So I'm trying to show is that we don't fully adhere to the CREATE mode, and rightfully if you ask me - we shouldn't zero out the segments until the application called commit(). But we choose to adhere differently to the CREATE* mode if the index is already populated. That's an inconsistent behavior, at least in my perspective. It's also harder to explain and document, e.g. you should call commit() if you used CREATE, in case you want to zero out everything immediately, and the Directory is not empty, but you don't need to call commit() if the directory was empty, Lucene will do it for you. -- so now how will the app know if it should call commit()? It will need to write a sort of emptiness-detection-then-commit? I am willing to consider the following semantics: * APPEND - assumes an index exists and open it. * CREATE - zeros out everything that's in the directory *immediately*, and also prepares an empty directory. * CREATE_OR_APPEND - either loads an existing index, or is able to work on the empty directory. No implicit commit is happening by IW if the index does not exist. But I think CREATE is too dangerous, and so I prefer to stick w/ the proposed change to the patch so far -- if you open an index in CREATE*, you should call commit before you can read it. That will adhere to the semantics of what the application wanted, whether it meant to zero out an existing Directory, or create a new one from scratch. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856063#action_12856063 ] Shai Erera commented on LUCENE-2386: So just call new IW(), then rollback and ensure dir.listAll() returns an empty list? Or also index stuff, making sure a flush occurs and then rollback? I'm not sure that the latter is related to that issue ... IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2386: --- Attachment: LUCENE-2386.patch Patch includes the proposed test in TestIndexWriter. I think this is ready for commit, if there are no more objections. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-2386. Lucene Fields: [New, Patch Available] (was: [New]) Resolution: Fixed Committed revision 932868. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1709) Parallelize Tests
[ https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855713#action_12855713 ] Shai Erera commented on LUCENE-1709: Committed revision 932878 with the following: # benchmark tests force sequential run # threadsPerProcessor defaults to 1 and can be overridden by -DthreadsPerProcessor=value # A CHANGES entry Parallelize Tests - Key: LUCENE-1709 URL: https://issues.apache.org/jira/browse/LUCENE-1709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, runLuceneTests.py Original Estimate: 48h Remaining Estimate: 48h The Lucene tests can be parallelized to make for a faster testing system. This task from ANT can be used: http://ant.apache.org/manual/CoreTasks/parallel.html Previous discussion: http://www.gossamer-threads.com/lists/lucene/java-dev/69669 Notes from Mike M.: {quote} I'd love to see a clean solution here (the tests are embarrassingly parallelizable, and we all have machines with good concurrency these days)... I have a rather hacked up solution now, that uses -Dtestpackage=XXX to split the tests up. Ideally I would be able to say use N threads and it'd do the right thing... like the -j flag to make. {quote} -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: svn commit: r932873 - /lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java
Sorry about that ... On Sun, Apr 11, 2010 at 3:10 PM, uschind...@apache.org wrote: Author: uschindler Date: Sun Apr 11 12:10:57 2010 New Revision: 932873 URL: http://svn.apache.org/viewvc?rev=932873view=rev Log: add missing license header Modified: lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java Modified: lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java?rev=932873r1=932872r2=932873view=diff == --- lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java (original) +++ lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexNotFoundException.java Sun Apr 11 12:10:57 2010 @@ -1,5 +1,22 @@ package org.apache.lucene.index; +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + import java.io.FileNotFoundException; /**
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855727#action_12855727 ] Shai Erera commented on LUCENE-2386: Committed revision 932917 for the revert. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2386: --- Attachment: LUCENE-2386.patch Fixes IndexFileDeleter, adds a proper test to TestIndexWriter. Haven't run all the tests yet though, but the added test passes now with the fix. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855767#action_12855767 ] Shai Erera commented on LUCENE-2386: About IndexReader.listCommits ... the javadocs state this There must be at least one commit in the Directory, else this method throws java.io.IOException.. So I'll change it to reflect the right exception type is thrown (IndexNotFoundException) and revert the change to DirReader.listCommits which returns an empty list. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2386: --- Attachment: LUCENE-2386.patch Patch w/ proposed fixes. All tests pass, including Solr's :). IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2386: --- Attachment: LUCENE-2386.patch Patch updated to latest rev. + the proposed name change -- IndexNotFoundException. All tests pass. I plan to commit this later today. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855344#action_12855344 ] Shai Erera commented on LUCENE-2386: Ok I've added the following to DirReader: {code} try { latest.read(dir, codecs); } catch (FileNotFoundException e) { if (e.getMessage().startsWith(no segments* file found in)) { // Might be that the Directory is empty, in which case just return an // empty collection. return Collections.emptyList(); } else { throw e; } } {code} And now that test passes. I'll continue discovering tests that fail ... probably backwards will have its share too :). IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855369#action_12855369 ] Shai Erera commented on LUCENE-2386: I already did that ... just didn't post back. Created SegmentsFileNotFoundException. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1879) Parallel incremental indexing
[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855379#action_12855379 ] Shai Erera commented on LUCENE-1879: I have found such version ... and it fails too :). At least the one I received. But never mind that ... as long as we both agree the implementation should change. I didn't mean to say anything bad about what you did .. I know the limitations you had to work with. Parallel incremental indexing - Key: LUCENE-1879 URL: https://issues.apache.org/jira/browse/LUCENE-1879 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael Busch Assignee: Michael Busch Fix For: 3.1 Attachments: parallel_incremental_indexing.tar A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler. Find details on the wiki page for this feature: http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing Discussion on java-dev: http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2386: --- Attachment: LUCENE-2386.patch Patch fixes all tests as well as changes to IndexWriter, IndexFileDeleter, DirectoryReader and SegmentInfos. I'd like to commit this shortly, before all the files get changed by a malicious other commit :). (kidding of course) IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855457#action_12855457 ] Shai Erera commented on LUCENE-2386: Ok sounds good. Is there a preferred package for exceptions? Or is o.a.l.index ok? IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Move NoDeletionPolicy to core
Hi I've noticed benchmark has a NoDeletionPolicy class and I was wondering if we can move it to core. I might want to use it for the parallel index stuff, but I think it'll also fit nicely in core, together with the other No* classes. In addition, this class should be made a singleton. If moving to core is acceptable, do you think any bw policy needs to be enforced (such as deprecating the one in benchmark and reference the one in core? I'll also want to change the package name from o.a.l.benchmark.utils to o.a.l.index, where the other IDPs are. Simple move and change (and update to benchmark algs which use it. Shai
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854885#action_12854885 ] Shai Erera commented on LUCENE-2074: Uwe, must this be coupled with that issue? This one waits for a long time (why? for JFlex 1.5 release?) and protecting against a huge buffer allocation can be a real quick and tiny fix. And this one also focuses on getting Unicode 5 to work, which is unrelated to the buffer size. But the buffer size is not a critical issue either that we need to move fast with it ... so it's your call. Just thought they are two unrelated problems. Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer --- Key: LUCENE-2074 URL: https://issues.apache.org/jira/browse/LUCENE-2074 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file. After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 or LUCENE_31 is used as matchVersion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854887#action_12854887 ] Shai Erera commented on LUCENE-2074: bq. I plan to commit this soon! That's great news ! BTW - what are you going to do w/ the JFlex 1.5 binary? Are you going to check it in somewhere? because it hasn't been released last I checked. I'm asking for general knowledge, because I know the scripts are downloading it, or rely on it to exist somewhere. In that case, then yes, let's fix it here. Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer --- Key: LUCENE-2074 URL: https://issues.apache.org/jira/browse/LUCENE-2074 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file. After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 or LUCENE_31 is used as matchVersion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)
[ https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854920#action_12854920 ] Shai Erera commented on LUCENE-1482: I still think that calling isDebugEnabled is better, because the message formatting stuff may do unnecessary things like casting, autoboxing etc. IMO, if logging is enabled, evaluating it twice is not a big deal ... it's a simple check. I'm glad someone here thinks logging will be useful though :). I wish there will be quorum here to proceed w/ that. Note that I also offered to not create any dependency on SLF4J, but rather extract infoStream to a static InfoStream class, which will avoid passing it around everywhere, and give the flexibility to output stuff from other classes which don't have an infoStream at hand. Replace infoSteram by a logging framework (SLF4J) - Key: LUCENE-1482 URL: https://issues.apache.org/jira/browse/LUCENE-1482 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 Attachments: LUCENE-1482-2.patch, LUCENE-1482.patch, slf4j-api-1.5.6.jar, slf4j-nop-1.5.6.jar Lucene makes use of infoStream to output messages in its indexing code only. For debugging purposes, when the search application is run on the customer side, getting messages from other code flows, like search, query parsing, analysis etc can be extremely useful. There are two main problems with infoStream today: 1. It is owned by IndexWriter, so if I want to add logging capabilities to other classes I need to either expose an API or propagate infoStream to all classes (see for example DocumentsWriter, which receives its infoStream instance from IndexWriter). 2. I can either turn debugging on or off, for the entire code. Introducing a logging framework can allow each class to control its logging independently, and more importantly, allows the application to turn on logging for only specific areas in the code (i.e., org.apache.lucene.index.*). I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, as it names states, a facade over different logging frameworks. As such, you can include the slf4j.jar in your application, and it recognizes at deploy time what is the actual logging framework you'd like to use. SLF4J comes with several adapters for Java logging, Log4j and others. If you know your application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in your classpath, and your logging statements will use Java logging underneath the covers. This makes the logging code very simple. For a class A the logger will be instantiated like this: public class A { private static final logger = LoggerFactory.getLogger(A.class); } And will later be used like this: public class A { private static final logger = LoggerFactory.getLogger(A.class); public void foo() { if (logger.isDebugEnabled()) { logger.debug(message); } } } That's all ! Checking for isDebugEnabled is very quick, at least using the JDK14 adapter (but I assume it's fast also over other logging frameworks). The important thing is, every class controls its own logger. Not all classes have to output logging messages, and we can improve Lucene's logging gradually, w/o changing the API, by adding more logging messages to interesting classes. I will submit a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1709) Parallelize Tests
[ https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855020#action_12855020 ] Shai Erera commented on LUCENE-1709: Robert, I will commit the patch, seems good to do anyway. We can handle the ant jars separately later. And ths hang behavior is exactly what I experience, including the FileInputStream thing. Only on my machine, when I took a thread dump, it showed that Ant waits on FIS.read() ... Robert - to remind you that even with the patch which forces junit to use a separate temp folder per thread, it still hung ... Parallelize Tests - Key: LUCENE-1709 URL: https://issues.apache.org/jira/browse/LUCENE-1709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, runLuceneTests.py Original Estimate: 48h Remaining Estimate: 48h The Lucene tests can be parallelized to make for a faster testing system. This task from ANT can be used: http://ant.apache.org/manual/CoreTasks/parallel.html Previous discussion: http://www.gossamer-threads.com/lists/lucene/java-dev/69669 Notes from Mike M.: {quote} I'd love to see a clean solution here (the tests are embarrassingly parallelizable, and we all have machines with good concurrency these days)... I have a rather hacked up solution now, that uses -Dtestpackage=XXX to split the tests up. Ideally I would be able to say use N threads and it'd do the right thing... like the -j flag to make. {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core
Move NoDeletionPolicy from benchmark to core Key: LUCENE-2385 URL: https://issues.apache.org/jira/browse/LUCENE-2385 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark, Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1 As the subject says, but I'll also make it a singleton + add some unit tests, as well as some documentation. I'll post a patch hopefully today. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core
[ https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2385: --- Attachment: LUCENE-2385.patch Move NoDeletionPolicy to core, adds javadocs + TestNoDeletionPolicy. Also includes the relevant changes to benchmark (algorithms + CreateIndexTask). I've fixed a typo I had in NoMergeScheduler - not related to this issue, but since it was just a typo, thought it's no harm to do it here. Tests pass. Planning to commit shortly. Move NoDeletionPolicy from benchmark to core Key: LUCENE-2385 URL: https://issues.apache.org/jira/browse/LUCENE-2385 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark, Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1 Attachments: LUCENE-2385.patch As the subject says, but I'll also make it a singleton + add some unit tests, as well as some documentation. I'll post a patch hopefully today. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855131#action_12855131 ] Shai Erera commented on LUCENE-2386: Took a look at IndexFileDeleter, and located to offending code segment which is responsible for the IndexCorruptException: {code} if (currentCommitPoint == null) { // We did not in fact see the segments_N file // corresponding to the segmentInfos that was passed // in. Yet, it must exist, because our caller holds // the write lock. This can happen when the directory // listing was stale (eg when index accessed via NFS // client with stale directory listing cache). So we // try now to explicitly open this commit point: SegmentInfos sis = new SegmentInfos(); try { sis.read(directory, segmentInfos.getCurrentSegmentFileName(), codecs); } catch (IOException e) { throw new CorruptIndexException(failed to locate current segments_N file); } {code} Looks like this code protects against a real problem, which was raised on the list a couple of times already - stale NFS cache. So I'm reluctant to remove that check ... thought I still think we should differentiate between a newly created index on a fresh Directory, to a stale NFS problem. Maybe we can pass a boolean isNew or something like that to the ctor, and if it's a new index and the last commit point is missing, IFD will not throw the exception, but silently ignore that? So the code would become something like this: {code} if (currentCommitPoint == null !isNew) { } {code} Does this make sense, or am I missing something? IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core
[ https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855140#action_12855140 ] Shai Erera commented on LUCENE-2385: I did that first, but then remembered that when I did that in the past, people were unable to apply my patches, w/o doing the svn move themselves. Anyway, for this file it's not really important I think - a very simple and tiny file, w/ no history to preserve? Is that ok for this file (b/c I have no idea how to do the svn move now ... after I've made all the changes already) :) Move NoDeletionPolicy from benchmark to core Key: LUCENE-2385 URL: https://issues.apache.org/jira/browse/LUCENE-2385 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark, Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1 Attachments: LUCENE-2385.patch As the subject says, but I'll also make it a singleton + add some unit tests, as well as some documentation. I'll post a patch hopefully today. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855148#action_12855148 ] Shai Erera commented on LUCENE-2386: Looking at IFD again, I think a boolean ctor arg is not required. What I can do is check if any Lucene file has been seen (in the for-loop iteration on the Directory files), and if not, then deduce it's a new Directory, and skip that 'if' check. I'll give it a shot. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core
[ https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2385: --- Attachment: LUCENE-2385.patch Is it better now? Move NoDeletionPolicy from benchmark to core Key: LUCENE-2385 URL: https://issues.apache.org/jira/browse/LUCENE-2385 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark, Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1 Attachments: LUCENE-2385.patch, LUCENE-2385.patch As the subject says, but I'll also make it a singleton + add some unit tests, as well as some documentation. I'll post a patch hopefully today. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core
[ https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855155#action_12855155 ] Shai Erera commented on LUCENE-2385: Forgot to mention that the only move I made was of NoDeletionPolicy: svn move contrib/benchmark/src/java/org/apache/lucene/benchmark/utils/NoDeletionPolicy.java src/java/org/apache/lucene/index/NoDeletionPolicy.java I'll remember that in the future Uwe - thanks for the heads up ! Move NoDeletionPolicy from benchmark to core Key: LUCENE-2385 URL: https://issues.apache.org/jira/browse/LUCENE-2385 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark, Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1 Attachments: LUCENE-2385.patch, LUCENE-2385.patch As the subject says, but I'll also make it a singleton + add some unit tests, as well as some documentation. I'll post a patch hopefully today. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2385) Move NoDeletionPolicy from benchmark to core
[ https://issues.apache.org/jira/browse/LUCENE-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-2385. Resolution: Fixed Committed revision 932129. Move NoDeletionPolicy from benchmark to core Key: LUCENE-2385 URL: https://issues.apache.org/jira/browse/LUCENE-2385 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark, Index Reporter: Shai Erera Assignee: Shai Erera Priority: Trivial Fix For: 3.1 Attachments: LUCENE-2385.patch, LUCENE-2385.patch As the subject says, but I'll also make it a singleton + add some unit tests, as well as some documentation. I'll post a patch hopefully today. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2386: --- Attachment: LUCENE-2386.patch First stab at this. Patch still missing CHANGES entry, and I haven't run all the tests, just TestIndexWriter. With those changes it passes. One thing that I think should be fixed is testImmediateDiskFull - if I don't add writer.commit(), the test fails, because dir.getRecomputeActualSizeInBytes returns 0 (no RAMFiles yet), and then the test succeeds at adding one document. So maybe just change the test to set maxSizeInBytes to '1', always? TestNoDeletionPolicy is not covered by this patch (should be fixed as well, because now the number of commits is exactly N and not N+1). Will fix it tomorrow. Anyway, it's really late now, so hopefully some fresh eyes will look at it while I'm away, and comment on the proposed changes. I hope I got all the changes to the tests right. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855265#action_12855265 ] Shai Erera commented on LUCENE-2386: bq. Maybe change testImmediateDiskFull to set max allowed size to max(1, current-usage)? Good idea ! Did it and it works. Now ... one thing I haven't mentioned is the bw break. This is a behavioral bw break, which specifically I'm not so sure we should care about, because I wonder how many apps out there rely on being able to open a reader before they ever commited on a fresh new index. So what do you think - do this change anyway, OR ... utilize Version to our aid? I.e., if the Version that was passed to IWC is before LUCENE_31, we keep the initial commit, otherwise we don't do it? Pros is that I won't need to change many of the tests because they still use the LUCENE_30 version (but that is not a strong argument), so it's a weak Pro. Cons is that IW will keep having that doCommit handling in its ctor, only now w/ added comments on why this is being kept around etc. What do you think? IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
TestCodecs running time
Hi I've noticed that TestCodecs takes an insanely long time to run on my machine - between 35-40 seconds. Is that expected? The reason why it runs so long, seems to be that its threads make (each) 4000 iterations ... is that really required to ensure correctness? Shai
Re: Controlling the maximum size of a segment during indexing
I'm not sure .. but did you set the RAMBufferSizeMB on IWC? Doesn't look like it, and the default is 16 MB, which can explain why it doesn't flush before that. Shai On Fri, Apr 9, 2010 at 8:01 AM, Lance Norskog goks...@gmail.com wrote: Here is a Java unit test that uses the LogByteSizeMergePolicy to control the maximum size of segment files during indexing. That is, it tries. It does not succeed. Will someone who truly understands the merge policy code please examine it. There is probably one tiny parameter missing. It adds 20 documents that each are 100k in size. It creates an index in a RAMDirectory which should have one segment that's a tad over 1mb, and then a set of segments that are a tad over 500k. Instead, the data does not flush until it commits, writing one 5m segment. - org.apache.lucene.index.TestIndexWriterMergeMB --- package org.apache.lucene.index; /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the License); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.FieldSelectorResult; import org.apache.lucene.document.Field.Index; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.util.LuceneTestCase; /* * Verify that segment sizes are limited to # of bytes. * * Sizing: * Max MB is 0.5m. Verify against thiAs plus 100k slop. (1.2x) * Min MB is 10k. * Each document is 100k. * mergeSegments=2 * MaxRAMBuffer=1m. Verify against this plus 200k slop. (1.2x) * * This test should cause the ram buffer to flush after 10 documents, and create a CFS a little over 1meg. * The later documents should be flushed to disk every 5-6 documents, and create CFS files a little over 0.5meg. */ public class TestIndexWriterMergeMB extends LuceneTestCase { private static final int MERGE_FACTOR = 2; private static final double RAMBUFFER_MB = 1.0; static final double MIN_MB = 0.01d; static final double MAX_MB = 0.5d; static final double SLOP_FACTOR = 1.2d; static final double MB = 1000*1000; static String VALUE_100k = null; // Test controlling the mergePolicy for max # of docs public void testMaxMergeMB() throws IOException { Directory dir = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig( TEST_VERSION_CURRENT, new WhitespaceAnalyzer(TEST_VERSION_CURRENT)); LogByteSizeMergePolicy mergeMB = new LogByteSizeMergePolicy(); config.setMergePolicy(mergeMB); mergeMB.setMinMergeMB(MIN_MB); mergeMB.setMaxMergeMB(MAX_MB); mergeMB.setUseCompoundFile(true); mergeMB.setMergeFactor(MERGE_FACTOR); config.setMaxBufferedDocs(100);// irrelevant but the next line fails without this. config.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH); MergeScheduler scheduler = new SerialMergeScheduler(); config.setMergeScheduler(scheduler); IndexWriter writer = new IndexWriter(dir, config); System.out.println(Start indexing); for (int i = 0; i 50; i++) { addDoc(writer, i); printSegmentSizes(dir); } checkSegmentSizes(dir); System.out.println(Commit); writer.commit(); printSegmentSizes(dir); checkSegmentSizes(dir); writer.close(); } // document that takes of 100k of RAM private void addDoc(IndexWriter writer, int i) throws IOException { if (VALUE_100k == null) { StringBuilder value = new StringBuilder(10); for(int fill = 0; fill 10; fill ++) { value.append('a'); } VALUE_100k = value.toString(); } Document doc = new Document(); doc.add(new Field(id, i + , Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field(content, VALUE_100k, Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); } private void checkSegmentSizes(Directory dir) { try { String[] files = dir.listAll(); for (String file : files) { if
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855277#action_12855277 ] Shai Erera commented on LUCENE-2386: Apparently, there are more tests that fail ... lost count but easy fixing. I tried writing the following test: {code} public void testNoCommits() throws Exception { // Tests that if we don't call commit(), the directory has 0 commits. This has // changed since LUCENE-2386, where before IW would always commit on a fresh // new index. Directory dir = new RAMDirectory(); IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(TEST_VERSION_CURRENT, new WhitespaceAnalyzer(TEST_VERSION_CURRENT))); assertEquals(expected 0 commits!, 0, IndexReader.listCommits(dir).size()); // No changes still should generate a commit, because it's a new index. writer.close(); assertEquals(expected 1 commits!, 0, IndexReader.listCommits(dir).size()); } {code} Simple test - validates that no commits are present following a freshly new index creation, w/o closing or committing. However, IndexReader.listCommits fails w/ the following exception: {code} java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.ramdirect...@2d262d26: files: [] at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:652) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:535) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:323) at org.apache.lucene.index.DirectoryReader.listCommits(DirectoryReader.java:1033) at org.apache.lucene.index.DirectoryReader.listCommits(DirectoryReader.java:1023) at org.apache.lucene.index.IndexReader.listCommits(IndexReader.java:1341) at org.apache.lucene.index.TestIndexWriter.testNoCommits(TestIndexWriter.java:4966) {code} The failure occurs when SegmentInfos attempts to find segments.gen and fails. So I wonder if I should fix DirectoryReader to catch that exception and simply return an empty Collection .. or I should fix SegmentInfos at this point -- notice the files: [] at the end - I think that by adding a check to the following code (SegmentInfos, line 652) which validates that there were any files before throwing the exception, it'll still work properly and safely (i.e. to detect a problematic Directory). Will need probably to break away from the while loop and I guess fix some other things in upper layers ... therefore I'm not sure if I should not simply catch that exception in DirectoryReader.listCommits w/ proper documentation and be done w/ it. After all, it's not supposed to be called ... ever? or hardly ever? {code} if (gen == -1) { // Neither approach found a generation throw new FileNotFoundException(no segments* file found in + directory + : files: + Arrays.toString(files)); } {code} IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1709) Parallelize Tests
[ https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1709: --- Attachment: LUCENE-1709-2.patch Since I had the changes on my local env. I thought it's best to generate a patch out of them, so they don't get lost. The patch doesn't cover the ant .jars, only the changes to common-build.xml as well as benchmark/build.xml Parallelize Tests - Key: LUCENE-1709 URL: https://issues.apache.org/jira/browse/LUCENE-1709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-1709-2.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, runLuceneTests.py Original Estimate: 48h Remaining Estimate: 48h The Lucene tests can be parallelized to make for a faster testing system. This task from ANT can be used: http://ant.apache.org/manual/CoreTasks/parallel.html Previous discussion: http://www.gossamer-threads.com/lists/lucene/java-dev/69669 Notes from Mike M.: {quote} I'd love to see a clean solution here (the tests are embarrassingly parallelizable, and we all have machines with good concurrency these days)... I have a rather hacked up solution now, that uses -Dtestpackage=XXX to split the tests up. Ideally I would be able to say use N threads and it'd do the right thing... like the -j flag to make. {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2377) Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-2377. Resolution: Fixed Committed revision 931502. Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark - Key: LUCENE-2377 URL: https://issues.apache.org/jira/browse/LUCENE-2377 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.1 Attachments: LUCENE-2377.patch Benchmark allows one to set the MP and MS to use, by defining the class name and then use reflection to instantiate them. However NoMP and NoMS are singletons and therefore reflection does not work for them. Easy fix in CreateIndexTask. I'll post a patch soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames
[ https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854588#action_12854588 ] Shai Erera commented on LUCENE-2353: Actually, we've reopened LUCENE-1709 to track that. This is not related to this issue's changes, but seems to be related to benchmark test in specifically. Please have a look there at a patch I've posted which forces benchmark tests to run in sequential mode. Additionally, you can 'ant test -Drunsequential=1' from the command line, benchmark's root folder, to achieve the same. And it'd be great if you post the above on LUCENE-1709 as well -- because now I know I'm not the only one running into this :). Config incorrectly handles Windows absolute pathnames - Key: LUCENE-2353 URL: https://issues.apache.org/jira/browse/LUCENE-2353 Project: Lucene - Java Issue Type: Bug Components: contrib/benchmark Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2353.patch, LUCENE-2353.patch I have no idea how no one ran into this so far, but I tried to execute an .alg file which used ReutersContentSource and referenced both docs.dir and work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the run reported an error of missing content under benchmark\work\something. I've traced the problem back to Config, where get(String, String) includes the following code: {code} if (sval.indexOf(:) 0) { return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); ... {code} It detects : in the value and so it thinks it's a per-round property, thus stripping d: from the value ... fix is very simple: {code} if (sval.indexOf(:) 0) { return sval; } else if (sval.indexOf(:\\) = 0) { // this previously messed up absolute path names on Windows. Assuming // there is no real value that starts with \\ return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); {code} I'll post a patch w/ the above fix + test shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Getting fsync out of the loop
How often is fsync called? If it's just during calls to commit, then is that that expensive? I mean, how often do you call commit? If that's that expensive (do you have some numbers to share) then I think that's be a neat idea. Though losing a few minutes worth of updates may sometimes be unrecoverable, depending on the scenario, bur I guess for those cases the 'standard way' should be used. What if your background thread simply committed every couple of minutes? What's the difference between taking the snapshot (which means you had to call commit previously) and commit it, to call iw.commit by a backgroud merge? Shai On Tue, Apr 6, 2010 at 5:11 PM, Earwin Burrfoot ear...@gmail.com wrote: So, I want to pump my IndexWriter hard and fast with documents. Removing fsync from FSDirectory helps. But for that I pay with possibility of index corruption, not only if my node suddenly loses power/kernelpanics, but also if it runs out of disk space (which happens more frequently). I invented the following solution: We write a special deletion policy that resembles SnapshotDeletionPolicy. At all times it takes hold of current synced commit and preserves it. Once every N minutes a special thread takes latest commit, syncs it and nominates as current synced commit. The previous one gets deleted. Now we are disastery-proof, and do fsync asynchronously from indexing threads. We pay for this with somewhat bigger transient disc usage, and probably losing a few minutes worth of updates in case of a crash, but that's acceptable. How does this sound? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Getting fsync out of the loop
Earwin - do you have some numbers to share on the running time of the indexing application? You've mentioned that if you take out fsync into a BG thread, the running time improves, but I'm curious to know by how much. Shai On Wed, Apr 7, 2010 at 2:26 AM, Earwin Burrfoot ear...@gmail.com wrote: Running out of disk space with fsync disabled won't lead to corruption. Even kill -9 the JRE process with fsync disabled won't corrupt. In these cases index just falls back to last successful commit. It's only power loss / OS / machine crash where you need fsync to avoid possible corruption (corruption may not even occur w/o fsync if you get lucky). Sorry to disappoint you, but running out of disk space is worse than kill -9. You can write down the file (to cache in fact), close it, all without getting any exceptions. And then it won't get flushed to disk because the disk is full. This can happen to segments file (and old one is deleted with default deletion policy). This can happen to fat freq/prox files mentioned in segments file (and yeah, the old segments file is deleted, so no falling back). What if your background thread simply committed every couple of minutes? What's the difference between taking the snapshot (which means you had to call commit previously) and commit it, to call iw.commit by a backgroud merge? -- But: why do you need to commit so often? To see stuff on reopen? Yes, I know about NRT. You've reinvented autocommit=true! ?? I'm doing regular commits, syncing down every Nth of it. Doesn't this just BG the syncing? Ie you could make a dedicated thread to do this. Yes, exactly, this BGs the syncing to a dedicated thread. Threads doing indexation/merging can continue unhampered. One possible win with this aproach is the cost of fsync should go way down the longer you wait after writing bytes to the file and before calling fsync. This is because typically OS write caches expire by time (eg 30 seconds) so if you want long enough the bytes will already at least be delivered to the IO system (but the IO system can do further caching which could still take time). On windows at least I definitely noticed this effect -- wait some before fync'ing and it's net/net much less costly. Yup. In fact you can just hold on to the latest commit for N seconds, than switch to the new latest commit. OS will fsync everything for you. I'm just playing around with stupid idea. I'd like to have NRT look-alike without binding readers and writers. :) Right now it's probably best for me to save my time and cut over to current NRT. But. An important lesson was learnt - no fsyncing blows up your index on out-of-disk-space. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1709) Parallelize Tests
[ https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854348#action_12854348 ] Shai Erera commented on LUCENE-1709: One more thing - change benchmark tests to run sequentially (by adding the property). Robert, are you going to tackle that soon? Parallelize Tests - Key: LUCENE-1709 URL: https://issues.apache.org/jira/browse/LUCENE-1709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, LUCENE-1709.patch, runLuceneTests.py Original Estimate: 48h Remaining Estimate: 48h The Lucene tests can be parallelized to make for a faster testing system. This task from ANT can be used: http://ant.apache.org/manual/CoreTasks/parallel.html Previous discussion: http://www.gossamer-threads.com/lists/lucene/java-dev/69669 Notes from Mike M.: {quote} I'd love to see a clean solution here (the tests are embarrassingly parallelizable, and we all have machines with good concurrency these days)... I have a rather hacked up solution now, that uses -Dtestpackage=XXX to split the tests up. Ideally I would be able to say use N threads and it'd do the right thing... like the -j flag to make. {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2377) Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark
Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark - Key: LUCENE-2377 URL: https://issues.apache.org/jira/browse/LUCENE-2377 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.1 Benchmark allows one to set the MP and MS to use, by defining the class name and then use reflection to instantiate them. However NoMP and NoMS are singletons and therefore reflection does not work for them. Easy fix in CreateIndexTask. I'll post a patch soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2377) Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2377: --- Attachment: LUCENE-2377.patch Patch includes both fix to CreateIndexTask as well as relevant tests to CreateIndexTaskTest. I plan to commit later today if there are no objections. Enable the use of NoMergePolicy and NoMergeScheduler by Benchmark - Key: LUCENE-2377 URL: https://issues.apache.org/jira/browse/LUCENE-2377 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.1 Attachments: LUCENE-2377.patch Benchmark allows one to set the MP and MS to use, by defining the class name and then use reflection to instantiate them. However NoMP and NoMS are singletons and therefore reflection does not work for them. Easy fix in CreateIndexTask. I'll post a patch soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Parallel tests in Benchmark
Ok let's do that (add runsequential to benchmark and all the rest). If I'll run into this elsewhere as well I will report and we can talk then about trying to find a solution for this. If it's just benchmark then I think we'll be ok. Shai On Thursday, April 1, 2010, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 1, 2010 at 12:03 AM, Shai Erera ser...@gmail.com wrote: Hi I'd like to summarize a discussion I had w/ Robert and Mike last night on IRC, about the parallelism of tasks in Benchmark: For some reason, ever since parallel tasks were introduced, when I run 'ant test' from the contrib/benchmark folder (or the root), the tests just hang at some point, after WriteLineDocTaskTest finishes. What's very weird is that it seems I'm the only one experiencing this, and so for a long time I thought it's just a problem w/ my environment ... until yesterday when I did a fresh checkout of trunk, to a fresh folder and project, and still the tests stuck. Thread dump does not show anything relevant to Lucene code, but rather to Ant. The main thread is waiting on org/apache/tools/ant/taskdefs/Parallel.spinThreads, another on org/apache/tools/ant/taskdefs/Execute.waitFor and two other on java/io/FileInputStream.read. But nothing is related to Lucene code, directly. Also annoyingly, but conveniently for debugging that issue, it happens very consistently on my machine - sometimes the test passes, but 90% hangs. Running w/ -Drunsequential=1 consistently succeeds. We've explored different ways to understand the cause of the problem, and came across several improvements and a workaround, but unfortunately not to a definite resolution: * As a last resort, we can add runsequential property to benchmark build.xml, which forces Benchmark tests to run sequentially. Since that's a tiny package which takes a few seconds to run anyway, and parallelism doesn't improve much (it actually runs slower, when it passes, on my machine: parallel=15 sec, seq=11 sec), this might be acceptable. * Moving the junit temp files (such as that flag file) created to the temp directory each test uses. This is actually a good thing to do anyway (thanks Robert for spotting that), because it avoids accidental commits of such files :), as well as doesn't clutter the main environment. We've done that because when I hit CTR:+C to stop one of the runs which hung, we received a FNFE on a junit flag file is being accessed by another process (something like that), and thought this is related to the hangs I'm seeing. Anyway, this file is attempted access by multiple JVMs concurrently, which seems bad. * Explore the JUnit Formatter code under src/test, since it uses file locking. I've disabled locks (using NoLockFactory), however the test still hung. * Change common-build.xml threadsPerProcessor to '1' instead of '2'. We think that might be a good thing to do anyway - if people run on machines with just one CPU, threading is not expected to help much, as opposed to running on multiple CPUs. But we don't want to enforce it on anyone, so we think to change the default to '1', but introduce a property 'threadsPerProcessor' which users will be able to set explicitly. ** Surprisingly, when I set it to '1' or '10' (I run on dual-core Thinkpad W500), the test consistently passes - it just doesn't like the value '2'. At least it passed as long as I ran it, maybe a thread hang is lurking for me around the corner somewhere. * We made sure the benchmark tests indeed read/write the test data files from/to unique directories. But like I said - there is no hang in Lucene code reported in the thread dump. It was very late last night when we stopped, and my eyes were tired, so I didn't summarize it right away. Robert, I hope I've captured everything we did, if not please add. Anyone's got any suggestions? It's unfortunate that I'm the only one running into this problem, because whatever the suggestions are, you'll probably need me to confirm them :). And I'm going away for 3 days (camping - no internet ... well at least no laptop :)), so unless someone has a suggestion within the coming few hours, we can continue that when I get back. Shai I think you got everything. I reopened the JIRA issue too (LUCENE-1709) and listed the things we can do for sure now, such as lowering threadsPerProcessor (and allowing someone to use a system property to override this) and fixing junit temp files to be in the temp directory. Additionally I would like to fix the ant library problem as mentioned there. it works great from the command-line but we should improve this for IDE-users, so they do not see a compile error. I am personally for the idea of adding the runsequential property to benchmark's build.xml, to force it to run serially. While I am unable to reproduce your problem, it does not surprise me, as I had a tough time trying to prevent benchmark
Re: Landing the flex branch
bq. Try a merge back: This would let flex appear as a single commit to trunk, so the history of trunk would be preserved. +1 for that - I think the history of trunk is important to preserve. And there is also a way to ask for flex's history so everybody win? Shai On Thursday, April 1, 2010, Uwe Schindler u...@thetaphi.de wrote: Hi, we should think about how to merge the changes to trunk. I can try this out during the weekend, to merge back the changes to trunk, but this can be very hard. So we have the following options: Try a merge back: This would let flex appear as a single commit to trunk, so the history of trunk would be preserved. If somebody wants to see the changes in the flex branch, he could ask for them (e.g. in TortoiseSVN there is a checkbox Include merged revisions). If this is not easy or fails, we can do the following: - Create a big diff between current trunk and flex (after flex is merged up to trunk). Attach this patch to an issue and let everybody review. After that we can apply the patch to trunk. This would result in the same behavior for trunk, no changes lost, but all changes in flex cannot be reviewed. - Delete current trunk and svn move the branch to trunk (after flex is merged up to trunk): This would make the history of flex the current history. The drawback: You losse latest trunk changes since the split of flex. Instead you will only see the merge messages. Therefore we should see this only as a last chance. Comments? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, March 30, 2010 5:35 PM To: java-dev@lucene.apache.org Subject: Landing the flex branch I think the time has finally come! Pending one issue (LUCENE-2354 -- Uwe), I think flex is ready to land I think the other issues with Fix Version = Flex Branch can be moved to 3.1 after we land. We still use the pre-flex APIs in a number of places... I think this is actually good (so we continue to test the back-compat emulation layer). With time we can cut them over. After flex, there are a number of fun things to explore. EG, we need to make attributes work well with codecs indexing/searching (with Multi/DirReader, serailize/unserialize, etc.); we need a BytesRef + packed ints FieldCache StringIndex variant which should use much less RAM in certain cases; we should build a fast core PForDelta codec; more queries can cutover to operating directly on byte[] terms, etc. But these can all come with time... Thoughts/issues/objections? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Welcome Uwe Schindler to the Lucene PMC
Congratulations Uwe ! Shai On Thursday, April 1, 2010, Earwin Burrfoot ear...@gmail.com wrote: Generics SpecOps made it to the top and are gonna rule us from the shadows :) Congrats! On Thu, Apr 1, 2010 at 16:37, Robert Muir rcm...@gmail.com wrote: Congrats Uwe! On Thu, Apr 1, 2010 at 7:05 AM, Grant Ingersoll gsing...@apache.org wrote: I'm pleased to announce that the Lucene PMC has voted to add Uwe Schindler to the PMC. Uwe has been doing a lot of work in Lucene and Solr, including several of the last releases in Lucene. Please join me in extending congratulations to Uwe! -Grant Ingersoll PMC Chair - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity
[ https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851829#action_12851829 ] Shai Erera commented on LUCENE-2310: +1 for this simplification. Can we just name it Indexable, and omit Document from it? That way, it's both shorter and less chances for users to directly link it w/ Document. One thing I didn't understand though, is what will happen to ir/is.doc() method? Will those be deprecated in favor of some other class which receives an IR as parameter and knows how to re-construct Indexable(Document)? Reduce Fieldable, AbstractField and Field complexity Key: LUCENE-2310 URL: https://issues.apache.org/jira/browse/LUCENE-2310 Project: Lucene - Java Issue Type: Sub-task Components: Index Reporter: Chris Male Attachments: LUCENE-2310-Deprecate-AbstractField-CleanField.patch, LUCENE-2310-Deprecate-AbstractField.patch, LUCENE-2310-Deprecate-AbstractField.patch, LUCENE-2310-Deprecate-AbstractField.patch, LUCENE-2310-Deprecate-DocumentGetFields-core.patch, LUCENE-2310-Deprecate-DocumentGetFields.patch, LUCENE-2310-Deprecate-DocumentGetFields.patch In order to move field type like functionality into its own class, we really need to try to tackle the hierarchy of Fieldable, AbstractField and Field. Currently AbstractField depends on Field, and does not provide much more functionality that storing fields, most of which are being moved over to FieldType. Therefore it seems ideal to try to deprecate AbstractField (and possible Fieldable), moving much of the functionality into Field and FieldType. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames
[ https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera reassigned LUCENE-2353: -- Assignee: Shai Erera Config incorrectly handles Windows absolute pathnames - Key: LUCENE-2353 URL: https://issues.apache.org/jira/browse/LUCENE-2353 Project: Lucene - Java Issue Type: Bug Components: contrib/benchmark Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2353.patch, LUCENE-2353.patch I have no idea how no one ran into this so far, but I tried to execute an .alg file which used ReutersContentSource and referenced both docs.dir and work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the run reported an error of missing content under benchmark\work\something. I've traced the problem back to Config, where get(String, String) includes the following code: {code} if (sval.indexOf(:) 0) { return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); ... {code} It detects : in the value and so it thinks it's a per-round property, thus stripping d: from the value ... fix is very simple: {code} if (sval.indexOf(:) 0) { return sval; } else if (sval.indexOf(:\\) = 0) { // this previously messed up absolute path names on Windows. Assuming // there is no real value that starts with \\ return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); {code} I'll post a patch w/ the above fix + test shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames
[ https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851836#action_12851836 ] Shai Erera commented on LUCENE-2353: Unless there are objections, I plan to commit this shortly Config incorrectly handles Windows absolute pathnames - Key: LUCENE-2353 URL: https://issues.apache.org/jira/browse/LUCENE-2353 Project: Lucene - Java Issue Type: Bug Components: contrib/benchmark Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2353.patch, LUCENE-2353.patch I have no idea how no one ran into this so far, but I tried to execute an .alg file which used ReutersContentSource and referenced both docs.dir and work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the run reported an error of missing content under benchmark\work\something. I've traced the problem back to Config, where get(String, String) includes the following code: {code} if (sval.indexOf(:) 0) { return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); ... {code} It detects : in the value and so it thinks it's a per-round property, thus stripping d: from the value ... fix is very simple: {code} if (sval.indexOf(:) 0) { return sval; } else if (sval.indexOf(:\\) = 0) { // this previously messed up absolute path names on Windows. Assuming // there is no real value that starts with \\ return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); {code} I'll post a patch w/ the above fix + test shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity
[ https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851842#action_12851842 ] Shai Erera commented on LUCENE-2310: Right Earwin - agreed. I'd like to summarize a brief discussion we had on IRC around that: The idea is not to provide another interface/class for search purposes, but rather expose the right API from IndexReader, even if it might be a bit low-level. API like getIndexedFields(docId) and getStorefFields(docId), both optionally take a FieldSelector, should allow the application to re-construct its Indexable however it wants. And IR/IS don't need to know anything about that. To complete the picture for current users, we can have a static reconstruct() on Document which takes IR, docId and FieldSelector ... BTW, I'm not even sure getIndedxedFields can be efficiently supported today. Just listing it here for completeness. Reduce Fieldable, AbstractField and Field complexity Key: LUCENE-2310 URL: https://issues.apache.org/jira/browse/LUCENE-2310 Project: Lucene - Java Issue Type: Sub-task Components: Index Reporter: Chris Male Attachments: LUCENE-2310-Deprecate-AbstractField-CleanField.patch, LUCENE-2310-Deprecate-AbstractField.patch, LUCENE-2310-Deprecate-AbstractField.patch, LUCENE-2310-Deprecate-AbstractField.patch, LUCENE-2310-Deprecate-DocumentGetFields-core.patch, LUCENE-2310-Deprecate-DocumentGetFields.patch, LUCENE-2310-Deprecate-DocumentGetFields.patch In order to move field type like functionality into its own class, we really need to try to tackle the hierarchy of Fieldable, AbstractField and Field. Currently AbstractField depends on Field, and does not provide much more functionality that storing fields, most of which are being moved over to FieldType. Therefore it seems ideal to try to deprecate AbstractField (and possible Fieldable), moving much of the functionality into Field and FieldType. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames
[ https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-2353. Resolution: Fixed Committed revision 929520. Config incorrectly handles Windows absolute pathnames - Key: LUCENE-2353 URL: https://issues.apache.org/jira/browse/LUCENE-2353 Project: Lucene - Java Issue Type: Bug Components: contrib/benchmark Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2353.patch, LUCENE-2353.patch I have no idea how no one ran into this so far, but I tried to execute an .alg file which used ReutersContentSource and referenced both docs.dir and work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the run reported an error of missing content under benchmark\work\something. I've traced the problem back to Config, where get(String, String) includes the following code: {code} if (sval.indexOf(:) 0) { return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); ... {code} It detects : in the value and so it thinks it's a per-round property, thus stripping d: from the value ... fix is very simple: {code} if (sval.indexOf(:) 0) { return sval; } else if (sval.indexOf(:\\) = 0) { // this previously messed up absolute path names on Windows. Assuming // there is no real value that starts with \\ return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); {code} I'll post a patch w/ the above fix + test shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Parallel tests in Benchmark
Hi I'd like to summarize a discussion I had w/ Robert and Mike last night on IRC, about the parallelism of tasks in Benchmark: For some reason, ever since parallel tasks were introduced, when I run 'ant test' from the contrib/benchmark folder (or the root), the tests just hang at some point, after WriteLineDocTaskTest finishes. What's very weird is that it seems I'm the only one experiencing this, and so for a long time I thought it's just a problem w/ my environment ... until yesterday when I did a fresh checkout of trunk, to a fresh folder and project, and still the tests stuck. Thread dump does not show anything relevant to Lucene code, but rather to Ant. The main thread is waiting on org/apache/tools/ant/taskdefs/Parallel.spinThreads, another on org/apache/tools/ant/taskdefs/Execute.waitFor and two other on java/io/FileInputStream.read. But nothing is related to Lucene code, directly. Also annoyingly, but conveniently for debugging that issue, it happens very consistently on my machine - sometimes the test passes, but 90% hangs. Running w/ -Drunsequential=1 consistently succeeds. We've explored different ways to understand the cause of the problem, and came across several improvements and a workaround, but unfortunately not to a definite resolution: * As a last resort, we can add runsequential property to benchmark build.xml, which forces Benchmark tests to run sequentially. Since that's a tiny package which takes a few seconds to run anyway, and parallelism doesn't improve much (it actually runs slower, when it passes, on my machine: parallel=15 sec, seq=11 sec), this might be acceptable. * Moving the junit temp files (such as that flag file) created to the temp directory each test uses. This is actually a good thing to do anyway (thanks Robert for spotting that), because it avoids accidental commits of such files :), as well as doesn't clutter the main environment. We've done that because when I hit CTR:+C to stop one of the runs which hung, we received a FNFE on a junit flag file is being accessed by another process (something like that), and thought this is related to the hangs I'm seeing. Anyway, this file is attempted access by multiple JVMs concurrently, which seems bad. * Explore the JUnit Formatter code under src/test, since it uses file locking. I've disabled locks (using NoLockFactory), however the test still hung. * Change common-build.xml threadsPerProcessor to '1' instead of '2'. We think that might be a good thing to do anyway - if people run on machines with just one CPU, threading is not expected to help much, as opposed to running on multiple CPUs. But we don't want to enforce it on anyone, so we think to change the default to '1', but introduce a property 'threadsPerProcessor' which users will be able to set explicitly. ** Surprisingly, when I set it to '1' or '10' (I run on dual-core Thinkpad W500), the test consistently passes - it just doesn't like the value '2'. At least it passed as long as I ran it, maybe a thread hang is lurking for me around the corner somewhere. * We made sure the benchmark tests indeed read/write the test data files from/to unique directories. But like I said - there is no hang in Lucene code reported in the thread dump. It was very late last night when we stopped, and my eyes were tired, so I didn't summarize it right away. Robert, I hope I've captured everything we did, if not please add. Anyone's got any suggestions? It's unfortunate that I'm the only one running into this problem, because whatever the suggestions are, you'll probably need me to confirm them :). And I'm going away for 3 days (camping - no internet ... well at least no laptop :)), so unless someone has a suggestion within the coming few hours, we can continue that when I get back. Shai
[jira] Updated: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames
[ https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2353: --- Attachment: LUCENE-2353.patch Updated to also match 'c:/temp' like paths, which are also accepted on Windows Config incorrectly handles Windows absolute pathnames - Key: LUCENE-2353 URL: https://issues.apache.org/jira/browse/LUCENE-2353 Project: Lucene - Java Issue Type: Bug Components: contrib/benchmark Reporter: Shai Erera Fix For: 3.1 Attachments: LUCENE-2353.patch, LUCENE-2353.patch I have no idea how no one ran into this so far, but I tried to execute an .alg file which used ReutersContentSource and referenced both docs.dir and work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the run reported an error of missing content under benchmark\work\something. I've traced the problem back to Config, where get(String, String) includes the following code: {code} if (sval.indexOf(:) 0) { return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); ... {code} It detects : in the value and so it thinks it's a per-round property, thus stripping d: from the value ... fix is very simple: {code} if (sval.indexOf(:) 0) { return sval; } else if (sval.indexOf(:\\) = 0) { // this previously messed up absolute path names on Windows. Assuming // there is no real value that starts with \\ return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); {code} I'll post a patch w/ the above fix + test shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames
[ https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850644#action_12850644 ] Shai Erera commented on LUCENE-2353: I don't have an account yet, so I cannot commit this on my own. Any volunteers? Config incorrectly handles Windows absolute pathnames - Key: LUCENE-2353 URL: https://issues.apache.org/jira/browse/LUCENE-2353 Project: Lucene - Java Issue Type: Bug Components: contrib/benchmark Reporter: Shai Erera Fix For: 3.1 Attachments: LUCENE-2353.patch I have no idea how no one ran into this so far, but I tried to execute an .alg file which used ReutersContentSource and referenced both docs.dir and work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the run reported an error of missing content under benchmark\work\something. I've traced the problem back to Config, where get(String, String) includes the following code: {code} if (sval.indexOf(:) 0) { return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); ... {code} It detects : in the value and so it thinks it's a per-round property, thus stripping d: from the value ... fix is very simple: {code} if (sval.indexOf(:) 0) { return sval; } else if (sval.indexOf(:\\) = 0) { // this previously messed up absolute path names on Windows. Assuming // there is no real value that starts with \\ return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); {code} I'll post a patch w/ the above fix + test shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames
Config incorrectly handles Windows absolute pathnames - Key: LUCENE-2353 URL: https://issues.apache.org/jira/browse/LUCENE-2353 Project: Lucene - Java Issue Type: Bug Components: contrib/benchmark Reporter: Shai Erera Fix For: 3.1 I have no idea how no one ran into this so far, but I tried to execute an .alg file which used ReutersContentSource and referenced both docs.dir and work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the run reported an error of missing content under benchmark\work\something. I've traced the problem back to Config, where get(String, String) includes the following code: {code} if (sval.indexOf(:) 0) { return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); ... {code} It detects : in the value and so it thinks it's a per-round property, thus stripping d: from the value ... fix is very simple: {code} if (sval.indexOf(:) 0) { return sval; } else if (sval.indexOf(:\\) = 0) { // this previously messed up absolute path names on Windows. Assuming // there is no real value that starts with \\ return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); {code} I'll post a patch w/ the above fix + test shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames
[ https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2353: --- Attachment: LUCENE-2353.patch The fix is only relevant to get(String, String) and not to all other get(String, type) variants. Benchmark test passed but after I svn up (to include the latest parallel test thing) the test just sits idle (after finishing), waiting for something. If I run the tests in eclipse they pass. So I'm guessing it's a problem w/ my env. or build.xml? I also tried 'ant clean test' from within benchmark, but it didn't help. I then tried 'ant clean' from root, and 'ant test' from benchmark, but the test just keeps waiting on WriteLineDocTaskTest, on this line: [junit] config properties: [junit] directory = RAMDirectory [junit] doc.maker = org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest$JustDateDocMaker [junit] line.file.out = D:\dev\lucene\lucene-trunk\build\contrib\benchmark\test\W\one-line [junit] --- I think this can go in (if it passes on someone else's machine, while I figure out what's wrong in my env. separately. Config incorrectly handles Windows absolute pathnames - Key: LUCENE-2353 URL: https://issues.apache.org/jira/browse/LUCENE-2353 Project: Lucene - Java Issue Type: Bug Components: contrib/benchmark Reporter: Shai Erera Fix For: 3.1 Attachments: LUCENE-2353.patch I have no idea how no one ran into this so far, but I tried to execute an .alg file which used ReutersContentSource and referenced both docs.dir and work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the run reported an error of missing content under benchmark\work\something. I've traced the problem back to Config, where get(String, String) includes the following code: {code} if (sval.indexOf(:) 0) { return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); ... {code} It detects : in the value and so it thinks it's a per-round property, thus stripping d: from the value ... fix is very simple: {code} if (sval.indexOf(:) 0) { return sval; } else if (sval.indexOf(:\\) = 0) { // this previously messed up absolute path names on Windows. Assuming // there is no real value that starts with \\ return sval; } // first time this prop is extracted by round int k = sval.indexOf(:); String colName = sval.substring(0, k); sval = sval.substring(k + 1); {code} I'll post a patch w/ the above fix + test shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850075#action_12850075 ] Shai Erera commented on LUCENE-2345: Earwin, w/o knowing too much about the details of your work, I wanted to comment on get rid of of init/reinit/moreinit methods, moving the code to constructors. I work now on Parallel Index and one of the things I do is extend IW. Currently, IW's ctor code performs the initialization, however I'm thinking to move that code to an init method. The reason is to allow easy extensions of IW, such as LUCENE-2330. There I'm going to add a default ctor to IW, accompanied by an init method the extending class can call if needed. So what I'm trying to say is that init methods are not always bad, and sometimes ctors limit you. Perhaps it would make sense though in what you're trying to do ... Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850083#action_12850083 ] Shai Erera commented on LUCENE-2345: Thanks Uwe, I know that ctor is the preferred way, and in the process of introducing IWC I delete IW.init which all ctors called and pulled all the code to IW ctor. I will make that init() on IW final. But sometimes putting code in init() is not bad (and it's used in Lucene elsewhere too (e.g. PQ and up until recently IW). Make it possible to subclass SegmentReader -- Key: LUCENE-2345 URL: https://issues.apache.org/jira/browse/LUCENE-2345 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith Fix For: 3.1 Attachments: LUCENE-2345_3.0.patch I would like the ability to subclass SegmentReader for numerous reasons: * to capture initialization/close events * attach custom objects to an instance of a segment reader (caches, statistics, so on and so forth) * override methods on segment reader as needed currently this isn't really possible I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader default implementation would be something like: {code} public class SegmentReaderFactory { public SegmentReader get(boolean readOnly) { return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); } public SegmentReader reopen(SegmentReader reader, boolean readOnly) { return newSegmentReader(readOnly); } } {code} It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc) I could prepare a patch if others think this has merit Obviously, this API would be experimental/advanced/will change in future -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850086#action_12850086 ] Shai Erera commented on LUCENE-2215: Sure let's wait for the patch and some perf. results. paging collector Key: LUCENE-2215 URL: https://issues.apache.org/jira/browse/LUCENE-2215 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.4, 3.0 Reporter: Adam Heinz Assignee: Grant Ingersoll Priority: Minor Attachments: IterablePaging.java, LUCENE-2215.patch, PagingCollector.java, TestingPagingCollector.java http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 Somebody assign this to Aaron McCurry and we'll see if we can get enough votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org