Re: Proposal about Version API relaxation
I'd be for this plan if I really thought the stable branch would get similar attention to the experimental branch - but I have some doubts about that. Its a fairly small dev community in comparison to other projects that do this ... Dev on the experimental latest greatest fun branch, or the more in the past, back compat hassle stable branch? Port most patches to two somewhat diverging code bases? If that was actually how things worked out, I'd be +1. I just wonder ... with the right framing I do think its possible though. On 04/16/2010 11:45 AM, Michael McCandless wrote: Getting back to the stable/experimental branches... I think, with separate stable experimental branches, development would/should be active on both branches. It'd depend on the feature... Eg today we'd have 3.x stable branch and the experimental branch (= trunk). Small features, bug fixes, would be ported to both branches. I think features that deprecate some APIs would still be fine on the stable branch. Major changes (eg flex) would only be done on the experimental branch. This empowers us on a feature by feature case to decide whether it'll be in the stable release or not. The stable branch releases would be 3.0, 3.1, etc., but we could still do the .Z releases (3.0.1, 3.0.2) for bug fixes, if we need to. And we could do alpha releases off the experimental branch as we think we're getting close to cutting a new stable release (4.0). Mike On Thu, Apr 15, 2010 at 6:58 PM, Robert Muirrcm...@gmail.com wrote: On Thu, Apr 15, 2010 at 6:50 PM, DM Smithdmsmith...@gmail.com wrote: Robert has already started one. (1488 I think). and it could work with this new scheme... because then you could use an older icu jar file with an older lucene-analyzer-icu.jar or whatever and you have it more under control. under the existing scheme you cant really improve back compat with ICU, because they make API changes and backwards breaks and such themselves, so you cant make one Tokenizer say that does anything meaningful that works with all versions of it... but it would be cool to say: here is lucene-analyzer-icu-4.0.jar that works with icu 4.4. and you could keep using that as long as you have to (meanwhile trunk could start using icu 4.6) -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On 04/16/2010 12:16 PM, Robert Muir wrote: On Fri, Apr 16, 2010 at 12:12 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: I'd be for this plan if I really thought the stable branch would get similar attention to the experimental branch - but I have some doubts about that. Its a fairly small dev community in comparison to other projects that do this ... Dev on the experimental latest greatest fun branch, or the more in the past, back compat hassle stable branch? Port most patches to two somewhat diverging code bases? If that was actually how things worked out, I'd be +1. I just wonder ... with the right framing I do think its possible though. But this is an open source project still right? So if you want more attention paid to the stable branch, put your patches where your mouth is (no offense). I don't think that's how things should work. The project should be framed to guide devs towards what's best for everybody. Right now all devs work on a stable branch because we have policies that encourage that. We could also make policies that encourage every dev for himself crap development. If no one wants to put new features in the back-compat hassle branch, well, then thats a sign that no one cares about it. It's not a sign that users don't care about it. Lately I think you have taken the stance, users be damned, Lucene dev should just be geared towards devs. I'm not a fan of that kind of attitude when it comes to Lucene dev myself. -- Robert Muir rcm...@gmail.com mailto:rcm...@gmail.com -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2287) Unexpected terms are highlighted within nested SpanQuery instances
[ https://issues.apache.org/jira/browse/LUCENE-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857904#action_12857904 ] Mark Miller commented on LUCENE-2287: - Hey Michael - there is a lot of reformatting it looks like in this patch - if its not that much of a hassle, is it possible to get a patch without the formats? Unexpected terms are highlighted within nested SpanQuery instances -- Key: LUCENE-2287 URL: https://issues.apache.org/jira/browse/LUCENE-2287 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Affects Versions: 2.9.1 Environment: Linux, Solaris, Windows Reporter: Michael Goddard Priority: Minor Attachments: LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch Original Estimate: 336h Remaining Estimate: 336h I haven't yet been able to resolve why I'm seeing spurious highlighting in nested SpanQuery instances. Briefly, the issue is illustrated by the second instance of Lucene being highlighted in the test below, when it doesn't satisfy the inner span. There's been some discussion about this on the java-dev list, and I'm opening this issue now because I have made some initial progress on this. This new test, added to the HighlighterTest class in lucene_2_9_1, illustrates this: /* * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ */ public void testHighlightingNestedSpans2() throws Exception { String theText = The Lucene was made by Doug Cutting and Lucene great Hadoop was; // Problem //String theText = The Lucene was made by Doug Cutting and the great Hadoop was; // Works okay String fieldName = SOME_FIELD_NAME; SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] { new SpanTermQuery(new Term(fieldName, lucene)), new SpanTermQuery(new Term(fieldName, doug)) }, 5, true); Query query = new SpanNearQuery(new SpanQuery[] { spanNear, new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true); String expected = The BLucene/B was made by BDoug/B Cutting and Lucene great BHadoop/B was; //String expected = The BLucene/B was made by BDoug/B Cutting and the great BHadoop/B was; String observed = highlightField(query, fieldName, theText); System.out.println(Expected: \ + expected + \n + Observed: \ + observed); assertEquals(Why is that second instance of the term \Lucene\ highlighted?, expected, observed); } Is this an issue that's arisen before? I've been reading through the source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and NearSpansOrdered, but haven't found the solution yet. Initially, I thought that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't get me too far. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
If you absolutely cannot re-index, and you have *no* access to the data again - you are one ballsy mofo to upgrade to a new version of Lucene for features. It means you likely BASE jump in your free time? On 04/15/2010 10:14 AM, Erick Erickson wrote: Coming in late to the discussion, and without really understanding the underlying Lucene issues, but... The size of the problem of reindexing is under-appreciated I think. Somewhere in my company is the original data I indexed. But the effort it would take to resurrect it is O(unknown). An unfortunate reality of commercial products is that the often receive very little love for extended periods of time until all of the sudden more work is required. There ensues an extended period of re-orientation, even if the people who originally worked on the project are still around. *Assuming* the data is available to reindex (and there are many reasons besides poor practice on the part of the company that it may not be), remembering/finding out exactly which of the various backups you made of the original data is the one that's actually in your product can be highly non-trivial. Compounded by the fact that the product manager will be adamant about Do NOT surprise our customers. So I can be in a spot of saying I *think* I have the original data set, and I *think* I have the original code used to index it, and if I get a new version of Lucene I *think* I can recreate the index and I *think* that the user will see the expected change. After all that effort is completed, I *think* we'll see the expected changes, but we won't know until we try it puts me in a very precarious position. This assumes that I have a reasonable chance of getting the original data. But say I've been indexing data from a live feed. Sure as hell hope I stored the data somewhere, because going back to the source and saying please resend me 10 years worth of data that I have in my index is...er...hard. Or say that the original provider has gone out of business, or the licensing arrangement specifies a one-time transmission of data that may not be retained in its original form or. The point of this long diatribe is that there are many reasons why reindexing is impossible and/or impractical. Making any decision that requires reindexing for a new version is locking a user into a version potentially forever. We should not underestimate how painful that can be and should never think that just reindex is acceptable in all situations. It's not. Period. Be very clear that some number of Lucene users will absolutely not be able to reindex. We may still make a decision that requires this, but let's make it without deluding ourselves that it's a possible solution for everyone. So an upgrade tool seems like a reasonable compromise. I agree that being hampered in what we can develop in Lucene by having to accomodate reading old indexes slows new features etc. It's always nice to be able to work without dealing with pesky legacy issues G. Perhaps splitting out the indexing upgrades into a separate program lets us accommodate both concerns. FWIW Erick On Thu, Apr 15, 2010 at 9:42 AM, Danil ŢORIN torin...@gmail.com mailto:torin...@gmail.com wrote: True. Just need the tool. On Thu, Apr 15, 2010 at 16:39, Earwin Burrfoot ear...@gmail.com mailto:ear...@gmail.com wrote: On Thu, Apr 15, 2010 at 17:17, Yonik Seeley yo...@lucidimagination.com mailto:yo...@lucidimagination.com wrote: Seamless online upgrades have their place too... say you are upgrading one server at a time in a cluster. Nothing here that can't be solved with an upgrade tool. Down one server, upgrade index, upgrade sofware, up. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com mailto:ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.
[ https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856916#action_12856916 ] Mark Miller commented on LUCENE-2159: - There is an excellent section on it in LIA2 :) Tool to expand the index for perf/stress testing. - Key: LUCENE-2159 URL: https://issues.apache.org/jira/browse/LUCENE-2159 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.0 Reporter: John Wang Attachments: ExpandIndex.java Sometimes it is useful to take a small-ish index and expand it into a large index with K segments for perf/stress testing. This tool does that. See attached class. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
On 04/14/2010 12:29 PM, Marvin Humphrey wrote: On Wed, Apr 14, 2010 at 08:30:14AM -0400, Grant Ingersoll wrote: The thing I keep going back to is that somehow Lucene has managed for years (and I mean lots of years) w/o stuff like Version and all this massive back compatibility checking. Non-constant global variables are an anti-pattern. I think clinging to such rules in the face of all situations is an anti-pattern :) I take it as a rule of thumb. In regards to this discussion: I agree that the Version stuff is a bit of a mess. I also agree that many users will want to just use one version across their app that is easy to change. I disagree that we should allow that behavior by just using a constructor without the Version param - or that you would be forced to set the static Version setting by trying to run your app and seeing an exception happen. That is all a bit ugly. Too many users will not understand Version or care to if they see they can skip passing it. IMO, you should have to specify that you are looking for this behavior. In which case, why not just specify it using the version param itself :) E.g. if a user wants to get this kind of static behavior, they can just choose to do it on their own, and pass their *own* static Version constant to all the constructors. I don't think we need to go through this hassle and introduce a less than ideal solution just so that users can pass one less param - especially when I think you should explicitly choose this behavior rather than get it by ignoring the Version param. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2393) Utility to output total term frequency and df from a lucene index
[ https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857155#action_12857155 ] Mark Miller commented on LUCENE-2393: - Perhaps this should be combined with high freq terms tool ... could make a ton of this little guys, so prob best to consolidate them. Utility to output total term frequency and df from a lucene index - Key: LUCENE-2393 URL: https://issues.apache.org/jira/browse/LUCENE-2393 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Tom Burton-West Priority: Trivial Attachments: LUCENE-2393.patch This is a command line utility that takes a field name, term, and index directory and outputs the document frequency for the term and the total number of occurrences of the term in the index (i.e. the sum of the tf of the term for each document). It is useful for estimating the size of the term's entry in the *prx files and consequent Disk I/O demands -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Proposal about Version API relaxation
I don't read what you wrote and what Mike wrote as even close to the same. - Mark http://www.lucidimagination.com (mobile) On Apr 15, 2010, at 12:05 AM, Shai Erera ser...@gmail.com wrote: Ahh ... a dream finally comes true ... what a great way to start a day :). +1 !!! I have some questions/comments though: * Index back compat should be maintained between major releases, like it is today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their segments when they move from 2.x to 3.x before 4.0 lands and they'll need to call optimize() to ensure 4.0 still works on their index. I hope that will still be the case? Otherwise I don't see how we can prevent reindexing by apps. ** Index behavioral/runtime changes, like those of Analyzers, are ok to require a reindex, as proposed. So after 3.1 is out, trunk can break the API and 3.2 will have a new set of API? Cool and convenient. For how long do we keep the 3.1 branch around? Also, it used to only fix bugs, but from now on it'll be allowed to introduce new features, if they maintain back-compat? So 3.1.1 can have 'flex' (going for the extreme on purpose) if someone maintains back-compat? I think the back-compat on branches should be only for index runtime changes. There's no point, in my opinion, to maintain API back- compat anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1 just to get a new feature but get it API back- supported? As soon as they upgrade to 3.2, that means a new set of API right? Major releases will just change the index structure format then? Or move to Java 1.6? Well ... not even that because as I understand it, 3.2 can move to Java 1.6 ... no API back-compat right :). That's definitely a great step forward ! Shai On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda va...@osafoundation.org wrote: On Thu, 15 Apr 2010, Earwin Burrfoot wrote: Can't believe my eyes. +1 Likewise. +1 ! Andi.. On Thu, Apr 15, 2010 at 01:22, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey mar...@rectangular.com wrote: Essentially, we're free to break back compat within Lucy at any time, but we're not able to break back compat within a stable fork like Lucy1, Lucy2, etc. So what we'll probably do during normal development with Analyzers is just change them and note the break in the Changes file. So... what if we change up how we develop and release Lucene: * A major release always bumps the major release number (2.x - 3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3) releases along that branch * There is no back compat across major releases (index nor APIs), but full back compat within branches. This would match how many other projects work (KS/Lucy, as Marvin describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.). The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and, if any devs have the itch, they could freely back-port improvements from trunk as long as they kept back-compat within the branch. I think in such a future world, we could: * Remove Version entirely! * Not worry at all about back-compat when developing on trunk * Give proper names to new improved classes instead of StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing today; rename existing classes. * Let analyzers freely, incrementally improve * Use interfaces without fear * Stop spending the truly substantial time (look @ Uwe's awesome back-compat layer for analyzers!) that we now must spend when adding new features, for back-compat * Be more free to introduce very new not-fully-baked features/APIs, marked as experimental, on the expectation that once they are used (in trunk) they will iterate/change/improve vs trying so hard to get things right on the first go for fear of future back compat horrors. Thoughts...? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/?? ? (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855732#action_12855732 ] Mark Miller commented on LUCENE-2386: - Is this change worth it with all of its repercussions? What are the upsides? There do appear to be downsides... IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855740#action_12855740 ] Mark Miller commented on LUCENE-2386: - {quote}I do think this is a good change - IW was previously inconsistent, first that it would even make a commit when we no longer have an autoCommit=true, and, second, that it would not make the commit for a directory that already had an index (we fixed this case a while back). So I like that this fix makes IW's init behavior more consistent / simpler.{quote} Thats not a very strong argument for a back compat break on a minor release though... IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855748#action_12855748 ] Mark Miller commented on LUCENE-2386: - bq. Hmmm... I think the back compat break is very minor Yes - it is - but so was the argument for it IMO. Your extended argument is more compelling though. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2391) Spellchecker uses default IW mergefactor/ramMB settings of 300/10
Spellchecker uses default IW mergefactor/ramMB settings of 300/10 - Key: LUCENE-2391 URL: https://issues.apache.org/jira/browse/LUCENE-2391 Project: Lucene - Java Issue Type: Improvement Components: contrib/spellchecker Reporter: Mark Miller Priority: Trivial These settings seem odd - I'd like to investigate what makes most sense here. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Controlling the maximum size of a segment during indexing
Setting maxMergeMB does not limit the size of segments you will see - it simply limits what segments will be merged - segments over maxMergeMB will not be merged with other segments - you can still buffer up a ton of docs in RAM and flush a segment larger than maxMergeMB, or merge n segments smaller than maxMergeMB that create a segment larger than maxMergeMB. -- - Mark http://www.lucidimagination.com On 04/09/2010 01:01 AM, Lance Norskog wrote: Here is a Java unit test that uses the LogByteSizeMergePolicy to control the maximum size of segment files during indexing. That is, it tries. It does not succeed. Will someone who truly understands the merge policy code please examine it. There is probably one tiny parameter missing. It adds 20 documents that each are 100k in size. It creates an index in a RAMDirectory which should have one segment that's a tad over 1mb, and then a set of segments that are a tad over 500k. Instead, the data does not flush until it commits, writing one 5m segment. - org.apache.lucene.index.TestIndexWriterMergeMB --- package org.apache.lucene.index; /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the License); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.FieldSelectorResult; import org.apache.lucene.document.Field.Index; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.util.LuceneTestCase; /* * Verify that segment sizes are limited to # of bytes. * * Sizing: * Max MB is 0.5m. Verify against thiAs plus 100k slop. (1.2x) * Min MB is 10k. * Each document is 100k. * mergeSegments=2 * MaxRAMBuffer=1m. Verify against this plus 200k slop. (1.2x) * * This test should cause the ram buffer to flush after 10 documents, and create a CFS a little over 1meg. * The later documents should be flushed to disk every 5-6 documents, and create CFS files a little over 0.5meg. */ public class TestIndexWriterMergeMB extends LuceneTestCase { private static final int MERGE_FACTOR = 2; private static final double RAMBUFFER_MB = 1.0; static final double MIN_MB = 0.01d; static final double MAX_MB = 0.5d; static final double SLOP_FACTOR = 1.2d; static final double MB = 1000*1000; static String VALUE_100k = null; // Test controlling the mergePolicy for max # of docs public void testMaxMergeMB() throws IOException { Directory dir = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig( TEST_VERSION_CURRENT, new WhitespaceAnalyzer(TEST_VERSION_CURRENT)); LogByteSizeMergePolicy mergeMB = new LogByteSizeMergePolicy(); config.setMergePolicy(mergeMB); mergeMB.setMinMergeMB(MIN_MB); mergeMB.setMaxMergeMB(MAX_MB); mergeMB.setUseCompoundFile(true); mergeMB.setMergeFactor(MERGE_FACTOR); config.setMaxBufferedDocs(100);// irrelevant but the next line fails without this. config.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH); MergeScheduler scheduler = new SerialMergeScheduler(); config.setMergeScheduler(scheduler); IndexWriter writer = new IndexWriter(dir, config); System.out.println(Start indexing); for (int i = 0; i 50; i++) { addDoc(writer, i); printSegmentSizes(dir); } checkSegmentSizes(dir); System.out.println(Commit); writer.commit(); printSegmentSizes(dir); checkSegmentSizes(dir); writer.close(); } // document that takes of 100k of RAM private void addDoc(IndexWriter writer, int i) throws IOException { if (VALUE_100k == null) { StringBuilder value = new StringBuilder(10); for(int fill = 0; fill 10; fill ++) { value.append('a'); } VALUE_100k = value.toString(); } Document doc = new Document(); doc.add(new Field(id, i + , Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field(content,
[jira] Commented: (LUCENE-2372) Replace deprecated TermAttribute by new CharTermAttribute
[ https://issues.apache.org/jira/browse/LUCENE-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855489#action_12855489 ] Mark Miller commented on LUCENE-2372: - bq.If I make it final and +1 - lets just remember to add these breaks to the CHANGES BW break section... Replace deprecated TermAttribute by new CharTermAttribute - Key: LUCENE-2372 URL: https://issues.apache.org/jira/browse/LUCENE-2372 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.1 Reporter: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2372.patch, LUCENE-2372.patch, LUCENE-2372.patch After LUCENE-2302 is merged to trunk with flex, we need to carry over all tokenizers and consumers of the TokenStreams to the new CharTermAttribute. We should also think about adding a AttributeFactory that creates a subclass of CharTermAttributeImpl that returns collation keys in toBytesRef() accessor. CollationKeyFilter is then obsolete, instead you can simply convert every TokenStream to indexing only CollationKeys by changing the attribute implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854899#action_12854899 ] Mark Miller commented on LUCENE-2074: - {quote}Uwe, must this be coupled with that issue? This one waits for a long time (why? for JFlex 1.5 release?) and protecting against a huge buffer allocation can be a real quick and tiny fix. And this one also focuses on getting Unicode 5 to work, which is unrelated to the buffer size. But the buffer size is not a critical issue either that we need to move fast with it ... so it's your call. Just thought they are two unrelated problems.{quote} Agreed. Whether its fixed as part of this commit or not, it really deserves its own issue anyway, for changes and tracking. It has nothing to do with this issue other than convenience. Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer --- Key: LUCENE-2074 URL: https://issues.apache.org/jira/browse/LUCENE-2074 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file. After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 or LUCENE_31 is used as matchVersion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1895) Point2D defines equals by comparing double types with ==
[ https://issues.apache.org/jira/browse/LUCENE-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853402#action_12853402 ] Mark Miller commented on LUCENE-1895: - I put this up not knowing really anything about the specific use case(s) of the Point2D class - I have never used Spatial - so close if it makes sense to do so. My generic worry is that you can come to the *same* double value in two different ways, but == will not find them to be equal. Point2D defines equals by comparing double types with == Key: LUCENE-1895 URL: https://issues.apache.org/jira/browse/LUCENE-1895 Project: Lucene - Java Issue Type: Bug Components: contrib/spatial Reporter: Mark Miller Assignee: Chris Male Priority: Trivial Ideally, this should allow for a margin of error right? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Welcome Shai Erera as Lucene/Solr committer
On 03/26/2010 09:07 AM, Michael McCandless wrote: I'm happy to announce that the PMC has accepted Shai Erera as Lucene/Solr committer! Welcome aboard Shai, Mike PS: it's custom to introduce yourself with a brief bio :) - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org Congrats Shai! -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modules
I also like the idea of a very basic analyzer set - I think you should still be able to do things with just the core jar - even if its only very basic things. On 03/26/2010 11:56 AM, Uwe Schindler wrote: That will be also heavy ANT build refactoring (oh no...). But I am also for a basic analyzer set (without Standard!!!). Uwe -Original Message- From: Shai Erera [mailto:ser...@gmail.com] Sent: Friday, March 26, 2010 4:16 PM To: java-dev@lucene.apache.org Subject: Re: Modules +1 for moving modules up one level. As for analyzers, I also prefer if lucene won't depend on modules even if just for the tests. That way one who doesn't use any module can check out lucene only. We can keep in lucene some basic analyzers (Whitespace, Simple) as well as a best out of the box choice - Standard for new users. Shai On Friday, March 26, 2010, Earwin Burrfootear...@gmail.com wrote: Sounds good to me. I guess one thing to think about is the analyzers in core (should they move to this module, too?). If so, perhaps we could make 'ant test' of lucene depend on this module, since core tests use analyzers. But you could use lucene without an analyzers module, it wouldnt be a real dependency. You can leave only most basic analyzers (one of them?) in core, switching over the tests that currently use more advanced ones. Then you move everything else to the module, along with analyzer- specific tests. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New LuSolr trunk
This looks good to me. +1 on landing flex now. On 03/22/2010 08:27 AM, Uwe Schindler wrote: Hi all, the discussion where to do the development after the merge, now gets actual: Currently a lusolr test-trunk is done as a branch inside solr (https://svn.apache.org/repos/asf/lucene/solr/branches/newtrunk). The question is, where to put the main development and how to switch, so non-developers that have checkouts of solr and/or lucene will see the change and do not send us outdated patches. I propose to do the following: - Start a new top-level project folder inside /lucene root svn folder: https://svn.apache.org/repos/asf/lucene/lusolr (please see lusolr as a placeholder name) and add branches, tags subfolders to it. Do not create trunk and do this together with the next step. - Move the branch from https://svn.apache.org/repos/asf/lucene/solr/branches/newtrunk to this new directory as trunk - For lucene flexible indexing, create a corresponding flex branch there and svn copy it from current new trunk. Merge the lucene flex changes into it. Alternatively, land flex now. Or simply do svn copy of current flex branch instead of merging (may be less work). - Do the same for possible solr branches in development - Create a tag in the lucene tags folder and in the solr tags folder with the current state of each trunk. After that delete all contents from old trunk in solr and lucene and place a readme file pointing developers to the new merged trunk folder (for both old trunks). This last step is important, else people who checkout the old trunk will soon see a very outdated view and may send us outdated patches in JIRA. When the contents of old-trunk disappear it's obvious to them what happened. If they had already some changes in their checkout, the svn client will keep the changed files as unversioned (after upgrade). The history keeps available, so it's also possible to checkout an older version from trunk using @rev or -r rev. I did a similar step with some backwards compatibility changes in lucene (add a README). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, March 22, 2010 11:37 AM To: java-dev@lucene.apache.org Subject: Re: (LUCENE-2297) IndexWriter should let you optionally enable reader pooling I think we should. It (newtrunk) was created to test Hoss's side-by-sdie proposal, and that approach looks to be working very well. Up until now we've been committing to the old trunk and then systematically merging over to newtrunk. I think we should now flip that, ie, commit to newtrunk and only merge back to the old trunk if for some strange reason it's needed. Mike On Mon, Mar 22, 2010 at 6:32 AM, Uwe Schindleru...@thetaphi.de wrote: Are we now only working on newtrunk? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless (JIRA) [mailto:j...@apache.org] Sent: Monday, March 22, 2010 11:22 AM To: java-dev@lucene.apache.org Subject: [jira] Resolved: (LUCENE-2297) IndexWriter should let you optionally enable reader pooling [ https://issues.apache.org/jira/browse/LUCENE- 2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all- tabpanel ] Michael McCandless resolved LUCENE-2297. Resolution: Fixed Fixed on newtrunk. IndexWriter should let you optionally enable reader pooling --- Key: LUCENE-2297 URL: https://issues.apache.org/jira/browse/LUCENE- 2297 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Priority: Minor Fix For: 3.1 Attachments: LUCENE-2297.patch For apps using a large index and frequently need to commit and resolve deletes, the cost of opening the SegmentReaders on demand for every commit can be prohibitive. We an already pool readers (NRT does so), but, we only turn it on if NRT readers are in use. We should allow separate control. We should do this after LUCENE-2294. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] Commented: (LUCENE-1709) Parallelize Tests
[ https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848712#action_12848712 ] Mark Miller commented on LUCENE-1709: - +1 on removing those flags - personally I find them unnecessary - and they complicate the build. And I would love to Lucene parallel like Solr now. Parallelize Tests - Key: LUCENE-1709 URL: https://issues.apache.org/jira/browse/LUCENE-1709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Fix For: 3.1 Attachments: LUCENE-1709.patch, runLuceneTests.py Original Estimate: 48h Remaining Estimate: 48h The Lucene tests can be parallelized to make for a faster testing system. This task from ANT can be used: http://ant.apache.org/manual/CoreTasks/parallel.html Previous discussion: http://www.gossamer-threads.com/lists/lucene/java-dev/69669 Notes from Mike M.: {quote} I'd love to see a clean solution here (the tests are embarrassingly parallelizable, and we all have machines with good concurrency these days)... I have a rather hacked up solution now, that uses -Dtestpackage=XXX to split the tests up. Ideally I would be able to say use N threads and it'd do the right thing... like the -j flag to make. {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Running the Solr/Lucene tests failed
Robert very recently committed some stuff that parallelizes the solr tests that may need to be worked out in all cases still (if that is indeed the problem here). A variety of devs have tested it, but there may be a lingering issue? No helpful errors printed above BUILD FAILED? The line the errors you pasted gives is simply the line that fails the build if tests failed. There is still a way to run them sequentially (as Hudson should be doing) that Robert should be able to let you in on as well. But it would be nice to get to the bottom of this. - Mark On 03/23/2010 03:36 PM, Michael Busch wrote: Hi all, I wanted to commit LUCENE-2329. I just checked out the new combined trunk https://svn.apache.org/repos/asf/lucene/dev/trunk and ran ant test. After 20 mins the build failed on the unmodified code (see below). I hadn't applied my patch yet. What's the status of the combined trunk? Should the tests pass? As far as I can tell all lucene tests were successful (core, contrib, bw), but the Solr tests failed. Is there more setup for the Solr part necessary after 'svn checkout'? Michael BUILD FAILED /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/build.xml:28: The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:393: The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! Total time: 19 minutes 38 seconds - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Running the Solr/Lucene tests failed
If you do an update your issue should be resolved. This is something we ran into the other day as well, and have been solving it a bit at a time ;) - Mark On 03/23/2010 04:29 PM, Robert Muir wrote: Yeah its a bit confusing... before, exceptions happening in other threads were silently hidden. Uwe fixed this in Lucene I think, and right now the verbosity is cranked for Solr, too. Yonik is hacking away at these tests to quiet the ones that are truly expected exceptions... At least I think I got this right... On Tue, Mar 23, 2010 at 4:26 PM, Michael Buschbusch...@gmail.com wrote: I see. And all the other exceptions printed are expected? Michael On 3/23/10 1:20 PM, Robert Muir wrote: Thanks Michael, this isn't a parallel test problem at all, its a sporatic problem with solr's jetty tests (the same problem I mentioned in the previous response). You might/will see this problem running the tests sequentially too. Test org.apache.solr.client.solrj.embedded.JettyWebappTest FAILED On Tue, Mar 23, 2010 at 4:15 PM, Michael Buschbusch...@gmail.comwrote: Sorry for the lack of details. Thought I had just not done an obvious step. Attached is the output from the Solr part. Btw: This machine is a Solr virgin, Solr never ran on it before. Michael On 3/23/10 1:00 PM, Mark Miller wrote: Robert very recently committed some stuff that parallelizes the solr tests that may need to be worked out in all cases still (if that is indeed the problem here). A variety of devs have tested it, but there may be a lingering issue? No helpful errors printed above BUILD FAILED? The line the errors you pasted gives is simply the line that fails the build if tests failed. There is still a way to run them sequentially (as Hudson should be doing) that Robert should be able to let you in on as well. But it would be nice to get to the bottom of this. - Mark On 03/23/2010 03:36 PM, Michael Busch wrote: Hi all, I wanted to commit LUCENE-2329. I just checked out the new combined trunk https://svn.apache.org/repos/asf/lucene/dev/trunk and ran ant test. After 20 mins the build failed on the unmodified code (see below). I hadn't applied my patch yet. What's the status of the combined trunk? Should the tests pass? As far as I can tell all lucene tests were successful (core, contrib, bw), but the Solr tests failed. Is there more setup for the Solr part necessary after 'svn checkout'? Michael BUILD FAILED /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/build.xml:28: The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:393: The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred while executing this line: /Users/michael/Documents/workspace/lucene-solr-trunk/trunk/solr/build.xml:472: Tests failed! The following error occurred
[jira] Commented: (LUCENE-1814) Some Lucene tests try and use a Junit Assert in new threads
[ https://issues.apache.org/jira/browse/LUCENE-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847931#action_12847931 ] Mark Miller commented on LUCENE-1814: - Chris Male mentioned to me that he thinks Uwe has fixed this? Some Lucene tests try and use a Junit Assert in new threads --- Key: LUCENE-1814 URL: https://issues.apache.org/jira/browse/LUCENE-1814 Project: Lucene - Java Issue Type: Bug Reporter: Mark Miller Priority: Minor There are a few cases in Lucene tests where JUnit Asserts are used inside a new threads run method - this won't work because Junit throws an exception when a call to Assert fails - that will kill the thread, but the exception will not propagate to JUnit - so unless a failure is caused later from the thread termination, the Asserts are invalid. TestThreadSafe TestStressIndexing2 TestStringIntern -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene and solr trunk
Alight, so we have implemented Hoss' suggestion here on the lucene/solr merged dev branch at lucene/solr/branches/newtrunk. Feel free to check it out and give some feedback. We also roughly have Solr running on Lucene trunk - eg compiling Solr will first compile lucene and run off those compiled class files. Running dist or example in Solr will grab Lucene's jars and put them in the war. This still needs further love, but it works. There is also a top level build.xml with two targets: clean, and test. Clean will clean both Lucene and Solr, and test will run tests for both Lucene and Solr. Thanks to everyone that contributed to getting all this working! -- - Mark http://www.lucidimagination.com On 03/17/2010 12:40 PM, Mark Miller wrote: Okay, so this looks good to me (a few others seemed to like it - though Lucene-Dev was somehow dropped earlier) - lets try this out on the branch? (then we can get rid of that horrible branch name ;) ) Anyone on the current branch object to having to do a quick svn switch? On 03/16/2010 06:46 PM, Chris Hostetter wrote: : Otis, yes, I think so, eventually. But that's gonna take much more discussion. : : I don't think this initial cutover should try to solve how modules : will be organized, yet... we'll get there, eventually. But we should at least consider it, and not move in a direction that's distinct from the ultimate goal of better refactoring (especailly since that was one of the main goals of unifying development efforts) Here's my concrete suggestion that could be done today (for simplicity: $svn = https://svn.apache.org/repos/asf/lucene)... svn mv $svn/java/trunk $svn/java/tmp-migration svn mkdir $svn/java/trunk svn mv $svn/solr/trunk $svn/java/trunk/solr svn mv $svn/java/tmp-migration $svn/java/trunk/core At which point: 0. People who want to work only on Lucene-Java can start checking out $svn/java/trunk/core (i'm pretty sure existing checkouts will continue to work w/o any changes, the svn info should just update itself) 1. build files can be added to (the new) $svn/java/trunk to build ./core followed by ./solr 2. the build files in $svn/java/trunk/solr can be modified to look at ../core/ to find lucene jars 3. people who care about Solr (including all committers) should start checking out and building all of $svn/java/trunk 4. Long term, we could choose to branch all of $svn/java/trunk for releases ... AND/OR we could choose to branch specific modules (ie: solr) independently (with modifications to the build files on those branches to pull in their dependencies from alternate locations 5. Long term, we can start refactoring additional modules out of $svn/java/trunk/solr and $svn/java/trunk/core (like $svn/java/trunk/core/contrib) into their own directory in $svn/java/trunk 6. Long term, people who want to work on more then just core but don't care about certain modules (like solr) can do a simple non-recursive checkout of $svn/java/trunk and then do full checkouts of whatever modules they care about (Please note: I'm just trying to list things we *could* do if we go this route, i'm not advocating that we *should* do any of these things) I can't think of any objections people have raised to any of the previous suggestions which apply to this suggestion. Is there anything people can think of that would be useful, but not possible, if we go this route? -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2305) Introduce Version in more places long before 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846542#action_12846542 ] Mark Miller commented on LUCENE-2305: - Ah, yes - I didnt remember your comment right: {quote} We could make the change under Version? (Change to true, starting in 3.1). Or maybe not make the change. If set to true, we use pct deletion on a segment to reduce its perceived size when selecting merges, which generally causes segments with pending deletions to be merged away sooner {quote} Sounds like a good move. Introduce Version in more places long before 4.0 Key: LUCENE-2305 URL: https://issues.apache.org/jira/browse/LUCENE-2305 Project: Lucene - Java Issue Type: Improvement Reporter: Shai Erera Fix For: 3.1 We need to introduce Version in as many places as we can (wherever it makes sense of course), and preferably long before 4.0 (or shall I say 3.9?) is out. That way, we can have a bunch of deprecated API now, that will be gone in 4.0, rather than doing it one class at a time and never finish :). The purpose is to introduce Version wherever it is mandatory now, and also in places where we think it might be useful in the future (like most of our Analyzers, configured classes and configuration classes). I marked this issue for 3.1, though I don't expect it to end in 3.1. I still think it will be done one step at a time, perhaps for cluster of classes together. But on the other hand I don't want to mark it for 4.0.0 because that needs to be resolved much sooner. So if I had a 3.9 version defined, I'd mark it for 3.9. We can do several commits in one issue right? So this one can live for a while in JIRA, while we gradually convert more and more classes. The first candidate is InstantiatedIndexWriter which probably should take an IndexWriterConfig. While I converted the code to use IWC, I've noticed Instantiated defaults its maxFieldLength to the current default (10,000) which is deprecated. I couldn't change it for back-compat reasons. But we can upgrade it to accept IWC, and set to unlimited if the version is onOrAfter 3.1, otherwise stay w/ the deprecated default. if it's acceptable to have several commits in one issue, I can start w/ Instantiated, post a patch and then we can continue to more classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846622#action_12846622 ] Mark Miller commented on LUCENE-2320: - +1 - I've had to do this in the past too. Just dropping tests doesn't seem like the way to go in many cases. Add MergePolicy to IndexWriterConfig Key: LUCENE-2320 URL: https://issues.apache.org/jira/browse/LUCENE-2320 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as well. The change is not straightforward and so I've kept it for a separate issue. MergePolicy requires in its ctor an IndexWriter, however none can be passed to it before an IndexWriter actually exists. And today IW may create an MP just for it to be overridden by the application one line afterwards. I don't want to make iw member of MP non-final, or settable by extending classes, however it needs to remain protected so they can access it directly. So the proposed changes are: * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set once (hence its name). It'll have the signature SetOnceT w/ *synchronized setT* and *T get()*. T will be declared volatile, so that get() won't be synchronized. * MP will define a *protected final SetOnceIndexWriter writer* instead of the current writer. *NOTE: this is a bw break*. any suggestions are welcomed. * MP will offer a public default ctor, together with a set(IndexWriter). * IndexWriter will set itself on MP using set(this). Note that if set will be called more than once, it will throw an exception (AlreadySetException - or does someone have a better suggestion, preferably an already existing Java exception?). That's the core idea. I'd like to post a patch soon, so I'd appreciate your review and proposals. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2323) reorganize contrib modules
[ https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846711#action_12846711 ] Mark Miller commented on LUCENE-2323: - This reorg is a great a great step for contrib IMO! +1 reorganize contrib modules -- Key: LUCENE-2323 URL: https://issues.apache.org/jira/browse/LUCENE-2323 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Robert Muir it would be nice to reorganize contrib modules, so that they are bundled together by functionality. For example: * the wikipedia contrib is a tokenizer, i think really belongs in contrib/analyzers * there are two highlighters, i think could be one highlighters package. * there are many queryparsers and queries in different places in contrib -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene and solr trunk
On 03/16/2010 03:43 AM, Simon Willnauer wrote: One more thing which I wonder about even more is that this whole merging happens so quickly for reasons I don't see right now. I don't want to keep anybody from making progress but it appears like a rush to me. Meh - I think your just plain wrong about this. Anyone can work as fast as they want on anything. Nothing has happened faster than the community wants yet. Your too concerned. This is called discussion. Nothing has happened. In my opinion, the whole freak out of what goes where in svn was so over blown - its so easy to move this stuff around at the drop of a hat. That's why it was suggested we put a branch there and no one saw anything wrong it with for the moment - everyone said, well we can just easily move it if someone has an issue - which we did. Didn't expect the freak out though. Frankly, we were just seeking a branch really, and didn't care where it went. Some of us are anxious to do some work - some of us are anxious to merge some code - no one is forcing this stuff on the others at a rapid pace - everyone gets there say as always. This is why we wanted a branch we could committ what we wanted to. SVN locations make starting the merge of code easier. They are easy to change. This is not like rushing index format changes. Its src code location - it can be moved at the drop of the hat. The sooner we resolve what we are going to do, the sooner we can start getting more work done that we hoped to get down with this merge. This thread starts that discussion. You can't start a discussion to early. Perhaps it leads to another discussion first, but their is no such thing as rushing the start of discussion. It doesn't say figure it out by tomorrow, cause we are doing this tomorrow. It doesn't say, figure this out by next week, because we are doing this next week. It says lets discuss where this is going to go. I think some people just need to relax, and discuss what they would like to see and worry less about how fast others are working. Fast work is good. It means more work. Nothing is going to happen until the community figures things out. BTW: I still have the impression that if I don't follow IRC constantly I'm missing important things. That's your impression then. Follow IRC if you want. People talk all over the places about Lucen/Solr - many times in places you can't follow - if it didn't happen on the list, it didn't happen. Michael Busch follows up saying, people say it was discussed thoroughly on IRC - so what? It doesn't count as a valid point of reference - I haven't seen that, but you can just tell someone that says that so - they owe you an explanation. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene and solr trunk
On 03/16/2010 07:05 AM, Shalin Shekhar Mangar wrote: Wow, you guys are moving fast! Thats a good thing. IRC is fine if you want to discuss something quickly. But it has its limitations. For example, I cannot follow IRC most of the times because I'm in a different time zone. But I don't want to stop anyone either. In fact, I can't do that. Nobody can. All I want to say is that once discussions have happened and a plan agreed upon, it may be a good idea to let solr-dev/java-dev know the plan. In this case I didn't know a new branch was created until I saw was a commit notification and then Yonik's email. Hi Shalin - I like your attitude ;) - Yonik's email was the notification of the plan :) Though we had no plan. When Robert and I made the branch we had no plan really - we just needed a place to put together our patches and do the final work. We were trying to do it with patches, but it was becoming difficult. But when we started we had no real plan - just to see if we could get Solr up and running on Lucene 3.01 and then trunk. Anything beyond that, we have not planned for - and before that was even completed, there were emails to java-dev about it. But we conceived nothing beyond seeing if we could get Solr running on the latest Lucene. From our perspective, we would have been just as happy with a branch on my local hard drive! That would have taken longer to setup though. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene and solr trunk
On 03/16/2010 09:05 AM, Andrzej Bialecki wrote: On 2010-03-16 12:29, Mark Miller wrote: From our perspective, we would have been just as happy with a branch on my local hard drive! That would have taken longer to setup though. You could have used git instead. There is a good integration between git and svn, and it's much easier (a giant understatement...) to handle branching and merging in git, both between git branches and syncing with external svn. Yeah, we have actually discussed doing things like GIT in the past - prob main reason we didn't is learning curve at the moment. I haven't used it yet. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene and solr trunk
On 03/16/2010 10:09 AM, Yonik Seeley wrote: On Tue, Mar 16, 2010 at 2:51 AM, Michael Buschbusch...@gmail.com wrote: Also, we're in review-and-commit process, not commit-and-review. Changes have to be proposed, discussed and ideally attached to jira as patches first. Correction, just for the sake of avoiding future confusion (i.e. I'm not making any point about this thread): Lucene and Solr have always officially been CTR. For trunk, we normally use a bit of informal lazy consensus for anything big, hard, or that might be controvertial... but we are not officially RTC. -Yonik - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org In any case, this is a branch. People really want to enforce RTC on a branch??? Even if that was our official process on trunk (which I agree it has not been) that's not how the flex branch worked. That's not how the solr_cloud branch worked. That's not how other previous branches have worked. IMO - anyone should be able to create a branch for anything - to play around with whatever they want. We should encourage this. Branches are good. And they take up little space. Branch changes have to be proposed, discussed, and attached to JIRA? Uggg - I certainly hope not. Branches should be considered replacements for huge unwieldy patches. Do I have to propose and discuss before I put up a patch? -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: #lucene IRC log [was: RE: lucene and solr trunk]
On 03/16/2010 02:57 PM, Grant Ingersoll wrote: On Mar 16, 2010, at 2:47 PM, Steven A Rowe wrote: On 03/16/2010 at 6:06 AM, Michael McCandless wrote: Does anyone know how other projects fold in IRC...? I gather from the deafening silence that we'll have to figure it out as we go... I think some (not all) of the discomfort associated with IRC could be addressed with a permanent, searchable, linkable archive of #lucene. I went looking for IRC loggers and found http://colabti.org/. One of the things hosted there is a searchable, linkable permanent archive of several freenode channels. I posted on #irclogger asking about hosting #lucene archive, and apparently all we have to do is ask, after first determining that nobody objects. Here's a link (not incidentally, this is exactly what we will have for #lucene once the service is switched on): http://colabti.org/irclogger/irclogger_log/irclogger?date=2010-03-16#l2 So, would anybody participating on #lucene object to a permanent archive? (I'm also going to provide a link to this thread on #lucene to make sure everybody there knows about the issue.) There's also a lot of chatter that happens on IRC, so logging is going to have a lot of noise. I'm still on the fence on what to do. I don't want to get in people's way, but we also need to have traceability about decisions, and we certainly can't have answers like We discussed this on IRC and you missed it, too bad happening (not saying that has happening, just saying I don't want to see it). -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org Even with logging, I'm against using IRC for making decisions, or as something people can point to. Even with searchable logging, I think we should stick with, if id didn't happen on the lists, it didn't happen. Its the same as when some of us get together and talk about Lucene and Solr - thats great stuff - you can get a lot done that is a lot harder on the lists - you can hash a lot out. But I think people should always have the right to act like it didn't happen - the same as if we are at ApacheCon or something - we don't come back and say, sorry, you missed all the discussion, but we had one and this what we are going to do. We summarize the discussion on the list (like Mike likes to do with IRC), and answer questions as people have them. I personally think its great to come to mini agreements with real-time talk - then it just has to make its way through the list. This isn't a counter point to anything you said Grant, just a nice place for me to drop this. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [DISCUSS] Do away with Contrib Committers and make core committers
On 03/15/2010 08:33 AM, Grant Ingersoll wrote: Right, Mark. I think we would be effectively raising the bar to some extent for what it takes to be a committer. That's part of my point though - some are contrib committers with a lower bar - now they are core/solr committers with that lower bar, but someone else that came along would not get to the same position now? We'd also be making contrib a first class citizen (not that it ever wasn't, but some people have that perception). I think because it was kind of true. I could come along before and donate contrib x, and never show I worked well with the community or build up the merit needed to be a committer, and be made a contrib committer simply to maintain my module. That's happened plenty. Finally, I think we need to recognize that not everyone needs to be a McCandless in order to contribute in a helpful way. We obviously recognize that or else I wouldn't be here! I think its more about fitting in - showing you get and follow the Apache way. Showing that ideas and changes you might push are in line with what the other committers thing is appropriate of a core/solr committer. Talent is not key here - community is. The bar for this has been *much* higher core than contrib in the past. And contrib has had different bars over time - I think it was even lower in the past at points. I think sometimes we forget that you can do svn revert. I hate to have to do that. I don't think its a great way to handle this - we could make everyone a committer at a drop of a hat and say we can just revert. I wouldn't call for a revert except in exceptional circumstances. I don't think that's the point. Obviously, we don't want to have to do it often, but it's not a huge deal if it happens. We've all been there. -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org I also wouldn't personally cast my vote on this broadly - some people I might think should be core/solr committers now, others not. Merit at Apache is important - you never lose it. Seems weird to get something like that so easily when in the past you had to work your way to it from contrib committership and get voted on individually by the PMC. Personally I'd prefer we just stop adding them, and the current ones work their way up like normal if they are so inclined, or the ones that are not even around anymore can just stay as they are. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene and solr trunk
On 03/15/2010 11:28 PM, Yonik Seeley wrote: So, we have a few options on where to put Solr's new trunk: Solr moves to Lucene's trunk: /java/trunk, /java/trunk/sol +1. With the goal of merged dev, merged tests, this looks the best to me. Simple to do patches that span both, simple to setup Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [DISCUSS] Do away with Contrib Committers and make core committers
On 03/14/2010 06:37 PM, Grant Ingersoll wrote: On Mar 14, 2010, at 2:03 PM, Uwe Schindler wrote: This time a +1 without discuss :-) Yeah, but Uwe, the thread was DISCUSS, not VOTE! :-) I had a whole spiel about earning merit, and some contrib committers were made contrib committers for just a single contrib, some long ago, didn't have to necessarily show they understood/followed the apache way, lower bar (not necessarily from talent perspective, but you might be made a contrib committer just to maintain the code module you contributed, whether you worked with the community or not), etc, etc. But ah, since everyone is into it without discussion, far be it from me to stand against. And I got my spiel in (super condensed) anyway now. With everyone else into it so far, I just look foolish trying to discuss :) - Mark - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Sunday, March 14, 2010 5:54 PM To: java-dev@lucene.apache.org Subject: [DISCUSS] Do away with Contrib Committers and make core committers Given the notion of one project, one set of committers, I think we should do away with the notion of contrib committers for java-dev and just have everyone be committers. Practically speaking, this would make all existing contrib committers be core committers. I think the notion of contrib committers has added to the confusion about the status of contrib as well as acted like a probation for new committers. To me, I don't think we should make that distinction, as has been evidenced time and time again, if we trust someone to commit to contrib, we can trust them to commit to core. And if we don't trust them to contrib to core then we probably shouldn't contrib either. Much of being a committer is about knowing what not to touch as it is to touch and I trust that all of our contrib committers know that. Thoughts? -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Welcome Chris Male as Contrib committer!
I am happy to announce the Lucene PMC has accepted Chris Male as a contrib committer! Chris has been making a lot of headway in cleaning up the spacial contrib lately, and hopefully now we can get more of those improvements into svn! Congrats Chris, and welcome! -- - Mark http://www.lucidimagination.com
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844516#action_12844516 ] Mark Miller commented on LUCENE-2309: - bq. Also IRC is not logged/archived and searchable (I think?) which makes it impossible to trace back a discussion, and/or randomly stumble upon it in Google. Apaches rule is, if it didn't happen on this lists, it didn't happen. #IRC is a great way for people to communicate and hash stuff out, but its not necessary you follow it. If you have questions or want further elaboration, just ask. No one can expect you to follow IRC, nor is it a valid reference for where something was decided. IRC is great - I think its really benefited having devs discuss there - but the official position is, if it didn't happen on the list, it didnt actually happen. Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type
Committers are competant in different areas of the code. Even mike wasn't big into the search side until per segment. Commiters are trusted to mess with the pieces they know. I don't see anyone even remotely suggesting that users should have to understand all of the implications of posting format modifications. Just sounds like a nasty jab to me. - Mark http://www.lucidimagination.com On Mar 12, 2010, at 2:43 PM, Marvin Humphrey (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844637#action_12844637 ] Marvin Humphrey commented on LUCENE-2308: - If you disable term freq, you also have to disable positions. The freq tells you how many positions there are. I think it's asking an awful lot of our users to require that they understand all the implications of posting format modifications when committers have difficulty mastering all the subtleties. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: svn commit: r921480 [1/8] - in /lucene/java/trunk: ./ contrib/analyzers/common/src/test/org/apache/lucene/analysis/query/ contrib/analyzers/common/src/test/org/apache/lucene/analysis/shingle/ cont
On 03/10/2010 01:48 PM, Robert Muir wrote: On Wed, Mar 10, 2010 at 1:40 PM, Shai Ereraser...@gmail.com wrote: I wrote that I defaulted to Whitespace for convenience reasons only. Now you don't need to specify anything if you don't care how the content is indexed, which is really the case for TONS of tests. The code became so much simpler. I guess I don't see it this way. It may be convenient for us, but its inconvenient for new users, as they see it as 'lucene's default'. No one wants to do more work than is necessary: currently a lot of people use StandardAnalyzer for this reason, maybe without a lot of thought. but this is ok. StandardAnalyzer at least does things like lowercasing. For those who do care, they anyway pay attention to it :). I see it as the inverse: I would rather our tests have new WhitespaceAnalyzer than see users complain on java-user mailing list that lucene doesn't ignore case differences or punctuation, because they don't need to think about this. Whitespace is a shitty default for a search engine, its only good for tests. +1. I don't think we should default an Analyzer. I agree that WhiteSpaceAnalyzer is not a good default. And I don't think Standard is a good default. I'm in agreement that you should have to specify to force thinking about it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843717#action_12843717 ] Mark Miller commented on LUCENE-2294: - bq. If we say Analyzer is mandatory, what will stop us tomorrow from saying IndexDeletionPolicy is mandatory? Nothing ;) But I think Analyzer should be mandatory and that IndexDeletionPolicy should not be mandatory, looking at them case by case. Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843729#action_12843729 ] Mark Miller commented on LUCENE-2294: - bq. Question - does SOLR requires everyone to specify an Analyzer, or does it come w/ a default one? Hmm... SOLR doesn't really use Lucene analyzers. It comes with a default Schema.xml that defines FieldTypes. Then field names can be assigned to FieldTypes. So technically speaking, no, Solr does not - but because most people build off the example, you could say that it does have defaults for example FieldTypes and defaults of what field names map to those. But it also only accepts certain example fields with the example Schema - you really have to go in and customize it to your needs - its setup to basically show off what options are available and work with some demo stuff. Solr comes with almost no defaults in a way - but it does ship with an example setup that is meant to show you how to set things up, and what is available. You could consider those defaults since most will build off it. example of Solr analyzer declaration: {code} !-- A general unstemmed text field - good if one does not know the language of the field -- fieldType name=textgen class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType {code} Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843756#action_12843756 ] Mark Miller commented on LUCENE-2294: - I'm assuming you would set an Analyzer for the document - and then you could override per field - or something along those lines. Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843331#action_12843331 ] Mark Miller commented on LUCENE-2089: - Sweet! explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: Flex Branch Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: Flex Branch Attachments: ContrivedFuzzyBenchmark.java, createLevAutomata.py, gen.py, gen.py, gen.py, gen.py, gen.py, gen.py, Lev2ParametricDescription.java, Lev2ParametricDescription.java, Lev2ParametricDescription.java, Lev2ParametricDescription.java, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089_concat.patch, Moman-0.2.1.tar.gz, moman-57f5dc9dd0e7.diff, TestFuzzy.java we can optimize fuzzyquery by using AutomatonTermsEnum. The idea is to speed up the core FuzzyQuery in similar fashion to Wildcard and Regex speedups, maintaining all backwards compatibility. The advantages are: * we can seek to terms that are useful, instead of brute-forcing the entire terms dict * we can determine matches faster, as true/false from a DFA is array lookup, don't even need to run levenshtein. We build Levenshtein DFAs in linear time with respect to the length of the word: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 To implement support for 'prefix' length, we simply concatenate two DFAs, which doesn't require us to do NFA-DFA conversion, as the prefix portion is a singleton. the concatenation is also constant time with respect to the size of the fuzzy DFA, it only need examine its start state. with this algorithm, parametric tables are precomputed so that DFAs can be constructed very quickly. if the required number of edits is too large (we don't have a table for it), we use dumb mode at first (no seeking, no DFA, just brute force like now). As the priority queue fills up during enumeration, the similarity score required to be a competitive term increases, so, the enum gets faster and faster as this happens. This is because terms in core FuzzyQuery are sorted by boost value, then by term (in lexicographic order). For a large term dictionary with a low minimal similarity, you will fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs (edit distance of 2 - edit distance of 1 - edit distance of 0) during enumeration, but also to switch from dumb mode to smart mode. With this design, we can add more DFAs at any time by adding additional tables. The tradeoff is the tables get rather large, so for very high K, we would start to increase the size of Lucene's jar file. The idea is we don't have include large tables for very high K, by using the 'competitive boost' attribute of the priority queue. For more information, see http://en.wikipedia.org/wiki/Levenshtein_automaton -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Vote on merging dev of Lucene and Solr
For those committers that don't follow the general mailing list, or follow it that closely, we are currently having a vote for committers: http://search.lucidimagination.com/search/document/4722d3144c2e3a8b/vote_merge_lucene_solr_development -- - Mark http://www.lucidimagination.com
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841249#action_12841249 ] Mark Miller commented on LUCENE-2294: - I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them. Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats *required* to set on the config class - without the same cstr business ... Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841249#action_12841249 ] Mark Miller edited comment on LUCENE-2294 at 3/4/10 1:45 PM: - I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them. Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats *required* to set on the config class - without the same cstr business ... *edit* Though I suppose the chaining *does* makes this more swallowable... new IW(new IWConfig(Analyzer).set().set().set()) isn't really so bad ... was (Author: markrmil...@gmail.com): I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them. Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats *required* to set on the config class - without the same cstr business ... Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Request for clarification on unordered SpanNearQuery
On 03/04/2010 11:34 AM, Goddard, Michael J. wrote: // Question: why wouldn't this Span be found? assertTrue(fourth range, spans.next()); assertEquals(fourth doc, 11, spans.doc()); assertEquals(fourth start, 2, spans.start()); assertEquals(fourth end, 6, spans.end()); Spans are funny beasts ;) No Spans ever start from the same position more than once. In effect, they are always marching forward. The third range starts at 2, and once it finds a match starting at 2, it moves on. So it won't find the other match that starts at 2. Spans are not exhaustive - exhaustive matching would be a different algorithm. So yes, you are wrong in your expectation :) Just how Spans were implemented. -- - Mark http://www.lucidimagination.com
[jira] Commented: (LUCENE-2287) Unexpected terms are highlighted within nested SpanQuery instances
[ https://issues.apache.org/jira/browse/LUCENE-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839744#action_12839744 ] Mark Miller commented on LUCENE-2287: - bq. Breaks backward compatibility, so need to find a way around that Wouldn't be the end of the world depending on the break. Unexpected terms are highlighted within nested SpanQuery instances -- Key: LUCENE-2287 URL: https://issues.apache.org/jira/browse/LUCENE-2287 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Affects Versions: 2.9.1 Environment: Linux, Solaris, Windows Reporter: Michael Goddard Priority: Minor Attachments: LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch Original Estimate: 336h Remaining Estimate: 336h I haven't yet been able to resolve why I'm seeing spurious highlighting in nested SpanQuery instances. Briefly, the issue is illustrated by the second instance of Lucene being highlighted in the test below, when it doesn't satisfy the inner span. There's been some discussion about this on the java-dev list, and I'm opening this issue now because I have made some initial progress on this. This new test, added to the HighlighterTest class in lucene_2_9_1, illustrates this: /* * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ */ public void testHighlightingNestedSpans2() throws Exception { String theText = The Lucene was made by Doug Cutting and Lucene great Hadoop was; // Problem //String theText = The Lucene was made by Doug Cutting and the great Hadoop was; // Works okay String fieldName = SOME_FIELD_NAME; SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] { new SpanTermQuery(new Term(fieldName, lucene)), new SpanTermQuery(new Term(fieldName, doug)) }, 5, true); Query query = new SpanNearQuery(new SpanQuery[] { spanNear, new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true); String expected = The BLucene/B was made by BDoug/B Cutting and Lucene great BHadoop/B was; //String expected = The BLucene/B was made by BDoug/B Cutting and the great BHadoop/B was; String observed = highlightField(query, fieldName, theText); System.out.println(Expected: \ + expected + \n + Observed: \ + observed); assertEquals(Why is that second instance of the term \Lucene\ highlighted?, expected, observed); } Is this an issue that's arisen before? I've been reading through the source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and NearSpansOrdered, but haven't found the solution yet. Initially, I thought that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't get me too far. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Adding .classpath.tmpl
+1 - I'd prefer this stay out of svn as well - I'd rather it go on the wiki too - perhaps in the same place that you can find the formatting file for eclipse and intellij. -- - Mark http://www.lucidimagination.com On 02/25/2010 11:10 AM, Grant Ingersoll wrote: To me, this is stuff that can go on the wiki or somewhere else, otherwise over time, there will be others to add in, etc. We could simply add a pointer to the wiki page in the README. On Feb 24, 2010, at 11:55 PM, Shai Erera wrote: Hi I always find it annoying when I checkout the code to a new project in eclipse, that I need to put everything that I care about in the classpath and adding the dependent libraries. On another project I'm involved with, we did that process once, adding all the source code to the classpath and the libraries and created a .classpath.tmpl. Now when people checkout the code, they can copy the content of that file to their .classpath file and setting up the project is reducing from a couple of minutes to few seconds. I don't want to check-in .classpath because not everyone wants all the code in their classpath. I attached such file to the mail. Note that the only dependency which will break on other machines is the ant.jar dependency, which on my Windows is located under c:\ant. That jar is required to compile contrib/ant from eclipse. Not sure how to resolve that, except besides removing that line from the file and document separately that that's what you need to do if you want to add contrib/ant ... The file is sorted by name, putting the core stuff at the top - so it's easy for people to selectively add the interesting packages. I don't know if an issue is required, if so I can create it in and move the discussion there. Shai lucene.classpath.tmpl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Question on highlighting of nested SpanQuery instances
Hey Michael - this is currently just a limitation of the Span highlighter. It does a bit of fudging when determining what a good position is - if a term from the text is found within the span of a spanquery it is in (no matter how deeply nested), the highlighter makes a guess that the term should be highlighted - this is because we don't have the actual positions of each term - just the positions of the start and end of the span. In almost all cases this works as you would expect - but when nesting spans like this, you can get spurious results within the overall span. So your idea that we should recurse into the Span is on the right track - but it just gets fairly complicated quick. Consider SpanNear(SpanNear(mark, miller,3), SpanTerm(lucene), 4) - if we recurse in an grab the first SpanNear (mark, miller, 3), we can correctly highlight that - but then we will handle lucene by itself - so all lucene terms will be hit rather than the one within 4 of the first span. So you have to deal with SpanOr, SpanNear, SpanNot recursively, but then also handle when they are linked, either with each other or with a SpanTerm - and uh - its gets hard real fast. Hence the fuzziness that goes on now. There may be something we can do to improve things in the future, but its kind of an accepted limitation at the moment - prob something we should add some doc about. - Mark Goddard, Michael J. wrote: Hello, I initially posted a version of this question to java-user, but think it's more of a java-dev question. I haven't yet been able to resolve why I'm seeing spurious highlighting in nested SpanQuery instances. To illustrate this, I added the code below to the HighlighterTest class in lucene_2_9_1: /* * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ */ public void testHighlightingNestedSpans2() throws Exception { String theText = The Lucene was made by Doug Cutting and Lucene great Hadoop was; // Problem //String theText = The Lucene was made by Doug Cutting and the great Hadoop was; // Works okay String fieldName = SOME_FIELD_NAME; SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] { new SpanTermQuery(new Term(fieldName, lucene)), new SpanTermQuery(new Term(fieldName, doug)) }, 5, true); Query query = new SpanNearQuery(new SpanQuery[] { spanNear, new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true); String expected = The BLucene/B was made by BDoug/B Cutting and Lucene great BHadoop/B was; //String expected = The BLucene/B was made by BDoug/B Cutting and the great BHadoop/B was; String observed = highlightField(query, fieldName, theText); System.out.println(Expected: \ + expected + \n + Observed: \ + observed); assertEquals(Why is that second instance of the term \Lucene\ highlighted?, expected, observed); } Is this an issue that's arisen before? I've been reading through the source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and NearSpansOrdered, but haven't found the solution yet. Initially, I thought that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't get me too far. Any suggestions are welcome. Thanks. Mike -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: (LUCENE-1844) Speed up junit tests
On 02/20/2010 05:45 PM, Michael McCandless wrote: Currently the tests run 1 jvm per test suite (eg, TestIndexWriter has its own jvm), I believe, and we haven't seen test failures... so I think for the most part tests are not interfering with each other (messing up global state). It should be less likely that we see interactions across test suites (but obviously still possible). I think we should commit this and then if there are somehow problems we can address them, then? +1 Mike On Sun, Feb 14, 2010 at 6:27 AM, Robert Muirrcm...@gmail.com wrote: its not just statics, I think we should really look at ensuring files are closed etc, or eventually there will be a problem! I guess in general the tradeoff is, it requires us to have better test code. On Sun, Feb 14, 2010 at 5:53 AM, Uwe Schindleru...@thetaphi.de wrote: At least we should check all core tests to not set any static defaults without try...finally! Are there any possibilities inside Eclipse/other-IDEs to check this? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Sunday, February 14, 2010 11:43 AM To: java-dev@lucene.apache.org Subject: Re: (LUCENE-1844) Speed up junit tests Wow -- this is MUCH faster! I think we should switch... It seems like we use a batchtest for all core tests, then for all back-compat tests, then once per contrib package? Ie, so ant test-core uses one jvm? I think we should simply fix any badly behaved tests (that don't restore statics). It's impressive we already have no test failures when we do this... I guess our tests are already cleaning things up (though also probably not often changing global state, or, changing it in a way that'd lead other tests to fail). Mike On Sat, Feb 13, 2010 at 5:23 PM, Robert Muirrcm...@gmail.com wrote: On Fri, Nov 27, 2009 at 1:27 PM, Michael McCandless luc...@mikemccandless.com wrote: Also one thing I'd love to try is NOT forking the JVM for each test (fork=no in the junit task). I wonder how much time that'd buy... it shaves off a good deal of time on my machine. 'ant test-core': 4 minutes, 39 seconds - 3 minutes, 3 seconds 'ant test': 11 minutes, 8 seconds - 7 minutes, 13 seconds however, it makes me a little nervous because i'm not sure all the tests cleanup nicely if they change statics and stuff. anyway, here's the trivial patch (you don't want fork=no, because it turns off assertions) Index: common-build.xml === --- common-build.xml(revision 909395) +++ common-build.xml(working copy) @@ -398,7 +398,7 @@ /condition mkdir dir=@{junit.output.dir}/ junit printsummary=off haltonfailure=no maxmemory=512M - errorProperty=tests.failed failureProperty=tests.failed + errorProperty=tests.failed failureProperty=tests.failed forkmode=perBatch classpath refid=@{junit.classpath}/ assertions enable package=org.apache.lucene/ -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Question on highlighting of nested SpanQuery instances
I played with it sometime back, but I don't have any code left from that exercise. Its fairly tricky. Take your example: SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] { new SpanTermQuery(new Term(fieldName, lucene)), new SpanTermQuery(new Term(fieldName, doug)) }, 5, true); Query query = new SpanNearQuery(new SpanQuery[] { spanNear, new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true); First you see the top level SpanNearQuery - you want to recurse in and just work with the lucene within 5 of dog, ordered, part. But you can't actually work with that alone. That whole span also has to be within 4 of hadoop ordered ... so how do you constrain the sub highlighting? Lets say you do it somehow. Now you recurse in an want to highlight hadoop - but again, not every hadoop - only the haoops that are within 4, ordered, of the first Span. So that's really the issue - you want to break up the Span and highlight recursively - but you can't really break them up and maintain all of the positional restrictions required. So another possible option that gets a little messier might be: when extracting the allowable positions for a term (which it does by checking the start and end of span), you might also run each inner span that contains that term, and then intersect the positions you find that way with the positions found with the overall span and use that list as the allowable positions. That could get kind of complicated though, especially taking into account the logic of the or and spannot spanqueries. - Mark On 02/22/2010 03:15 PM, Goddard, Michael J. wrote: Mark, Thanks a lot for the insight. I'm working with this today and, diving into the WeightedSpanTermExtractor class and fiddling with it. If you ever did have any code which attempted to recurse into these structures, I'd be happy to get my hands on it. Thanks again. Mike -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Mon 2/22/2010 9:15 AM To: java-dev@lucene.apache.org Cc: Goddard, Michael J. Subject: Re: Question on highlighting of nested SpanQuery instances Hey Michael - this is currently just a limitation of the Span highlighter. It does a bit of fudging when determining what a good position is - if a term from the text is found within the span of a spanquery it is in (no matter how deeply nested), the highlighter makes a guess that the term should be highlighted - this is because we don't have the actual positions of each term - just the positions of the start and end of the span. In almost all cases this works as you would expect - but when nesting spans like this, you can get spurious results within the overall span. So your idea that we should recurse into the Span is on the right track - but it just gets fairly complicated quick. Consider SpanNear(SpanNear(mark, miller,3), SpanTerm(lucene), 4) - if we recurse in an grab the first SpanNear (mark, miller, 3), we can correctly highlight that - but then we will handle lucene by itself - so all lucene terms will be hit rather than the one within 4 of the first span. So you have to deal with SpanOr, SpanNear, SpanNot recursively, but then also handle when they are linked, either with each other or with a SpanTerm - and uh - its gets hard real fast. Hence the fuzziness that goes on now. There may be something we can do to improve things in the future, but its kind of an accepted limitation at the moment - prob something we should add some doc about. - Mark Goddard, Michael J. wrote: Hello, I initially posted a version of this question to java-user, but think it's more of a java-dev question. I haven't yet been able to resolve why I'm seeing spurious highlighting in nested SpanQuery instances. To illustrate this, I added the code below to the HighlighterTest class in lucene_2_9_1: /* * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ */ public void testHighlightingNestedSpans2() throws Exception { String theText = The Lucene was made by Doug Cutting and Lucene great Hadoop was; // Problem //String theText = The Lucene was made by Doug Cutting and the great Hadoop was; // Works okay String fieldName = SOME_FIELD_NAME; SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] { new SpanTermQuery(new Term(fieldName, lucene)), new SpanTermQuery(new Term(fieldName, doug)) }, 5, true); Query query = new SpanNearQuery(new SpanQuery[] { spanNear, new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true); String expected = The BLucene/B was made by BDoug/B Cutting and Lucene great BHadoop/B was; //String expected = The BLucene/B was made by BDoug/B Cutting and the great BHadoop/B was; String observed = highlightField(query, fieldName, theText); System.out.println(Expected: \ + expected + \n + Observed: \ + observed); assertEquals(Why is that second instance of the term \Lucene\ highlighted?, expected
Looks like we missed a little change for 3.0 ...
/* TODO 3.0: change this default to true */ protected boolean calibrateSizeByDeletes = false; Better to make these JIRA issues to avoid the miss? -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts
+1 from me - I've put both releases through their paces - though technically, there are a handful of files that look like they need apache headers (css, html) that are reported by rat. I don't think this is a new issue though, so I don't think its something we need to be that picky about right now. A ref to the apache header policy: With few exceptions #faq-exceptions, all human-readable Apache-developed files that are included within a distribution must include the header text #header-text. Documentation, including web site documentation distributed with the release, may include the header text within some form of metadata (such as HTML comments) or as a header or footer appearing in the visible documentation. A file without any degree of creativity in either its literal elements or its structure is not protected by copyright law; therefore, such a file does not require a license header. If in doubt about the extent of the file's creativity, add the license header to the file. -- - Mark http://www.lucidimagination.com Uwe Schindler wrote: Hi all, I tested the lucene-core-3.0.1.jar in production since Sunday afternoon, no problems. I also replaced by the 2.9.2 file in my dev environment (without recompilations, because the locally added generics would break only the compilation but not the JVM of my projects) and tested: works. I also downloaded the artifacts to a computer without my own trustdb, imported KEYS and verified the signatures - no problems (only the GPG warning about the fact that the imported KEYS are not yet trusted by me). Md5/sha1 are also ok. I also downloaded source zips and built/tested using ANT - passed. So a +1 from myself as a non-PMC member. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Monday, February 15, 2010 12:46 AM To: gene...@lucene.apache.org; java-dev@lucene.apache.org Subject: [VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts Hallo Folks, I have posted a release candidate for both Lucene Java 2.9.2 and 3.0.1 (which both have the same bug fix level, functionality and release announcement), build from revision 910082 of the corresponding branches. Thanks for all your help! Please test them and give your votes until Thursday morning, as the scheduled release date for both versions is Friday, Feb 19th, 2010. Only votes from Lucene PMC are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. We planned the parallel release with one announcement because of their parallel development / bug fix level to emphasize that they are equal except deprecation removal and Java 5 since major version 3. Please also read the attached release announcement (Open Document) and send it corrected back if you miss anything or want to improve my bad English :-) You find the artifacts here: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/ Maven repo: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/maven/ The changes are here: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/changes-2.9.2/Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/changes-2.9.2/Contrib-Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/changes-3.0.1/Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/changes-3.0.1/Contrib-Changes.html Uwe === Proposed Release Announcement === Hello Lucene users, On behalf of the Lucene development community I would like to announce the release of Lucene Java versions 3.0.1 and 2.9.2: Both releases fix bugs in the previous versions, where 2.9.2 is the last release working with Java 1.4, still providing all deprecated APIs of the Lucene Java 2.x series. 3.0.1 has the same bug fix level, but requires Java 5 and is no longer compatible with code using deprecated APIs. The API was cleaned up to make use of Java 5's generics, varargs, enums, and autoboxing. New users of Lucene are advised to use version 3.0.1 for new developments, because it has a clean, type safe new API. Users upgrading from 2.9.x can now remove unnecessary casts and add generics to their code, too. Important improvements in these releases are a increased maximum number of unique terms in each index segment. They also add fixes in IndexWriter’s commit and lost document deletes in near real-time indexing. Also lots of bugs in Contrib’s Analyzers package were fixed. Additionally, the 3.0.1 release restored some public methods, that get lost during deprecation removal. If you are using Lucene in a web application environment, you will notice that
Re: [jira] Commented: (LUCENE-2262) QueryParser should now allow leading '?' wildcards
Nah, let's just make fuzzy not work in the qp by default :) And make that back compat while your at it - while not abusing Version so that it's used for something subjective :) wouldn't want to rile up Hoss. I'm like 3/4 serious. - Mark http://www.lucidimagination.com (mobile) On Feb 13, 2010, at 10:22 PM, Robert Muir (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833496#action_12833496 ] Robert Muir commented on LUCENE-2262: - bq. in my opinion disallowing these queries with leading wildcards, be it * or ? or whatever, is rather silly, since we allow even slower fuzzyqueries by default. bq. Agree. What do you think, should we skip this step then and simply deprecate the entire setAllowLeadingWildcard concept all together, setting it to true for Version = 3.1? QueryParser should now allow leading '?' wildcards -- Key: LUCENE-2262 URL: https://issues.apache.org/jira/browse/LUCENE-2262 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: Flex Branch Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: Flex Branch Attachments: LUCENE-2262.patch, LUCENE-2262_backwards.patch QueryParser currently throws an exception if a wildcard term begins with the '?' operator. The current documentation describes why this is: {noformat} When set, * or ? are allowed as the first character of a PrefixQuery and WildcardQuery. Note that this can produce very slow queries on big indexes. {noformat} In the flexible indexing branch, wildcard queries with leading '?' operator are no longer slow on big indexes (they do not enumerate terms in linear fashion). Thus, it no longer makes sense to throw a ParseException for a leading '?' So, users should be able to perform a query of ?foo and no longer get a ParseException from the QueryParser. For the flexible indexing branch, wildcard queries of 'foo?', '? foo', 'f?oo', etc are all the same from a performance perspective. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Nasty NIO behavior makes NIOFSDirectory silently close channel
Perhaps - one of the things they are supposed to be addressing is extendability. nio2 does have FileSystemProvider, which would actually allow you to create a custom channel ! I have not dug in enough to know much more than that though. *But*, another really interesting thing is that in Java 7, FileDescriptors are ref counted ! (though users can't inc/dec). But, FileInputStream and OutputStream have a new constructor that takes a FileDescriptor. So possibly, you could just make one that sits around to keep the FileDescriptor valid, and get your channel off FileInputStream/FileOutputStream? And then if it goes down, make a new one using the FileDescriptor which was not actually closed because there was a still a ref to it. Possibly ;) Michael McCandless wrote: Does anyone know if nio2 has improved this...? Mike On Fri, Jan 29, 2010 at 2:00 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Defaulting NIOFSDir could account for some of the recent speed improvements users have been reporting in Lucene 2.9. So removing it as a default could reverse those and people could then report Lucene 3.X has slowed... On Thu, Jan 28, 2010 at 5:24 AM, Michael McCandless luc...@mikemccandless.com wrote: Bummer. So the only viable workarounds are 1) don't use Thread.interrupt (nor, things like Future.cancel, which in turn use Thread.interrupt) with NIOFSDir, or 2) we fix NIOFSDir to reopen the channel AND the app must make a deletion policy that keeps a commit alive if any reader is using it. Or, 3) don't use NIOFSDir! Mike On Thu, Jan 28, 2010 at 7:29 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Thu, Jan 28, 2010 at 12:43 PM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Jan 28, 2010 at 6:38 AM, Uwe Schindler u...@thetaphi.de wrote: So I checked the code of NIOFSIndexInput, my last comment was not really correct: NIOFSIndexInput extends SimpleFSIndexInput and that opens the RAF. In the ctor RAF.getChannel() is called. The RAF keeps open until the file is closed (and also the channel). So it's really simple to fix in my opinion, just call getChannel() again on this exception. Because the RAF should still be open? Short answer: public final FileChannel getChannel() { synchronized (this) { if (channel == null) channel = FileChannelImpl.open(fd, true, rw, this); return channel; } } this is not gonna work I tried it before. The RandomAccessFile buffers the channel!! simon I think we need a definitive answer on what happens to the RAF when the FileChannel was closed by Thread.Interrupt. Simon can you test this? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2226) move contrib/snowball to contrib/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801972#action_12801972 ] Mark Miller commented on LUCENE-2226: - Contribs back compat policy is that there is no back compat policy unless that contrib specifically states one. move contrib/snowball to contrib/analyzers -- Key: LUCENE-2226 URL: https://issues.apache.org/jira/browse/LUCENE-2226 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2226.patch to fix bugs in some duplicate, handcoded impls of these stemmers (nl, fr, ru, etc) we should simply merge snowball and analyzers, and replace the buggy impls with the proper snowball stemfilters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2226) move contrib/snowball to contrib/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802011#action_12802011 ] Mark Miller commented on LUCENE-2226: - {quote}Mark, that is my understanding too. I wasn't commenting on the policy but on the fact of the possible breakage. I think it is a courtesy to notify users of a change to which they might need to pay attention. I don't know that's spelled out in the policy, but I think it should be. Not that a lack of notice is a guarantee of no breakage but that a notice is a guarantee of breakage (at least under some circumstances).{quote} Right - I was just pointing out that jar drop in is far from a requirement in contrib. We do always try and play nice anyway. bq. Is there any contrib that specifically states one? I couldn't find it. Don't think so - meaning there is no back compat policy in contrib - I think as a contrib matures, its up to those working on it to decide that its reached a state that deserves a policy of some kind. The Highlighter could probably use one at this point, but at the same time, nothing has created too much of an outcry at this point. bq. The analysis/common is not clear as it has the Version stuff. Right - just because there is no policy doesn't mean we shouldn't make any attempts at back compat - but the issue you brought up is not something easily addressed, nor I think, large enough to worry about with the proper warning in Changes. Users should be wary of contrib on upgrading - unless it presents a strong back compat policy. bq. But after all the dust settles and this i18n stuff is solid, I think it might be reasonable to make a stronger bw compat statement. I agree - now that contrib has been getting some much needed love recently, I think it should start heading towards some back compat promises - especially concerning analyzers. We already do tend to bend over backwards when we can anyway. I think we are on the same page - I'm just not very worried about the break you mention - I think its a perfectly acceptable growing pain. And I think our back compat has been so week because contrib has been a bit of a wasteland in the past - no one was willing to take ownership of a lot of this stuff - especially the language analyzers. That has change recently. As the devs clean up and consolidate this stuff properly, I think we can work towards stronger promises in the future. move contrib/snowball to contrib/analyzers -- Key: LUCENE-2226 URL: https://issues.apache.org/jira/browse/LUCENE-2226 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2226.patch to fix bugs in some duplicate, handcoded impls of these stemmers (nl, fr, ru, etc) we should simply merge snowball and analyzers, and replace the buggy impls with the proper snowball stemfilters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Compound File Default
Otis Gospodnetic wrote: At the same time, seeing how some people benchmark systems without tuning them and then publish their results, cfs may be safer. Though at the same time you get nailed with a 10-15% indexing speed hit. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
update doc by query
Any reason we don't offer update doc by query along with term? Its easy enough to implement in the same manner - is there some sort of gotchya with this, or is it just because there has been no demand yet? -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene Java 2.9.2
Other than what's left of the TokenStream issues, I think we just need a compression solution - which shouldn't be difficult. - Mark Robert Muir wrote: https://issues.apache.org/jira/browse/SOLR-1657 i just struckthrough the things that are done. Mark, Robert: How far are we with progress in solr? Were there any additional problems with 3.0.0? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Tuesday, January 05, 2010 1:34 PM To: java-dev@lucene.apache.org Subject: RE: Lucene Java 2.9.2 My plan was to release it together with 3.0.1. Both version then will have the same bug fix status. I have the scripts here to build the artifacts (as I added fast vector highlighter poms), so I could do it for both and start the release. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, January 05, 2010 12:26 PM To: java-dev@lucene.apache.org Subject: Re: Lucene Java 2.9.2 Lucene 2.9.2 hasn't been released yet, but I think we should release it at some point soonish? It's accumulated some important bug fixes. Mike On Mon, Jan 4, 2010 at 10:59 PM, George Aroush geo...@aroush.net wrote: Hi Folks, Over at Lucene.Net, we have 2.9.1 ready for official release. This is a port of the current Lucene Java 2.9.1 release. When I raised the question about releasing Lucene.Net 2.9.1, a question was asked to port over LUCENE-2190 for which a patch was quickly made (see: https://issues.apache.org/jira/browse/LUCENENET-331). This begs the question, if Lucene.Net takes just this one patch, than Lucene.Net 2.9.1 is now 2.9.1.1 (which I personally don't like to see happening as I prefer to see a 1-to-1 release match). So, I examined the list of fixes made in 2.9.2 here: https://issues.apache.org/jira/browse/LUCENE/fixforversion/12314342 and found that this is a small task to port over. So far so good? Good. Now, as far as I know, Lucene Java never made an official 2.9.2 release, or is this in the works (I don't recall seeing any email about it)? If so, what's the time line? I think our decision on the Lucene.Net side will be based on the answer to this question. Thanks. -- George - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-2035. - Resolution: Fixed Thanks Christopher! TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-860) site should call project Lucene Java, not just Lucene
[ https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-860: --- Attachment: LUCENE-860-1.patch updated patch that also includes doc site level changes site should call project Lucene Java, not just Lucene - Key: LUCENE-860 URL: https://issues.apache.org/jira/browse/LUCENE-860 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doug Cutting Assignee: Mark Miller Priority: Minor Fix For: 3.1 Attachments: LUCENE-860-1.patch, LUCENE-860-2.patch, LUCENE-860.patch To avoid confusion with the top-level Lucene project, the Lucene Java website should refer to itself as Lucene Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-860) site should call project Lucene Java, not just Lucene
[ https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-860: --- Attachment: LUCENE-860-2.patch site should call project Lucene Java, not just Lucene - Key: LUCENE-860 URL: https://issues.apache.org/jira/browse/LUCENE-860 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doug Cutting Assignee: Mark Miller Priority: Minor Fix For: 3.1 Attachments: LUCENE-860-1.patch, LUCENE-860-2.patch, LUCENE-860.patch To avoid confusion with the top-level Lucene project, the Lucene Java website should refer to itself as Lucene Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791939#action_12791939 ] Mark Miller commented on LUCENE-2035: - I'll commit this soon. TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1922) exposing the ability to get the number of unique term count per field
[ https://issues.apache.org/jira/browse/LUCENE-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1922: Affects Version/s: (was: 2.4.1) Flex Branch exposing the ability to get the number of unique term count per field - Key: LUCENE-1922 URL: https://issues.apache.org/jira/browse/LUCENE-1922 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: Flex Branch Reporter: John Wang Add an api to get the number of unique term count given a field name, e.g.: IndexReader.getUniqueTermCount(String field) This issue has a dependency on LUCENE-1458 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791680#action_12791680 ] Mark Miller commented on LUCENE-2035: - Hey Christopher, why are you going through the trouble of the custom collector to check that there are no hits? Why not just do a standard search? TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2035.patch, LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2035: Attachment: LUCENE-2035.patch I've broken the new tests back out into there own file, change the hit collector code to just search basically, and improved the test coverage of TokenSources a bit. TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790748#action_12790748 ] Mark Miller commented on LUCENE-2089: - Sorry Earwin - to be clear, we don't actually use chapter 6 - AutomataQuery needs the automata. You can get all the states just by taking the power set of the subsumption triangle for every base position, and then removing from each set any position thats subsumed by another. Thats what I mean by brute force. But in the paper, they boil this down to nice little i param tables, extracting some sort of pattern from that process. They give no hint on how they do this, or whether it applicable to greater n's though. No big deal I guess - the computer can do the brute force method - but I wouldn't be surprised if it starts to bog down at much higher n's. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2165) SnowballAnalyzer lacks a constructor that takes a Set of Stop Words
[ https://issues.apache.org/jira/browse/LUCENE-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2165: Fix Version/s: 3.1 SnowballAnalyzer lacks a constructor that takes a Set of Stop Words --- Key: LUCENE-2165 URL: https://issues.apache.org/jira/browse/LUCENE-2165 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.1, 3.0 Reporter: Nick Burch Priority: Minor Fix For: 3.1 As discussed on the java-user list, the SnowballAnalyzer has been updated to use a Set of stop words. However, there is no constructor which accepts a Set, there's only the original String[] one This is an issue, because most of the common sources of stop words (eg StopAnalyzer) have deprecated their String[] stop word lists, and moved over to Sets (eg StopAnalyzer.ENGLISH_STOP_WORDS_SET). So, for now, you either have to use a deprecated field on StopAnalyzer, or manually turn the Set into an array so you can pass it to the SnowballAnalyzer I would suggest that a constructor is added to SnowballAnalyzer which accepts a Set. Not sure if the old String[] one should be deprecated or not. A sample patch against 2.9.1 to add the constructor is: --- SnowballAnalyzer.java.orig 2009-12-15 11:14:08.0 + +++ SnowballAnalyzer.java 2009-12-14 12:58:37.0 + @@ -67,6 +67,12 @@ stopSet = StopFilter.makeStopSet(stopWords); } + /** Builds the named analyzer with the given stop words. */ + public SnowballAnalyzer(Version matchVersion, String name, Set stopWordsSet) { +this(matchVersion, name); +stopSet = stopWordsSet; + } + -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1769) Fix wrong clover analysis because of backwards-tests, upgrade clover to 2.4.3 or better
[ https://issues.apache.org/jira/browse/LUCENE-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791115#action_12791115 ] Mark Miller commented on LUCENE-1769: - Would be cool to get this issue wrapped up ... Fix wrong clover analysis because of backwards-tests, upgrade clover to 2.4.3 or better --- Key: LUCENE-1769 URL: https://issues.apache.org/jira/browse/LUCENE-1769 Project: Lucene - Java Issue Type: Bug Components: Build Affects Versions: 2.9 Reporter: Uwe Schindler Attachments: clover.license, LUCENE-1769.patch, LUCENE-1769.patch, nicks-LUCENE-1769.patch This is a followup for [http://www.lucidimagination.com/search/document/6248d6eafbe10ef4/build_failed_in_hudson_lucene_trunk_902] The problem with clover running on hudson is, that it does not instrument all tests ran. The autodetection of clover 1.x is not able to find out which files are the correct tests and only instruments the backwards test. Because of this, the current coverage report is only from the backwards tests running against the current Lucene JAR. You can see this, if you install clover and start the tests. During test-core no clover data is added to the db, only when backwards-tests begin, new files are created in the clover db folder. Clover 2.x supports a new ant task, testsources that can be used to specify the files, that are the tests. It works here locally with clover 2.4.3 and produces a really nice coverage report, also linking with test files work, it tells which tests failed and so on. I will attach a patch, that changes common-build.xml to the new clover version (other initialization resource) and tells clover where to find the tests (using the test folder include/exclude properties). One problem with the current patch: It does *not* instrument the backwards branch, so you see only coverage of the core/contrib tests. Getting the coverage also from the backwards tests is not easy possible because of two things: - the tag test dir is not easy to find out and add to testsources element (there may be only one of them) - the test names in BW branch are identical to the trunk tests. This completely corrupts the linkage between tests and code in the coverage report. In principle the best would be to generate a second coverage report for the backwards branch with a separate clover DB. The attached patch does not instrument the bw branch, it only does trunk tests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller reassigned LUCENE-2035: --- Assignee: Mark Miller TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2035: Fix Version/s: 3.1 TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2035: Attachment: LUCENE-2035.patch TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2035.patch, LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791152#action_12791152 ] Mark Miller commented on LUCENE-2035: - Thanks for the tests and fix Christopher! I've got one more patch coming and ill commit in a few days. I'm going to break the tests back out in a separate file again (on second thought I think how you had is a good idea) and remove an author tag. Then after one more review I think this good to go in. TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2035.patch, LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-406) sort missing string fields last
[ https://issues.apache.org/jira/browse/LUCENE-406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791153#action_12791153 ] Mark Miller commented on LUCENE-406: We should update this and incorporate into Lucene. sort missing string fields last --- Key: LUCENE-406 URL: https://issues.apache.org/jira/browse/LUCENE-406 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 1.4 Environment: Operating System: All Platform: All Reporter: Yonik Seeley Assignee: Hoss Man Priority: Minor Attachments: MissingStringLastComparatorSource.java, MissingStringLastComparatorSource.java, TestMissingStringLastComparatorSource.java A SortComparatorSource for string fields that orders documents with the sort field missing after documents with the field. This is the reverse of the default Lucene implementation. The concept and first-pass implementation was done by Chris Hostetter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1942) NUM_THREADS is a static member of RunAddIndexesThreads and should be accessed in a static way
[ https://issues.apache.org/jira/browse/LUCENE-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-1942. - Resolution: Won't Fix NUM_THREADS is a static member of RunAddIndexesThreads and should be accessed in a static way - Key: LUCENE-1942 URL: https://issues.apache.org/jira/browse/LUCENE-1942 Project: Lucene - Java Issue Type: Bug Components: Other Environment: Eclipse 3.4.2 Reporter: Hasan Diwan Priority: Trivial Attachments: lucene.pat The summary contains the problem. No further description needed, I don't think. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-628) Intermittent FileNotFoundException for .fnm when using rsync
[ https://issues.apache.org/jira/browse/LUCENE-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-628. Resolution: Incomplete Intermittent FileNotFoundException for .fnm when using rsync Key: LUCENE-628 URL: https://issues.apache.org/jira/browse/LUCENE-628 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.9 Environment: Linux RedHat ES3, Jboss402 Reporter: Simon Lorenz Priority: Minor We use Lucene 1.9.1 to create and search indexes for web applications. The application runs in Jboss402 on Redhat ES3. A single Master (Writer) Jboss instance creates and writes the indexes using the compound file format , which is optimised after all updates. These index files are replicated every few hours using rsync, to a number of other application servers (Searchers). The rsync job only runs if there are no lucene lock files present on the Writer. The Searcher servers that receive the replicated files, perform only searches on the index. Up to 60 searches may be performed each minute. Everything works well most of the time, but we get the following issue on the Searcher servers about 10% of the time. Following an rsync replication one or all of the Searcher server throws IOException caught when creating and IndexSearcher java.io.FileNotFoundException: //_1zm.fnm (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:212) at org.apache.lucene.store.FSIndexInput$Descriptor.init(FSDirectory.java:425) at org.apache.lucene.store.FSIndexInput.init(FSDirectory.java:434) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324) at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:56) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110) at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:154) at org.apache.lucene.store.Lock$With.run(Lock.java:109) at org.apache.lucene.index.IndexReader.open(IndexReader.java:143) As we use the compound file format I would not expect .fnm files to be present. When replicating, we do not delete the old .cfs index files as these could still be referenced by old Searcher threads. We do overwrite the segments and deletable files on the Searcher servers. My thoughts are: Either we are occasionally overwriting a file at the exact time a new searcher is being created, or the lock files are removed from the Writer server before the compaction process is completed, we then replicate a segments file that still references a ghost .fnm file. I would greatly appreciate any ideas and suggestions to solve this annoying issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790368#action_12790368 ] Mark Miller commented on LUCENE-2089: - bq. If you do take hold of it, do not hesitate to share The original paper and C++ code likewise melt my brain, and I needed the algo in some other place. The java impl I was onto was about 75% complete according to the author, but I have not yet looked at the code. Robert was convinced it was a different less efficient algorithm last I heard though. We have cracked much of the paper - thats how Robert implemented n=1 here - thats from the paper. The next step is to work out how to construct the tables for n as Robert says above. And store those tables efficiently as they start getting quite large rather fast - though we might only use as high as n=3 or 4 in Lucene - Robert suspects term seeking will outweigh any gains at that point. I think we know how to do the majority of the work for the n case, but I don't really have much/any time for this, so it probably depends on if/when Robert gets to it. If he loses interest on finishing, I def plan to come back to it someday. I'd like to complete my understanding of the paper and see a full n java impl of this in either case. The main piece left that I don't understand fully (computing all possible states for n), can be computed with just a brute force check (thats how the python impl is doing it), so there may not be much more to understand. I would like to know how the paper is getting 'i' parametrized state generators though - thats much more efficient. The paper shows them for n=1 and n=2. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789901#action_12789901 ] Mark Miller commented on LUCENE-2126: - I disagree with you here: introducing DataInput/Output makes IMO the API actually easier for the normal user to understand. I agree with everything you say in the second paragraph, but I don't see how any of that supports the assertion you make in the first paragraph. Presumably, because the normal user won't touch/see the IndexInput/Output classes, but more likely may deal with DataInput/Output - and those classes being limited to what actually makes sense to use for them (only exposing methods they should use) - thats easier for them. I was leaning towards Marvin's arguments - it really seems that documentation should be enough to steer users against doing something stupid - there is no doubt that writing attributes into the posting list is a fairly advanced operation (though more normal than using IndexInput/Output). On the other hand though, I'm not really sold on the downsides longer term either. The complexity argument is a bit over blown. If you understand anything down to the level of these classes, this is a ridiculously simple change. The backcompat argument is not very persuasive either - not only does it look like a slim chance of any future issues - at this level we are fairly loose about back compat when something comes up. I think advanced users have already realized, the more you dig into Lucene's guts, the more likely you won't be able to count on jar drop in. Thats just the way things have gone. I don't see a looming concrete issue myself anyway. And if there is a hidden one, I don't think anyone is going to get in a ruffle about it. So net/net, I'm +1. Seems worth it to me to be able to give a user 2125 the correct API. I could go either way on the name change. Not a fan of LuceneInput/Output though. Split up IndexInput and IndexOutput into DataInput and DataOutput - Key: LUCENE-2126 URL: https://issues.apache.org/jira/browse/LUCENE-2126 Project: Lucene - Java Issue Type: Improvement Affects Versions: Flex Branch Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Flex Branch Attachments: lucene-2126.patch I'd like to introduce the two new classes DataInput and DataOutput that contain all methods from IndexInput and IndexOutput that actually decode or encode data, such as readByte()/writeByte(), readVInt()/writeVInt(). Methods like getFilePointer(), seek(), close(), etc., which are not related to data encoding, but to files as input/output source stay in IndexInput/IndexOutput. This patch also changes ByteSliceReader/ByteSliceWriter to extend DataInput/DataOutput. Previously ByteSliceReader implemented the methods that stay in IndexInput by throwing RuntimeExceptions. See also LUCENE-2125. All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789384#action_12789384 ] Mark Miller commented on LUCENE-2133: - bq. Something along these lines maybe? And we are back to 831 :) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) - FieldCacheImpl (= IndexFieldCacheImpl) - all classes in FieldCacheImpl (= several package-level classes) - all subclasses of FieldComparator (= several package-level classes) Final notes: - The patch would be simpler if no backwards compatibility was necessary. The Lucene community has to decide which classes/methods can immediately be removed, which ones later, which not at all. Whenever new classes depend on the old ones, an appropriate notice exists in the javadocs. - The patch introduces a new
[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR
[ https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788762#action_12788762 ] Mark Miller commented on LUCENE-1377: - bq. with the exception of a few core committers. I think the exception is the other way around, especially considering Lucene contrib. Lets look at the Solr list (and consider some are not very active in Solr currently) ||name||status|| |Bill Au| | |Doug Cutting|Lucene Core Committer| |Otis Gospodnetić|Lucene Core Committer| |Erik Hatcher| Lucene Core Committer| |Chris Hostetter |Lucene Core Committer| |Grant Ingersoll | Lucene Core Committer| |Mike Klaas| | |Shalin Shekhar Mangar| | |Ryan McKinley| Lucene Contrib Committer| |Mark Miller |Lucene Core Committer| |Noble Paul| | |Yonik Seeley| Lucene Core Committer| |Koji Sekiguchi|Lucene Contrib Committer| Add HTMLStripReader and WordDelimiterFilter from SOLR - Key: LUCENE-1377 URL: https://issues.apache.org/jira/browse/LUCENE-1377 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.3.2 Reporter: Jason Rutherglen Priority: Minor Original Estimate: 24h Remaining Estimate: 24h SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very useful for a wide variety of use cases. It would be good to place them into core Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788874#action_12788874 ] Mark Miller commented on LUCENE-2133: - I don't know that back compat is really a concern if we are just leaving the old API intact as part of that, with its own caching mechanism? Just deprecate the old API, and make a new one. This is a big pain, because you have to be sure you don't straddle the two apis on upgrading, but thats the boat we will be in anyway. Which means a new impl should provide enough benefits to make that large pain worth enduring. 831 was not committed for the same reason - it didn't bring enough to table to be worth it after we got to a per segment cache in another way. Since I don't see that this provides anything over 831, I don't see how its not in the same boat. I'm not sure we should target a specific release with this - we don't even know when 3.1 is going to happen. 2.9 took a year. Its anybodies guess - we should prob just do what makes sense and commit it when its ready. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788908#action_12788908 ] Mark Miller commented on LUCENE-2133: - bq. LUCENE-831 still requires a static FieldCache, the root of all evil :) It doesn't require one though? It supports a cache per segment reader just like this. Except its called a ValueSource. The CacheByReaderValueSource is just there to handle a back compat issue - its something that we would want to get around and use the reader valuesource for instead - but that patch still had a long way to go. Overall, from what I can see, the approach was about the same. bq. It probably makes sense to start from one of Hoss's original patches or even from scratch That was said before a lot more work was done. The API was actually starting to shape up nicely. bq. The more complex the patches are, the longer it will take to integrate them into a new version. Of course - and this is a complex issue with a lot of upgrade pain. Like with 831, it not really worth the pain to users without more benefits. bq. The more such patches you have, the longer it will take to get to a new release. Thats not really true. 3.1 does't need this patch - there would be no reason to hold it for it. Patches go in when they are ready. bq. Let's make it simple, submit what we have and build upon that. I dont think thats simple :) The patch can be iterated on outside of trunk as easy as in. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788951#action_12788951 ] Mark Miller commented on LUCENE-2133: - bq. That is, it adds a lot of duplicated code / different possible implementations for the same thing. Things were still ugly were not likely to stick around - 831 was very much a work in progress. The solution there to handle back compat issues was a working solution that would need to be improved upon. 831 was still in experimentation state - issues that need more though had hacked in working solutions. We had a more general cache at one point, and began working towards ValueSources based on discussion. The latest 831 patch is an exploration of that, not a final product. bq. They should store arbitrary data, allow cache inspection, eviction of entries and so on. Thats extremely simple to add to an IndexReader - we were thinking of a ValueSource as something different than a basic cache. {quote} It is indeed a complex problem but it can easily be split into several subtasks that can be addressed by different people in parallel. To allow such a development, we have to somehow get the base code it into SVN, not necessarily trunk, admittedly, a branch would also do. Of course, this requires also additional work to keep it in sync with trunk. If we can really assume to have 3.1 in one year, we have lots of time for developing a stable, powerful new API directly in trunk. Of course, this is a decision related to release management and not to the actual problem. I can live with both ways (trunk vs. branch), but, in my opinion, managing the changes just as patch files in jira is not a viable option. {quote} A branch is certainly a possibility, but with only one person working on it, I think its overkill. With some additional interest, a branch can make sense - otherwise its not worth the merging headaches. You also have to have a committer(s) thats willing to take on the merging. At one point, 831 was much more like this patch. Discussion along what Mike brought up above started transforming it to something else. We essentially decided that unless that much was brought to the table, the disrupting change just wasn't worth it for a different cache API. I'm def a proponent of FieldCache reform - but I think we want to fully flesh it out before committing to something in trunk. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance
[jira] Issue Comment Edited: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788951#action_12788951 ] Mark Miller edited comment on LUCENE-2133 at 12/10/09 9:48 PM: --- bq. That is, it adds a lot of duplicated code / different possible implementations for the same thing. Things that were still ugly were not likely to stick around - 831 was very much a work in progress. The solution there to handle back compat issues was a working solution that would need to be improved upon. 831 was still in experimentation state - issues that need more though had hacked in working solutions. We had a more general cache at one point, and began working towards ValueSources based on discussion. The latest 831 patch is an exploration of that, not a final product. bq. They should store arbitrary data, allow cache inspection, eviction of entries and so on. Thats extremely simple to add to an IndexReader - we were thinking of a ValueSource as something different than a basic cache. {quote} It is indeed a complex problem but it can easily be split into several subtasks that can be addressed by different people in parallel. To allow such a development, we have to somehow get the base code it into SVN, not necessarily trunk, admittedly, a branch would also do. Of course, this requires also additional work to keep it in sync with trunk. If we can really assume to have 3.1 in one year, we have lots of time for developing a stable, powerful new API directly in trunk. Of course, this is a decision related to release management and not to the actual problem. I can live with both ways (trunk vs. branch), but, in my opinion, managing the changes just as patch files in jira is not a viable option. {quote} A branch is certainly a possibility, but with only one person working on it, I think its overkill. With some additional interest, a branch can make sense - otherwise its not worth the merging headaches. You also have to have a committer(s) thats willing to take on the merging. At one point, 831 was much more like this patch. Discussion along what Mike brought up above started transforming it to something else. We essentially decided that unless that much was brought to the table, the disrupting change just wasn't worth it for a different cache API. I'm def a proponent of FieldCache reform - but I think we want to fully flesh it out before committing to something in trunk. was (Author: markrmil...@gmail.com): bq. That is, it adds a lot of duplicated code / different possible implementations for the same thing. Things were still ugly were not likely to stick around - 831 was very much a work in progress. The solution there to handle back compat issues was a working solution that would need to be improved upon. 831 was still in experimentation state - issues that need more though had hacked in working solutions. We had a more general cache at one point, and began working towards ValueSources based on discussion. The latest 831 patch is an exploration of that, not a final product. bq. They should store arbitrary data, allow cache inspection, eviction of entries and so on. Thats extremely simple to add to an IndexReader - we were thinking of a ValueSource as something different than a basic cache. {quote} It is indeed a complex problem but it can easily be split into several subtasks that can be addressed by different people in parallel. To allow such a development, we have to somehow get the base code it into SVN, not necessarily trunk, admittedly, a branch would also do. Of course, this requires also additional work to keep it in sync with trunk. If we can really assume to have 3.1 in one year, we have lots of time for developing a stable, powerful new API directly in trunk. Of course, this is a decision related to release management and not to the actual problem. I can live with both ways (trunk vs. branch), but, in my opinion, managing the changes just as patch files in jira is not a viable option. {quote} A branch is certainly a possibility, but with only one person working on it, I think its overkill. With some additional interest, a branch can make sense - otherwise its not worth the merging headaches. You also have to have a committer(s) thats willing to take on the merging. At one point, 831 was much more like this patch. Discussion along what Mike brought up above started transforming it to something else. We essentially decided that unless that much was brought to the table, the disrupting change just wasn't worth it for a different cache API. I'm def a proponent of FieldCache reform - but I think we want to fully flesh it out before committing to something in trunk. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[jira] Commented: (LUCENE-2018) Reconsider boolean max clause exception
[ https://issues.apache.org/jira/browse/LUCENE-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787658#action_12787658 ] Mark Miller commented on LUCENE-2018: - I still think this should be removed - or moved to the MTQ query itself - then a setting on the queryparser could set it, or a user could set it. It shouldn't be a sys property, and I don't necessarily think it should be on by default either. Reconsider boolean max clause exception --- Key: LUCENE-2018 URL: https://issues.apache.org/jira/browse/LUCENE-2018 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Fix For: 3.1 Now that we have smarter multi-term queries, I think its time to reconsider the boolean max clause setting. It made more sense before, because you could hit it more unaware when the multi-term queries got huge - now its more likely that if it happens its because a user built the boolean themselves. And no duh thousands more boolean clauses means slower perf and more resources needed. We don't throw an exception when you try to use a ton of resources in a thousand other ways. The current setting also suffers from the static hell argument - especially when you consider something like Solr's multicore feature - you can have different settings for this in different cores, and the last one is going to win. Its ugly. Yes, that could be addressed better in Solr as well - but I still think it should be less ugly in Lucene as well. I'd like to consider either doing away with it, or raising it by quite a bit at the least. Or an alternative better solution. Right now, it aint so great. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787711#action_12787711 ] Mark Miller commented on LUCENE-2133: - There are a bunch or unrelated changes (imports/names/exception thrown) that should be pulled from this patch. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) - FieldCacheImpl (= IndexFieldCacheImpl) - all classes in FieldCacheImpl (= several package-level classes) - all subclasses of FieldComparator (= several package-level classes) Final notes: - The patch would be simpler if no backwards compatibility was necessary. The Lucene community has to decide which classes/methods can immediately be removed, which ones later, which not at all. Whenever new classes depend on the old ones, an appropriate notice exists in the javadocs
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787715#action_12787715 ] Mark Miller commented on LUCENE-2133: - Hmm ... nevermind. The exception is related and most of the imports are correct - brain spin. Didn't see that import org.apache.lucene.search.SortField; // for javadocs wasn't being used anymore anyway. import org.apache.lucene.search.fields.IndexFieldCache in NumericQuery should get a //javadoc so someone doesn't accidently remove it. And I guess the t to threadLocal change doesn't hurt with the amount your changing that anyway. Its a better name. This looks pretty nice overall. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) - FieldCacheImpl (= IndexFieldCacheImpl) - all classes in FieldCacheImpl (= several package-level classes
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787729#action_12787729 ] Mark Miller commented on LUCENE-2133: - A couple more quick notes: I know the FieldComparator class is ugly, but I'm not sure we should pull the rug by putting the impls in a new package. On the other hand, its not likely to affect many and it was experimental - so its a tough call. Its a lot of classes in there ;) I'm also not sure if fields is the right package name? And do the Filters belong in that package? Also, almost a non issue, but extending a deprecated class is going to be an ultra minor back compat break when its removed. Not likely a problem though. But we might put a note to that affect to be clear. It is almost self documenting anyway though :) Rather then changing the tests to the new classes, we should prob copy them and make new ones - then remove them when the deprecations are removed. Also, you should pull the author tag(s) - all credit is through JIRA and Changes. (I only see it like once, so I bet thats eclipse?) I havn't done a thorough review it all, but this is pretty great stuff to appear so complete and out of nowhere :) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787734#action_12787734 ] Mark Miller commented on LUCENE-2133: - It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And I think the FieldCache import in that class can be removed. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) - FieldCacheImpl (= IndexFieldCacheImpl) - all classes in FieldCacheImpl (= several package-level classes) - all subclasses of FieldComparator (= several package-level classes) Final notes: - The patch would be simpler if no backwards compatibility was necessary. The Lucene community has to decide which classes/methods can immediately be removed, which ones later, which not at all. Whenever new classes depend on the old ones, an appropriate notice
[jira] Issue Comment Edited: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787734#action_12787734 ] Mark Miller edited comment on LUCENE-2133 at 12/8/09 8:42 PM: -- It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And I think the FieldCache import in that class can be removed (same with IndexFieldCacheRangeFilter). was (Author: markrmil...@gmail.com): It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And I think the FieldCache import in that class can be removed. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) - FieldCacheImpl (= IndexFieldCacheImpl) - all classes in FieldCacheImpl (= several package-level classes) - all subclasses of FieldComparator (= several package-level classes) Final
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787748#action_12787748 ] Mark Miller commented on LUCENE-2133: - bq. I think it does not hurt either. I didn't notice that you actually just deprecated the originals - I guess thats not a complete rug pull ... By the way, I don't think you need to deprecate something in a new class ( IndexFieldCacheImpl): {code} /** * @deprecated Use {...@link #clear()} instead. */ public void purgeAllCaches() { init(); } {code} [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) - FieldCacheImpl (= IndexFieldCacheImpl) - all classes in FieldCacheImpl (= several package-level classes) - all subclasses of FieldComparator (= several package-level classes) Final notes: - The patch would be simpler
[jira] Issue Comment Edited: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787752#action_12787752 ] Mark Miller edited comment on LUCENE-2133 at 12/8/09 9:34 PM: -- And what about the doubling up insanity? It looks like you just commented out that check? It appears to me that thats still an issue we want to check for - we want to make sure Lucene core and users have a way to be sure they are not using a toplevel reader and its sub readers for caches unless they *really* intend to. *edit* This type of change actually even exaggerates that problem (though if we want to improve things here, its something we will have to deal with). Now you might have a mixture of old api/new api caches as well if you don't properly upgrade everything at once. was (Author: markrmil...@gmail.com): And what about the doubling up insanity? It looks like you just commented out that check? It appears to me that thats still an issue we want to check for - we want to make sure Lucene core and users have a way to be sure they are not using a toplevel reader and its sub readers for caches unless they *really* intend to. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787767#action_12787767 ] Mark Miller commented on LUCENE-2133: - bq. not bind the cache so hard to the IndexReader (which was also the problem with the last FieldCache), instead just make it a plugin component At a minimum, you should be able to set the cache for the reader. bq. For the functionality of Lucene, FieldCache is not needed, sorting is just an addon on searching The way he has it, this is not just for the fieldache, but also the fieldsreader and vectorreader - if we go down that road, we should consider norms as well. bq. I see no problems with appling it soon I still think it might be a little early. This has a lot of consequences. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache
[jira] Resolved: (LUCENE-2106) Benchmark does not close its Reader when OpenReader/CloseReader are not used
[ https://issues.apache.org/jira/browse/LUCENE-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-2106. - Resolution: Fixed Benchmark does not close its Reader when OpenReader/CloseReader are not used Key: LUCENE-2106 URL: https://issues.apache.org/jira/browse/LUCENE-2106 Project: Lucene - Java Issue Type: Bug Components: contrib/benchmark Affects Versions: 3.0 Reporter: Mark Miller Assignee: Mark Miller Fix For: 3.0.1, 3.1 Attachments: LUCENE-2106.patch Only the Searcher is closed, but because the reader is passed to the Searcher, the Searcher does not close the Reader, causing a resource leak. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1844) Speed up junit tests
[ https://issues.apache.org/jira/browse/LUCENE-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787004#action_12787004 ] Mark Miller commented on LUCENE-1844: - It should work fine. Speed up junit tests Key: LUCENE-1844 URL: https://issues.apache.org/jira/browse/LUCENE-1844 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Assignee: Michael McCandless Fix For: 3.1 Attachments: FastCnstScoreQTest.patch, hi_junit_test_runtimes.png, LUCENE-1844-Junit3.patch, LUCENE-1844.patch, LUCENE-1844.patch, LUCENE-1844.patch As Lucene grows, so does the number of JUnit tests. This is obviously a good thing, but it comes with longer and longer test times. Now that we also run back compat tests in a standard test run, this problem is essentially doubled. There are some ways this may get better, including running parallel tests. You will need the hardware to fully take advantage, but it should be a nice gain. There is already an issue for this, and Junit 4.6, 4.7 have the beginnings of something we might be able to count on soon. 4.6 was buggy, and 4.7 still doesn't come with nice ant integration. Parallel tests will come though. Beyond parallel testing, I think we also need to concentrate on keeping our tests lean. We don't want to sacrifice coverage or quality, but I'm sure there is plenty of fat to skim. I've started making a list of some of the longer tests - I think with some work we can make our tests much faster - and then with parallelization, I think we could see some really great gains. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto, constant, or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This way, if you have a bunch of really large segments and a lot of really small segments, you wouldn't apply a huge booleanquery against all of the small segments which don't have those terms anyway. How advantageous this is, I'm not sure yet. No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
[ https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787239#action_12787239 ] Mark Miller commented on LUCENE-2130: - Whoops - a little off in that summary - you would't apply a huge boolean query - you'd just have a sparser filter. This might not be that beneficial. Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto, constant, or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This way, if you have a bunch of really large segments and a lot of really small segments, you wouldn't apply a huge booleanquery against all of the small segments which don't have those terms anyway. How advantageous this is, I'm not sure yet. No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
[ https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787248#action_12787248 ] Mark Miller commented on LUCENE-2130: - Okay - so talking to Robert in chat - the advantage when you are enumerating a lot of terms is that you avoid DirectoryReaders MultiTermEnum and its PQ. Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto, constant, or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This way, if you have a bunch of really large segments and a lot of really small segments, you wouldn't apply a huge booleanquery against all of the small segments which don't have those terms anyway. How advantageous this is, I'm not sure yet. No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org