Re: [Lucene.Net] [jira] [Commented] (LUCENENET-412) Replacing ArrayLists, Hashtables etc. with appropriate Generics.
Can i use this version with an existing index based on lucene.Java 3.0.3 ? Alex Am 19.05.2011 00:20, schrieb Digy (JIRA): [ https://issues.apache.org/jira/browse/LUCENENET-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035795#comment-13035795 ] Digy commented on LUCENENET-412: Hi All, Lucene.Net 2.9.4g is almost ready for testing feedbacks. While injecting generics making some clean up in code, I tried to be close to lucene 3.0.3 as much as possible. Therefore it's position is somewhere between lucene.Java 2.9.4 3.0.3 DIGY PS: For those who might want to try this version: It won't probably be a drop-in replacement since there are a few API changes like - StopAnalyzer(Liststring stopWords) - Query.ExtractTerms(ICollectionstring) - TopDocs.*TotalHits*, TopDocs.*ScoreDocs* and some removed methods/classes like - Filter.Bits - JustCompileSearch - Contrib/Similarity.Net Replacing ArrayLists, Hashtables etc. with appropriate Generics. Key: LUCENENET-412 URL: https://issues.apache.org/jira/browse/LUCENENET-412 Project: Lucene.Net Issue Type: Improvement Affects Versions: Lucene.Net 2.9.4 Reporter: Digy Priority: Minor Fix For: Lucene.Net 2.9.4 Attachments: IEquatable for QuerySubclasses.patch, LUCENENET-412.patch, lucene_2.9.4g_exceptions_fix This will move Lucene.Net.2.9.4 closer to lucene.3.0.3 and allow some performance gains. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035223#comment-13035223 ] Shai Erera commented on LUCENE-3102: There are two things left to do: (1) Use bit set instead of int[] for docIDs. If we do this, then it means the Collector cannot support out-of-order collections (which is not a big deal IMO). It also means for large indexes, we might consume more RAM than int[]. (2) Allow this Collector to stand on its own, w/o necessarily wrapping another Collector. There are several ways we can achieve that: * Take a 'null' Collector and check other != null. Adds an 'if' but not a big deal IMO. Also, acceptDocsOutOfOrder will have to either return false (or true), or we take that as a parameter. * Take a 'null' Collector and set this.other to a private static instance of a NoOpCollector. We'll still be delegating calls to it, but hopefully it won't be expensive. Same issue w/ out-of-order * Create two specialized variants of CachingCollector. Personally I'm not too much in favor of the last option - too much code dup for not much gain. The option I like the most is the 2nd (introducing a NoOpCollector). We can even introduce it as a public static member of CachingCollector and let users decide if they want to use it or not. For ease of use, we can still allow 'null' to be passed to create(). What do you think? Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3084: -- Attachment: LUCENE-3084-trunk-only.patch This patch only additionally has a cache of the unmodified collections (like Java's core collections do). This prevents creation of new instance on each asList() call. Mike: Do you have any further comments, else I will commit in a day or two (before leaving to Lucene Rev). MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3084: -- Description: SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. Also SegmentInfos subclasses VectorSI, this should be removed and the collections be hidden inside the class. We can add unmodifiable views on it (asList(), asSet()). was: SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. Summary: MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring (was: MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. Also SegmentInfos subclasses VectorSI, this should be removed and the collections be hidden inside the class. We can add unmodifiable views on it (asList(), asSet()). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3108) Land DocValues on trunk
[ https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035234#comment-13035234 ] Simon Willnauer commented on LUCENE-3108: - Mike thanks for the review! bq. Phew been a long time since I looked at this branch! its been changing :) bq. We have some stale jdocs that reference .setIntValue methods (they are now .setInt) True - thanks I will fix. bq. Hmm do we have byte ordering problems? Ie, if I write index on machine with little-endian but then try to load values on big-endian...? I think we're OK (we seem to always use IndexOutput.writeInt, and we convert float-to-raw-int-bits using java's APIs)? We are ok here since we write big-endian (enforced by DataOutput) and read it back in as plain bytes. The created ByteBuffer will always use BIG_ENDIAN as the default order. I added a comment for this. bq. How come codecID changed from String to int on the branch? due to DocValues I need to compare the ID to certain fields to see for what field I stored and need to open docValues. I always had to parse the given string which is kind of odd. I think its more natural to have the same datatype on FieldInfo, SegmentCodecs and eventually in the Codec#files() method. Making a string out of it is way simpler / less risky than parsing IMO. bq. What are oal.util.Pair and ParallelArray for? legacy I will remove bq. FloatsRef should state in the jdocs that it's really slicing a double[]? yep done! bq. Can SortField somehow detect whether the needed field was stored in FC vs DV and pick the right comparator accordingly...? Kind of like how NumericField can detect whether the ints are encoded as plain text or as NF? We can open a new issue for this, post-landing... This is tricky though. You can have a DV field that is indexed too so its hard to tell if we can reliably do it. If we can't make it reliable I think we should not do it at all. bq. It looks like we can sort by int/long/float/double pulled from DV, but not by terms? This is fine for landing... but I think we should open a post-landing issue to also make FieldComparators for the Terms cases? Yeah true. I didn't add a FieldComparator for bytes yet. I think this is post landing! bq. Should we rename oal.index.values.Type - .ValueType? Just because... it looks so generic when its imported used as Type somewhere? agreed. I also think we should rename Source but I don't have a good name yet. Any idea? bq. Since we dynamically reserve a value to mean unset, does that mean there are some datasets we cannot index? Or... do we tap into the unused bit of a long, ie the sentinel value can be negative? But if the data set spans Long.MIN_VALUE to Long.MAX_VALUE, what do we do...? This is tricky though. The quick answer is yes, but we can't do that anyway since I have not normalize the range to be 0 based since PackedInts doesn't allow negative values. so the range we can store is (2^63) -1. So essentially with the current impl we can store (2^63)-2 and the max value is Long#MAX_VALUE-1. Currently there is no assert for this which is needed I think but to get around this we need to have a different impl I think or do I miss something? I will make the changes once SVN is writeable again. Land DocValues on trunk --- Key: LUCENE-3108 URL: https://issues.apache.org/jira/browse/LUCENE-3108 Project: Lucene - Java Issue Type: Task Components: core/index, core/search, core/store Affects Versions: CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Its time to move another feature from branch to trunk. I want to start this process now while still a couple of issues remain on the branch. Currently I am down to a single nocommit (javadocs on DocValues.java) and a couple of testing TODOs (explicit multithreaded tests and unoptimized with deletions) but I think those are not worth separate issues so we can resolve them as we go. The already created issues (LUCENE-3075 and LUCENE-3074) should not block this process here IMO, we can fix them once we are on trunk. Here is a quick feature overview of what has been implemented: * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, Bytes (fixed / variable size each in sorted, straight and deref variations) * Integration into Flex-API, Codec provides a PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) * By-Default enabled in all codecs except of PreFlex * Follows other flex-API patterns like non-segment reader throw UOE forcing MultiPerDocValues if on DirReader etc. * Integration into IndexWriter, FieldInfos etc. * Random-testing enabled via RandomIW - injecting random DocValues into documents * Basic checks in CheckIndex (which
[jira] [Issue Comment Edited] (LUCENE-3108) Land DocValues on trunk
[ https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035234#comment-13035234 ] Simon Willnauer edited comment on LUCENE-3108 at 5/18/11 8:20 AM: -- Mike thanks for the review! bq. Phew been a long time since I looked at this branch! its been changing :) {quote} We have some stale jdocs that reference .setIntValue methods (they are now .setInt){quote} True - thanks I will fix. {quote} Hmm do we have byte ordering problems? Ie, if I write index on machine with little-endian but then try to load values on big-endian...? I think we're OK (we seem to always use IndexOutput.writeInt, and we convert float-to-raw-int-bits using java's APIs)?{quote} We are ok here since we write big-endian (enforced by DataOutput) and read it back in as plain bytes. The created ByteBuffer will always use BIG_ENDIAN as the default order. I added a comment for this. {quote}How come codecID changed from String to int on the branch?{quote} due to DocValues I need to compare the ID to certain fields to see for what field I stored and need to open docValues. I always had to parse the given string which is kind of odd. I think its more natural to have the same datatype on FieldInfo, SegmentCodecs and eventually in the Codec#files() method. Making a string out of it is way simpler / less risky than parsing IMO. {quote} What are oal.util.Pair and ParallelArray for?{quote} legacy I will remove {quote} FloatsRef should state in the jdocs that it's really slicing a double[]?{quote} yep done! {quote} Can SortField somehow detect whether the needed field was stored in FC vs DV and pick the right comparator accordingly...? Kind of like how NumericField can detect whether the ints are encoded as plain text or as NF? We can open a new issue for this, post-landing...{quote} This is tricky though. You can have a DV field that is indexed too so its hard to tell if we can reliably do it. If we can't make it reliable I think we should not do it at all. {quote}It looks like we can sort by int/long/float/double pulled from DV, but not by terms? This is fine for landing... but I think we should open a post-landing issue to also make FieldComparators for the Terms cases?{quote} Yeah true. I didn't add a FieldComparator for bytes yet. I think this is post landing! {quote} Should we rename oal.index.values.Type - .ValueType? Just because... it looks so generic when its imported used as Type somewhere?{quote} agreed. I also think we should rename Source but I don't have a good name yet. Any idea? {quote} Since we dynamically reserve a value to mean unset, does that mean there are some datasets we cannot index? Or... do we tap into the unused bit of a long, ie the sentinel value can be negative? But if the data set spans Long.MIN_VALUE to Long.MAX_VALUE, what do we do...?{quote} Again, tricky! The quick answer is yes, but we can't do that anyway since I have not normalize the range to be 0 based since PackedInts doesn't allow negative values. so the range we can store is (2^63) -1. So essentially with the current impl we can store (2^63)-2 and the max value is Long#MAX_VALUE-1. Currently there is no assert for this which is needed I think but to get around this we need to have a different impl I think or do I miss something? I will make the changes once SVN is writeable again. was (Author: simonw): Mike thanks for the review! bq. Phew been a long time since I looked at this branch! its been changing :) bq. We have some stale jdocs that reference .setIntValue methods (they are now .setInt) True - thanks I will fix. bq. Hmm do we have byte ordering problems? Ie, if I write index on machine with little-endian but then try to load values on big-endian...? I think we're OK (we seem to always use IndexOutput.writeInt, and we convert float-to-raw-int-bits using java's APIs)? We are ok here since we write big-endian (enforced by DataOutput) and read it back in as plain bytes. The created ByteBuffer will always use BIG_ENDIAN as the default order. I added a comment for this. bq. How come codecID changed from String to int on the branch? due to DocValues I need to compare the ID to certain fields to see for what field I stored and need to open docValues. I always had to parse the given string which is kind of odd. I think its more natural to have the same datatype on FieldInfo, SegmentCodecs and eventually in the Codec#files() method. Making a string out of it is way simpler / less risky than parsing IMO. bq. What are oal.util.Pair and ParallelArray for? legacy I will remove bq. FloatsRef should state in the jdocs that it's really slicing a double[]? yep done! bq. Can SortField somehow detect whether the needed field was stored in FC vs DV and pick the right comparator accordingly...? Kind of like how NumericField can detect whether the
Solr ByteUtils
Hey there, I just ran into org.apache.solr.util.ByteUtils which seems pretty much like a duplication of UnicodeUtils in Lucene. I think we should get rid of it and merge what needs to be merged into UnicodeUtils. This utils class is really just doing unicode stuff. simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Solr ByteUtils
+1 Mike http://blog.mikemccandless.com On Wed, May 18, 2011 at 4:34 AM, Simon Willnauer simon.willna...@googlemail.com wrote: Hey there, I just ran into org.apache.solr.util.ByteUtils which seems pretty much like a duplication of UnicodeUtils in Lucene. I think we should get rid of it and merge what needs to be merged into UnicodeUtils. This utils class is really just doing unicode stuff. simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1888) Provide Option to Store Payloads on the Term Vector
[ https://issues.apache.org/jira/browse/LUCENE-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035277#comment-13035277 ] Michal Fapso commented on LUCENE-1888: -- Hi Peter, I work on the same thing. You can get my code from here: http://speech.fit.vutbr.cz/en/software/speech-search (Lucene extension for bin sequences), there are also some testing data. That code runs behind this website: http://www.superlectures.com/odyssey/ It is few months old, so if you are interested, I can send you our current version. Best regards, Michal Fapso Provide Option to Store Payloads on the Term Vector --- Key: LUCENE-1888 URL: https://issues.apache.org/jira/browse/LUCENE-1888 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 4.0 Would be nice to have the option to access the payloads in a document-centric way by adding them to the Term Vectors. Naturally, this makes the Term Vectors bigger, but it may be just what one needs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-1888) Provide Option to Store Payloads on the Term Vector
[ https://issues.apache.org/jira/browse/LUCENE-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035277#comment-13035277 ] Michal Fapso edited comment on LUCENE-1888 at 5/18/11 10:40 AM: Hi Peter, I work on the same thing. You can get my code from here: http://speech.fit.vutbr.cz/en/software/speech-search (Lucene extension for bin sequences), there are also some testing data. Actually it indexes word confusion networks with scores of hypotheses, but of course it will work also for 1-best string transcripts. That code runs behind this website: http://www.superlectures.com/odyssey/ It is few months old, so if you are interested, I can send you our current version. Best regards, Michal Fapso was (Author: michalfapso): Hi Peter, I work on the same thing. You can get my code from here: http://speech.fit.vutbr.cz/en/software/speech-search (Lucene extension for bin sequences), there are also some testing data. That code runs behind this website: http://www.superlectures.com/odyssey/ It is few months old, so if you are interested, I can send you our current version. Best regards, Michal Fapso Provide Option to Store Payloads on the Term Vector --- Key: LUCENE-1888 URL: https://issues.apache.org/jira/browse/LUCENE-1888 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 4.0 Would be nice to have the option to access the payloads in a document-centric way by adding them to the Term Vectors. Naturally, this makes the Term Vectors bigger, but it may be just what one needs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3102: --- Attachment: LUCENE-3102-nowrap.patch Patch against 3x: * Adds a create() to CachingCollector which does not take a Collector to wrap. Internally, it creates a no-op collector, which ignores everything. * Javadocs for create() * matching test. Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035282#comment-13035282 ] Michael McCandless commented on LUCENE-3084: Looks awesome Uwe! +1 to commit. Some small variable naming suggestions: * Rename cloneChilds - cloneChildren (sis.createBackupSIS) * Maybe call it (and, invert) mapIndexesValid instead of mapIndexesInvalid (in SIS.java)? I generally prefer not putting not into boolean variables when possible, for sanity... MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. Also SegmentInfos subclasses VectorSI, this should be removed and the collections be hidden inside the class. We can add unmodifiable views on it (asList(), asSet()). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035283#comment-13035283 ] Michael McCandless commented on LUCENE-3102: The committed CHANGES has typo (reply should be replay). Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035284#comment-13035284 ] Uwe Schindler commented on LUCENE-3084: --- OK! Thanks Mike bq. mapIndexesInvalid I will remove the map again and replace by a simple Set. Using a map that maps to indexes is too complicated and does not bring us anything. contains() works without and indexOf() needs to rebuild the map whenever an insert or remove is done. Especially on remove(SI) it will rebuild the map two times in the badest case. A linear scan for indexOf is in my opinion fine. We can only optimize by doing a contains on the set first. MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. Also SegmentInfos subclasses VectorSI, this should be removed and the collections be hidden inside the class. We can add unmodifiable views on it (asList(), asSet()). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035285#comment-13035285 ] Michael McCandless commented on LUCENE-3102: Patch to allow no wrapped collector looks good! I wonder/hope hotspot can realize those method calls are no-ops... Maybe change TestGrouping to randomly use this ctor? Ie, randomly, you can use caching collector (not wrapped), then call its replay method twice (once against 1st pass, then against 2nd pass, collectors), and then assert results like normal. This is also a good verification that replay works twice... On the OBS, it makes me nervous to just always do this; I'd rather have it cutover at some point? Or perhaps it's an expert optional arg to create, whether it should back w/ OBS vs int[]? Or, ideally... we make a bit set impl that does this all under the hood (uses int[] when there are few docs, and ugprades to OBS once there are enough to justify it...), then we can just use that bit set here. Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2524) Adding grouping to Solr 3x
[ https://issues.apache.org/jira/browse/SOLR-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035287#comment-13035287 ] Michael McCandless commented on SOLR-2524: -- +1 this would be awesome Martijn!! In general we should try hard to backport features we build on trunk, to 3.x, when feasible. Adding grouping to Solr 3x -- Key: SOLR-2524 URL: https://issues.apache.org/jira/browse/SOLR-2524 Project: Solr Issue Type: New Feature Affects Versions: 3.2 Reporter: Martijn van Groningen Grouping was recently added to Lucene 3x. See LUCENE-1421 for more information. I think it would be nice if we expose this functionality also to the Solr users that are bound to a 3.x version. The grouping feature added to Lucene is currently a subset of the functionality that Solr 4.0-trunk offers. Mainly it doesn't support grouping by function / query. The work involved getting the grouping contrib to work on Solr 3x is acceptable. I have it more or less running here. It supports the response format and request parameters (expect: group.query and group.func) described in the FieldCollapse page on the Solr wiki. I think it would be great if this is included in the Solr 3.2 release. Many people are using grouping as patch now and this would help them a lot. Any thoughts? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync
Build: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/126/ No tests ran. Build Log (for compile errors): [...truncated 7931 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2883) Consolidate Solr Lucene FunctionQuery into modules
[ https://issues.apache.org/jira/browse/LUCENE-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035331#comment-13035331 ] Chris Male commented on LUCENE-2883: Script needed to run to use patch: {code} svn mkdir --parents modules/queries/src/java/org/apache/lucene/queries/function svn move solr/src/java/org/apache/solr/search/function/FunctionQuery.java modules/queries/src/java/org/apache/lucene/queries/function/FunctionQuery.java svn move solr/src/java/org/apache/solr/search/function/ValueSource.java modules/queries/src/java/org/apache/lucene/queries/function/ValueSource.java svn move solr/src/java/org/apache/solr/search/function/DocValues.java modules/queries/src/java/org/apache/lucene/queries/function/DocValues.java svn move solr/src/java/org/apache/solr/search/MutableValue.java modules/queries/src/java/org/apache/lucene/queries/function/MutableValue.java svn move solr/src/java/org/apache/solr/search/MutableValueFloat.java modules/queries/src/java/org/apache/lucene/queries/function/MutableValueFloat.java {code} Consolidate Solr Lucene FunctionQuery into modules - Key: LUCENE-2883 URL: https://issues.apache.org/jira/browse/LUCENE-2883 Project: Lucene - Java Issue Type: Task Components: core/search Affects Versions: 4.0 Reporter: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2883.patch Spin-off from the [dev list | http://www.mail-archive.com/dev@lucene.apache.org/msg13261.html] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2883) Consolidate Solr Lucene FunctionQuery into modules
[ https://issues.apache.org/jira/browse/LUCENE-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male updated LUCENE-2883: --- Attachment: LUCENE-2883.patch Patch that factors out the core FunctionQuery stuff into a queries module. Theres alot of issues here but it does compile. The following issues need to be addressed: - MutableValue MutableFloatValue are used in the FunctionQuery code so I've pulled them into the module too. Should all the other Mutable*Value classes come too? Should they go into some other module? - What to return in ValueSource#getSortField which currently returns a SortField which implements SolrSortField. This is currently commented out so we can determine what best to do. Having this commented out breaks the Solr tests. - Many of the ValueSources and DocValues in Solr could be moved to the module, but not all of them. Some have dependencies on Solr dependencies / Solr core code. - Module isn't full integrated into the build.xmls and dev-tools. - Lucene core's FunctionQuery stuff needs to be removed. I'll add a script that needs to be run before adding this patch shortly. Consolidate Solr Lucene FunctionQuery into modules - Key: LUCENE-2883 URL: https://issues.apache.org/jira/browse/LUCENE-2883 Project: Lucene - Java Issue Type: Task Components: core/search Affects Versions: 4.0 Reporter: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2883.patch Spin-off from the [dev list | http://www.mail-archive.com/dev@lucene.apache.org/msg13261.html] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3115) Escaping stars and question marks do not work.
Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query *\** doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3115) Escaping stars and question marks do not work.
[ https://issues.apache.org/jira/browse/LUCENE-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Kornev updated LUCENE-3115: Description: The string I have search by st*rs is indexed. Search by query \*\\*\* doesn't return matching result. This query returns all not empty values. (was: The string I have search by st*rs is indexed. Search by query *\** doesn't return matching result. This query returns all not empty values.) Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query \*\\*\* doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3115) Escaping stars and question marks do not work.
[ https://issues.apache.org/jira/browse/LUCENE-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Kornev updated LUCENE-3115: Description: The string I have search by st*rs is indexed. Search by query \*\\*\* doesn't return matching result. This query returns all not empty values. (was: The string I have search by st*rs is indexed. Search by query \\**\\* doesn't return matching result. This query returns all not empty values.) Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query \*\\*\* doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3115) Escaping stars and question marks do not work.
[ https://issues.apache.org/jira/browse/LUCENE-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Kornev updated LUCENE-3115: Description: The string I have search by st*rs is indexed. Search by query \*\\\*\* doesn't return matching result. This query returns all not empty values. (was: The string I have search by st*rs is indexed. Search by query \*\\*\* doesn't return matching result. This query returns all not empty values.) Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query \*\\\*\* doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3115) Escaping stars and question marks do not work.
[ https://issues.apache.org/jira/browse/LUCENE-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Kornev updated LUCENE-3115: Description: The string I have search by st*rs is indexed. Search by query \\**\\* doesn't return matching result. This query returns all not empty values. (was: The string I have search by st*rs is indexed. Search by query \*\\*\* doesn't return matching result. This query returns all not empty values.) Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query \\**\\* doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3115) Escaping stars and question marks do not work.
[ https://issues.apache.org/jira/browse/LUCENE-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Kornev updated LUCENE-3115: Description: The string I have search by st*rs is indexed. Search by query \*#92\*\* doesn't return matching result. This query returns all not empty values. (was: The string I have search by st*rs is indexed. Search by query \*\\\*\* doesn't return matching result. This query returns all not empty values.) Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query \*#92\*\* doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3115) Escaping stars and question marks do not work.
[ https://issues.apache.org/jira/browse/LUCENE-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Kornev updated LUCENE-3115: Description: The string I have search by st*rs is indexed. Search by query \*\\ \*\* doesn't return matching result. This query returns all not empty values. (was: The string I have search by st*rs is indexed. Search by query \*#92\*\* doesn't return matching result. This query returns all not empty values.) Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query \*\\ \*\* doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3115) Escaping stars and question marks do not work.
[ https://issues.apache.org/jira/browse/LUCENE-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035334#comment-13035334 ] Erick Erickson commented on LUCENE-3115: Please raise this on the user's list first. Offhand I suspect Lucene is working as expected, but you haven't provided nearly enough information to decide whether this is a bug or not. A self-contained junit test would be ideal here. Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query \*\\\ \*\* doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3115) Escaping stars and question marks do not work.
[ https://issues.apache.org/jira/browse/LUCENE-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Kornev updated LUCENE-3115: Description: The string I have search by st*rs is indexed. Search by query \*\\\ \*\* doesn't return matching result. This query returns all not empty values. (was: The string I have search by st*rs is indexed. Search by query \*\\ \*\* doesn't return matching result. This query returns all not empty values.) Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query \*\\\ \*\* doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync]
This build failed because of Javadoc warnings: [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_16-p9 [javadoc] Building tree for all the packages and classes... [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:62: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:76: warning - Tag @link: reference not found: SentinelIntSet -Original Message- From: Apache Jenkins Server [mailto:hud...@hudson.apache.org] Sent: Wednesday, May 18, 2011 7:36 AM To: dev@lucene.apache.org Subject: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync Build: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/126/ No tests ran. Build Log (for compile errors): [...truncated 7931 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync]
Ugh my bad, sorry. I'll fix! Mike http://blog.mikemccandless.com On Wed, May 18, 2011 at 8:22 AM, Steven A Rowe sar...@syr.edu wrote: This build failed because of Javadoc warnings: [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_16-p9 [javadoc] Building tree for all the packages and classes... [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:62: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:76: warning - Tag @link: reference not found: SentinelIntSet -Original Message- From: Apache Jenkins Server [mailto:hud...@hudson.apache.org] Sent: Wednesday, May 18, 2011 7:36 AM To: dev@lucene.apache.org Subject: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync Build: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/126/ No tests ran. Build Log (for compile errors): [...truncated 7931 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3115) Escaping stars and question marks do not work.
[ https://issues.apache.org/jira/browse/LUCENE-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Kornev updated LUCENE-3115: Description: The string I have search by st*rs is indexed. Search by query {code}index.query(*\**);{code} doesn't return matching result. This query returns all not empty values. (was: The string I have search by st*rs is indexed. Search by query \*\\\ \*\* doesn't return matching result. This query returns all not empty values.) Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query {code}index.query(*\**);{code} doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3115) Escaping stars and question marks do not work.
[ https://issues.apache.org/jira/browse/LUCENE-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Kornev updated LUCENE-3115: Description: The string I have search by st*rs is indexed. Search by query {code}index.query(key, *\**);{code} doesn't return matching result. This query returns all not empty values. (was: The string I have search by st*rs is indexed. Search by query {code}index.query(*\**);{code} doesn't return matching result. This query returns all not empty values.) Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.1 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query {code}index.query(key, *\**);{code} doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync]
Hmm, I know what's wrong here (SentinelIntSet is package private), and I'll fix... but when I run ant javadocs I don't see these warnings (but the build clearly does)... Mike http://blog.mikemccandless.com On Wed, May 18, 2011 at 8:22 AM, Steven A Rowe sar...@syr.edu wrote: This build failed because of Javadoc warnings: [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_16-p9 [javadoc] Building tree for all the packages and classes... [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:62: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:76: warning - Tag @link: reference not found: SentinelIntSet -Original Message- From: Apache Jenkins Server [mailto:hud...@hudson.apache.org] Sent: Wednesday, May 18, 2011 7:36 AM To: dev@lucene.apache.org Subject: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync Build: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/126/ No tests ran. Build Log (for compile errors): [...truncated 7931 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035344#comment-13035344 ] Shai Erera commented on LUCENE-3102: bq. The committed CHANGES has typo (reply should be replay). Thanks, will include it in the next commit. bq. I'd rather have it cutover at some point This can only be done if out-of-order collection wasn't done so far, because otherwise, cutting to OBS will take cached doc IDs and scores out of sync. bq. we make a bit set impl that does this all under the hood (uses int[] when there are few docs, and ugprades to OBS once there are enough to justify it...) That's a good idea. I think we should leave the OBS stuff for another issue. See first how this performs and optimize only if needed. I'll take a look at TestGrouping. Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync]
Mike, I can repro on my Win7 box using Oracle JDK 1.5.0_22. Maybe you're using 1.6.0_X? - Steve -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, May 18, 2011 8:33 AM To: dev@lucene.apache.org Subject: Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync] Hmm, I know what's wrong here (SentinelIntSet is package private), and I'll fix... but when I run ant javadocs I don't see these warnings (but the build clearly does)... Mike http://blog.mikemccandless.com On Wed, May 18, 2011 at 8:22 AM, Steven A Rowe sar...@syr.edu wrote: This build failed because of Javadoc warnings: [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_16-p9 [javadoc] Building tree for all the packages and classes... [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:62: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:76: warning - Tag @link: reference not found: SentinelIntSet -Original Message- From: Apache Jenkins Server [mailto:hud...@hudson.apache.org] Sent: Wednesday, May 18, 2011 7:36 AM To: dev@lucene.apache.org Subject: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync Build: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/126/ No tests ran. Build Log (for compile errors): [...truncated 7931 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync]
Ahh yes I'm using 1.6, OK. I'll switch to 1.5... Mike http://blog.mikemccandless.com On Wed, May 18, 2011 at 8:40 AM, Steven A Rowe sar...@syr.edu wrote: Mike, I can repro on my Win7 box using Oracle JDK 1.5.0_22. Maybe you're using 1.6.0_X? - Steve -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, May 18, 2011 8:33 AM To: dev@lucene.apache.org Subject: Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync] Hmm, I know what's wrong here (SentinelIntSet is package private), and I'll fix... but when I run ant javadocs I don't see these warnings (but the build clearly does)... Mike http://blog.mikemccandless.com On Wed, May 18, 2011 at 8:22 AM, Steven A Rowe sar...@syr.edu wrote: This build failed because of Javadoc warnings: [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_16-p9 [javadoc] Building tree for all the packages and classes... [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:62: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:76: warning - Tag @link: reference not found: SentinelIntSet -Original Message- From: Apache Jenkins Server [mailto:hud...@hudson.apache.org] Sent: Wednesday, May 18, 2011 7:36 AM To: dev@lucene.apache.org Subject: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync Build: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/126/ No tests ran. Build Log (for compile errors): [...truncated 7931 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3115) Escaping stars and question marks do not work.
[ https://issues.apache.org/jira/browse/LUCENE-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Kornev updated LUCENE-3115: Affects Version/s: (was: 3.1) 3.0 Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.0 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query {code}index.query(key, *\**);{code} doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3084: -- Attachment: LUCENE-3084-trunk-only.patch New patch with the renaming and removal of Map in favour of a simple Set. Again ready to commit. MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. Also SegmentInfos subclasses VectorSI, this should be removed and the collections be hidden inside the class. We can add unmodifiable views on it (asList(), asSet()). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035349#comment-13035349 ] Michael McCandless commented on LUCENE-3084: Patch looks great Uwe! +1 to commit. Thanks! MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. Also SegmentInfos subclasses VectorSI, this should be removed and the collections be hidden inside the class. We can add unmodifiable views on it (asList(), asSet()). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-3084: - Assignee: Uwe Schindler (was: Michael McCandless) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. Also SegmentInfos subclasses VectorSI, this should be removed and the collections be hidden inside the class. We can add unmodifiable views on it (asList(), asSet()). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync]
Hi Mike, Same problem in trunk! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, May 18, 2011 2:27 PM To: dev@lucene.apache.org Subject: Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS- MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync] Ugh my bad, sorry. I'll fix! Mike http://blog.mikemccandless.com On Wed, May 18, 2011 at 8:22 AM, Steven A Rowe sar...@syr.edu wrote: This build failed because of Javadoc warnings: [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_16-p9 [javadoc] Building tree for all the packages and classes... [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:62: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:76: warning - Tag @link: reference not found: SentinelIntSet -Original Message- From: Apache Jenkins Server [mailto:hud...@hudson.apache.org] Sent: Wednesday, May 18, 2011 7:36 AM To: dev@lucene.apache.org Subject: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync Build: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/126/ No tests ran. Build Log (for compile errors): [...truncated 7931 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3113) fix analyzer bugs found by MockTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-3113. - Resolution: Fixed Committed revision 1104519, 1124242 (branch_3x) fix analyzer bugs found by MockTokenizer Key: LUCENE-3113 URL: https://issues.apache.org/jira/browse/LUCENE-3113 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Reporter: Robert Muir Fix For: 3.2, 4.0 Attachments: LUCENE-3113.patch, LUCENE-3113.patch In LUCENE-3064, we beefed up MockTokenizer with assertions, and I've switched over the analysis tests to use MockTokenizer for better coverage. However, this found a few bugs (one of which is LUCENE-3106): * incrementToken() after it returns false in CommonGramsQueryFilter, HyphenatedWordsFilter, ShingleFilter, SynonymFilter * missing end() implementation for PrefixAwareTokenFilter * double reset() in QueryAutoStopWordAnalyzer and ReusableAnalyzerBase * missing correctOffset()s in MockTokenizer itself. I think it would be nice to just fix all the bugs on one issue... I've fixed everything except Shingle and Synonym -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035357#comment-13035357 ] Robert Muir commented on LUCENE-3092: - {quote} Tests? It's nice to have a test use a RAMDirectory for speed, but still follow the same code path as FSDirectory for debugging + orthogonality. {quote} FWIW currently the lucene tests use a RAMDirectory 90% of the time (and something else the other 10%). We could adjust this... at the time I set it, it seemed to not slow the tests down that much but still give us a little more coverage. NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3014) comparator API for segment versions
[ https://issues.apache.org/jira/browse/LUCENE-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035359#comment-13035359 ] Robert Muir commented on LUCENE-3014: - any Objections? Uwe you still want to take this or should I? I want to get LUCENE-3012 wrapped up. comparator API for segment versions --- Key: LUCENE-3014 URL: https://issues.apache.org/jira/browse/LUCENE-3014 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.2 Attachments: LUCENE-3014.patch See LUCENE-3012 for an example. Things get ugly if you want to use SegmentInfo.getVersion() For example, what if we committed my patch, release 3.2, but later released 3.1.1 (will 3.1.1 this be whats written and returned by this function?) Then suddenly we broke the index format because we are using Strings here without a reasonable comparator API. In this case one should be able to compute if the version is 3.2 safely. If we don't do this, and we rely upon this version information internally in lucene, I think we are going to break something. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3014) comparator API for segment versions
[ https://issues.apache.org/jira/browse/LUCENE-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035361#comment-13035361 ] Uwe Schindler commented on LUCENE-3014: --- It's fine, commit it! We may look for usage of the version field in SegmentInfos, and use this comparator there (especially e.g. my new one for upgrades or the standard IndexTooOldException stuff). But I think that should be a new issue. comparator API for segment versions --- Key: LUCENE-3014 URL: https://issues.apache.org/jira/browse/LUCENE-3014 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.2 Attachments: LUCENE-3014.patch See LUCENE-3012 for an example. Things get ugly if you want to use SegmentInfo.getVersion() For example, what if we committed my patch, release 3.2, but later released 3.1.1 (will 3.1.1 this be whats written and returned by this function?) Then suddenly we broke the index format because we are using Strings here without a reasonable comparator API. In this case one should be able to compute if the version is 3.2 safely. If we don't do this, and we rely upon this version information internally in lucene, I think we are going to break something. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3115) Escaping stars and question marks do not work.
[ https://issues.apache.org/jira/browse/LUCENE-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rowe resolved LUCENE-3115. - Resolution: Invalid Hi Vladimir, I'm resolving this issue as Invalid. When you have a problem with Lucene, please post to the Lucene Java User mailing list first - see [http://lucene.apache.org/java/docs/mailinglists.html]. Also, the next time you use JIRA, please make use of the Preview button below the text entry box, rather than re-editing lots of times. Every time you submit an edit, a message is sent to the Lucene/Solr developer mailing list. I have 11 different versions of your issue description clogging up my mailbox... Steve Escaping stars and question marks do not work. -- Key: LUCENE-3115 URL: https://issues.apache.org/jira/browse/LUCENE-3115 Project: Lucene - Java Issue Type: Bug Components: core/index Affects Versions: 3.0 Reporter: Vladimir Kornev The string I have search by st*rs is indexed. Search by query {code}index.query(key, *\**);{code} doesn't return matching result. This query returns all not empty values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3014) comparator API for segment versions
[ https://issues.apache.org/jira/browse/LUCENE-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035370#comment-13035370 ] Shai Erera commented on LUCENE-3014: Hey guys, does this affect LUCENE-2921 (or vice versa)? Basically, I thought that we should stop writing version header in files and just use the release version as a header. Robert, I don't think we are allowed to change index format versions on bug-fix releases and even if we do, that same bug fix would go into the 3.x release so it would still know how to read 3.1.1? Perhaps that was your point and I missed it ... comparator API for segment versions --- Key: LUCENE-3014 URL: https://issues.apache.org/jira/browse/LUCENE-3014 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.2 Attachments: LUCENE-3014.patch See LUCENE-3012 for an example. Things get ugly if you want to use SegmentInfo.getVersion() For example, what if we committed my patch, release 3.2, but later released 3.1.1 (will 3.1.1 this be whats written and returned by this function?) Then suddenly we broke the index format because we are using Strings here without a reasonable comparator API. In this case one should be able to compute if the version is 3.2 safely. If we don't do this, and we rely upon this version information internally in lucene, I think we are going to break something. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3014) comparator API for segment versions
[ https://issues.apache.org/jira/browse/LUCENE-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035376#comment-13035376 ] Uwe Schindler commented on LUCENE-3014: --- Shai: we should not change index format, but it still feels bad not to have a correct version comparison API. With this patch you can even compare 3.0 against only 3 or 3.0.0.0.0 and they will be equal. And once we are at version 10, a simple string compare is a bad idea :-) Thats why Robert and me are against pure string comparisons. comparator API for segment versions --- Key: LUCENE-3014 URL: https://issues.apache.org/jira/browse/LUCENE-3014 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.2 Attachments: LUCENE-3014.patch See LUCENE-3012 for an example. Things get ugly if you want to use SegmentInfo.getVersion() For example, what if we committed my patch, release 3.2, but later released 3.1.1 (will 3.1.1 this be whats written and returned by this function?) Then suddenly we broke the index format because we are using Strings here without a reasonable comparator API. In this case one should be able to compute if the version is 3.2 safely. If we don't do this, and we rely upon this version information internally in lucene, I think we are going to break something. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3014) comparator API for segment versions
[ https://issues.apache.org/jira/browse/LUCENE-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035378#comment-13035378 ] Robert Muir commented on LUCENE-3014: - {quote} Hey guys, does this affect LUCENE-2921 (or vice versa)? {quote} Hi Shai, I think this helps LUCENE-2921. This is a comparator to use, when you want to examine the release version that created the segment (the one you added in LUCENE-2720). Its guaranteed to compare correctly if say, we we released 3.10, and also if the number of trailing zeros etc are different. In other words, if you implement LUCENE-2921 I think the idea is typically you will want to use this comparator when examining the version string. {quote} Robert, I don't think we are allowed to change index format versions on bug-fix releases and even if we do, that same bug fix would go into the 3.x release so it would still know how to read 3.1.1? Perhaps that was your point and I missed it ... {quote} On LUCENE-3012, I've proposed a fix-for version for Lucene 3.2. But we can discuss on that issue. comparator API for segment versions --- Key: LUCENE-3014 URL: https://issues.apache.org/jira/browse/LUCENE-3014 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.2 Attachments: LUCENE-3014.patch See LUCENE-3012 for an example. Things get ugly if you want to use SegmentInfo.getVersion() For example, what if we committed my patch, release 3.2, but later released 3.1.1 (will 3.1.1 this be whats written and returned by this function?) Then suddenly we broke the index format because we are using Strings here without a reasonable comparator API. In this case one should be able to compute if the version is 3.2 safely. If we don't do this, and we rely upon this version information internally in lucene, I think we are going to break something. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3014) comparator API for segment versions
[ https://issues.apache.org/jira/browse/LUCENE-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035379#comment-13035379 ] Shai Erera commented on LUCENE-3014: Yes, that makes sense. I can use that API in LUCENE-2921. Thanks a lot for saving me some effort :). comparator API for segment versions --- Key: LUCENE-3014 URL: https://issues.apache.org/jira/browse/LUCENE-3014 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.2 Attachments: LUCENE-3014.patch See LUCENE-3012 for an example. Things get ugly if you want to use SegmentInfo.getVersion() For example, what if we committed my patch, release 3.2, but later released 3.1.1 (will 3.1.1 this be whats written and returned by this function?) Then suddenly we broke the index format because we are using Strings here without a reasonable comparator API. In this case one should be able to compute if the version is 3.2 safely. If we don't do this, and we rely upon this version information internally in lucene, I think we are going to break something. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3014) comparator API for segment versions
[ https://issues.apache.org/jira/browse/LUCENE-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-3014. - Resolution: Fixed Fix Version/s: 4.0 Committed revision 1124266, 1124269 (branch3x) comparator API for segment versions --- Key: LUCENE-3014 URL: https://issues.apache.org/jira/browse/LUCENE-3014 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.2, 4.0 Attachments: LUCENE-3014.patch See LUCENE-3012 for an example. Things get ugly if you want to use SegmentInfo.getVersion() For example, what if we committed my patch, release 3.2, but later released 3.1.1 (will 3.1.1 this be whats written and returned by this function?) Then suddenly we broke the index format because we are using Strings here without a reasonable comparator API. In this case one should be able to compute if the version is 3.2 safely. If we don't do this, and we rely upon this version information internally in lucene, I think we are going to break something. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3108) Land DocValues on trunk
[ https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035386#comment-13035386 ] Simon Willnauer commented on LUCENE-3108: - FYI. I ran indexing benchmarks trunk vs. branch and they are super close together. its like 3 sec difference while branch was faster so its in the noise. I also indexed one docvalues field (floats) which was also about the same 2 sec. slower including merges etc. So we are on the save side that this feature does not influence indexing performance. I didn't expect anything else really since the only difference is a single condition in DocFieldProcessor. Land DocValues on trunk --- Key: LUCENE-3108 URL: https://issues.apache.org/jira/browse/LUCENE-3108 Project: Lucene - Java Issue Type: Task Components: core/index, core/search, core/store Affects Versions: CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Its time to move another feature from branch to trunk. I want to start this process now while still a couple of issues remain on the branch. Currently I am down to a single nocommit (javadocs on DocValues.java) and a couple of testing TODOs (explicit multithreaded tests and unoptimized with deletions) but I think those are not worth separate issues so we can resolve them as we go. The already created issues (LUCENE-3075 and LUCENE-3074) should not block this process here IMO, we can fix them once we are on trunk. Here is a quick feature overview of what has been implemented: * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, Bytes (fixed / variable size each in sorted, straight and deref variations) * Integration into Flex-API, Codec provides a PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) * By-Default enabled in all codecs except of PreFlex * Follows other flex-API patterns like non-segment reader throw UOE forcing MultiPerDocValues if on DirReader etc. * Integration into IndexWriter, FieldInfos etc. * Random-testing enabled via RandomIW - injecting random DocValues into documents * Basic checks in CheckIndex (which runs after each test) * FieldComparator for int and float variants (Sorting, currently directly integrated into SortField, this might go into a separate DocValuesSortField eventually) * Extended TestSort for DocValues * RAM-Resident random access API plus on-disk DocValuesEnum (currently only sequential access) - Source.java / DocValuesEnum.java * Extensible Cache implementation for RAM-Resident DocValues (by-default loaded into RAM only once and freed once IR is closed) - SourceCache.java PS: Currently the RAM resident API is named Source (Source.java) which seems too generic. I think we should rename it into RamDocValues or something like that, suggestion welcome! Any comments, questions (rants :)) are very much appreciated. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3012) if you use setNorm, lucene writes a headerless separate norms file
[ https://issues.apache.org/jira/browse/LUCENE-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3012: Attachment: LUCENE-3012_3x.patch Updated patch (against branch_3x for simplicity) that uses the LUCENE-3014 comparator API. Because separate norms files are independent of the version that created the segment (e.g. one can call setNorm with 3.6 for a 3.1 segment), I think its really important that we fix this in 3.2 to write the header. If there are no objections, I'd like to commit, and then regenerate the tentative 3.2 indexes for trunk's TestBackwardsCompatibility. There's no need to change the fileformats.html documentation, as what we are doing now is actually inconsistent with it, thus the bug. if you use setNorm, lucene writes a headerless separate norms file -- Key: LUCENE-3012 URL: https://issues.apache.org/jira/browse/LUCENE-3012 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.2 Attachments: LUCENE-3012.patch, LUCENE-3012_3x.patch In this case SR.reWrite just writes the bytes with no header... we should write it always. we can detect in these cases (segment written = 3.1) with a sketchy length == maxDoc check. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS-MAVEN] Lucene-Solr-Maven-trunk #124: POMs out of sync
Build: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-trunk/124/ No tests ran. Build Log (for compile errors): [...truncated 7394 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3103) create a simple test that indexes and searches byte[] terms
[ https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-3103. - Resolution: Fixed Committed revision 1124288. create a simple test that indexes and searches byte[] terms --- Key: LUCENE-3103 URL: https://issues.apache.org/jira/browse/LUCENE-3103 Project: Lucene - Java Issue Type: Test Components: general/test Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-3103.patch Currently, the only good test that does this is Test2BTerms (disabled by default) I think we should test this capability, and also have a simpler example for how to do this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3068) The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position
[ https://issues.apache.org/jira/browse/LUCENE-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035422#comment-13035422 ] Doron Cohen commented on LUCENE-3068: - fixed in trunk in r1124293. The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position -- Key: LUCENE-3068 URL: https://issues.apache.org/jira/browse/LUCENE-3068 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.0.3, 3.1, 4.0 Reporter: Michael McCandless Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch In LUCENE-736 we made fixes to SloppyPhraseScorer, because it was matching docs that it shouldn't; but I think those changes caused it to fail to match docs that it should, specifically when the doc itself has tokens at the same position. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3102: --- Attachment: LUCENE-3102-nowrap.patch Patch adds random to TestGrouping and fixes the CHANGES typo. Mike, TestGrouping fails w/ this seed: -Dtests.seed=7295196064099074191:-1632255311098421589 (it picks a no wrapping collector). I guess I didn't insert the random thing properly. It's the only place where the test creates a CachingCollector though. I noticed that it fails on the 'doCache' but '!doAllGroups' case. Can you please take a look? I'm not familiar with this test, and cannot debug it anymore today. Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync]
I think it's resolved now? Sorry! But it's great we now catch jdoc errors before releasing... should we fix our while(1) test to also catch this (not just nightly / maven)? Mike http://blog.mikemccandless.com On Wed, May 18, 2011 at 9:01 AM, Uwe Schindler u...@thetaphi.de wrote: Hi Mike, Same problem in trunk! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, May 18, 2011 2:27 PM To: dev@lucene.apache.org Subject: Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS- MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync] Ugh my bad, sorry. I'll fix! Mike http://blog.mikemccandless.com On Wed, May 18, 2011 at 8:22 AM, Steven A Rowe sar...@syr.edu wrote: This build failed because of Javadoc warnings: [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_16-p9 [javadoc] Building tree for all the packages and classes... [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:62: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:76: warning - Tag @link: reference not found: SentinelIntSet -Original Message- From: Apache Jenkins Server [mailto:hud...@hudson.apache.org] Sent: Wednesday, May 18, 2011 7:36 AM To: dev@lucene.apache.org Subject: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync Build: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/126/ No tests ran. Build Log (for compile errors): [...truncated 7931 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync]
On Wed, May 18, 2011 at 11:31 AM, Michael McCandless luc...@mikemccandless.com wrote: I think it's resolved now? Sorry! But it's great we now catch jdoc errors before releasing... should we fix our while(1) test to also catch this (not just nightly / maven)? yeah couldn't we just fire off the 'javadocs-all' task after compiling? this takes 10 seconds on my computer and it could catch these things quicker. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3068) The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position
[ https://issues.apache.org/jira/browse/LUCENE-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen resolved LUCENE-3068. - Resolution: Fixed fix merged to 3x in r1124302. The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position -- Key: LUCENE-3068 URL: https://issues.apache.org/jira/browse/LUCENE-3068 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.0.3, 3.1, 4.0 Reporter: Michael McCandless Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch In LUCENE-736 we made fixes to SloppyPhraseScorer, because it was matching docs that it shouldn't; but I think those changes caused it to fail to match docs that it should, specifically when the doc itself has tokens at the same position. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync]
All fine now! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, May 18, 2011 5:31 PM To: dev@lucene.apache.org Subject: Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS- MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync] I think it's resolved now? Sorry! But it's great we now catch jdoc errors before releasing... should we fix our while(1) test to also catch this (not just nightly / maven)? Mike http://blog.mikemccandless.com On Wed, May 18, 2011 at 9:01 AM, Uwe Schindler u...@thetaphi.de wrote: Hi Mike, Same problem in trunk! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, May 18, 2011 2:27 PM To: dev@lucene.apache.org Subject: Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS- MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync] Ugh my bad, sorry. I'll fix! Mike http://blog.mikemccandless.com On Wed, May 18, 2011 at 8:22 AM, Steven A Rowe sar...@syr.edu wrote: This build failed because of Javadoc warnings: [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_16-p9 [javadoc] Building tree for all the packages and classes... [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:45: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:62: warning - Tag @link: reference not found: SentinelIntSet [javadoc] .../org/apache/lucene/search/grouping/AllGroupsCollector.java:76: warning - Tag @link: reference not found: SentinelIntSet -Original Message- From: Apache Jenkins Server [mailto:hud...@hudson.apache.org] Sent: Wednesday, May 18, 2011 7:36 AM To: dev@lucene.apache.org Subject: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync Build: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-3.x/126/ No tests ran. Build Log (for compile errors): [...truncated 7931 lines...] -- --- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync]
Hi, Definitely not. Javadocs-all takes 2 minutes here, so please don’t bundle it with compile, I will stop working on Lucene then, that does not help during development (I use no Eclipse to develop...). We can trigger this for Hudson half-hourly. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Wednesday, May 18, 2011 5:35 PM To: dev@lucene.apache.org Subject: Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS- MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync] On Wed, May 18, 2011 at 11:31 AM, Michael McCandless luc...@mikemccandless.com wrote: I think it's resolved now? Sorry! But it's great we now catch jdoc errors before releasing... should we fix our while(1) test to also catch this (not just nightly / maven)? yeah couldn't we just fire off the 'javadocs-all' task after compiling? this takes 10 seconds on my computer and it could catch these things quicker. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-3102: --- Attachment: LUCENE-3102.patch Patch. I think I fixed TestGrouping to exercise the no wrapped collector and replay twice case for CachingCollector. Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3012) if you use setNorm, lucene writes a headerless separate norms file
[ https://issues.apache.org/jira/browse/LUCENE-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035459#comment-13035459 ] Michael McCandless commented on LUCENE-3012: I agree this is important to fix! Patch looks good. if you use setNorm, lucene writes a headerless separate norms file -- Key: LUCENE-3012 URL: https://issues.apache.org/jira/browse/LUCENE-3012 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.2 Attachments: LUCENE-3012.patch, LUCENE-3012_3x.patch In this case SR.reWrite just writes the bytes with no header... we should write it always. we can detect in these cases (segment written = 3.1) with a sketchy length == maxDoc check. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035474#comment-13035474 ] Uwe Schindler commented on LUCENE-3084: --- Committed trunk revision: 1124307, 1124316 (copy-paste error) Now backporting... MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. Also SegmentInfos subclasses VectorSI, this should be removed and the collections be hidden inside the class. We can add unmodifiable views on it (asList(), asSet()). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Javadoc warnings failing the branch_3x build [was: RE: [JENKINS-MAVEN] Lucene-Solr-Maven-3.x #126: POMs out of sync]
On Wed, May 18, 2011 at 11:44 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, Definitely not. Javadocs-all takes 2 minutes here, so please don’t bundle it with compile, I will stop working on Lucene then, that does not help during development (I use no Eclipse to develop...). We can trigger this for Hudson half-hourly. Hi uwe... this is what i was asking for, to run it after compile in the half-hourly. i'm sorry your computer has such as slow io system! - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2524) Adding grouping to Solr 3x
[ https://issues.apache.org/jira/browse/SOLR-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen updated SOLR-2524: Attachment: SOLR-2524.patch Attached the initial patch. * Patch is based on what is in the trunk. ** Integrated the grouping contrib collectors ** Same response formats. ** All parameters except group.query and group.func are supported. ** Computed DocSet (for facetComponent and StatsComponent) is based the ungrouped result. * Also integrated the caching collector. For this I added the group.cache=true|false and group.cache.maxSize=[number] parameters. Things still todo: * Integrate AllGroupsCollector for total count based on groups. * Create a Solr Test for grouping * Cleanup / Refactor / java doc Adding grouping to Solr 3x -- Key: SOLR-2524 URL: https://issues.apache.org/jira/browse/SOLR-2524 Project: Solr Issue Type: New Feature Affects Versions: 3.2 Reporter: Martijn van Groningen Attachments: SOLR-2524.patch Grouping was recently added to Lucene 3x. See LUCENE-1421 for more information. I think it would be nice if we expose this functionality also to the Solr users that are bound to a 3.x version. The grouping feature added to Lucene is currently a subset of the functionality that Solr 4.0-trunk offers. Mainly it doesn't support grouping by function / query. The work involved getting the grouping contrib to work on Solr 3x is acceptable. I have it more or less running here. It supports the response format and request parameters (expect: group.query and group.func) described in the FieldCollapse page on the Solr wiki. I think it would be great if this is included in the Solr 3.2 release. Many people are using grouping as patch now and this would help them a lot. Any thoughts? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3084: -- Attachment: LUCENE-3084-3.x-only.patch Merged patch. Will commit now. MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-3.x-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. Also SegmentInfos subclasses VectorSI, this should be removed and the collections be hidden inside the class. We can add unmodifiable views on it (asList(), asSet()). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-3084. --- Resolution: Fixed Committed 3.x revision: 1124339 MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos, Remove VectorSI subclassing from SegmentInfos more refactoring -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Uwe Schindler Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-3.x-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. Also SegmentInfos subclasses VectorSI, this should be removed and the collections be hidden inside the class. We can add unmodifiable views on it (asList(), asSet()). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2883) Consolidate Solr Lucene FunctionQuery into modules
[ https://issues.apache.org/jira/browse/LUCENE-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035508#comment-13035508 ] Michael McCandless commented on LUCENE-2883: Thanks Chris! The patch applies cleanly for me (after running the svn commands) and everything compiles. I think the patch is a great start, ie, we will need the low level infra used by FQs in the module. bq. MutableValue MutableFloatValue are used in the FunctionQuery code so I've pulled them into the module too. Should all the other Mutable*Value classes come too? Should they go into some other module? I think we should move Mutable* over? Grouping module will need all of these, I think? (Ie if we want to allow users to group by arbitrary typed field). bq. What to return in ValueSource#getSortField which currently returns a SortField which implements SolrSortField. This is currently commented out so we can determine what best to do. Having this commented out breaks the Solr tests. Hmm good question. This looks to be related to sorting by FQ (SOLR-1297) because some FQs need to be weighted. Not sure what to do here yet... which FQs in particular require this? bq. Many of the ValueSources and DocValues in Solr could be moved to the module, but not all of them. Some have dependencies on Solr dependencies / Solr core code. I think apply 90/10 rule here? Start with the easy-to-move queries? We don't need initial go to be perfect... progress not perfection. bq. Lucene core's FunctionQuery stuff needs to be removed. Do you have a sense of whether Solr's FQs are a superset of Lucene's? Ie, is there anything Lucene's FQs can do that Solr's can't? Probably, as a separate issue, we should also move contrib/queries - modules/queries. And I think the cool nested queries (LUCENE-2454) would also go into this module... Consolidate Solr Lucene FunctionQuery into modules - Key: LUCENE-2883 URL: https://issues.apache.org/jira/browse/LUCENE-2883 Project: Lucene - Java Issue Type: Task Components: core/search Affects Versions: 4.0 Reporter: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: LUCENE-2883.patch Spin-off from the [dev list | http://www.mail-archive.com/dev@lucene.apache.org/msg13261.html] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3109) Rename FieldsConsumer to InvertedFieldsConsumer
[ https://issues.apache.org/jira/browse/LUCENE-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035510#comment-13035510 ] Michael McCandless commented on LUCENE-3109: +1 Rename FieldsConsumer to InvertedFieldsConsumer --- Key: LUCENE-3109 URL: https://issues.apache.org/jira/browse/LUCENE-3109 Project: Lucene - Java Issue Type: Task Components: core/codecs Affects Versions: 4.0 Reporter: Simon Willnauer Priority: Minor Fix For: 4.0 The name FieldsConsumer is missleading here it really is an InvertedFieldsConsumer and since we are extending codecs to consume non-inverted Fields we should be clear here. Same applies to Fields.java as well as FieldsProducer. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3116) pendingCommit in IndexWriter is not thoroughly tested
pendingCommit in IndexWriter is not thoroughly tested - Key: LUCENE-3116 URL: https://issues.apache.org/jira/browse/LUCENE-3116 Project: Lucene - Java Issue Type: Test Components: core/index Affects Versions: 3.2, 4.0 Reporter: Uwe Schindler When working on LUCENE-3084, I had a copy-paste error in my patch (see revision 1124307 and corrected in 1124316), I replaced pendingCommit by segmentInfos in IndexWriter, corrected by the following patch: {noformat} --- lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java (original) +++ lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java Wed May 18 16:16:29 2011 @@ -2552,7 +2552,7 @@ public class IndexWriter implements Clos lastCommitChangeCount = pendingCommitChangeCount; segmentInfos.updateGeneration(pendingCommit); segmentInfos.setUserData(pendingCommit.getUserData()); -rollbackSegments = segmentInfos.createBackupSegmentInfos(true); +rollbackSegments = pendingCommit.createBackupSegmentInfos(true); deleter.checkpoint(pendingCommit, true); } finally { // Matches the incRef done in startCommit: {noformat} This did not cause any test failure. On IRC, Mike said: {quote} [19:21] mikemccand: ThetaPh1: hmm [19:21] mikemccand: well [19:22] mikemccand: pendingCommit and sis only differ while commit() is running [19:22] mikemccand: ie if a thread starts commit [19:22] mikemccand: but fsync is taking a long time [19:22] mikemccand: and another thread makes a change to sis [19:22] ThetaPh1: ok so hard to find that bug [19:22] mikemccand: we need our mock dir wrapper to sometimes take a long time syncing {quote} Maybe we need such a test, I feel bad when such stupid changes don't make any test fail. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3116) pendingCommit in IndexWriter is not thoroughly tested
[ https://issues.apache.org/jira/browse/LUCENE-3116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-3116: --- Fix Version/s: 4.0 3.2 pendingCommit in IndexWriter is not thoroughly tested - Key: LUCENE-3116 URL: https://issues.apache.org/jira/browse/LUCENE-3116 Project: Lucene - Java Issue Type: Test Components: core/index Affects Versions: 3.2, 4.0 Reporter: Uwe Schindler Fix For: 3.2, 4.0 When working on LUCENE-3084, I had a copy-paste error in my patch (see revision 1124307 and corrected in 1124316), I replaced pendingCommit by segmentInfos in IndexWriter, corrected by the following patch: {noformat} --- lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java (original) +++ lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java Wed May 18 16:16:29 2011 @@ -2552,7 +2552,7 @@ public class IndexWriter implements Clos lastCommitChangeCount = pendingCommitChangeCount; segmentInfos.updateGeneration(pendingCommit); segmentInfos.setUserData(pendingCommit.getUserData()); -rollbackSegments = segmentInfos.createBackupSegmentInfos(true); +rollbackSegments = pendingCommit.createBackupSegmentInfos(true); deleter.checkpoint(pendingCommit, true); } finally { // Matches the incRef done in startCommit: {noformat} This did not cause any test failure. On IRC, Mike said: {quote} [19:21] mikemccand: ThetaPh1: hmm [19:21] mikemccand: well [19:22] mikemccand: pendingCommit and sis only differ while commit() is running [19:22] mikemccand: ie if a thread starts commit [19:22] mikemccand: but fsync is taking a long time [19:22] mikemccand: and another thread makes a change to sis [19:22] ThetaPh1: ok so hard to find that bug [19:22] mikemccand: we need our mock dir wrapper to sometimes take a long time syncing {quote} Maybe we need such a test, I feel bad when such stupid changes don't make any test fail. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035519#comment-13035519 ] Steven Rowe commented on LUCENE-152: bq. Code is fine to afaik: http://www.apache.org/legal/3party.html My interpretation of this is that we can directly include the KStem source code in Lucene/Solr's source tree, and then modify it at will, since its license (BSD style) is in Category A (authorized licenses). Thoughts? [PATCH] KStem for Lucene Key: LUCENE-152 URL: https://issues.apache.org/jira/browse/LUCENE-152 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: unspecified Environment: Operating System: other Platform: Other Reporter: Otis Gospodnetic Priority: Minor September 10th 2003 contributionn from Sergio Guzman-Lara guz...@cs.umass.edu Original email: Hi all, I have ported the kstem stemmer to Java and incorporated it to Lucene. You can get the source code (Kstem.jar) from the following website: http://ciir.cs.umass.edu/downloads/ Just click on KStem Java Implementation (you will need to register your e-mail, for free of course, with the CIIR --Center for Intelligent Information Retrieval, UMass -- and get an access code). Content of Kstem.jar: java/org/apache/lucene/analysis/KStemData1.java java/org/apache/lucene/analysis/KStemData2.java java/org/apache/lucene/analysis/KStemData3.java java/org/apache/lucene/analysis/KStemData4.java java/org/apache/lucene/analysis/KStemData5.java java/org/apache/lucene/analysis/KStemData6.java java/org/apache/lucene/analysis/KStemData7.java java/org/apache/lucene/analysis/KStemData8.java java/org/apache/lucene/analysis/KStemFilter.java java/org/apache/lucene/analysis/KStemmer.java KStemData1.java, ..., KStemData8.java Contain several lists of words used by Kstem KStemmer.java Implements the Kstem algorithm KStemFilter.java Extends TokenFilter applying Kstem To compile unjar the file Kstem.jar to Lucene's src directory, and compile it there. What is Kstem? A stemmer designed by Bob Krovetz (for more information see http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). Copyright issues This is open source. The actual license agreement is included at the top of every source file. Any comments/questions/suggestions are welcome, Sergio Guzman-Lara Senior Research Fellow CIIR UMass -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3116) pendingCommit in IndexWriter is not thoroughly tested
[ https://issues.apache.org/jira/browse/LUCENE-3116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035520#comment-13035520 ] Michael McCandless commented on LUCENE-3116: It's great you caught this on backport Uwe! And, yes, spooky no tests failed... It'll be challenging to have a test catch this. Fixing MockDirWrapper to sometimes take unusually long time to do the fsync is a great start. What this change would have caused is .rollback() would roll back to a wrong copy of the sis, ie not a commit point but rather a commit point plus some additional flushes. pendingCommit in IndexWriter is not thoroughly tested - Key: LUCENE-3116 URL: https://issues.apache.org/jira/browse/LUCENE-3116 Project: Lucene - Java Issue Type: Test Components: core/index Affects Versions: 3.2, 4.0 Reporter: Uwe Schindler Fix For: 3.2, 4.0 When working on LUCENE-3084, I had a copy-paste error in my patch (see revision 1124307 and corrected in 1124316), I replaced pendingCommit by segmentInfos in IndexWriter, corrected by the following patch: {noformat} --- lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java (original) +++ lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/IndexWriter.java Wed May 18 16:16:29 2011 @@ -2552,7 +2552,7 @@ public class IndexWriter implements Clos lastCommitChangeCount = pendingCommitChangeCount; segmentInfos.updateGeneration(pendingCommit); segmentInfos.setUserData(pendingCommit.getUserData()); -rollbackSegments = segmentInfos.createBackupSegmentInfos(true); +rollbackSegments = pendingCommit.createBackupSegmentInfos(true); deleter.checkpoint(pendingCommit, true); } finally { // Matches the incRef done in startCommit: {noformat} This did not cause any test failure. On IRC, Mike said: {quote} [19:21] mikemccand: ThetaPh1: hmm [19:21] mikemccand: well [19:22] mikemccand: pendingCommit and sis only differ while commit() is running [19:22] mikemccand: ie if a thread starts commit [19:22] mikemccand: but fsync is taking a long time [19:22] mikemccand: and another thread makes a change to sis [19:22] ThetaPh1: ok so hard to find that bug [19:22] mikemccand: we need our mock dir wrapper to sometimes take a long time syncing {quote} Maybe we need such a test, I feel bad when such stupid changes don't make any test fail. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-2524) Adding grouping to Solr 3x
[ https://issues.apache.org/jira/browse/SOLR-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned SOLR-2524: Assignee: Michael McCandless Adding grouping to Solr 3x -- Key: SOLR-2524 URL: https://issues.apache.org/jira/browse/SOLR-2524 Project: Solr Issue Type: New Feature Affects Versions: 3.2 Reporter: Martijn van Groningen Assignee: Michael McCandless Attachments: SOLR-2524.patch Grouping was recently added to Lucene 3x. See LUCENE-1421 for more information. I think it would be nice if we expose this functionality also to the Solr users that are bound to a 3.x version. The grouping feature added to Lucene is currently a subset of the functionality that Solr 4.0-trunk offers. Mainly it doesn't support grouping by function / query. The work involved getting the grouping contrib to work on Solr 3x is acceptable. I have it more or less running here. It supports the response format and request parameters (expect: group.query and group.func) described in the FieldCollapse page on the Solr wiki. I think it would be great if this is included in the Solr 3.2 release. Many people are using grouping as patch now and this would help them a lot. Any thoughts? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2524) Adding grouping to Solr 3x
[ https://issues.apache.org/jira/browse/SOLR-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035526#comment-13035526 ] Michael McCandless commented on SOLR-2524: -- Awesome, that was fast! Maybe rename group.cache.maxSize - .maxSizeMB? (So it's clear what the units are). Should we default group.cache to true? (It's false now?). When you get the top groups from collector2, should you pass in offset instead of 0? (Hmm -- maybe groupOffset? It seems like you're using offset for both the first second phase collectors? Maybe I'm confused...). bq. Computed DocSet (for facetComponent and StatsComponent) is based the ungrouped result. This matches how Solr does grouping on trunk right? Adding grouping to Solr 3x -- Key: SOLR-2524 URL: https://issues.apache.org/jira/browse/SOLR-2524 Project: Solr Issue Type: New Feature Affects Versions: 3.2 Reporter: Martijn van Groningen Attachments: SOLR-2524.patch Grouping was recently added to Lucene 3x. See LUCENE-1421 for more information. I think it would be nice if we expose this functionality also to the Solr users that are bound to a 3.x version. The grouping feature added to Lucene is currently a subset of the functionality that Solr 4.0-trunk offers. Mainly it doesn't support grouping by function / query. The work involved getting the grouping contrib to work on Solr 3x is acceptable. I have it more or less running here. It supports the response format and request parameters (expect: group.query and group.func) described in the FieldCollapse page on the Solr wiki. I think it would be great if this is included in the Solr 3.2 release. Many people are using grouping as patch now and this would help them a lot. Any thoughts? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3012) if you use setNorm, lucene writes a headerless separate norms file
[ https://issues.apache.org/jira/browse/LUCENE-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-3012. - Resolution: Fixed Fix Version/s: 4.0 Committed revision 1124366, 1124369 if you use setNorm, lucene writes a headerless separate norms file -- Key: LUCENE-3012 URL: https://issues.apache.org/jira/browse/LUCENE-3012 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.2, 4.0 Attachments: LUCENE-3012.patch, LUCENE-3012_3x.patch In this case SR.reWrite just writes the bytes with no header... we should write it always. we can detect in these cases (segment written = 3.1) with a sketchy length == maxDoc check. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-3102. Resolution: Fixed Thanks Mike. Seems that TestGrouping is indeed fixed. Committed revision 1124378 (3x). Committed revision 1124379 (trunk). Resolving this. We can tackle OBS and other optimizations in subsequent issues if the need arises. Thanks Mike ! Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2524) Adding grouping to Solr 3x
[ https://issues.apache.org/jira/browse/SOLR-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035586#comment-13035586 ] Martijn van Groningen commented on SOLR-2524: - bq. Maybe rename group.cache.maxSize - .maxSizeMB? (So it's clear what the units are). Yes that is a more descriptive name. bq. Should we default group.cache to true? (It's false now?). That makes sense. I think that if the cachedCollector.isCached() returns false we should put something in the response indication that the cache wasn't used because it hit the cache.maxSizeMB limit. Otherwise the nobody will no whether the cache was utilized. When I was playing around with the cache options I noticed that searching without cache (~350 ms) was faster then with cache (~500 ms) on a 10M index with 1711 distinct group values. This is not what I'd expect. bq. When you get the top groups from collector2, should you pass in offset instead of 0? (Hmm – maybe groupOffset? It seems like you're using offset for both the first second phase collectors? Maybe I'm confused...). I know that is confusing, but the DocSlice expects offset + len documents. So that was a quick of doing that. I will clean that up. bq. This matches how Solr does grouping on trunk right? Yes it does. I'm already thinking about a new collector that collects all most relevant documents of all groups. This collector should produce something like an OpenBitSet. We can use the OpenBitSet to create a DocSet. I think this should be implemented in a different issue. Adding grouping to Solr 3x -- Key: SOLR-2524 URL: https://issues.apache.org/jira/browse/SOLR-2524 Project: Solr Issue Type: New Feature Affects Versions: 3.2 Reporter: Martijn van Groningen Assignee: Michael McCandless Attachments: SOLR-2524.patch Grouping was recently added to Lucene 3x. See LUCENE-1421 for more information. I think it would be nice if we expose this functionality also to the Solr users that are bound to a 3.x version. The grouping feature added to Lucene is currently a subset of the functionality that Solr 4.0-trunk offers. Mainly it doesn't support grouping by function / query. The work involved getting the grouping contrib to work on Solr 3x is acceptable. I have it more or less running here. It supports the response format and request parameters (expect: group.query and group.func) described in the FieldCollapse page on the Solr wiki. I think it would be great if this is included in the Solr 3.2 release. Many people are using grouping as patch now and this would help them a lot. Any thoughts? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene/Solr JIRA
On Tue, May 17, 2011 at 9:23 PM, Steven A Rowe sar...@syr.edu wrote: On 5/17/2011 at 3:02 PM, Chris Hostetter wrote: If we were starting from scratch, i'd agree with you that having a single Jira project makes more sense, but given where we are today, i think we should probably keep them distinct -- partly from a pain of migration standpoint on our end, but also from a user expecations standpoint -- i think the Solr users/community as a whole is use to the existence of the SOLR project in Jira, and use to the SOLR-* issue naming convention, and it would likely be more confusing for *them* to change now. just a few words. I disagree here with you hoss IMO the suggestion to merge JIRA would help to move us closer together and help close the gap between Solr and Lucene. I think we need to start identifying us with what we work on. It feels like we don't do that today and we should work hard to stop that and make hard breaks that might hurt but I think its the way to go. Drawn from what has happend in the last weeks / month it would be good to start from the scratch at least in JIRA. I'd go even further and nuke the name entirely and call everything lucene - I know not many folks like the idea and it might take a while to bake in but I think for us (PMC / Committers) and the community it would be good. I am not calling a vote here just stating my opinion. here is my +1 to the JIRA suggestion Simon +1 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene/Solr JIRA
I didn't know that it was decided that top-level modules issues go under the Lucene project. That indeed reduces some of the confusion (as long as users will adhere to it, but I guess it's also up to us to enforce it). So Lucene project becomes everything that is not precisely Solr, i.e. not under solr/? I think that one day we will need to merge. The more modules we'll have, the less issues will be open under Solr project (if it uses those modules). I agree w/ what Simon wrote - users will get used to it, so that's not a good reason IMO. Also, if we keep claiming the user base is different, then I think we have a problem ... every Solr user is also a Lucene user (eventually) -- true, some only interact w/ Solr REST API, and may not know/care Lucene is run at the lower level. But for the community's sake, I think merging JIRA will only help down the road. Not very related to JIRA merge, but still ... if I look at Lucene project today, I see ~30 issues marked for 3.2, so I think to myself well, 3.2 in maybe a month seems reasonable. But then I look at Solr project and see ~230 marked for 3.2 and I think if we need to release both Lucene and Solr, then we're definitely far from 3.2. Now, I don't know if Solr's ~230 is just bad JIRA management, and most of the issues are just drifted from version to version for several releases, or Solr really has 230 issues that need to be addressed in 3.2, and then we have a serious manpower. I'm not saying that if the two were merged under the same project we'd have *less* 3.2 issues overall for sure, but I have a feeling that would happen, because at least for me, it's hard to track two projects and I usually look @ Lucene. I can imagine that if they were merged under the same project, and I'd see a longer list of issues, I'd do something (radical, like closing/cleaning them :)). But I'm only guessing. Maybe I should try to run w/ Hoss's query for some time and see if it affects my itch to reduce the number of issues. At the end of the day, I don't think we can maintain two projects for much longer, and I don't think it's the right thing to do at all. And if one day we'll merge JIRA projects, then tomorrow is as good a day as any other day. Our users, I'm sure, will get used to it very quickly. I doubt users care that much about the prefix SOLR-* for Solr issues. Shai On Wed, May 18, 2011 at 10:10 PM, Simon Willnauer simon.willna...@googlemail.com wrote: On Tue, May 17, 2011 at 9:23 PM, Steven A Rowe sar...@syr.edu wrote: On 5/17/2011 at 3:02 PM, Chris Hostetter wrote: If we were starting from scratch, i'd agree with you that having a single Jira project makes more sense, but given where we are today, i think we should probably keep them distinct -- partly from a pain of migration standpoint on our end, but also from a user expecations standpoint -- i think the Solr users/community as a whole is use to the existence of the SOLR project in Jira, and use to the SOLR-* issue naming convention, and it would likely be more confusing for *them* to change now. just a few words. I disagree here with you hoss IMO the suggestion to merge JIRA would help to move us closer together and help close the gap between Solr and Lucene. I think we need to start identifying us with what we work on. It feels like we don't do that today and we should work hard to stop that and make hard breaks that might hurt but I think its the way to go. Drawn from what has happend in the last weeks / month it would be good to start from the scratch at least in JIRA. I'd go even further and nuke the name entirely and call everything lucene - I know not many folks like the idea and it might take a while to bake in but I think for us (PMC / Committers) and the community it would be good. I am not calling a vote here just stating my opinion. here is my +1 to the JIRA suggestion Simon +1 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035616#comment-13035616 ] Michael McCandless commented on LUCENE-152: --- I think that's right. [PATCH] KStem for Lucene Key: LUCENE-152 URL: https://issues.apache.org/jira/browse/LUCENE-152 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: unspecified Environment: Operating System: other Platform: Other Reporter: Otis Gospodnetic Priority: Minor September 10th 2003 contributionn from Sergio Guzman-Lara guz...@cs.umass.edu Original email: Hi all, I have ported the kstem stemmer to Java and incorporated it to Lucene. You can get the source code (Kstem.jar) from the following website: http://ciir.cs.umass.edu/downloads/ Just click on KStem Java Implementation (you will need to register your e-mail, for free of course, with the CIIR --Center for Intelligent Information Retrieval, UMass -- and get an access code). Content of Kstem.jar: java/org/apache/lucene/analysis/KStemData1.java java/org/apache/lucene/analysis/KStemData2.java java/org/apache/lucene/analysis/KStemData3.java java/org/apache/lucene/analysis/KStemData4.java java/org/apache/lucene/analysis/KStemData5.java java/org/apache/lucene/analysis/KStemData6.java java/org/apache/lucene/analysis/KStemData7.java java/org/apache/lucene/analysis/KStemData8.java java/org/apache/lucene/analysis/KStemFilter.java java/org/apache/lucene/analysis/KStemmer.java KStemData1.java, ..., KStemData8.java Contain several lists of words used by Kstem KStemmer.java Implements the Kstem algorithm KStemFilter.java Extends TokenFilter applying Kstem To compile unjar the file Kstem.jar to Lucene's src directory, and compile it there. What is Kstem? A stemmer designed by Bob Krovetz (for more information see http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). Copyright issues This is open source. The actual license agreement is included at the top of every source file. Any comments/questions/suggestions are welcome, Sergio Guzman-Lara Senior Research Fellow CIIR UMass -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene/Solr JIRA
On Tue, May 17, 2011 at 10:23 PM, Steven A Rowe sar...@syr.edu wrote: On 5/17/2011 at 3:02 PM, Chris Hostetter wrote: If we were starting from scratch, i'd agree with you that having a single Jira project makes more sense, but given where we are today, i think we should probably keep them distinct -- partly from a pain of migration standpoint on our end, but also from a user expecations standpoint -- i think the Solr users/community as a whole is use to the existence of the SOLR project in Jira, and use to the SOLR-* issue naming convention, and it would likely be more confusing for *them* to change now. +1 +1 for keeping separate user lists and separate JIRA projects, stabilize, no rush, release a few, then, perhaps, reiterate on this.
[jira] [Created] (SOLR-2526) Grouping on multiple fields
Grouping on multiple fields --- Key: SOLR-2526 URL: https://issues.apache.org/jira/browse/SOLR-2526 Project: Solr Issue Type: New Feature Components: search Affects Versions: 4.0 Reporter: Arian Karbasi Priority: Minor Grouping on multiple fields and/or ranges should be an option (X,Y) groupings. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2526) Grouping on multiple fields
[ https://issues.apache.org/jira/browse/SOLR-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arian Karbasi updated SOLR-2526: Component/s: search Grouping on multiple fields --- Key: SOLR-2526 URL: https://issues.apache.org/jira/browse/SOLR-2526 Project: Solr Issue Type: New Feature Components: search Affects Versions: 4.0 Reporter: Arian Karbasi Priority: Minor Grouping on multiple fields and/or ranges should be an option (X,Y) groupings. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2526) Grouping on multiple fields
[ https://issues.apache.org/jira/browse/SOLR-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arian Karbasi updated SOLR-2526: Component/s: (was: search) Grouping on multiple fields --- Key: SOLR-2526 URL: https://issues.apache.org/jira/browse/SOLR-2526 Project: Solr Issue Type: New Feature Components: search Affects Versions: 4.0 Reporter: Arian Karbasi Priority: Minor Grouping on multiple fields and/or ranges should be an option (X,Y) groupings. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2524) Adding grouping to Solr 3x
[ https://issues.apache.org/jira/browse/SOLR-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035631#comment-13035631 ] Michael McCandless commented on SOLR-2524: -- bq. I think that if the cachedCollector.isCached() returns false we should put something in the response indication that the cache wasn't used because it hit the cache.maxSizeMB limit. Otherwise the nobody will no whether the cache was utilized. +1, and maybe log a warning? Or is that going to be too much logging? bq. When I was playing around with the cache options I noticed that searching without cache (~350 ms) was faster then with cache (~500 ms) on a 10M index with 1711 distinct group values. This is not what I'd expect. That is worrisome!! Was this a simple TermQuery? Is it somehow possible Solr is already caching the queries results itself...? bq. I'm already thinking about a new collector that collects all most relevant documents of all groups. This collector should produce something like an OpenBitSet. We can use the OpenBitSet to create a DocSet. I think this should be implemented in a different issue. Cool! Adding grouping to Solr 3x -- Key: SOLR-2524 URL: https://issues.apache.org/jira/browse/SOLR-2524 Project: Solr Issue Type: New Feature Affects Versions: 3.2 Reporter: Martijn van Groningen Assignee: Michael McCandless Attachments: SOLR-2524.patch Grouping was recently added to Lucene 3x. See LUCENE-1421 for more information. I think it would be nice if we expose this functionality also to the Solr users that are bound to a 3.x version. The grouping feature added to Lucene is currently a subset of the functionality that Solr 4.0-trunk offers. Mainly it doesn't support grouping by function / query. The work involved getting the grouping contrib to work on Solr 3x is acceptable. I have it more or less running here. It supports the response format and request parameters (expect: group.query and group.func) described in the FieldCollapse page on the Solr wiki. I think it would be great if this is included in the Solr 3.2 release. Many people are using grouping as patch now and this would help them a lot. Any thoughts? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2526) Grouping on multiple fields
[ https://issues.apache.org/jira/browse/SOLR-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035635#comment-13035635 ] Michael McCandless commented on SOLR-2526: -- I think LUCENE-3099 could make this possible, by allowing subclasses to define arbitrary group keys per document. Today the grouping module is hardwired to use BytesRef (pulled from FieldCache of a single-valued indexed field) as the group key, but really it should be able to be any key. Grouping on multiple fields --- Key: SOLR-2526 URL: https://issues.apache.org/jira/browse/SOLR-2526 Project: Solr Issue Type: New Feature Components: search Affects Versions: 4.0 Reporter: Arian Karbasi Priority: Minor Grouping on multiple fields and/or ranges should be an option (X,Y) groupings. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035637#comment-13035637 ] Michael McCandless commented on LUCENE-3102: Thanks Shai -- this is awesome progress! Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: core/search Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Solr Config XML DTD's
I looked into inserting a formal validation step in o.a.solr.core.Config and ran some preliminary simple tests. The code is fairly simple; just a couple of gotchas: 1) to use the RNC validation language (my preference), we would need to pull in a couple of new jars, one of which is over 600K. Also, support for RNC in the XML world is not very widespread: it's gotten more interest from researchers and less uptake more broadly, so it might not be the best choice, even if, aesthetically it is superior IMO. 2) The other alternatives are XML Schema and DTD. I think DTD is a non-starter since it just can't allow things like arbitrary attributes on an element (you have to list them explicitly). Schema is probably the best choice all things considered: support for it is built into the XML tools already in use, and it is widely adopted. The drawback is that it's a baroque and unwieldy syntax designed by an indecisive committee that loaded it down with excessive featuritis, and someone will end up having to maintain this: every time you add a new configuration option to the schema (or solrconfig, etc), then the schema-schema (validation schema?) will have to be updated to reflect that. 3) Finally, to get good error reporting it's important to show file name and line number where an error occurred. Although you can validate a constructed XML tree (a DOM), it's better to run validation on a Stream so the line numbers are available. Therefore it will probably be necessary to run two passes (one to validate, and one to construct the DOM), which means buffering the config. Doesn't seem like a big deal: these are small files that only get loaded once, but this is a cost of validation, I think. Of course the benefit is that users would actually get fast-failing specific and informative error messages covering a wide variety of misconfigurations: I would hope we could be restrictive enough to catch mis-spelled versions of known element and attribute names, or places where elements are out of order. I'd be willing to work this up, develop a preliminary schema (of whichever sort we choose), and send in a patch, but other folks would probably end up having to maintain it from time to time if it's to have any value at all and not just get disabled, so I just want to make sure this is something you all think is worth while before going any further. -Mike On 05/17/2011 09:04 AM, Michael McCandless wrote: https://issues.apache.org/jira/browse/SOLR-2119 is a good example where we are failing to catch mis-configuration on startup. Is there some way we can baby step here? EG use one of these XML validation packages, incrementally, on only sub-strings from the XML? (Or simpler is to just do the checking ourselves w/ custom code). Mike http://blog.mikemccandless.com On Wed, May 4, 2011 at 10:50 PM, Michael Sokolovsoko...@ifactory.com wrote: I'm not sure you will find anyone wanting to put in this effort now, but another suggestion for a general approach might be: 1 very basic static analysis to catch what you can - this should be a pretty minimal effort only given what can reasonably be achieved 2 throw runtime errors as Hoss says (probably already doing this well enough, but maybe some incremental improvements are needed?) 3 an option to run a configtest like httpd provides that preloads all declared handlers/plugins/modules etc, instantiates them and gives them an opportunity to read their config and throw whatever errors they find. This way you can set a standard (error on unrecognized parameter, say) in some core areas, and distribute the effort. This is a hugely useful sanity check to be able to run when you want to make config changes and not have your server fall over when it starts (or worse - later). -Mike kibitzer Sokolov On 5/4/2011 6:55 PM, Chris Hostetter wrote: As i said: any improvements to help catch the mistakes we can identify would be great, but we should maintain perspective of the effort/gain tradeoff given that there is likely nothing we can do about the basic problem of a string that won't be evaluated until runtime - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3117) yank SegmentReader.norm out of SegmentReader.java
yank SegmentReader.norm out of SegmentReader.java - Key: LUCENE-3117 URL: https://issues.apache.org/jira/browse/LUCENE-3117 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir While working on flex scoring branch and LUCENE-3012, I noticed it was difficult to navigate the norms handling in SegmentReader's code. I think we should yank this inner class out into a separate file as a start. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3117) yank SegmentReader.norm out of SegmentReader.java
[ https://issues.apache.org/jira/browse/LUCENE-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3117: Attachment: LUCENE-3117.patch yank SegmentReader.norm out of SegmentReader.java - Key: LUCENE-3117 URL: https://issues.apache.org/jira/browse/LUCENE-3117 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-3117.patch While working on flex scoring branch and LUCENE-3012, I noticed it was difficult to navigate the norms handling in SegmentReader's code. I think we should yank this inner class out into a separate file as a start. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3117) yank SegmentReader.norm out of SegmentReader.java
[ https://issues.apache.org/jira/browse/LUCENE-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035642#comment-13035642 ] Michael McCandless commented on LUCENE-3117: +1, this code is scary, and pulling it out is a great baby step. yank SegmentReader.norm out of SegmentReader.java - Key: LUCENE-3117 URL: https://issues.apache.org/jira/browse/LUCENE-3117 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-3117.patch While working on flex scoring branch and LUCENE-3012, I noticed it was difficult to navigate the norms handling in SegmentReader's code. I think we should yank this inner class out into a separate file as a start. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3068) The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position
[ https://issues.apache.org/jira/browse/LUCENE-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035643#comment-13035643 ] Doron Cohen commented on LUCENE-3068: - I wonder if this should be fixed also in 3.1 branch? Probably so only if we make a 3.1.1, but not needed if its gonna be a 3.2. What's the best practice then? Reopen until decision? Or rely on rescanning all 3.2 changes in case its gonna be 3.1.1? The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position -- Key: LUCENE-3068 URL: https://issues.apache.org/jira/browse/LUCENE-3068 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.0.3, 3.1, 4.0 Reporter: Michael McCandless Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch, LUCENE-3068.patch In LUCENE-736 we made fixes to SloppyPhraseScorer, because it was matching docs that it shouldn't; but I think those changes caused it to fail to match docs that it should, specifically when the doc itself has tokens at the same position. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3117) yank SegmentReader.norm out of SegmentReader.java
[ https://issues.apache.org/jira/browse/LUCENE-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3117: Attachment: LUCENE-3117.patch oops the last patch had an outdated hack (for calling the silly SR.cloneBytes) yank SegmentReader.norm out of SegmentReader.java - Key: LUCENE-3117 URL: https://issues.apache.org/jira/browse/LUCENE-3117 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-3117.patch, LUCENE-3117.patch While working on flex scoring branch and LUCENE-3012, I noticed it was difficult to navigate the norms handling in SegmentReader's code. I think we should yank this inner class out into a separate file as a start. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-152) [PATCH] KStem for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035647#comment-13035647 ] Yonik Seeley commented on LUCENE-152: - heh - I had heard enough times that the license wouldn't permit it that I never looked into it myself. http://markmail.org/message/zlett7y3dj76xa2f Anyway, I did a bunch of optimizations for Lucid's version way back when. It makes sense for those to be contributed back here... I'll see what I can do (but it might be delayed a week by everyone being busy at Lucene Revolution). [PATCH] KStem for Lucene Key: LUCENE-152 URL: https://issues.apache.org/jira/browse/LUCENE-152 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: unspecified Environment: Operating System: other Platform: Other Reporter: Otis Gospodnetic Priority: Minor September 10th 2003 contributionn from Sergio Guzman-Lara guz...@cs.umass.edu Original email: Hi all, I have ported the kstem stemmer to Java and incorporated it to Lucene. You can get the source code (Kstem.jar) from the following website: http://ciir.cs.umass.edu/downloads/ Just click on KStem Java Implementation (you will need to register your e-mail, for free of course, with the CIIR --Center for Intelligent Information Retrieval, UMass -- and get an access code). Content of Kstem.jar: java/org/apache/lucene/analysis/KStemData1.java java/org/apache/lucene/analysis/KStemData2.java java/org/apache/lucene/analysis/KStemData3.java java/org/apache/lucene/analysis/KStemData4.java java/org/apache/lucene/analysis/KStemData5.java java/org/apache/lucene/analysis/KStemData6.java java/org/apache/lucene/analysis/KStemData7.java java/org/apache/lucene/analysis/KStemData8.java java/org/apache/lucene/analysis/KStemFilter.java java/org/apache/lucene/analysis/KStemmer.java KStemData1.java, ..., KStemData8.java Contain several lists of words used by Kstem KStemmer.java Implements the Kstem algorithm KStemFilter.java Extends TokenFilter applying Kstem To compile unjar the file Kstem.jar to Lucene's src directory, and compile it there. What is Kstem? A stemmer designed by Bob Krovetz (for more information see http://ciir.cs.umass.edu/pubfiles/ir-35.pdf). Copyright issues This is open source. The actual license agreement is included at the top of every source file. Any comments/questions/suggestions are welcome, Sergio Guzman-Lara Senior Research Fellow CIIR UMass -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene/Solr JIRA
: I didn't know that it was decided that top-level modules issues go under the : Lucene project. That indeed reduces some of the confusion (as long as users : will adhere to it, but I guess it's also up to us to enforce it). And as noted: moving a Jira issue from SOLR-LUCENE (or vice versa) is really simple with teh current version of Jira ... almost as easy as changing Component : I think that one day we will need to merge. The more modules we'll have, the : less issues will be open under Solr project (if it uses those modules). I : agree w/ what Simon wrote - users will get used to it, so that's not a good : reason IMO. Also, if we keep claiming the user base is different, then I : think we have a problem ... every Solr user is also a Lucene user : (eventually) -- true, some only interact w/ Solr REST API, and may not : know/care Lucene is run at the lower level. But for the community's sake, I : think merging JIRA will only help down the road. I just don't see it that way ... saying every Solr user is also a Lucene user is like saying ever Solr user is a Java user, or every Solr user is a commons-io user ... we don't expect our users to know even know that, let alone assume that Solr users should know what layer of the stack a bug/feature should be filed against. If a Solr user has an issue/improvement at the Solr level of the stack they file a SOLR issue -- if they are a savy dev and know that the only real change is at the Lucene level they file a LUCENE issue, if we feel like we need to move a SOLR issue to a LUCENE issue we can do so, if we feel like there should be two issus to track the bug/change, one covering the Solr layer changes and one covering some dependent Lucene layer change we can do that to -- just like we could if someone filed a bug against a module component and we decided there was a dependent, but fundementally distinct core change that we wanted to track as a distinct jira. : At the end of the day, I don't think we can maintain two projects for much : longer, and I don't think it's the right thing to do at all. And if one day : we'll merge JIRA projects, then tomorrow is as good a day as any other day. We are one Apache project, two Apache Products. The fact that the Jira terminology is Project is an implementation detail. If we get to the point where we decide that we want to release some module as distinct release artifacts, because we think it has a distinct user community who doesn't know/care about the Lucene Core as a whole, then I would totally argue in favor of that module having a distinct Jira Product/Project as well. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3118) Tools for making explanations easier to consume/understand
Tools for making explanations easier to consume/understand -- Key: LUCENE-3118 URL: https://issues.apache.org/jira/browse/LUCENE-3118 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Priority: Minor Often times, reading Explanations (i.e. the breakdown of scores for a particular query and result, say via Solr's debugQuery) is a pretty cryptic and hard to do undertaking. I often say people suffer from explain blindness from staring at explanation results for too long. We could add a layer of explanation helpers above the core Explain functionality that help people understand better what is going on. The goal is to give a higher level of tools to people who aren't necessarily well versed in all the underpinnings of Lucene's scoring mechanisms but still want information about why something didn't match For instance (brainstorming some things that might be doable): * Explain Diff Tool -- Given an 1 or more explanations, quickly highlight what the key things are that differentiate the results (i.e. fieldNorm is higher, etc.) * Given a query and any document, give a more friendly reason why it ranks lower than others without the need to have to parse through all the pieces of the score, for instance, could you simply say something like, programatically that is, this document scored lower compared to your top 10 b/c it had no values in the foo Field. * Could even maybe return codes for these reasons which could then be hooked into actual user messages. I don't have anything concrete patch-wise here, but am putting this up as a way to capture the idea and potentially spur others to think about it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene/Solr JIRA
: just a few words. I disagree here with you hoss IMO the suggestion to : merge JIRA would help to move us closer together and help close the : gap between Solr and Lucene. I think we need to start identifying us : with what we work on. It feels like we don't do that today and we : should work hard to stop that and make hard breaks that might hurt but I just don't see how you think that would help anything ... we still need to distinguish Jira issues to identify where in the stack they affect. If there is a divide among the developers because of the niches where they tend to work, will that divide magicly go away because we partition all issues using the component feature of instead of by the Jira project feature? I don't really see how that makes any sense. Even if we all thought it did, and even if the cost/effort of migrating/converting were totally free, the user bases (who interact with the Solr APIs vs directly using the Lucene-Core/Module APIs) are so distinct that I genuinely think sticking with distinct Jira Projects makes more sense for our users. : JIRA. I'd go even further and nuke the name entirely and call : everything lucene - I know not many folks like the idea and it might : take a while to bake in but I think for us (PMC / Committers) and the Everything already is called Lucene ... the Project is Apache Lucene the community is Lucene ... the Lucene project currently releases several products, and one of them is called Apache Solr ... if you're suggestion that we should ultimately elimianate the name Solr then we'd still have to decide what we're going going to call that end product, the artifact that we ship that provides the abstraction layer that Solr currently provides. Even if you mean to suggest that we should only have one unified product -- one singular release artifact -- that abstraction layer still needs a name. The name we have now is Solr, it has brand awareness and a user base who understands what it means to say they are Installing Solr or that a new feature is available when Using Solr Eliminating that name doesn't seem like it would benefit the user community in anyway. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org