[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113912#comment-13113912 ] Chris Male commented on LUCENE-1536: Actually, the more I look at the nocommits, the less I like what I've suggested. I think having getRandomAccessBits as it is in the patch is fine. But I like we should maybe make setLiveDocsOnly and setAllowRandomAccessFiltering 1st class features of the Bits interface. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Prettify JS and CSS exceluded from Javadocs
Hi Steve, As I noted before, jarring prettify won't solve the problem entirely, as the references in the HTML will point to an incorrect location. Why don't we just package prettify.js in the jar like we do with stylesheet+prettify.css? We only need this .js as we only write in Java ... It's a tiny file, and it will simplify the whole process. Wha do you think? Shai On Thu, Sep 22, 2011 at 5:05 PM, Steven A Rowe sar...@syr.edu wrote: The patch I gave for lucene/contrib-build.xml's javadocs target was wrong (I placed the nested tag outside of the jarify invocation). Here's a fixed patch: Index: lucene/contrib/contrib-build.xml === --- lucene/contrib/contrib-build.xml(revision 1165174) +++ lucene/contrib/contrib-build.xml(revision ) @@ -95,7 +95,11 @@ packageset dir=${src.dir}/ /sources /invoke-javadoc - jarify basedir=${javadoc.dir}/contrib-${name} destfile=${build.dir}/${final.name}-javadoc.jar/ + jarify basedir=${javadoc.dir}/contrib-${name} destfile=${build.dir}/${final.name}-javadoc.jar +nested + fileset dir=${prettify.dir}/ +/nested + /jarify /sequential /target
[jira] [Commented] (LUCENE-3452) The native FS lock used in test-framework's o.a.l.util.LuceneJUnitResultFormatter prohibits testing on a multi-user system
[ https://issues.apache.org/jira/browse/LUCENE-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113932#comment-13113932 ] Uwe Schindler commented on LUCENE-3452: --- bq. On the first pass, the build hung in the middle of the lucene core tests - I killed the process after half an hour with no output. I restarted the tests, and the build made it through the Lucene tests, but then at least one Solr core test failed. We had this several times on Jenkins, too. I killed the JVM approx 4 times the last 2 weeks. The native FS lock used in test-framework's o.a.l.util.LuceneJUnitResultFormatter prohibits testing on a multi-user system -- Key: LUCENE-3452 URL: https://issues.apache.org/jira/browse/LUCENE-3452 Project: Lucene - Java Issue Type: Bug Components: general/test Affects Versions: 3.4, 4.0 Reporter: Steven Rowe Priority: Minor Attachments: LUCENE-3452.patch {{LuceneJUnitResultFormatter}} uses a lock to buffer test suites' output, so that when run in parallel, they don't interrupt each other when they are displayed on the console. The current implementation uses a fixed directory ({{lucene_junit_lock/}} in {{java.io.tmpdir}} (by default {{/tmp/}} on Unix/Linux systems) as the location of this lock. This functionality was introduced on SOLR-1835. As Shawn Heisey reported on SOLR-2739, some tests fail when run as root, but succeed when run as a non-root user. On #lucene IRC today, Shawn wrote: {quote} (2:06:07 PM) elyograg: Now that I know I can't run the tests as root, I have discovered /tmp/lucene_junit_lock. Once you run the tests as user A, you cannot run them again as user B until that directory is deleted, and only root or the original user can do so. {quote} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book
[ https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113957#comment-13113957 ] Michael McCandless commented on LUCENE-3449: bq. Nomatter what, this loop needs 2 ifs per cycle Duh, I was wrong about this! We just need to change the sentinel value returned by nextSetBit when there is no next set bit, from -1 to MAX_INT. In fact we did this for DISI.nextDoc, for the same reason (saves an if per cycle). Then you just rotate the loop: {noformat} if (bits.length() != 0) { int bit = bits.nextSetBit(0); final int limit = bits.length()-1; while (bit limit) { // ...do something with bit... bit = bits.nextSetBit(1+bit); } if (bit == bits.length()-1) { // ...do something with bit... } } {noformat} Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book --- Key: LUCENE-3449 URL: https://issues.apache.org/jira/browse/LUCENE-3449 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.4, 4.0 Reporter: Uwe Schindler Priority: Minor Attachments: LUCENE-3449.patch The usage pattern for nextSetBit/prevSetBit is the following: {code:java} for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) { // operate on index i here } {code} The problem is that the i+1 at the end can be bs.length(), but the code in nextSetBit does not allow this (same applies to prevSetBit(0)). The above usage pattern is in every programming book, so it should really be supported. The check has to be done in all cases (with the current impl in the calling code). If the check is done inside xxxSetBit() it can also be optimized to be only called seldom and not all the time, like in the ugly looking replacement, thats currently needed: {code:java} for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : -1) { // operate on index i here } {code} We should change this and allow out-of bounds indexes for those two methods (they already do some checks in that direction). Enforcing this with an assert is unuseable on the client side. The test code for FixedBitSet also uses this, horrible. Please support the common usage pattern for BitSets. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113958#comment-13113958 ] hadas raviv commented on LUCENE-2959: - Hi, First of all, I would like to thank you for the great contribution you made by adding the state of the art ranking methods to lucene. I was waiting for these features for a long time, since they enable an IR researcher like me to use lucene, which is a powerful tool, for research purposes. I downloaded the last version of lucene trunk and played a little with the models you implemented. There is question I have and I would really appreciate your answer (my apology in advance - I'm new to lucene so maybe this question is trivial for you): I saw that you didn't change the default implementation of lucene for coding the document length which is used for ranking in language models (one byte for coding the document length together with boosting). Why did you decide that? Is it possible to save the real document length coded in some other way (maybe with the new flexible index)? Is there any example for such an implementation? It is just that I'm concerned with the effect of using an inaccurate document length on results quality. Did you check this issue? In addition - do you know about intentions to implement some more advanced ranking models (such as relevance models, mrf) in the near future? Thanks in advance, Hadas [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: core/query/scoring, general/javadocs, modules/examples Reporter: David Mark Nemeskey Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: flexscoring branch, 4.0 Attachments: LUCENE-2959.patch, LUCENE-2959.patch, LUCENE-2959_mockdfr.patch, LUCENE-2959_nocommits.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. The wiki page for the project can be found at http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3451) Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113960#comment-13113960 ] Michael McCandless commented on LUCENE-3451: Patch looks great Uwe! Nice catch on the analyzers removing stop words and then making an all MUST_NOT BQ. But, I think we should throw an exception in this case, since it's a horrible trap now? User will get 0 results but that's flat out silently wrong? Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery - Key: LUCENE-3451 URL: https://issues.apache.org/jira/browse/LUCENE-3451 Project: Lucene - Java Issue Type: Bug Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 4.0 Attachments: LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch We should at least in Lucene 4.0 remove the hack in BooleanFilter that allows pure negative Filter clauses. This is not supported by BooleanQuery and confuses users (I think that's the problem in LUCENE-3450). The hack is buggy, as it does not respect deleted documents and returns them in its DocIdSet. Also we should think about disallowing pure-negative Queries at all and throw UOE. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book
[ https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113961#comment-13113961 ] Dawid Weiss commented on LUCENE-3449: - Sorry, but my gut feeling says no to this loop logic. It just seems strangely complicated. If adherence to BitSet is not an issue why not: {code} for (int i = bs.firstSetBit(); i = 0; i = bs.nextSetBitAfter(i)) { {code} this seems clearer on method naming, has a single if... and I think could be implemented nearly identically to what's already in the code. We can run microbenchmarks for fun and see what comes out better an by what margin. Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book --- Key: LUCENE-3449 URL: https://issues.apache.org/jira/browse/LUCENE-3449 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.4, 4.0 Reporter: Uwe Schindler Priority: Minor Attachments: LUCENE-3449.patch The usage pattern for nextSetBit/prevSetBit is the following: {code:java} for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) { // operate on index i here } {code} The problem is that the i+1 at the end can be bs.length(), but the code in nextSetBit does not allow this (same applies to prevSetBit(0)). The above usage pattern is in every programming book, so it should really be supported. The check has to be done in all cases (with the current impl in the calling code). If the check is done inside xxxSetBit() it can also be optimized to be only called seldom and not all the time, like in the ugly looking replacement, thats currently needed: {code:java} for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : -1) { // operate on index i here } {code} We should change this and allow out-of bounds indexes for those two methods (they already do some checks in that direction). Enforcing this with an assert is unuseable on the client side. The test code for FixedBitSet also uses this, horrible. Please support the common usage pattern for BitSets. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113965#comment-13113965 ] Michael McCandless commented on LUCENE-1536: bq. But I like we should maybe make setLiveDocsOnly and setAllowRandomAccessFiltering 1st class features of the Bits interface. Hmm... that also makes me a bit nervous ;) Bits is too low-level for these concepts? Ie whether a filter/DIS folded in live docs already, and whether the filter/DIS is best applied by iteration vs by random access, are higher level filter concepts, not low level Bits concepts, I think? Also, Bits by definition is random-access so I don't think it should have set/getAllowRandomAccessFiltering. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3435) Create a Size Estimator model for Lucene and Solr
[ https://issues.apache.org/jira/browse/LUCENE-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113967#comment-13113967 ] Grant Ingersoll commented on LUCENE-3435: - A patch would be great for all of these things. Thanks! Create a Size Estimator model for Lucene and Solr - Key: LUCENE-3435 URL: https://issues.apache.org/jira/browse/LUCENE-3435 Project: Lucene - Java Issue Type: Task Components: core/other Affects Versions: 4.0 Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor It is often handy to be able to estimate the amount of memory and disk space that both Lucene and Solr use, given certain assumptions. I intend to check in an Excel spreadsheet that allows people to estimate memory and disk usage for trunk. I propose to put it under dev-tools, as I don't think it should be official documentation just yet and like the IDE stuff, we'll see how well it gets maintained. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113977#comment-13113977 ] Robert Muir commented on LUCENE-2959: - {quote} I saw that you didn't change the default implementation of lucene for coding the document length which is used for ranking in language models (one byte for coding the document length together with boosting). Why did you decide that? {quote} So that you can switch between ranking models without re-indexing. {quote} It is just that I'm concerned with the effect of using an inaccurate document length on results quality. Did you check this issue? {quote} I ran experiments on this a long time ago, the changes were not statistically significant. But, there is an issue open to still switch norms to docvalues fields, for other reasons: LUCENE-3221 {quote} In addition - do you know about intentions to implement some more advanced ranking models (such as relevance models, mrf) in the near future? {quote} No, there won't be any additional work on this issue, GSOC is over. [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: core/query/scoring, general/javadocs, modules/examples Reporter: David Mark Nemeskey Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: flexscoring branch, 4.0 Attachments: LUCENE-2959.patch, LUCENE-2959.patch, LUCENE-2959_mockdfr.patch, LUCENE-2959_nocommits.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. The wiki page for the project can be found at http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments
[ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sebastian L. updated LUCENE-3440: - Fix Version/s: 4.0 Affects Version/s: 4.0 FastVectorHighlighter: IDF-weighted terms for ordered fragments Key: LUCENE-3440 URL: https://issues.apache.org/jira/browse/LUCENE-3440 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.5, 4.0 Reporter: sebastian L. Priority: Minor Labels: FastVectorHighlighter Fix For: 3.5, 4.0 Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch The FastVectorHighlighter uses for every term found in a fragment an equal weight, which causes a higher ranking for fragments with a high number of words or, in the worst case, a high number of very common words than fragments that contains *all* of the terms used in the original query. This patch provides ordered fragments with IDF-weighted terms: total weight = total weight + IDF for unique term per fragment * boost of query; The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer. The patch is simple, but it works for us. Some ideas: - A better approach would be moving the whole fragments-scoring into a separate class. - Switch scoring via parameter - Exact phrases should be given a even better score, regardless if a phrase-query was executed or not - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding fragments should be ranked higher -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3312) Break out StorableField from IndexableField
[ https://issues.apache.org/jira/browse/LUCENE-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113984#comment-13113984 ] Chris Male commented on LUCENE-3312: Back on this wagon for a bit. Just wondering about whether we need a StorableFieldType to accompany StorableField. At this stage I've struggling to identify candidate properties for a StorableFieldType. Options include moving the Numeric.DataType and DocValues' ValueType to the FieldType. While I sort of like this idea, it seems to have a couple of disadvantages: - Any FieldTypes passed into NumericField and IndexDocValuesField would have to have these properties set from the beginning. For both of these, this would mean it wouldn't be possible to simple initialize a field and then use one of the setters to define the Data/ValueType - they would need to be known at construction. - It separates the 'data type' away from the actual value. If these properties were to stay on StorableFieldType, I can't really see the need for a StorableFieldType. Break out StorableField from IndexableField --- Key: LUCENE-3312 URL: https://issues.apache.org/jira/browse/LUCENE-3312 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Fix For: Field Type branch In the field type branch we have strongly decoupled Document/Field/FieldType impl from the indexer, by having only a narrow API (IndexableField) passed to IndexWriter. This frees apps up use their own documents instead of the user-space impls we provide in oal.document. Similarly, with LUCENE-3309, we've done the same thing on the doc/field retrieval side (from IndexReader), with the StoredFieldsVisitor. But, maybe we should break out StorableField from IndexableField, such that when you index a doc you provide two Iterables -- one for the IndexableFields and one for the StorableFields. Either can be null. One downside is possible perf hit for fields that are both indexed stored (ie, we visit them twice, lookup their name in a hash twice, etc.). But the upside is a cleaner separation of concerns in API -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3312) Break out StorableField from IndexableField
[ https://issues.apache.org/jira/browse/LUCENE-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113984#comment-13113984 ] Chris Male edited comment on LUCENE-3312 at 9/24/11 2:25 PM: - Back on this wagon for a bit. Just wondering about whether we need a StorableFieldType to accompany StorableField. At this stage I've struggling to identify candidate properties for a StorableFieldType. Options include moving the Numeric.DataType and DocValues' ValueType to the FieldType. While I sort of like this idea, it seems to have a couple of disadvantages: - Any FieldTypes passed into NumericField and IndexDocValuesField would have to have these properties set from the beginning. For both of these, this would mean it wouldn't be possible to simple initialize a field and then use one of the setters to define the Data/ValueType - they would need to be known at construction. - It separates the 'data type' away from the actual value. If these properties were to stay on StorableField, I can't really see the need for a StorableFieldType. was (Author: cmale): Back on this wagon for a bit. Just wondering about whether we need a StorableFieldType to accompany StorableField. At this stage I've struggling to identify candidate properties for a StorableFieldType. Options include moving the Numeric.DataType and DocValues' ValueType to the FieldType. While I sort of like this idea, it seems to have a couple of disadvantages: - Any FieldTypes passed into NumericField and IndexDocValuesField would have to have these properties set from the beginning. For both of these, this would mean it wouldn't be possible to simple initialize a field and then use one of the setters to define the Data/ValueType - they would need to be known at construction. - It separates the 'data type' away from the actual value. If these properties were to stay on StorableFieldType, I can't really see the need for a StorableFieldType. Break out StorableField from IndexableField --- Key: LUCENE-3312 URL: https://issues.apache.org/jira/browse/LUCENE-3312 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Fix For: Field Type branch In the field type branch we have strongly decoupled Document/Field/FieldType impl from the indexer, by having only a narrow API (IndexableField) passed to IndexWriter. This frees apps up use their own documents instead of the user-space impls we provide in oal.document. Similarly, with LUCENE-3309, we've done the same thing on the doc/field retrieval side (from IndexReader), with the StoredFieldsVisitor. But, maybe we should break out StorableField from IndexableField, such that when you index a doc you provide two Iterables -- one for the IndexableFields and one for the StorableFields. Either can be null. One downside is possible perf hit for fields that are both indexed stored (ie, we visit them twice, lookup their name in a hash twice, etc.). But the upside is a cleaner separation of concerns in API -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book
[ https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113988#comment-13113988 ] Michael McCandless commented on LUCENE-3449: {quote} If adherence to BitSet is not an issue why not: for (int i = bs.firstSetBit(); i = 0; i = bs.nextSetBitAfter(i)) { this seems clearer on method naming, has a single if... and I think could be implemented nearly identically to what's already in the code. We can run microbenchmarks for fun and see what comes out better an by what margin. {quote} Ooh I love that! If in fact we can achieve such clean code (above), a clean API ( all methods require in-bounds index), and not incur added cost in nextSetBitAfter (vs the nextSetBit we have today) then I agree this would be the best of all worlds. I think we should give the sentinal a name (eg FBS.END)? Then the end condition can be {{i != FBS.END}}. Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book --- Key: LUCENE-3449 URL: https://issues.apache.org/jira/browse/LUCENE-3449 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.4, 4.0 Reporter: Uwe Schindler Priority: Minor Attachments: LUCENE-3449.patch The usage pattern for nextSetBit/prevSetBit is the following: {code:java} for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) { // operate on index i here } {code} The problem is that the i+1 at the end can be bs.length(), but the code in nextSetBit does not allow this (same applies to prevSetBit(0)). The above usage pattern is in every programming book, so it should really be supported. The check has to be done in all cases (with the current impl in the calling code). If the check is done inside xxxSetBit() it can also be optimized to be only called seldom and not all the time, like in the ugly looking replacement, thats currently needed: {code:java} for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : -1) { // operate on index i here } {code} We should change this and allow out-of bounds indexes for those two methods (they already do some checks in that direction). Enforcing this with an assert is unuseable on the client side. The test code for FixedBitSet also uses this, horrible. Please support the common usage pattern for BitSets. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-3435) Create a Size Estimator model for Lucene and Solr
What about putting this in Google Docs for easy collaboration? Patching an Excel file will be tough to coordinate. On Sep 24, 2011, at 9:11, Grant Ingersoll (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113967#comment-13113967 ] Grant Ingersoll commented on LUCENE-3435: - A patch would be great for all of these things. Thanks! Create a Size Estimator model for Lucene and Solr - Key: LUCENE-3435 URL: https://issues.apache.org/jira/browse/LUCENE-3435 Project: Lucene - Java Issue Type: Task Components: core/other Affects Versions: 4.0 Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor It is often handy to be able to estimate the amount of memory and disk space that both Lucene and Solr use, given certain assumptions. I intend to check in an Excel spreadsheet that allows people to estimate memory and disk usage for trunk. I propose to put it under dev-tools, as I don't think it should be official documentation just yet and like the IDE stuff, we'll see how well it gets maintained. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3453) remove IndexDocValuesField
remove IndexDocValuesField -- Key: LUCENE-3453 URL: https://issues.apache.org/jira/browse/LUCENE-3453 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Its confusing how we present CSF functionality to the user, its actually not a field but an attribute of a field like STORED or INDEXED. Otherwise, its really hard to think about CSF because there is a mismatch between the APIs and the index format. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3453) remove IndexDocValuesField
[ https://issues.apache.org/jira/browse/LUCENE-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113997#comment-13113997 ] Chris Male commented on LUCENE-3453: I'm not sure what the better alternative is, but +1 to removing this class. remove IndexDocValuesField -- Key: LUCENE-3453 URL: https://issues.apache.org/jira/browse/LUCENE-3453 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Its confusing how we present CSF functionality to the user, its actually not a field but an attribute of a field like STORED or INDEXED. Otherwise, its really hard to think about CSF because there is a mismatch between the APIs and the index format. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2769) HunspellStemFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-2769: -- Attachment: SOLR-2769-branch_3x.patch Attaching branch_3x patch, identical except from package name for WhitespaceTokenizer HunspellStemFilterFactory - Key: SOLR-2769 URL: https://issues.apache.org/jira/browse/SOLR-2769 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Labels: stemming Fix For: 3.5, 4.0 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a Factory for it in Solr -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2769) HunspellStemFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-2769: -- Attachment: SOLR-2769-branch_3x.patch Same for branch HunspellStemFilterFactory - Key: SOLR-2769 URL: https://issues.apache.org/jira/browse/SOLR-2769 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Labels: stemming Fix For: 3.5, 4.0 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769-branch_3x.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a Factory for it in Solr -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2621) Extend Codec to handle also stored fields and term vectors
[ https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2621: Attachment: LUCENE-2621_rote.patch Here is a minimal 'rote refactor' for stored fields, there is a lot more to do (e.g. filenames/extensions should come from codec, segmentmerger optimizations (bulk merging) should not be in the API but customized by the codec, the codec name (format) of fields should be recorded in the index, we should implement a simpletext version and refactor/generalize, ...) but more importantly, I think we need to restructure the class hierarchy: Codec is a per-field thing currently but I think the name Codec should represent the entire index... maybe what is Codec now should be named FieldCodec? maybe the parts of CodecProvider (e.g. segmentinfosreader, storedfields, etc) should be be moved to this new Codec class? in this world maybe PreFlex codec for example returns its hardcoded representation for every field since in 3.x this stuff is *not* per field, and with more of the back compat code refactored down into PreFlex. Would be good to come up with a nice class naming/hierarchy that represents reality here. Extend Codec to handle also stored fields and term vectors -- Key: LUCENE-2621 URL: https://issues.apache.org/jira/browse/LUCENE-2621 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Andrzej Bialecki Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2621_rote.patch Currently Codec API handles only writing/reading of term-related data, while stored fields data and term frequency vector data writing/reading is handled elsewhere. I propose to extend the Codec API to handle this data as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2769) HunspellStemFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-2769: -- Attachment: SOLR-2769.patch Better Javadoc with example XML and link to dictionaries HunspellStemFilterFactory - Key: SOLR-2769 URL: https://issues.apache.org/jira/browse/SOLR-2769 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Labels: stemming Fix For: 3.5, 4.0 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769-branch_3x.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a Factory for it in Solr -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2769) HunspellStemFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl resolved SOLR-2769. --- Resolution: Fixed Checked in for trunk and 3x HunspellStemFilterFactory - Key: SOLR-2769 URL: https://issues.apache.org/jira/browse/SOLR-2769 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Labels: stemming Fix For: 3.5, 4.0 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769-branch_3x.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a Factory for it in Solr -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2621) Extend Codec to handle also stored fields and term vectors
[ https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114022#comment-13114022 ] Michael McCandless commented on LUCENE-2621: Awesome! I think, like the postings, we can add a .merge() method, and the impl for that would do bulk-merge when it can? On the restructuring, maybe we can go back to a PerFieldCodecWrapper, which is itself a Codec? This would simplify CodecProvider back to just being a name - Codec instance provider? We would still use SegmentCodecs/FieldInfo(s) to compute/record the codecID, though, in theory this could become private to PFCW once it's a Codec again. Extend Codec to handle also stored fields and term vectors -- Key: LUCENE-2621 URL: https://issues.apache.org/jira/browse/LUCENE-2621 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Andrzej Bialecki Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2621_rote.patch Currently Codec API handles only writing/reading of term-related data, while stored fields data and term frequency vector data writing/reading is handled elsewhere. I propose to extend the Codec API to handle this data as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2621) Extend Codec to handle also stored fields and term vectors
[ https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114024#comment-13114024 ] Robert Muir commented on LUCENE-2621: - I think so, assuming FieldInfos etc are *also* read/written by the codec. Then I think PFCW could be an abstract class that writes per-field configuration into the index, but for example PreFlexCodec would *not* extend this class, as a 3.x index is the same codec across all fields. I think if we do things this way we have a lot more flexibility with backwards compatibility instead of all this if-then-else conditional version-checking code when reading these files... Really, for example if someone wanted to make a Codec that reads a lucene 2.x indexes (compressed fields and all) they should be able to do this if we reorganize this right. Extend Codec to handle also stored fields and term vectors -- Key: LUCENE-2621 URL: https://issues.apache.org/jira/browse/LUCENE-2621 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Andrzej Bialecki Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2621_rote.patch Currently Codec API handles only writing/reading of term-related data, while stored fields data and term frequency vector data writing/reading is handled elsewhere. I propose to extend the Codec API to handle this data as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book
[ https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114025#comment-13114025 ] Dawid Weiss commented on LUCENE-3449: - Eh... I shouldn't be throwing suggestions not backed up by patches... ;) I'm working on something else tonight, but I'll add it to my queue. If anybody (Uwe, Uwe, Uwe! :) wants to give it a take, go ahead. Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book --- Key: LUCENE-3449 URL: https://issues.apache.org/jira/browse/LUCENE-3449 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.4, 4.0 Reporter: Uwe Schindler Priority: Minor Attachments: LUCENE-3449.patch The usage pattern for nextSetBit/prevSetBit is the following: {code:java} for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) { // operate on index i here } {code} The problem is that the i+1 at the end can be bs.length(), but the code in nextSetBit does not allow this (same applies to prevSetBit(0)). The above usage pattern is in every programming book, so it should really be supported. The check has to be done in all cases (with the current impl in the calling code). If the check is done inside xxxSetBit() it can also be optimized to be only called seldom and not all the time, like in the ugly looking replacement, thats currently needed: {code:java} for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : -1) { // operate on index i here } {code} We should change this and allow out-of bounds indexes for those two methods (they already do some checks in that direction). Enforcing this with an assert is unuseable on the client side. The test code for FixedBitSet also uses this, horrible. Please support the common usage pattern for BitSets. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book
[ https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114025#comment-13114025 ] Dawid Weiss edited comment on LUCENE-3449 at 9/24/11 5:54 PM: -- Eh... I shouldn't be throwing suggestions not backed up by patches... ;)) I'm working on something else tonight, but I'll add it to my queue. If anybody (Uwe, Uwe, Uwe! :)) wants to give it a take, go ahead. was (Author: dweiss): Eh... I shouldn't be throwing suggestions not backed up by patches... ;) I'm working on something else tonight, but I'll add it to my queue. If anybody (Uwe, Uwe, Uwe! :) wants to give it a take, go ahead. Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book --- Key: LUCENE-3449 URL: https://issues.apache.org/jira/browse/LUCENE-3449 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.4, 4.0 Reporter: Uwe Schindler Priority: Minor Attachments: LUCENE-3449.patch The usage pattern for nextSetBit/prevSetBit is the following: {code:java} for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) { // operate on index i here } {code} The problem is that the i+1 at the end can be bs.length(), but the code in nextSetBit does not allow this (same applies to prevSetBit(0)). The above usage pattern is in every programming book, so it should really be supported. The check has to be done in all cases (with the current impl in the calling code). If the check is done inside xxxSetBit() it can also be optimized to be only called seldom and not all the time, like in the ugly looking replacement, thats currently needed: {code:java} for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : -1) { // operate on index i here } {code} We should change this and allow out-of bounds indexes for those two methods (they already do some checks in that direction). Enforcing this with an assert is unuseable on the client side. The test code for FixedBitSet also uses this, horrible. Please support the common usage pattern for BitSets. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book
[ https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114025#comment-13114025 ] Dawid Weiss edited comment on LUCENE-3449 at 9/24/11 5:55 PM: -- Eh... I shouldn't be throwing suggestions not backed up by patches... ;)) I'm working on something else tonight, but I'll add it to my queue. If anybody (Uwe, Uwe, Uwe! :)) wants to give it a go, go ahead. was (Author: dweiss): Eh... I shouldn't be throwing suggestions not backed up by patches... ;)) I'm working on something else tonight, but I'll add it to my queue. If anybody (Uwe, Uwe, Uwe! :)) wants to give it a take, go ahead. Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book --- Key: LUCENE-3449 URL: https://issues.apache.org/jira/browse/LUCENE-3449 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.4, 4.0 Reporter: Uwe Schindler Priority: Minor Attachments: LUCENE-3449.patch The usage pattern for nextSetBit/prevSetBit is the following: {code:java} for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) { // operate on index i here } {code} The problem is that the i+1 at the end can be bs.length(), but the code in nextSetBit does not allow this (same applies to prevSetBit(0)). The above usage pattern is in every programming book, so it should really be supported. The check has to be done in all cases (with the current impl in the calling code). If the check is done inside xxxSetBit() it can also be optimized to be only called seldom and not all the time, like in the ugly looking replacement, thats currently needed: {code:java} for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : -1) { // operate on index i here } {code} We should change this and allow out-of bounds indexes for those two methods (they already do some checks in that direction). Enforcing this with an assert is unuseable on the client side. The test code for FixedBitSet also uses this, horrible. Please support the common usage pattern for BitSets. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2621) Extend Codec to handle also stored fields and term vectors
[ https://issues.apache.org/jira/browse/LUCENE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114036#comment-13114036 ] Robert Muir commented on LUCENE-2621: - I created a branch (https://svn.apache.org/repos/asf/lucene/dev/branches/lucene2621) for extending and refactoring the codec API to cover more portions of the index... I think it would be really nice to flesh this out for 4.0 Extend Codec to handle also stored fields and term vectors -- Key: LUCENE-2621 URL: https://issues.apache.org/jira/browse/LUCENE-2621 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Andrzej Bialecki Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2621_rote.patch Currently Codec API handles only writing/reading of term-related data, while stored fields data and term frequency vector data writing/reading is handled elsewhere. I propose to extend the Codec API to handle this data as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3453) remove IndexDocValuesField
[ https://issues.apache.org/jira/browse/LUCENE-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114037#comment-13114037 ] Simon Willnauer commented on LUCENE-3453: - +1 remove IndexDocValuesField -- Key: LUCENE-3453 URL: https://issues.apache.org/jira/browse/LUCENE-3453 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Its confusing how we present CSF functionality to the user, its actually not a field but an attribute of a field like STORED or INDEXED. Otherwise, its really hard to think about CSF because there is a mismatch between the APIs and the index format. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114041#comment-13114041 ] Robert Muir commented on LUCENE-1536: - I didnt look too hard here at whats going on, but maybe we could use the RandomAccess marker interface from the jdk? if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Prettify JS and CSS exceluded from Javadocs
Hi Shai, Sure, that sounds fine to me. Steve From: Shai Erera [mailto:ser...@gmail.com] Sent: Saturday, September 24, 2011 3:12 AM To: dev@lucene.apache.org Subject: Re: Prettify JS and CSS exceluded from Javadocs Hi Steve, As I noted before, jarring prettify won't solve the problem entirely, as the references in the HTML will point to an incorrect location. Why don't we just package prettify.js in the jar like we do with stylesheet+prettify.css? We only need this .js as we only write in Java ... It's a tiny file, and it will simplify the whole process. Wha do you think? Shai On Thu, Sep 22, 2011 at 5:05 PM, Steven A Rowe sar...@syr.edumailto:sar...@syr.edu wrote: The patch I gave for lucene/contrib-build.xml's javadocs target was wrong (I placed the nested tag outside of the jarify invocation). Here's a fixed patch: Index: lucene/contrib/contrib-build.xml === --- lucene/contrib/contrib-build.xml(revision 1165174) +++ lucene/contrib/contrib-build.xml(revision ) @@ -95,7 +95,11 @@ packageset dir=${src.dir}/ /sources /invoke-javadoc - jarify basedir=${javadoc.dir}/contrib-${name} destfile=${build.dir}/${final.namehttp://final.name}-javadoc.jar/ + jarify basedir=${javadoc.dir}/contrib-${name} destfile=${build.dir}/${final.namehttp://final.name}-javadoc.jar +nested + fileset dir=${prettify.dir}/ +/nested + /jarify /sequential /target
[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book
[ https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114055#comment-13114055 ] Uwe Schindler commented on LUCENE-3449: --- bq (Uwe, Uwe, Uwe! :)) Yes, but today is/was freetime :-) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book --- Key: LUCENE-3449 URL: https://issues.apache.org/jira/browse/LUCENE-3449 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.4, 4.0 Reporter: Uwe Schindler Priority: Minor Attachments: LUCENE-3449.patch The usage pattern for nextSetBit/prevSetBit is the following: {code:java} for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) { // operate on index i here } {code} The problem is that the i+1 at the end can be bs.length(), but the code in nextSetBit does not allow this (same applies to prevSetBit(0)). The above usage pattern is in every programming book, so it should really be supported. The check has to be done in all cases (with the current impl in the calling code). If the check is done inside xxxSetBit() it can also be optimized to be only called seldom and not all the time, like in the ugly looking replacement, thats currently needed: {code:java} for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : -1) { // operate on index i here } {code} We should change this and allow out-of bounds indexes for those two methods (they already do some checks in that direction). Enforcing this with an assert is unuseable on the client side. The test code for FixedBitSet also uses this, horrible. Please support the common usage pattern for BitSets. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book
[ https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114055#comment-13114055 ] Uwe Schindler edited comment on LUCENE-3449 at 9/24/11 7:45 PM: bq. (Uwe, Uwe, Uwe! :)) Yes, but today is/was freetime :-) was (Author: thetaphi): bq (Uwe, Uwe, Uwe! :)) Yes, but today is/was freetime :-) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book --- Key: LUCENE-3449 URL: https://issues.apache.org/jira/browse/LUCENE-3449 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.4, 4.0 Reporter: Uwe Schindler Priority: Minor Attachments: LUCENE-3449.patch The usage pattern for nextSetBit/prevSetBit is the following: {code:java} for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) { // operate on index i here } {code} The problem is that the i+1 at the end can be bs.length(), but the code in nextSetBit does not allow this (same applies to prevSetBit(0)). The above usage pattern is in every programming book, so it should really be supported. The check has to be done in all cases (with the current impl in the calling code). If the check is done inside xxxSetBit() it can also be optimized to be only called seldom and not all the time, like in the ugly looking replacement, thats currently needed: {code:java} for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : -1) { // operate on index i here } {code} We should change this and allow out-of bounds indexes for those two methods (they already do some checks in that direction). Enforcing this with an assert is unuseable on the client side. The test code for FixedBitSet also uses this, horrible. Please support the common usage pattern for BitSets. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3449) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book
[ https://issues.apache.org/jira/browse/LUCENE-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114057#comment-13114057 ] Dawid Weiss commented on LUCENE-3449: - I was just teasing you, enjoy your weekend :) Fix FixedBitSet.nextSetBit/prevSetBit to support the common usage pattern in every programming book --- Key: LUCENE-3449 URL: https://issues.apache.org/jira/browse/LUCENE-3449 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.4, 4.0 Reporter: Uwe Schindler Priority: Minor Attachments: LUCENE-3449.patch The usage pattern for nextSetBit/prevSetBit is the following: {code:java} for(int i=bs.nextSetBit(0); i=0; i=bs.nextSetBit(i+1)) { // operate on index i here } {code} The problem is that the i+1 at the end can be bs.length(), but the code in nextSetBit does not allow this (same applies to prevSetBit(0)). The above usage pattern is in every programming book, so it should really be supported. The check has to be done in all cases (with the current impl in the calling code). If the check is done inside xxxSetBit() it can also be optimized to be only called seldom and not all the time, like in the ugly looking replacement, thats currently needed: {code:java} for(int i=bs.nextSetBit(0); i=0; i=(ibs.length()-1) ? bs.nextSetBit(i+1) : -1) { // operate on index i here } {code} We should change this and allow out-of bounds indexes for those two methods (they already do some checks in that direction). Enforcing this with an assert is unuseable on the client side. The test code for FixedBitSet also uses this, horrible. Please support the common usage pattern for BitSets. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2769) HunspellStemFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114062#comment-13114062 ] Jan Høydahl commented on SOLR-2769: --- Updated documentation: http://wiki.apache.org/solr/Hunspell http://wiki.apache.org/solr/HunspellStemFilterFactory http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters http://wiki.apache.org/solr/LanguageAnalysis#Stemming HunspellStemFilterFactory - Key: SOLR-2769 URL: https://issues.apache.org/jira/browse/SOLR-2769 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Labels: stemming Fix For: 3.5, 4.0 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769-branch_3x.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a Factory for it in Solr -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2792) Allow case insensitive Hunspell stemming
Allow case insensitive Hunspell stemming Key: SOLR-2792 URL: https://issues.apache.org/jira/browse/SOLR-2792 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.5, 4.0 Reporter: Jan Høydahl Same as http://code.google.com/p/lucene-hunspell/issues/detail?id=3 Hunspell dictionaries are by nature case sensitive. The Hunspell stemmer thus needs an option to allow case insensitive matching of the dictionaries. Imagine a query for microsofts. It will never be stemmed to the dictionary word Microsoft because of the case difference. This problem cannot be fixed by putting LowercaseFilter before Hunspell. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2792) Allow case insensitive Hunspell stemming
[ https://issues.apache.org/jira/browse/SOLR-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114067#comment-13114067 ] Jan Høydahl commented on SOLR-2792: --- Propose an option ignoreCase=true for HunspellStemFilterFactory, which effectively lowercases everything before matching. Allow case insensitive Hunspell stemming Key: SOLR-2792 URL: https://issues.apache.org/jira/browse/SOLR-2792 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.5, 4.0 Reporter: Jan Høydahl Same as http://code.google.com/p/lucene-hunspell/issues/detail?id=3 Hunspell dictionaries are by nature case sensitive. The Hunspell stemmer thus needs an option to allow case insensitive matching of the dictionaries. Imagine a query for microsofts. It will never be stemmed to the dictionary word Microsoft because of the case difference. This problem cannot be fixed by putting LowercaseFilter before Hunspell. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1895) ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search time
[ https://issues.apache.org/jira/browse/SOLR-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114068#comment-13114068 ] Karl Wright commented on SOLR-1895: --- I now have 4 versions of the plugin, all of which still are SearchComponents. The four are: (1) uses filters and wildcards (2) uses queries and wildcards (3) uses filters and a special token to mark security fields that are empty (4) uses queries and a special token to mark security fields that are empty I've done some timings, using 5000 documents, a realistic number of user tokens (100), for 3000 user queries. The numbers are interesting: Filter + wildcard = 193948ms Query + wildcard = 26137ms Filter + token = 39012ms Query + token = 25078ms Since the current implementation is the first, and that's obviously by far the worst performancewise, I recommend switching to a query-based implementation regardless of whether it's a SearchComponent or query parser plugin. ManifoldCF SearchComponent plugin for enforcing ManifoldCF security at search time -- Key: SOLR-1895 URL: https://issues.apache.org/jira/browse/SOLR-1895 Project: Solr Issue Type: New Feature Components: SearchComponents - other Reporter: Karl Wright Labels: document, security, solr Fix For: 3.5, 4.0 Attachments: LCFSecurityFilter.java, LCFSecurityFilter.java, LCFSecurityFilter.java, LCFSecurityFilter.java, SOLR-1895-service-plugin.patch, SOLR-1895-service-plugin.patch, SOLR-1895.patch, SOLR-1895.patch, SOLR-1895.patch, SOLR-1895.patch, SOLR-1895.patch, SOLR-1895.patch I've written an LCF SearchComponent which filters returned results based on access tokens provided by LCF's authority service. The component requires you to configure the appropriate authority service URL base, e.g.: !-- LCF document security enforcement component -- searchComponent name=lcfSecurity class=LCFSecurityFilter str name=AuthorityServiceBaseURLhttp://localhost:8080/lcf-authority-service/str /searchComponent Also required are the following schema.xml additions: !-- Security fields -- field name=allow_token_document type=string indexed=true stored=false multiValued=true/ field name=deny_token_document type=string indexed=true stored=false multiValued=true/ field name=allow_token_share type=string indexed=true stored=false multiValued=true/ field name=deny_token_share type=string indexed=true stored=false multiValued=true/ Finally, to tie it into the standard request handler, it seems to need to run last: requestHandler name=standard class=solr.SearchHandler default=true arr name=last-components strlcfSecurity/str /arr ... I have not set a package for this code. Nor have I been able to get it reviewed by someone as conversant with Solr as I would prefer. It is my hope, however, that this module will become part of the standard Solr 1.5 suite of search components, since that would tie it in with LCF nicely. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2769) HunspellStemFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114071#comment-13114071 ] Robert Muir commented on SOLR-2769: --- I think we should be more cautious on recommending Hunspell on the wiki here, for these reasons: * The algorithm relies entirely on the quality of the dictionary, for many of these languages the dictionary is not good for this purpose: no affix rules, just a list of words, etc * Even in the case where a particular dictionary is pretty good, there are a number of problems: the primary use case of these dictionaries is spellchecking and that doesn't necessarily imply that the rules+affix combinations yield good results here. * Finally, the usual problems of having a dictionary-based technique, languages are not static and there absolutely no handling for OOV words. HunspellStemFilterFactory - Key: SOLR-2769 URL: https://issues.apache.org/jira/browse/SOLR-2769 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Labels: stemming Fix For: 3.5, 4.0 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769-branch_3x.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a Factory for it in Solr -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3454) rename optimize to a less cool-sounding name
rename optimize to a less cool-sounding name Key: LUCENE-3454 URL: https://issues.apache.org/jira/browse/LUCENE-3454 Project: Lucene - Java Issue Type: Improvement Affects Versions: 4.0 Reporter: Robert Muir I think users see the name optimize and feel they must do this, because who wants a suboptimal system? but this probably just results in wasted time and resources. maybe rename to collapseSegments or something? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3453) remove IndexDocValuesField
[ https://issues.apache.org/jira/browse/LUCENE-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114121#comment-13114121 ] Chris Male commented on LUCENE-3453: Hey Robert, Are you putting something together on this, or should I give it a shot? remove IndexDocValuesField -- Key: LUCENE-3453 URL: https://issues.apache.org/jira/browse/LUCENE-3453 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Its confusing how we present CSF functionality to the user, its actually not a field but an attribute of a field like STORED or INDEXED. Otherwise, its really hard to think about CSF because there is a mismatch between the APIs and the index format. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3453) remove IndexDocValuesField
[ https://issues.apache.org/jira/browse/LUCENE-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114122#comment-13114122 ] Robert Muir commented on LUCENE-3453: - please take it! it was just an idea after some discussion with Andrzej, who was experimenting in Luke (I think if you are not careful its easy to get norms with your indexdocvaluesfield?) also I noticed in the tests that the added dv fields were hitting up Similarity... I have no ideas on naming or api, maybe UNINVERTED? remove IndexDocValuesField -- Key: LUCENE-3453 URL: https://issues.apache.org/jira/browse/LUCENE-3453 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Its confusing how we present CSF functionality to the user, its actually not a field but an attribute of a field like STORED or INDEXED. Otherwise, its really hard to think about CSF because there is a mismatch between the APIs and the index format. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3453) remove IndexDocValuesField
[ https://issues.apache.org/jira/browse/LUCENE-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male reassigned LUCENE-3453: -- Assignee: Chris Male remove IndexDocValuesField -- Key: LUCENE-3453 URL: https://issues.apache.org/jira/browse/LUCENE-3453 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Robert Muir Assignee: Chris Male Its confusing how we present CSF functionality to the user, its actually not a field but an attribute of a field like STORED or INDEXED. Otherwise, its really hard to think about CSF because there is a mismatch between the APIs and the index format. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3454) rename optimize to a less cool-sounding name
[ https://issues.apache.org/jira/browse/LUCENE-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114124#comment-13114124 ] Robert Muir commented on LUCENE-3454: - if anyone wants to take this, don't hesitate! rename optimize to a less cool-sounding name Key: LUCENE-3454 URL: https://issues.apache.org/jira/browse/LUCENE-3454 Project: Lucene - Java Issue Type: Improvement Affects Versions: 4.0 Reporter: Robert Muir I think users see the name optimize and feel they must do this, because who wants a suboptimal system? but this probably just results in wasted time and resources. maybe rename to collapseSegments or something? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
SolrCloud Branch
FYI I'm going to make another SolrCloud branch to collect some ideas for the indexing side. I've got stuff I'm playing with, like leader election, that heavily intersect with other other stuff I'm playing, so juggling patches would just be a nightmare to impossible. - Mark Miller lucidimagination.com 2011.lucene-eurocon.org | Oct 17-20 | Barcelona - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: SolrCloud Branch
+1, we shouldn't hesitate to make branches I think. it makes it easier to collaborate. On Sep 24, 2011 8:39 PM, Mark Miller markrmil...@gmail.com wrote: FYI I'm going to make another SolrCloud branch to collect some ideas for the indexing side. I've got stuff I'm playing with, like leader election, that heavily intersect with other other stuff I'm playing, so juggling patches would just be a nightmare to impossible. - Mark Miller lucidimagination.com 2011.lucene-eurocon.org | Oct 17-20 | Barcelona - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: showItems for LRUCache missing?
Let's add it. Add a SOLR ticket in JIRA. On 9/22/11 7:59 AM, Eric Pugh ep...@opensourceconnections.com wrote: Folks, I was trying to figure out what explicitly was in my various Solr caches. After building a custom request handler and using Reflection to gain access to the private Map's in the caches, I realized that showItems is an option on both fieldValueCache and FastLRUCache, but isn't an option on LRUCache... Is there a reason for that? If I wanted to submit a patch, is it best to do it against Trunk? Eric - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Can we auto generate ID?
It would be a great feature if the ID could be auto-generated by a GUID inside the update or DIH handlers. Thoughts?
[jira] [Commented] (LUCENE-3454) rename optimize to a less cool-sounding name
[ https://issues.apache.org/jira/browse/LUCENE-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114133#comment-13114133 ] Otis Gospodnetic commented on LUCENE-3454: -- Would it be wise to stick with a name less specific than collapseSegments for example, in order not to have an incorrect name that requires another renaming when this command ends up doing something new in the future? rename optimize to a less cool-sounding name Key: LUCENE-3454 URL: https://issues.apache.org/jira/browse/LUCENE-3454 Project: Lucene - Java Issue Type: Improvement Affects Versions: 4.0 Reporter: Robert Muir I think users see the name optimize and feel they must do this, because who wants a suboptimal system? but this probably just results in wasted time and resources. maybe rename to collapseSegments or something? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3396) Make TokenStream Reuse Mandatory for Analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114136#comment-13114136 ] Chris Male commented on LUCENE-3396: I haven't actually committed yet. I was hoping you'd have a chance to review before I did. I'll now commit. Make TokenStream Reuse Mandatory for Analyzers -- Key: LUCENE-3396 URL: https://issues.apache.org/jira/browse/LUCENE-3396 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Attachments: LUCENE-3396-forgotten.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-remaining-analyzers.patch, LUCENE-3396-remaining-merging.patch In LUCENE-2309 it became clear that we'd benefit a lot from Analyzer having to return reusable TokenStreams. This is a big chunk of work, but its time to bite the bullet. I plan to attack this in the following way: - Collapse the logic of ReusableAnalyzerBase into Analyzer - Add a ReuseStrategy abstraction to Analyzer which controls whether the TokenStreamComponents are reused globally (as they are today) or per-field. - Convert all Analyzers over to using TokenStreamComponents. I've already seen that some of the TokenStreams created in tests need some work to be reusable (even if they aren't reused). - Remove Analyzer.reusableTokenStream and convert everything over to using .tokenStream (which will now be returning reusable TokenStreams). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2757) Switch min(a,b) function to min(a,b,...)
[ https://issues.apache.org/jira/browse/SOLR-2757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Bell updated SOLR-2757: Attachment: SOLR-2757-2.patch Test cases Switch min(a,b) function to min(a,b,...) Key: SOLR-2757 URL: https://issues.apache.org/jira/browse/SOLR-2757 Project: Solr Issue Type: Improvement Affects Versions: 3.4 Reporter: Bill Bell Priority: Minor Attachments: SOLR-2757-2.patch Original Estimate: 1h Remaining Estimate: 1h Would like the ability to use min(1,5,10,11) to return 1. To do that today it is parenthesis nightmare: min(min(min(1,5),10),11) Should extend max() as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2757) Switch min(a,b) function to min(a,b,...)
[ https://issues.apache.org/jira/browse/SOLR-2757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Bell updated SOLR-2757: Attachment: (was: SOLR-2757.patch) Switch min(a,b) function to min(a,b,...) Key: SOLR-2757 URL: https://issues.apache.org/jira/browse/SOLR-2757 Project: Solr Issue Type: Improvement Affects Versions: 3.4 Reporter: Bill Bell Priority: Minor Attachments: SOLR-2757-2.patch Original Estimate: 1h Remaining Estimate: 1h Would like the ability to use min(1,5,10,11) to return 1. To do that today it is parenthesis nightmare: min(min(min(1,5),10),11) Should extend max() as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114140#comment-13114140 ] Robert Muir commented on LUCENE-2279: - since we have merged lucene and solr, and chris has fixed analyzer to have a performant api, not by experts but by default, I think we can mark this issue resolved? eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114148#comment-13114148 ] Robert Muir commented on LUCENE-3055: - Chris has made a ton of progress here, I think we are very close, though it would be good revisit LUCENE-2788 in the future and ensure that for 4.0 charfilters have a reusable API as well (this is currently not the case). LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers -- Key: LUCENE-3055 URL: https://issues.apache.org/jira/browse/LUCENE-3055 Project: Lucene - Java Issue Type: Bug Components: modules/analysis Affects Versions: 3.1 Reporter: Ian Soboroff LUCENE-2372 and LUCENE-2389 marked all analyzers as final. This makes ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. StandardAnalyzer to make a small modification e.g. to tokenStream(). These issues don't indicate a new method of doing this. The issues don't give a reason except for design considerations, which seems a poor reason to make a backward-incompatible change -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3396) Make TokenStream Reuse Mandatory for Analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114149#comment-13114149 ] Robert Muir commented on LUCENE-3396: - +1! Make TokenStream Reuse Mandatory for Analyzers -- Key: LUCENE-3396 URL: https://issues.apache.org/jira/browse/LUCENE-3396 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Attachments: LUCENE-3396-forgotten.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-remaining-analyzers.patch, LUCENE-3396-remaining-merging.patch In LUCENE-2309 it became clear that we'd benefit a lot from Analyzer having to return reusable TokenStreams. This is a big chunk of work, but its time to bite the bullet. I plan to attack this in the following way: - Collapse the logic of ReusableAnalyzerBase into Analyzer - Add a ReuseStrategy abstraction to Analyzer which controls whether the TokenStreamComponents are reused globally (as they are today) or per-field. - Convert all Analyzers over to using TokenStreamComponents. I've already seen that some of the TokenStreams created in tests need some work to be reusable (even if they aren't reused). - Remove Analyzer.reusableTokenStream and convert everything over to using .tokenStream (which will now be returning reusable TokenStreams). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3396) Make TokenStream Reuse Mandatory for Analyzers
[ https://issues.apache.org/jira/browse/LUCENE-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114151#comment-13114151 ] Chris Male commented on LUCENE-3396: Committed revision 1175297. I don't want to mark this as resolved just yet. I want to spin off another sub-task to move all consumers over to reusableTokenStream (and then rename it back to tokenStream). Make TokenStream Reuse Mandatory for Analyzers -- Key: LUCENE-3396 URL: https://issues.apache.org/jira/browse/LUCENE-3396 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Attachments: LUCENE-3396-forgotten.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-rab.patch, LUCENE-3396-remaining-analyzers.patch, LUCENE-3396-remaining-merging.patch In LUCENE-2309 it became clear that we'd benefit a lot from Analyzer having to return reusable TokenStreams. This is a big chunk of work, but its time to bite the bullet. I plan to attack this in the following way: - Collapse the logic of ReusableAnalyzerBase into Analyzer - Add a ReuseStrategy abstraction to Analyzer which controls whether the TokenStreamComponents are reused globally (as they are today) or per-field. - Convert all Analyzers over to using TokenStreamComponents. I've already seen that some of the TokenStreams created in tests need some work to be reusable (even if they aren't reused). - Remove Analyzer.reusableTokenStream and convert everything over to using .tokenStream (which will now be returning reusable TokenStreams). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2769) HunspellStemFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114152#comment-13114152 ] Chris Male commented on SOLR-2769: -- Hey Jan, I fixed the @see javadoc in the factory. Both IntelliJ and ant javadoc reported that you can't do @see like that (with URLs). HunspellStemFilterFactory - Key: SOLR-2769 URL: https://issues.apache.org/jira/browse/SOLR-2769 Project: Solr Issue Type: New Feature Components: Schema and Analysis Reporter: Jan Høydahl Labels: stemming Fix For: 3.5, 4.0 Attachments: SOLR-2769-branch_3x.patch, SOLR-2769-branch_3x.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch, SOLR-2769.patch Now that Hunspell stemmer is added to Lucene (LUCENE-3414), let's make a Factory for it in Solr -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3455) All Analysis Consumers should use reusableTokenStream
All Analysis Consumers should use reusableTokenStream - Key: LUCENE-3455 URL: https://issues.apache.org/jira/browse/LUCENE-3455 Project: Lucene - Java Issue Type: Sub-task Reporter: Chris Male With Analyzer now using TokenStreamComponents, theres no reason for Analysis consumers to use tokenStream() (it just gives bad performance). Consequently all consumers will be moved over to using reusableTokenStream(). The only challenge here is that reusableTokenStream throws an IOException which many consumers are not rigged to deal with. Once all consumers have been moved, we can rename reusableTokenStream() back to tokenStream(). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3455) All Analysis Consumers should use reusableTokenStream
[ https://issues.apache.org/jira/browse/LUCENE-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114153#comment-13114153 ] Robert Muir commented on LUCENE-3455: - +1, there is a lot of crazy code around this area, consumers catching the exception from reusableTokenStream() and falling back to tokenStream() and other silly things. All Analysis Consumers should use reusableTokenStream - Key: LUCENE-3455 URL: https://issues.apache.org/jira/browse/LUCENE-3455 Project: Lucene - Java Issue Type: Sub-task Components: modules/analysis Reporter: Chris Male With Analyzer now using TokenStreamComponents, theres no reason for Analysis consumers to use tokenStream() (it just gives bad performance). Consequently all consumers will be moved over to using reusableTokenStream(). The only challenge here is that reusableTokenStream throws an IOException which many consumers are not rigged to deal with. Once all consumers have been moved, we can rename reusableTokenStream() back to tokenStream(). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3454) rename optimize to a less cool-sounding name
[ https://issues.apache.org/jira/browse/LUCENE-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114154#comment-13114154 ] Robert Muir commented on LUCENE-3454: - Otis: thats a good point, currently optimize is just a request, we should probably figure out what it should be. Should it be collapseSegments, which is a well-defined request without a cool sounding name? Or, should it be something else, which gives you a more optimal configuration for search performance... (i still think optimize is a bad name even for this)? Personally i suspect its going to be hard to support this case, e.g. you would really need to know things like, if the user has executionService set on the IndexSearcher and how big the threadpool is and things like that to make an 'optimal' configuration... and we don't have a nice way of knowing that information today. rename optimize to a less cool-sounding name Key: LUCENE-3454 URL: https://issues.apache.org/jira/browse/LUCENE-3454 Project: Lucene - Java Issue Type: Improvement Affects Versions: 4.0 Reporter: Robert Muir I think users see the name optimize and feel they must do this, because who wants a suboptimal system? but this probably just results in wasted time and resources. maybe rename to collapseSegments or something? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org