[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
[ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992382#comment-12992382 ] Simon Willnauer commented on LUCENE-2881: - bq. All tests pass. I'll commit this in a day or two if nobody objects. +1 Track FieldInfo per segment instead of per-IW-session - Key: LUCENE-2881 URL: https://issues.apache.org/jira/browse/LUCENE-2881 Project: Lucene - Java Issue Type: Improvement Affects Versions: Realtime Branch, CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Michael Busch Fix For: Realtime Branch, CSF branch, 4.0 Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over. The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments. We already work around this limitation in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
[ https://issues.apache.org/jira/browse/LUCENE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2911: Attachment: LUCENE-2911.patch improved the patch by using a simpler demorgan expression Steven came up with. I think this one is ready to commit. synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types. -- Key: LUCENE-2911 URL: https://issues.apache.org/jira/browse/LUCENE-2911 Project: Lucene - Java Issue Type: Sub-task Components: Analysis Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2911.patch, LUCENE-2911.patch I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a future target such as 3.2 But, in 3.1 I would like to do a little cleanup first, and synchronize all these token types, etc. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
[ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992476#comment-12992476 ] Simon Willnauer commented on LUCENE-2881: - I just integrated this patch to the docvalues branch! It works like a charm! Nice work Michael, this brings docValues a huge step closer. All tests pass which failed before, FieldInfo is reliable now! Track FieldInfo per segment instead of per-IW-session - Key: LUCENE-2881 URL: https://issues.apache.org/jira/browse/LUCENE-2881 Project: Lucene - Java Issue Type: Improvement Affects Versions: Realtime Branch, CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Michael Busch Fix For: Realtime Branch, CSF branch, 4.0 Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over. The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments. We already work around this limitation in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1165) Reduce exposure of nightly build documentation
[ https://issues.apache.org/jira/browse/LUCENE-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-1165. --- Resolution: Fixed Corresponding INFRA-3389 was solved: https://hudson.apache.org/robots.txt Reduce exposure of nightly build documentation -- Key: LUCENE-1165 URL: https://issues.apache.org/jira/browse/LUCENE-1165 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doron Cohen Assignee: Uwe Schindler Priority: Minor From LUCENE-1157 - ..the nightly build documentation is too prominent. A search for indexwriter api on Google or Yahoo! returns nightly documentation before released documentation. (https://issues.apache.org/jira/browse/LUCENE-1157?focusedCommentId=12565820#action_12565820) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1165) Reduce exposure of nightly build documentation
[ https://issues.apache.org/jira/browse/LUCENE-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler closed LUCENE-1165. - Reduce exposure of nightly build documentation -- Key: LUCENE-1165 URL: https://issues.apache.org/jira/browse/LUCENE-1165 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doron Cohen Assignee: Uwe Schindler Priority: Minor From LUCENE-1157 - ..the nightly build documentation is too prominent. A search for indexwriter api on Google or Yahoo! returns nightly documentation before released documentation. (https://issues.apache.org/jira/browse/LUCENE-1157?focusedCommentId=12565820#action_12565820) -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
[ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992524#comment-12992524 ] Michael Busch commented on LUCENE-2881: --- Awesome, thanks for letting me know! I hope I'll be able to say the same about the RT branch after I tried it there... :) Track FieldInfo per segment instead of per-IW-session - Key: LUCENE-2881 URL: https://issues.apache.org/jira/browse/LUCENE-2881 Project: Lucene - Java Issue Type: Improvement Affects Versions: Realtime Branch, CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Michael Busch Fix For: Realtime Branch, CSF branch, 4.0 Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over. The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments. We already work around this limitation in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
[ https://issues.apache.org/jira/browse/LUCENE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992531#comment-12992531 ] Steven Rowe commented on LUCENE-2911: - bq. I think this one is ready to commit. +1 I applied the patch, jflex generates properly, tests pass synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types. -- Key: LUCENE-2911 URL: https://issues.apache.org/jira/browse/LUCENE-2911 Project: Lucene - Java Issue Type: Sub-task Components: Analysis Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2911.patch, LUCENE-2911.patch I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a future target such as 3.2 But, in 3.1 I would like to do a little cleanup first, and synchronize all these token types, etc. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
[ https://issues.apache.org/jira/browse/LUCENE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992601#comment-12992601 ] Robert Muir commented on LUCENE-2911: - Committed revision 1068979. Now backporting... synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types. -- Key: LUCENE-2911 URL: https://issues.apache.org/jira/browse/LUCENE-2911 Project: Lucene - Java Issue Type: Sub-task Components: Analysis Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2911.patch, LUCENE-2911.patch I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a future target such as 3.2 But, in 3.1 I would like to do a little cleanup first, and synchronize all these token types, etc. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2338) improved per-field similarity integration into schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992614#comment-12992614 ] Yonik Seeley commented on SOLR-2338: Yep, sounds like a great idea! Should we specify the similarity class in each fieldType that want's to use a non-default similarity: {code} fieldType analyzer.../analyzer similarity class=.../similarity /fieldType {code} Or use named similarities and refer to them: {code} fieldType analyzer.../analyzer similarity name=short_text/ /fieldType similarity name=short_text class=.../similarity {code} improved per-field similarity integration into schema.xml - Key: SOLR-2338 URL: https://issues.apache.org/jira/browse/SOLR-2338 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 4.0 Reporter: Robert Muir Currently since LUCENE-2236, we can enable Similarity per-field, but in schema.xml there is only a 'global' factory for the SimilarityProvider. In my opinion this is too low-level because to customize Similarity on a per-field basis, you have to set your own CustomSimilarityProvider with similarity class=.../ and manage the per-field mapping yourself in java code. Instead I think it would be better if you just specify the Similarity in the FieldType, like after analyzer. As far as the example, one idea from LUCENE-1360 was to make a short_text or metadata_text used by the various metadata fields in the example that has better norm quantization for its shortness... -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2912) remove field param from computeNorm, scorePayload ; remove UOE'd lengthNorm, switch SweetSpot to per-field
remove field param from computeNorm, scorePayload ; remove UOE'd lengthNorm, switch SweetSpot to per-field --- Key: LUCENE-2912 URL: https://issues.apache.org/jira/browse/LUCENE-2912 Project: Lucene - Java Issue Type: Improvement Reporter: Robert Muir Fix For: 4.0 In LUCENE-2236 we switched sim to per field (SimilarityProvider returns a per-field similarity). But we didn't completely cleanup there... I think we should now do this: * SweetSpotSimilarity loses all its hashmaps. Instead, just configure one per field and return it in your SimilarityProvider. this means for example, all its TF factors can now be configured per-field too, not just the length normalization factors. * computeNorm and scorePayload lose their field parameter, as its redundant and confusing. * the UOE'd obselete lengthNorm is removed. I also updated javadocs that were pointing to it (this is bad!). -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2912) remove field param from computeNorm, scorePayload ; remove UOE'd lengthNorm, switch SweetSpot to per-field
[ https://issues.apache.org/jira/browse/LUCENE-2912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2912: Attachment: LUCENE-2912.patch Attached is an initial patch, all tests pass. remove field param from computeNorm, scorePayload ; remove UOE'd lengthNorm, switch SweetSpot to per-field --- Key: LUCENE-2912 URL: https://issues.apache.org/jira/browse/LUCENE-2912 Project: Lucene - Java Issue Type: Improvement Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-2912.patch In LUCENE-2236 we switched sim to per field (SimilarityProvider returns a per-field similarity). But we didn't completely cleanup there... I think we should now do this: * SweetSpotSimilarity loses all its hashmaps. Instead, just configure one per field and return it in your SimilarityProvider. this means for example, all its TF factors can now be configured per-field too, not just the length normalization factors. * computeNorm and scorePayload lose their field parameter, as its redundant and confusing. * the UOE'd obselete lengthNorm is removed. I also updated javadocs that were pointing to it (this is bad!). -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2338) improved per-field similarity integration into schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992670#comment-12992670 ] Robert Muir commented on SOLR-2338: --- doesn't matter to me really, but what is the advantage of the named similarities? this would be a bit inconsistent from how you configure analyzers (and an additional level of indirection that might be confusing)... or am I missing something? improved per-field similarity integration into schema.xml - Key: SOLR-2338 URL: https://issues.apache.org/jira/browse/SOLR-2338 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 4.0 Reporter: Robert Muir Currently since LUCENE-2236, we can enable Similarity per-field, but in schema.xml there is only a 'global' factory for the SimilarityProvider. In my opinion this is too low-level because to customize Similarity on a per-field basis, you have to set your own CustomSimilarityProvider with similarity class=.../ and manage the per-field mapping yourself in java code. Instead I think it would be better if you just specify the Similarity in the FieldType, like after analyzer. As far as the example, one idea from LUCENE-1360 was to make a short_text or metadata_text used by the various metadata fields in the example that has better norm quantization for its shortness... -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2338) improved per-field similarity integration into schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992676#comment-12992676 ] Yonik Seeley commented on SOLR-2338: Other components in solrconfig use that indirection, but I'm fine w/ the approach taken by tokenizer / token filter config. improved per-field similarity integration into schema.xml - Key: SOLR-2338 URL: https://issues.apache.org/jira/browse/SOLR-2338 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 4.0 Reporter: Robert Muir Currently since LUCENE-2236, we can enable Similarity per-field, but in schema.xml there is only a 'global' factory for the SimilarityProvider. In my opinion this is too low-level because to customize Similarity on a per-field basis, you have to set your own CustomSimilarityProvider with similarity class=.../ and manage the per-field mapping yourself in java code. Instead I think it would be better if you just specify the Similarity in the FieldType, like after analyzer. As far as the example, one idea from LUCENE-1360 was to make a short_text or metadata_text used by the various metadata fields in the example that has better norm quantization for its shortness... -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2353) SpellCheckCollator uses org.mortbay.log.Log for logging
SpellCheckCollator uses org.mortbay.log.Log for logging --- Key: SOLR-2353 URL: https://issues.apache.org/jira/browse/SOLR-2353 Project: Solr Issue Type: Bug Components: spellchecker Reporter: Sami Siren Priority: Trivial SLF4j should be used instead. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2353) SpellCheckCollator uses org.mortbay.log.Log for logging
[ https://issues.apache.org/jira/browse/SOLR-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated SOLR-2353: - Attachment: SOLR-2353.patch SpellCheckCollator uses org.mortbay.log.Log for logging --- Key: SOLR-2353 URL: https://issues.apache.org/jira/browse/SOLR-2353 Project: Solr Issue Type: Bug Components: spellchecker Reporter: Sami Siren Priority: Trivial Attachments: SOLR-2353.patch SLF4j should be used instead. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992687#comment-12992687 ] hao yan commented on LUCENE-2903: - Hi, Robert and Michael In order to test if ByteBuffer/IntBuffer works better than int[]-byte[] conversion, I now separate them into 3 different codecs. All of them use the same PForDelta implementation except that they use different indexinput/indexoutput as follows. 1. PatchedFrameOfRef3 - use in.readBytes(), it will convert int[] - byte[] manually. Its corresponding java code is: PForDeltaFixedIntBlockCodec.java 2. PatchedFrameOfRef4 - use in.readBytes(), it will convert int[] - byte[] by ByteBuffer/IntBuffer. Its corresponding java code is: PForDeltaFixedIntBlockWithByteBufferCodec.java 3. PatchedFrameOfRef5 - use in.readInt() with a loop, it does not need conversion. Its corresponding java code is: PForDeltaFixedIntBlockWithReadIntCodec.java I tested them against BulkVInt on MacOS. The detailed results are attached. Here is the conclusion: 1) Yes, Michael and Robert, you guys are right! ByteBuffer/IntBuffer are faster then my manual conversion btw byte[]/int[]. I guess the reason I thought they were worse is that i did not separate codecs before, such that the test results is not stable due to JVM/JIT. 2) Now, PatchedFrameOfRef4 is still worse than BulkVInt in many kinds of queries. However, it seems that it can do better for fuzzy queries and wildcardquery. 3) Of course, these PatchedFrameOfRef3,4,5 are all better than PatchedFrameOfRef and FrameOfRef for almost all queries. 4) The new patched is just uploaded, please check them out. The following is the experimental results for 0.1M data. (1) bulkVInt VS patchedFrameOfRef4 (withByteBuffer, in.readBytes(..) ) QueryQPS bulkVIntQPS pathcedFrameofref4-withByteBuffer Pct diff united states 389.26 361.79 -7.1% united states~3 234.52 228.99 -2.4% +nebraska +states 1138.95 992.06-12.9% +united +states 670.69 603.86-10.0% doctimesecnum:[1 TO 6] 415.28 447.83 7.8% doctitle:.*[Uu]nited.* 496.03 522.47 5.3% spanFirst(unit, 5) 1176.47 1086.96 -7.6% spanNear([unit, state], 10, true) 502.26 423.73-15.6% states 1612.90 1453.49 -9.9% u*d 167.95 171.17 1.9% un*d 260.69 275.33 5.6% uni* 602.41 577.37 -4.2% unit* 1016.26 1041.67 2.5% united states 617.28 549.45-11.0% united~0.6 12.22 12.93 5.9% united~0.75 53.88 56.78 5.4% unit~0.5 12.58 13.19 4.9% unit~0.7 52.41 54.93 4.8% (2) bulkVInt VS patchedFrameOfRef3 (with my own int[] - byte[] conversion, still in.readBytes(..)) QueryQPS bulkVIntQPS pathcedFrameofref3 Pct diff united states 388.50 363.24 -6.5% united states~3 234.80 223.56 -4.8% +nebraska +states 1138.95 1016.26-10.8% +united +states 671.14 607.90 -9.4% doctimesecnum:[1 TO 6] 418.24 441.89 5.7% doctitle:.*[Uu]nited.* 489.00 522.74 6.9% spanFirst(unit, 5) 1246.88 1127.40 -9.6% spanNear([unit, state], 10, true) 514.14 473.71 -7.9% states 1612.90 1488.10 -7.7% u*d 170.77 167.31 -2.0% un*d 261.37 264.48 1.2% uni* 609.38 602.41 -1.1% unit* 1028.81 1052.63 2.3% united states 614.25 564.33 -8.1% united~0.6 12.05 12.11 0.5% united~0.75 53.16 54.97 3.4% unit~0.5 12.43 12.50 0.6% unit~0.7 52.81 53.23 0.8% (3) bulkVInt VS patchedFrameOfRef5 (with my own int[] - byte[] conversion, still in.readBytes(..)) QueryQPS bulkVIntQPS pathcedFrameofref5-withReadInt Pct diff united states 391.24 366.70 -6.3% united states~3 235.40 235.07 -0.1% +nebraska +states 1137.66 1072.96 -5.7% +united +states 673.40 642.26 -4.6% doctimesecnum:[1 TO 6] 414.25 407.66 -1.6% doctitle:.*[Uu]nited.* 492.61 538.21 9.3% spanFirst(unit, 5) 1253.13 1175.09 -6.2% spanNear([unit, state], 10, true) 511.25 483.56 -5.4% states 1642.04 1490.31 -9.2% u*d 166.78 160.28 -3.9% un*d 261.64 255.36 -2.4% uni* 609.38 593.47 -2.6% unit* 1026.69
[jira] Commented: (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992690#comment-12992690 ] Grant Ingersoll commented on SOLR-1942: --- Hey Simon, Any progress on this? Seems like a useful feature. I haven't had time to review, but if you feel it's ready, I say go for it. Ability to select codec per field - Key: SOLR-1942 URL: https://issues.apache.org/jira/browse/SOLR-1942 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Fix For: 4.0 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch We should use PerFieldCodecWrapper to allow users to select the codec per-field. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992691#comment-12992691 ] Grant Ingersoll commented on SOLR-1942: --- FWIW, I like the syntax of the schema and solrconfig.xml configuration that you showed in the example here. Would be good to add to the wiki once you commit it. Ability to select codec per field - Key: SOLR-1942 URL: https://issues.apache.org/jira/browse/SOLR-1942 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Fix For: 4.0 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch We should use PerFieldCodecWrapper to allow users to select the codec per-field. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2354) CoreAdminRequest#createCore should allow you to specify the data dir
CoreAdminRequest#createCore should allow you to specify the data dir Key: SOLR-2354 URL: https://issues.apache.org/jira/browse/SOLR-2354 Project: Solr Issue Type: Improvement Components: clients - java Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 4.0 -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2355) simple distrib update processor
simple distrib update processor --- Key: SOLR-2355 URL: https://issues.apache.org/jira/browse/SOLR-2355 Project: Solr Issue Type: New Feature Reporter: Yonik Seeley Priority: Minor Here's a simple update processor for distributed indexing that I implemented years ago. It implements a simple hash(id) MOD nservers and just fails if any servers are down. Given the recent activity in distributed indexing, I thought this might be at least a good source for ideas. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
[ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992707#comment-12992707 ] Simon Willnauer commented on LUCENE-2881: - I actually have some intermitted failures on docvalues which seem to be caused by some mixed up codec IDs. I found one which fixed most of the cases in FieldInfo#clone() where the codec ID is lost. But there must be some other situation where the codec ID gets mixed up. I will be AFK for 10 days at least so I have to leave you alone with that. Seems that we need a test for that though. The docValues test bring that up quickly :( Track FieldInfo per segment instead of per-IW-session - Key: LUCENE-2881 URL: https://issues.apache.org/jira/browse/LUCENE-2881 Project: Lucene - Java Issue Type: Improvement Affects Versions: Realtime Branch, CSF branch, 4.0 Reporter: Simon Willnauer Assignee: Michael Busch Fix For: Realtime Branch, CSF branch, 4.0 Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming / ordering. IW carries FI instances over from previous segments which also carries over field properties like isIndexed etc. While having consistent field ordering per IW session appears to be important due to bulk merging stored fields etc. carrying over other properties might become problematic with Lucene's Codec support. Codecs that rely on consistent properties in FI will fail if FI properties are carried over. The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field (using the field id within the file name). Yet, if a segment has no DocValues indexed in a particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues will be true since those values are reused from previous segments. We already work around this limitation in SegmentInfo with properties like hasVectors or hasProx which is really something we should manage per Codec Segment. Ideally FieldInfo would be managed per Segment and Codec such that its properties are valid per segment. It also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just per segment metadata. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2355) simple distrib update processor
[ https://issues.apache.org/jira/browse/SOLR-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-2355: --- Attachment: TestDistributedUpdate.java DistributedUpdateProcessorFactory.java Here's the processor class and the test class (not in patch form, I just pulled these files straight from our commercial product). simple distrib update processor --- Key: SOLR-2355 URL: https://issues.apache.org/jira/browse/SOLR-2355 Project: Solr Issue Type: New Feature Reporter: Yonik Seeley Priority: Minor Attachments: DistributedUpdateProcessorFactory.java, TestDistributedUpdate.java Here's a simple update processor for distributed indexing that I implemented years ago. It implements a simple hash(id) MOD nservers and just fails if any servers are down. Given the recent activity in distributed indexing, I thought this might be at least a good source for ideas. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2355) simple distrib update processor
[ https://issues.apache.org/jira/browse/SOLR-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992711#comment-12992711 ] Yonik Seeley commented on SOLR-2355: Some sample configuration: {code} updateRequestProcessorChain name=distrib processor class=com.lucid.update.DistributedUpdateProcessorFactory !-- example configuration... shards should be in the *same* order for every server in a cluster. Only self should change to represent what server *this* is. str name=selflocalhost:8983/solr/str arr name=shards strlocalhost:8983/solr/str strlocalhost:7574/solr/str /arr -- /processor processor class=solr.LogUpdateProcessorFactory int name=maxNumToLog10/int /processor processor class=solr.RunUpdateProcessorFactory/ /updateRequestProcessorChain {code} Now on any update command, you can set update.processor=distrib and have distrib indexing controlled by the shards and self params, either configured in solrconfig, or passed in w/ the update command. Or if you don't want to have to specify update.processor=distrib, you can set it as the default update processor for any update request handlers: {code} !-- CSV update handler, loaded on demand -- requestHandler class=solr.CSVRequestHandler name=/update/csv startup=lazy lst name=defaults str name=update.processordistrib/str /lst /requestHandler {code} simple distrib update processor --- Key: SOLR-2355 URL: https://issues.apache.org/jira/browse/SOLR-2355 Project: Solr Issue Type: New Feature Reporter: Yonik Seeley Priority: Minor Attachments: DistributedUpdateProcessorFactory.java, TestDistributedUpdate.java Here's a simple update processor for distributed indexing that I implemented years ago. It implements a simple hash(id) MOD nservers and just fails if any servers are down. Given the recent activity in distributed indexing, I thought this might be at least a good source for ideas. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Distributed Indexing
I haven't had time to follow all of this discussion, but this issue might help: https://issues.apache.org/jira/browse/SOLR-2355 It's an implementation of the basic http://localhost:8983/solr/update/csv?shards=shard1,shard2... -Yonik http://lucidimagination.com On Mon, Feb 7, 2011 at 8:55 AM, Upayavira u...@odoko.co.uk wrote: Surely you want to be implementing an UpdateRequestProcessor, rather than a RequestHandler. The ContentStreamHandlerBase, in the handleRequestBody method gets an UpdateRequestProcessor and uses it to process the request. What we need is that handleRequestBody method to, as you have suggested, check on the shards parameter, and if necessary call a different UpdateRequestProcessor (a DistributedUpdateRequestProcessor). I don't think we really need it to be configurable at this point. The ContentStreamHandlerBase could just use a single hardwired implementation. If folks want choice of DistributedUpdateRequestProcessor, it can be added later. For configuration, the DistributedUpdateRequestProcessor should get its config from the parent RequestHandler. The configuration I'm most interested in is the DistributionPolicy. And that can be done with a distributionPolicyClass=solr.IDHashDistributionPolicy request parameter, which could potentially be configured in solrconfig.xml as an invariant, or provided in the request by the user if necessary. So, I'd avoid another thing that needs to be configured unless there are real benefits to it (which there don't seem to me to be right now). Upayavira On Sun, 06 Feb 2011 23:08 +, Alex Cowell alxc...@gmail.com wrote: Hey, We're making good progress, but our DistributedUpdateRequestHandler is having a bit of an identity crisis, so we thought we'd ask what other people's opinions are. The current situation is as follows: We've added a method to ContentStreamHandlerBase to check if an update request is distributed or not (based on the presence/validity of the 'shards' parameter). So a non-distributed request will proceed as normal but a distributed request would be passed on to the DistributedUpdateRequestHandler to deal with. The reason this choice is made in the ContentStreamHandlerBase is so that the DistributedUpdateRequestHandler can use the URL the request came in on to determine where to distribute update requests. Eg. an update request is sent to: http://localhost:8983/solr/update/csv?shards=shard1,shard2... then the DistributedUpdateRequestHandler knows to send requests to: shard1/update/csv shard2/update/csv Alternatively, if the request wasn't distributed, it would simply be handled by whichever request handler /update/csv uses. Herein lies the problem. The DistributedUpdateRequestHandler is not really a request handler in the same way as the CSVRequestHandler or XmlUpdateRequestHandlers are. If anything, it's more like a plugin for the various existing update request handlers, to allow them to deal with distributed requests - a distributor if you will. It isn't designed to be able to receive and handle requests directly. We would like this DistributedUpdateRequestHandler to be defined in the solrconfig to allow flexibility for setting up multiple different DistributedUpdateRequestHandlers with different ShardDistributionPolicies etc.and also to allow us to get the appropriate instance from the core in the code. There seem to be two paths for doing this: 1. Leave it as an implementation of SolrRequestHandler and hope the user doesn't directly send update requests to it (ie. a request to http://localhost:8983/solr/distrib update handler path would most likely cripple something). So it would be defined in the solrconfig something like: requestHandler name=distrib-update class=solr.DistributedUpdateRequestHandler / 2. Create a new plugin type for the solrconfig, say updateRequestDistributor which would involve creating a new interface for the DistributedUpdateRequestHandler to implement, then registering it with the core. It would be defined in the solrconfig something like: updateRequestDistributor name=distrib-update class=solr.DistributedUpdateRequestHandler lst name=defaults str name=policysolr.HashedDistributionPolicy/str /lst /updateRequestDistributor This would mean that it couldn't directly receive requests, but that an instance could still easily be retrieved from the core to handle the distribution of update requests. Any thoughts on the above issue (or a more succinct, descriptive name for the class) are most welcome! Alex --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2338) improved per-field similarity integration into schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992728#comment-12992728 ] Hoss Man commented on SOLR-2338: Most existing situations where plugins are dereferenced by name are so we can reuse the exact same object instance (ie: for recording stats, or because they are heavyweight to construct on the fly) in the case of similarity, the main advantage i can think of would be if we wanted true per-field similiarity declaration, not just per field type ie... {code} similarity name=S_XX class=.../similarity similarity name=S_YY class=.../similarity ... fieldType name=FT_AA analyzer.../analyzer similarity name=S_XX/ /fieldType ... field name=F_111 type=FT_AA /!-- implied S_XX -- field name=F_222 type=FT_AA similarity=S_YY / {code} ...but even if we don't do that, i suppose it's also conceivable that someone might have their own Similarity implementation that is expensive to instantiate (ie: maintains some big in memory data structures?) and might want to be able to declare one instance and then refer to it by name in many different fieldType declarations. I think for now just supporting the first example yonik cited... {code} fieldType analyzer.../analyzer similarity class=.../similarity /fieldType {code} would be a huge win, and we can always enhance to add name derefrencing later. improved per-field similarity integration into schema.xml - Key: SOLR-2338 URL: https://issues.apache.org/jira/browse/SOLR-2338 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 4.0 Reporter: Robert Muir Currently since LUCENE-2236, we can enable Similarity per-field, but in schema.xml there is only a 'global' factory for the SimilarityProvider. In my opinion this is too low-level because to customize Similarity on a per-field basis, you have to set your own CustomSimilarityProvider with similarity class=.../ and manage the per-field mapping yourself in java code. Instead I think it would be better if you just specify the Similarity in the FieldType, like after analyzer. As far as the example, one idea from LUCENE-1360 was to make a short_text or metadata_text used by the various metadata fields in the example that has better norm quantization for its shortness... -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao yan updated LUCENE-2903: Attachment: LUCENE-2903.patch This patch is to further improve pfordelta codec (PForDeltaFixedIntBlockCodec). I used 3 different implementations (3 codecs) for inputindex/outputindex. In particular, 1. PatchedFrameOfRef3 use in.readBytes(), it will convert int[] byte[] manually. Its corresponding java code is: PForDeltaFixedIntBlockCodec.java 2. PatchedFrameOfRef4 use in.readBytes(), it will convert int[] byte[] by ByteBuffer/IntBuffer. Its corresponding java code is: PForDeltaFixedIntBlockWithByteBufferCodec.java 3. PatchedFrameOfRef5 - use in.readInt() with a loop, it does not need conversion. Its corresponding java code is: PForDeltaFixedIntBlockWithReadIntCodec.java Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2338) improved per-field similarity integration into schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992735#comment-12992735 ] Robert Muir commented on SOLR-2338: --- {quote} ...but even if we don't do that, i suppose it's also conceivable that someone might have their own Similarity implementation that is expensive to instantiate (ie: maintains some big in memory data structures?) and might want to be able to declare one instance and then refer to it by name in many different fieldType declarations. {quote} I don't think this is really a use case we need to support: the purpose of Similarity today is to do term weighting, not to be a huge data-structure holder. While I know Mike's original patch went this way with LUCENE-2392 (e.g. norms), I'm not sure i like it being in Similarity in the future either. Otherwise concepts like lazy-loading norms and all this other stuff get pushed onto the sim, which is an awkward place (imagine if you have many fields). So, I think we shouldn't really design for abuses of the API. If there are other use cases for named similarity that have to do with term weighting, I'm interested. improved per-field similarity integration into schema.xml - Key: SOLR-2338 URL: https://issues.apache.org/jira/browse/SOLR-2338 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 4.0 Reporter: Robert Muir Currently since LUCENE-2236, we can enable Similarity per-field, but in schema.xml there is only a 'global' factory for the SimilarityProvider. In my opinion this is too low-level because to customize Similarity on a per-field basis, you have to set your own CustomSimilarityProvider with similarity class=.../ and manage the per-field mapping yourself in java code. Instead I think it would be better if you just specify the Similarity in the FieldType, like after analyzer. As far as the example, one idea from LUCENE-1360 was to make a short_text or metadata_text used by the various metadata fields in the example that has better norm quantization for its shortness... -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992757#comment-12992757 ] Robert Muir commented on LUCENE-2903: - Hello, I don't see the new files you referred to in the patch Maybe the new files were not added to svn with 'svn add' before making the patch? Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2892) Add QueryParser.newFieldQuery
[ https://issues.apache.org/jira/browse/LUCENE-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2892: Fix Version/s: 4.0 Add QueryParser.newFieldQuery - Key: LUCENE-2892 URL: https://issues.apache.org/jira/browse/LUCENE-2892 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.0 Attachments: LUCENE-2892.patch Note: this patch changes no behavior, just makes QP more subclassable. Currently we have Query getFieldQuery(String field, String queryText, boolean quoted) This contains very hairy methods for producing a query from QP's analyzer. I propose we factor this into newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted) Then getFieldQuery just calls newFieldQuery(this.analyzer, field, queryText, quoted); The reasoning is: it can be quite useful to consider the double quote as more than phrases, but a more exact search. In the case the user quoted the terms, you might want to analyze the text with an alternate analyzer that: doesn't produce synonyms, doesnt decompose compounds, doesn't use WordDelimiterFilter (you would need to be using preserveOriginal=true at index time for the WDF one), etc etc. This is similar to the way google's double quote operator works, its not defined as phrase but this exact wording or phrase. For example compare results to a query of tests versus tests. Currently you can do this without heavy code duplication, but really only if you make a separate field (which is wasteful), and make your custom QP lie about its field... in the examples I listed above you can do this with a single field, yet still have a more exact phrase search. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2892) Add QueryParser.newFieldQuery
[ https://issues.apache.org/jira/browse/LUCENE-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992759#comment-12992759 ] Robert Muir commented on LUCENE-2892: - I think this is an easy win for some common use cases from the API perspective. If no one objects i'll commit in a few days. Add QueryParser.newFieldQuery - Key: LUCENE-2892 URL: https://issues.apache.org/jira/browse/LUCENE-2892 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-2892.patch Note: this patch changes no behavior, just makes QP more subclassable. Currently we have Query getFieldQuery(String field, String queryText, boolean quoted) This contains very hairy methods for producing a query from QP's analyzer. I propose we factor this into newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted) Then getFieldQuery just calls newFieldQuery(this.analyzer, field, queryText, quoted); The reasoning is: it can be quite useful to consider the double quote as more than phrases, but a more exact search. In the case the user quoted the terms, you might want to analyze the text with an alternate analyzer that: doesn't produce synonyms, doesnt decompose compounds, doesn't use WordDelimiterFilter (you would need to be using preserveOriginal=true at index time for the WDF one), etc etc. This is similar to the way google's double quote operator works, its not defined as phrase but this exact wording or phrase. For example compare results to a query of tests versus tests. Currently you can do this without heavy code duplication, but really only if you make a separate field (which is wasteful), and make your custom QP lie about its field... in the examples I listed above you can do this with a single field, yet still have a more exact phrase search. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2892) Add QueryParser.newFieldQuery
[ https://issues.apache.org/jira/browse/LUCENE-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned LUCENE-2892: --- Assignee: Robert Muir Add QueryParser.newFieldQuery - Key: LUCENE-2892 URL: https://issues.apache.org/jira/browse/LUCENE-2892 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.0 Attachments: LUCENE-2892.patch Note: this patch changes no behavior, just makes QP more subclassable. Currently we have Query getFieldQuery(String field, String queryText, boolean quoted) This contains very hairy methods for producing a query from QP's analyzer. I propose we factor this into newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted) Then getFieldQuery just calls newFieldQuery(this.analyzer, field, queryText, quoted); The reasoning is: it can be quite useful to consider the double quote as more than phrases, but a more exact search. In the case the user quoted the terms, you might want to analyze the text with an alternate analyzer that: doesn't produce synonyms, doesnt decompose compounds, doesn't use WordDelimiterFilter (you would need to be using preserveOriginal=true at index time for the WDF one), etc etc. This is similar to the way google's double quote operator works, its not defined as phrase but this exact wording or phrase. For example compare results to a query of tests versus tests. Currently you can do this without heavy code duplication, but really only if you make a separate field (which is wasteful), and make your custom QP lie about its field... in the examples I listed above you can do this with a single field, yet still have a more exact phrase search. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Assigned: (SOLR-2353) SpellCheckCollator uses org.mortbay.log.Log for logging
[ https://issues.apache.org/jira/browse/SOLR-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller reassigned SOLR-2353: - Assignee: Mark Miller SpellCheckCollator uses org.mortbay.log.Log for logging --- Key: SOLR-2353 URL: https://issues.apache.org/jira/browse/SOLR-2353 Project: Solr Issue Type: Bug Components: spellchecker Reporter: Sami Siren Assignee: Mark Miller Priority: Trivial Attachments: SOLR-2353.patch SLF4j should be used instead. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992809#comment-12992809 ] hao yan commented on LUCENE-2903: - just uploaded. Sorry. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
LUCENE-1608 CustomScoreQuery should support arbitrary queries.
Hi everyone. I don't know yet (I am new on the discussion list) if updates in Jira are automatically notified to the list, so that is why I'm writing this mail. I just uploaded a class for demonstration purposes in LUCENE-1608: CustomScoreQuery should support arbitrary queries.(https://issues.apache.org/jira/browse/LUCENE-1608) Comments and critics are more than welcome. Thank you. Regards. Fernando. PS: sorry if this is not the correct way for announce an update of a ticket in Jira.