[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values
[ https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12603780#action_12603780 ] Mike Klaas commented on SOLR-556: - Thanks for the patch, Lars. I think that the basic approach is sound, though I am a little nervous about the performance implications (especially in the case of phrase highlighting, where we spin up an entirely new spanhighlighter for each value in a multi-valued field). I wonder if I am the only one who highlights large text fields composed of dozens of individual values? Highlighting of multi-valued fields returns snippets which span multiple different values - Key: SOLR-556 URL: https://issues.apache.org/jira/browse/SOLR-556 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 1.3 Environment: Tomcat 5.5 Reporter: Lars Kotthoff Assignee: Mike Klaas Priority: Minor Fix For: 1.3 Attachments: SOLR-556-highlight-multivalued.patch, solr-highlight-multivalued-example.xml When highlighting multi-valued fields, the highlighter sometimes returns snippets which span multiple values, e.g. with values foo and bar and search term ba the highlighter will create the snippet fooemba/emr. Furthermore it sometimes returns smaller snippets than it should, e.g. with value foobar and search term oo it will create the snippet emoo/em regardless of hl.fragsize. I have been unable to determine the real cause for this, or indeed what actually goes on at all. To reproduce the problem, I've used the following steps: * create an index with multi-valued fields, one document should have at least 3 values for these fields (in my case strings of length between 5 and 15 Japanese characters -- as far as I can tell plain old ASCII should produce the same effect though) * search for part of a value in such a field with highlighting enabled, the additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, hl.mergeContiguous=true (changing the parameters does not seem to have any effect on the result though) * highlighted snippets should show effects described above -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values
[ https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12603782#action_12603782 ] Lars Kotthoff commented on SOLR-556: In the setup I've been testing it with (one large single-valued text field and several multi-valued fields) it didn't seem to have any serious performance implications -- i.e. the randomness of my test queries was enough to mask any loss of performance ;) The length of the multi-valued fields is in the order of 10-20 characters on average though and there're not many multiple different values. I personally think that returning correct data is more important than performance in this case, but that may just be because my particular setup doesn't suffer any significant loss of performance. I didn't see any other way to correct the behaviour of the current trunk code, but if anybody else has a better idea how to do it, please let us know! Highlighting of multi-valued fields returns snippets which span multiple different values - Key: SOLR-556 URL: https://issues.apache.org/jira/browse/SOLR-556 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 1.3 Environment: Tomcat 5.5 Reporter: Lars Kotthoff Assignee: Mike Klaas Priority: Minor Fix For: 1.3 Attachments: SOLR-556-highlight-multivalued.patch, solr-highlight-multivalued-example.xml When highlighting multi-valued fields, the highlighter sometimes returns snippets which span multiple values, e.g. with values foo and bar and search term ba the highlighter will create the snippet fooemba/emr. Furthermore it sometimes returns smaller snippets than it should, e.g. with value foobar and search term oo it will create the snippet emoo/em regardless of hl.fragsize. I have been unable to determine the real cause for this, or indeed what actually goes on at all. To reproduce the problem, I've used the following steps: * create an index with multi-valued fields, one document should have at least 3 values for these fields (in my case strings of length between 5 and 15 Japanese characters -- as far as I can tell plain old ASCII should produce the same effect though) * search for part of a value in such a field with highlighting enabled, the additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, hl.mergeContiguous=true (changing the parameters does not seem to have any effect on the result though) * highlighted snippets should show effects described above -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values
[ https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12603785#action_12603785 ] Mike Klaas commented on SOLR-556: - Hey Lars, Yeah, I'm talking about highlighting 15kB of text in 100-200 character chunks. Maybe I can whip up a perf test for this soon. The reason we probably see this issue differently is that the incorrect behaviour is quite minor for most users (perhaps a bit of punctuation leaking from value to value at most). Once way to correct what you are seeing is to use a tokenizer that creates tokens out of the CJK characters, or things on boundaries. In your case, inserting a fake token when encountering a right bracket [)] would fix the problem, I think. Nevertheless, I think I will probably end up committing your patch after pondering it some more. Highlighting of multi-valued fields returns snippets which span multiple different values - Key: SOLR-556 URL: https://issues.apache.org/jira/browse/SOLR-556 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 1.3 Environment: Tomcat 5.5 Reporter: Lars Kotthoff Assignee: Mike Klaas Priority: Minor Fix For: 1.3 Attachments: SOLR-556-highlight-multivalued.patch, solr-highlight-multivalued-example.xml When highlighting multi-valued fields, the highlighter sometimes returns snippets which span multiple values, e.g. with values foo and bar and search term ba the highlighter will create the snippet fooemba/emr. Furthermore it sometimes returns smaller snippets than it should, e.g. with value foobar and search term oo it will create the snippet emoo/em regardless of hl.fragsize. I have been unable to determine the real cause for this, or indeed what actually goes on at all. To reproduce the problem, I've used the following steps: * create an index with multi-valued fields, one document should have at least 3 values for these fields (in my case strings of length between 5 and 15 Japanese characters -- as far as I can tell plain old ASCII should produce the same effect though) * search for part of a value in such a field with highlighting enabled, the additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, hl.mergeContiguous=true (changing the parameters does not seem to have any effect on the result though) * highlighted snippets should show effects described above -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values
[ https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12603787#action_12603787 ] Lars Kotthoff commented on SOLR-556: Hi Mike, In my opinion the most important reason for committing the patch is that the current implementation breaks the multi-valued field abstraction. There's no way to assert that tokenizers will always produce tokens suitable for the current implementation. It also makes for a very hard to find bug, because there're no error messages. I just found it by chance. And even if you notice that something is wrong, fixing it is non-trivial and requires quite some knowlegde how Solr does highlighting of multi-valued fields. So the other option is to add a page to the wiki with a workaround like you've suggested, but I think that's rather going to scare people evaluating Solr for use with CJK text away ;) Highlighting of multi-valued fields returns snippets which span multiple different values - Key: SOLR-556 URL: https://issues.apache.org/jira/browse/SOLR-556 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 1.3 Environment: Tomcat 5.5 Reporter: Lars Kotthoff Assignee: Mike Klaas Priority: Minor Fix For: 1.3 Attachments: SOLR-556-highlight-multivalued.patch, solr-highlight-multivalued-example.xml When highlighting multi-valued fields, the highlighter sometimes returns snippets which span multiple values, e.g. with values foo and bar and search term ba the highlighter will create the snippet fooemba/emr. Furthermore it sometimes returns smaller snippets than it should, e.g. with value foobar and search term oo it will create the snippet emoo/em regardless of hl.fragsize. I have been unable to determine the real cause for this, or indeed what actually goes on at all. To reproduce the problem, I've used the following steps: * create an index with multi-valued fields, one document should have at least 3 values for these fields (in my case strings of length between 5 and 15 Japanese characters -- as far as I can tell plain old ASCII should produce the same effect though) * search for part of a value in such a field with highlighting enabled, the additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, hl.mergeContiguous=true (changing the parameters does not seem to have any effect on the result though) * highlighted snippets should show effects described above -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values
[ https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12602541#action_12602541 ] Mike Klaas commented on SOLR-556: - Ah, I see what the problem is: Although it is impossible for tokens from different values to appear in the same fragment (due to the semantics of MultiValuedTokenFilter), the non-token text (typically, punctuation) from different values can bleed into the same fragment, since lucene's highlighter can only create a new fragment on token boundaries. Unfortunately SOLR-553 was committed a day after you submitted your patch, and rearranges the code slightly so that it no longer applies. Could you sync the patch with trunk? I think the basic approach is sound. Highlighting of multi-valued fields returns snippets which span multiple different values - Key: SOLR-556 URL: https://issues.apache.org/jira/browse/SOLR-556 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 1.3 Environment: Tomcat 5.5 Reporter: Lars Kotthoff Assignee: Mike Klaas Priority: Minor Fix For: 1.3 Attachments: solr-highlight-multivalued-example.xml, solr-highlight-multivalued.patch When highlighting multi-valued fields, the highlighter sometimes returns snippets which span multiple values, e.g. with values foo and bar and search term ba the highlighter will create the snippet fooemba/emr. Furthermore it sometimes returns smaller snippets than it should, e.g. with value foobar and search term oo it will create the snippet emoo/em regardless of hl.fragsize. I have been unable to determine the real cause for this, or indeed what actually goes on at all. To reproduce the problem, I've used the following steps: * create an index with multi-valued fields, one document should have at least 3 values for these fields (in my case strings of length between 5 and 15 Japanese characters -- as far as I can tell plain old ASCII should produce the same effect though) * search for part of a value in such a field with highlighting enabled, the additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, hl.mergeContiguous=true (changing the parameters does not seem to have any effect on the result though) * highlighted snippets should show effects described above -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values
[ https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597226#action_12597226 ] Mike Klaas commented on SOLR-556: - Thanks for the report, Lars. I'll take a look at this shortly. Highlighting of multi-valued fields returns snippets which span multiple different values - Key: SOLR-556 URL: https://issues.apache.org/jira/browse/SOLR-556 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 1.3 Environment: Tomcat 5.5 Reporter: Lars Kotthoff Priority: Minor Attachments: solr-highlight-multivalued-example.xml, solr-highlight-multivalued.patch When highlighting multi-valued fields, the highlighter sometimes returns snippets which span multiple values, e.g. with values foo and bar and search term ba the highlighter will create the snippet fooemba/emr. Furthermore it sometimes returns smaller snippets than it should, e.g. with value foobar and search term oo it will create the snippet emoo/em regardless of hl.fragsize. I have been unable to determine the real cause for this, or indeed what actually goes on at all. To reproduce the problem, I've used the following steps: * create an index with multi-valued fields, one document should have at least 3 values for these fields (in my case strings of length between 5 and 15 Japanese characters -- as far as I can tell plain old ASCII should produce the same effect though) * search for part of a value in such a field with highlighting enabled, the additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, hl.mergeContiguous=true (changing the parameters does not seem to have any effect on the result though) * highlighted snippets should show effects described above -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values
[ https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12596925#action_12596925 ] Otis Gospodnetic commented on SOLR-556: --- Lars - could you please try the patch in SOLR-553 and see if it fixes the problem you described here? Highlighting of multi-valued fields returns snippets which span multiple different values - Key: SOLR-556 URL: https://issues.apache.org/jira/browse/SOLR-556 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 1.3 Environment: Tomcat 5.5 Reporter: Lars Kotthoff Priority: Minor Attachments: solr-highlight-multivalued.patch When highlighting multi-valued fields, the highlighter sometimes returns snippets which span multiple values, e.g. with values foo and bar and search term ba the highlighter will create the snippet fooemba/emr. Furthermore it sometimes returns smaller snippets than it should, e.g. with value foobar and search term oo it will create the snippet emoo/em regardless of hl.fragsize. I have been unable to determine the real cause for this, or indeed what actually goes on at all. To reproduce the problem, I've used the following steps: * create an index with multi-valued fields, one document should have at least 3 values for these fields (in my case strings of length between 5 and 15 Japanese characters -- as far as I can tell plain old ASCII should produce the same effect though) * search for part of a value in such a field with highlighting enabled, the additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, hl.mergeContiguous=true (changing the parameters does not seem to have any effect on the result though) * highlighted snippets should show effects described above -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values
[ https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12596999#action_12596999 ] Lars Kotthoff commented on SOLR-556: I've applied SOLR-553 and confirmed that this problem is not fixed, regardless of the setting of usePhraseHighlighter. Highlighting of multi-valued fields returns snippets which span multiple different values - Key: SOLR-556 URL: https://issues.apache.org/jira/browse/SOLR-556 Project: Solr Issue Type: Bug Components: highlighter Affects Versions: 1.3 Environment: Tomcat 5.5 Reporter: Lars Kotthoff Priority: Minor Attachments: solr-highlight-multivalued.patch When highlighting multi-valued fields, the highlighter sometimes returns snippets which span multiple values, e.g. with values foo and bar and search term ba the highlighter will create the snippet fooemba/emr. Furthermore it sometimes returns smaller snippets than it should, e.g. with value foobar and search term oo it will create the snippet emoo/em regardless of hl.fragsize. I have been unable to determine the real cause for this, or indeed what actually goes on at all. To reproduce the problem, I've used the following steps: * create an index with multi-valued fields, one document should have at least 3 values for these fields (in my case strings of length between 5 and 15 Japanese characters -- as far as I can tell plain old ASCII should produce the same effect though) * search for part of a value in such a field with highlighting enabled, the additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, hl.mergeContiguous=true (changing the parameters does not seem to have any effect on the result though) * highlighted snippets should show effects described above -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.