[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values

2008-06-10 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12603780#action_12603780
 ] 

Mike Klaas commented on SOLR-556:
-

Thanks for the patch, Lars.  I think that the basic approach is sound, though I 
am a little nervous about the performance implications (especially in the case 
of phrase highlighting, where we spin up an entirely new spanhighlighter for 
each value in a multi-valued field).  I wonder if I am the only one who 
highlights large text fields composed of dozens of individual values?




 Highlighting of multi-valued fields returns snippets which span multiple 
 different values
 -

 Key: SOLR-556
 URL: https://issues.apache.org/jira/browse/SOLR-556
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.3
 Environment: Tomcat 5.5
Reporter: Lars Kotthoff
Assignee: Mike Klaas
Priority: Minor
 Fix For: 1.3

 Attachments: SOLR-556-highlight-multivalued.patch, 
 solr-highlight-multivalued-example.xml


 When highlighting multi-valued fields, the highlighter sometimes returns 
 snippets which span multiple values, e.g. with values foo and bar and 
 search term ba the highlighter will create the snippet fooemba/emr. 
 Furthermore it sometimes returns smaller snippets than it should, e.g. with 
 value foobar and search term oo it will create the snippet emoo/em 
 regardless of hl.fragsize.
 I have been unable to determine the real cause for this, or indeed what 
 actually goes on at all. To reproduce the problem, I've used the following 
 steps:
 * create an index with multi-valued fields, one document should have at least 
 3 values for these fields (in my case strings of length between 5 and 15 
 Japanese characters -- as far as I can tell plain old ASCII should produce 
 the same effect though)
 * search for part of a value in such a field with highlighting enabled, the 
 additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, 
 hl.mergeContiguous=true (changing the parameters does not seem to have any 
 effect on the result though)
 * highlighted snippets should show effects described above

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values

2008-06-10 Thread Lars Kotthoff (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12603782#action_12603782
 ] 

Lars Kotthoff commented on SOLR-556:


In the setup I've been testing it with (one large single-valued text field 
and several multi-valued fields) it didn't seem to have any serious performance 
implications -- i.e. the randomness of my test queries was enough to mask any 
loss of performance ;)
The length of the multi-valued fields is in the order of 10-20 characters on 
average though and there're not many multiple different values.

I personally think that returning correct data is more important than 
performance in this case, but that may just be because my particular setup 
doesn't suffer any significant loss of performance. I didn't see any other way 
to correct the behaviour of the current trunk code, but if anybody else has a 
better idea how to do it, please let us know!

 Highlighting of multi-valued fields returns snippets which span multiple 
 different values
 -

 Key: SOLR-556
 URL: https://issues.apache.org/jira/browse/SOLR-556
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.3
 Environment: Tomcat 5.5
Reporter: Lars Kotthoff
Assignee: Mike Klaas
Priority: Minor
 Fix For: 1.3

 Attachments: SOLR-556-highlight-multivalued.patch, 
 solr-highlight-multivalued-example.xml


 When highlighting multi-valued fields, the highlighter sometimes returns 
 snippets which span multiple values, e.g. with values foo and bar and 
 search term ba the highlighter will create the snippet fooemba/emr. 
 Furthermore it sometimes returns smaller snippets than it should, e.g. with 
 value foobar and search term oo it will create the snippet emoo/em 
 regardless of hl.fragsize.
 I have been unable to determine the real cause for this, or indeed what 
 actually goes on at all. To reproduce the problem, I've used the following 
 steps:
 * create an index with multi-valued fields, one document should have at least 
 3 values for these fields (in my case strings of length between 5 and 15 
 Japanese characters -- as far as I can tell plain old ASCII should produce 
 the same effect though)
 * search for part of a value in such a field with highlighting enabled, the 
 additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, 
 hl.mergeContiguous=true (changing the parameters does not seem to have any 
 effect on the result though)
 * highlighted snippets should show effects described above

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values

2008-06-10 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12603785#action_12603785
 ] 

Mike Klaas commented on SOLR-556:
-

Hey Lars,

Yeah, I'm talking about highlighting 15kB of text in 100-200 character chunks.  
Maybe I can whip up a perf test for this soon.

The reason we probably see this issue differently is that the incorrect 
behaviour is quite minor for most users (perhaps a bit of punctuation leaking 
from value to value at most).  Once way to correct what you are seeing is to 
use a tokenizer that creates tokens out of the CJK characters, or things on 
boundaries.  In your case, inserting a fake token when encountering a right 
bracket [)] would fix the problem, I think.

Nevertheless, I think I will probably end up committing your patch after 
pondering it some more.



 Highlighting of multi-valued fields returns snippets which span multiple 
 different values
 -

 Key: SOLR-556
 URL: https://issues.apache.org/jira/browse/SOLR-556
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.3
 Environment: Tomcat 5.5
Reporter: Lars Kotthoff
Assignee: Mike Klaas
Priority: Minor
 Fix For: 1.3

 Attachments: SOLR-556-highlight-multivalued.patch, 
 solr-highlight-multivalued-example.xml


 When highlighting multi-valued fields, the highlighter sometimes returns 
 snippets which span multiple values, e.g. with values foo and bar and 
 search term ba the highlighter will create the snippet fooemba/emr. 
 Furthermore it sometimes returns smaller snippets than it should, e.g. with 
 value foobar and search term oo it will create the snippet emoo/em 
 regardless of hl.fragsize.
 I have been unable to determine the real cause for this, or indeed what 
 actually goes on at all. To reproduce the problem, I've used the following 
 steps:
 * create an index with multi-valued fields, one document should have at least 
 3 values for these fields (in my case strings of length between 5 and 15 
 Japanese characters -- as far as I can tell plain old ASCII should produce 
 the same effect though)
 * search for part of a value in such a field with highlighting enabled, the 
 additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, 
 hl.mergeContiguous=true (changing the parameters does not seem to have any 
 effect on the result though)
 * highlighted snippets should show effects described above

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values

2008-06-10 Thread Lars Kotthoff (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12603787#action_12603787
 ] 

Lars Kotthoff commented on SOLR-556:


Hi Mike,

In my opinion the most important reason for committing the patch is that the 
current implementation breaks the multi-valued field abstraction. There's no 
way to assert that tokenizers will always produce tokens suitable for the 
current implementation. It also makes for a very hard to find bug, because 
there're no error messages. I just found it by chance. And even if you notice 
that something is wrong, fixing it is non-trivial and requires quite some 
knowlegde how Solr does highlighting of multi-valued fields.

So the other option is to add a page to the wiki with a workaround like you've 
suggested, but I think that's rather going to scare people evaluating Solr for 
use with CJK text away ;)

 Highlighting of multi-valued fields returns snippets which span multiple 
 different values
 -

 Key: SOLR-556
 URL: https://issues.apache.org/jira/browse/SOLR-556
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.3
 Environment: Tomcat 5.5
Reporter: Lars Kotthoff
Assignee: Mike Klaas
Priority: Minor
 Fix For: 1.3

 Attachments: SOLR-556-highlight-multivalued.patch, 
 solr-highlight-multivalued-example.xml


 When highlighting multi-valued fields, the highlighter sometimes returns 
 snippets which span multiple values, e.g. with values foo and bar and 
 search term ba the highlighter will create the snippet fooemba/emr. 
 Furthermore it sometimes returns smaller snippets than it should, e.g. with 
 value foobar and search term oo it will create the snippet emoo/em 
 regardless of hl.fragsize.
 I have been unable to determine the real cause for this, or indeed what 
 actually goes on at all. To reproduce the problem, I've used the following 
 steps:
 * create an index with multi-valued fields, one document should have at least 
 3 values for these fields (in my case strings of length between 5 and 15 
 Japanese characters -- as far as I can tell plain old ASCII should produce 
 the same effect though)
 * search for part of a value in such a field with highlighting enabled, the 
 additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, 
 hl.mergeContiguous=true (changing the parameters does not seem to have any 
 effect on the result though)
 * highlighted snippets should show effects described above

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values

2008-06-04 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12602541#action_12602541
 ] 

Mike Klaas commented on SOLR-556:
-

Ah, I see what the problem is:  Although it is impossible for tokens from 
different values to appear in the same fragment (due to the semantics of 
MultiValuedTokenFilter), the non-token text (typically, punctuation) from 
different values can bleed into the same fragment, since lucene's highlighter 
can only create a new fragment on token boundaries.

Unfortunately SOLR-553 was committed a day after you submitted your patch, and 
rearranges the code slightly so that it no longer applies.  Could you sync the 
patch with trunk?  I think the basic approach is sound.

 Highlighting of multi-valued fields returns snippets which span multiple 
 different values
 -

 Key: SOLR-556
 URL: https://issues.apache.org/jira/browse/SOLR-556
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.3
 Environment: Tomcat 5.5
Reporter: Lars Kotthoff
Assignee: Mike Klaas
Priority: Minor
 Fix For: 1.3

 Attachments: solr-highlight-multivalued-example.xml, 
 solr-highlight-multivalued.patch


 When highlighting multi-valued fields, the highlighter sometimes returns 
 snippets which span multiple values, e.g. with values foo and bar and 
 search term ba the highlighter will create the snippet fooemba/emr. 
 Furthermore it sometimes returns smaller snippets than it should, e.g. with 
 value foobar and search term oo it will create the snippet emoo/em 
 regardless of hl.fragsize.
 I have been unable to determine the real cause for this, or indeed what 
 actually goes on at all. To reproduce the problem, I've used the following 
 steps:
 * create an index with multi-valued fields, one document should have at least 
 3 values for these fields (in my case strings of length between 5 and 15 
 Japanese characters -- as far as I can tell plain old ASCII should produce 
 the same effect though)
 * search for part of a value in such a field with highlighting enabled, the 
 additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, 
 hl.mergeContiguous=true (changing the parameters does not seem to have any 
 effect on the result though)
 * highlighted snippets should show effects described above

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values

2008-05-15 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12597226#action_12597226
 ] 

Mike Klaas commented on SOLR-556:
-

Thanks for the report, Lars.  I'll take a look at this shortly.

 Highlighting of multi-valued fields returns snippets which span multiple 
 different values
 -

 Key: SOLR-556
 URL: https://issues.apache.org/jira/browse/SOLR-556
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.3
 Environment: Tomcat 5.5
Reporter: Lars Kotthoff
Priority: Minor
 Attachments: solr-highlight-multivalued-example.xml, 
 solr-highlight-multivalued.patch


 When highlighting multi-valued fields, the highlighter sometimes returns 
 snippets which span multiple values, e.g. with values foo and bar and 
 search term ba the highlighter will create the snippet fooemba/emr. 
 Furthermore it sometimes returns smaller snippets than it should, e.g. with 
 value foobar and search term oo it will create the snippet emoo/em 
 regardless of hl.fragsize.
 I have been unable to determine the real cause for this, or indeed what 
 actually goes on at all. To reproduce the problem, I've used the following 
 steps:
 * create an index with multi-valued fields, one document should have at least 
 3 values for these fields (in my case strings of length between 5 and 15 
 Japanese characters -- as far as I can tell plain old ASCII should produce 
 the same effect though)
 * search for part of a value in such a field with highlighting enabled, the 
 additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, 
 hl.mergeContiguous=true (changing the parameters does not seem to have any 
 effect on the result though)
 * highlighted snippets should show effects described above

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values

2008-05-14 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12596925#action_12596925
 ] 

Otis Gospodnetic commented on SOLR-556:
---

Lars - could you please try the patch in SOLR-553 and see if it fixes the 
problem you described here?


 Highlighting of multi-valued fields returns snippets which span multiple 
 different values
 -

 Key: SOLR-556
 URL: https://issues.apache.org/jira/browse/SOLR-556
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.3
 Environment: Tomcat 5.5
Reporter: Lars Kotthoff
Priority: Minor
 Attachments: solr-highlight-multivalued.patch


 When highlighting multi-valued fields, the highlighter sometimes returns 
 snippets which span multiple values, e.g. with values foo and bar and 
 search term ba the highlighter will create the snippet fooemba/emr. 
 Furthermore it sometimes returns smaller snippets than it should, e.g. with 
 value foobar and search term oo it will create the snippet emoo/em 
 regardless of hl.fragsize.
 I have been unable to determine the real cause for this, or indeed what 
 actually goes on at all. To reproduce the problem, I've used the following 
 steps:
 * create an index with multi-valued fields, one document should have at least 
 3 values for these fields (in my case strings of length between 5 and 15 
 Japanese characters -- as far as I can tell plain old ASCII should produce 
 the same effect though)
 * search for part of a value in such a field with highlighting enabled, the 
 additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, 
 hl.mergeContiguous=true (changing the parameters does not seem to have any 
 effect on the result though)
 * highlighted snippets should show effects described above

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-556) Highlighting of multi-valued fields returns snippets which span multiple different values

2008-05-14 Thread Lars Kotthoff (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12596999#action_12596999
 ] 

Lars Kotthoff commented on SOLR-556:


I've applied SOLR-553 and confirmed that this problem is not fixed, regardless 
of the setting of usePhraseHighlighter.

 Highlighting of multi-valued fields returns snippets which span multiple 
 different values
 -

 Key: SOLR-556
 URL: https://issues.apache.org/jira/browse/SOLR-556
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.3
 Environment: Tomcat 5.5
Reporter: Lars Kotthoff
Priority: Minor
 Attachments: solr-highlight-multivalued.patch


 When highlighting multi-valued fields, the highlighter sometimes returns 
 snippets which span multiple values, e.g. with values foo and bar and 
 search term ba the highlighter will create the snippet fooemba/emr. 
 Furthermore it sometimes returns smaller snippets than it should, e.g. with 
 value foobar and search term oo it will create the snippet emoo/em 
 regardless of hl.fragsize.
 I have been unable to determine the real cause for this, or indeed what 
 actually goes on at all. To reproduce the problem, I've used the following 
 steps:
 * create an index with multi-valued fields, one document should have at least 
 3 values for these fields (in my case strings of length between 5 and 15 
 Japanese characters -- as far as I can tell plain old ASCII should produce 
 the same effect though)
 * search for part of a value in such a field with highlighting enabled, the 
 additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, 
 hl.mergeContiguous=true (changing the parameters does not seem to have any 
 effect on the result though)
 * highlighted snippets should show effects described above

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.