Summary#toString always Entity encodes -- problem for 
OpenSearchServlet#description field
-----------------------------------------------------------------------------------------

         Key: NUTCH-257
         URL: http://issues.apache.org/jira/browse/NUTCH-257
     Project: Nutch
        Type: Bug

  Components: searcher  
    Versions: 0.8-dev    
    Reporter: [EMAIL PROTECTED]
    Priority: Minor


All search result data we display in search results has to be explicitly 
Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries.  
Its already Entity.encoded.  This is fine when outputing HTML but it gets in 
the way when outputing otherwise -- as xml for example.  I'd suggest we not 
make any presumption about how search results are used.

The problem becomes especially acute when the text language is other than 
english.

Here is an example of what a Czech description field in an OpenSearchServlet 
hit record looks like:

<description>&lt;span class="ellipsis"&gt; ... 
&lt;/span&gt;V&amp;#283;deck&amp;aacute; knihovna v Olomouci Bezru&amp;#269;ova 
2, Olomouc 9, 779 11, &amp;#268;esk&amp;aacute; republika &amp;nbsp; tel. 
+420-585223441 &amp;nbsp; fax +420-585225774 http://www.&lt;span 
class="highlight"&gt;vkol&lt;/span&gt;.cz/ &amp;nbsp;&amp;nbsp; 
mailto:info@&lt;span class="highlight"&gt;vkol&lt;/span&gt;.cz 
Otev&amp;#345;eno : &amp;nbsp; po-p&amp;aacute; &amp;nbsp; 8 30 -19 00 
&amp;nbsp;&amp;nbsp;&amp;nbsp; so &amp;nbsp; 9 00 -13 00 
&amp;nbsp;&amp;nbsp;&amp;nbsp; ne &amp;nbsp; zav&amp;#345;eno V katalogu s 
&amp;uacute;pln&amp;yacute;m &amp;#269;asov&amp;yacute;m&lt;span 
class="ellipsis"&gt; ... &lt;/span&gt;03 Organizace 20/12 Odkazy 19/04 Hledej 
23/03 &amp;nbsp; 23/03 &amp;nbsp; Po&amp;#269;et 
p&amp;#345;&amp;iacute;stup&amp;#367; od 1.9.1998. Statistiky . [ ] &amp;nbsp; 
[ Nahoru ] &lt;span class="highlight"&gt;VKOL&lt;/span&gt;</description>

Here is same description field with Entity.encoding disabled:

<description>&lt;span class="ellipsis"&gt; ... &lt;/span&gt;tisky statistiky 
knihovny WWW serveru středověké rukopisy studovny CD-ROM historických fondů 
hlavní Internet Německé knihovny vázaných novin SVKOL viz &lt;span 
class="highlight"&gt;VKOL&lt;/span&gt; šatna T telefonní čísla knihovny 
zaměstnanců U V vazba věcný popis vedení knihovny vedoucí oddělení video 
&lt;span class="highlight"&gt;VKOL&lt;/span&gt; volný výběr výpůjčka výroční 
zpráva výstavy W webmaster WWW odkazy X Y Z - Ž zamluvení knihy zahraniční 
periodika zpracování fondu&lt;span class="highlight"&gt;VKOL&lt;/span&gt; - 
hledej Hledej [ &lt;span class="highlight"&gt;VKOL&lt;/span&gt; ] [ Novinky ] [ 
Katalog ] [ Služby ] [ Aktivity ] [ Průvodce ] [ Dokumenty ] [ Regionální fce ] 
[ Organizace ] [ Odkazy ] [ Hledej ] [     ] [     ] Obsah full-textové 
vyhledávání, 19/04/2003 rejstřík vybraných&lt;span class="ellipsis"&gt; ... 
&lt;/span&gt;</description>

Notice how the Czech characters in the first are all numerically encoded: i.e. 
#NNN;.

I'd suggest that Summary#toString() become Summary#toEntityEncodedString() and 
that toString return raw aggregation of Fragments.  Would likely require adding 
methods to the HitSummarizer interface so can ask for either raw text or entity 
encoded with addition to NutchBean so can ask for either.  Or, better I'd 
suggest is that Summarizer never return Entity.encoded text.  Let that happen 
in search.jsp (I can make patch to do the latter if its amenable).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to