[ 
http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376989 ] 

Doug Cutting commented on NUTCH-257:
------------------------------------

I'd vote to never have Summary#toString() perform entity encoding, to fix 
search.jsp to encode things itself, and *not* to add a new 
Summary#toEntityEncodedString() method.

> Summary#toString always Entity encodes -- problem for 
> OpenSearchServlet#description field
> -----------------------------------------------------------------------------------------
>
>          Key: NUTCH-257
>          URL: http://issues.apache.org/jira/browse/NUTCH-257
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: [EMAIL PROTECTED]
>     Priority: Minor

>
> All search result data we display in search results has to be explicitly 
> Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries.  
> Its already Entity.encoded.  This is fine when outputing HTML but it gets in 
> the way when outputing otherwise -- as xml for example.  I'd suggest we not 
> make any presumption about how search results are used.
> The problem becomes especially acute when the text language is other than 
> english.
> Here is an example of what a Czech description field in an OpenSearchServlet 
> hit record looks like:
> <description>&lt;span class="ellipsis"&gt; ... 
> &lt;/span&gt;V&amp;#283;deck&amp;aacute; knihovna v Olomouci 
> Bezru&amp;#269;ova 2, Olomouc 9, 779 11, &amp;#268;esk&amp;aacute; republika 
> &amp;nbsp; tel. +420-585223441 &amp;nbsp; fax +420-585225774 
> http://www.&lt;span class="highlight"&gt;vkol&lt;/span&gt;.cz/ 
> &amp;nbsp;&amp;nbsp; mailto:info@&lt;span 
> class="highlight"&gt;vkol&lt;/span&gt;.cz Otev&amp;#345;eno : &amp;nbsp; 
> po-p&amp;aacute; &amp;nbsp; 8 30 -19 00 &amp;nbsp;&amp;nbsp;&amp;nbsp; so 
> &amp;nbsp; 9 00 -13 00 &amp;nbsp;&amp;nbsp;&amp;nbsp; ne &amp;nbsp; 
> zav&amp;#345;eno V katalogu s &amp;uacute;pln&amp;yacute;m 
> &amp;#269;asov&amp;yacute;m&lt;span class="ellipsis"&gt; ... &lt;/span&gt;03 
> Organizace 20/12 Odkazy 19/04 Hledej 23/03 &amp;nbsp; 23/03 &amp;nbsp; 
> Po&amp;#269;et p&amp;#345;&amp;iacute;stup&amp;#367; od 1.9.1998. Statistiky 
> . [ ] &amp;nbsp; [ Nahoru ] &lt;span 
> class="highlight"&gt;VKOL&lt;/span&gt;</description>
> Here is same description field with Entity.encoding disabled:
> <description>&lt;span class="ellipsis"&gt; ... &lt;/span&gt;tisky statistiky 
> knihovny WWW serveru st?edov?ké rukopisy studovny CD-ROM historických fond? 
> hlavní Internet N?mecké knihovny vázaných novin SVKOL viz &lt;span 
> class="highlight"&gt;VKOL&lt;/span&gt; ?atna T telefonní ?ísla knihovny 
> zam?stnanc? U V vazba v?cný popis vedení knihovny vedoucí odd?lení video 
> &lt;span class="highlight"&gt;VKOL&lt;/span&gt; volný výb?r výp?j?ka výro?ní 
> zpráva výstavy W webmaster WWW odkazy X Y Z - ? zamluvení knihy zahrani?ní 
> periodika zpracování fondu&lt;span class="highlight"&gt;VKOL&lt;/span&gt; - 
> hledej Hledej [ &lt;span class="highlight"&gt;VKOL&lt;/span&gt; ] [ Novinky ] 
> [ Katalog ] [ Slu?by ] [ Aktivity ] [ Pr?vodce ] [ Dokumenty ] [ Regionální 
> fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [     ] [     ] Obsah full-textové 
> vyhledávání, 19/04/2003 rejst?ík vybraných&lt;span class="ellipsis"&gt; ... 
> &lt;/span&gt;</description>
> Notice how the Czech characters in the first are all numerically encoded: 
> i.e. #NNN;.
> I'd suggest that Summary#toString() become Summary#toEntityEncodedString() 
> and that toString return raw aggregation of Fragments.  Would likely require 
> adding methods to the HitSummarizer interface so can ask for either raw text 
> or entity encoded with addition to NutchBean so can ask for either.  Or, 
> better I'd suggest is that Summarizer never return Entity.encoded text.  Let 
> that happen in search.jsp (I can make patch to do the latter if its amenable).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to