[ http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376989 ]
Doug Cutting commented on NUTCH-257: ------------------------------------ I'd vote to never have Summary#toString() perform entity encoding, to fix search.jsp to encode things itself, and *not* to add a new Summary#toEntityEncodedString() method. > Summary#toString always Entity encodes -- problem for > OpenSearchServlet#description field > ----------------------------------------------------------------------------------------- > > Key: NUTCH-257 > URL: http://issues.apache.org/jira/browse/NUTCH-257 > Project: Nutch > Type: Bug > Components: searcher > Versions: 0.8-dev > Reporter: [EMAIL PROTECTED] > Priority: Minor > > All search result data we display in search results has to be explicitly > Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries. > Its already Entity.encoded. This is fine when outputing HTML but it gets in > the way when outputing otherwise -- as xml for example. I'd suggest we not > make any presumption about how search results are used. > The problem becomes especially acute when the text language is other than > english. > Here is an example of what a Czech description field in an OpenSearchServlet > hit record looks like: > <description><span class="ellipsis"> ... > </span>V&#283;deck&aacute; knihovna v Olomouci > Bezru&#269;ova 2, Olomouc 9, 779 11, &#268;esk&aacute; republika > &nbsp; tel. +420-585223441 &nbsp; fax +420-585225774 > http://www.<span class="highlight">vkol</span>.cz/ > &nbsp;&nbsp; mailto:info@<span > class="highlight">vkol</span>.cz Otev&#345;eno : &nbsp; > po-p&aacute; &nbsp; 8 30 -19 00 &nbsp;&nbsp;&nbsp; so > &nbsp; 9 00 -13 00 &nbsp;&nbsp;&nbsp; ne &nbsp; > zav&#345;eno V katalogu s &uacute;pln&yacute;m > &#269;asov&yacute;m<span class="ellipsis"> ... </span>03 > Organizace 20/12 Odkazy 19/04 Hledej 23/03 &nbsp; 23/03 &nbsp; > Po&#269;et p&#345;&iacute;stup&#367; od 1.9.1998. Statistiky > . [ ] &nbsp; [ Nahoru ] <span > class="highlight">VKOL</span></description> > Here is same description field with Entity.encoding disabled: > <description><span class="ellipsis"> ... </span>tisky statistiky > knihovny WWW serveru st?edov?ké rukopisy studovny CD-ROM historických fond? > hlavní Internet N?mecké knihovny vázaných novin SVKOL viz <span > class="highlight">VKOL</span> ?atna T telefonní ?ísla knihovny > zam?stnanc? U V vazba v?cný popis vedení knihovny vedoucí odd?lení video > <span class="highlight">VKOL</span> volný výb?r výp?j?ka výro?ní > zpráva výstavy W webmaster WWW odkazy X Y Z - ? zamluvení knihy zahrani?ní > periodika zpracování fondu<span class="highlight">VKOL</span> - > hledej Hledej [ <span class="highlight">VKOL</span> ] [ Novinky ] > [ Katalog ] [ Slu?by ] [ Aktivity ] [ Pr?vodce ] [ Dokumenty ] [ Regionální > fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [ ] [ ] Obsah full-textové > vyhledávání, 19/04/2003 rejst?ík vybraných<span class="ellipsis"> ... > </span></description> > Notice how the Czech characters in the first are all numerically encoded: > i.e. #NNN;. > I'd suggest that Summary#toString() become Summary#toEntityEncodedString() > and that toString return raw aggregation of Fragments. Would likely require > adding methods to the HitSummarizer interface so can ask for either raw text > or entity encoded with addition to NutchBean so can ask for either. Or, > better I'd suggest is that Summarizer never return Entity.encoded text. Let > that happen in search.jsp (I can make patch to do the latter if its amenable). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira