Summary#toString always Entity encodes -- problem for
OpenSearchServlet#description field
-----------------------------------------------------------------------------------------
Key: NUTCH-257
URL: http://issues.apache.org/jira/browse/NUTCH-257
Project: Nutch
Type: Bug
Components: searcher
Versions: 0.8-dev
Reporter: [EMAIL PROTECTED]
Priority: Minor
All search result data we display in search results has to be explicitly
Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries.
Its already Entity.encoded. This is fine when outputing HTML but it gets in
the way when outputing otherwise -- as xml for example. I'd suggest we not
make any presumption about how search results are used.
The problem becomes especially acute when the text language is other than
english.
Here is an example of what a Czech description field in an OpenSearchServlet
hit record looks like:
<description><span class="ellipsis"> ...
</span>V&#283;deck&aacute; knihovna v Olomouci Bezru&#269;ova
2, Olomouc 9, 779 11, &#268;esk&aacute; republika &nbsp; tel.
+420-585223441 &nbsp; fax +420-585225774 http://www.<span
class="highlight">vkol</span>.cz/ &nbsp;&nbsp;
mailto:info@<span class="highlight">vkol</span>.cz
Otev&#345;eno : &nbsp; po-p&aacute; &nbsp; 8 30 -19 00
&nbsp;&nbsp;&nbsp; so &nbsp; 9 00 -13 00
&nbsp;&nbsp;&nbsp; ne &nbsp; zav&#345;eno V katalogu s
&uacute;pln&yacute;m &#269;asov&yacute;m<span
class="ellipsis"> ... </span>03 Organizace 20/12 Odkazy 19/04 Hledej
23/03 &nbsp; 23/03 &nbsp; Po&#269;et
p&#345;&iacute;stup&#367; od 1.9.1998. Statistiky . [ ] &nbsp;
[ Nahoru ] <span class="highlight">VKOL</span></description>
Here is same description field with Entity.encoding disabled:
<description><span class="ellipsis"> ... </span>tisky statistiky
knihovny WWW serveru středověké rukopisy studovny CD-ROM historických fondů
hlavní Internet Německé knihovny vázaných novin SVKOL viz <span
class="highlight">VKOL</span> šatna T telefonní čísla knihovny
zaměstnanců U V vazba věcný popis vedení knihovny vedoucí oddělení video
<span class="highlight">VKOL</span> volný výběr výpůjčka výroční
zpráva výstavy W webmaster WWW odkazy X Y Z - Ž zamluvení knihy zahraniční
periodika zpracování fondu<span class="highlight">VKOL</span> -
hledej Hledej [ <span class="highlight">VKOL</span> ] [ Novinky ] [
Katalog ] [ Služby ] [ Aktivity ] [ Průvodce ] [ Dokumenty ] [ Regionální fce ]
[ Organizace ] [ Odkazy ] [ Hledej ] [ ] [ ] Obsah full-textové
vyhledávání, 19/04/2003 rejstřík vybraných<span class="ellipsis"> ...
</span></description>
Notice how the Czech characters in the first are all numerically encoded: i.e.
#NNN;.
I'd suggest that Summary#toString() become Summary#toEntityEncodedString() and
that toString return raw aggregation of Fragments. Would likely require adding
methods to the HitSummarizer interface so can ask for either raw text or entity
encoded with addition to NutchBean so can ask for either. Or, better I'd
suggest is that Summarizer never return Entity.encoded text. Let that happen
in search.jsp (I can make patch to do the latter if its amenable).
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers