[jira] Commented: (NUTCH-437) MapFile in Hadoop 0.10.2 has changed, must update references
[ https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472841 ] [EMAIL PROTECTED] commented on NUTCH-437: - +1. I reviewed and applied patch along with a hadoop-0.11.1-core jar in place of the current hadoop-0.10.1. Ran unit tests and all passed. Might want to change this issues subject. Says 0.10.2 hadoop. Implication is patch only works with this (non-existent) hadoop. MapFile in Hadoop 0.10.2 has changed, must update references Key: NUTCH-437 URL: https://issues.apache.org/jira/browse/NUTCH-437 Project: Nutch Issue Type: Bug Affects Versions: 0.8.2, 0.9.0 Environment: windows xp and java Reporter: Dennis Kubes Assigned To: Dennis Kubes Fix For: 0.8.2, 0.9.0 Attachments: nutch-hadoop-0.10.2-mapfile.patch The MapFile.Writer signature has changed in hadoop 0.10.2 to include a Configuration object. Object in the Nutch codebase that reference MapFile.Writer will need to be updated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-425) parse-js pollutes anchor text with base URL of source page
[ https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] [EMAIL PROTECTED] updated NUTCH-425: Attachment: nutch425.patch parse-js pollutes anchor text with base URL of source page -- Key: NUTCH-425 URL: https://issues.apache.org/jira/browse/NUTCH-425 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Reporter: [EMAIL PROTECTED] Attachments: nutch425.patch Parse-js plugin always adds URL -- usually page base URL -- as anchor text for any link discovered parsing javascript. Anchor text is tokenized when indexed and by default gets a heavy weighting. The upshot is often pages show high in search results for no reason other than query term appears in (URL) anchors. See http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg06935.html for related user list postings. Here is extract from linkdb exhibiting the problem: https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030 Inlinks: fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 anchor: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx fromUrl: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMOs=6547 anchor: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMOs=6547 fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-425) parse-js pollutes anchor text with base URL of source page
[ https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462291 ] [EMAIL PROTECTED] commented on NUTCH-425: - I took a look at what is passed to parse-js both when called from parsehtml and when run by the parser passed javascript files. It doesn't look like there is anything to hand that could possibly be construed as 'anchor text' when an URL is found in javascript. Following on from this, the attached patch does the most basic 'fix'. It just sets the anchor text param to the empty string when getJSLinks is called. parse-js pollutes anchor text with base URL of source page -- Key: NUTCH-425 URL: https://issues.apache.org/jira/browse/NUTCH-425 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Reporter: [EMAIL PROTECTED] Attachments: nutch425.patch Parse-js plugin always adds URL -- usually page base URL -- as anchor text for any link discovered parsing javascript. Anchor text is tokenized when indexed and by default gets a heavy weighting. The upshot is often pages show high in search results for no reason other than query term appears in (URL) anchors. See http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg06935.html for related user list postings. Here is extract from linkdb exhibiting the problem: https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030 Inlinks: fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 anchor: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx fromUrl: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMOs=6547 anchor: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMOs=6547 fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-426) parse-js skips parsing if found URL fails java.net.URL parse
parse-js skips parsing if found URL fails java.net.URL parse Key: NUTCH-426 URL: https://issues.apache.org/jira/browse/NUTCH-426 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Reporter: [EMAIL PROTECTED] Priority: Minor The parse-js plugin in getJSLinks tries a regex looking for likely URLs against a string of javascript. Any matches that do not begin 'www' are given to java.net.URL with base URL to test 'URLness'. Often this test will fail. Currently, when it fails, nutch skips processing any more of the javascript snippet logs an error. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-426) parse-js skips parsing if found URL fails java.net.URL parse
[ https://issues.apache.org/jira/browse/NUTCH-426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] [EMAIL PROTECTED] updated NUTCH-426: Attachment: nutch426.patch parse-js skips parsing if found URL fails java.net.URL parse Key: NUTCH-426 URL: https://issues.apache.org/jira/browse/NUTCH-426 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Reporter: [EMAIL PROTECTED] Priority: Minor Attachments: nutch426.patch The parse-js plugin in getJSLinks tries a regex looking for likely URLs against a string of javascript. Any matches that do not begin 'www' are given to java.net.URL with base URL to test 'URLness'. Often this test will fail. Currently, when it fails, nutch skips processing any more of the javascript snippet logs an error. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-426) parse-js skips parsing if found URL fails java.net.URL parse
[ https://issues.apache.org/jira/browse/NUTCH-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462307 ] [EMAIL PROTECTED] commented on NUTCH-426: - Just attached a patch that catches the MalformedURLException, logs the failure at trace level, and then continues to pickup on any more matches found in remaining javascript. parse-js skips parsing if found URL fails java.net.URL parse Key: NUTCH-426 URL: https://issues.apache.org/jira/browse/NUTCH-426 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Reporter: [EMAIL PROTECTED] Priority: Minor Attachments: nutch426.patch The parse-js plugin in getJSLinks tries a regex looking for likely URLs against a string of javascript. Any matches that do not begin 'www' are given to java.net.URL with base URL to test 'URLness'. Often this test will fail. Currently, when it fails, nutch skips processing any more of the javascript snippet logs an error. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-423) Add other index-basic fields as query plugins
Add other index-basic fields as query plugins - Key: NUTCH-423 URL: http://issues.apache.org/jira/browse/NUTCH-423 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Reporter: [EMAIL PROTECTED] Priority: Minor The basic indexer plugin adds 'host', 'site', 'url', 'content', 'title', and 'anchor'. The query-basic plugin expands queries against the 'default' field to run against all basic indexer plugin fields. The query-url pluging adds query filtering on the 'url' field and query-site' on 'site'. This patch adds plugins to filter on the remainder: 'host', 'content', 'title', and 'anchor'. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-423) Add other index-basic fields as query plugins
[ http://issues.apache.org/jira/browse/NUTCH-423?page=all ] [EMAIL PROTECTED] updated NUTCH-423: Attachment: other-index-basic-query-fields.patch Add other index-basic fields as query plugins - Key: NUTCH-423 URL: http://issues.apache.org/jira/browse/NUTCH-423 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 0.9.0 Reporter: [EMAIL PROTECTED] Priority: Minor Attachments: other-index-basic-query-fields.patch The basic indexer plugin adds 'host', 'site', 'url', 'content', 'title', and 'anchor'. The query-basic plugin expands queries against the 'default' field to run against all basic indexer plugin fields. The query-url pluging adds query filtering on the 'url' field and query-site' on 'site'. This patch adds plugins to filter on the remainder: 'host', 'content', 'title', and 'anchor'. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars08-v5.patch No, the double call to getLegalXml is not intentional. Its a mistake. Thanks for finding it. I've attached yet another version (Any prizes for most revisions to a patch?). OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Assignee: Sami Siren Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08-v5.patch, fixIllegalXmlChars08.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars08-v4.patch v3 mistakenly included debugging code. Attached cleaned up v4. OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars08-v3.patch Version of patch that doesn't ...process the String twice if it contains some illegal characters!. Its name is fixIllegalXmlChars08-v3.patch (Be careful, its not the last patch in the list). It was made against 414852. At least 3 different people have run into this awkward issue going by the comments in this issue. I petition that is sufficent to earn a commit. Thanks. OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Version: 0.8-dev (was: 0.7) Was version 0.7. Changed 'Affects Version' to 0.8-dev. OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count
CrawlDbReducer: OOME because no upper-bound on inlinks count Key: NUTCH-269 URL: http://issues.apache.org/jira/browse/NUTCH-269 Project: Nutch Type: Bug Reporter: [EMAIL PROTECTED] Priority: Trivial A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands of inlinks (The british foriegn office likes putting a clear.gif multiple times into each page: http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count
[ http://issues.apache.org/jira/browse/NUTCH-269?page=all ] [EMAIL PROTECTED] updated NUTCH-269: Attachment: too-many-links.patch Add configurable upper limit to amount of links we'll read. CrawlDbReducer: OOME because no upper-bound on inlinks count Key: NUTCH-269 URL: http://issues.apache.org/jira/browse/NUTCH-269 Project: Nutch Type: Bug Reporter: [EMAIL PROTECTED] Priority: Trivial Attachments: too-many-links.patch A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands of inlinks (The british foriegn office likes putting a clear.gif multiple times into each page: http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count
[ http://issues.apache.org/jira/browse/NUTCH-269?page=all ] [EMAIL PROTECTED] updated NUTCH-269: Attachment: too-many-links2.patch Previous patch is useless. This one actually breaks the loop. CrawlDbReducer: OOME because no upper-bound on inlinks count Key: NUTCH-269 URL: http://issues.apache.org/jira/browse/NUTCH-269 Project: Nutch Type: Bug Reporter: [EMAIL PROTECTED] Priority: Trivial Attachments: too-many-links.patch, too-many-links2.patch A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands of inlinks (The british foriegn office likes putting a clear.gif multiple times into each page: http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field
Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field - Key: NUTCH-257 URL: http://issues.apache.org/jira/browse/NUTCH-257 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Reporter: [EMAIL PROTECTED] Priority: Minor All search result data we display in search results has to be explicitly Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries. Its already Entity.encoded. This is fine when outputing HTML but it gets in the way when outputing otherwise -- as xml for example. I'd suggest we not make any presumption about how search results are used. The problem becomes especially acute when the text language is other than english. Here is an example of what a Czech description field in an OpenSearchServlet hit record looks like: descriptionlt;span class=ellipsisgt; ... lt;/spangt;Vamp;#283;deckamp;aacute; knihovna v Olomouci Bezruamp;#269;ova 2, Olomouc 9, 779 11, amp;#268;eskamp;aacute; republika amp;nbsp; tel. +420-585223441 amp;nbsp; fax +420-585225774 http://www.lt;span class=highlightgt;vkollt;/spangt;.cz/ amp;nbsp;amp;nbsp; mailto:info@lt;span class=highlightgt;vkollt;/spangt;.cz Otevamp;#345;eno : amp;nbsp; po-pamp;aacute; amp;nbsp; 8 30 -19 00 amp;nbsp;amp;nbsp;amp;nbsp; so amp;nbsp; 9 00 -13 00 amp;nbsp;amp;nbsp;amp;nbsp; ne amp;nbsp; zavamp;#345;eno V katalogu s amp;uacute;plnamp;yacute;m amp;#269;asovamp;yacute;mlt;span class=ellipsisgt; ... lt;/spangt;03 Organizace 20/12 Odkazy 19/04 Hledej 23/03 amp;nbsp; 23/03 amp;nbsp; Poamp;#269;et pamp;#345;amp;iacute;stupamp;#367; od 1.9.1998. Statistiky . [ ] amp;nbsp; [ Nahoru ] lt;span class=highlightgt;VKOLlt;/spangt;/description Here is same description field with Entity.encoding disabled: descriptionlt;span class=ellipsisgt; ... lt;/spangt;tisky statistiky knihovny WWW serveru středověké rukopisy studovny CD-ROM historických fondů hlavní Internet Německé knihovny vázaných novin SVKOL viz lt;span class=highlightgt;VKOLlt;/spangt; šatna T telefonní čísla knihovny zaměstnanců U V vazba věcný popis vedení knihovny vedoucí oddělení video lt;span class=highlightgt;VKOLlt;/spangt; volný výběr výpůjčka výroční zpráva výstavy W webmaster WWW odkazy X Y Z - Ž zamluvení knihy zahraniční periodika zpracování fondult;span class=highlightgt;VKOLlt;/spangt; - hledej Hledej [ lt;span class=highlightgt;VKOLlt;/spangt; ] [ Novinky ] [ Katalog ] [ Služby ] [ Aktivity ] [ Průvodce ] [ Dokumenty ] [ Regionální fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [ ] [ ] Obsah full-textové vyhledávání, 19/04/2003 rejstřík vybranýchlt;span class=ellipsisgt; ... lt;/spangt;/description Notice how the Czech characters in the first are all numerically encoded: i.e. #NNN;. I'd suggest that Summary#toString() become Summary#toEntityEncodedString() and that toString return raw aggregation of Fragments. Would likely require adding methods to the HitSummarizer interface so can ask for either raw text or entity encoded with addition to NutchBean so can ask for either. Or, better I'd suggest is that Summarizer never return Entity.encoded text. Let that happen in search.jsp (I can make patch to do the latter if its amenable). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field
[ http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376997 ] [EMAIL PROTECTED] commented on NUTCH-257: - I took a closer look. Turns out Summary is inherently all about rendering HTML (See the different Summary.Fragment subclasses -- one for ellipsis, another for hightlight. In each of these, the to String wraps the fragment in some HTML 'span' markup). What about changing HitSummarizer#getSummary to return Summary instead of String or String []. If the rendering context requires HTML, ask Summary to compose the HTML to output (Summary#toHtmlString()?). If, xml, get plain-text version of summary (Summary#toString())? Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field - Key: NUTCH-257 URL: http://issues.apache.org/jira/browse/NUTCH-257 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Reporter: [EMAIL PROTECTED] Priority: Minor All search result data we display in search results has to be explicitly Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries. Its already Entity.encoded. This is fine when outputing HTML but it gets in the way when outputing otherwise -- as xml for example. I'd suggest we not make any presumption about how search results are used. The problem becomes especially acute when the text language is other than english. Here is an example of what a Czech description field in an OpenSearchServlet hit record looks like: descriptionlt;span class=ellipsisgt; ... lt;/spangt;Vamp;#283;deckamp;aacute; knihovna v Olomouci Bezruamp;#269;ova 2, Olomouc 9, 779 11, amp;#268;eskamp;aacute; republika amp;nbsp; tel. +420-585223441 amp;nbsp; fax +420-585225774 http://www.lt;span class=highlightgt;vkollt;/spangt;.cz/ amp;nbsp;amp;nbsp; mailto:info@lt;span class=highlightgt;vkollt;/spangt;.cz Otevamp;#345;eno : amp;nbsp; po-pamp;aacute; amp;nbsp; 8 30 -19 00 amp;nbsp;amp;nbsp;amp;nbsp; so amp;nbsp; 9 00 -13 00 amp;nbsp;amp;nbsp;amp;nbsp; ne amp;nbsp; zavamp;#345;eno V katalogu s amp;uacute;plnamp;yacute;m amp;#269;asovamp;yacute;mlt;span class=ellipsisgt; ... lt;/spangt;03 Organizace 20/12 Odkazy 19/04 Hledej 23/03 amp;nbsp; 23/03 amp;nbsp; Poamp;#269;et pamp;#345;amp;iacute;stupamp;#367; od 1.9.1998. Statistiky . [ ] amp;nbsp; [ Nahoru ] lt;span class=highlightgt;VKOLlt;/spangt;/description Here is same description field with Entity.encoding disabled: descriptionlt;span class=ellipsisgt; ... lt;/spangt;tisky statistiky knihovny WWW serveru st?edov?ké rukopisy studovny CD-ROM historických fond? hlavní Internet N?mecké knihovny vázaných novin SVKOL viz lt;span class=highlightgt;VKOLlt;/spangt; ?atna T telefonní ?ísla knihovny zam?stnanc? U V vazba v?cný popis vedení knihovny vedoucí odd?lení video lt;span class=highlightgt;VKOLlt;/spangt; volný výb?r výp?j?ka výro?ní zpráva výstavy W webmaster WWW odkazy X Y Z - ? zamluvení knihy zahrani?ní periodika zpracování fondult;span class=highlightgt;VKOLlt;/spangt; - hledej Hledej [ lt;span class=highlightgt;VKOLlt;/spangt; ] [ Novinky ] [ Katalog ] [ Slu?by ] [ Aktivity ] [ Pr?vodce ] [ Dokumenty ] [ Regionální fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [ ] [ ] Obsah full-textové vyhledávání, 19/04/2003 rejst?ík vybranýchlt;span class=ellipsisgt; ... lt;/spangt;/description Notice how the Czech characters in the first are all numerically encoded: i.e. #NNN;. I'd suggest that Summary#toString() become Summary#toEntityEncodedString() and that toString return raw aggregation of Fragments. Would likely require adding methods to the HitSummarizer interface so can ask for either raw text or entity encoded with addition to NutchBean so can ask for either. Or, better I'd suggest is that Summarizer never return Entity.encoded text. Let that happen in search.jsp (I can make patch to do the latter if its amenable). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-256) Cannot open filename ....index.done.crc
[ http://issues.apache.org/jira/browse/NUTCH-256?page=comments#action_12376999 ] [EMAIL PROTECTED] commented on NUTCH-256: - Works for me. Thanks. Please close as fixed. Cannot open filename index.done.crc --- Key: NUTCH-256 URL: http://issues.apache.org/jira/browse/NUTCH-256 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: [EMAIL PROTECTED] Priority: Minor Attachments: index.done.crc.patch Trying to copy indices out of DFS I always get: [bregeon] workspace ./hadoop/bin/hadoop dfs -get outputs . 060427 160317 parsing file:/home/stack/workspace/hadoop-local-conf/hadoop-default.xml 060427 160317 parsing file:/home/stack/workspace/hadoop-local-conf/hadoop-site.xml 060427 160318 No FS indicated, using default:localhost:9001 060427 160318 Client connection to 127.0.0.1:9001: starting 060427 160318 Problem opening checksum file: /user/stack/outputs/indexes/part-0/index.done. Ignoring with exception org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename /user/stack/outputs/indexes/part-0/.index.done.crc at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:589) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-256) Cannot open filename ....index.done.crc
Cannot open filename index.done.crc --- Key: NUTCH-256 URL: http://issues.apache.org/jira/browse/NUTCH-256 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: [EMAIL PROTECTED] Priority: Minor Trying to copy indices out of DFS I always get: [bregeon] workspace ./hadoop/bin/hadoop dfs -get outputs . 060427 160317 parsing file:/home/stack/workspace/hadoop-local-conf/hadoop-default.xml 060427 160317 parsing file:/home/stack/workspace/hadoop-local-conf/hadoop-site.xml 060427 160318 No FS indicated, using default:localhost:9001 060427 160318 Client connection to 127.0.0.1:9001: starting 060427 160318 Problem opening checksum file: /user/stack/outputs/indexes/part-0/index.done. Ignoring with exception org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename /user/stack/outputs/indexes/part-0/.index.done.crc at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:589) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-256) Cannot open filename ....index.done.crc
[ http://issues.apache.org/jira/browse/NUTCH-256?page=all ] [EMAIL PROTECTED] updated NUTCH-256: Attachment: index.done.crc.patch Ensure creation of companion index.done .crc file Cannot open filename index.done.crc --- Key: NUTCH-256 URL: http://issues.apache.org/jira/browse/NUTCH-256 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: [EMAIL PROTECTED] Priority: Minor Attachments: index.done.crc.patch Trying to copy indices out of DFS I always get: [bregeon] workspace ./hadoop/bin/hadoop dfs -get outputs . 060427 160317 parsing file:/home/stack/workspace/hadoop-local-conf/hadoop-default.xml 060427 160317 parsing file:/home/stack/workspace/hadoop-local-conf/hadoop-site.xml 060427 160318 No FS indicated, using default:localhost:9001 060427 160318 Client connection to 127.0.0.1:9001: starting 060427 160318 Problem opening checksum file: /user/stack/outputs/indexes/part-0/index.done. Ignoring with exception org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename /user/stack/outputs/indexes/part-0/.index.done.crc at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:589) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-190) ParseUtil drops reason for failed parse
ParseUtil drops reason for failed parse --- Key: NUTCH-190 URL: http://issues.apache.org/jira/browse/NUTCH-190 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: linux Reporter: [EMAIL PROTECTED] Priority: Minor Doing the below: Parse parse; ParseStatus parseStatus; try { parse = ParseUtil.parse(content); parseStatus = parse.getData().getStatus(); } catch (Exception e) { parseStatus = new ParseStatus(e); } if (!parseStatus.isSuccess()) { LOG.warning(Error parsing: + url + : + parseStatus); parse = null; } ...on failure, the LOG.warning never prints out the reason for failure. Here's an example: Error parsing: http://www.dfrc.nasa.gov/DTRS/1967/PDF/H-478.pdf: failed(0,0). ParseUtil is dropping messages lovingly crafted by parsers. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-190) ParseUtil drops reason for failed parse
[ http://issues.apache.org/jira/browse/NUTCH-190?page=all ] [EMAIL PROTECTED] updated NUTCH-190: Attachment: ParseUtil_drops_failure_reason.patch Attached is a suggested patch against revision 369598. ParseUtil drops reason for failed parse --- Key: NUTCH-190 URL: http://issues.apache.org/jira/browse/NUTCH-190 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: linux Reporter: [EMAIL PROTECTED] Priority: Minor Attachments: ParseUtil_drops_failure_reason.patch Doing the below: Parse parse; ParseStatus parseStatus; try { parse = ParseUtil.parse(content); parseStatus = parse.getData().getStatus(); } catch (Exception e) { parseStatus = new ParseStatus(e); } if (!parseStatus.isSuccess()) { LOG.warning(Error parsing: + url + : + parseStatus); parse = null; } ...on failure, the LOG.warning never prints out the reason for failure. Here's an example: Error parsing: http://www.dfrc.nasa.gov/DTRS/1967/PDF/H-478.pdf: failed(0,0). ParseUtil is dropping messages lovingly crafted by parsers. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-190) ParseUtil drops reason for failed parse
[ http://issues.apache.org/jira/browse/NUTCH-190?page=comments#action_12364145 ] [EMAIL PROTECTED] commented on NUTCH-190: - Here's an example of failure output after patch is applied: 060126 141413 task_m_bx2ifn Error parsing: http://techreports.jpl.nasa.gov/2000/00-1147.pdf: failed(2,202): Content truncated at 102013 bytes. Parser can't handle incomplete application/pdf file ParseUtil drops reason for failed parse --- Key: NUTCH-190 URL: http://issues.apache.org/jira/browse/NUTCH-190 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: linux Reporter: [EMAIL PROTECTED] Priority: Minor Attachments: ParseUtil_drops_failure_reason.patch Doing the below: Parse parse; ParseStatus parseStatus; try { parse = ParseUtil.parse(content); parseStatus = parse.getData().getStatus(); } catch (Exception e) { parseStatus = new ParseStatus(e); } if (!parseStatus.isSuccess()) { LOG.warning(Error parsing: + url + : + parseStatus); parse = null; } ...on failure, the LOG.warning never prints out the reason for failure. Here's an example: Error parsing: http://www.dfrc.nasa.gov/DTRS/1967/PDF/H-478.pdf: failed(0,0). ParseUtil is dropping messages lovingly crafted by parsers. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-130) Be explicit about target JVM when building (1.4.x?)
[ http://issues.apache.org/jira/browse/NUTCH-130?page=comments#action_12358981 ] [EMAIL PROTECTED] commented on NUTCH-130: - Need to do same for plugin compile: $ /usr/local/bin/svn diff src/plugin/build-plugin.xml Index: src/plugin/build-plugin.xml === --- src/plugin/build-plugin.xml (revision 350057) +++ src/plugin/build-plugin.xml (working copy) @@ -85,6 +85,8 @@ includes=**/*.java destdir=${build.classes} debug=${javac.debug} + target=1.4 + source=1.4 deprecation=${javac.deprecation} classpath refid=classpath/ /javac Be explicit about target JVM when building (1.4.x?) --- Key: NUTCH-130 URL: http://issues.apache.org/jira/browse/NUTCH-130 Project: Nutch Type: Improvement Reporter: [EMAIL PROTECTED] Priority: Minor Below is patch for nutch build.xml. It stipulates the target JVM is 1.4.x. Without explicit target, a nutch built with 1.5.x java defaults to a 1.5.x java target and won't run in a 1.4.x JVM. Can be annoying (From the ant javac doc, regards the target attribute: We highly recommend to always specify this attribute.). [debord 282] nutch svn diff -u build.xml Subcommand 'diff' doesn't accept option '-u [--show-updates]' Type 'svn help diff' for usage. [debord 283] nutch svn diff build.xml Index: build.xml === --- build.xml (revision 349779) +++ build.xml (working copy) @@ -72,6 +72,8 @@ destdir=${build.classes} debug=${debug} optimize=${optimize} + target=1.4 + source=1.4 deprecation=${deprecation} classpath refid=classpath/ /javac -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-130) Be explicit about target JVM when building (1.4.x?)
Be explicit about target JVM when building (1.4.x?) --- Key: NUTCH-130 URL: http://issues.apache.org/jira/browse/NUTCH-130 Project: Nutch Type: Improvement Reporter: [EMAIL PROTECTED] Priority: Minor Below is patch for nutch build.xml. It stipulates the target JVM is 1.4.x. Without explicit target, a nutch built with 1.5.x java defaults to a 1.5.x java target and won't run in a 1.4.x JVM. Can be annoying (From the ant javac doc, regards the target attribute: We highly recommend to always specify this attribute.). [debord 282] nutch svn diff -u build.xml Subcommand 'diff' doesn't accept option '-u [--show-updates]' Type 'svn help diff' for usage. [debord 283] nutch svn diff build.xml Index: build.xml === --- build.xml (revision 349779) +++ build.xml (working copy) @@ -72,6 +72,8 @@ destdir=${build.classes} debug=${debug} optimize=${optimize} + target=1.4 + source=1.4 deprecation=${deprecation} classpath refid=classpath/ /javac -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12357300 ] [EMAIL PROTECTED] commented on NUTCH-110: - Scrub NUTCH-110-version2.patch. This patch double-encode certain entities (First by the new toValidXmlText method, second by the javax.xml.transform.Transformer transformer used by OpenSearchServlet). Use the original patch, fixIllegalXmlChars.patch, to address the problem described in this issue. OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: NUTCH-110-version2.patch Patch version 2. This patch benefits from discussion held up on nutch dev list. This patch differs from the first in that it handles ALL illegal XML characters, entity encoding the 5 'special characters' AND (silently) dropping characters outside the xml legal range of characters. The previous patch just did the latter task letting the configured transformer/DOM Serializer handle entity escaping. This patch also differs from patch version 1 in that it moves the method that processes the xml out into util.StringUtil: The assumption being that not only OpenSearchServlet needs to make text safe to include in xml. The core method, StringUtil#toValidXmlText, was authored by Dawid Weiss and was taken from carrot2 XMLSerializerHelper. Below is except from mail up on nutch dev where he grants permission to copy toValidXmlText. Message-ID: [EMAIL PROTECTED] Date: Fri, 14 Oct 2005 08:42:48 +0200 From: Dawid Weiss [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Subject: Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters ... So, will I amend the patch in NUTCH-110 so it uses XMLSerializerHelper#toValidXmlText in place of #getLegalXml method? Copy the method's contents. It doesn't really make sense to copy the entire class just for this method. Good luck. D. OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars.patch Attached patch runs all xml text through a check for bad xml characters. This patch is brutal dropping silently illegal characters. Patch was made after hunting xalan, jdk, and nutch itself for a method that would do the above filtering but was unable to find any such method -- perhaps an oversight on my part? OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Attachments: fixIllegalXmlChars.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira