[jira] Commented: (NUTCH-437) MapFile in Hadoop 0.10.2 has changed, must update references

2007-02-13 Thread [EMAIL PROTECTED] (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472841
 ] 

[EMAIL PROTECTED] commented on NUTCH-437:
-

+1.  I reviewed and applied patch along with a hadoop-0.11.1-core jar in place 
of the current hadoop-0.10.1.  Ran unit tests and all passed.

Might want to change this issues subject.  Says 0.10.2 hadoop.  Implication is 
patch only works with this (non-existent) hadoop.

 MapFile in Hadoop 0.10.2 has changed, must update references
 

 Key: NUTCH-437
 URL: https://issues.apache.org/jira/browse/NUTCH-437
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.2, 0.9.0
 Environment: windows xp and java
Reporter: Dennis Kubes
 Assigned To: Dennis Kubes
 Fix For: 0.8.2, 0.9.0

 Attachments: nutch-hadoop-0.10.2-mapfile.patch


 The MapFile.Writer signature has changed in hadoop 0.10.2 to include a 
 Configuration object.  Object in the Nutch codebase that reference 
 MapFile.Writer will need to be updated.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-425) parse-js pollutes anchor text with base URL of source page

2007-01-04 Thread [EMAIL PROTECTED] (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

[EMAIL PROTECTED] updated NUTCH-425:


Attachment: nutch425.patch

 parse-js pollutes anchor text with base URL of source page
 --

 Key: NUTCH-425
 URL: https://issues.apache.org/jira/browse/NUTCH-425
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
Reporter: [EMAIL PROTECTED]
 Attachments: nutch425.patch


 Parse-js plugin always adds URL -- usually page base URL -- as anchor text 
 for any link discovered parsing javascript.  Anchor text is tokenized when 
 indexed and by default gets a heavy weighting.  The upshot is often pages 
 show high in search results for no reason other than query term appears in 
 (URL) anchors.  
 See http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg06935.html 
 for related user list postings.
 Here is extract from linkdb exhibiting the problem:
 https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030 Inlinks: 
  fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx 
 anchor: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx
  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx 
 anchor: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx
  fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 anchor: 
 http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05
  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx 
 anchor: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx
  fromUrl: 
 http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMOs=6547 anchor: 
 http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMOs=6547
  fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx 
 anchor: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx
  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx 
 anchor: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-425) parse-js pollutes anchor text with base URL of source page

2007-01-04 Thread [EMAIL PROTECTED] (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462291
 ] 

[EMAIL PROTECTED] commented on NUTCH-425:
-

I took a look at what is passed to parse-js both when called from parsehtml and 
when run by the parser passed javascript files.  It doesn't look like there is 
anything to hand that could possibly be construed as 'anchor text' when an URL 
is found in javascript.  Following on from this, the attached patch does the 
most basic 'fix'.  It just sets the anchor text param to the empty string when 
getJSLinks is called.

 parse-js pollutes anchor text with base URL of source page
 --

 Key: NUTCH-425
 URL: https://issues.apache.org/jira/browse/NUTCH-425
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
Reporter: [EMAIL PROTECTED]
 Attachments: nutch425.patch


 Parse-js plugin always adds URL -- usually page base URL -- as anchor text 
 for any link discovered parsing javascript.  Anchor text is tokenized when 
 indexed and by default gets a heavy weighting.  The upshot is often pages 
 show high in search results for no reason other than query term appears in 
 (URL) anchors.  
 See http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg06935.html 
 for related user list postings.
 Here is extract from linkdb exhibiting the problem:
 https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030 Inlinks: 
  fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx 
 anchor: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx
  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx 
 anchor: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx
  fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 anchor: 
 http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05
  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx 
 anchor: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx
  fromUrl: 
 http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMOs=6547 anchor: 
 http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMOs=6547
  fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx 
 anchor: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx
  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx 
 anchor: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-426) parse-js skips parsing if found URL fails java.net.URL parse

2007-01-04 Thread [EMAIL PROTECTED] (JIRA)
parse-js skips parsing if found URL fails java.net.URL parse


 Key: NUTCH-426
 URL: https://issues.apache.org/jira/browse/NUTCH-426
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
Reporter: [EMAIL PROTECTED]
Priority: Minor


The parse-js plugin in getJSLinks tries a regex looking for likely URLs against 
a string of javascript.  Any matches that do not begin 'www' are given to 
java.net.URL with base URL to test 'URLness'.  Often this test will fail.  
Currently, when it fails, nutch skips processing any more of the javascript 
snippet logs an error.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-426) parse-js skips parsing if found URL fails java.net.URL parse

2007-01-04 Thread [EMAIL PROTECTED] (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

[EMAIL PROTECTED] updated NUTCH-426:


Attachment: nutch426.patch

 parse-js skips parsing if found URL fails java.net.URL parse
 

 Key: NUTCH-426
 URL: https://issues.apache.org/jira/browse/NUTCH-426
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
Reporter: [EMAIL PROTECTED]
Priority: Minor
 Attachments: nutch426.patch


 The parse-js plugin in getJSLinks tries a regex looking for likely URLs 
 against a string of javascript.  Any matches that do not begin 'www' are 
 given to java.net.URL with base URL to test 'URLness'.  Often this test will 
 fail.  Currently, when it fails, nutch skips processing any more of the 
 javascript snippet logs an error.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-426) parse-js skips parsing if found URL fails java.net.URL parse

2007-01-04 Thread [EMAIL PROTECTED] (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462307
 ] 

[EMAIL PROTECTED] commented on NUTCH-426:
-

Just attached a patch that catches the MalformedURLException, logs the failure 
at trace level, and then continues to pickup on any more matches found in 
remaining javascript.

 parse-js skips parsing if found URL fails java.net.URL parse
 

 Key: NUTCH-426
 URL: https://issues.apache.org/jira/browse/NUTCH-426
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
Reporter: [EMAIL PROTECTED]
Priority: Minor
 Attachments: nutch426.patch


 The parse-js plugin in getJSLinks tries a regex looking for likely URLs 
 against a string of javascript.  Any matches that do not begin 'www' are 
 given to java.net.URL with base URL to test 'URLness'.  Often this test will 
 fail.  Currently, when it fails, nutch skips processing any more of the 
 javascript snippet logs an error.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-423) Add other index-basic fields as query plugins

2006-12-28 Thread [EMAIL PROTECTED] (JIRA)
Add other index-basic fields as query plugins
-

 Key: NUTCH-423
 URL: http://issues.apache.org/jira/browse/NUTCH-423
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.9.0
Reporter: [EMAIL PROTECTED]
Priority: Minor


The basic indexer plugin adds 'host', 'site', 'url', 'content', 'title', and 
'anchor'.  The query-basic plugin expands queries against the 'default' field 
to run against all basic indexer plugin fields.  The query-url pluging adds 
query filtering on the 'url' field and query-site' on 'site'.  This patch adds 
plugins to filter on the remainder: 'host', 'content', 'title', and 'anchor'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-423) Add other index-basic fields as query plugins

2006-12-28 Thread [EMAIL PROTECTED] (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-423?page=all ]

[EMAIL PROTECTED] updated NUTCH-423:


Attachment: other-index-basic-query-fields.patch

 Add other index-basic fields as query plugins
 -

 Key: NUTCH-423
 URL: http://issues.apache.org/jira/browse/NUTCH-423
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.9.0
Reporter: [EMAIL PROTECTED]
Priority: Minor
 Attachments: other-index-basic-query-fields.patch


 The basic indexer plugin adds 'host', 'site', 'url', 'content', 'title', and 
 'anchor'.  The query-basic plugin expands queries against the 'default' field 
 to run against all basic indexer plugin fields.  The query-url pluging adds 
 query filtering on the 'url' field and query-site' on 'site'.  This patch 
 adds plugins to filter on the remainder: 'host', 'content', 'title', and 
 'anchor'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-20 Thread [EMAIL PROTECTED] (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

[EMAIL PROTECTED] updated NUTCH-110:


Attachment: fixIllegalXmlChars08-v5.patch

No, the double call to getLegalXml is not intentional.  Its a mistake.  Thanks 
for finding it.

I've attached yet another version (Any prizes for most revisions to a patch?).

 OpenSearchServlet outputs illegal xml characters
 

  Key: NUTCH-110
  URL: http://issues.apache.org/jira/browse/NUTCH-110
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.8-dev
  Environment: linux, jdk 1.5
 Reporter: [EMAIL PROTECTED]
 Assignee: Sami Siren
  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, 
 fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, 
 fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08-v5.patch, 
 fixIllegalXmlChars08.patch

 OpenSearchServlet does not check text-to-output for illegal xml characters; 
 dependent on  search result, its possible for OSS to output xml that is not 
 well-formed.  For example, if text has the character FF character in it -- -- 
 i.e. the ascii character at position (decimal) 12 --  the produced XML will 
 show the FF character as '#12;' The character/entity '#12;' is not legal in 
 XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-19 Thread [EMAIL PROTECTED] (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

[EMAIL PROTECTED] updated NUTCH-110:


Attachment: fixIllegalXmlChars08-v4.patch

v3 mistakenly included debugging code.

Attached cleaned up v4.

 OpenSearchServlet outputs illegal xml characters
 

  Key: NUTCH-110
  URL: http://issues.apache.org/jira/browse/NUTCH-110
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.8-dev
  Environment: linux, jdk 1.5
 Reporter: [EMAIL PROTECTED]
  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, 
 fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, 
 fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08.patch

 OpenSearchServlet does not check text-to-output for illegal xml characters; 
 dependent on  search result, its possible for OSS to output xml that is not 
 well-formed.  For example, if text has the character FF character in it -- -- 
 i.e. the ascii character at position (decimal) 12 --  the produced XML will 
 show the FF character as '#12;' The character/entity '#12;' is not legal in 
 XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-16 Thread [EMAIL PROTECTED] (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

[EMAIL PROTECTED] updated NUTCH-110:


Attachment: fixIllegalXmlChars08-v3.patch

Version of patch that doesn't ...process the String twice if it contains some 
illegal characters!.  Its name is fixIllegalXmlChars08-v3.patch (Be careful, 
its not the last patch in the list).  It was made against 414852.

At least 3 different people have run into this awkward issue going by the 
comments in this issue.  I petition that is sufficent to earn a commit.

Thanks.

 OpenSearchServlet outputs illegal xml characters
 

  Key: NUTCH-110
  URL: http://issues.apache.org/jira/browse/NUTCH-110
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.7
  Environment: linux, jdk 1.5
 Reporter: [EMAIL PROTECTED]
  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, 
 fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, 
 fixIllegalXmlChars08.patch

 OpenSearchServlet does not check text-to-output for illegal xml characters; 
 dependent on  search result, its possible for OSS to output xml that is not 
 well-formed.  For example, if text has the character FF character in it -- -- 
 i.e. the ascii character at position (decimal) 12 --  the produced XML will 
 show the FF character as '#12;' The character/entity '#12;' is not legal in 
 XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-16 Thread [EMAIL PROTECTED] (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

[EMAIL PROTECTED] updated NUTCH-110:


Version: 0.8-dev
 (was: 0.7)

Was version 0.7.  Changed 'Affects Version' to 0.8-dev.

 OpenSearchServlet outputs illegal xml characters
 

  Key: NUTCH-110
  URL: http://issues.apache.org/jira/browse/NUTCH-110
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.8-dev
  Environment: linux, jdk 1.5
 Reporter: [EMAIL PROTECTED]
  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, 
 fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, 
 fixIllegalXmlChars08.patch

 OpenSearchServlet does not check text-to-output for illegal xml characters; 
 dependent on  search result, its possible for OSS to output xml that is not 
 well-formed.  For example, if text has the character FF character in it -- -- 
 i.e. the ascii character at position (decimal) 12 --  the produced XML will 
 show the FF character as '#12;' The character/entity '#12;' is not legal in 
 XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2006-05-15 Thread [EMAIL PROTECTED] (JIRA)
CrawlDbReducer: OOME because no upper-bound on inlinks count


 Key: NUTCH-269
 URL: http://issues.apache.org/jira/browse/NUTCH-269
 Project: Nutch
Type: Bug

Reporter: [EMAIL PROTECTED]
Priority: Trivial


A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands of 
inlinks (The british foriegn office likes putting a clear.gif multiple times 
into each page: 
http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2006-05-15 Thread [EMAIL PROTECTED] (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-269?page=all ]

[EMAIL PROTECTED] updated NUTCH-269:


Attachment: too-many-links.patch

Add configurable upper limit to amount of links we'll read. 

 CrawlDbReducer: OOME because no upper-bound on inlinks count
 

  Key: NUTCH-269
  URL: http://issues.apache.org/jira/browse/NUTCH-269
  Project: Nutch
 Type: Bug

 Reporter: [EMAIL PROTECTED]
 Priority: Trivial
  Attachments: too-many-links.patch

 A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands 
 of inlinks (The british foriegn office likes putting a clear.gif multiple 
 times into each page: 
 http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2006-05-15 Thread [EMAIL PROTECTED] (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-269?page=all ]

[EMAIL PROTECTED] updated NUTCH-269:


Attachment: too-many-links2.patch

Previous patch is useless.  This one actually breaks the loop.  

 CrawlDbReducer: OOME because no upper-bound on inlinks count
 

  Key: NUTCH-269
  URL: http://issues.apache.org/jira/browse/NUTCH-269
  Project: Nutch
 Type: Bug

 Reporter: [EMAIL PROTECTED]
 Priority: Trivial
  Attachments: too-many-links.patch, too-many-links2.patch

 A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands 
 of inlinks (The british foriegn office likes putting a clear.gif multiple 
 times into each page: 
 http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

2006-04-28 Thread [EMAIL PROTECTED] (JIRA)
Summary#toString always Entity encodes -- problem for 
OpenSearchServlet#description field
-

 Key: NUTCH-257
 URL: http://issues.apache.org/jira/browse/NUTCH-257
 Project: Nutch
Type: Bug

  Components: searcher  
Versions: 0.8-dev
Reporter: [EMAIL PROTECTED]
Priority: Minor


All search result data we display in search results has to be explicitly 
Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries.  
Its already Entity.encoded.  This is fine when outputing HTML but it gets in 
the way when outputing otherwise -- as xml for example.  I'd suggest we not 
make any presumption about how search results are used.

The problem becomes especially acute when the text language is other than 
english.

Here is an example of what a Czech description field in an OpenSearchServlet 
hit record looks like:

descriptionlt;span class=ellipsisgt; ... 
lt;/spangt;Vamp;#283;deckamp;aacute; knihovna v Olomouci Bezruamp;#269;ova 
2, Olomouc 9, 779 11, amp;#268;eskamp;aacute; republika amp;nbsp; tel. 
+420-585223441 amp;nbsp; fax +420-585225774 http://www.lt;span 
class=highlightgt;vkollt;/spangt;.cz/ amp;nbsp;amp;nbsp; 
mailto:info@lt;span class=highlightgt;vkollt;/spangt;.cz 
Otevamp;#345;eno : amp;nbsp; po-pamp;aacute; amp;nbsp; 8 30 -19 00 
amp;nbsp;amp;nbsp;amp;nbsp; so amp;nbsp; 9 00 -13 00 
amp;nbsp;amp;nbsp;amp;nbsp; ne amp;nbsp; zavamp;#345;eno V katalogu s 
amp;uacute;plnamp;yacute;m amp;#269;asovamp;yacute;mlt;span 
class=ellipsisgt; ... lt;/spangt;03 Organizace 20/12 Odkazy 19/04 Hledej 
23/03 amp;nbsp; 23/03 amp;nbsp; Poamp;#269;et 
pamp;#345;amp;iacute;stupamp;#367; od 1.9.1998. Statistiky . [ ] amp;nbsp; 
[ Nahoru ] lt;span class=highlightgt;VKOLlt;/spangt;/description

Here is same description field with Entity.encoding disabled:

descriptionlt;span class=ellipsisgt; ... lt;/spangt;tisky statistiky 
knihovny WWW serveru středověké rukopisy studovny CD-ROM historických fondů 
hlavní Internet Německé knihovny vázaných novin SVKOL viz lt;span 
class=highlightgt;VKOLlt;/spangt; šatna T telefonní čísla knihovny 
zaměstnanců U V vazba věcný popis vedení knihovny vedoucí oddělení video 
lt;span class=highlightgt;VKOLlt;/spangt; volný výběr výpůjčka výroční 
zpráva výstavy W webmaster WWW odkazy X Y Z - Ž zamluvení knihy zahraniční 
periodika zpracování fondult;span class=highlightgt;VKOLlt;/spangt; - 
hledej Hledej [ lt;span class=highlightgt;VKOLlt;/spangt; ] [ Novinky ] [ 
Katalog ] [ Služby ] [ Aktivity ] [ Průvodce ] [ Dokumenty ] [ Regionální fce ] 
[ Organizace ] [ Odkazy ] [ Hledej ] [ ] [ ] Obsah full-textové 
vyhledávání, 19/04/2003 rejstřík vybranýchlt;span class=ellipsisgt; ... 
lt;/spangt;/description

Notice how the Czech characters in the first are all numerically encoded: i.e. 
#NNN;.

I'd suggest that Summary#toString() become Summary#toEntityEncodedString() and 
that toString return raw aggregation of Fragments.  Would likely require adding 
methods to the HitSummarizer interface so can ask for either raw text or entity 
encoded with addition to NutchBean so can ask for either.  Or, better I'd 
suggest is that Summarizer never return Entity.encoded text.  Let that happen 
in search.jsp (I can make patch to do the latter if its amenable).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

2006-04-28 Thread [EMAIL PROTECTED] (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376997 ] 

[EMAIL PROTECTED] commented on NUTCH-257:
-

I took a closer look.  Turns out Summary is inherently all about rendering HTML 
(See the different Summary.Fragment subclasses -- one for ellipsis, another for 
hightlight.  In each of these, the to String wraps the fragment in some HTML 
'span' markup).

What about changing HitSummarizer#getSummary to return Summary instead of 
String or String [].  If the rendering context requires HTML, ask Summary to 
compose the HTML to output (Summary#toHtmlString()?).  If, xml, get plain-text 
version of summary (Summary#toString())?

 Summary#toString always Entity encodes -- problem for 
 OpenSearchServlet#description field
 -

  Key: NUTCH-257
  URL: http://issues.apache.org/jira/browse/NUTCH-257
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.8-dev
 Reporter: [EMAIL PROTECTED]
 Priority: Minor


 All search result data we display in search results has to be explicitly 
 Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries.  
 Its already Entity.encoded.  This is fine when outputing HTML but it gets in 
 the way when outputing otherwise -- as xml for example.  I'd suggest we not 
 make any presumption about how search results are used.
 The problem becomes especially acute when the text language is other than 
 english.
 Here is an example of what a Czech description field in an OpenSearchServlet 
 hit record looks like:
 descriptionlt;span class=ellipsisgt; ... 
 lt;/spangt;Vamp;#283;deckamp;aacute; knihovna v Olomouci 
 Bezruamp;#269;ova 2, Olomouc 9, 779 11, amp;#268;eskamp;aacute; republika 
 amp;nbsp; tel. +420-585223441 amp;nbsp; fax +420-585225774 
 http://www.lt;span class=highlightgt;vkollt;/spangt;.cz/ 
 amp;nbsp;amp;nbsp; mailto:info@lt;span 
 class=highlightgt;vkollt;/spangt;.cz Otevamp;#345;eno : amp;nbsp; 
 po-pamp;aacute; amp;nbsp; 8 30 -19 00 amp;nbsp;amp;nbsp;amp;nbsp; so 
 amp;nbsp; 9 00 -13 00 amp;nbsp;amp;nbsp;amp;nbsp; ne amp;nbsp; 
 zavamp;#345;eno V katalogu s amp;uacute;plnamp;yacute;m 
 amp;#269;asovamp;yacute;mlt;span class=ellipsisgt; ... lt;/spangt;03 
 Organizace 20/12 Odkazy 19/04 Hledej 23/03 amp;nbsp; 23/03 amp;nbsp; 
 Poamp;#269;et pamp;#345;amp;iacute;stupamp;#367; od 1.9.1998. Statistiky 
 . [ ] amp;nbsp; [ Nahoru ] lt;span 
 class=highlightgt;VKOLlt;/spangt;/description
 Here is same description field with Entity.encoding disabled:
 descriptionlt;span class=ellipsisgt; ... lt;/spangt;tisky statistiky 
 knihovny WWW serveru st?edov?ké rukopisy studovny CD-ROM historických fond? 
 hlavní Internet N?mecké knihovny vázaných novin SVKOL viz lt;span 
 class=highlightgt;VKOLlt;/spangt; ?atna T telefonní ?ísla knihovny 
 zam?stnanc? U V vazba v?cný popis vedení knihovny vedoucí odd?lení video 
 lt;span class=highlightgt;VKOLlt;/spangt; volný výb?r výp?j?ka výro?ní 
 zpráva výstavy W webmaster WWW odkazy X Y Z - ? zamluvení knihy zahrani?ní 
 periodika zpracování fondult;span class=highlightgt;VKOLlt;/spangt; - 
 hledej Hledej [ lt;span class=highlightgt;VKOLlt;/spangt; ] [ Novinky ] 
 [ Katalog ] [ Slu?by ] [ Aktivity ] [ Pr?vodce ] [ Dokumenty ] [ Regionální 
 fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [ ] [ ] Obsah full-textové 
 vyhledávání, 19/04/2003 rejst?ík vybranýchlt;span class=ellipsisgt; ... 
 lt;/spangt;/description
 Notice how the Czech characters in the first are all numerically encoded: 
 i.e. #NNN;.
 I'd suggest that Summary#toString() become Summary#toEntityEncodedString() 
 and that toString return raw aggregation of Fragments.  Would likely require 
 adding methods to the HitSummarizer interface so can ask for either raw text 
 or entity encoded with addition to NutchBean so can ask for either.  Or, 
 better I'd suggest is that Summarizer never return Entity.encoded text.  Let 
 that happen in search.jsp (I can make patch to do the latter if its amenable).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-256) Cannot open filename ....index.done.crc

2006-04-28 Thread [EMAIL PROTECTED] (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-256?page=comments#action_12376999 ] 

[EMAIL PROTECTED] commented on NUTCH-256:
-

Works for me.  Thanks.  Please close as fixed.

 Cannot open filename index.done.crc
 ---

  Key: NUTCH-256
  URL: http://issues.apache.org/jira/browse/NUTCH-256
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: [EMAIL PROTECTED]
 Priority: Minor
  Attachments: index.done.crc.patch

 Trying to copy indices out of DFS I always get:
 [bregeon] workspace  ./hadoop/bin/hadoop dfs -get outputs .
 060427 160317 parsing 
 file:/home/stack/workspace/hadoop-local-conf/hadoop-default.xml
 060427 160317 parsing 
 file:/home/stack/workspace/hadoop-local-conf/hadoop-site.xml
 060427 160318 No FS indicated, using default:localhost:9001
 060427 160318 Client connection to 127.0.0.1:9001: starting
 060427 160318 Problem opening checksum file: 
 /user/stack/outputs/indexes/part-0/index.done.  Ignoring with exception 
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open 
 filename /user/stack/outputs/indexes/part-0/.index.done.crc
 at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130)
 at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:589)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-256) Cannot open filename ....index.done.crc

2006-04-27 Thread [EMAIL PROTECTED] (JIRA)
Cannot open filename index.done.crc
---

 Key: NUTCH-256
 URL: http://issues.apache.org/jira/browse/NUTCH-256
 Project: Nutch
Type: Bug

  Components: indexer  
Versions: 0.8-dev
Reporter: [EMAIL PROTECTED]
Priority: Minor


Trying to copy indices out of DFS I always get:

[bregeon] workspace  ./hadoop/bin/hadoop dfs -get outputs .
060427 160317 parsing 
file:/home/stack/workspace/hadoop-local-conf/hadoop-default.xml
060427 160317 parsing 
file:/home/stack/workspace/hadoop-local-conf/hadoop-site.xml
060427 160318 No FS indicated, using default:localhost:9001
060427 160318 Client connection to 127.0.0.1:9001: starting
060427 160318 Problem opening checksum file: 
/user/stack/outputs/indexes/part-0/index.done.  Ignoring with exception 
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open 
filename /user/stack/outputs/indexes/part-0/.index.done.crc
at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:589)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218)



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-256) Cannot open filename ....index.done.crc

2006-04-27 Thread [EMAIL PROTECTED] (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-256?page=all ]

[EMAIL PROTECTED] updated NUTCH-256:


Attachment: index.done.crc.patch

Ensure creation of companion index.done .crc file

 Cannot open filename index.done.crc
 ---

  Key: NUTCH-256
  URL: http://issues.apache.org/jira/browse/NUTCH-256
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: [EMAIL PROTECTED]
 Priority: Minor
  Attachments: index.done.crc.patch

 Trying to copy indices out of DFS I always get:
 [bregeon] workspace  ./hadoop/bin/hadoop dfs -get outputs .
 060427 160317 parsing 
 file:/home/stack/workspace/hadoop-local-conf/hadoop-default.xml
 060427 160317 parsing 
 file:/home/stack/workspace/hadoop-local-conf/hadoop-site.xml
 060427 160318 No FS indicated, using default:localhost:9001
 060427 160318 Client connection to 127.0.0.1:9001: starting
 060427 160318 Problem opening checksum file: 
 /user/stack/outputs/indexes/part-0/index.done.  Ignoring with exception 
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open 
 filename /user/stack/outputs/indexes/part-0/.index.done.crc
 at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130)
 at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:589)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-190) ParseUtil drops reason for failed parse

2006-01-26 Thread [EMAIL PROTECTED] (JIRA)
ParseUtil drops reason for failed parse
---

 Key: NUTCH-190
 URL: http://issues.apache.org/jira/browse/NUTCH-190
 Project: Nutch
Type: Bug
  Components: fetcher  
Versions: 0.8-dev
 Environment: linux
Reporter: [EMAIL PROTECTED]
Priority: Minor


Doing the below:

Parse parse;
ParseStatus parseStatus;
try {
  parse = ParseUtil.parse(content);
  parseStatus = parse.getData().getStatus();
} catch (Exception e) {
  parseStatus = new ParseStatus(e);
}
if (!parseStatus.isSuccess()) {
  LOG.warning(Error parsing:  + url + :  + parseStatus);
  parse = null;
}

...on failure, the LOG.warning never prints out the reason for failure.  Here's 
an example: Error parsing: http://www.dfrc.nasa.gov/DTRS/1967/PDF/H-478.pdf: 
failed(0,0).

ParseUtil is dropping messages lovingly crafted by parsers.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-190) ParseUtil drops reason for failed parse

2006-01-26 Thread [EMAIL PROTECTED] (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-190?page=all ]

[EMAIL PROTECTED] updated NUTCH-190:


Attachment: ParseUtil_drops_failure_reason.patch

Attached is a suggested patch against revision 369598.



 ParseUtil drops reason for failed parse
 ---

  Key: NUTCH-190
  URL: http://issues.apache.org/jira/browse/NUTCH-190
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.8-dev
  Environment: linux
 Reporter: [EMAIL PROTECTED]
 Priority: Minor
  Attachments: ParseUtil_drops_failure_reason.patch

 Doing the below:
 Parse parse;
 ParseStatus parseStatus;
 try {
   parse = ParseUtil.parse(content);
   parseStatus = parse.getData().getStatus();
 } catch (Exception e) {
   parseStatus = new ParseStatus(e);
 }
 if (!parseStatus.isSuccess()) {
   LOG.warning(Error parsing:  + url + :  + parseStatus);
   parse = null;
 }
 ...on failure, the LOG.warning never prints out the reason for failure.  
 Here's an example: Error parsing: 
 http://www.dfrc.nasa.gov/DTRS/1967/PDF/H-478.pdf: failed(0,0).
 ParseUtil is dropping messages lovingly crafted by parsers.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-190) ParseUtil drops reason for failed parse

2006-01-26 Thread [EMAIL PROTECTED] (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-190?page=comments#action_12364145 ] 

[EMAIL PROTECTED] commented on NUTCH-190:
-

Here's an example of failure output after patch is applied:

060126 141413 task_m_bx2ifn  Error parsing: 
http://techreports.jpl.nasa.gov/2000/00-1147.pdf: failed(2,202): Content 
truncated at 102013 bytes. Parser can't handle incomplete application/pdf file

 ParseUtil drops reason for failed parse
 ---

  Key: NUTCH-190
  URL: http://issues.apache.org/jira/browse/NUTCH-190
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.8-dev
  Environment: linux
 Reporter: [EMAIL PROTECTED]
 Priority: Minor
  Attachments: ParseUtil_drops_failure_reason.patch

 Doing the below:
 Parse parse;
 ParseStatus parseStatus;
 try {
   parse = ParseUtil.parse(content);
   parseStatus = parse.getData().getStatus();
 } catch (Exception e) {
   parseStatus = new ParseStatus(e);
 }
 if (!parseStatus.isSuccess()) {
   LOG.warning(Error parsing:  + url + :  + parseStatus);
   parse = null;
 }
 ...on failure, the LOG.warning never prints out the reason for failure.  
 Here's an example: Error parsing: 
 http://www.dfrc.nasa.gov/DTRS/1967/PDF/H-478.pdf: failed(0,0).
 ParseUtil is dropping messages lovingly crafted by parsers.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-130) Be explicit about target JVM when building (1.4.x?)

2005-11-30 Thread [EMAIL PROTECTED] (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-130?page=comments#action_12358981 ] 

[EMAIL PROTECTED] commented on NUTCH-130:
-

Need to do same for plugin compile:

$  /usr/local/bin/svn diff src/plugin/build-plugin.xml
Index: src/plugin/build-plugin.xml
===
--- src/plugin/build-plugin.xml (revision 350057)
+++ src/plugin/build-plugin.xml (working copy)
@@ -85,6 +85,8 @@
  includes=**/*.java
  destdir=${build.classes}
  debug=${javac.debug}
+ target=1.4
+ source=1.4
  deprecation=${javac.deprecation}
   classpath refid=classpath/
 /javac  

 Be explicit about target JVM when building (1.4.x?)
 ---

  Key: NUTCH-130
  URL: http://issues.apache.org/jira/browse/NUTCH-130
  Project: Nutch
 Type: Improvement
 Reporter: [EMAIL PROTECTED]
 Priority: Minor


 Below is patch for nutch build.xml.  It stipulates the target JVM is 1.4.x.  
 Without explicit target, a nutch built with 1.5.x java defaults to a 1.5.x 
 java target and won't run in a 1.4.x JVM.  Can be annoying (From the ant 
 javac doc, regards the target attribute: We highly recommend to always 
 specify this attribute.).
 [debord 282] nutch  svn diff -u build.xml
 Subcommand 'diff' doesn't accept option '-u [--show-updates]'
 Type 'svn help diff' for usage.
 [debord 283] nutch  svn diff build.xml
 Index: build.xml
 ===
 --- build.xml   (revision 349779)
 +++ build.xml   (working copy)
 @@ -72,6 +72,8 @@
   destdir=${build.classes}
   debug=${debug}
   optimize=${optimize}
 + target=1.4
 + source=1.4
   deprecation=${deprecation}
classpath refid=classpath/
  /javac

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-130) Be explicit about target JVM when building (1.4.x?)

2005-11-29 Thread [EMAIL PROTECTED] (JIRA)
Be explicit about target JVM when building (1.4.x?)
---

 Key: NUTCH-130
 URL: http://issues.apache.org/jira/browse/NUTCH-130
 Project: Nutch
Type: Improvement
Reporter: [EMAIL PROTECTED]
Priority: Minor


Below is patch for nutch build.xml.  It stipulates the target JVM is 1.4.x.  
Without explicit target, a nutch built with 1.5.x java defaults to a 1.5.x java 
target and won't run in a 1.4.x JVM.  Can be annoying (From the ant javac doc, 
regards the target attribute: We highly recommend to always specify this 
attribute.).

[debord 282] nutch  svn diff -u build.xml
Subcommand 'diff' doesn't accept option '-u [--show-updates]'
Type 'svn help diff' for usage.
[debord 283] nutch  svn diff build.xml
Index: build.xml
===
--- build.xml   (revision 349779)
+++ build.xml   (working copy)
@@ -72,6 +72,8 @@
  destdir=${build.classes}
  debug=${debug}
  optimize=${optimize}
+ target=1.4
+ source=1.4
  deprecation=${deprecation}
   classpath refid=classpath/
 /javac


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-11-10 Thread [EMAIL PROTECTED] (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12357300 ] 

[EMAIL PROTECTED] commented on NUTCH-110:
-

Scrub NUTCH-110-version2.patch. This patch double-encode certain entities 
(First by the new toValidXmlText method, second by the 
javax.xml.transform.Transformer transformer used by OpenSearchServlet). 

Use the original patch, fixIllegalXmlChars.patch, to address the problem 
described in this issue.

 OpenSearchServlet outputs illegal xml characters
 

  Key: NUTCH-110
  URL: http://issues.apache.org/jira/browse/NUTCH-110
  Project: Nutch
 Type: Bug
   Components: searcher
 Versions: 0.7
  Environment: linux, jdk 1.5
 Reporter: [EMAIL PROTECTED]
  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch

 OpenSearchServlet does not check text-to-output for illegal xml characters; 
 dependent on  search result, its possible for OSS to output xml that is not 
 well-formed.  For example, if text has the character FF character in it -- -- 
 i.e. the ascii character at position (decimal) 12 --  the produced XML will 
 show the FF character as '#12;' The character/entity '#12;' is not legal in 
 XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-14 Thread [EMAIL PROTECTED] (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

[EMAIL PROTECTED] updated NUTCH-110:


Attachment: NUTCH-110-version2.patch

Patch version 2.  This patch benefits from discussion held up on nutch dev 
list. This patch differs from the first in that it handles ALL illegal XML 
characters, entity encoding the 5 'special characters' AND (silently) dropping 
characters outside the xml legal range of characters. The previous patch just 
did the latter task letting the configured transformer/DOM Serializer handle 
entity escaping.

This patch also differs from patch version 1 in that it moves the method that 
processes the xml out into util.StringUtil: The assumption being that not only 
OpenSearchServlet needs to make text safe to include in xml.

The core method, StringUtil#toValidXmlText, was authored by Dawid Weiss and was 
taken from carrot2 XMLSerializerHelper.  Below is except from mail up on nutch 
dev where he grants permission to copy toValidXmlText.

Message-ID: [EMAIL PROTECTED]
Date: Fri, 14 Oct 2005 08:42:48 +0200
From: Dawid Weiss [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Subject: Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
 xml characters

...

 So, will I amend the patch in NUTCH-110 so it uses 
 XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?

Copy the method's contents. It doesn't really make sense to copy the 
entire class just for this method. Good luck.

D. 

 OpenSearchServlet outputs illegal xml characters
 

  Key: NUTCH-110
  URL: http://issues.apache.org/jira/browse/NUTCH-110
  Project: Nutch
 Type: Bug
   Components: searcher
 Versions: 0.7
  Environment: linux, jdk 1.5
 Reporter: [EMAIL PROTECTED]
  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch

 OpenSearchServlet does not check text-to-output for illegal xml characters; 
 dependent on  search result, its possible for OSS to output xml that is not 
 well-formed.  For example, if text has the character FF character in it -- -- 
 i.e. the ascii character at position (decimal) 12 --  the produced XML will 
 show the FF character as '#12;' The character/entity '#12;' is not legal in 
 XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread [EMAIL PROTECTED] (JIRA)
OpenSearchServlet outputs illegal xml characters


 Key: NUTCH-110
 URL: http://issues.apache.org/jira/browse/NUTCH-110
 Project: Nutch
Type: Bug
  Components: searcher  
Versions: 0.7
 Environment: linux, jdk 1.5
Reporter: [EMAIL PROTECTED]


OpenSearchServlet does not check text-to-output for illegal xml characters; 
dependent on  search result, its possible for OSS to output xml that is not 
well-formed.  For example, if text has the character FF character in it -- -- 
i.e. the ascii character at position (decimal) 12 --  the produced XML will 
show the FF character as '#12;' The character/entity '#12;' is not legal in 
XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread [EMAIL PROTECTED] (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

[EMAIL PROTECTED] updated NUTCH-110:


Attachment: fixIllegalXmlChars.patch

Attached patch runs all xml text through a check for bad xml characters.  This 
patch is brutal dropping silently illegal characters.  Patch was made after 
hunting xalan, jdk, and nutch itself for a method that would do the above 
filtering but was unable to find any such method -- perhaps an oversight on my 
part?

 OpenSearchServlet outputs illegal xml characters
 

  Key: NUTCH-110
  URL: http://issues.apache.org/jira/browse/NUTCH-110
  Project: Nutch
 Type: Bug
   Components: searcher
 Versions: 0.7
  Environment: linux, jdk 1.5
 Reporter: [EMAIL PROTECTED]
  Attachments: fixIllegalXmlChars.patch

 OpenSearchServlet does not check text-to-output for illegal xml characters; 
 dependent on  search result, its possible for OSS to output xml that is not 
 well-formed.  For example, if text has the character FF character in it -- -- 
 i.e. the ascii character at position (decimal) 12 --  the produced XML will 
 show the FF character as '#12;' The character/entity '#12;' is not legal in 
 XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira