[ https://issues.apache.org/jira/browse/NUTCH-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2318: ----------------------------------- Fix Version/s: 1.17 > Text extraction in HtmlParser adds too much whitespace. > ------------------------------------------------------- > > Key: NUTCH-2318 > URL: https://issues.apache.org/jira/browse/NUTCH-2318 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 2.3.1, 1.15 > Reporter: Felix Zett > Priority: Major > Fix For: 1.17 > > > In parse-html, org.apache.nutch.parse.html.HtmlParser will call > DOMContentUtils.getText() to extract the text content. For every text node > encountered in the document, the getTextHelper() function will first add a > space character to the already extracted text and then the text content > itself (stripped of excess whitespace). This means that parsing HTML such as > {{<p>behavi<em>ou</em>r</p>}} > will lead to this extracted text: > {{behavi ou r}} > I would have expected a parser not to add whitespace to content that visually > (and actually) does not contain any in the first place. This applies to all > similar semantic tags as well as {{<span>}}. > My naiive approach would be to remove the lines {{text = text.trim()}} and > {{sb.append(' ')}}, but I'm aware that this will lead to bad parsing of stuff > like {{<p>foo</p><p>bar</p>}}. > This is not an issue in parse-tika, since tika removes all "unimportant" tags > beforehand. However, I'd like to keep using parse-html because I need to keep > the document reasonably intact for parse filters applied later. > I know I could write a parse filter that will re-extract the text content, > but this feels like a bug (or at least a shortcoming) in the ParseHtml. -- This message was sent by Atlassian Jira (v8.3.4#803005)