[ https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ken Krugler reassigned TIKA-2683: --------------------------------- Assignee: Ken Krugler > Missing space and inappropriate new-line in Boilerpipe extracted text > --------------------------------------------------------------------- > > Key: TIKA-2683 > URL: https://issues.apache.org/jira/browse/TIKA-2683 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.18 > Environment: Replicable everywhere in all environments > Reporter: Karanjeet Singh > Assignee: Ken Krugler > Priority: Major > Labels: Boilerplate_Removal, boilerpipe, parser > Fix For: 1.19 > > > Boilerpipe extractor in Tika miss to capture the space and new-line character > in HTML. > Also, additional new-line characters are inserted in between the text. > *Example URL* - [https://en.wikipedia.org/wiki/Blobfish] > Missing space in "family Psychrolutidae" and additional new-line characters > around round brackets '(' > > Related issue reported long back - > https://issues.apache.org/jira/browse/TIKA-961 -- This message was sent by Atlassian JIRA (v7.6.3#76005)