[ 
https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Krugler reassigned TIKA-2683:
---------------------------------

    Assignee: Ken Krugler

> Missing space and inappropriate new-line in Boilerpipe extracted text
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2683
>                 URL: https://issues.apache.org/jira/browse/TIKA-2683
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.18
>         Environment: Replicable everywhere in all environments
>            Reporter: Karanjeet Singh
>            Assignee: Ken Krugler
>            Priority: Major
>              Labels: Boilerplate_Removal, boilerpipe, parser
>             Fix For: 1.19
>
>
> Boilerpipe extractor in Tika miss to capture the space and new-line character 
> in HTML.
> Also, additional new-line characters are inserted in between the text.
> *Example URL* - [https://en.wikipedia.org/wiki/Blobfish]
> Missing space in "family Psychrolutidae" and additional new-line characters 
> around round brackets  '(' 
>  
> Related issue reported long back - 
> https://issues.apache.org/jira/browse/TIKA-961



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to