Hi, We're having an issue with Boilerpipe and the lack of whitespace between tags and terms. The ordinary Tika HTML parser does the job right. Take the following HTML for example:
abc<br>def<br>xyz becomes without BP: abc def xyz becomes with BP: abcdefxyz How does the Tika parser determine when to put whitespace between tags? What about languages without whitespace? When testing with ordinary chinese pages i see whitespace being added here too. Also, any hints as where to look for the problem in the Boilerpipe code is appreciated. Thanks, Markus