Hi,

We're having an issue with Boilerpipe and the lack of whitespace between tags 
and terms. The ordinary Tika HTML parser does the job right. Take the following 
HTML for example:

abc<br>def<br>xyz

becomes without BP: abc def xyz
becomes with BP: abcdefxyz

How does the Tika parser determine when to put whitespace between tags? What 
about languages without whitespace? When testing with ordinary chinese pages i 
see whitespace being added here too.
Also, any hints as where to look for the problem in the Boilerpipe code is 
appreciated.

Thanks,
Markus

Reply via email to