Missing spaces on html parsing
------------------------------
Key: TIKA-394
URL: https://issues.apache.org/jira/browse/TIKA-394
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.6
Environment: Tomcat 6, Windows XP (russian locale)
Reporter: Andrey Barhatov
On parsing such html code:
text<p>more<br>yet<select><option>city1<option>city2</select>
resulting text is:
textmore
yetcity1city2
But must be:
text
more
yet city1 city2
Code sample:
import java.io.*;
import org.apache.tika.metadata.*;
import org.apache.tika.parser.*;
public class test {
public static void main(String[] args) throws Exception {
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/html");
String content =
"text<p>more<br>yet<select><option>city1<option>city2</select>";
InputStream in = new ByteArrayInputStream(content.getBytes("UTF-8"));
AutoDetectParser parser = new AutoDetectParser();
Reader reader = new ParsingReader(parser, in, metadata, new
ParseContext());
char[] buf = new char[10000];
int len;
StringBuffer text = new StringBuffer();
while((len = reader.read(buf)) > 0) {
text.append(buf, 0, len);
}
System.out.print(text);
}
}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.