WriteOutContentHandler concatenates title tag and body text.
------------------------------------------------------------
Key: TIKA-730
URL: https://issues.apache.org/jira/browse/TIKA-730
Project: Tika
Issue Type: Bug
Components: general, parser
Affects Versions: 0.9
Reporter: Raimund Merkert
I just noticed that the WriteOutContentHandler concatenates strings that it
should not concatenate. I noticed this in case of a title tag which was
combined with the first text in a body, e.g.:
<head><title>a</title><head><body>b</body>
results in "ab" and not "a b" (or something else with a break). Interestingly,
"<p>a</p><p>b</p>" does get broken into separate words.
I'm not aware of a better way to extract text only with an out-of-the-box tika.
I've added a small unit test here:
{code}
package tika;
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.io.StringWriter;
import java.nio.charset.Charset;
import junit.framework.Assert;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.WriteOutContentHandler;
import org.junit.Test;
public class WriteOutContentHandler_JUnit {
private static final String HTML = "<!DOCTYPE html PUBLIC \"-//W3C//DTD
XHTML 1.0 Transitional//EN\"
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"
+ "<html
xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>title</title></head>
<body>a</body></html>";
public static String processStream(String str) throws Exception {
InputStream in = new ByteArrayInputStream(str.getBytes(Charset
.forName("UTF-8")));
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
org.apache.tika.metadata.Metadata m = new
org.apache.tika.metadata.Metadata();
StringWriter out = new StringWriter();
WriteOutContentHandler ctHandler = new
WriteOutContentHandler(out);
try {
parser.parse(in, ctHandler, m, context);
return out.toString();
} finally {
out.flush();
}
}
@Test
public void testParse() throws Exception {
String data = processStream(HTML);
data = data.trim();
System.err.println("Extracted:\n" + data);
Assert.assertFalse(data.equals("titlea"));
}
}
{code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira