WriteOutContentHandler concatenates title tag and body text.
------------------------------------------------------------

                 Key: TIKA-730
                 URL: https://issues.apache.org/jira/browse/TIKA-730
             Project: Tika
          Issue Type: Bug
          Components: general, parser
    Affects Versions: 0.9
            Reporter: Raimund Merkert


I just noticed that the WriteOutContentHandler concatenates strings that it 
should not concatenate. I noticed this in case of a title tag which was 
combined with the first text in a body, e.g.: 
<head><title>a</title><head><body>b</body>
results in "ab" and not "a b" (or something else with a break). Interestingly, 
"<p>a</p><p>b</p>" does get broken into separate words. 

I'm not aware of a better way to extract text only with an out-of-the-box tika.

I've added a small unit test here:
{code}
package tika;

import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.io.StringWriter;
import java.nio.charset.Charset;

import junit.framework.Assert;

import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.WriteOutContentHandler;
import org.junit.Test;

public class WriteOutContentHandler_JUnit {

        private static final String HTML = "<!DOCTYPE html PUBLIC \"-//W3C//DTD 
XHTML 1.0 Transitional//EN\" 
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\";>"
                        + "<html 
xmlns=\"http://www.w3.org/1999/xhtml\";><head><title>title</title></head>  
<body>a</body></html>";

        public static String processStream(String str) throws Exception {

                InputStream in = new ByteArrayInputStream(str.getBytes(Charset
                                .forName("UTF-8")));

                AutoDetectParser parser = new AutoDetectParser();
                ParseContext context = new ParseContext();
                org.apache.tika.metadata.Metadata m = new 
org.apache.tika.metadata.Metadata();
                StringWriter out = new StringWriter();
                WriteOutContentHandler ctHandler = new 
WriteOutContentHandler(out);

                try {
                        parser.parse(in, ctHandler, m, context);
                        return out.toString();
                } finally {
                        out.flush();
                }
        }

        @Test
        public void testParse() throws Exception {
                String data = processStream(HTML);
                data = data.trim();
                System.err.println("Extracted:\n" + data);
                Assert.assertFalse(data.equals("titlea"));
        }
}
{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to