[jira] [Resolved] (TIKA-730) WriteOutContentHandler concatenates title tag and body text.

Jukka Zitting (Resolved) (JIRA) Wed, 05 Oct 2011 08:22:02 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting resolved TIKA-730.
--------------------------------

    Resolution: Won't Fix

Resolving as Won't Fix since in this case the WriteOutContentHandler class 
works exactly as designed and documented.

Have you looked at the [Tika facade 
class|http://tika.apache.org/0.10/api/org/apache/tika/Tika.html] that provides 
a simplified API for extracting just the text content of a document as a String 
or a Reader? That should be a better match for your use case than 
WriteOutContentHandler.
                
> WriteOutContentHandler concatenates title tag and body text.
> ------------------------------------------------------------
>
>                 Key: TIKA-730
>                 URL: https://issues.apache.org/jira/browse/TIKA-730
>             Project: Tika
>          Issue Type: Bug
>          Components: general, parser
>    Affects Versions: 0.9
>            Reporter: Raimund Merkert
>
> I just noticed that the WriteOutContentHandler concatenates strings that it 
> should not concatenate. I noticed this in case of a title tag which was 
> combined with the first text in a body, e.g.: 
> <head><title>a</title><head><body>b</body>
> results in "ab" and not "a b" (or something else with a break). 
> Interestingly, "<p>a</p><p>b</p>" does get broken into separate words. 
> I'm not aware of a better way to extract text only with an out-of-the-box 
> tika.
> I've added a small unit test here:
> {code}
> package tika;
> import java.io.ByteArrayInputStream;
> import java.io.InputStream;
> import java.io.StringWriter;
> import java.nio.charset.Charset;
> import junit.framework.Assert;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.sax.WriteOutContentHandler;
> import org.junit.Test;
> public class WriteOutContentHandler_JUnit {
>       private static final String HTML = "<!DOCTYPE html PUBLIC \"-//W3C//DTD 
> XHTML 1.0 Transitional//EN\" 
> \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\";>"
>                       + "<html 
> xmlns=\"http://www.w3.org/1999/xhtml\";><head><title>title</title></head>  
> <body>a</body></html>";
>       public static String processStream(String str) throws Exception {
>               InputStream in = new ByteArrayInputStream(str.getBytes(Charset
>                               .forName("UTF-8")));
>               AutoDetectParser parser = new AutoDetectParser();
>               ParseContext context = new ParseContext();
>               org.apache.tika.metadata.Metadata m = new 
> org.apache.tika.metadata.Metadata();
>               StringWriter out = new StringWriter();
>               WriteOutContentHandler ctHandler = new 
> WriteOutContentHandler(out);
>               try {
>                       parser.parse(in, ctHandler, m, context);
>                       return out.toString();
>               } finally {
>                       out.flush();
>               }
>       }
>       @Test
>       public void testParse() throws Exception {
>               String data = processStream(HTML);
>               data = data.trim();
>               System.err.println("Extracted:\n" + data);
>               Assert.assertFalse(data.equals("titlea"));
>       }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-730) WriteOutContentHandler concatenates title tag and body text.

Reply via email to