Re: Getting no text content from html

Martin Grotzke Sat, 25 Jul 2009 05:45:26 -0700

Hi,

On Sat, 2009-07-25 at 00:49 +0200, Martin Grotzke wrote:
> Hi all,
> 
> I'm just starting with tika and try to extract the text content of some
> html. Unfortunately, I get no content at all.
> 
> This is my test method (in scala):
> 
>   def testHtml() {
>     val html = "<html><body>my content</body></html>"
>     val input = new ByteArrayInputStream(html.getBytes)
>     val metadata = new Metadata
>     val textHandler = new BodyContentHandler
>     val parser = new HtmlParser
>     parser.parse(input, textHandler, metadata);
>     input.close();
>     println("HTML Input: " + html)
>     println("Title: " + metadata.get("title"))
>     println("Author: " + metadata.get("Author"))
>     println("content: " + textHandler.toString)
>   }
If the above was not explicit enough: textHandler.toString was empty.


Any help?

Thx && cheers,
Martin


> 
> Is there anything wrong here?
> 
> Thanx && cheers,
> Martin
>

signature.asc
Description: This is a digitally signed message part

Re: Getting no text content from html

Reply via email to