Hi Jukka,

On Sep 20, 2009, at 2:26pm, Jukka Zitting wrote:

2. is it possible to skip html tags with tika (say i don't want to have <script>
or <style> contents in my resulting plain text

Yes. That's actually what the HTML parser in Tika is programmed to do
by default. See the DISCARD_ELEMENTS set in
org.apache.tika.parser.html.HTMLParser.

I recently ran into the need to customize the behavior of the HtmlParser, in terms of what tags it passed through.

In particular, the <span> tag contained attributes I wanted, but these aren't part of the "SAFE_ELEMENTS" set.

1. From what I can see, <span> should be part of the XHTML safe set.

2. It would be great to have some way to easily customize this behavior, e.g. a protected isSafeElement() method.

3. It looks like the code currently will skip calling startElement/ endElement for non-safe tags, but will output any characters found between those tags.

Depending on the where the non-safe tag occurs, this could result in an invalid XHTML document, e.g. if you had

<body><non-safe tag>some text</non-safe tag></body>

this would output

<body>some text</body>

Thanks,

-- Ken


3. are there any plan for outputing the result into RDF (currently i'm using aperture), but i would be more than happy to switch to an apache project and i'm also willing
to contribute on that one.

We've had discussions about using XMP for expressing and handling
extracted document metadata. So far we haven't reached clear consensus
and not much work has yet been done about this, but contributions are
of course welcome.

BR,

Jukka Zitting

Reply via email to