Re: rdf output

Ken Krugler Sun, 20 Sep 2009 16:28:37 -0700

Hi Jukka,

On Sep 20, 2009, at 2:26pm, Jukka Zitting wrote:

2. is it possible to skip html tags with tika (say i don't want tohave <script>
or <style> contents in my resulting plain text


Yes. That's actually what the HTML parser in Tika is programmed to do
by default. See the DISCARD_ELEMENTS set in
org.apache.tika.parser.html.HTMLParser.

I recently ran into the need to customize the behavior of theHtmlParser, in terms of what tags it passed through.

In particular, the <span> tag contained attributes I wanted, but thesearen't part of the "SAFE_ELEMENTS" set.


1. From what I can see, <span> should be part of the XHTML safe set.

2. It would be great to have some way to easily customize thisbehavior, e.g. a protected isSafeElement() method.

3. It looks like the code currently will skip calling startElement/endElement for non-safe tags, but will output any characters foundbetween those tags.

Depending on the where the non-safe tag occurs, this could result inan invalid XHTML document, e.g. if you had


<body><non-safe tag>some text</non-safe tag></body>

this would output

<body>some text</body>

Thanks,

-- Ken

3. are there any plan for outputing the result into RDF (currentlyi'm using aperture),but i would be more than happy to switch to an apache project andi'm also willing
to contribute on that one.
We've had discussions about using XMP for expressing and handling
extracted document metadata. So far we haven't reached clear consensus
and not much work has yet been done about this, but contributions are
of course welcome.

BR,

Jukka Zitting

Re: rdf output

Reply via email to