[
https://issues.apache.org/jira/browse/TIKA-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763976#action_12763976
]
Benson Margulies commented on TIKA-303:
---------------------------------------
Feed in any HTML page that already has a title. First the regular startDocument
will be called, then the document's html/head/title will be produced. Then
lazyStartDocument will add another layer.
You get
<html>
<head>
<title>title</title>
</head>
<body>
<html>
<head><title>...</title></head><body> the body
</body>
</htm>
</body>
</html>
I'll attach a code example later on.
> XHTMLContentHandler mishandles headers
> --------------------------------------
>
> Key: TIKA-303
> URL: https://issues.apache.org/jira/browse/TIKA-303
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.4
> Reporter: Benson Margulies
>
> XHTMLContentHandler.startDocument does not note that it has been called. So
> then lazyStartDocument will happen and embed an extra layer of
> head/title/body processing.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.