[jira] [Issue Comment Deleted] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-725:
--

Comment: was deleted

(was: When a {{TransformerHandler}} is used the actual writing of the final 
elements is delegated to an XML serializer such as {{ToHTMLStream}} which 
extends {{ToStream}}.

When {{ToStream.characters}} is called with zero length it returns immediately 
and does not close the start tag of the current element, and 
{{ToStream.endElement}} checks whether the start tag is open to determine 
whether or not to close as {{title/}} or {{title/title}}.

It seems the code brought over from the xalan project to the JDK was locked 
down quite a bit during the transition.  When using xalan directly an alternate 
XML serializer can be specified via XSLT or other means [1], but in the JDK 
that functionality seems to have been removed as 
{{TransletOutputHandlerFactory.getSerializationHandler}} has ToHTMLStream 
hard-coded.

Additionally, ToHTMLStream is declared as final and the majority of the classes 
which one would normally extend to use a different 
{{TransletOutputHandlerFactory}} are internal, so a proper solution would 
likely involve depending on xalan directly or duplicating a whole lot of code, 
neither of which is ideal.

As a workaround, a {{ExpandedTitleContentHandler}} content handler decorator 
was added which checks for the previous fix for this issue, a call to 
{{characters(new char[0], 0, 0)}} for the title element, and if present changes 
the length to 1 then catches the expected {{ArrayIndexOutOfBoundsException}} 
thrown by {{ToStream.characters}}.

The result is that the title start tag is closed since the check for zero 
length passes and no character writing is attempted.

{{TikaCLI}} was modified to wrap the transformer handler returned by 
{{SAXTransformerFactory}} for the {{html}} output method, so only handling of 
the {{title}} tag for HTML output will be affected by the change.

In the event that this approach has adverse effects for those using XML 
serializers other than those present in the JDK, the change to {{TikaCLI}} can 
be reverted or made an option.

Those calling Tika programmatically will need to wrap their transformer 
handlers in a {{ExpandedTitleContentHandler}} as well, i.e.:

{code}
...
SAXTransformerFactory factory = (SAXTransformerFactory) 
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, html);
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, indent);
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, encoding);
handler.setResult(new StreamResult(output));
return new ExpandedTitleContentHandler(handler);
{code}

Resolved in r1423538.


[1] http://xml.apache.org/xalan-j/usagepatterns.html)

 Empty title element makes Tika-generated HTML documents not open in Chromium
 

 Key: TIKA-725
 URL: https://issues.apache.org/jira/browse/TIKA-725
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 0.9
 Environment: Chromium 12 on Ubuntu Linux
Reporter: Henri Bergius
Assignee: Ray Gauss II
Priority: Minor
  Labels: html
 Fix For: 0.10


 Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
 empty title element as title/ into the document HEAD section. This causes 
 Chromium not to display the document contents.
 Switching it to title/title fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Issue Comment Deleted] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-725:
--

Comment: was deleted

(was: Sorry, reopening to move comments.)

 Empty title element makes Tika-generated HTML documents not open in Chromium
 

 Key: TIKA-725
 URL: https://issues.apache.org/jira/browse/TIKA-725
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 0.9
 Environment: Chromium 12 on Ubuntu Linux
Reporter: Henri Bergius
Assignee: Ray Gauss II
Priority: Minor
  Labels: html
 Fix For: 0.10


 Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
 empty title element as title/ into the document HEAD section. This causes 
 Chromium not to display the document contents.
 Switching it to title/title fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira