Tim Allison created TIKA-3027:
---------------------------------

             Summary: Consider using html parser instead of xml parser for epub 
contents
                 Key: TIKA-3027
                 URL: https://issues.apache.org/jira/browse/TIKA-3027
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison
         Attachments: testEPUB_html.epub

We have a good number of files in our regression set whose content "xhtml" 
files cause problems for the XML parser.  Should we switch to the HTMLParser?

 

To name a few:
{noformat}
commoncrawl3/6H/6HAGP5DFUKFYPUAUBPZ6NX54LUT6H5YO
commoncrawl3/LR/LR53ZVY5VR4BILUK27LGKROTBMVQ4YMV
commoncrawl3/Q4/Q4F2HATL7V5A6AYDJKZYNXV4AU6NXRMX
commoncrawl3/7I/7I6CKCIX75V22UNG7YPUVL6O2F3WVUTF
commoncrawl3/PF/PFYKV55F57N46PQJXAPZDEXCGJ54W26N
commoncrawl3/QK/QKVFV2QCCPXCQT27ZKRTOTTA5PHLFLIE
commoncrawl3/XB/XBUNGEOTNUBZ4EDHIEXRR5NW2PWF4WNN
commoncrawl3/72/72CJJQCXYVNIBX6O2M2AEJOHUZJUK625 {noformat}
I'm attaching a 6HA... renamed.

 

The few that I've tried to open in iBooks cause errors in iBooks and don't open 
at all.  Will try a few other readers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to