Sara Miller created TIKA-2177:
---------------------------------

             Summary: microsoft.OfficeParser shows add links in additional 
paragraphs
                 Key: TIKA-2177
                 URL: https://issues.apache.org/jira/browse/TIKA-2177
             Project: Tika
          Issue Type: Bug
          Components: server
    Affects Versions: 1.13
         Environment: org.apache.tika.parser.microsoft.OfficeParser and 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser
            Reporter: Sara Miller
            Priority: Minor


I'm converting Excel files, both .xls and .xlsx.
.xls uses org.apache.tika.parser.microsoft.OfficeParser and 
.xlsx uses org.apache.tika.parser.microsoft.ooxml.OOXMLParser

If I have a link in my excel document, for example [email protected], the .xls 
parser adds additional elements in the document structure which shows an 
incorrect output of how the document looks. 

For example, this table in file.xls: 
mailadress      password
[email protected] hohoho

will output: 
 <div class="page">
            <h1>Sheet1</h1>
            <table>
                <tbody>
                    <tr>
                        <td>mailadress</td>
                        <td>password</td>
                    </tr>
                    <tr>
                        <td>[email protected]</td>
                        <td>hohoho</td>
                    </tr>
                </tbody>
            </table>
            <div class="outside">
                <a href="mailto:[email protected]";>[email protected]</a>
            </div>
        </div>

The <div class="outside"> should be removed because it does not correspond to 
the document structure. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to