[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411710#comment-17411710 ]
Tilman Hausherr commented on TIKA-3544: --------------------------------------- It seems to depend on the value: {noformat} <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="extended-properties:AppVersion" content="16.0300"/> <meta name="protected" content="false"/> <meta name="extended-properties:Application" content="Microsoft Excel"/> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/> <meta name="meta:last-author" content="Jitin Jindal"/> <meta name="X-TIKA:digest:SHA256" content="7d1109045508e7fdc0148d9e9e7b16d01ce18ae0794f7381145e23973996c0b6"/> <meta name="extended-properties:DocSecurityString" content="None"/> <meta name="resourceName" content="Credit Card Numbers.xlsx"/> <meta name="dcterms:modified" content="2021-09-07T20:57:34Z"/> <meta name="Content-Length" content="500481"/> <meta name="X-TIKA:digest:MD5" content="72c4c6777f1f9144542ddf5a059d2ffa"/> <meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/> <title/> </head> <body><div><h1>Payments - Payment Details</h1> <table><tbody><tr> <td>Payment Details</td></tr> <tr> <td>Credit Card Numbers (Source: http://www.getcreditcardnumbers.com/)</td></tr> <tr> <td>6,48019534464278E+15</td></tr> <tr> <td>30295201231669</td></tr> <tr> <td>30082494556063</td></tr> <tr> <td>344850003945824</td></tr> <tr> <td>3,58338792333363E+15</td></tr> <tr> <td>3,58738537059364E+15</td></tr> <tr/> </tbody></table> <p>&"Helvetica,Regular"&12&K000000&P </p> <a href="http://www.getcreditcardnumbers.com/">http://www.getcreditcardnumbers.com/</a></div> </body></html> {noformat} > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > ----------------------------------------------------------------------------------------------------------------- > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.20 > Reporter: Jitin Jindal > Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)