[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411710#comment-17411710
 ] 

Tilman Hausherr commented on TIKA-3544:
---------------------------------------

It seems to depend on the value:
{noformat}
<?xml version="1.0" encoding="UTF-8"?><html 
xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="extended-properties:AppVersion" content="16.0300"/>
<meta name="protected" content="false"/>
<meta name="extended-properties:Application" content="Microsoft Excel"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By" 
content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
<meta name="meta:last-author" content="Jitin Jindal"/>
<meta name="X-TIKA:digest:SHA256" 
content="7d1109045508e7fdc0148d9e9e7b16d01ce18ae0794f7381145e23973996c0b6"/>
<meta name="extended-properties:DocSecurityString" content="None"/>
<meta name="resourceName" content="Credit Card Numbers.xlsx"/>
<meta name="dcterms:modified" content="2021-09-07T20:57:34Z"/>
<meta name="Content-Length" content="500481"/>
<meta name="X-TIKA:digest:MD5" content="72c4c6777f1f9144542ddf5a059d2ffa"/>
<meta name="Content-Type" 
content="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
<title/>
</head>
<body><div><h1>Payments - Payment Details</h1>
<table><tbody><tr>      <td>Payment Details</td></tr>
<tr>    <td>Credit Card Numbers (Source: 
http://www.getcreditcardnumbers.com/)</td></tr>
<tr>    <td>6,48019534464278E+15</td></tr>
<tr>    <td>30295201231669</td></tr>
<tr>    <td>30082494556063</td></tr>
<tr>    <td>344850003945824</td></tr>
<tr>    <td>3,58338792333363E+15</td></tr>
<tr>    <td>3,58738537059364E+15</td></tr>
<tr/>
</tbody></table>
<p>&amp;"Helvetica,Regular"&amp;12&amp;K000000&amp;P  </p>
<a 
href="http://www.getcreditcardnumbers.com/";>http://www.getcreditcardnumbers.com/</a></div>
</body></html>
{noformat}


> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3544
>                 URL: https://issues.apache.org/jira/browse/TIKA-3544
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20
>            Reporter: Jitin Jindal
>            Priority: Major
>         Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to