[ 
https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412025#comment-17412025
 ] 

Tim Allison commented on TIKA-3544:
-----------------------------------

Oh, this is hilarious, if I type '6480195344542781' (16 digits), Excel 
automatically floors that to '6480195344542780' which means Excel is corrupting 
16 digit credit card numbers that do not happen to end in zero!   

I note that Excel is not rounding; it also floors '6480195344542789' to 
'6480195344542780'

So, y, we could bump it to 16, but that would be wrong 90% of the time...  I'm 
now inclined to propose that we not do anything here.

Note: This is Excel for Mac (16.52), your mileage may vary.

> Extraction of long sequences of digits from Excel spreadsheets using Tika 
> 1.20 doesn’t yield the expected results
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3544
>                 URL: https://issues.apache.org/jira/browse/TIKA-3544
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20
>            Reporter: Jitin Jindal
>            Priority: Major
>         Attachments: Credit Card Numbers.xlsx
>
>
> If an Excel spreadsheet contains a long sequence of digits, such as a credit 
> card number, Tika 1.13 will emit the said sequence in scientific notation.
> For example, the credit card number “6011799905775830” is extracted from the 
> attached spreadsheet as 6.480195344642784E15, which clearly is not the 
> desired output.
> I think the impact of this issue is significant. There’s plenty of 
> information that can no longer be reliably extracted from spreadsheets. Think 
> credit card numbers, telephone numbers and product identifiers to name a few.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to