[ 
https://issues.apache.org/jira/browse/TIKA-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547807#comment-13547807
 ] 

Nick Burch commented on TIKA-1054:
----------------------------------

Excel Date Formatting is not nearly as straight-forward as you might initially 
suspect. Quite a few of the common formatting strings are stored in the file 
format in their US style, but magically displayed in Excel based on the current 
locale of Excel. This especially catches people out who aren't in the US, as 
they enter a date in their normal formatting, and when processed by Apache POI 
it comes out looking American, because Excel stored it in the US format but 
displays it differently...

If you switch your computer to a US locale, and try re-loading the file in 
Excel, I strongly suspect your problematic dates will look different there too. 
Could you try that?

(It has been suggested that it would be good if POI could provide a way to 
translate these magic locale formats from their US style to the various 
different local locale formattings, but thus far no-one has volunteered to 
actually identify what the locale-specific format strings are for all the key 
locales)
                
> Problem with parsing excel date formats
> ---------------------------------------
>
>                 Key: TIKA-1054
>                 URL: https://issues.apache.org/jira/browse/TIKA-1054
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.2
>            Reporter: Olof Jonasson
>
> I'm using solr4.0 and tika1.2 and get some problems with indexing excel files 
> containing date formats. I've read TIKA-125, TIKA-371, TIKA-103 and TIKA-360 
> and there I get the impression that the date formatting problem is solved 
> (for some cases at least).
> I've used testEXCEL-formats.xls from TIKA-103 and also resaved it as xlsx and 
> tested that as well. Default locale on my computer is swedish. This is what I 
> get (sorry for the occasional swedish):
> Content of testEXCEL-formats.xlsx and testEXCEL-formats.xls
> Number #,##0.00 1 599,99 -1 599,99
> Currency $#,##0.00;[Red]($#,##0.00) $1 599,99 ($1 599,99)
> Scientific 0.00E+00 1,98E+08 -1,98E+08
> Percentage (0.025) 3% 2,50%
> Fraction (2.5) 2 1/2
> Time Format: h:mm AM/PM 6:15 AM 6:15 PM
> Time Format: h:mm 06:15 18:15
> Date Format: m/d/yy 2009-10-03
> Date Format: d-mmm-yy 17-maj-07
> Date/Time Format 2008-01-19 04:35
> Custom Number: 19 dollars and ,99 cents
> Custom Date: At 4:20 AM on torsdag maj 17, 2007
> What the tika1.2 parser returns for the xlsx (and is indexed by solr)
> Number #,##0.00 1 599,99 -1 599,99
> Currency $#,##0.00;[Red]($#,##0.00) $1 599,99 ($1 599,99)
> Scientific 0.00E+00 1,98E+08 -1,98E+08
> Percentage (0.025) 3% 2,50%
> Fraction (2.5) 2 1/2
> Time Format: h:mm AM/PM 6:15 fm 6:15 em
> Time Format: h:mm 6:15 18:15
> Date Format: m/d/yy 2009/10/03
> Date Format: d-mmm-yy 17-maj-07
> Date/Time Format 1/19/08 4:35
> Custom Number: 19,99 dollars and cents
> Custom Date: 39219.18056369212 
> What the tika1.2 parser returns for the xls (and is indexed by solr)
> Number #,##0.00  1 599,99 -1 599,99
> Currency $#,##0.00;[Red]($#,##0.00) $1 599,99 ($1 599,99)
> Scientific 0.00E+00 1,98E+08 -1,98E+08
> Percentage (0.025) 3% 2,50%
> Fraction (2.5) 2 1/2
> Time Format: h:mm AM/PM 6:15 fm 6:15 em
> Time Format: h:mm  6:15 18:15
> Date Format: m/d/yy 10/3/09
> Date Format: d-mmm-yy 17-maj-07
> Date/Time Format  1/19/08 4:35
> Custom Number: 19,99 dollars and cents
> Custom Date: 39219.18056369212
> --- 
> Unexpected formats for:
> Date Format: m/d/yy 2009-10-03
> Date/Time Format 2008-01-19 04:35
> Custom Date: At 4:20 AM on torsdag maj 17, 2007

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to