[ https://issues.apache.org/jira/browse/TIKA-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olof Jonasson updated TIKA-1054: -------------------------------- Description: I'm using solr4.0 and tika1.2 and get some problems with indexing excel files containing date formats. I've read TIKA-125, TIKA-371, TIKA-103 and TIKA-360 and there I get the impression that the date formatting problem is solved (for some cases at least). I've used testEXCEL-formats.xls from TIKA-103 and also resaved it as xlsx and tested that as well. Default locale on my computer is swedish. This is what I get (sorry for the occasional swedish): Content of testEXCEL-formats.xlsx and testEXCEL-formats.xls Number #,##0.00 1 599,99 -1 599,99 Currency $#,##0.00;[Red]($#,##0.00) $1 599,99 ($1 599,99) Scientific 0.00E+00 1,98E+08 -1,98E+08 Percentage (0.025) 3% 2,50% Fraction (2.5) 2 1/2 Time Format: h:mm AM/PM 6:15 AM 6:15 PM Time Format: h:mm 06:15 18:15 Date Format: m/d/yy 2009-10-03 Date Format: d-mmm-yy 17-maj-07 Date/Time Format 2008-01-19 04:35 Custom Number: 19 dollars and ,99 cents Custom Date: At 4:20 AM on torsdag maj 17, 2007 What the tika1.2 parser returns for the xlsx (and is indexed by solr) Number #,##0.00 1 599,99 -1 599,99 Currency $#,##0.00;[Red]($#,##0.00) $1 599,99 ($1 599,99) Scientific 0.00E+00 1,98E+08 -1,98E+08 Percentage (0.025) 3% 2,50% Fraction (2.5) 2 1/2 Time Format: h:mm AM/PM 6:15 fm 6:15 em Time Format: h:mm 6:15 18:15 Date Format: m/d/yy 2009/10/03 Date Format: d-mmm-yy 17-maj-07 Date/Time Format 1/19/08 4:35 Custom Number: 19,99 dollars and cents Custom Date: 39219.18056369212 What the tika1.2 parser returns for the xls (and is indexed by solr) Number #,##0.00 1 599,99 -1 599,99 Currency $#,##0.00;[Red]($#,##0.00) $1 599,99 ($1 599,99) Scientific 0.00E+00 1,98E+08 -1,98E+08 Percentage (0.025) 3% 2,50% Fraction (2.5) 2 1/2 Time Format: h:mm AM/PM 6:15 fm 6:15 em Time Format: h:mm 6:15 18:15 Date Format: m/d/yy 10/3/09 Date Format: d-mmm-yy 17-maj-07 Date/Time Format 1/19/08 4:35 Custom Number: 19,99 dollars and cents Custom Date: 39219.18056369212 --- Unexpected formats for: Date Format: m/d/yy 2009-10-03 Date/Time Format 2008-01-19 04:35 Custom Date: At 4:20 AM on torsdag maj 17, 2007 was: I'm using solr4.0 and tika1.2 and get some problems with indexing excel files containing date formats. I've read TIKA-103 and TIKA-360 and there I get the impression that the date formatting problem is solved (for some cases at least). I've used testEXCEL-formats.xls from TIKA-103 and also resaved it as xlsx and tested that as well. Default locale on my computer is swedish. This is what I get (sorry for the occasional swedish): Content of testEXCEL-formats.xlsx and testEXCEL-formats.xls Number #,##0.00 1 599,99 -1 599,99 Currency $#,##0.00;[Red]($#,##0.00) $1 599,99 ($1 599,99) Scientific 0.00E+00 1,98E+08 -1,98E+08 Percentage (0.025) 3% 2,50% Fraction (2.5) 2 1/2 Time Format: h:mm AM/PM 6:15 AM 6:15 PM Time Format: h:mm 06:15 18:15 Date Format: m/d/yy 2009-10-03 Date Format: d-mmm-yy 17-maj-07 Date/Time Format 2008-01-19 04:35 Custom Number: 19 dollars and ,99 cents Custom Date: At 4:20 AM on torsdag maj 17, 2007 What the tika1.2 parser returns for the xlsx (and is indexed by solr) Number #,##0.00 1 599,99 -1 599,99 Currency $#,##0.00;[Red]($#,##0.00) $1 599,99 ($1 599,99) Scientific 0.00E+00 1,98E+08 -1,98E+08 Percentage (0.025) 3% 2,50% Fraction (2.5) 2 1/2 Time Format: h:mm AM/PM 6:15 fm 6:15 em Time Format: h:mm 6:15 18:15 Date Format: m/d/yy 2009/10/03 Date Format: d-mmm-yy 17-maj-07 Date/Time Format 1/19/08 4:35 Custom Number: 19,99 dollars and cents Custom Date: 39219.18056369212 What the tika1.2 parser returns for the xls (and is indexed by solr) Number #,##0.00 1 599,99 -1 599,99 Currency $#,##0.00;[Red]($#,##0.00) $1 599,99 ($1 599,99) Scientific 0.00E+00 1,98E+08 -1,98E+08 Percentage (0.025) 3% 2,50% Fraction (2.5) 2 1/2 Time Format: h:mm AM/PM 6:15 fm 6:15 em Time Format: h:mm 6:15 18:15 Date Format: m/d/yy 10/3/09 Date Format: d-mmm-yy 17-maj-07 Date/Time Format 1/19/08 4:35 Custom Number: 19,99 dollars and cents Custom Date: 39219.18056369212 --- Unexpected formats for: Date Format: m/d/yy 2009-10-03 Date/Time Format 2008-01-19 04:35 Custom Date: At 4:20 AM on torsdag maj 17, 2007 > Problem with parsing excel date formats > --------------------------------------- > > Key: TIKA-1054 > URL: https://issues.apache.org/jira/browse/TIKA-1054 > Project: Tika > Issue Type: Bug > Affects Versions: 1.2 > Reporter: Olof Jonasson > > I'm using solr4.0 and tika1.2 and get some problems with indexing excel files > containing date formats. I've read TIKA-125, TIKA-371, TIKA-103 and TIKA-360 > and there I get the impression that the date formatting problem is solved > (for some cases at least). > I've used testEXCEL-formats.xls from TIKA-103 and also resaved it as xlsx and > tested that as well. Default locale on my computer is swedish. This is what I > get (sorry for the occasional swedish): > Content of testEXCEL-formats.xlsx and testEXCEL-formats.xls > Number #,##0.00 1 599,99 -1 599,99 > Currency $#,##0.00;[Red]($#,##0.00) $1 599,99 ($1 599,99) > Scientific 0.00E+00 1,98E+08 -1,98E+08 > Percentage (0.025) 3% 2,50% > Fraction (2.5) 2 1/2 > Time Format: h:mm AM/PM 6:15 AM 6:15 PM > Time Format: h:mm 06:15 18:15 > Date Format: m/d/yy 2009-10-03 > Date Format: d-mmm-yy 17-maj-07 > Date/Time Format 2008-01-19 04:35 > Custom Number: 19 dollars and ,99 cents > Custom Date: At 4:20 AM on torsdag maj 17, 2007 > What the tika1.2 parser returns for the xlsx (and is indexed by solr) > Number #,##0.00 1 599,99 -1 599,99 > Currency $#,##0.00;[Red]($#,##0.00) $1 599,99 ($1 599,99) > Scientific 0.00E+00 1,98E+08 -1,98E+08 > Percentage (0.025) 3% 2,50% > Fraction (2.5) 2 1/2 > Time Format: h:mm AM/PM 6:15 fm 6:15 em > Time Format: h:mm 6:15 18:15 > Date Format: m/d/yy 2009/10/03 > Date Format: d-mmm-yy 17-maj-07 > Date/Time Format 1/19/08 4:35 > Custom Number: 19,99 dollars and cents > Custom Date: 39219.18056369212 > What the tika1.2 parser returns for the xls (and is indexed by solr) > Number #,##0.00 1 599,99 -1 599,99 > Currency $#,##0.00;[Red]($#,##0.00) $1 599,99 ($1 599,99) > Scientific 0.00E+00 1,98E+08 -1,98E+08 > Percentage (0.025) 3% 2,50% > Fraction (2.5) 2 1/2 > Time Format: h:mm AM/PM 6:15 fm 6:15 em > Time Format: h:mm 6:15 18:15 > Date Format: m/d/yy 10/3/09 > Date Format: d-mmm-yy 17-maj-07 > Date/Time Format 1/19/08 4:35 > Custom Number: 19,99 dollars and cents > Custom Date: 39219.18056369212 > --- > Unexpected formats for: > Date Format: m/d/yy 2009-10-03 > Date/Time Format 2008-01-19 04:35 > Custom Date: At 4:20 AM on torsdag maj 17, 2007 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira