[
https://issues.apache.org/jira/browse/TIKA-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796474#action_12796474
]
Dave Meikle commented on TIKA-103:
----------------------------------
I am not sure how others feel about this issue but for me it is one I would
like addressed as I have some applications that are parsing excel files
containing various types of formatting and it would be good to have the 'as-is'
value within the parsed content.
As can be seen above, I have attached an initial patch to have TikaHSSFListener
use the FormatTrackingHSSFListener proxy class to handle the current POI
supported formatting - this moves us on some what leaving the outstanding
support something to progress within POI[1]. The reason I have attached the
patch instead of committing directly is that I would like to propose the
following:
* This initial support is included in the up-coming 0.6 release
* An issue is raised against POI, and any fixes to support the other formatting
is progressed there.
Not sure what you all think?
I am going to have a track through to see how Niall got on, if he managed to
get the time.
Cheers,
Dave
[1] I will continue to see if there are any user model features we can use to
add further support from the current POI code base.
> Excel parsing ignores cell formating
> ------------------------------------
>
> Key: TIKA-103
> URL: https://issues.apache.org/jira/browse/TIKA-103
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Niall Pemberton
> Attachments: testEXCEL-formats.xls, tika-103_initial_patch.diff
>
>
> Unfortunately Excel stores dates as the number of days since 1900 (or 1904,
> but ignore that atm) with the time element being stored in the fractional
> part of the numeric value. So for example 19 Jan 2008 04:35:01 is stored as
> Double value 39466.190980358806. The only way to make sense of the data is
> to look at the formatting on the cell. Although dates are the worst case, it
> also affects other numeric values - currencies, percentages, scientific,
> fractions and worst of all custom formats.
> POI recognises 49 "built in" formats of excel and for those it has the
> limited capability of determining whether a numeric cell is a date or not and
> if it is, a utility to convert to a java date, something like:
> if (HSSFDateUtil.isCellDateFormatted(cell)) {
> Date date = HSSFDateUtil.getJavaDate(cell.getNumericCellValue());
> }
> The current ExcelParser implementation takes no account of the data format
> and IMO is going to severly limit how useful that implementation is. I'm also
> think that the above while improving the situation slightly is still not
> great. I asked about this on the POI dev list a couple of days ago[1] and the
> only light is someone posted a format parser a few months back. It sounds
> like POI will accept that contribution if it has unit tests. So I'm going to
> try and find time to do that. If the data format can be properly parsed then
> it means being able to extract it in the format the users sees it within
> Excel - which IMO would be the ideal situation.
> [1] http://www.mail-archive.com/[email protected]/msg00582.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.