[ 
https://issues.apache.org/jira/browse/CONNECTORS-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031505#comment-17031505
 ] 

Karl Wright commented on CONNECTORS-1617:
-----------------------------------------

The internal Tika extractor treats all metadata as strings, using the Tika 
library.  I don't think the date format is configurable.  Indeed, there's a 
blog post on this:

https://grokbase.com/t/tika/user/10982he7yd/how-can-i-configure-tika-to-extract-dates-in-single-format

Note that Tika tries to maintain the date format present in the original 
spreadsheet!!

The solution proposed when you want a specific date format is this:

{quote}
* Write your own excel parser for Tika, which ignores the date formatting
set for cells, and always uses iso8601
{quote}

That's not going to cut it here because we don't have any information that 
would allow us to autodetect the incoming format properly.  It's basically just 
a text file and there are no hints, especially for dates like "01-01-2010".  
Which comes first, the day or the month?

The external Tika extractor has even less configurability because you cannot 
run custom code there.

Now, suppose all you want to do is post-process just *dates* to change the 
separator character.  Well, we do not know whether the field being returned 
from Tika is a date even.  If we replaced all /'s with -'s in it then we'd 
corrupt other kinds of fields.

My conclusion: there's nothing we can do in ManifoldCF to fix this problem.  A 
solution might be found in Tika itself, but only if somebody tickets it.  Tika 
would need to go through the column definitions and understand which columns 
were dates and act accordingly.  Feel free to open a Tika ticket accordingly.



> Date format extraction problem in XLS/XLSX
> ------------------------------------------
>
>                 Key: CONNECTORS-1617
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1617
>             Project: ManifoldCF
>          Issue Type: Task
>          Components: Tika extractor, Tika service connector
>    Affects Versions: ManifoldCF 2.10
>            Reporter: Zoltan Farago
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: exceldatum.xlsx
>
>
> Currently TIKA/ManifoldCF 2.10 extracts dates from the attached file tis way:
> 2018.05.10 -> 10/05/18
> 2002.02.02 -> 2/2/2
> We need this format:
> 2018.05.10 -> 2018-05-10
> 2002.02.02 -> 2002-02-02
> This occurs only when the field type is date. When the field type is text 
> then the output is fine.
>  
> Please help us with a recommendation with any settings in the pipeline (Tika 
> configs, excel setting, OS local settings, etc.), or provide a fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to