[jira] [Commented] (TIKA-3493) dcterms:created date depends on the current TimeZone in RTF documents

2022-11-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635958#comment-17635958
 ] 

Tim Allison commented on TIKA-3493:
---

There are some image formats that leave us with the same problem.  We should do 
with RTF whatever we're doing there (I think leaving it without timezone ?).

I had to add this [1] to allow for successful pipes "emits" to Solr, 
Elasticsearch and OpenSearch.

[1] 
https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/metadata/filter/DateNormalizingMetadataFilter.java

> dcterms:created date depends on the current TimeZone in RTF documents
> -
>
> Key: TIKA-3493
> URL: https://issues.apache.org/jira/browse/TIKA-3493
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0
>Reporter: David Pilato
>Assignee: Tim Allison
>Priority: Minor
> Attachments: Test_case_to_demo_the_change_with_Tika_1_x1.patch
>
>
> {color:#33}I'm migrating an existing project to Tika 2.0.0.
> I'm seeing a strange behavior.
> TL;DR: the created date of the document changes depending on the timezone.
> Long story:
> I have a unit test which extracts content and metadata from a [RTF 
> document|[https://github.com/dadoonet/fscrawler/raw/master/test-documents/src/main/resources/documents/test.rtf]].
> When using Tika 1.27, whatever the timezone defined for my JVM, I'm always 
> getting the same value for "dcterms:created": "2016-07-07T13:38:00Z".
> When running the same test with Tika 2.0.0, the date changes depending on the 
> Timezone.
> For example:
> {color}
>  * {color:#33}Asia/Sakhalin gives dcterms:created=2016-07-06T23:38:00Z
> {color}
>  * {color:#33}Asia/Colombo gives dcterms:created=2016-07-07T05:08:00Z
> {color}
>  * {color:#33}Europe/Stockholm gives dcterms:created=2016-07-07T08:38:00Z
> {color}
>  
> {color:#33}I don't know if it's a bug or expected. May be the RTF format 
> does not specify the Timezone.
> I'm surprised that I don't see the same behavior for Office documents 
> actually.
> {color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3493) dcterms:created date depends on the current TimeZone in RTF documents

2022-11-18 Thread Konstantin Gribov (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635951#comment-17635951
 ] 

Konstantin Gribov commented on TIKA-3493:
-

Just hit the same with one of the tests failing. I looked through RTF spec 1.9 
and they effectively have local date/time (just wallclock without time zone) 
there. 

Right now it's interpreted as date/time in current jvm timezone. Both 
LibreOffice and Word (on Mac) interpret them the same.

Maybe we should keep it without timezone in the metadata string (in 
{{dcterms:created}} or another property) and only reinterpret it with a TZ in 
{{Metadata#getDate}} but it would be a breaking change. Or if we can keep raw 
representation plus Tika's best guess what instant it meant. Likely to require 
breaking changes too. 

> dcterms:created date depends on the current TimeZone in RTF documents
> -
>
> Key: TIKA-3493
> URL: https://issues.apache.org/jira/browse/TIKA-3493
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0
>Reporter: David Pilato
>Assignee: Tim Allison
>Priority: Minor
> Attachments: Test_case_to_demo_the_change_with_Tika_1_x1.patch
>
>
> {color:#33}I'm migrating an existing project to Tika 2.0.0.
> I'm seeing a strange behavior.
> TL;DR: the created date of the document changes depending on the timezone.
> Long story:
> I have a unit test which extracts content and metadata from a [RTF 
> document|[https://github.com/dadoonet/fscrawler/raw/master/test-documents/src/main/resources/documents/test.rtf]].
> When using Tika 1.27, whatever the timezone defined for my JVM, I'm always 
> getting the same value for "dcterms:created": "2016-07-07T13:38:00Z".
> When running the same test with Tika 2.0.0, the date changes depending on the 
> Timezone.
> For example:
> {color}
>  * {color:#33}Asia/Sakhalin gives dcterms:created=2016-07-06T23:38:00Z
> {color}
>  * {color:#33}Asia/Colombo gives dcterms:created=2016-07-07T05:08:00Z
> {color}
>  * {color:#33}Europe/Stockholm gives dcterms:created=2016-07-07T08:38:00Z
> {color}
>  
> {color:#33}I don't know if it's a bug or expected. May be the RTF format 
> does not specify the Timezone.
> I'm surprised that I don't see the same behavior for Office documents 
> actually.
> {color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3493) dcterms:created date depends on the current TimeZone in RTF documents

2021-07-22 Thread David Pilato (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385505#comment-17385505
 ] 

David Pilato commented on TIKA-3493:


{quote}It doesn't look like the RTF specifies a timezone
{quote}
Yeah. That looks like a feature to me than a bug... The bug was most likely in 
1.x branch. :)

 

> dcterms:created date depends on the current TimeZone in RTF documents
> -
>
> Key: TIKA-3493
> URL: https://issues.apache.org/jira/browse/TIKA-3493
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0
>Reporter: David Pilato
>Assignee: Tim Allison
>Priority: Minor
> Attachments: Test_case_to_demo_the_change_with_Tika_1_x1.patch
>
>
> {color:#33}I'm migrating an existing project to Tika 2.0.0.
> I'm seeing a strange behavior.
> TL;DR: the created date of the document changes depending on the timezone.
> Long story:
> I have a unit test which extracts content and metadata from a [RTF 
> document|[https://github.com/dadoonet/fscrawler/raw/master/test-documents/src/main/resources/documents/test.rtf]].
> When using Tika 1.27, whatever the timezone defined for my JVM, I'm always 
> getting the same value for "dcterms:created": "2016-07-07T13:38:00Z".
> When running the same test with Tika 2.0.0, the date changes depending on the 
> Timezone.
> For example:
> {color}
>  * {color:#33}Asia/Sakhalin gives dcterms:created=2016-07-06T23:38:00Z
> {color}
>  * {color:#33}Asia/Colombo gives dcterms:created=2016-07-07T05:08:00Z
> {color}
>  * {color:#33}Europe/Stockholm gives dcterms:created=2016-07-07T08:38:00Z
> {color}
>  
> {color:#33}I don't know if it's a bug or expected. May be the RTF format 
> does not specify the Timezone.
> I'm surprised that I don't see the same behavior for Office documents 
> actually.
> {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3493) dcterms:created date depends on the current TimeZone in RTF documents

2021-07-22 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385472#comment-17385472
 ] 

Tim Allison commented on TIKA-3493:
---

It doesn't look like the RTF specifies a timezone: 

{noformat}
{\creatim\yr2016\mo7\dy7\hr10\min38}
{noformat}

But let me take a look at how the 2.x code differs from 1.x.

> dcterms:created date depends on the current TimeZone in RTF documents
> -
>
> Key: TIKA-3493
> URL: https://issues.apache.org/jira/browse/TIKA-3493
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0
>Reporter: David Pilato
>Assignee: Tim Allison
>Priority: Minor
> Attachments: Test_case_to_demo_the_change_with_Tika_1_x1.patch
>
>
> {color:#33}I'm migrating an existing project to Tika 2.0.0.
> I'm seeing a strange behavior.
> TL;DR: the created date of the document changes depending on the timezone.
> Long story:
> I have a unit test which extracts content and metadata from a [RTF 
> document|[https://github.com/dadoonet/fscrawler/raw/master/test-documents/src/main/resources/documents/test.rtf]].
> When using Tika 1.27, whatever the timezone defined for my JVM, I'm always 
> getting the same value for "dcterms:created": "2016-07-07T13:38:00Z".
> When running the same test with Tika 2.0.0, the date changes depending on the 
> Timezone.
> For example:
> {color}
>  * {color:#33}Asia/Sakhalin gives dcterms:created=2016-07-06T23:38:00Z
> {color}
>  * {color:#33}Asia/Colombo gives dcterms:created=2016-07-07T05:08:00Z
> {color}
>  * {color:#33}Europe/Stockholm gives dcterms:created=2016-07-07T08:38:00Z
> {color}
>  
> {color:#33}I don't know if it's a bug or expected. May be the RTF format 
> does not specify the Timezone.
> I'm surprised that I don't see the same behavior for Office documents 
> actually.
> {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3493) dcterms:created date depends on the current TimeZone in RTF documents

2021-07-22 Thread David Pilato (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385445#comment-17385445
 ] 

David Pilato commented on TIKA-3493:


I attached a patch which adds a unit test. 

It is failing with:

{{org.junit.ComparisonFailure: }}
{{Expected :2006-05-18T07:19:00Z}}
{{Actual :2006-05-18T10:19:00Z}}

> dcterms:created date depends on the current TimeZone in RTF documents
> -
>
> Key: TIKA-3493
> URL: https://issues.apache.org/jira/browse/TIKA-3493
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0.0
>Reporter: David Pilato
>Priority: Minor
> Attachments: Test_case_to_demo_the_change_with_Tika_1_x1.patch
>
>
> {color:#33}I'm migrating an existing project to Tika 2.0.0.
> I'm seeing a strange behavior.
> TL;DR: the created date of the document changes depending on the timezone.
> Long story:
> I have a unit test which extracts content and metadata from a [RTF 
> document|[https://github.com/dadoonet/fscrawler/raw/master/test-documents/src/main/resources/documents/test.rtf]].
> When using Tika 1.27, whatever the timezone defined for my JVM, I'm always 
> getting the same value for "dcterms:created": "2016-07-07T13:38:00Z".
> When running the same test with Tika 2.0.0, the date changes depending on the 
> Timezone.
> For example:
> {color}
>  * {color:#33}Asia/Sakhalin gives dcterms:created=2016-07-06T23:38:00Z
> {color}
>  * {color:#33}Asia/Colombo gives dcterms:created=2016-07-07T05:08:00Z
> {color}
>  * {color:#33}Europe/Stockholm gives dcterms:created=2016-07-07T08:38:00Z
> {color}
>  
> {color:#33}I don't know if it's a bug or expected. May be the RTF format 
> does not specify the Timezone.
> I'm surprised that I don't see the same behavior for Office documents 
> actually.
> {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)