[ 
https://issues.apache.org/jira/browse/TIKA-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566429#comment-17566429
 ] 

Luís Filipe Nassif commented on TIKA-3815:
------------------------------------------

That test also doesn't pass in my machine, sorry about that...

Maybe I have found a similar issue in RFC822Parser: when timezone is not 
specified in Date strings, they are parsed using local timezone. That could 
lead to different results in different timezones. But I'm not sure if that is 
not fine or if that is not expected.

I'll fix the RFC822ParserTest to use UTC timezone when formatting parsed Dates 
with some timezone information. If there is no timezone info in tested Date 
strings, don't set a timezone in the test, like RFC822Parser does.

But maybe RFC822Parser should be changed to use UTC when no timezone is 
specified in Date strings. This would make results consistent in different 
timezones, but would change the behavior and I'm not sure if it is desired.

Opinions?

> Inconsistent Date/Time information extracted from Exif data
> -----------------------------------------------------------
>
>                 Key: TIKA-3815
>                 URL: https://issues.apache.org/jira/browse/TIKA-3815
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.4.1, 1.28.4
>            Reporter: Luís Filipe Nassif
>            Assignee: Luís Filipe Nassif
>            Priority: Major
>             Fix For: 2.4.2
>
>         Attachments: IMG_20220616_111848_HDR.jpg
>
>
> Running tika-app-2.4.1.jar on the attached image, these metadata is returned:
> Exif IFD0:Date/Time: 2022:06:16 11:18:49
> Exif SubIFD:Date/Time Digitized: 2022:06:16 11:18:49
> Exif SubIFD:Date/Time Original: 2022:06:16 11:18:49
> Exif SubIFD:Time Zone: -03:00
> Exif SubIFD:Time Zone Digitized: -03:00
> Exif SubIFD:Time Zone Original: -03:00
> File Modified Date: Thu Jun 16 11:18:50 -03:00 2022
> GPS:GPS Date Stamp: 2022:06:16
> GPS:GPS Time-Stamp: 14:18:47.000 UTC
> dcterms:created: 2022-06-16T08:18:49
> dcterms:modified: 2022-06-16T08:18:49
> exif:DateTimeOriginal: 2022-06-16T08:18:49
>  
> The right value is 2022-06-16T14:18:49Z. Although there is no timezone 
> specified for some values, I think it makes no sense converting them to 
> timezones different than GMT, the one used to take the picture (-03:00) or 
> the one used to run the application (-03:00), so Tika could be making an 
> incorrect timezone conversion on the last 3 fields.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to