[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110592#comment-14110592 ] Vilmos Papp commented on TIKA-1369: --- Hi Nick, In ImageMetadataExtrctorTest there are a test cases where new ExifHanlders created like: {code} ExifSubIFDDirectory exif = mock(ExifSubIFDDirectory.class); when(exif.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)).thenReturn(true); GregorianCalendar calendar = new GregorianCalendar(TimeZone.getDefault(), Locale.getDefault()); calendar.setTimeInMillis(0); calendar.set(2000, 0, 1, 0, 0, 0); when(exif.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)).thenReturn( calendar.getTime()); // jvm default timezone as in Metadata Extractor Metadata metadata = new Metadata(); new ImageMetadataExtractor.ExifHandler().handle(exif, metadata); {code} What happens if we cache ExifHandler like this: {code} ImageMetadataExtractor.ExifHandler cachedHandler = new ImageMetadataExtractor.ExifHandler(); {code} And use it like this (possibly several times for different metadata): {code} cachedHandler.handle(exif, metadata); {code} This seems to me a valid scenario, and as caching an object for later usage should be fine, perhaps the original thread safety issue could occur, but I should create a test case for it to prove it or prove that it not occurs. What do you think? > Date parsing and thread safety in ImageMetadataExtractor > > > Key: TIKA-1369 > URL: https://issues.apache.org/jira/browse/TIKA-1369 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: OS X 10.9.4 Java 7_60 >Reporter: John Gibson >Priority: Critical > > The {{ImageMetadataExtractor}} uses a static instance of > {{SimpleDateFormat}}. This is not thread safe. > {code:title=ImageMetadataExtractor.java} > static class ExifHandler implements DirectoryHandler { > private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new > SimpleDateFormat("-MM-dd'T'HH:mm:ss"); > ... > public void handleDateTags(Directory directory, Metadata metadata) > throws MetadataException { > // Date/Time Original overrides value from > ExifDirectory.TAG_DATETIME > Date original = null; > if > (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { > original = > directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); > // Unless we have GPS time we don't know the time zone so > date must be set > // as ISO 8601 datetime without timezone suffix (no Z or +/-) > if (original != null) { > String datetimeNoTimeZone = > DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor > uses > metadata.set(TikaCoreProperties.CREATED, > datetimeNoTimeZone); > metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); > } > } >... > {code} > This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. > In the discussion there the idea of using alternative thread-safe (and > faster) formatters from either Joda time or Commons Lang were dismissed > because they would add too many dependencies. Given that Tika already has a > fairly large laundry list of dependencies to parse content, adding one more > JAR to make sure things don't break is probably a good idea. > In addition, because no timezone or locale are specified by either Tika's > formatter or the call to com.drew.metadata.Directory it can wreak havok > during randomized testing. Given that the timezone is unknown, why not just > default it to UTC and let the caller guess the timezone? As it stands I have > to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110444#comment-14110444 ] Vilmos Papp commented on TIKA-1369: --- Hi Nick, Yes, that should fix it, I think. > Date parsing and thread safety in ImageMetadataExtractor > > > Key: TIKA-1369 > URL: https://issues.apache.org/jira/browse/TIKA-1369 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: OS X 10.9.4 Java 7_60 >Reporter: John Gibson >Priority: Critical > > The {{ImageMetadataExtractor}} uses a static instance of > {{SimpleDateFormat}}. This is not thread safe. > {code:title=ImageMetadataExtractor.java} > static class ExifHandler implements DirectoryHandler { > private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new > SimpleDateFormat("-MM-dd'T'HH:mm:ss"); > ... > public void handleDateTags(Directory directory, Metadata metadata) > throws MetadataException { > // Date/Time Original overrides value from > ExifDirectory.TAG_DATETIME > Date original = null; > if > (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { > original = > directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); > // Unless we have GPS time we don't know the time zone so > date must be set > // as ISO 8601 datetime without timezone suffix (no Z or +/-) > if (original != null) { > String datetimeNoTimeZone = > DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor > uses > metadata.set(TikaCoreProperties.CREATED, > datetimeNoTimeZone); > metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); > } > } >... > {code} > This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. > In the discussion there the idea of using alternative thread-safe (and > faster) formatters from either Joda time or Commons Lang were dismissed > because they would add too many dependencies. Given that Tika already has a > fairly large laundry list of dependencies to parse content, adding one more > JAR to make sure things don't break is probably a good idea. > In addition, because no timezone or locale are specified by either Tika's > formatter or the call to com.drew.metadata.Directory it can wreak havok > during randomized testing. Given that the timezone is unknown, why not just > default it to UTC and let the caller guess the timezone? As it stands I have > to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108981#comment-14108981 ] Vilmos Papp commented on TIKA-1369: --- Hi Nick, Did you manage to take a look at my fix? Regards, Vilmos > Date parsing and thread safety in ImageMetadataExtractor > > > Key: TIKA-1369 > URL: https://issues.apache.org/jira/browse/TIKA-1369 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: OS X 10.9.4 Java 7_60 >Reporter: John Gibson >Priority: Critical > > The {{ImageMetadataExtractor}} uses a static instance of > {{SimpleDateFormat}}. This is not thread safe. > {code:title=ImageMetadataExtractor.java} > static class ExifHandler implements DirectoryHandler { > private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new > SimpleDateFormat("-MM-dd'T'HH:mm:ss"); > ... > public void handleDateTags(Directory directory, Metadata metadata) > throws MetadataException { > // Date/Time Original overrides value from > ExifDirectory.TAG_DATETIME > Date original = null; > if > (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { > original = > directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); > // Unless we have GPS time we don't know the time zone so > date must be set > // as ISO 8601 datetime without timezone suffix (no Z or +/-) > if (original != null) { > String datetimeNoTimeZone = > DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor > uses > metadata.set(TikaCoreProperties.CREATED, > datetimeNoTimeZone); > metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); > } > } >... > {code} > This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. > In the discussion there the idea of using alternative thread-safe (and > faster) formatters from either Joda time or Commons Lang were dismissed > because they would add too many dependencies. Given that Tika already has a > fairly large laundry list of dependencies to parse content, adding one more > JAR to make sure things don't break is probably a good idea. > In addition, because no timezone or locale are specified by either Tika's > formatter or the call to com.drew.metadata.Directory it can wreak havok > during randomized testing. Given that the timezone is unknown, why not just > default it to UTC and let the caller guess the timezone? As it stands I have > to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077587#comment-14077587 ] Vilmos Papp commented on TIKA-1369: --- Hi Nick, Thanks, for the quick answer. I prefer pull request over attachments of patches. Cheers, Vilmos > Date parsing and thread safety in ImageMetadataExtractor > > > Key: TIKA-1369 > URL: https://issues.apache.org/jira/browse/TIKA-1369 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: OS X 10.9.4 Java 7_60 >Reporter: John Gibson >Priority: Critical > > The {{ImageMetadataExtractor}} uses a static instance of > {{SimpleDateFormat}}. This is not thread safe. > {code:title=ImageMetadataExtractor.java} > static class ExifHandler implements DirectoryHandler { > private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new > SimpleDateFormat("-MM-dd'T'HH:mm:ss"); > ... > public void handleDateTags(Directory directory, Metadata metadata) > throws MetadataException { > // Date/Time Original overrides value from > ExifDirectory.TAG_DATETIME > Date original = null; > if > (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { > original = > directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); > // Unless we have GPS time we don't know the time zone so > date must be set > // as ISO 8601 datetime without timezone suffix (no Z or +/-) > if (original != null) { > String datetimeNoTimeZone = > DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor > uses > metadata.set(TikaCoreProperties.CREATED, > datetimeNoTimeZone); > metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); > } > } >... > {code} > This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. > In the discussion there the idea of using alternative thread-safe (and > faster) formatters from either Joda time or Commons Lang were dismissed > because they would add too many dependencies. Given that Tika already has a > fairly large laundry list of dependencies to parse content, adding one more > JAR to make sure things don't break is probably a good idea. > In addition, because no timezone or locale are specified by either Tika's > formatter or the call to com.drew.metadata.Directory it can wreak havok > during randomized testing. Given that the timezone is unknown, why not just > default it to UTC and let the caller guess the timezone? As it stands I have > to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077561#comment-14077561 ] Vilmos Papp commented on TIKA-1369: --- Hi, I've sent a pull request on github to fix this: https://github.com/chrismattmann/tika/pull/1, I hope I sent it to the proper person, if not, where should I send it? Regards, Vilmos > Date parsing and thread safety in ImageMetadataExtractor > > > Key: TIKA-1369 > URL: https://issues.apache.org/jira/browse/TIKA-1369 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: OS X 10.9.4 Java 7_60 >Reporter: John Gibson >Priority: Critical > > The {{ImageMetadataExtractor}} uses a static instance of > {{SimpleDateFormat}}. This is not thread safe. > {code:title=ImageMetadataExtractor.java} > static class ExifHandler implements DirectoryHandler { > private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new > SimpleDateFormat("-MM-dd'T'HH:mm:ss"); > ... > public void handleDateTags(Directory directory, Metadata metadata) > throws MetadataException { > // Date/Time Original overrides value from > ExifDirectory.TAG_DATETIME > Date original = null; > if > (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { > original = > directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); > // Unless we have GPS time we don't know the time zone so > date must be set > // as ISO 8601 datetime without timezone suffix (no Z or +/-) > if (original != null) { > String datetimeNoTimeZone = > DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor > uses > metadata.set(TikaCoreProperties.CREATED, > datetimeNoTimeZone); > metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); > } > } >... > {code} > This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. > In the discussion there the idea of using alternative thread-safe (and > faster) formatters from either Joda time or Commons Lang were dismissed > because they would add too many dependencies. Given that Tika already has a > fairly large laundry list of dependencies to parse content, adding one more > JAR to make sure things don't break is probably a good idea. > In addition, because no timezone or locale are specified by either Tika's > formatter or the call to com.drew.metadata.Directory it can wreak havok > during randomized testing. Given that the timezone is unknown, why not just > default it to UTC and let the caller guess the timezone? As it stands I have > to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.2#6252)