[ https://issues.apache.org/jira/browse/TIKA-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404953#comment-13404953 ]
Jörg Ehrlich commented on TIKA-930: ----------------------------------- Hi Ray and Nick, It is very important to also "educate" average developers to use the standards in the proper way. As I wrote for the Rating field: It is imperative to stick with standards otherwise you risk sacrificing interoperability, which is one of the most important features for metadata. And regarding the Creator field: With IPTC and PLUS there exist very strong and well known standards to depict who created what part of an asset. And I strongly recommend to stick with at least one of them instead of coming up with an own proprietary creator scheme which no one knows about. It's nice to be able to be pragmatic, but not using standards for metadata today causes a lot of headache in the future. Regarding Geo data: I'm ok with using the W3C properties for the core properties. > Consolidation of Some Tika Core Properties > ------------------------------------------ > > Key: TIKA-930 > URL: https://issues.apache.org/jira/browse/TIKA-930 > Project: Tika > Issue Type: Improvement > Components: metadata > Affects Versions: 1.2 > Reporter: Ray Gauss II > > There are a few properties in TikaCoreProperties which overlap and I think we > should minimize ambiguity by consolidating them into a single composite > property with the clearest name, the most general specification referenced as > its primary property, and the others and deprecated strings as its > secondaries. > Here's the proposed pseudo-code for the changes: > Remove TikaCoreProperties.SUBJECT > TikaCoreProperties.KEYWORDS <- DublinCore.SUBJECT, { Office.KEYWORDS, > MSOffice.KEYWORDS, Metadata.SUBJECT } > Remove TikaCoreProperties.DATE > TikaCoreProperties.CREATION_DATE <- DublinCore.DATE, { Office.CREATION_DATE, > MSOffice.CREATION_DATE, Metadata.DATE } > Remove TikaCoreProperties.MODIFIED > TikaCoreProperties.SAVE_DATE <- DublinCore.MODIFIED, { Office.SAVE_DATE, > MSOffice.LAST_SAVED, Metadata.MODIFIED, "Last-Modified" } > and an example of the Java changes: > {code:title=TikaCoreProperties.java *Before*} > /** > * @see DublinCore#SUBJECT > */ > public static final Property SUBJECT = > Property.composite(DublinCore.SUBJECT, > new Property[] { Property.internalText(Metadata.SUBJECT) }); > > /** > * @see Office#KEYWORDS > */ > public static final Property KEYWORDS = > Property.composite(Office.KEYWORDS, > new Property[] { Property.internalTextBag(MSOffice.KEYWORDS) }); > {code} > would become > {code:title= TikaCoreProperties.java *After*} > /** > * @see DublinCore#SUBJECT > * @see Office#KEYWORDS > */ > public static final Property KEYWORDS = > Property.composite(DublinCore.SUBJECT, > new Property[] { > Office.KEYWORDS, > Property.internalTextBag(MSOffice.KEYWORDS), > Property.internalText(Metadata.SUBJECT) > }); > {code} > Since this would require a bit of refactoring for parsers that use the > properties being removed I thought it best to get some feedback before > working up a full patch. > Does this seem like a reasonable approach? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira