Ray Gauss II created TIKA-930:
---------------------------------

             Summary: Consolidation of Some Tika Core Properties
                 Key: TIKA-930
                 URL: https://issues.apache.org/jira/browse/TIKA-930
             Project: Tika
          Issue Type: Improvement
          Components: metadata
    Affects Versions: 1.2
            Reporter: Ray Gauss II



There are a few properties in TikaCoreProperties which overlap and I think we 
should minimize ambiguity by consolidating them into a single composite 
property with the clearest name, the most general specification referenced as 
its primary property, and the others and deprecated strings as its secondaries.

Here's the proposed pseudo-code for the changes:

Remove TikaCoreProperties.SUBJECT
TikaCoreProperties.KEYWORDS <- DublinCore.SUBJECT, { Office.KEYWORDS, 
MSOffice.KEYWORDS, Metadata.SUBJECT }

Remove TikaCoreProperties.DATE
TikaCoreProperties.CREATION_DATE <- DublinCore.DATE, { Office.CREATION_DATE, 
MSOffice.CREATION_DATE, Metadata.DATE }

Remove TikaCoreProperties.MODIFIED
TikaCoreProperties.SAVE_DATE <- DublinCore.MODIFIED, { Office.SAVE_DATE, 
MSOffice.LAST_SAVED, Metadata.MODIFIED, "Last-Modified" }


and an example of the Java changes:
{code:title=TikaCoreProperties.java *Before*}
    /**
     * @see DublinCore#SUBJECT
     */
    public static final Property SUBJECT = 
Property.composite(DublinCore.SUBJECT, 
            new Property[] { Property.internalText(Metadata.SUBJECT) });
      
    /**
     * @see Office#KEYWORDS
     */
    public static final Property KEYWORDS = Property.composite(Office.KEYWORDS,
            new Property[] { Property.internalTextBag(MSOffice.KEYWORDS) });
{code}
would become
{code:title= TikaCoreProperties.java *After*}
    /**
     * @see DublinCore#SUBJECT
     * @see Office#KEYWORDS
     */
    public static final Property KEYWORDS = 
Property.composite(DublinCore.SUBJECT,
            new Property[] { 
                    Office.KEYWORDS, 
                    Property.internalTextBag(MSOffice.KEYWORDS),
                    Property.internalText(Metadata.SUBJECT)
                });
{code}


Since this would require a bit of refactoring for parsers that use the 
properties being removed I thought it best to get some feedback before working 
up a full patch.

Does this seem like a reasonable approach?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to