[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363834 ]
Jerome Charron commented on NUTCH-139: -------------------------------------- Andrzej, I really don't like this "X-Nutch" naming convention. First it's really protocol level oriented, and it forces to map "X-Nutch" values with original ones (of course an utility method can easily provides this mapping). But I really think this solution is really clean (from my point of view). We should perhaps define one more time what is a MetaData value. I suggest to define a new class to represent a metadata value instead of using a simple String. Thus, we can define a class that holds both original and final value. The idea is that the only way to set the original value is to construct a new object (I will call this class MetaValue, but native english speakers are encourage to propose a better name), then when you set the value of this metadata value, it never override the original one, but the final one. Here is a short piece of code: public class MetaValue { private String[] original = null; private List actual = null; public MetaValue(String[] values) { // Constructor for multi value original = values; } public MetaValue(String value) { // Constructor for single value original = new String[] { value }; } public void setValue(String[] values) { // copies the values in a new empty actual list } public void addValue(String value) { // append this value to the list of values } public String[] getOriginalValues() { } public String[] getFinalValues() { } public String[] getValues() { // Return the final values if the list of values is not null // otherwise return the final values } } With this approach we can keep the same value (MetaValue) with the same key. > Standard metadata property names in the ParseData metadata > ---------------------------------------------------------- > > Key: NUTCH-139 > URL: http://issues.apache.org/jira/browse/NUTCH-139 > Project: Nutch > Type: Improvement > Components: fetcher > Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev > Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, > although bug is independent of environment > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Priority: Minor > Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 > Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, > NUTCH-139.jc.review.patch.txt > > Currently, people are free to name their string-based properties anything > that they want, such as having names of "Content-type", "content-TyPe", > "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a > solution in which all property names be converted to lower case, but in > essence this really only fixes half the problem right (the case of > identifying that "CONTENT_TYPE" > and "conTeNT_TyPE" and all the permutations are really the same). What about > if I named it "Content Type", or "ContentType"? > I propose that a way to correct this would be to create a standard set of > named Strings in the ParseData class that the protocol framework and the > parsing framework could use to identify common properties such as > "Content-type", "Creator", "Language", etc. > The properties would be defined at the top of the ParseData class, something > like: > public class ParseData{ > ..... > public static final String CONTENT_TYPE = "content-type"; > public static final String CREATOR = "creator"; > .... > } > In this fashion, users could at least know what the name of the standard > properties that they can obtain from the ParseData are, for example by making > a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the > content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, > "text/xml"); Of course, this wouldn't preclude users from doing what they are > currently doing, it would just provide a standard method of obtaining some of > the more common, critical metadata without pouring over the code base to > figure out what they are named. > I'll contribute a patch near the end of the this week, or beg. of next week > that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira