[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362618 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

Here is a new proposal for this issue.

org.apache.nutch.util.MetaData
  * becomes an utility class that is only a container of multi-valued, typo 
toletent String properties (using the same kind of API than JavaMail : the add 
/ set methods mentionned by Doug - it is already implemented in the actual 
patch).
  * There is no more metadata names constants in this class, since it becomes a 
generic object for storing String/String[] mappings


org.apache.nutch.protocol.ContentProperties
  * This class simply extends the MetaData class
  * It defines the content related constants (Content-Type, and so on)

org.apache.nutch.parse.ParseProperties
  * This class simply extends the MetaData class
  * It defines the parse related constants (Dublin Core constans)

org.apache.nutch.parse.ParseData
  * The constructor becomes ParseData(ParseStatus, String, Outlink[], 
ContentProperties)
  * This class holds two metadata sets : 
     1. ContentProperties for the original metadata set which came from protocol
     2. MetaDataProperties for the parse metadata set.

  * This class provides 3 ways to retrieve a metadata value:
    1. public ContentProperties getContentMeta();
    2. public ParseProperties getParseMeta();
    3. public MetaData getMetaData(); // Returns a mix of the two previous one 
where values in parse properties override those in content properties.

In all parsers implementations:
* Remove the copying of content metadata to parse metadata.

>From my point of view the key benefits are:
  1. Provide a clear separation between content metadata and parse metadata.
  2. Metadata names are defined at the right places.
  3. Keeps the advantage of metadata names normalization and syntax correction
  4. An easy mapping beetween the content metadatas name and parse metadata 
names (both can use the real name of the metadata, without adding an artificial 
X-Nutch prefix for parse metadata name)


Comments are welcome.

Jérôme

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to