[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Andrzej Bialecki (JIRA) Fri, 20 Jan 2006 04:12:17 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363394 ]


Andrzej Bialecki  commented on NUTCH-139:
-----------------------------------------

Yes, I agree with the split into a generic MetaData container, and subclasses 
that define necessary constants for metadata names.

However, your proposal still doesn't address the key issue of having a set of 
"approved" or "final" values for metadata.

Example: in index-basic you need to index the title. It's not in the 
ContentProperties (wrong level) but in ParseProperties. A content parser may 
discover that the original title is empty or invalid. Still, this original 
value should be stored, under the standard key "dc:title". But then the parser 
knows the best what to do with an invalid value (is the ultimate authority), 
and it knows that the rest of Nutch really needs a meaningful value for the 
title, so it constructs it from the first line of the body text. However, with 
your porposed approach the parser cannot put it under the same key in 
ParseProperties (dc:title), because either it overwrites the original value, or 
it turns it into a simple multi-value property - as if the original metadata 
had multiple values.

That's why we insisted that the parser needs to use an "X-nutch-dc:title", so 
that it has a way to mark the final value that will remain in force for further 
processing. If you have a better way to achieve the same semantics, please 
explain.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Reply via email to