[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Chris A. Mattmann (JIRA) Thu, 26 Jan 2006 10:45:41 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364116 ]


Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Just to add to Jerome's last comment, I think the key here is simplicity. As a 
software developer, and ultimately as an end user of Nutch, I identified the 
issue that there we several places where a developer has to remember the exact 
string used in a particular piece of code hidden under layers of OO 
abstraction, etc., just to get the value of a metadata property returned from 
the protocol layer. For example, did you know that in order to get  the content 
encoding at the protocol level, you have to use the EXACT string 
"Content-Encoding", not "ContentEncoding", or "ContenT-ENcodING", etc, but 
"Content-Encoding". There are numerous other examples at the protocol level, 
such "Content-Length", and "Content-Type" (even though Doug by now I'm sure 
hates that example :-) ). The whole point is that if you go look at the 
protocol level plugins, they all share the fact that they are reading these 
properties, and in some cases writing them to a metadata map. The whole issue 
is, why, as a writer of a protocol layer plugin, should I have to worry about 
the exact format of the String to get the "Content-Encoding" from the protocol 
layer? Wouldn't it be nice to standardize public static final Strings and then 
reference them instead of replicate them at the protocol plugin levels?

So, instead of having within protocol-http/HttpResponse.java:

    String contentLengthString = (String)headers.get("Content-Length");

and then in protocol-file/FileResponse.java

    hdrs.put("Content-Length", new Long(size).toString());

wouldn't it be nice to have a public static final String CONTENT_LENGTH = 
"Content-Length", and then replacing the hard coded strings in the protocol 
plugin code? So the above becomes:

protocol-http/HttpResponse.java:

    String contentLengthString = (String)headers.get(CONTENT_LENGTH);

protocol-file/FileResponse.java

    hdrs.put(CONTENT_LENGTH,  new Long(size).toString());

Of course, that's just one layer of the issue. As we've all identified these 
so-called "magic" strings exist at the parsing layers too. For example, in the 
rtf parser, there are * 17 * of these so called magic strings, ranging from 
"Security" to "Last-Save-Date" to "Last-Printed". Of course it would be naive 
to put every single metadata string that is written or read from a Map in the 
parsing and protocol layers of nutch into a single monolithic metadata class, 
but in the end, there are several standard metadata properties (* cough cough 
Dublin Core *) that deserve such first class status, along with certain other 
commonly used metadata properties at each respective layer, protocol and 
parsing. I believe that the purpose of this patch should not only to provide an 
extensible Metadata class, but also let's not forget the simple stuff too. And 
also, let's not turn this issue into 993939393 different things that need to be 
done. It should be phased into several capabilities,  and the first phase would 
be providing standard metadata names container at protocol and parsing layers 
which Jerome and I are working towards. I guess what I'm just trying to 
advocate is to not just forget about this issue by adding a million things to 
it, and making it difficult to complete that it never gets completed and 
accepted. Let's just keep it focused and simple, because in the end, as a user 
of Nutch, and as a software developer, I think it is very time-saving and 
helpful to have common Strings defined in one-place, or a few places rather 
than spread out across 20 or 30 classes, where you have to inspect each class 
to find out the exact way to read/write a String to make stuff work. That's all 
I'm saying.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Reply via email to