[Nutch-dev] [jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Andrzej Bialecki (JIRA) Wed, 01 Feb 2006 02:05:25 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364782 ]


Andrzej Bialecki  commented on NUTCH-192:
-----------------------------------------

There is a very real hazard in the fact that we don't store the dictionary. 
Let's consider this example: two plugins invoke WritableName.setName() with 
different classes, ClassA and ClassB. We get the mapping ClassA -> 23, ClassB 
-> 24. The files written by these plugins use just the byte IDs, 23 and 24. The 
someone changes the config file, and plugins are initialized in a reversed 
order, so consequently we get ClassB -> 23, ClassA ->24. And now the plugins 
cannot read the files they created because of the wrong class returned from 
MapWritable ...

So, I'm still convinced that we need to save the dictionary. Unfortunately, for 
small amounts of metadata (typical use case) it blows up the on-disk size of 
MapWritable, which is why I thought using Strings would be cheaper ...

Other things: In the javadoc for MapWritable it should be mentioned that any 
Writable type that one is going to use needs to be first registered with the 
WritableName.setName(). Or perhaps the method could do it automatically, but 
then the IDs will be unpredictable, depending on the order of iteration (which 
leads to the problem described above).

Also, there is a bug in setName(): if you try adding the same mapping twice 
(which could happen in different places), the method should allocate just one 
ID for the class. As it is now, it will allocate new ID each time you call the 
method, even if the class name is the same. Just add this:

   public static synchronized void setName(Class writableClass, String name) {
     Object o = CLASS_TO_NAME.put(writableClass, name);
     NAME_TO_CLASS.put(name, writableClass);
     if (o != null) return; // already has an ID
    CLASS_TO_ID.put(writableClass, new Byte((byte)CLASS_TO_ID.size()));
    ID_TO_CLASS.put(new Byte((byte)ID_TO_CLASS.size()), writableClass);
   }

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch 
> features realized and makes a lot possible to smaller special focused search 
> engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-192) meta data support for CrawlDatum

Reply via email to