[ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364923 ] 

Doug Cutting commented on NUTCH-192:
------------------------------------

I'm worried that this will substantially slow things.

I'd like to see some effort made to ensure that:

1. If no metadata is used, then no MapWritable's should be allocated.

2. If readFields() is called repeatedly on a single CrawlDatum instance, as few 
new objects should be alloacated as possible.  If MapWritable were to extend 
HashMap rather than wrap it, and MapWritable.readFields() first called clear(), 
then the HashMap's entry table could be reused.  Better yet would be to try to 
reuse the entries in the table.  If an entry exists with the same classes, then 
it and its key and value instances could be reused.  This optimization would 
require the use of a more extensible HashMap, perhaps like that in Jakarta 
Commons Collections.  Alternately, one could use a linked list instead of a 
HashMap, which should be plenty fast for things this size.

If an entry were defined as:

class Entry {
  Writable key;
  Writable value;
  Entry next;
}

Then MapWritable could have fields:
  Entry first;
  Entry last;
  Entry old;

clear() would set old=first; and first=last=null.
allocateEntry(Class keyClass, Class valueClass) would scan old, splicing out 
and returning the first entry whose classes match these.  If none is found then 
a new entry would be allocated.
readFields() would first identify each key and value class, call 
allocateEntry(), then call entry.key.readFields() and entry.value.readFields() 
and finally set last.next=entry and last=entry.

Also, why does MapWritable.write() create a DataOutputBuffer?  It should just 
write to out.




> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch 
> features realized and makes a lot possible to smaller special focused search 
> engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to