[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]
Stefan Groschupf updated NUTCH-192:
-----------------------------------
Attachment: metadata060206.patch
Doug, did you mean something like this?
Writing 1 mio map's (with one tuple [int key, long value]) into a sequence file
that use a int key takes around 5400 ms on my box.
Writing 1 mio int key, utf8 values into a sequence files took pretty much the
same time.
However reading utf8 is requre 60 % of the time i need to read the map. This is
may depends that utf8 just reads a byte array and convert the string first if
toString is called. If I call toString in my test than reading utf8 is slower
that reading the map.
So another possible improvement could be to read just a byte array into the map
and 'parsing' this byte array first and only when the first get method is
called.
This can save some time in processing crawlDatum in situation where we do not
need to access the meta data at all.
However reading and writing of a 10 mio map's with one key value tuple can be
done in less than a minute on my desktop box.
> meta data support for CrawlDatum
> --------------------------------
>
> Key: NUTCH-192
> URL: http://issues.apache.org/jira/browse/NUTCH-192
> Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Fix For: 0.8-dev
> Attachments: metadata010206.patch, metadata060206.patch,
> metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch
> features realized and makes a lot possible to smaller special focused search
> engines.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers