Developer Developer wrote:
I am sorry I did not realize that I was so vague :).
Here are the details.
I am in the process of evaluating nutch to determine its suitability for one
of my projects. Particularly , I need to find out crawl speed of a single
instance of nutch running ( i.e NO CLUSTER using mapreduce).
This is, for the most part, dependent on the urls that you are fetching :)
I have downloaded and installed and configured nutch 0.9 to crawl and index
www.eclipse.org. I am using the default nutch utility to crawl and index i.e
the following command
sh bin/nutch crawl urls -dir crawl_test -depth 8 topN 1000
Now here is my question.
How do I time the start and end of Nutch crawl and index process. I tried to
alter the Nutch script to print time, but it only execute the time command
at the beginning of the script. The time command at the end of the script is
some how did not executed. So, what is the best approach to measure the time
for crawl and index ?
In the hadoop admin screen for job tracker, usually at master:50030,
when you click on the jobdetails.jsp link it will show you a total time.
If you want to get a total time for just fetching (and not processing
the content), use the -noParse option for the fetcher.
Further, I suspect the logging to console takes a lot of time, so is there a
way to turn off logging for the testing purpose ?
This is handled through the log4j.properties file in the conf directory.
Thanks !
On Jan 25, 2008 12:23 PM, Erick Erickson <[EMAIL PROTECTED]> wrote:
Well I suspect you'll get the usual response to this open a question:
"It depends". You have to provide a lot more details before any
meaningful response can be made.
Imagine yourself on the receiving end of a question this vague
in an area of your own expertise. Could you answer it?
Best
Erick
On Jan 25, 2008 12:10 PM, Developer Developer <[EMAIL PROTECTED]>
wrote:
Please provide any comments on this one.
Thanks !
On Jan 23, 2008 9:57 AM, Developer Developer <[EMAIL PROTECTED]>
wrote:
Folks,
I want to record the performance numbers of nutch crawl and index ?
Can
you please let me know what is the best to do it ? HOw do I obtain
performance numbers for inject, generate, fetch and updatedb ?
Thanks !