result tuning

2005-08-16 Thread webmaster
where is it that I would change the query results to every search only making 
10 results, instead of 100? so it wont cache 10 pages of sub-results? I take 
it that it is not the io.sort.factor option!!!
-Jay


128-bit and 64-bit MD5 Hash Value

2005-08-16 Thread Michael Ji
hi there,

1.

I dumped the WebDB to text file and take a look by
myself.

In LinkByURL, there are fields, samples like:

FROM_ID = "2093bd0edd595fe47d6ea0a7b1858e3"

while

DOMAIN_ID = "-7601366135611285483"

If I interprete Link.java correctly, FROM_ID is a
128-bit MD5 Hash and DOMAIN_ID is a 64-bit MD5 Hash.

Am I correct? Why 64-bit MD5 is all digits?

2.
MD5 Hash is caculated by Java code or generated by
hardware when the fetched data stream coming in?

thanks,

Michael Ji




Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 


RE: Slow Results

2005-08-16 Thread Paul Harrison
Doug,

I appreciate the feedback.  We have altered the Nutch implementation to add
a field for retrieval later use.  The field holds information we use to sort
the data differently than what Nutch has out of the box.  Is it possible to
do the following:

1.  We would need to alter the code to only grab the document ids of n
number of documents (500, 1000, 10,000). We could do this by using NutchBean
instead, or modify OpenSearchServlet to conditionally generate summaries. 

2.  We would then be able to look at the field we added to the data
associated with the document ids for sorting the n number of documents using
our own sorting mechanism. 

3.  We would then generate the summaries of the first 10 documents based on
the newly sorted list of document ids.
 
4.  We would then display those 10 results with the summaries.
 
5.  When a user clicks to go to the next 10 results we would already have
the next 10 ids stored somewhere and could generate the summaries to the
next 10 ids without having to relookup everything.

What do you think?

Thanks,

Paul

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, August 16, 2005 3:27 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Slow Results

What API are you using to get hits, NutchBean or OpenSearchServlet?  If 
you're using OpenSearchServlet, then, with 1000 hits, most of your time 
is probably spent constructing summaries.  Do you need the summaries? 
If not, use NutchBean instead, or modify OpenSearchServlet to not 
generate summaries.  If you only need unique document ids, then perhaps 
you can only fetch the Hit instance for each match.  That would be 
fastest.  If you need titles, urls, etc., then you need HitDetails, 
which are slower to access.  Slowest is summaries.

Doug

Paul Harrison wrote:
> I have crawled some 100 million pages and am running this on five P4 3.0
GHz
> machines with a 40 GB OS drive and two 250 GB data drives.  I am trying to
> get Nutch to grab 1000 results so I can pass them to a separate program I
> have instead of using the Nutch default (100 I think).  As a result it
takes
> an enormous amount of time to get results.  So I backed the number of
pages
> indexed to 7 million and still having Nutch grab 1000 results instead of
the
> default.  While the results were better they are still unusable as it is
> taking between 15 and 20 seconds to complete the task.  Does anyone have
any
> idea why Nutch slows down so bad when you have it grab 1000 pages instead
of
> the default number?  Does anyone have any suggestions on how to speed this
> process up?  Do I use more machines, upgrade to a newer version of Nutch,
> etc.?
> 
>  
> 
> Any help would be MOST appreciated.
> 
>  
> 
> Thanks,
> 
>  
> 
> Paul
> 
> 



Re: (mapred branch) Job.xml as a directory instead of a file, other issues.

2005-08-16 Thread Doug Cutting

Jeremy Bensley wrote:

After going through your checklist, I realized that my view on how the
MapReduce function behaves was slightly flawed, as I did not realize
that the temporary storage phase between map and reduce had to be in a
shared location.


The temprorary storage between map and reduce is actually not stored in 
NDFS, but on node's local disks.  But the input (the url file in this 
case) must be shared.



So, my process for running crawl is now:
1. Set up / start NDFS name and data nodes
2. Copy url file into NDFS 
3. Set up / start job and task trackers

4. run crawl with arguments referencing the NDFS positions of my
inputs and outputs


That looks right to me.

We really need a mapred & ndfs-based tutorial...


The only lasting issue I have is that, whenever I attempt to start a
tasktracker or jobtracker and have the configuration parameters for
mapred specified only in mapred-default.xml, I get the following
error:

050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml
050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml
Exception in thread "main" java.lang.RuntimeException: Bad
mapred.job.tracker: local
at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245)
at org.apache.nutch.mapred.TaskTracker.(TaskTracker.java:72)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609)

It is as if the mapred-default.xml is not being parsed for its
options. If I specify the same options in nutch-site.xml it works just
fine.


The config files are a bit confusing.  mapred-default.xml is for stuff 
that may be reasonably overidden by applications, while nutch-site.xml 
is for stuff that should not be overridden by applications.  So the name 
of the shared filesystem and of the job tracker should be in 
nutch-site.xml, since they should not be overridden.  But, e.g., the 
default number of map and reduce tasks should be in mapred-default.xml, 
since applications do sometimes change these.


The "local" job tracker should only be used in standalone 
configurations, when everything runs in the same process.  It doesn't 
make sense to start a task tracker process configured with a "local" job 
tracker.  If you want to run them on the same host then you might 
configure "localhost:" as the job tracker.


Doug


Re: page ranking weights

2005-08-16 Thread Ken Krugler

also how does it keep track of incoming links globally on these pages, if
the weight is determined by # of incoming links then there would have to be
somewhere it keeps track so when you split your indexes it can still have an
accurate value for the distributed search?


The WebDB keeps track of this info. It's not in the segments/indexes.


 > at which step does nutch figure out the weight of each page, the updatedb
 > step? or the index step?


The updatedb step.

In UpdateDatabaseTool.java's PageContentChanged() method, first all 
of the outlink URLs are harvested from the fetched page. Then a score 
is calculated for each of the pages referenced by these outlink URLs, 
based on the score of the fetched page, multiplied by either the 
internal or external link weight (from Nutch config XML data, both 
1.0 by default), depending on whether the URL is in the same domain 
as the fetched page.


When you inject URLs, there is no referring page, so it arbitrarily 
uses the db.score.injected value (1.0 by default).


So if you leave everything set to default values, and don't perform 
link analysis, I think every page will wind up with a score of 1.0.


-- Ken
--
Ken Krugler
TransPac Software, Inc.

+1 530-470-9200


Re: (mapred branch) Job.xml as a directory instead of a file, other issues.

2005-08-16 Thread Jeremy Bensley
After going through your checklist, I realized that my view on how the
MapReduce function behaves was slightly flawed, as I did not realize
that the temporary storage phase between map and reduce had to be in a
shared location. So, my process for running crawl is now:

1. Set up / start NDFS name and data nodes
2. Copy url file into NDFS 
3. Set up / start job and task trackers
4. run crawl with arguments referencing the NDFS positions of my
inputs and outputs

Following these steps I was able to get it to work as expected.


The only lasting issue I have is that, whenever I attempt to start a
tasktracker or jobtracker and have the configuration parameters for
mapred specified only in mapred-default.xml, I get the following
error:

050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml
050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml
Exception in thread "main" java.lang.RuntimeException: Bad
mapred.job.tracker: local
at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245)
at org.apache.nutch.mapred.TaskTracker.(TaskTracker.java:72)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609)

It is as if the mapred-default.xml is not being parsed for its
options. If I specify the same options in nutch-site.xml it works just
fine.

I appreciate the help, and look forward to experimenting with the software.

Jeremy


On 8/16/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Jeremy Bensley wrote:
> > First, I have observed the same behavior as a previous poster from
> > yesterday who, instead of specifying a file for the URLs to be read
> > from, must now specify a directory (full path) to which a file
> > containing the URL list is stored. From the response to that thread I
> > am gathering that it isn't desired behavior to specify a directory
> > instead of a file.
> 
> A directory is required.  For consistency, all inputs and outputs are
> now directories of files rather than individual files.
> 
> > Second, and more importantly, I am having issues with task trackers. I
> > have three machines running task tracker, and a fourth running the job
> > tracker, and they seem to be talking well. Whenever I try to invoke
> > crawl using the job tracker, however, all of my task trackers
> > continually fail with this:
> >
> > 050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
> > [Fatal Error] :-1:-1: Premature end of file.
> > 050816 134532 SEVERE error parsing conf file:
> > org.xml.sax.SAXParseException: Premature end of file.
> > java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
> > end of file.
> > at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
> > at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
> > at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
> > at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
> > at 
> > org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
> > at 
> > org.apache.nutch.mapred.TaskTracker$TaskInProgress.(TaskTracker.java:319)
> > at 
> > org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
> > at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
> > at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
> > Caused by: org.xml.sax.SAXParseException: Premature end of file.
> > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
> > at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
> > ... 8 more
> >
> > Whenever I look at the job.xml file specified by this location, it
> > turns out that it is a directory, not a file.
> >
> > drwxrwxr-x  2 jeremy  users 4096 Aug 16 13:45 job.xml
> 
> I have not seen this before.  If you remove everything in /tmp/nutch, is
> this reproducible?  Are you using NDFS?  If not, how are you sharing
> files between task trackers?  Is this on Win32, Linux or what?  Are you
> running the latest mapred code?  If your troubles continue, please post
> your nutch-site.xml and mapred-default.xml.
> 
> Doug
>


Re: Release 0.7 problem

2005-08-16 Thread Piotr Kosiorowski

Hi,
Just for information
The only change I plan to make is change the tar task to:

 

  


  
  

  

  

I will commit it tommorow and test.
Regards
Piotr

Doug Cutting wrote:

Piotr Kosiorowski wrote:


After making a tar I was trying to go through crawl tutorial.
 - tar xvfz nutch-0.7.tar.gz
 bin/nutch - is not executable (and nutch-daemon.sh too).




It is strange nobody reported it so far so it may still be my fault.



No, it looks like a problem with ant's tar task, which erases executable 
bits.  In prior releases I think Nutch used to directly exec 'tar czf' 
since ant's tar task didn't support compression.  Since it added 
compression we started using the ant task...


But if not - should we make a release with bin/* scripts not 
executable or change the build process?



I think we should fix this before we release.

Good job catching it.

Doug





Re: Release 0.7 problem

2005-08-16 Thread Piotr Kosiorowski

So I will move the release till tommorow as I am a bit sleepy now.
Regards
Piotr
Doug Cutting wrote:

Piotr Kosiorowski wrote:


After making a tar I was trying to go through crawl tutorial.
 - tar xvfz nutch-0.7.tar.gz
 bin/nutch - is not executable (and nutch-daemon.sh too).




It is strange nobody reported it so far so it may still be my fault.



No, it looks like a problem with ant's tar task, which erases executable 
bits.  In prior releases I think Nutch used to directly exec 'tar czf' 
since ant's tar task didn't support compression.  Since it added 
compression we started using the ant task...


But if not - should we make a release with bin/* scripts not 
executable or change the build process?



I think we should fix this before we release.

Good job catching it.

Doug





Re: Release 0.7 problem

2005-08-16 Thread Doug Cutting

Piotr Kosiorowski wrote:

After making a tar I was trying to go through crawl tutorial.
 - tar xvfz nutch-0.7.tar.gz
 bin/nutch - is not executable (and nutch-daemon.sh too).



It is strange nobody reported it so far so it may still be my fault.


No, it looks like a problem with ant's tar task, which erases executable 
bits.  In prior releases I think Nutch used to directly exec 'tar czf' 
since ant's tar task didn't support compression.  Since it added 
compression we started using the ant task...


But if not - should we make a release with bin/* scripts not executable 
or change the build process?


I think we should fix this before we release.

Good job catching it.

Doug


Release 0.7 problem

2005-08-16 Thread Piotr Kosiorowski

Hello,
I have a problem related to 0.7 release.
After making a tar I was trying to go through crawl tutorial.
 - tar xvfz nutch-0.7.tar.gz
 bin/nutch - is not executable (and nutch-daemon.sh too).
I thought it was my mistake - I started to do it on Windows so I moved 
to linux, but the problem persisted.
I downloaded latest nightly build(nutch-2005-08-16.tar.gz) and it is 
still the same.


I am not using standard nutch script(and build.xml too) for my local 
installation at work so I had a look and noticed that in my build.xml I 
have additional elements inside tar element



 

It is strange nobody reported it so far so it may still be my fault.
But if not - should we make a release with bin/* scripts not executable 
or change the build process?


I would go for a change but than I will do the release tommorow - as I 
would like to test it.

Comments?

Regards
Piotr



Re: Slow Results

2005-08-16 Thread Doug Cutting
What API are you using to get hits, NutchBean or OpenSearchServlet?  If 
you're using OpenSearchServlet, then, with 1000 hits, most of your time 
is probably spent constructing summaries.  Do you need the summaries? 
If not, use NutchBean instead, or modify OpenSearchServlet to not 
generate summaries.  If you only need unique document ids, then perhaps 
you can only fetch the Hit instance for each match.  That would be 
fastest.  If you need titles, urls, etc., then you need HitDetails, 
which are slower to access.  Slowest is summaries.


Doug

Paul Harrison wrote:

I have crawled some 100 million pages and am running this on five P4 3.0 GHz
machines with a 40 GB OS drive and two 250 GB data drives.  I am trying to
get Nutch to grab 1000 results so I can pass them to a separate program I
have instead of using the Nutch default (100 I think).  As a result it takes
an enormous amount of time to get results.  So I backed the number of pages
indexed to 7 million and still having Nutch grab 1000 results instead of the
default.  While the results were better they are still unusable as it is
taking between 15 and 20 seconds to complete the task.  Does anyone have any
idea why Nutch slows down so bad when you have it grab 1000 pages instead of
the default number?  Does anyone have any suggestions on how to speed this
process up?  Do I use more machines, upgrade to a newer version of Nutch,
etc.?

 


Any help would be MOST appreciated.

 


Thanks,

 


Paul




Re: (mapred branch) Job.xml as a directory instead of a file, other issues.

2005-08-16 Thread Doug Cutting

Jeremy Bensley wrote:

First, I have observed the same behavior as a previous poster from
yesterday who, instead of specifying a file for the URLs to be read
from, must now specify a directory (full path) to which a file
containing the URL list is stored. From the response to that thread I
am gathering that it isn't desired behavior to specify a directory
instead of a file.


A directory is required.  For consistency, all inputs and outputs are 
now directories of files rather than individual files.



Second, and more importantly, I am having issues with task trackers. I
have three machines running task tracker, and a fourth running the job
tracker, and they seem to be talking well. Whenever I try to invoke
crawl using the job tracker, however, all of my task trackers
continually fail with this:

050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
[Fatal Error] :-1:-1: Premature end of file.
050816 134532 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
end of file.
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.(TaskTracker.java:319)
at 
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
... 8 more

Whenever I look at the job.xml file specified by this location, it
turns out that it is a directory, not a file.

drwxrwxr-x  2 jeremy  users 4096 Aug 16 13:45 job.xml


I have not seen this before.  If you remove everything in /tmp/nutch, is 
this reproducible?  Are you using NDFS?  If not, how are you sharing 
files between task trackers?  Is this on Win32, Linux or what?  Are you 
running the latest mapred code?  If your troubles continue, please post 
your nutch-site.xml and mapred-default.xml.


Doug


(mapred branch) Job.xml as a directory instead of a file, other issues.

2005-08-16 Thread Jeremy Bensley
I have been attempting to get the mapred branch version of the crawler
working and have hit some snags.

First, I have observed the same behavior as a previous poster from
yesterday who, instead of specifying a file for the URLs to be read
from, must now specify a directory (full path) to which a file
containing the URL list is stored. From the response to that thread I
am gathering that it isn't desired behavior to specify a directory
instead of a file.

Second, and more importantly, I am having issues with task trackers. I
have three machines running task tracker, and a fourth running the job
tracker, and they seem to be talking well. Whenever I try to invoke
crawl using the job tracker, however, all of my task trackers
continually fail with this:

050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
[Fatal Error] :-1:-1: Premature end of file.
050816 134532 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
end of file.
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.(TaskTracker.java:319)
at 
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
... 8 more

Whenever I look at the job.xml file specified by this location, it
turns out that it is a directory, not a file.

drwxrwxr-x  2 jeremy  users 4096 Aug 16 13:45 job.xml


Any help / observation of these issues is most appreciated.

Thanks,

Jeremy


Difference Between 0.6 and 0.7

2005-08-16 Thread Paul Harrison
I have a March 9 release (0.6) of Nutch that I crawled a significant number
of pages with.  Can I install and use the 0.7 release against those crawled
pages or will I have to recrawl all my pages?

 

Thanks,

 

Paul



Slow Results

2005-08-16 Thread Paul Harrison
I have crawled some 100 million pages and am running this on five P4 3.0 GHz
machines with a 40 GB OS drive and two 250 GB data drives.  I am trying to
get Nutch to grab 1000 results so I can pass them to a separate program I
have instead of using the Nutch default (100 I think).  As a result it takes
an enormous amount of time to get results.  So I backed the number of pages
indexed to 7 million and still having Nutch grab 1000 results instead of the
default.  While the results were better they are still unusable as it is
taking between 15 and 20 seconds to complete the task.  Does anyone have any
idea why Nutch slows down so bad when you have it grab 1000 pages instead of
the default number?  Does anyone have any suggestions on how to speed this
process up?  Do I use more machines, upgrade to a newer version of Nutch,
etc.?

 

Any help would be MOST appreciated.

 

Thanks,

 

Paul



Slow Results

2005-08-16 Thread Paul Harrison
I have crawled some 100 million pages and am running this on five P4 3.0 GHz
machines with a 40 GB OS drive and two 250 GB data drives.  I am trying to
get Nutch to grab 1000 results so I can pass them to a separate program I
have instead of using the Nutch default (100 I think).  As a result it takes
an enormous amount of time to get results.  So I backed the number of pages
indexed to 7 million and still having Nutch grab 1000 results instead of the
default.  While the results were better they are still unusable as it is
taking between 15 and 20 seconds to complete the task.  Does anyone have any
idea why Nutch slows down so bad when you have it grab 1000 pages instead of
the default number?  Does anyone have any suggestions on how to speed this
process up?  Do I use more machines, upgrade to a newer version of Nutch,
etc.?

 

Any help would be MOST appreciated.

 

Thanks,

 

Paul



Difference Between 0.6 and 0.7

2005-08-16 Thread Paul Harrison
I have a March 9 release (0.6) of Nutch that I crawled a significant number
of pages with.  Can I install and use the 0.7 release against those crawled
pages or will I have to recrawl all my pages?

 

Thanks,

 

Paul



Re: Release 0.7

2005-08-16 Thread Doug Cutting

Piotr Kosiorowski wrote:

Is anyone working on preparing the release?


I am not.


If not I can spent some time on it in an hour or so.


+1

Thanks,

Doug


Release 0.7

2005-08-16 Thread Piotr Kosiorowski

Hello Nutch Committers,
Is anyone working on preparing the release?
If not I can spent some time on it in an hour or so.
Regards
Piotr