update crawldb

2006-04-24 Thread Anton Potehin
How to update info about links already added to db. Particularly we need
to update status of some part of links. What classes should we use to
read info about each link stored in DB and then update its status? We
use Trunc branch of Nutch. 

 



mapred.map.tasks

2006-04-20 Thread Anton Potehin
property

  namemapred.map.tasks/name

  value2/value

  descriptionThe default number of map tasks per job.  Typically set

  to a prime several times greater than number of available hosts.

  Ignored when mapred.job.tracker is local.  

  /description

/property

 

We have a question on this property. Is it really preferred to set this
parameter several times greater than number of available hosts? We do
not understand why it should be so? 

Our spider is distributed among 3 machines. What value is most preferred
for this parameter in our case? Which other factors may have effect on
most preferred value of this parameter?  

 



question about crawldb

2006-04-18 Thread Anton Potehin
1.  We have found these flags in CrawlDatum class: 

  public static final byte STATUS_SIGNATURE = 0;

  public static final byte STATUS_DB_UNFETCHED = 1;

  public static final byte STATUS_DB_FETCHED = 2;

  public static final byte STATUS_DB_GONE = 3;

  public static final byte STATUS_LINKED = 4;

  public static final byte STATUS_FETCH_SUCCESS = 5;

  public static final byte STATUS_FETCH_RETRY = 6;

  public static final byte STATUS_FETCH_GONE = 7;

Though the names of these flags describe their aims, it is not clear
completely what they mean and what is the difference between
STATUS_DB_FETCHED and STATUS_FETCH_SUCCESS for example.

 

 

2.  Where new links are being added into CrawlDB? 

 



mapred branch

2006-04-10 Thread Anton Potehin
Where now placed mapred branch of nutch ?



image search

2006-04-10 Thread Anton Potehin
Somebody try create image search based on nutch ?



mapred crawl

2005-11-23 Thread Anton Potehin
We used nutch for whole web crawling.

In infinite loop we run tasks:

1) bin/nutch generate db segmentsPath -topN 1

2) bin/nutch fetch segment name

3) bin/nutch updatedb db segment name  

4) bin/nutch analyze db segment name

5) bin/nutch index segment name

6) bin/nutch dedup segments dedup.tmp

 

After each iteration we produce new segment and may use it for search.

 

Now we try mapred. How we can use crawl in similar way? We need results
in process, but not in the end of crawling (since is very long process -
weeks).

 



About tomcat

2005-11-21 Thread Anton Potehin
We come to decision, we need restart webapp for new results appeared in search. 
How to this correctly without restarting tomcat?

 

After long work of tomcat,  we have too many open files error. May be this is 
result of restarting of webapp by touch command on web.xml? By now before 
tomcat starting, we setting max number open files parameter to 4096 (1024 by 
default), but we think it is not right decision.

 

 

 



rank system

2005-11-08 Thread Anton Potehin
What about scoring in mapred? I have looked crawl/crawl.java but I did
not found anything concerned with page scores calculating. Does the
mapred use ranking system somehow? 

Is it possible to use mapred for clustering whole-web crawling or it
works with Intranet Crawling only?

 



questions

2005-11-08 Thread Anton Potehin
After I looked thru Crawl.java I exploded all tasks for several phases:

1)   Inject - here we add web-links into crawlDb

2)   Generate segment - here we create data segment

3)   Fetching

4)   Parse segment

5)   Update crawlDb - here the information is added from segment
into crawlDb

6)   Phase 2 - 5 is repeated several times

7)   Link db

 

I can't understand how the clusterization is performed. What phases may
be performed parallel on several machines and how jobs may be separated
for several machines. 

What is performed at 7th phase?