update crawldb
How to update info about links already added to db. Particularly we need to update status of some part of links. What classes should we use to read info about each link stored in DB and then update its status? We use Trunc branch of Nutch.
mapred.map.tasks
property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is local. /description /property We have a question on this property. Is it really preferred to set this parameter several times greater than number of available hosts? We do not understand why it should be so? Our spider is distributed among 3 machines. What value is most preferred for this parameter in our case? Which other factors may have effect on most preferred value of this parameter?
question about crawldb
1. We have found these flags in CrawlDatum class: public static final byte STATUS_SIGNATURE = 0; public static final byte STATUS_DB_UNFETCHED = 1; public static final byte STATUS_DB_FETCHED = 2; public static final byte STATUS_DB_GONE = 3; public static final byte STATUS_LINKED = 4; public static final byte STATUS_FETCH_SUCCESS = 5; public static final byte STATUS_FETCH_RETRY = 6; public static final byte STATUS_FETCH_GONE = 7; Though the names of these flags describe their aims, it is not clear completely what they mean and what is the difference between STATUS_DB_FETCHED and STATUS_FETCH_SUCCESS for example. 2. Where new links are being added into CrawlDB?
mapred branch
Where now placed mapred branch of nutch ?
image search
Somebody try create image search based on nutch ?
mapred crawl
We used nutch for whole web crawling. In infinite loop we run tasks: 1) bin/nutch generate db segmentsPath -topN 1 2) bin/nutch fetch segment name 3) bin/nutch updatedb db segment name 4) bin/nutch analyze db segment name 5) bin/nutch index segment name 6) bin/nutch dedup segments dedup.tmp After each iteration we produce new segment and may use it for search. Now we try mapred. How we can use crawl in similar way? We need results in process, but not in the end of crawling (since is very long process - weeks).
About tomcat
We come to decision, we need restart webapp for new results appeared in search. How to this correctly without restarting tomcat? After long work of tomcat, we have too many open files error. May be this is result of restarting of webapp by touch command on web.xml? By now before tomcat starting, we setting max number open files parameter to 4096 (1024 by default), but we think it is not right decision.
rank system
What about scoring in mapred? I have looked crawl/crawl.java but I did not found anything concerned with page scores calculating. Does the mapred use ranking system somehow? Is it possible to use mapred for clustering whole-web crawling or it works with Intranet Crawling only?
questions
After I looked thru Crawl.java I exploded all tasks for several phases: 1) Inject - here we add web-links into crawlDb 2) Generate segment - here we create data segment 3) Fetching 4) Parse segment 5) Update crawlDb - here the information is added from segment into crawlDb 6) Phase 2 - 5 is repeated several times 7) Link db I can't understand how the clusterization is performed. What phases may be performed parallel on several machines and how jobs may be separated for several machines. What is performed at 7th phase?