Sorry Doug,
Firstly I don't understand your answer. Thanks for suggestion, I will try it.
Regards, Ferenc
Doug Cutting (JIRA) wrotte:
[ http://issues.apache.org/jira/browse/NUTCH-7?page=comments#action_63160 ]
Doug Cutting commented on NUTCH-7:
----------------------------------
The link analysis tool is not actively maintained. It's use is optional, so, if you have problems with it, you can just stop using it. To get some of its effects (prioritizing pages when crawling and searching) without using the analyze command, set both fetchlist.score.by.link.count and indexer.boost.by.link.count to true. This "poor man's link analysis" implementation works surprisingly well.
analyze tool takes up all the disk space when there are circular links ----------------------------------------------------------------------
Key: NUTCH-7
URL: http://issues.apache.org/jira/browse/NUTCH-7
Project: Nutch
Type: Bug
Components: indexer
Environment: analyze runs for an excessive amount of time and creates huge temp files until it runs out of disk space (if you let the db grow)
Reporter: Phoebe Miller
It is repeatable by running an instance with these seeds:
http://www.acf.hhs.gov/programs/ofs/forms.htm/grants/grants/grants/grants/data/grants/data/data/data/data/grants/data/grants/grants/grants/process.htm
http://www.acf.hhs.gov/programs/ofs/
and limit it (for best effect) to just:
*.acf.hhs.gov/*
Let it go for about 12 cycles to build it up and the temp file size roughly doubles with each segment.
]$ ls -l /db/tmpdir2344la/
...
1503641425 Mar 10 17:42 scoreEdits.0.unsorted
for a very small db:
Stats for [EMAIL PROTECTED]
-------------------------------
Number of pages: 6916
Number of links: 8085
scoreEdits.0.sorted.0 contains rows of links that looked like the first seed url, but with more grants/ and data/ in the sub dirs.
In the File:
.DistributedAnalysisTool.java
345 if (curIndex - startIndex > extent) {
346 break;
347 }
is the hard stop.
Further down the score is written:
381 for (int i = 0; i < outLinks.length; i++) {
...
385 scoreWriter.append(outLinks[i].getURL(), score);
Putting a check here stops the tmpdir.../scoreEdits.0 file growth
but the links themselves should not be produced in the generation either.
------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
