I don't want the results, i.e. fetched pages to be modified. I want a tool that can extract the information, from the log, that Nutch outputs.
Let's say, a website has: http://www.linkA.com has a link to linkInternal1, linkInternal2, linkExternal1 http://www.linkB.com has a link to linkInternal3, linkExternal2 My input is, a url list: www.linkA.com www.linkB.com My filters: No external websites. For some reason, pages with links linkInternal2 and linkInternal2 could not be fetched. Rest was got. So, I want a tool, that can output, say (verbose = off), Total number of input links: 2 Total number of links: 7 Total number of internal links: 5 Total number of filtered links: 2 Total number of pages fetched: 3 Total number of errors(pages not fetched): 2 If i turn verbose on, It should give a more detailed explanation, about pages, links, fetched pages, time, errors, what errors.... In this way, I know exactly what nutch does - how many links did it visit, fetch, why it didn't get some of them.... Also, I am mostly likely to have a predetermined range of webpages, I want to visit in a website, i.e. I already know, there are 7 links(5 internal + 2 external) in the website. The above tool, would make it very easy for a person to compare what nutch did/what I wanted. Currently, one has to do manually, enter nutch commands - look into logs....I just want a customizable log report. Lukas Vlcek wrote: > > Hi, > > I haven't been watching nutch development progress for some time (so my > answer may not be accurate) but I don't think there is such a tool/report. > Anyway, your contribution would be warmly welcomed! :-) > > On the other hand, based on your short description of features you are > looking for, my personal opinion is that you are looking for tool which > should provide exact information about something that is very variable > (mutable) in its nature and heavy dependend on Nutch setup. > > For example the size on parsed document (for example html document) can be > limited to specific size. So can be the number of links extracted from > document ... etc,etc ... Such variables have fatal impact on the crawl > result and thus on the resul of your report as well. > > Just my 2 cents. > > Regards, > Lukas > > On 12/2/06, karthik085 <[EMAIL PROTECTED]> wrote: >> >> >> Hello, >> >> How do I check that all pages have been fetched? Is there a command or >> tool, >> that says like: >> these are the number of pages in the website, the number of pages >> fetched, >> pages filtered... >> give a report. If errors, how many and give a brief description... >> >> I understand analyzing log and readdb with stats/dumppageurl is one >> option. >> But, it is time consuming and requires unwanted manual work. If there is >> a >> tool/command that did the above option, I could just easily parse the >> report >> for my web services. >> -- >> View this message in context: >> http://www.nabble.com/Nutch-Data-Testing-tf2742246.html#a7651128 >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Nutch-Data-Testing-tf2742246.html#a7684864 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
