I don't want the results, i.e. fetched pages to be modified. I want a tool
that can extract the information, from the log, that Nutch outputs.

Let's say, a website has:
http://www.linkA.com has a link to linkInternal1, linkInternal2,
linkExternal1
http://www.linkB.com has a link to linkInternal3, linkExternal2

My input is, a url list:
www.linkA.com
www.linkB.com

My filters:
No external websites.

For some reason, pages with links linkInternal2 and linkInternal2 could not
be fetched. Rest was got.

So, I want a tool, that can output, say (verbose = off),
Total number of input links: 2
Total number of links: 7
Total number of internal links: 5
Total number of filtered links: 2
Total number of pages fetched: 3
Total number of errors(pages not fetched): 2

If i turn verbose on,
It should give a more detailed explanation, about pages, links, fetched
pages, time, errors, what errors....

In this way, I know exactly what nutch does - how many links did it visit,
fetch, why it didn't get some of them....

Also, I am mostly likely to have a predetermined range of webpages, I want
to visit in a website, i.e. I already know, there are 7 links(5 internal + 2
external) in the website. The above tool, would make it very easy for a
person to compare what nutch did/what I wanted.

Currently, one has to do manually, enter nutch commands - look into
logs....I just want a customizable log report.


Lukas Vlcek wrote:
> 
> Hi,
> 
> I haven't been watching nutch development progress for some time (so my
> answer may not be accurate) but I don't think there is such a tool/report.
> Anyway, your contribution would be warmly welcomed! :-)
> 
> On the other hand, based on your short description of features you are
> looking for, my personal opinion is that you are looking for tool which
> should provide exact information about something that is very variable
> (mutable) in its nature and heavy dependend on Nutch setup.
> 
> For example the size on parsed document (for example html document) can be
> limited to specific size. So can be the number of links extracted from
> document ... etc,etc ... Such variables have fatal impact on the crawl
> result and thus on the resul of your report as well.
> 
> Just my 2 cents.
> 
> Regards,
> Lukas
> 
> On 12/2/06, karthik085 <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hello,
>>
>> How do I check that all pages have been fetched? Is there a command or
>> tool,
>> that says like:
>> these are the number of pages in the website, the number of pages
>> fetched,
>> pages filtered...
>> give a report. If errors, how many and give a brief description...
>>
>> I understand analyzing log and readdb with stats/dumppageurl is one
>> option.
>> But, it is time consuming and requires unwanted manual work. If there is
>> a
>> tool/command that did the above option, I could just easily parse the
>> report
>> for my web services.
>> --
>> View this message in context:
>> http://www.nabble.com/Nutch-Data-Testing-tf2742246.html#a7651128
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch-Data-Testing-tf2742246.html#a7684864
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to