Hi,

IMHO it is not that easy to get this reasult from the Nutch log file. I
would say that generally it is not possible at all.

I see two reasons:
(1) Nutch uses log4j and thus the output depends on the configuration of the
log system. How can you count fetched pages if the logger for Fetcher class
is set to FATAL level only for example? In such case there won't be any
record from Fetcher in the log file. Or let's imagine that the configuration
of logging system is changed the way that it will use database instead of
flat file? And what is worse is that one log file can be shared by multiple
Nutch instances - each parsing different set of urls.

(2) Generally, you can't rely on any specific configuration of Nutch (
nutch-site.xml file or other xml config file). As you probably know it is
possible to tell the Nutch to parse only first X bytes from each document
(and such portion of document may not contain links you are interested in)
or exclude specific type of link (not to follow Word links for example) ..
etc, etc ...

I think that logging system is mainly for debug/trace purpose only. If I
were you then I wouldn't rely on log system and Nutch configuration being in
some specific state.

Regards,
Lukas

On 12/4/06, karthik085 <[EMAIL PROTECTED]> wrote:


I don't want the results, i.e. fetched pages to be modified. I want a tool
that can extract the information, from the log, that Nutch outputs.

Let's say, a website has:
http://www.linkA.com has a link to linkInternal1, linkInternal2,
linkExternal1
http://www.linkB.com has a link to linkInternal3, linkExternal2

My input is, a url list:
www.linkA.com
www.linkB.com

My filters:
No external websites.

For some reason, pages with links linkInternal2 and linkInternal2 could
not
be fetched. Rest was got.

So, I want a tool, that can output, say (verbose = off),
Total number of input links: 2
Total number of links: 7
Total number of internal links: 5
Total number of filtered links: 2
Total number of pages fetched: 3
Total number of errors(pages not fetched): 2

If i turn verbose on,
It should give a more detailed explanation, about pages, links, fetched
pages, time, errors, what errors....

In this way, I know exactly what nutch does - how many links did it visit,
fetch, why it didn't get some of them....

Also, I am mostly likely to have a predetermined range of webpages, I want
to visit in a website, i.e. I already know, there are 7 links(5 internal +
2
external) in the website. The above tool, would make it very easy for a
person to compare what nutch did/what I wanted.

Currently, one has to do manually, enter nutch commands - look into
logs....I just want a customizable log report.


Lukas Vlcek wrote:
>
> Hi,
>
> I haven't been watching nutch development progress for some time (so my
> answer may not be accurate) but I don't think there is such a
tool/report.
> Anyway, your contribution would be warmly welcomed! :-)
>
> On the other hand, based on your short description of features you are
> looking for, my personal opinion is that you are looking for tool which
> should provide exact information about something that is very variable
> (mutable) in its nature and heavy dependend on Nutch setup.
>
> For example the size on parsed document (for example html document) can
be
> limited to specific size. So can be the number of links extracted from
> document ... etc,etc ... Such variables have fatal impact on the crawl
> result and thus on the resul of your report as well.
>
> Just my 2 cents.
>
> Regards,
> Lukas
>
> On 12/2/06, karthik085 <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hello,
>>
>> How do I check that all pages have been fetched? Is there a command or
>> tool,
>> that says like:
>> these are the number of pages in the website, the number of pages
>> fetched,
>> pages filtered...
>> give a report. If errors, how many and give a brief description...
>>
>> I understand analyzing log and readdb with stats/dumppageurl is one
>> option.
>> But, it is time consuming and requires unwanted manual work. If there
is
>> a
>> tool/command that did the above option, I could just easily parse the
>> report
>> for my web services.
>> --
>> View this message in context:
>> http://www.nabble.com/Nutch-Data-Testing-tf2742246.html#a7651128
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>
>

--
View this message in context:
http://www.nabble.com/Nutch-Data-Testing-tf2742246.html#a7684864
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to