[jira] [Commented] (NUTCH-1615) Implementing A Feature for Fetching From Websites Dump

2014-04-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961376#comment-13961376
 ] 

Sebastian Nagel commented on NUTCH-1615:


No question, reading an entire [Wikimedia 
dump|http://dumps.wikimedia.org/backup-index.html] into web table would provide 
a nice playground to test content extraction, link rank algorithms, etc. 
Crawling Wikipedia is no alternative because of its size and because you are 
encouraged [not to 
do|http://en.wikipedia.org/wiki/Wikipedia:Download#Please_do_not_use_a_web_crawler].
 There are already tools to process Wikipedia dumps via Hadoop (e.g., search 
for "[hadoop process wikipedia 
dump|https://www.google.com/search?q=hadoop%20process%20wikipedia%20dump]";). 
But wiki markup is quite complex, and to convert it properly to HTML there is 
hardly any other choice than to set up your own Mediawiki server and import 
Wikipedia dumps. The situation for other content management systems isn't 
better: usually dumps can be generated, but the format isn't standardized. 
Consequently, there will be probably no way to implement a generalized tool 
which allows to "fetch from website dumps".

> Implementing A Feature for Fetching From Websites Dump
> --
>
> Key: NUTCH-1615
> URL: https://issues.apache.org/jira/browse/NUTCH-1615
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 2.1
>Reporter: cihad güzel
>Priority: Minor
>
> Some web sites provide dump (as like http://dumps.wikimedia.org/enwiki/ for 
> wikipedia.org). We should fetch from dumps for such kind of web sites. Thus 
> fetching  will be quicker.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1750) Improvement of Fetcher's reportStatus

2014-04-06 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1750:


 Summary: Improvement of Fetcher's reportStatus
 Key: NUTCH-1750
 URL: https://issues.apache.org/jira/browse/NUTCH-1750
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1750) Improvement of Fetcher's reportStatus

2014-04-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1750:
-

Attachment: NUTCH-1750.patch

> Improvement of Fetcher's reportStatus
> -
>
> Key: NUTCH-1750
> URL: https://issues.apache.org/jira/browse/NUTCH-1750
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Julien Nioche
>Priority: Minor
> Attachments: NUTCH-1750.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1750) Improvement of Fetcher's reportStatus

2014-04-06 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961401#comment-13961401
 ] 

Julien Nioche commented on NUTCH-1750:
--

The patch attached improves a few things : 
* makes it explicit that the stats in brackets are for the last second
* average pages per sec was always an int but displayed as a float 
* avgBytesSec could be incorrect on large fetch sessions (because of the long 
casted to a float)
* bytes last second now shown in kbits instead of bytes
 

> Improvement of Fetcher's reportStatus
> -
>
> Key: NUTCH-1750
> URL: https://issues.apache.org/jira/browse/NUTCH-1750
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Julien Nioche
>Priority: Minor
> Attachments: NUTCH-1750.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2014-04-06 Thread Chris Schneider (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961562#comment-13961562
 ] 

Chris Schneider commented on NUTCH-385:
---

Hi Julien,

Actually, I believe the original bug report made two basic requests for 
improvement:

1) The behavior of these two configuration parameters should be changed to make 
them more consistent with one another.

2) The behavior of these two configuration parameters should be clearly 
documented in the configuration file, including any interactions between them 
(such as who trumps whom).

Since then, Andrzej has attempted to justify the current behavior, though there 
seem to be other opinions on how it really ought to work. Even if we decide not 
to change the current implementation, I think it certainly deserves better 
documentation.

Chris

> Server delay feature conflicts with maxThreadsPerHost
> -
>
> Key: NUTCH-385
> URL: https://issues.apache.org/jira/browse/NUTCH-385
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Reporter: Chris Schneider
>
> For some time I've been puzzled by the interaction between two paramters that 
> control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our 
> processing of the robots.txt file, and which can be limited by 
> fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than 
> the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
> ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
> continuously. In other words, it never tries to point 3 at the host, and it 
> always points a second thread at the host before the first thread finishes 
> accessing it. Since HttpBase.unblockAddr never gets called with 
> (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. Thus, the server delay will never be used at all. The fetcher will be 
> continuously retrieving pages from the host, often with 2 fetchers accessing 
> the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to 
> complete before it gets around to pointing another thread at the target host. 
> When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. This, in turn, will prevent any threads from accessing this host until 
> the delay is complete, even though zero threads are currently accessing the 
> host.
> I see this behavior as inconsistent. More importantly, the current 
> implementation certainly doesn't seem to answer my original question about 
> appropriate definitions for what appear to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more 
> than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this 
> trumped the server delay, causing the latter to be ignored completely. That 
> is certainly not the case in the current implementation, as it will wait for 
> server delay whenever the number of threads accessing a given host drops to 
> zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)