[jira] [Updated] (NUTCH-1749) Title duplicated in document body

2014-04-05 Thread Greg Padiasek (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Padiasek updated NUTCH-1749:
-

Attachment: DOMContentUtils.patch

> Title duplicated in document body
> -
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1749) Title duplicated in document body

2014-04-05 Thread Greg Padiasek (JIRA)
Greg Padiasek created NUTCH-1749:


 Summary: Title duplicated in document body
 Key: NUTCH-1749
 URL: https://issues.apache.org/jira/browse/NUTCH-1749
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Greg Padiasek


The HTML parser plugin inserts document title into document content. Since the 
title alone can be retrieved via DOMContentUtils.getTitle() and content is 
retrieved via DOMContentUtils.getText(), there is no need to duplicate title in 
the content. When title is included in the content it becomes 
difficult/impossible to extract document body without title. A need to extract 
document body without title is visible when user wants to index or display body 
and title separately.

Attached is a patch which prevents including title in document content in the 
HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (NUTCH-1615) Implementing A Feature for Fetching From Websites Dump

2014-04-05 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cihad güzel updated NUTCH-1615:
---

Comment: was deleted

(was: I'm trying for this issue.)

> Implementing A Feature for Fetching From Websites Dump
> --
>
> Key: NUTCH-1615
> URL: https://issues.apache.org/jira/browse/NUTCH-1615
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 2.1
>Reporter: cihad güzel
>Priority: Minor
>
> Some web sites provide dump (as like http://dumps.wikimedia.org/enwiki/ for 
> wikipedia.org). We should fetch from dumps for such kind of web sites. Thus 
> fetching  will be quicker.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher

2014-04-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961236#comment-13961236
 ] 

Hudson commented on NUTCH-1747:
---

SUCCESS: Integrated in Nutch-trunk #2592 (See 
[https://builds.apache.org/job/Nutch-trunk/2592/])
NUTCH-1747 Use AtomicInteger as semaphore in Fetcher (jnioche: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1585196)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java


> Use AtomicInteger as semaphore in Fetcher
> -
>
> Key: NUTCH-1747
> URL: https://issues.apache.org/jira/browse/NUTCH-1747
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.8
>Reporter: Julien Nioche
>Priority: Minor
> Attachments: NUTCH-1747-trunk.patch
>
>
> In Fetcher we currently use 
> Set  inProgress = Collections.synchronizedSet(new 
> HashSet());
> as semaphore within the FetchItemQueues to keep track of the URLs being 
> fetched and prevent threads from pulling from them. It works fine but we 
> could use AtomicIntegers instead as all we need is the counts, not the 
> contents.
> This change would have little impact on the performance but would make the 
> code a bit cleaner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1748) despite unix systems allow "abc..xyz.txt" kind of urls, url validator plugin rejects.

2014-04-05 Thread Sertac TURKEL (JIRA)
Sertac TURKEL created NUTCH-1748:


 Summary: despite unix systems allow "abc..xyz.txt" kind of urls, 
url validator plugin rejects. 
 Key: NUTCH-1748
 URL: https://issues.apache.org/jira/browse/NUTCH-1748
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Sertac TURKEL
Priority: Minor
 Fix For: 2.3


Unix systems accept files containing two dots "abc..xyz.txt". So
urlfilter-validator should not  reject this kind of urls. Also paths containing 
"/../" or "/.." in final position should be still rejected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1342) Read time out protocol-http

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1342:
-

Component/s: (was: fetcher)
 protocol

> Read time out protocol-http
> ---
>
> Key: NUTCH-1342
> URL: https://issues.apache.org/jira/browse/NUTCH-1342
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.4, 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not 
> protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
> at java.io.FilterInputStream.read(FilterInputStream.java:90)
> at 
> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
> at 
> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:157)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-827) HTTP POST Authentication

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-827:


Component/s: (was: fetcher)
 protocol

> HTTP POST Authentication
> 
>
> Key: NUTCH-827
> URL: https://issues.apache.org/jira/browse/NUTCH-827
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.1, nutchgora
>Reporter: Jasper van Veghel
>Priority: Minor
>  Labels: authentication
> Fix For: 1.9
>
> Attachments: http-client-form-authtication.patch, 
> nutch-http-cookies.patch
>
>
> I've created a patch against the trunk which adds support for very 
> rudimentary POST-based authentication support. It takes a link from 
> nutch-site.xml with a site to POST to and its respective parameters 
> (username, password, etc.). It then checks upon every request whether any 
> cookies have been initialized, and if none have, it fetches them from the 
> given link.
> This isn't perfect but Works For Me (TM) as I generally only need to retrieve 
> results from a single domain and so have no cookie overlap (i.e. if the 
> domain cookies expire, all cookies disappear from the HttpClient and I can 
> simply re-fetch them). A natural improvement would be to be able to specify 
> one particular cookie to check the expiration-date against. If anyone is 
> interested in this beside me I'd be glad to put some more effort into making 
> this more universally applicable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-410) Faster RegexNormalize with more features

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-410:


Component/s: (was: fetcher)

> Faster RegexNormalize with more features
> 
>
> Key: NUTCH-410
> URL: https://issues.apache.org/jira/browse/NUTCH-410
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.8
> Environment: Tested on MacOS X 10.4.7/10.4.8
>Reporter: Doug Cook
>Priority: Minor
> Fix For: 1.9
>
> Attachments: betterRegexNorm.patch
>
>
> The patch associated with this is backwards-compatible and has several 
> improvements over the stock 0.8 RegexURLNormalizer:
> 1) About a 34% performance improvement, from only executing the superclass 
> (BasicURLNormalizer) once in most cases, instead of twice as the stock 
> version did. 
> 2) Support for expensive host-specific normalizations with good performance. 
> Each  block optionally takes a list of hosts to which to apply the 
> associated regex. If supplied, the regex will only be applied to these hosts. 
> This should have scalable performance; the comparison is O(1) regardless of 
> the number of hosts. The format is:
> 
> www.host1.com
> host2.site2.com
>  my pattern here 
>  my substitution here 
>
> 3)  Support for decoding URLs with escaped character encodings (e.g. %20, 
> etc.). This is useful, for example, to decode "jump redirects" which have the 
> target URL encoded within the source, as on Yahoo. I tried to create an 
> extensible notion of "options," the first of which is "unescape." The 
> unescape function is applied *after* the substitution and *only* if the 
> substitution pattern matches. A simple pattern to unescape Yahoo directory 
> redirects would be something like:
> 
>   ^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^&]+)
>   $1
>   unescape
> 
> 4) Added the notion of iterating the pattern chain. This is useful when the 
> result of a normalization can itself be normalized. While some of this can be 
> handled in the stock version by repeating patterns, or by careful ordering of 
> patterns, the notion of iterating is cleaner and more powerful. The chain is 
> defined to iterate only when the previous iteration changes the input, up to 
> a configurable maxium number of iterations. The config parameter to change 
> is: urlnormalizer.regex.maxiterations, which defaults to 1 (previous 
> behavior). The change is performance-neutral when disabled, and has a 
> relatively small performance cost when enabled.
> Pardon any potentially unconventional Java on my part. I've got lots of C/C++ 
> search engine experience, but Nutch is my first large Java app. I welcome any 
> feedback, and hope this is useful.
> Doug



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1278) Fetch Improvement in threads per host

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1278.
--

Resolution: Won't Fix

No follow up from contributor + solution proposed quite invasive (changes at 
several levels)

> Fetch Improvement in threads per host
> -
>
> Key: NUTCH-1278
> URL: https://issues.apache.org/jira/browse/NUTCH-1278
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 1.4
>Reporter: behnam nikbakht
> Fix For: 1.9
>
> Attachments: NUTCH-1278-v.2.zip, NUTCH-1278.zip
>
>
> the value of maxThreads is equal to fetcher.threads.per.host and is constant 
> for every host
> there is a possibility with using of dynamic values for every host that 
> influeced with number of blocked requests.
> this means that if number of blocked requests for one host increased, then we 
> most decrease this value and increase http.timeout



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1297.
--

Resolution: Won't Fix

NUTCH-1687 is a nicer approach + no feedback from original contributor

> it is better for fetchItemQueues to select items from greater queues first
> --
>
> Key: NUTCH-1297
> URL: https://issues.apache.org/jira/browse/NUTCH-1297
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.4
>Reporter: behnam nikbakht
>  Labels: fetch_queues
> Fix For: 1.9
>
> Attachments: NUTCH-1297.patch
>
>
> there is a situation that if we have multiple hosts in fetch, and size of 
> hosts were different, large hosts have a long delay until the getFetchItem() 
> in FetchItemQueues class select a url from them, so we can give them more 
> priority.
> for example if we have 10 url from host1 and 1000 url from host2, and have 5 
> threads, if all threads first selected from host1, we had more delay on fetch 
> rather than a situation that threads first selected from host2, and when host 
> 2 was busy, then selected from host1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-490:


Component/s: (was: fetcher)
 parser

> Extension point with filters for Neko HTML parser (with patch)
> --
>
> Key: NUTCH-490
> URL: https://issues.apache.org/jira/browse/NUTCH-490
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 0.9.0
> Environment: Any
>Reporter: Marcin Okraszewski
>Priority: Minor
> Fix For: 1.9
>
> Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, 
> nutch-extensionpoins_plugin.xml.diff
>
>
> In my project I need to set filters for Neko HTML parser. So instead of 
> adding it hard coded, I made an extension point to define filters for Neko. I 
> was fallowing the code for HtmlParser filters. In fact the method to get 
> filters I think could be generalized to handle both cases. But I didn't want 
> to make too big mess.
> The attached patch is for Nutch 0.9. This part of code wasn't changed in 
> trunk, so should be applicable easily.
> BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by 
> extension point itself. Now there are options for Neko and TagSoap. But if 
> someone would like to use something else or set give different settings for 
> the parser, he would need to modify HtmlParser class, instead of replacing a 
> plugin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-385.
-

Resolution: Not a Problem

This is not a problem but a discussion of how things work in the Fetcher. Not 
action needed.

> Server delay feature conflicts with maxThreadsPerHost
> -
>
> Key: NUTCH-385
> URL: https://issues.apache.org/jira/browse/NUTCH-385
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Reporter: Chris Schneider
>
> For some time I've been puzzled by the interaction between two paramters that 
> control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our 
> processing of the robots.txt file, and which can be limited by 
> fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than 
> the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
> ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
> continuously. In other words, it never tries to point 3 at the host, and it 
> always points a second thread at the host before the first thread finishes 
> accessing it. Since HttpBase.unblockAddr never gets called with 
> (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. Thus, the server delay will never be used at all. The fetcher will be 
> continuously retrieving pages from the host, often with 2 fetchers accessing 
> the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to 
> complete before it gets around to pointing another thread at the target host. 
> When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
> host. This, in turn, will prevent any threads from accessing this host until 
> the delay is complete, even though zero threads are currently accessing the 
> host.
> I see this behavior as inconsistent. More importantly, the current 
> implementation certainly doesn't seem to answer my original question about 
> appropriate definitions for what appear to be conflicting parameters. 
> In a nutshell, how could we possibly honor the server delay if we allow more 
> than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this 
> trumped the server delay, causing the latter to be ignored completely. That 
> is certainly not the case in the current implementation, as it will wait for 
> server delay whenever the number of threads accessing a given host drops to 
> zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1747.
--

Resolution: Fixed

Committed revision 1585196.


> Use AtomicInteger as semaphore in Fetcher
> -
>
> Key: NUTCH-1747
> URL: https://issues.apache.org/jira/browse/NUTCH-1747
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.8
>Reporter: Julien Nioche
>Priority: Minor
> Attachments: NUTCH-1747-trunk.patch
>
>
> In Fetcher we currently use 
> Set  inProgress = Collections.synchronizedSet(new 
> HashSet());
> as semaphore within the FetchItemQueues to keep track of the URLs being 
> fetched and prevent threads from pulling from them. It works fine but we 
> could use AtomicIntegers instead as all we need is the counts, not the 
> contents.
> This change would have little impact on the performance but would make the 
> code a bit cleaner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher

2014-04-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961211#comment-13961211
 ] 

Sebastian Nagel commented on NUTCH-1747:


+1
Looks like inProgress was intended to hold more than the bare count of 
FetchItems in progress. In doubt, we can get the in-progress FetchItems and 
their associated queue from FetcherThreads (cf. NUTCH-1182).


> Use AtomicInteger as semaphore in Fetcher
> -
>
> Key: NUTCH-1747
> URL: https://issues.apache.org/jira/browse/NUTCH-1747
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.8
>Reporter: Julien Nioche
>Priority: Minor
> Attachments: NUTCH-1747-trunk.patch
>
>
> In Fetcher we currently use 
> Set  inProgress = Collections.synchronizedSet(new 
> HashSet());
> as semaphore within the FetchItemQueues to keep track of the URLs being 
> fetched and prevent threads from pulling from them. It works fine but we 
> could use AtomicIntegers instead as all we need is the counts, not the 
> contents.
> This change would have little impact on the performance but would make the 
> code a bit cleaner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads

2014-04-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1182:
---

Attachment: NUTCH-1182-trunk-v1.patch

>From time to time this problem is reported by users 
>([2013|http://mail-archives.apache.org/mod_mbox/nutch-user/201304.mbox/%3ccajvbnigoqjl2hbuhv0gdbcjea2xzxhabqrsbpjaqtmfldkw...@mail.gmail.com%3E],
> 
>[2012a|http://stackoverflow.com/questions/10331440/nutch-fetcher-aborting-with-n-hung-threads],
> 
>[2012b|http://stackoverflow.com/questions/12181249/nutch-crawl-fails-when-run-as-a-background-process-on-linux],
> 
>[2011|http://lucene.472066.n3.nabble.com/Nutch-1-2-fetcher-aborting-with-N-hung-threads-td2411724.html]).
> Shutting down hung threads is hard to implement (cf. NUTCH-1387). But logging 
>the URLs which cause threads to hang would definitely help in many situations. 
>Patch attached.

> fetcher should track and shut down hung threads
> ---
>
> Key: NUTCH-1182
> URL: https://issues.apache.org/jira/browse/NUTCH-1182
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.3, 1.4
> Environment: Linux, local job runner
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.9
>
> Attachments: NUTCH-1182-trunk-v1.patch
>
>
> While crawling a slow server with a couple of very large PDF documents (30 
> MB) on it
> after some time and a bulk of successfully fetched documents the fetcher stops
> with the message: ??Aborting with 10 hung threads.??
> From now on every cycle ends with hung threads, almost no documents are 
> fetched
> successfully. In addition, strange hadoop errors are logged:
> {noformat}
>fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException
> at java.lang.System.arraycopy(Native Method)
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108)
> ...
> {noformat}
> or
> {noformat}
>Exception in thread "QueueFeeder" java.lang.NullPointerException
>  at 
> org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48)
>  at 
> org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214)
> {noformat}
> I've run the debugger and found:
> # after the "hung threads" are reported the fetcher stops but the threads are 
> still alive and continue fetching a document. In consequence, this will
> #* limit the small bandwidth of network/server even more
> #* after the document is fetched the thread tries to write the content via 
> {{output.collect()}} which must fail because the fetcher map job is already 
> finished and the associated temporary mapred directory is deleted. The error 
> message may get mixed with the progress output of the next fetch cycle 
> causing additional confusion.
> # documents/URLs causing the hung thread are never reported nor stored. That 
> is, it's hard to track them down, and they will cause a hung thread again and 
> again.
> The problem is reproducible when fetching bigger documents and setting 
> {{mapred.task.timeout}} to a low value (this will definitely cause hung 
> threads).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads

2014-04-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1182:
---

Fix Version/s: 1.9

> fetcher should track and shut down hung threads
> ---
>
> Key: NUTCH-1182
> URL: https://issues.apache.org/jira/browse/NUTCH-1182
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.3, 1.4
> Environment: Linux, local job runner
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.9
>
>
> While crawling a slow server with a couple of very large PDF documents (30 
> MB) on it
> after some time and a bulk of successfully fetched documents the fetcher stops
> with the message: ??Aborting with 10 hung threads.??
> From now on every cycle ends with hung threads, almost no documents are 
> fetched
> successfully. In addition, strange hadoop errors are logged:
> {noformat}
>fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException
> at java.lang.System.arraycopy(Native Method)
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108)
> ...
> {noformat}
> or
> {noformat}
>Exception in thread "QueueFeeder" java.lang.NullPointerException
>  at 
> org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48)
>  at 
> org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214)
> {noformat}
> I've run the debugger and found:
> # after the "hung threads" are reported the fetcher stops but the threads are 
> still alive and continue fetching a document. In consequence, this will
> #* limit the small bandwidth of network/server even more
> #* after the document is fetched the thread tries to write the content via 
> {{output.collect()}} which must fail because the fetcher map job is already 
> finished and the associated temporary mapred directory is deleted. The error 
> message may get mixed with the progress output of the next fetch cycle 
> causing additional confusion.
> # documents/URLs causing the hung thread are never reported nor stored. That 
> is, it's hard to track them down, and they will cause a hung thread again and 
> again.
> The problem is reproducible when fetching bigger documents and setting 
> {{mapred.task.timeout}} to a low value (this will definitely cause hung 
> threads).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1735) code dedup fetcher queue redirects

2014-04-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961185#comment-13961185
 ] 

Hudson commented on NUTCH-1735:
---

SUCCESS: Integrated in Nutch-trunk #2591 (See 
[https://builds.apache.org/job/Nutch-trunk/2591/])
NUTCH-1735 code dedup fetcher queue redirects (snagel: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1585144)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java


> code dedup fetcher queue redirects
> --
>
> Key: NUTCH-1735
> URL: https://issues.apache.org/jira/browse/NUTCH-1735
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.7
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.9
>
> Attachments: NUTCH-1735.patch
>
>
> 20 lines of duplicated code in Fetcher when a new FetchItem is created for a 
> redirect and queued.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1735) code dedup fetcher queue redirects

2014-04-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1735.


Resolution: Fixed

Committed to trunk r1585144.

> code dedup fetcher queue redirects
> --
>
> Key: NUTCH-1735
> URL: https://issues.apache.org/jira/browse/NUTCH-1735
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.7
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.9
>
> Attachments: NUTCH-1735.patch
>
>
> 20 lines of duplicated code in Fetcher when a new FetchItem is created for a 
> redirect and queued.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2014-04-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961009#comment-13961009
 ] 

Julien Nioche commented on NUTCH-1687:
--

I like the idea but am a bit concerned by the potential impact of : 

it = Iterables.cycle(queues.keySet()).iterator();

whenever a new FetchItemQueue is added. It will be called a lot at the 
beginning of a Fetch when we create most of the queues and we'd create loads of 
iterator that would be overridden straight away.

What about doing this lazily and trigger the generation of a new iterator only 
if getFetchItem() is called and at least one FetchItemQueue has been added? 

I agree that in the middle of a Fetch, queues don't get added so often compared 
to calls to getFetchItem() so not having to create an iterator there as we 
currently do would definitely be a plus.

In extreme cases when there is a large diversity of hostnames / domains within 
a fetchlist we could end up creating a new iterator for every new URL and would 
always start at the first one anyway which is what we currently do so the new 
approach would not be worse anyway.

What do you think?

Also why not using Iterators.cycle() directly? 

Thanks

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1735) code dedup fetcher queue redirects

2014-04-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961002#comment-13961002
 ] 

Julien Nioche commented on NUTCH-1735:
--

+1 Nice to simplify the code of the Fetcher

> code dedup fetcher queue redirects
> --
>
> Key: NUTCH-1735
> URL: https://issues.apache.org/jira/browse/NUTCH-1735
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.7
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.9
>
> Attachments: NUTCH-1735.patch
>
>
> 20 lines of duplicated code in Fetcher when a new FetchItem is created for a 
> redirect and queued.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (NUTCH-207) Bandwidth target for fetcher rather than a thread count

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-207:
---

Assignee: Julien Nioche

Will see if I can port this patch to the current version of the Fetcher

> Bandwidth target for fetcher rather than a thread count
> ---
>
> Key: NUTCH-207
> URL: https://issues.apache.org/jira/browse/NUTCH-207
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.8
>Reporter: Rod Taylor
>Assignee: Julien Nioche
> Fix For: 1.9
>
> Attachments: ratelimit.patch
>
>
> Increases or decreases the number of threads from the starting value 
> (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve 
> a target bandwidth (fetcher.threads.bandwidth).
> It seems to be able to keep within 10% of the target bandwidth even when 
> large numbers of errors are found or when a number of large pages is run 
> across.
> To achieve more accurate tracking Nutch should keep track of protocol 
> overhead as well as the volume of pages downloaded.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1747:
-

Attachment: NUTCH-1747-trunk.patch

> Use AtomicInteger as semaphore in Fetcher
> -
>
> Key: NUTCH-1747
> URL: https://issues.apache.org/jira/browse/NUTCH-1747
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.8
>Reporter: Julien Nioche
>Priority: Minor
> Attachments: NUTCH-1747-trunk.patch
>
>
> In Fetcher we currently use 
> Set  inProgress = Collections.synchronizedSet(new 
> HashSet());
> as semaphore within the FetchItemQueues to keep track of the URLs being 
> fetched and prevent threads from pulling from them. It works fine but we 
> could use AtomicIntegers instead as all we need is the counts, not the 
> contents.
> This change would have little impact on the performance but would make the 
> code a bit cleaner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher

2014-04-05 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1747:


 Summary: Use AtomicInteger as semaphore in Fetcher
 Key: NUTCH-1747
 URL: https://issues.apache.org/jira/browse/NUTCH-1747
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.8
Reporter: Julien Nioche
Priority: Minor


In Fetcher we currently use 

Set  inProgress = Collections.synchronizedSet(new 
HashSet());

as semaphore within the FetchItemQueues to keep track of the URLs being fetched 
and prevent threads from pulling from them. It works fine but we could use 
AtomicIntegers instead as all we need is the counts, not the contents.

This change would have little impact on the performance but would make the code 
a bit cleaner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)