Re: upgrading protocol-httpclient to httpclient 4.1.1
Alright. I'll look into it. Thanks! On Sat, Apr 5, 2014 at 12:39 AM, Sebastian Nagel wastl.na...@googlemail.com wrote: Define 'addressing'. :-) I didn't refactor because I don't really know which direction will be the right direction for that plugin. So in a way the plugin is still the same. All I did was to change all the API calls to httpclient 4.1.1 and check that the tests still run (it wasn't as easy as it sounds. :-P ) That's at least something. Unfortunately, I never had a closer look to the httpclient plugin, and cannot estimate what level of rewriting is required. So what you are saying is that I can make the protocol-httpclient use the latest 4.3.x version without breaking anything? Yes, it should be possible. It happens just often that different versions of a lib are used. So what do you say? Should I redo it with 4.2.6? Go straight for 4.3.x? I would like to be able to provide a patch for 2.2.1 users and trunk users considering i'm a 2.2.1 user myself. What would be the correct approach? Go straight for 4.3.x and not depend on indirectly on the Solr version. What exactly do I need to change and where? src/plugin/protocol-httpclient/ivy.xml - add as dependency src/plugin/protocol-httpclient/plugin.xml - add as library - add also transitive dependencies The best was is to have a look at another plugin, e.g., indexer-elastic Will I still be able to use Eclipse or will it break because Eclipse won't know how to provide the correct dependency? You have to update the dependencies: - if you use IvyDE : add the ivy.xml as IvyDE lib to Java build path - if ant eclipse: change ivy.xml, close the Eclipse project, call ant eclipse, open project again and press F5 Refresh Sebastian On 04/04/2014 10:56 PM, d_k wrote: On Fri, Apr 4, 2014 at 11:28 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi, does it mean you are (also) addressing NUTCH-1086? Would be great, since this issue is waiting for a solution since long! Define 'addressing'. :-) I didn't refactor because I don't really know which direction will be the right direction for that plugin. So in a way the plugin is still the same. All I did was to change all the API calls to httpclient 4.1.1 and check that the tests still run (it wasn't as easy as it sounds. :-P ) The reason I picked version 4.1.1 and not the latest is because I noticed it is already in the build/lib dir and I wasn't sure I can use two versions of the jar with the same namespace without creating conflicts. You should be able to use any version of httpclient, but it must be registered as dependency in the plugin's ivy.xml (src/plugin/protocol-httpclient/ivy.xml), not in the main ivy/ivy.xml. Actually I didn't change any ivy xml. I just changed the code to use the new imports and it must have picked up the dependencies by itself. I used Eclipse so maybe it has something to do with it. Each plugin gets its own class loader to solve the problem of conflicting dependencies, see https://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading So what you are saying is that I can make the protocol-httpclient use the latest 4.3.x version without breaking anything? What exactly do I need to change and where? Will I still be able to use Eclipse or will it break because Eclipse won't know how to provide the correct dependency? I didn't check 2.2.1, but in head of 2.x httpclient 4.2.6 is a dependency of a dependency (solrj) of the indexer-solr plugin. The upgrade has been done with NUTCH-1568. So what do you say? Should I redo it with 4.2.6? Go straight for 4.3.x? I would like to be able to provide a patch for 2.2.1 users and trunk users considering i'm a 2.2.1 user myself. What would be the correct approach? Sebastian On 04/04/2014 04:14 PM, d_k wrote: I've written a patch for the 2.2.1 source code that upgrades the protocol-httpclient to httpclient 4.1.1 Unfortunately I had to adjust the test because currently httpclient 4.1.1 does not support authenticating with different credentials against different realms in the same domain: HTTPCLIENT-1490https://issues.apache.org/jira/browse/HTTPCLIENT-1490 . The reason I picked version 4.1.1 and not the latest is because I noticed it is already in the build/lib dir and I wasn't sure I can use two versions of the jar with the same namespace without creating conflicts. My questions are: 1) Anyone needs this patch or did I took the wrong path in choosing 4.1.1? 2) If so, under what JIRA issue should I submit it? NUTCH-751? NUTCH-1086? something else? new issue?
[jira] [Created] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher
Julien Nioche created NUTCH-1747: Summary: Use AtomicInteger as semaphore in Fetcher Key: NUTCH-1747 URL: https://issues.apache.org/jira/browse/NUTCH-1747 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.8 Reporter: Julien Nioche Priority: Minor In Fetcher we currently use SetFetchItem inProgress = Collections.synchronizedSet(new HashSetFetchItem()); as semaphore within the FetchItemQueues to keep track of the URLs being fetched and prevent threads from pulling from them. It works fine but we could use AtomicIntegers instead as all we need is the counts, not the contents. This change would have little impact on the performance but would make the code a bit cleaner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1747: - Attachment: NUTCH-1747-trunk.patch Use AtomicInteger as semaphore in Fetcher - Key: NUTCH-1747 URL: https://issues.apache.org/jira/browse/NUTCH-1747 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.8 Reporter: Julien Nioche Priority: Minor Attachments: NUTCH-1747-trunk.patch In Fetcher we currently use SetFetchItem inProgress = Collections.synchronizedSet(new HashSetFetchItem()); as semaphore within the FetchItemQueues to keep track of the URLs being fetched and prevent threads from pulling from them. It works fine but we could use AtomicIntegers instead as all we need is the counts, not the contents. This change would have little impact on the performance but would make the code a bit cleaner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (NUTCH-207) Bandwidth target for fetcher rather than a thread count
[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-207: --- Assignee: Julien Nioche Will see if I can port this patch to the current version of the Fetcher Bandwidth target for fetcher rather than a thread count --- Key: NUTCH-207 URL: https://issues.apache.org/jira/browse/NUTCH-207 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8 Reporter: Rod Taylor Assignee: Julien Nioche Fix For: 1.9 Attachments: ratelimit.patch Increases or decreases the number of threads from the starting value (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve a target bandwidth (fetcher.threads.bandwidth). It seems to be able to keep within 10% of the target bandwidth even when large numbers of errors are found or when a number of large pages is run across. To achieve more accurate tracking Nutch should keep track of protocol overhead as well as the volume of pages downloaded. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1735) code dedup fetcher queue redirects
[ https://issues.apache.org/jira/browse/NUTCH-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961002#comment-13961002 ] Julien Nioche commented on NUTCH-1735: -- +1 Nice to simplify the code of the Fetcher code dedup fetcher queue redirects -- Key: NUTCH-1735 URL: https://issues.apache.org/jira/browse/NUTCH-1735 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.7 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.9 Attachments: NUTCH-1735.patch 20 lines of duplicated code in Fetcher when a new FetchItem is created for a redirect and queued. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961009#comment-13961009 ] Julien Nioche commented on NUTCH-1687: -- I like the idea but am a bit concerned by the potential impact of : it = Iterables.cycle(queues.keySet()).iterator(); whenever a new FetchItemQueue is added. It will be called a lot at the beginning of a Fetch when we create most of the queues and we'd create loads of iterator that would be overridden straight away. What about doing this lazily and trigger the generation of a new iterator only if getFetchItem() is called and at least one FetchItemQueue has been added? I agree that in the middle of a Fetch, queues don't get added so often compared to calls to getFetchItem() so not having to create an iterator there as we currently do would definitely be a plus. In extreme cases when there is a large diversity of hostnames / domains within a fetchlist we could end up creating a new iterator for every new URL and would always start at the first one anyway which is what we currently do so the new approach would not be worse anyway. What do you think? Also why not using Iterators.cycle() directly? Thanks Pick queue in Round Robin - Key: NUTCH-1687 URL: https://issues.apache.org/jira/browse/NUTCH-1687 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Tien Nguyen Manh Priority: Minor Fix For: 1.9 Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch Currently we chose queue to pick url from start of queues list, so queue at the start of list have more change to be pick first, that can cause problem of long tail queue, which only few queue available at the end which have many urls. public synchronized FetchItem getFetchItem() { final IteratorMap.EntryString, FetchItemQueue it = queues.entrySet().iterator(); == always reset to find queue from start while (it.hasNext()) { I think it is better to pick queue in round robin, that can make reduce time to find the available queue and make all queue was picked in round robin and if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (NUTCH-1735) code dedup fetcher queue redirects
[ https://issues.apache.org/jira/browse/NUTCH-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1735. Resolution: Fixed Committed to trunk r1585144. code dedup fetcher queue redirects -- Key: NUTCH-1735 URL: https://issues.apache.org/jira/browse/NUTCH-1735 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.7 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.9 Attachments: NUTCH-1735.patch 20 lines of duplicated code in Fetcher when a new FetchItem is created for a redirect and queued. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1735) code dedup fetcher queue redirects
[ https://issues.apache.org/jira/browse/NUTCH-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961185#comment-13961185 ] Hudson commented on NUTCH-1735: --- SUCCESS: Integrated in Nutch-trunk #2591 (See [https://builds.apache.org/job/Nutch-trunk/2591/]) NUTCH-1735 code dedup fetcher queue redirects (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1585144) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java code dedup fetcher queue redirects -- Key: NUTCH-1735 URL: https://issues.apache.org/jira/browse/NUTCH-1735 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.7 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.9 Attachments: NUTCH-1735.patch 20 lines of duplicated code in Fetcher when a new FetchItem is created for a redirect and queued. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1182: --- Fix Version/s: 1.9 fetcher should track and shut down hung threads --- Key: NUTCH-1182 URL: https://issues.apache.org/jira/browse/NUTCH-1182 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3, 1.4 Environment: Linux, local job runner Reporter: Sebastian Nagel Priority: Minor Fix For: 2.4, 1.9 While crawling a slow server with a couple of very large PDF documents (30 MB) on it after some time and a bulk of successfully fetched documents the fetcher stops with the message: ??Aborting with 10 hung threads.?? From now on every cycle ends with hung threads, almost no documents are fetched successfully. In addition, strange hadoop errors are logged: {noformat} fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException at java.lang.System.arraycopy(Native Method) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108) ... {noformat} or {noformat} Exception in thread QueueFeeder java.lang.NullPointerException at org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214) {noformat} I've run the debugger and found: # after the hung threads are reported the fetcher stops but the threads are still alive and continue fetching a document. In consequence, this will #* limit the small bandwidth of network/server even more #* after the document is fetched the thread tries to write the content via {{output.collect()}} which must fail because the fetcher map job is already finished and the associated temporary mapred directory is deleted. The error message may get mixed with the progress output of the next fetch cycle causing additional confusion. # documents/URLs causing the hung thread are never reported nor stored. That is, it's hard to track them down, and they will cause a hung thread again and again. The problem is reproducible when fetching bigger documents and setting {{mapred.task.timeout}} to a low value (this will definitely cause hung threads). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1182: --- Attachment: NUTCH-1182-trunk-v1.patch From time to time this problem is reported by users ([2013|http://mail-archives.apache.org/mod_mbox/nutch-user/201304.mbox/%3ccajvbnigoqjl2hbuhv0gdbcjea2xzxhabqrsbpjaqtmfldkw...@mail.gmail.com%3E], [2012a|http://stackoverflow.com/questions/10331440/nutch-fetcher-aborting-with-n-hung-threads], [2012b|http://stackoverflow.com/questions/12181249/nutch-crawl-fails-when-run-as-a-background-process-on-linux], [2011|http://lucene.472066.n3.nabble.com/Nutch-1-2-fetcher-aborting-with-N-hung-threads-td2411724.html]). Shutting down hung threads is hard to implement (cf. NUTCH-1387). But logging the URLs which cause threads to hang would definitely help in many situations. Patch attached. fetcher should track and shut down hung threads --- Key: NUTCH-1182 URL: https://issues.apache.org/jira/browse/NUTCH-1182 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3, 1.4 Environment: Linux, local job runner Reporter: Sebastian Nagel Priority: Minor Fix For: 2.4, 1.9 Attachments: NUTCH-1182-trunk-v1.patch While crawling a slow server with a couple of very large PDF documents (30 MB) on it after some time and a bulk of successfully fetched documents the fetcher stops with the message: ??Aborting with 10 hung threads.?? From now on every cycle ends with hung threads, almost no documents are fetched successfully. In addition, strange hadoop errors are logged: {noformat} fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException at java.lang.System.arraycopy(Native Method) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108) ... {noformat} or {noformat} Exception in thread QueueFeeder java.lang.NullPointerException at org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214) {noformat} I've run the debugger and found: # after the hung threads are reported the fetcher stops but the threads are still alive and continue fetching a document. In consequence, this will #* limit the small bandwidth of network/server even more #* after the document is fetched the thread tries to write the content via {{output.collect()}} which must fail because the fetcher map job is already finished and the associated temporary mapred directory is deleted. The error message may get mixed with the progress output of the next fetch cycle causing additional confusion. # documents/URLs causing the hung thread are never reported nor stored. That is, it's hard to track them down, and they will cause a hung thread again and again. The problem is reproducible when fetching bigger documents and setting {{mapred.task.timeout}} to a low value (this will definitely cause hung threads). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961211#comment-13961211 ] Sebastian Nagel commented on NUTCH-1747: +1 Looks like inProgress was intended to hold more than the bare count of FetchItems in progress. In doubt, we can get the in-progress FetchItems and their associated queue from FetcherThreads (cf. NUTCH-1182). Use AtomicInteger as semaphore in Fetcher - Key: NUTCH-1747 URL: https://issues.apache.org/jira/browse/NUTCH-1747 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.8 Reporter: Julien Nioche Priority: Minor Attachments: NUTCH-1747-trunk.patch In Fetcher we currently use SetFetchItem inProgress = Collections.synchronizedSet(new HashSetFetchItem()); as semaphore within the FetchItemQueues to keep track of the URLs being fetched and prevent threads from pulling from them. It works fine but we could use AtomicIntegers instead as all we need is the counts, not the contents. This change would have little impact on the performance but would make the code a bit cleaner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-385. - Resolution: Not a Problem This is not a problem but a discussion of how things work in the Fetcher. Not action needed. Server delay feature conflicts with maxThreadsPerHost - Key: NUTCH-385 URL: https://issues.apache.org/jira/browse/NUTCH-385 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Chris Schneider For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host: 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay. 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1. According to my (limited) understanding of the code in HttpBase.java: Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously. Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host. I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host? It would be one thing if whenever (fetcher.threads.per.host 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)
[ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-490: Component/s: (was: fetcher) parser Extension point with filters for Neko HTML parser (with patch) -- Key: NUTCH-490 URL: https://issues.apache.org/jira/browse/NUTCH-490 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 0.9.0 Environment: Any Reporter: Marcin Okraszewski Priority: Minor Fix For: 1.9 Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, nutch-extensionpoins_plugin.xml.diff In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess. The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily. BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first
[ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1297. -- Resolution: Won't Fix NUTCH-1687 is a nicer approach + no feedback from original contributor it is better for fetchItemQueues to select items from greater queues first -- Key: NUTCH-1297 URL: https://issues.apache.org/jira/browse/NUTCH-1297 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Labels: fetch_queues Fix For: 1.9 Attachments: NUTCH-1297.patch there is a situation that if we have multiple hosts in fetch, and size of hosts were different, large hosts have a long delay until the getFetchItem() in FetchItemQueues class select a url from them, so we can give them more priority. for example if we have 10 url from host1 and 1000 url from host2, and have 5 threads, if all threads first selected from host1, we had more delay on fetch rather than a situation that threads first selected from host2, and when host 2 was busy, then selected from host1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (NUTCH-1278) Fetch Improvement in threads per host
[ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1278. -- Resolution: Won't Fix No follow up from contributor + solution proposed quite invasive (changes at several levels) Fetch Improvement in threads per host - Key: NUTCH-1278 URL: https://issues.apache.org/jira/browse/NUTCH-1278 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Fix For: 1.9 Attachments: NUTCH-1278-v.2.zip, NUTCH-1278.zip the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host there is a possibility with using of dynamic values for every host that influeced with number of blocked requests. this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-827) HTTP POST Authentication
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-827: Component/s: (was: fetcher) protocol HTTP POST Authentication Key: NUTCH-827 URL: https://issues.apache.org/jira/browse/NUTCH-827 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: 1.1, nutchgora Reporter: Jasper van Veghel Priority: Minor Labels: authentication Fix For: 1.9 Attachments: http-client-form-authtication.patch, nutch-http-cookies.patch I've created a patch against the trunk which adds support for very rudimentary POST-based authentication support. It takes a link from nutch-site.xml with a site to POST to and its respective parameters (username, password, etc.). It then checks upon every request whether any cookies have been initialized, and if none have, it fetches them from the given link. This isn't perfect but Works For Me (TM) as I generally only need to retrieve results from a single domain and so have no cookie overlap (i.e. if the domain cookies expire, all cookies disappear from the HttpClient and I can simply re-fetch them). A natural improvement would be to be able to specify one particular cookie to check the expiration-date against. If anyone is interested in this beside me I'd be glad to put some more effort into making this more universally applicable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1342) Read time out protocol-http
[ https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1342: - Component/s: (was: fetcher) protocol Read time out protocol-http --- Key: NUTCH-1342 URL: https://issues.apache.org/jira/browse/NUTCH-1342 Project: Nutch Issue Type: Bug Components: protocol Affects Versions: 1.4, 1.5 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.9 Attachments: NUTCH-1342-1.6-1.patch For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same: {code} 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.FilterInputStream.read(FilterInputStream.java:116) at java.io.PushbackInputStream.read(PushbackInputStream.java:169) at java.io.FilterInputStream.read(FilterInputStream.java:90) at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228) at org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:157) at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138) {code} Some example URL's: * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/ * 301 http://shop.fcgroningen.nl/aanbieding -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1748) despite unix systems allow abc..xyz.txt kind of urls, url validator plugin rejects.
Sertac TURKEL created NUTCH-1748: Summary: despite unix systems allow abc..xyz.txt kind of urls, url validator plugin rejects. Key: NUTCH-1748 URL: https://issues.apache.org/jira/browse/NUTCH-1748 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Sertac TURKEL Priority: Minor Fix For: 2.3 Unix systems accept files containing two dots abc..xyz.txt. So urlfilter-validator should not reject this kind of urls. Also paths containing /../ or /.. in final position should be still rejected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961236#comment-13961236 ] Hudson commented on NUTCH-1747: --- SUCCESS: Integrated in Nutch-trunk #2592 (See [https://builds.apache.org/job/Nutch-trunk/2592/]) NUTCH-1747 Use AtomicInteger as semaphore in Fetcher (jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1585196) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java Use AtomicInteger as semaphore in Fetcher - Key: NUTCH-1747 URL: https://issues.apache.org/jira/browse/NUTCH-1747 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.8 Reporter: Julien Nioche Priority: Minor Attachments: NUTCH-1747-trunk.patch In Fetcher we currently use SetFetchItem inProgress = Collections.synchronizedSet(new HashSetFetchItem()); as semaphore within the FetchItemQueues to keep track of the URLs being fetched and prevent threads from pulling from them. It works fine but we could use AtomicIntegers instead as all we need is the counts, not the contents. This change would have little impact on the performance but would make the code a bit cleaner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (NUTCH-1615) Implementing A Feature for Fetching From Websites Dump
[ https://issues.apache.org/jira/browse/NUTCH-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cihad güzel updated NUTCH-1615: --- Comment: was deleted (was: I'm trying for this issue.) Implementing A Feature for Fetching From Websites Dump -- Key: NUTCH-1615 URL: https://issues.apache.org/jira/browse/NUTCH-1615 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 2.1 Reporter: cihad güzel Priority: Minor Some web sites provide dump (as like http://dumps.wikimedia.org/enwiki/ for wikipedia.org). We should fetch from dumps for such kind of web sites. Thus fetching will be quicker. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1749) Title duplicated in document body
Greg Padiasek created NUTCH-1749: Summary: Title duplicated in document body Key: NUTCH-1749 URL: https://issues.apache.org/jira/browse/NUTCH-1749 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Greg Padiasek The HTML parser plugin inserts document title into document content. Since the title alone can be retrieved via DOMContentUtils.getTitle() and content is retrieved via DOMContentUtils.getText(), there is no need to duplicate title in the content. When title is included in the content it becomes difficult/impossible to extract document body without title. A need to extract document body without title is visible when user wants to index or display body and title separately. Attached is a patch which prevents including title in document content in the HTML parser plugin. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1749) Title duplicated in document body
[ https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Padiasek updated NUTCH-1749: - Attachment: DOMContentUtils.patch Title duplicated in document body - Key: NUTCH-1749 URL: https://issues.apache.org/jira/browse/NUTCH-1749 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Greg Padiasek Attachments: DOMContentUtils.patch The HTML parser plugin inserts document title into document content. Since the title alone can be retrieved via DOMContentUtils.getTitle() and content is retrieved via DOMContentUtils.getText(), there is no need to duplicate title in the content. When title is included in the content it becomes difficult/impossible to extract document body without title. A need to extract document body without title is visible when user wants to index or display body and title separately. Attached is a patch which prevents including title in document content in the HTML parser plugin. -- This message was sent by Atlassian JIRA (v6.2#6252)