Re: upgrading protocol-httpclient to httpclient 4.1.1

2014-04-05 Thread d_k
Alright. I'll look into it. Thanks!


On Sat, Apr 5, 2014 at 12:39 AM, Sebastian Nagel wastl.na...@googlemail.com
 wrote:

  Define 'addressing'. :-)
  I didn't refactor because I don't really know which direction will be the
  right direction for that plugin. So in a way the plugin is still the
 same.
  All I did was to change all the API calls to httpclient 4.1.1 and check
  that the tests still run (it wasn't as easy as it sounds. :-P )

 That's at least something. Unfortunately, I never had a closer look to the
 httpclient
 plugin, and cannot estimate what level of rewriting is required.

  So what you are saying is that I can make the protocol-httpclient use the
  latest 4.3.x version without breaking anything?

 Yes, it should be possible. It happens just often that different versions
 of a lib are used.

  So what do you say? Should I redo it with 4.2.6? Go straight for 4.3.x?
  I would like to be able to provide a patch for 2.2.1 users and trunk
 users
  considering i'm a 2.2.1 user myself.
  What would be the correct approach?

 Go straight for 4.3.x and not depend on indirectly on the Solr version.

  What exactly do I need to change and where?
 src/plugin/protocol-httpclient/ivy.xml
  - add as dependency
 src/plugin/protocol-httpclient/plugin.xml
  - add as library
  - add also transitive dependencies

 The best was is to have a look at another plugin, e.g.,
 indexer-elastic

  Will I still be able to use
  Eclipse or will it break because Eclipse won't know how to provide the
  correct dependency?

 You have to update the dependencies:
 - if you use IvyDE : add the ivy.xml as IvyDE lib to Java build path
 - if ant eclipse: change ivy.xml, close the Eclipse project,
call ant eclipse, open project again and press F5 Refresh

 Sebastian

 On 04/04/2014 10:56 PM, d_k wrote:
  On Fri, Apr 4, 2014 at 11:28 PM, Sebastian Nagel 
 wastl.na...@googlemail.com
  wrote:
 
  Hi,
 
  does it mean you are (also) addressing NUTCH-1086? Would be great,
  since this issue is waiting for a solution since long!
 
 
  Define 'addressing'. :-)
  I didn't refactor because I don't really know which direction will be the
  right direction for that plugin. So in a way the plugin is still the
 same.
  All I did was to change all the API calls to httpclient 4.1.1 and check
  that the tests still run (it wasn't as easy as it sounds. :-P )
 
 
  The reason I picked version 4.1.1 and not the latest is because I
 noticed
  it is already in the build/lib dir and I wasn't sure I can use two
  versions
  of the jar with the same namespace without creating conflicts.
 
  You should be able to use any version of httpclient, but it must be
  registered as dependency in the plugin's ivy.xml
  (src/plugin/protocol-httpclient/ivy.xml),
  not in the main ivy/ivy.xml.
 
 
  Actually I didn't change any ivy xml. I just changed the code to use the
  new imports and it must have picked up the dependencies by itself. I used
  Eclipse so maybe it has something to do with it.
 
 
  Each plugin gets its own class loader to solve the problem of
 conflicting
  dependencies, see
 
 https://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading
 
 
  So what you are saying is that I can make the protocol-httpclient use the
  latest 4.3.x version without breaking anything?
  What exactly do I need to change and where? Will I still be able to use
  Eclipse or will it break because Eclipse won't know how to provide the
  correct dependency?
 
 
  I didn't check 2.2.1, but in head of 2.x httpclient 4.2.6 is a
 dependency
  of a dependency (solrj) of the indexer-solr plugin. The upgrade has been
  done
  with NUTCH-1568.
 
 
  So what do you say? Should I redo it with 4.2.6? Go straight for 4.3.x?
  I would like to be able to provide a patch for 2.2.1 users and trunk
 users
  considering i'm a 2.2.1 user myself.
  What would be the correct approach?
 
 
 
  Sebastian
 
  On 04/04/2014 04:14 PM, d_k wrote:
  I've written a patch for the 2.2.1 source code that upgrades the
  protocol-httpclient to httpclient 4.1.1
 
  Unfortunately I had to adjust the test because currently httpclient
 4.1.1
  does not support authenticating with different credentials against
  different realms in the same domain:
  HTTPCLIENT-1490https://issues.apache.org/jira/browse/HTTPCLIENT-1490
  .
 
  The reason I picked version 4.1.1 and not the latest is because I
 noticed
  it is already in the build/lib dir and I wasn't sure I can use two
  versions
  of the jar with the same namespace without creating conflicts.
 
  My questions are:
  1) Anyone needs this patch or did I took the wrong path in choosing
  4.1.1?
  2) If so, under what JIRA issue should I submit it? NUTCH-751?
  NUTCH-1086?
  something else? new issue?
 
 
 
 




[jira] [Created] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher

2014-04-05 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1747:


 Summary: Use AtomicInteger as semaphore in Fetcher
 Key: NUTCH-1747
 URL: https://issues.apache.org/jira/browse/NUTCH-1747
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.8
Reporter: Julien Nioche
Priority: Minor


In Fetcher we currently use 

SetFetchItem  inProgress = Collections.synchronizedSet(new 
HashSetFetchItem());

as semaphore within the FetchItemQueues to keep track of the URLs being fetched 
and prevent threads from pulling from them. It works fine but we could use 
AtomicIntegers instead as all we need is the counts, not the contents.

This change would have little impact on the performance but would make the code 
a bit cleaner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1747:
-

Attachment: NUTCH-1747-trunk.patch

 Use AtomicInteger as semaphore in Fetcher
 -

 Key: NUTCH-1747
 URL: https://issues.apache.org/jira/browse/NUTCH-1747
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.8
Reporter: Julien Nioche
Priority: Minor
 Attachments: NUTCH-1747-trunk.patch


 In Fetcher we currently use 
 SetFetchItem  inProgress = Collections.synchronizedSet(new 
 HashSetFetchItem());
 as semaphore within the FetchItemQueues to keep track of the URLs being 
 fetched and prevent threads from pulling from them. It works fine but we 
 could use AtomicIntegers instead as all we need is the counts, not the 
 contents.
 This change would have little impact on the performance but would make the 
 code a bit cleaner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (NUTCH-207) Bandwidth target for fetcher rather than a thread count

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-207:
---

Assignee: Julien Nioche

Will see if I can port this patch to the current version of the Fetcher

 Bandwidth target for fetcher rather than a thread count
 ---

 Key: NUTCH-207
 URL: https://issues.apache.org/jira/browse/NUTCH-207
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8
Reporter: Rod Taylor
Assignee: Julien Nioche
 Fix For: 1.9

 Attachments: ratelimit.patch


 Increases or decreases the number of threads from the starting value 
 (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve 
 a target bandwidth (fetcher.threads.bandwidth).
 It seems to be able to keep within 10% of the target bandwidth even when 
 large numbers of errors are found or when a number of large pages is run 
 across.
 To achieve more accurate tracking Nutch should keep track of protocol 
 overhead as well as the volume of pages downloaded.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1735) code dedup fetcher queue redirects

2014-04-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961002#comment-13961002
 ] 

Julien Nioche commented on NUTCH-1735:
--

+1 Nice to simplify the code of the Fetcher

 code dedup fetcher queue redirects
 --

 Key: NUTCH-1735
 URL: https://issues.apache.org/jira/browse/NUTCH-1735
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.7
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.9

 Attachments: NUTCH-1735.patch


 20 lines of duplicated code in Fetcher when a new FetchItem is created for a 
 redirect and queued.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2014-04-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961009#comment-13961009
 ] 

Julien Nioche commented on NUTCH-1687:
--

I like the idea but am a bit concerned by the potential impact of : 

it = Iterables.cycle(queues.keySet()).iterator();

whenever a new FetchItemQueue is added. It will be called a lot at the 
beginning of a Fetch when we create most of the queues and we'd create loads of 
iterator that would be overridden straight away.

What about doing this lazily and trigger the generation of a new iterator only 
if getFetchItem() is called and at least one FetchItemQueue has been added? 

I agree that in the middle of a Fetch, queues don't get added so often compared 
to calls to getFetchItem() so not having to create an iterator there as we 
currently do would definitely be a plus.

In extreme cases when there is a large diversity of hostnames / domains within 
a fetchlist we could end up creating a new iterator for every new URL and would 
always start at the first one anyway which is what we currently do so the new 
approach would not be worse anyway.

What do you think?

Also why not using Iterators.cycle() directly? 

Thanks

 Pick queue in Round Robin
 -

 Key: NUTCH-1687
 URL: https://issues.apache.org/jira/browse/NUTCH-1687
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 1.9

 Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch


 Currently we chose queue to pick url from start of queues list, so queue at 
 the start of list have more change to be pick first, that can cause problem 
 of long tail queue, which only few queue available at the end which have many 
 urls.
 public synchronized FetchItem getFetchItem() {
   final IteratorMap.EntryString, FetchItemQueue it =
 queues.entrySet().iterator(); == always reset to find queue from 
 start
   while (it.hasNext()) {
 
 I think it is better to pick queue in round robin, that can make reduce time 
 to find the available queue and make all queue was picked in round robin and 
 if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1735) code dedup fetcher queue redirects

2014-04-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1735.


Resolution: Fixed

Committed to trunk r1585144.

 code dedup fetcher queue redirects
 --

 Key: NUTCH-1735
 URL: https://issues.apache.org/jira/browse/NUTCH-1735
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.7
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.9

 Attachments: NUTCH-1735.patch


 20 lines of duplicated code in Fetcher when a new FetchItem is created for a 
 redirect and queued.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1735) code dedup fetcher queue redirects

2014-04-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961185#comment-13961185
 ] 

Hudson commented on NUTCH-1735:
---

SUCCESS: Integrated in Nutch-trunk #2591 (See 
[https://builds.apache.org/job/Nutch-trunk/2591/])
NUTCH-1735 code dedup fetcher queue redirects (snagel: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1585144)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java


 code dedup fetcher queue redirects
 --

 Key: NUTCH-1735
 URL: https://issues.apache.org/jira/browse/NUTCH-1735
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.7
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.9

 Attachments: NUTCH-1735.patch


 20 lines of duplicated code in Fetcher when a new FetchItem is created for a 
 redirect and queued.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads

2014-04-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1182:
---

Fix Version/s: 1.9

 fetcher should track and shut down hung threads
 ---

 Key: NUTCH-1182
 URL: https://issues.apache.org/jira/browse/NUTCH-1182
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3, 1.4
 Environment: Linux, local job runner
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 2.4, 1.9


 While crawling a slow server with a couple of very large PDF documents (30 
 MB) on it
 after some time and a bulk of successfully fetched documents the fetcher stops
 with the message: ??Aborting with 10 hung threads.??
 From now on every cycle ends with hung threads, almost no documents are 
 fetched
 successfully. In addition, strange hadoop errors are logged:
 {noformat}
fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException
 at java.lang.System.arraycopy(Native Method)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108)
 ...
 {noformat}
 or
 {noformat}
Exception in thread QueueFeeder java.lang.NullPointerException
  at 
 org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48)
  at 
 org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41)
  at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214)
 {noformat}
 I've run the debugger and found:
 # after the hung threads are reported the fetcher stops but the threads are 
 still alive and continue fetching a document. In consequence, this will
 #* limit the small bandwidth of network/server even more
 #* after the document is fetched the thread tries to write the content via 
 {{output.collect()}} which must fail because the fetcher map job is already 
 finished and the associated temporary mapred directory is deleted. The error 
 message may get mixed with the progress output of the next fetch cycle 
 causing additional confusion.
 # documents/URLs causing the hung thread are never reported nor stored. That 
 is, it's hard to track them down, and they will cause a hung thread again and 
 again.
 The problem is reproducible when fetching bigger documents and setting 
 {{mapred.task.timeout}} to a low value (this will definitely cause hung 
 threads).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads

2014-04-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1182:
---

Attachment: NUTCH-1182-trunk-v1.patch

From time to time this problem is reported by users 
([2013|http://mail-archives.apache.org/mod_mbox/nutch-user/201304.mbox/%3ccajvbnigoqjl2hbuhv0gdbcjea2xzxhabqrsbpjaqtmfldkw...@mail.gmail.com%3E],
 
[2012a|http://stackoverflow.com/questions/10331440/nutch-fetcher-aborting-with-n-hung-threads],
 
[2012b|http://stackoverflow.com/questions/12181249/nutch-crawl-fails-when-run-as-a-background-process-on-linux],
 
[2011|http://lucene.472066.n3.nabble.com/Nutch-1-2-fetcher-aborting-with-N-hung-threads-td2411724.html]).
 Shutting down hung threads is hard to implement (cf. NUTCH-1387). But logging 
the URLs which cause threads to hang would definitely help in many situations. 
Patch attached.

 fetcher should track and shut down hung threads
 ---

 Key: NUTCH-1182
 URL: https://issues.apache.org/jira/browse/NUTCH-1182
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3, 1.4
 Environment: Linux, local job runner
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 2.4, 1.9

 Attachments: NUTCH-1182-trunk-v1.patch


 While crawling a slow server with a couple of very large PDF documents (30 
 MB) on it
 after some time and a bulk of successfully fetched documents the fetcher stops
 with the message: ??Aborting with 10 hung threads.??
 From now on every cycle ends with hung threads, almost no documents are 
 fetched
 successfully. In addition, strange hadoop errors are logged:
 {noformat}
fetch of http://.../xyz.pdf failed with: java.lang.NullPointerException
 at java.lang.System.arraycopy(Native Method)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1108)
 ...
 {noformat}
 or
 {noformat}
Exception in thread QueueFeeder java.lang.NullPointerException
  at 
 org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:48)
  at 
 org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:41)
  at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:214)
 {noformat}
 I've run the debugger and found:
 # after the hung threads are reported the fetcher stops but the threads are 
 still alive and continue fetching a document. In consequence, this will
 #* limit the small bandwidth of network/server even more
 #* after the document is fetched the thread tries to write the content via 
 {{output.collect()}} which must fail because the fetcher map job is already 
 finished and the associated temporary mapred directory is deleted. The error 
 message may get mixed with the progress output of the next fetch cycle 
 causing additional confusion.
 # documents/URLs causing the hung thread are never reported nor stored. That 
 is, it's hard to track them down, and they will cause a hung thread again and 
 again.
 The problem is reproducible when fetching bigger documents and setting 
 {{mapred.task.timeout}} to a low value (this will definitely cause hung 
 threads).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher

2014-04-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961211#comment-13961211
 ] 

Sebastian Nagel commented on NUTCH-1747:


+1
Looks like inProgress was intended to hold more than the bare count of 
FetchItems in progress. In doubt, we can get the in-progress FetchItems and 
their associated queue from FetcherThreads (cf. NUTCH-1182).


 Use AtomicInteger as semaphore in Fetcher
 -

 Key: NUTCH-1747
 URL: https://issues.apache.org/jira/browse/NUTCH-1747
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.8
Reporter: Julien Nioche
Priority: Minor
 Attachments: NUTCH-1747-trunk.patch


 In Fetcher we currently use 
 SetFetchItem  inProgress = Collections.synchronizedSet(new 
 HashSetFetchItem());
 as semaphore within the FetchItemQueues to keep track of the URLs being 
 fetched and prevent threads from pulling from them. It works fine but we 
 could use AtomicIntegers instead as all we need is the counts, not the 
 contents.
 This change would have little impact on the performance but would make the 
 code a bit cleaner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-385.
-

Resolution: Not a Problem

This is not a problem but a discussion of how things work in the Fetcher. Not 
action needed.

 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: https://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Chris Schneider

 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-490:


Component/s: (was: fetcher)
 parser

 Extension point with filters for Neko HTML parser (with patch)
 --

 Key: NUTCH-490
 URL: https://issues.apache.org/jira/browse/NUTCH-490
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 0.9.0
 Environment: Any
Reporter: Marcin Okraszewski
Priority: Minor
 Fix For: 1.9

 Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, 
 nutch-extensionpoins_plugin.xml.diff


 In my project I need to set filters for Neko HTML parser. So instead of 
 adding it hard coded, I made an extension point to define filters for Neko. I 
 was fallowing the code for HtmlParser filters. In fact the method to get 
 filters I think could be generalized to handle both cases. But I didn't want 
 to make too big mess.
 The attached patch is for Nutch 0.9. This part of code wasn't changed in 
 trunk, so should be applicable easily.
 BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by 
 extension point itself. Now there are options for Neko and TagSoap. But if 
 someone would like to use something else or set give different settings for 
 the parser, he would need to modify HtmlParser class, instead of replacing a 
 plugin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1297.
--

Resolution: Won't Fix

NUTCH-1687 is a nicer approach + no feedback from original contributor

 it is better for fetchItemQueues to select items from greater queues first
 --

 Key: NUTCH-1297
 URL: https://issues.apache.org/jira/browse/NUTCH-1297
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
  Labels: fetch_queues
 Fix For: 1.9

 Attachments: NUTCH-1297.patch


 there is a situation that if we have multiple hosts in fetch, and size of 
 hosts were different, large hosts have a long delay until the getFetchItem() 
 in FetchItemQueues class select a url from them, so we can give them more 
 priority.
 for example if we have 10 url from host1 and 1000 url from host2, and have 5 
 threads, if all threads first selected from host1, we had more delay on fetch 
 rather than a situation that threads first selected from host2, and when host 
 2 was busy, then selected from host1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1278) Fetch Improvement in threads per host

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1278.
--

Resolution: Won't Fix

No follow up from contributor + solution proposed quite invasive (changes at 
several levels)

 Fetch Improvement in threads per host
 -

 Key: NUTCH-1278
 URL: https://issues.apache.org/jira/browse/NUTCH-1278
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
 Fix For: 1.9

 Attachments: NUTCH-1278-v.2.zip, NUTCH-1278.zip


 the value of maxThreads is equal to fetcher.threads.per.host and is constant 
 for every host
 there is a possibility with using of dynamic values for every host that 
 influeced with number of blocked requests.
 this means that if number of blocked requests for one host increased, then we 
 most decrease this value and increase http.timeout



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-827) HTTP POST Authentication

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-827:


Component/s: (was: fetcher)
 protocol

 HTTP POST Authentication
 

 Key: NUTCH-827
 URL: https://issues.apache.org/jira/browse/NUTCH-827
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: 1.1, nutchgora
Reporter: Jasper van Veghel
Priority: Minor
  Labels: authentication
 Fix For: 1.9

 Attachments: http-client-form-authtication.patch, 
 nutch-http-cookies.patch


 I've created a patch against the trunk which adds support for very 
 rudimentary POST-based authentication support. It takes a link from 
 nutch-site.xml with a site to POST to and its respective parameters 
 (username, password, etc.). It then checks upon every request whether any 
 cookies have been initialized, and if none have, it fetches them from the 
 given link.
 This isn't perfect but Works For Me (TM) as I generally only need to retrieve 
 results from a single domain and so have no cookie overlap (i.e. if the 
 domain cookies expire, all cookies disappear from the HttpClient and I can 
 simply re-fetch them). A natural improvement would be to be able to specify 
 one particular cookie to check the expiration-date against. If anyone is 
 interested in this beside me I'd be glad to put some more effort into making 
 this more universally applicable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1342) Read time out protocol-http

2014-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1342:
-

Component/s: (was: fetcher)
 protocol

 Read time out protocol-http
 ---

 Key: NUTCH-1342
 URL: https://issues.apache.org/jira/browse/NUTCH-1342
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.4, 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.9

 Attachments: NUTCH-1342-1.6-1.patch


 For some reason some URL's always time out with protocol-http but not 
 protocol-httpclient. The stack trace is always the same:
 {code}
 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
 java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
 at java.io.FilterInputStream.read(FilterInputStream.java:116)
 at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
 at java.io.FilterInputStream.read(FilterInputStream.java:90)
 at 
 org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
 at 
 org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:157)
 at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
 at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
 {code}
 Some example URL's:
 * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
 * 301 http://shop.fcgroningen.nl/aanbieding



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1748) despite unix systems allow abc..xyz.txt kind of urls, url validator plugin rejects.

2014-04-05 Thread Sertac TURKEL (JIRA)
Sertac TURKEL created NUTCH-1748:


 Summary: despite unix systems allow abc..xyz.txt kind of urls, 
url validator plugin rejects. 
 Key: NUTCH-1748
 URL: https://issues.apache.org/jira/browse/NUTCH-1748
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Sertac TURKEL
Priority: Minor
 Fix For: 2.3


Unix systems accept files containing two dots abc..xyz.txt. So
urlfilter-validator should not  reject this kind of urls. Also paths containing 
/../ or /.. in final position should be still rejected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher

2014-04-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961236#comment-13961236
 ] 

Hudson commented on NUTCH-1747:
---

SUCCESS: Integrated in Nutch-trunk #2592 (See 
[https://builds.apache.org/job/Nutch-trunk/2592/])
NUTCH-1747 Use AtomicInteger as semaphore in Fetcher (jnioche: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1585196)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java


 Use AtomicInteger as semaphore in Fetcher
 -

 Key: NUTCH-1747
 URL: https://issues.apache.org/jira/browse/NUTCH-1747
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.8
Reporter: Julien Nioche
Priority: Minor
 Attachments: NUTCH-1747-trunk.patch


 In Fetcher we currently use 
 SetFetchItem  inProgress = Collections.synchronizedSet(new 
 HashSetFetchItem());
 as semaphore within the FetchItemQueues to keep track of the URLs being 
 fetched and prevent threads from pulling from them. It works fine but we 
 could use AtomicIntegers instead as all we need is the counts, not the 
 contents.
 This change would have little impact on the performance but would make the 
 code a bit cleaner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (NUTCH-1615) Implementing A Feature for Fetching From Websites Dump

2014-04-05 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cihad güzel updated NUTCH-1615:
---

Comment: was deleted

(was: I'm trying for this issue.)

 Implementing A Feature for Fetching From Websites Dump
 --

 Key: NUTCH-1615
 URL: https://issues.apache.org/jira/browse/NUTCH-1615
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 2.1
Reporter: cihad güzel
Priority: Minor

 Some web sites provide dump (as like http://dumps.wikimedia.org/enwiki/ for 
 wikipedia.org). We should fetch from dumps for such kind of web sites. Thus 
 fetching  will be quicker.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1749) Title duplicated in document body

2014-04-05 Thread Greg Padiasek (JIRA)
Greg Padiasek created NUTCH-1749:


 Summary: Title duplicated in document body
 Key: NUTCH-1749
 URL: https://issues.apache.org/jira/browse/NUTCH-1749
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Greg Padiasek


The HTML parser plugin inserts document title into document content. Since the 
title alone can be retrieved via DOMContentUtils.getTitle() and content is 
retrieved via DOMContentUtils.getText(), there is no need to duplicate title in 
the content. When title is included in the content it becomes 
difficult/impossible to extract document body without title. A need to extract 
document body without title is visible when user wants to index or display body 
and title separately.

Attached is a patch which prevents including title in document content in the 
HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1749) Title duplicated in document body

2014-04-05 Thread Greg Padiasek (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Padiasek updated NUTCH-1749:
-

Attachment: DOMContentUtils.patch

 Title duplicated in document body
 -

 Key: NUTCH-1749
 URL: https://issues.apache.org/jira/browse/NUTCH-1749
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Greg Padiasek
 Attachments: DOMContentUtils.patch


 The HTML parser plugin inserts document title into document content. Since 
 the title alone can be retrieved via DOMContentUtils.getTitle() and content 
 is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
 title in the content. When title is included in the content it becomes 
 difficult/impossible to extract document body without title. A need to 
 extract document body without title is visible when user wants to index or 
 display body and title separately.
 Attached is a patch which prevents including title in document content in the 
 HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)