[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860140#comment-13860140
 ] 

Markus Jelsma commented on NUTCH-1360:
--

Almost all unit tests fail due to improper use of entities in configuration.

{code}
org.xml.sax.SAXParseException; systemId: 
file:/home/markus/projects/apache/nutch/trunk/conf/nutch-default.xml; 
lineNumber: 32; columnNumber: 60; The entity name must immediately follow the 
'' in the entity reference.
java.lang.RuntimeException: org.xml.sax.SAXParseException; systemId: 
file:/home/markus/projects/apache/nutch/trunk/conf/nutch-default.xml; 
lineNumber: 32; columnNumber: 60; The entity name must immediately follow the 
'' in the entity reference.
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1249)
at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1117)
at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1053)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:460)
at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:131)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123)
at 
org.apache.nutch.crawl.TestCrawlDbFilter.setUp(TestCrawlDbFilter.java:50)
Caused by: org.xml.sax.SAXParseException; systemId: 
file:/home/markus/projects/apache/nutch/trunk/conf/nutch-default.xml; 
lineNumber: 32; columnNumber: 60; The entity name must immediately follow the 
'' in the entity reference.
at 
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:251)
at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:300)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1156)
{code}

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860142#comment-13860142
 ] 

Markus Jelsma commented on NUTCH-1360:
--

{code}
--- conf/nutch-default.xml  (revision 1554785)
+++ conf/nutch-default.xml  (working copy)
@@ -29,7 +29,7 @@
   valuefalse/value
   descriptionEnables us to capture the specific IP address 
   (InetSocketAddress) of the host which we connect to via 
-  the given protocol. Currently supported is protocol-ftp  
+  the given protocol. Currently supported is protocol-ftp and
   http.
   /description
 /property
{code}

will commit shortly

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860153#comment-13860153
 ] 

Markus Jelsma commented on NUTCH-1360:
--

Committed revision 1554791.


 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860222#comment-13860222
 ] 

Hudson commented on NUTCH-1360:
---

FAILURE: Integrated in Nutch-trunk #2472 (See 
[https://builds.apache.org/job/Nutch-trunk/2472/])
NUTCH-1360 fix entity in configuration (markus: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1554791)
* /nutch/trunk/conf/nutch-default.xml


 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-12-23 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855745#comment-13855745
 ] 

Julien Nioche commented on NUTCH-1360:
--

Looks good mate, +1 to commit

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-12-23 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855780#comment-13855780
 ] 

Lewis John McGibbney commented on NUTCH-1360:
-

Thanks [~jnioche]
Committed @revision 1553151 in trunk

Update made to obtain Configuration object from HttpBase in 2.x HEAD
Committed @revision 1553154 in 2.x HEAD
 

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-12-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855794#comment-13855794
 ] 

Hudson commented on NUTCH-1360:
---

SUCCESS: Integrated in Nutch-nutchgora #858 (See 
[https://builds.apache.org/job/Nutch-nutchgora/858/])
NUTCH-1360 Support the storing of IP address connected to when web crawling 
(lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1553154)
* /nutch/branches/2.x/src/java/org/apache/nutch/net/protocols/Response.java
* 
/nutch/branches/2.x/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-12-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856108#comment-13856108
 ] 

Hudson commented on NUTCH-1360:
---

FAILURE: Integrated in Nutch-trunk #2462 (See 
[https://builds.apache.org/job/Nutch-trunk/2462/])
NUTCH-1360 Support the storing of IP address connected to when web crawling 
(lewismc: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1553151)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* 
/nutch/trunk/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java
* 
/nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java
* 
/nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java
* 
/nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.java
* 
/nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-11-28 Thread Walter Tietze (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13835056#comment-13835056
 ] 

Walter Tietze commented on NUTCH-1360:
--

Hi Lewis,

according to the mail I sent to you, I provide my patch for storing ip 
addresses in apache-nutch-1.5.1 as attachment.

( https://issues.apache.org/jira/browse/NUTCH-289 might also be appropriate!)

In our project MIA (http://mia-marktplatz.de/) we spider the german www. To 
stay polite we had to switch to a 'byIP' policy to guarantee request 
frequencies of at least one minute per server. Crawling 'byHost' was no option, 
because many sites use up to some thousand subdomains hosted at a single server 
with one ip address. 
In proceeding with our crawl I realized that crawling by IP seemed to slow 
down, because in the process of generating the url lists nutch has to determine 
the ip address to build up the queues for urls according to their ip addresses. 

This solution is a simple solution which writes the once determined ip address 
into the metadata field of the CrawlDatum object. When a crawl cycle has 
finished its fetch job an additional map-reduce job is started to determine the 
ip addresses  of newly fetched and parsed urls. New urls are inserted into the 
crawldb with their ip addresses if an ip address could have been determined.

In this solution there exist also the two classes IpAddressResolver.java and 
DNSCache.java which cache already fetched ip addresses from the DNS and control 
the number of concurrent calls to the DNS from each map job. Since many urls 
with the same ip address should be generated into a queue I wanted to minimize 
the load which is taken to build up the queues. Caching ip addresses in-memory 
shouldn't be memory-consuming. To avoid to many concurrent requests to a DNS 
from the crawler, I added some code to restrict the number of parallel requests 
to the DNS.

I use this piece of code in production since about three-quarters this year and 
it seems to work fine. The four configuration entries should be 
self-explaining. 

Cheers, Walter

 







 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, 
 NUTCH-1360v4.patch, NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-11-28 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13835062#comment-13835062
 ] 

Lewis John McGibbney commented on NUTCH-1360:
-

Hi Walter thank you very much for the patch and the verbose explanation. Great 
to hear and see how you've adapted and added to the code to suit your 
requirements. 

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-11-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812288#comment-13812288
 ] 

Hudson commented on NUTCH-1360:
---

SUCCESS: Integrated in Nutch-nutchgora #808 (See 
[https://builds.apache.org/job/Nutch-nutchgora/808/])
NUTCH-1360 Suport the storing of IP address connected to when web crawling 
(lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1538280)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/nutch-default.xml
* 
/nutch/branches/2.x/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java
* 
/nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java
* 
/nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java
* 
/nutch/branches/2.x/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
* 
/nutch/branches/2.x/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/DummyX509TrustManager.java
* 
/nutch/branches/2.x/src/plugin/protocol-sftp/src/java/org/apache/nutch/protocol/sftp/Sftp.java


 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, 
 NUTCH-1360v4.patch, NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-11-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811999#comment-13811999
 ] 

Yasin Kılınç commented on NUTCH-1360:
-

hi Lewis,
One more question, 
what is difference Configuration conf = new Configuration(); and 
NutchConfiguration.create();  ?

Thanks.

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, 
 NUTCH-1360v4.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-11-02 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812163#comment-13812163
 ] 

Lewis John McGibbney commented on NUTCH-1360:
-

Committed @revision 1538280 in 2.x HEAD

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, 
 NUTCH-1360v4.patch, NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-11-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811088#comment-13811088
 ] 

Yasin Kılınç commented on NUTCH-1360:
-

1- when protocol is any other protocol than HTTP it doesn't make sense property 
name was named http.store.ip.adress.in my opinion it doesn't general name. 
maybe we can name it store.ip.address and other protocols can implement this 
one. What do you think about it?

2- Don't we already have the IP of connected host when we establish socket 
connection? Why do we need to add IP in HTTP GET request? Is it because 
initially connected host IP and fetched host IP may be different?

Thanks 

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-11-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811558#comment-13811558
 ] 

Yasin Kılınç commented on NUTCH-1360:
-

I looked at your code Lewis,
Firstly, Even if this property is false in configuration file, connection will 
be established to remote host for all url. And this code can significantly 
affect the operating speed of fetch. why don't we do during socket connection? 
so i think it is faster and is made at a time. 

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-11-01 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811565#comment-13811565
 ] 

Lewis John McGibbney commented on NUTCH-1360:
-

Hi Yasin. Yeah I am +1 for your suggestion. Can you identify where we are 
making the initial socket connection so we are only doing this once. I thought 
that this was where the socket connection was being made. If we can move the 
function to the correct place or even identify where the socket connection is 
happening then I would be happy to amend the patch. 

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-11-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811568#comment-13811568
 ] 

Yasin Kılınç commented on NUTCH-1360:
-

following codes are making socket connection, in HttpResponse.java 

// connect
String sockHost = http.useProxy() ? http.getProxyHost() : host;
int sockPort = http.useProxy() ? http.getProxyPort() : port;
InetSocketAddress sockAddr = new InetSocketAddress(sockHost, sockPort);
socket.connect(sockAddr, http.getTimeout());

and we can get host ip address via this code

sockAddr.getAddress().getHostAddress()
 

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-07-10 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410499#comment-13410499
 ] 

Lewis John McGibbney commented on NUTCH-1360:
-

reverted @revision 1359760 in trunk
revision @revision 1359762 in 2.X branch

OK so I understand exactly where I've gone wrong here and will work to improve 
this patch when I have more time. I now acknowledge that it shouldn't have made 
it's way into the codebase(s) and can only thank you guys for pointing this out.

{bq} Guys, unless a change is trivial please do not commit it yourself unless 
it has been reviewed by a fellow committer {bq}

Point absolutely taken Julien thanks. I'll get around to this in the near 
future and attach patch when it's ready. Also thanks for the suggestions Ferdy 
 Julien. 



 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6, 2.1

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-07-10 Thread Ferdy Galema (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410884#comment-13410884
 ] 

Ferdy Galema commented on NUTCH-1360:
-

Thanks! Keep up the good work!

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6, 2.1

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-07-04 Thread Ferdy Galema (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406495#comment-13406495
 ] 

Ferdy Galema commented on NUTCH-1360:
-

Sorry for the late response, but this issue is not properly implemented (for 
both branch and trunk).

- IP is always stored instead of depending on property: headers.set(_ip,... 
should be done only if http.getIP_Header() is true.

- http.store.ip.address appends the _ip:true or false property to the request 
string? What is the purpose of that? If not intentional, we should simply 
revert this. On top of that it uses the property with a default of true, but 
is should be false if the adding to request string is intentional.

Thanks.


 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-07-04 Thread Ferdy Galema (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406561#comment-13406561
 ] 

Ferdy Galema commented on NUTCH-1360:
-

Just one more thing:
Should the IP not be stored in the metadata instead of the headers field? It is 
technically not a response header. As far as I know currently the headers 
container is only used for the headers returned by the http server. (But 
correct me if I'm wrong)

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293085#comment-13293085
 ] 

Hudson commented on NUTCH-1360:
---

Integrated in nutch-trunk-maven #307 (See 
[https://builds.apache.org/job/nutch-trunk-maven/307/])
commit to address NUTCH-1360 and update to CHANGES.txt (Revision 1348993)

 Result = SUCCESS
lewismc : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
* 
/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* 
/nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1329#comment-1329
 ] 

Hudson commented on NUTCH-1360:
---

Integrated in Nutch-trunk #1868 (See 
[https://builds.apache.org/job/Nutch-trunk/1868/])
commit to address NUTCH-1360 and update to CHANGES.txt (Revision 1348993)

 Result = SUCCESS
lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1348993
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
* 
/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* 
/nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-05-21 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13280269#comment-13280269
 ] 

Lewis John McGibbney commented on NUTCH-1360:
-

Committed @revision 1341100 in nutchgora branch

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1360-nutchgora-v2.patch, 
 NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-05-10 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272234#comment-13272234
 ] 

Lewis John McGibbney commented on NUTCH-1360:
-

As all protocol plugins try to imitate an HTTP scenario, I propose the 
nutch-default.xml property to be homed under HTTP properties and named 
http.store.ip.address?

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.6

 Attachments: NUTCH-1360-nutchgora.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira