[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860140#comment-13860140 ] Markus Jelsma commented on NUTCH-1360: -- Almost all unit tests fail due to improper use of entities in configuration. {code} org.xml.sax.SAXParseException; systemId: file:/home/markus/projects/apache/nutch/trunk/conf/nutch-default.xml; lineNumber: 32; columnNumber: 60; The entity name must immediately follow the '' in the entity reference. java.lang.RuntimeException: org.xml.sax.SAXParseException; systemId: file:/home/markus/projects/apache/nutch/trunk/conf/nutch-default.xml; lineNumber: 32; columnNumber: 60; The entity name must immediately follow the '' in the entity reference. at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1249) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1117) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1053) at org.apache.hadoop.conf.Configuration.get(Configuration.java:460) at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:131) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123) at org.apache.nutch.crawl.TestCrawlDbFilter.setUp(TestCrawlDbFilter.java:50) Caused by: org.xml.sax.SAXParseException; systemId: file:/home/markus/projects/apache/nutch/trunk/conf/nutch-default.xml; lineNumber: 32; columnNumber: 60; The entity name must immediately follow the '' in the entity reference. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:251) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:300) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1156) {code} Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.9 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860142#comment-13860142 ] Markus Jelsma commented on NUTCH-1360: -- {code} --- conf/nutch-default.xml (revision 1554785) +++ conf/nutch-default.xml (working copy) @@ -29,7 +29,7 @@ valuefalse/value descriptionEnables us to capture the specific IP address (InetSocketAddress) of the host which we connect to via - the given protocol. Currently supported is protocol-ftp + the given protocol. Currently supported is protocol-ftp and http. /description /property {code} will commit shortly Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.9 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860153#comment-13860153 ] Markus Jelsma commented on NUTCH-1360: -- Committed revision 1554791. Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.9 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860222#comment-13860222 ] Hudson commented on NUTCH-1360: --- FAILURE: Integrated in Nutch-trunk #2472 (See [https://builds.apache.org/job/Nutch-trunk/2472/]) NUTCH-1360 fix entity in configuration (markus: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1554791) * /nutch/trunk/conf/nutch-default.xml Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855745#comment-13855745 ] Julien Nioche commented on NUTCH-1360: -- Looks good mate, +1 to commit Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855780#comment-13855780 ] Lewis John McGibbney commented on NUTCH-1360: - Thanks [~jnioche] Committed @revision 1553151 in trunk Update made to obtain Configuration object from HttpBase in 2.x HEAD Committed @revision 1553154 in 2.x HEAD Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.9 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855794#comment-13855794 ] Hudson commented on NUTCH-1360: --- SUCCESS: Integrated in Nutch-nutchgora #858 (See [https://builds.apache.org/job/Nutch-nutchgora/858/]) NUTCH-1360 Support the storing of IP address connected to when web crawling (lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1553154) * /nutch/branches/2.x/src/java/org/apache/nutch/net/protocols/Response.java * /nutch/branches/2.x/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.9 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856108#comment-13856108 ] Hudson commented on NUTCH-1360: --- FAILURE: Integrated in Nutch-trunk #2462 (See [https://builds.apache.org/job/Nutch-trunk/2462/]) NUTCH-1360 Support the storing of IP address connected to when web crawling (lewismc: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1553151) * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java * /nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java * /nutch/trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java * /nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.java * /nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.9 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13835056#comment-13835056 ] Walter Tietze commented on NUTCH-1360: -- Hi Lewis, according to the mail I sent to you, I provide my patch for storing ip addresses in apache-nutch-1.5.1 as attachment. ( https://issues.apache.org/jira/browse/NUTCH-289 might also be appropriate!) In our project MIA (http://mia-marktplatz.de/) we spider the german www. To stay polite we had to switch to a 'byIP' policy to guarantee request frequencies of at least one minute per server. Crawling 'byHost' was no option, because many sites use up to some thousand subdomains hosted at a single server with one ip address. In proceeding with our crawl I realized that crawling by IP seemed to slow down, because in the process of generating the url lists nutch has to determine the ip address to build up the queues for urls according to their ip addresses. This solution is a simple solution which writes the once determined ip address into the metadata field of the CrawlDatum object. When a crawl cycle has finished its fetch job an additional map-reduce job is started to determine the ip addresses of newly fetched and parsed urls. New urls are inserted into the crawldb with their ip addresses if an ip address could have been determined. In this solution there exist also the two classes IpAddressResolver.java and DNSCache.java which cache already fetched ip addresses from the DNS and control the number of concurrent calls to the DNS from each map job. Since many urls with the same ip address should be generated into a queue I wanted to minimize the load which is taken to build up the queues. Caching ip addresses in-memory shouldn't be memory-consuming. To avoid to many concurrent requests to a DNS from the crawler, I added some code to restrict the number of parallel requests to the DNS. I use this piece of code in production since about three-quarters this year and it seems to work fine. The four configuration entries should be self-explaining. Cheers, Walter Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13835062#comment-13835062 ] Lewis John McGibbney commented on NUTCH-1360: - Hi Walter thank you very much for the patch and the verbose explanation. Great to hear and see how you've adapted and added to the code to suit your requirements. Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812288#comment-13812288 ] Hudson commented on NUTCH-1360: --- SUCCESS: Integrated in Nutch-nutchgora #808 (See [https://builds.apache.org/job/Nutch-nutchgora/808/]) NUTCH-1360 Suport the storing of IP address connected to when web crawling (lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1538280) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/conf/nutch-default.xml * /nutch/branches/2.x/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java * /nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Client.java * /nutch/branches/2.x/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java * /nutch/branches/2.x/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java * /nutch/branches/2.x/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/DummyX509TrustManager.java * /nutch/branches/2.x/src/plugin/protocol-sftp/src/java/org/apache/nutch/protocol/sftp/Sftp.java Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811999#comment-13811999 ] Yasin Kılınç commented on NUTCH-1360: - hi Lewis, One more question, what is difference Configuration conf = new Configuration(); and NutchConfiguration.create(); ? Thanks. Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.8 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812163#comment-13812163 ] Lewis John McGibbney commented on NUTCH-1360: - Committed @revision 1538280 in 2.x HEAD Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811088#comment-13811088 ] Yasin Kılınç commented on NUTCH-1360: - 1- when protocol is any other protocol than HTTP it doesn't make sense property name was named http.store.ip.adress.in my opinion it doesn't general name. maybe we can name it store.ip.address and other protocols can implement this one. What do you think about it? 2- Don't we already have the IP of connected host when we establish socket connection? Why do we need to add IP in HTTP GET request? Is it because initially connected host IP and fetched host IP may be different? Thanks Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.8 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811558#comment-13811558 ] Yasin Kılınç commented on NUTCH-1360: - I looked at your code Lewis, Firstly, Even if this property is false in configuration file, connection will be established to remote host for all url. And this code can significantly affect the operating speed of fetch. why don't we do during socket connection? so i think it is faster and is made at a time. Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.8 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811565#comment-13811565 ] Lewis John McGibbney commented on NUTCH-1360: - Hi Yasin. Yeah I am +1 for your suggestion. Can you identify where we are making the initial socket connection so we are only doing this once. I thought that this was where the socket connection was being made. If we can move the function to the correct place or even identify where the socket connection is happening then I would be happy to amend the patch. Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.8 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811568#comment-13811568 ] Yasin Kılınç commented on NUTCH-1360: - following codes are making socket connection, in HttpResponse.java // connect String sockHost = http.useProxy() ? http.getProxyHost() : host; int sockPort = http.useProxy() ? http.getProxyPort() : port; InetSocketAddress sockAddr = new InetSocketAddress(sockHost, sockPort); socket.connect(sockAddr, http.getTimeout()); and we can get host ip address via this code sockAddr.getAddress().getHostAddress() Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.8 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410499#comment-13410499 ] Lewis John McGibbney commented on NUTCH-1360: - reverted @revision 1359760 in trunk revision @revision 1359762 in 2.X branch OK so I understand exactly where I've gone wrong here and will work to improve this patch when I have more time. I now acknowledge that it shouldn't have made it's way into the codebase(s) and can only thank you guys for pointing this out. {bq} Guys, unless a change is trivial please do not commit it yourself unless it has been reviewed by a fellow committer {bq} Point absolutely taken Julien thanks. I'll get around to this in the near future and attach patch when it's ready. Also thanks for the suggestions Ferdy Julien. Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.6, 2.1 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410884#comment-13410884 ] Ferdy Galema commented on NUTCH-1360: - Thanks! Keep up the good work! Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.6, 2.1 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406495#comment-13406495 ] Ferdy Galema commented on NUTCH-1360: - Sorry for the late response, but this issue is not properly implemented (for both branch and trunk). - IP is always stored instead of depending on property: headers.set(_ip,... should be done only if http.getIP_Header() is true. - http.store.ip.address appends the _ip:true or false property to the request string? What is the purpose of that? If not intentional, we should simply revert this. On top of that it uses the property with a default of true, but is should be false if the adding to request string is intentional. Thanks. Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.6 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406561#comment-13406561 ] Ferdy Galema commented on NUTCH-1360: - Just one more thing: Should the IP not be stored in the metadata instead of the headers field? It is technically not a response header. As far as I know currently the headers container is only used for the headers returned by the http server. (But correct me if I'm wrong) Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.6 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293085#comment-13293085 ] Hudson commented on NUTCH-1360: --- Integrated in nutch-trunk-maven #307 (See [https://builds.apache.org/job/nutch-trunk-maven/307/]) commit to address NUTCH-1360 and update to CHANGES.txt (Revision 1348993) Result = SUCCESS lewismc : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java * /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * /nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.6 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1329#comment-1329 ] Hudson commented on NUTCH-1360: --- Integrated in Nutch-trunk #1868 (See [https://builds.apache.org/job/Nutch-trunk/1868/]) commit to address NUTCH-1360 and update to CHANGES.txt (Revision 1348993) Result = SUCCESS lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1348993 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java * /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * /nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.6 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13280269#comment-13280269 ] Lewis John McGibbney commented on NUTCH-1360: - Committed @revision 1341100 in nutchgora branch Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.6 Attachments: NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272234#comment-13272234 ] Lewis John McGibbney commented on NUTCH-1360: - As all protocol plugins try to imitate an HTTP scenario, I propose the nutch-default.xml property to be homed under HTTP properties and named http.store.ip.address? Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.6 Attachments: NUTCH-1360-nutchgora.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira