[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-11-28 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835062#comment-13835062
 ] 

Lewis John McGibbney commented on NUTCH-1360:
-

Hi Walter thank you very much for the patch and the verbose explanation. Great 
to hear and see how you've adapted and added to the code to suit your 
requirements. 

> Suport the storing of IP address connected to when web crawling
> ---
>
> Key: NUTCH-1360
> URL: https://issues.apache.org/jira/browse/NUTCH-1360
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
> NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
> NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
> NUTCH-1360v5.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which 
> we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-11-28 Thread Walter Tietze (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Tietze updated NUTCH-1360:
-

Attachment: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch

IP storing in CrawlDb metadata for apache-nutch-1.5.1

> Suport the storing of IP address connected to when web crawling
> ---
>
> Key: NUTCH-1360
> URL: https://issues.apache.org/jira/browse/NUTCH-1360
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
> NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
> NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
> NUTCH-1360v5.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which 
> we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2013-11-28 Thread Walter Tietze (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835056#comment-13835056
 ] 

Walter Tietze commented on NUTCH-1360:
--

Hi Lewis,

according to the mail I sent to you, I provide my patch for storing ip 
addresses in apache-nutch-1.5.1 as attachment.

( https://issues.apache.org/jira/browse/NUTCH-289 might also be appropriate!)

In our project MIA (http://mia-marktplatz.de/) we spider the german www. To 
stay polite we had to switch to a 'byIP' policy to guarantee request 
frequencies of at least one minute per server. Crawling 'byHost' was no option, 
because many sites use up to some thousand subdomains hosted at a single server 
with one ip address. 
In proceeding with our crawl I realized that crawling by IP seemed to slow 
down, because in the process of generating the url lists nutch has to determine 
the ip address to build up the queues for urls according to their ip addresses. 

This solution is a simple solution which writes the once determined ip address 
into the metadata field of the CrawlDatum object. When a crawl cycle has 
finished its fetch job an additional map-reduce job is started to determine the 
ip addresses  of newly fetched and parsed urls. New urls are inserted into the 
crawldb with their ip addresses if an ip address could have been determined.

In this solution there exist also the two classes IpAddressResolver.java and 
DNSCache.java which cache already fetched ip addresses from the DNS and control 
the number of concurrent calls to the DNS from each map job. Since many urls 
with the same ip address should be generated into a queue I wanted to minimize 
the load which is taken to build up the queues. Caching ip addresses in-memory 
shouldn't be memory-consuming. To avoid to many concurrent requests to a DNS 
from the crawler, I added some code to restrict the number of parallel requests 
to the DNS.

I use this piece of code in production since about three-quarters this year and 
it seems to work fine. The four configuration entries should be 
self-explaining. 

Cheers, Walter

 







> Suport the storing of IP address connected to when web crawling
> ---
>
> Key: NUTCH-1360
> URL: https://issues.apache.org/jira/browse/NUTCH-1360
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1360-nutchgora-v2.patch, 
> NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, 
> NUTCH-1360v4.patch, NUTCH-1360v5.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which 
> we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


Re: [DISCUSS] Release Trunk

2013-11-28 Thread Julien Nioche
Hi Lewis

We've done quite a few things in 1.x since the previous release (e.g.
generic deduplication, removing indexer.solr package, etc...)  and the next
2.x release will be after the changes to GORA have been made, tested and
used on the Nutch side so that could be quite a while.

I am neutral as to whether we should do a 1.x release now. There are some
minor issues that we could do in 1.x before the next release like :
* https://issues.apache.org/jira/browse/NUTCH-1360
* https://issues.apache.org/jira/browse/NUTCH-1676
and probably a few others but they could also be done later.

Let's hear what others think.

Thanks

Julien




On 28 November 2013 16:34, Lewis John Mcgibbney
wrote:

> Hi Folks,
> Thread says it all.
> There are some hot tickets over in Gora right now so I think holding off
> the next while for a 2.x release would be wise.
> I can spin the RC for trunk tonight/tomorrow/weekend if we get the thumbs
> up.
> Ta
> Lewis
>
> --
> *Lewis*
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[DISCUSS] Release Trunk

2013-11-28 Thread Lewis John Mcgibbney
Hi Folks,
Thread says it all.
There are some hot tickets over in Gora right now so I think holding off
the next while for a 2.x release would be wise.
I can spin the RC for trunk tonight/tomorrow/weekend if we get the thumbs
up.
Ta
Lewis

-- 
*Lewis*


[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2013-11-28 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834964#comment-13834964
 ] 

Lewis John McGibbney commented on NUTCH-1325:
-

Hi [~otis] there is already a host table implementation in 2.x :)

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1674) Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index

2013-11-28 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834944#comment-13834944
 ] 

Otis Gospodnetic commented on NUTCH-1674:
-

[~alparslan.avci] - thanks!  Did you see speedups after you changes (because 
they pull less data over the network)?

> Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index
> -
>
> Key: NUTCH-1674
> URL: https://issues.apache.org/jira/browse/NUTCH-1674
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3
>Reporter: Nguyen Manh Tien
> Fix For: 2.3
>
> Attachments: NUTCH-1674.patch, NUTCH-1674_2.patch
>
>
> Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, 
> update, index). When crawldb is big, the time to scan is bigger than the 
> actual processing time.
> We really need to skip records while scanning using GORA-119 for example we 
> can only get records belong to a specified batchId.
> In my crawl the filter reduce the time to scan from 90 min to 30 min.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1674) Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index

2013-11-28 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alparslan Avcı updated NUTCH-1674:
--

Attachment: NUTCH-1674_2.patch

The Nguyen's patch works good, however I want to comment about some points. 
First, the patch applies filter by using WebPage.Field.BATCH_ID field. This 
filter brings also the rows that are not marked for generate, or fetch, or etc. 
I think FilterOp.EQUALS_IN_MAP operator can be used to filter by marks for each 
job. For example; for FetcherJob, WebPage.Field.MARKERS can be used as filter 
field and Mark.GENERATE_MARK and batchId can be used as operands. This will 
filter only the rows that have Mark.GENERATE_MARK and this Mark.GENERATE_MARK 
value has to be equal to batchId. Moreover; by using this method, we can remove 
the NutchJob.shouldProcess(mark, batchId) controls in the mappers. 
Secondly, new filters more than batchId can be implemented in future. I think 
for extensibility, the filters have to be implemented on each job class, not in 
StorageUtils. StorageUtils has to only apply the filter (or filterset) that is 
passed as method parameter since it is a util class.
In the patch I added, I applied the possible filters (which are only batchId 
filters for now) to the jobs. After the implementation of new Hbase filters and 
filterset on Gora, we can add new filters (eg.:Non-existance of Mark.FETCH_MARK 
filter for FetcherJob) and clean the map functions from some controls. 

> Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index
> -
>
> Key: NUTCH-1674
> URL: https://issues.apache.org/jira/browse/NUTCH-1674
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3
>Reporter: Nguyen Manh Tien
> Fix For: 2.3
>
> Attachments: NUTCH-1674.patch, NUTCH-1674_2.patch
>
>
> Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, 
> update, index). When crawldb is big, the time to scan is bigger than the 
> actual processing time.
> We really need to skip records while scanning using GORA-119 for example we 
> can only get records belong to a specified batchId.
> In my crawl the filter reduce the time to scan from 90 min to 30 min.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (NUTCH-1647) protocol-http throws unzipBestEffort returned null for some pages

2013-11-28 Thread Luke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834757#comment-13834757
 ] 

Luke edited comment on NUTCH-1647 at 11/28/13 11:54 AM:


Embarrassingly, I've found my problem was related to something else - I'd set 
http.content.limit to 0. Changing this has resolved my issues. (in my defense, 
it is a fairly misleading error message)

I can confirm that I'm getting the same behaviour for 
http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale




was (Author: lukejira):
Embarrassingly, I've found my problem was related to something else - I'd set 
http.content.limit to 0. Changing this has resolved my problems. (in my 
defense, it is a fairly misleading error message)

I can confirm that I'm getting the same behaviour for 
http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale



> protocol-http throws unzipBestEffort returned null for some pages
> -
>
> Key: NUTCH-1647
> URL: https://issues.apache.org/jira/browse/NUTCH-1647
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.7
>Reporter: Markus Jelsma
> Fix For: 1.8
>
>
> bin/nutch indexchecker 
> http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale  
> Fetch failed with protocol status: exception(16), lastModified=0: 
> java.io.IOException: unzipBestEffort returned null
> {code}
> 2013-10-01 13:44:55,612 INFO  http.Http - http.proxy.host = null
> 2013-10-01 13:44:55,612 INFO  http.Http - http.proxy.port = 8080
> 2013-10-01 13:44:55,612 INFO  http.Http - http.timeout = 12000
> 2013-10-01 13:44:55,612 INFO  http.Http - http.content.limit = 5242880
> 2013-10-01 13:44:55,612 INFO  http.Http - http.agent = Mozilla/5.0 
> (compatible; OpenindexSpider; 
> +http://www.openindex.io/en/webmasters/spider.html)
> 2013-10-01 13:44:55,612 INFO  http.Http - http.accept.language = 
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2013-10-01 13:44:55,613 INFO  http.Http - http.accept = 
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2013-10-01 13:44:55,925 ERROR http.Http - Failed to get protocol output
> java.io.IOException: unzipBestEffort returned null
> at 
> org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:317)
> at 
> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:164)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:86)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:150)
> {code}
> Haven't got a clue yet as to what the exact issue is.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1647) protocol-http throws unzipBestEffort returned null for some pages

2013-11-28 Thread Luke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834757#comment-13834757
 ] 

Luke commented on NUTCH-1647:
-

Embarrassingly, I've found my problem was related to something else - I'd set 
http.content.limit to 0. Changing this has resolved my problems. (in my 
defense, it is a fairly misleading error message)

I can confirm that I'm getting the same behaviour for 
http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale



> protocol-http throws unzipBestEffort returned null for some pages
> -
>
> Key: NUTCH-1647
> URL: https://issues.apache.org/jira/browse/NUTCH-1647
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.7
>Reporter: Markus Jelsma
> Fix For: 1.8
>
>
> bin/nutch indexchecker 
> http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale  
> Fetch failed with protocol status: exception(16), lastModified=0: 
> java.io.IOException: unzipBestEffort returned null
> {code}
> 2013-10-01 13:44:55,612 INFO  http.Http - http.proxy.host = null
> 2013-10-01 13:44:55,612 INFO  http.Http - http.proxy.port = 8080
> 2013-10-01 13:44:55,612 INFO  http.Http - http.timeout = 12000
> 2013-10-01 13:44:55,612 INFO  http.Http - http.content.limit = 5242880
> 2013-10-01 13:44:55,612 INFO  http.Http - http.agent = Mozilla/5.0 
> (compatible; OpenindexSpider; 
> +http://www.openindex.io/en/webmasters/spider.html)
> 2013-10-01 13:44:55,612 INFO  http.Http - http.accept.language = 
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2013-10-01 13:44:55,613 INFO  http.Http - http.accept = 
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2013-10-01 13:44:55,925 ERROR http.Http - Failed to get protocol output
> java.io.IOException: unzipBestEffort returned null
> at 
> org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:317)
> at 
> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:164)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:86)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:150)
> {code}
> Haven't got a clue yet as to what the exact issue is.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1676) Add rudimentary SSL support to protocol-http

2013-11-28 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1676:
-

Attachment: NUTCH-1676.patch

> Add rudimentary SSL support to protocol-http
> 
>
> Key: NUTCH-1676
> URL: https://issues.apache.org/jira/browse/NUTCH-1676
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.7
>Reporter: Julien Nioche
> Fix For: 1.8
>
> Attachments: NUTCH-1676.patch
>
>
> Adding https support to our http protocol would be a good thing even if it 
> does not handle the security. This would save us from having to use the 
> http-client plugin which is buggy in its current form. 
> Patch generated from 
> https://github.com/Aloisius/nutch/commit/d3e15a1db0eb323ccdcf5ad69a3d3a01ec65762c#commitcomment-4720772
> Needs testing...



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (NUTCH-1676) Add rudimentary SSL support to protocol-http

2013-11-28 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1676:


 Summary: Add rudimentary SSL support to protocol-http
 Key: NUTCH-1676
 URL: https://issues.apache.org/jira/browse/NUTCH-1676
 Project: Nutch
  Issue Type: Improvement
  Components: protocol
Affects Versions: 1.7
Reporter: Julien Nioche
 Fix For: 1.8


Adding https support to our http protocol would be a good thing even if it does 
not handle the security. This would save us from having to use the http-client 
plugin which is buggy in its current form. 

Patch generated from 
https://github.com/Aloisius/nutch/commit/d3e15a1db0eb323ccdcf5ad69a3d3a01ec65762c#commitcomment-4720772

Needs testing...




--
This message was sent by Atlassian JIRA
(v6.1#6144)