[jira] [Updated] (NUTCH-2536) GeneratorReducer.count is a static variable

2018-03-21 Thread Ben Vachon (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Vachon updated NUTCH-2536:
--
Environment: Non-distributed, single node, standalone Nutch jobs run in a 
sinlge JVM with HBase as the data store. 2.3.1

> GeneratorReducer.count is a static variable
> ---
>
> Key: NUTCH-2536
> URL: https://issues.apache.org/jira/browse/NUTCH-2536
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.3.1
> Environment: Non-distributed, single node, standalone Nutch jobs run 
> in a sinlge JVM with HBase as the data store. 2.3.1
>Reporter: Ben Vachon
>Priority: Minor
>  Labels: Generate
> Fix For: 2.4
>
>   Original Estimate: 2.4h
>  Remaining Estimate: 2.4h
>
> The count field of the GeneratorReducer class is a static field. This means 
> that if the GeneratorJob is run multiple times within the same JVM, it will 
> count all the webpages generated across all batches.
> The count field is checked against the GeneratorJob's topN configuration 
> variable, which is described as:
> "top threshold for maximum number of URLs permitted in a batch"
> I understand this to mean that EACH batch should be capped at the topN value, 
> not ALL batches.
> This isn't a problem with the way that Nutch is typically used because the 
> script starts a new JVM each time. I'm not using the script, I'm calling the 
> java classes directly (using the ToolRunner) within an existing JVM, so I'm 
> categorizing this as an SDK issue.
> Changing the field to be non-static will not affect the behavior of the class 
> as its run by the script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2540) Support Generic Deduplication in Nutch 2.x

2018-03-21 Thread Ben Vachon (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Vachon updated NUTCH-2540:
--
Environment: (was: Non-distributed, single node, standalone Nutch jobs 
run in a sinlge JVM with HBase as the data store. 2.3.1)

> Support Generic Deduplication in Nutch 2.x
> --
>
> Key: NUTCH-2540
> URL: https://issues.apache.org/jira/browse/NUTCH-2540
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 2.3.1
>Reporter: Ben Vachon
>Priority: Major
>  Labels: dedupe
> Fix For: 2.4
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Currently, deduplication in 2.x exists only as a utility for the Solr index.
> My use-case for Nutch required deduplication so I wrote custom code that 
> checks for duplicates based on digest and deletes them at index time. I 
> figured I'd port the change so that others could use it as well.
> This is a very simple approach to Deduplication. There's plenty of room to 
> improve it.
> This change adds a new DataStore for Duplicate entries that are just lists of 
> urls with signatures as keys.
> A DeduplicatorJob can be run between the DbUpdatorJob and IndexingJob to map 
> WebPages into the Duplicate DataStore.
> Since the key of the Duplicate store is the digest field of the WebPage store 
> entries, duplicate matching can be configured via extension of the Signature 
> abstract class.
> A new "-deduplicate" argument is added to the IndexingJob (false by default). 
> If this flag is used, then the IndexingJob will check the Duplicate DataStore 
> for duplicate URLs, run pluggable DuplicateFilters to determine which URL 
> belongs to the original WebPage, and skip the WebPage if it is not the 
> original, and delete (from the index) the other pages if the WebPage is the 
> original.
> I've also added a BasicDuplicateFilter plugin class that considers the URL 
> with the shortest path to be the original.
> Eventually, it would be best to consider things like score and fetch time 
> when determining which WebPage to keep and which to remove.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2540) Support Generic Deduplication in Nutch 2.x

2018-03-21 Thread Ben Vachon (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Vachon updated NUTCH-2540:
--
Environment: Non-distributed, single node, standalone Nutch jobs run in a 
sinlge JVM with HBase as the data store. 2.3.1

> Support Generic Deduplication in Nutch 2.x
> --
>
> Key: NUTCH-2540
> URL: https://issues.apache.org/jira/browse/NUTCH-2540
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 2.3.1
> Environment: Non-distributed, single node, standalone Nutch jobs run 
> in a sinlge JVM with HBase as the data store. 2.3.1
>Reporter: Ben Vachon
>Priority: Major
>  Labels: dedupe
> Fix For: 2.4
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Currently, deduplication in 2.x exists only as a utility for the Solr index.
> My use-case for Nutch required deduplication so I wrote custom code that 
> checks for duplicates based on digest and deletes them at index time. I 
> figured I'd port the change so that others could use it as well.
> This is a very simple approach to Deduplication. There's plenty of room to 
> improve it.
> This change adds a new DataStore for Duplicate entries that are just lists of 
> urls with signatures as keys.
> A DeduplicatorJob can be run between the DbUpdatorJob and IndexingJob to map 
> WebPages into the Duplicate DataStore.
> Since the key of the Duplicate store is the digest field of the WebPage store 
> entries, duplicate matching can be configured via extension of the Signature 
> abstract class.
> A new "-deduplicate" argument is added to the IndexingJob (false by default). 
> If this flag is used, then the IndexingJob will check the Duplicate DataStore 
> for duplicate URLs, run pluggable DuplicateFilters to determine which URL 
> belongs to the original WebPage, and skip the WebPage if it is not the 
> original, and delete (from the index) the other pages if the WebPage is the 
> original.
> I've also added a BasicDuplicateFilter plugin class that considers the URL 
> with the shortest path to be the original.
> Eventually, it would be best to consider things like score and fetch time 
> when determining which WebPage to keep and which to remove.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2541) Arabic characters in the URL path are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408075#comment-16408075
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2541:
---

I've tested against master, I still see the same issue:

{code}
➜  local (master) ✔ bin/nutch parsechecker 
"http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html;
fetching: 
http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
robots.txt whitelist not configured.
Fetch failed with protocol status: exception(16), lastModified=0: 
java.lang.IllegalArgumentException: Invalid uri 
'http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html':
 escaped absolute path not valid
{code}


> Arabic characters in the URL path are not properly escaped by the 
> protocol-httpclient plugin
> 
>
> Key: NUTCH-2541
> URL: https://issues.apache.org/jira/browse/NUTCH-2541
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 2.3.1, 1.14
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Major
>
> As reported on [1] 
> When trying to crawl some URLs with Arabic characters Nutch will complain due 
> to an {{InvalidArgumentException}}. This happens because the HTTP client 
> library is using internally the {{java.net.URI}} which does not support this 
> characters unless they're properly escaped.
> [1] 
> https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2541) Arabic characters in the URL path are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407880#comment-16407880
 ] 

Jorge Luis Betancourt Gonzalez edited comment on NUTCH-2541 at 3/21/18 3:18 PM:


Normal characters (like a space) in the URL path {{/some other}} (the space) 
are escaped without any issues.


was (Author: jorgelbg):
Normal characters (like a space) in the URL path {{/some other}} the space is 
escaped without any issues.

> Arabic characters in the URL path are not properly escaped by the 
> protocol-httpclient plugin
> 
>
> Key: NUTCH-2541
> URL: https://issues.apache.org/jira/browse/NUTCH-2541
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 2.3.1, 1.14
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Major
>
> As reported on [1] 
> When trying to crawl some URLs with Arabic characters Nutch will complain due 
> to an {{InvalidArgumentException}}. This happens because the HTTP client 
> library is using internally the {{java.net.URI}} which does not support this 
> characters unless they're properly escaped.
> [1] 
> https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2541) Arabic characters in the URL path are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407981#comment-16407981
 ] 

Markus Jelsma edited comment on NUTCH-2541 at 3/21/18 3:17 PM:
---

This is probably not a 1.14 problem, we fixed it some versions ago in the 
BasicURLNormalizer.
edit: i can't seem to find the ticket, i bet Sebastian can!


was (Author: markus17):
This is probably not a 1.14 problem, we fixed it some versions ago in the 
BasicURLNormalizer.

> Arabic characters in the URL path are not properly escaped by the 
> protocol-httpclient plugin
> 
>
> Key: NUTCH-2541
> URL: https://issues.apache.org/jira/browse/NUTCH-2541
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 2.3.1, 1.14
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Major
>
> As reported on [1] 
> When trying to crawl some URLs with Arabic characters Nutch will complain due 
> to an {{InvalidArgumentException}}. This happens because the HTTP client 
> library is using internally the {{java.net.URI}} which does not support this 
> characters unless they're properly escaped.
> [1] 
> https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2541) Arabic characters in the URL path are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407981#comment-16407981
 ] 

Markus Jelsma commented on NUTCH-2541:
--

This is probably not a 1.14 problem, we fixed it some versions ago in the 
BasicURLNormalizer.

> Arabic characters in the URL path are not properly escaped by the 
> protocol-httpclient plugin
> 
>
> Key: NUTCH-2541
> URL: https://issues.apache.org/jira/browse/NUTCH-2541
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 2.3.1, 1.14
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Major
>
> As reported on [1] 
> When trying to crawl some URLs with Arabic characters Nutch will complain due 
> to an {{InvalidArgumentException}}. This happens because the HTTP client 
> library is using internally the {{java.net.URI}} which does not support this 
> characters unless they're properly escaped.
> [1] 
> https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2541) Arabic characters in the URL path are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407880#comment-16407880
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2541:
---

Normal characters (like a space) in the URL path {{/some other}} the space is 
escaped without any issues.

> Arabic characters in the URL path are not properly escaped by the 
> protocol-httpclient plugin
> 
>
> Key: NUTCH-2541
> URL: https://issues.apache.org/jira/browse/NUTCH-2541
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 2.3.1, 1.14
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Major
>
> As reported on [1] 
> When trying to crawl some URLs with Arabic characters Nutch will complain due 
> to an {{InvalidArgumentException}}. This happens because the HTTP client 
> library is using internally the {{java.net.URI}} which does not support this 
> characters unless they're properly escaped.
> [1] 
> https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2541) Arabic characters in the URL path are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2541:
--
Summary: Arabic characters in the URL path are not properly escaped by the 
protocol-httpclient plugin  (was: Arabic characters in the URL are not properly 
escaped by the protocol-httpclient plugin)

> Arabic characters in the URL path are not properly escaped by the 
> protocol-httpclient plugin
> 
>
> Key: NUTCH-2541
> URL: https://issues.apache.org/jira/browse/NUTCH-2541
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 2.3.1, 1.14
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Major
>
> As reported on [1] 
> When trying to crawl some URLs with Arabic characters Nutch will complain due 
> to an {{InvalidArgumentException}}. This happens because the HTTP client 
> library is using internally the {{java.net.URI}} which does not support this 
> characters unless they're properly escaped.
> [1] 
> https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2541) Arabic characters in the URL are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2541:
-

 Summary: Arabic characters in the URL are not properly escaped by 
the protocol-httpclient plugin
 Key: NUTCH-2541
 URL: https://issues.apache.org/jira/browse/NUTCH-2541
 Project: Nutch
  Issue Type: Bug
  Components: plugin, protocol
Affects Versions: 1.14, 2.3.1
Reporter: Jorge Luis Betancourt Gonzalez


As reported on [1] 

When trying to crawl some URLs with Arabic characters Nutch will complain due 
to an {{InvalidArgumentException}}. This happens because the HTTP client 
library is using internally the {{java.net.URI}} which does not support this 
characters unless they're properly escaped.

[1] 
https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2018-03-21 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407778#comment-16407778
 ] 

Semyon Semyonov commented on NUTCH-2455:


I see a conflict for this branch and master, let me know when you want to merge 
it and I'm going to fix them.

By the way, we ran it several times for number of hosts in between 100 000 and 
2 000 000 , it worked quite well.

> Speed up the merging of HostDb entries for variable fetch delay
> ---
>
> Key: NUTCH-2455
> URL: https://issues.apache.org/jira/browse/NUTCH-2455
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use  pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)