[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2018-03-21 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407778#comment-16407778
 ] 

Semyon Semyonov commented on NUTCH-2455:


I see a conflict for this branch and master, let me know when you want to merge 
it and I'm going to fix them.

By the way, we ran it several times for number of hosts in between 100 000 and 
2 000 000 , it worked quite well.

> Speed up the merging of HostDb entries for variable fetch delay
> ---
>
> Key: NUTCH-2455
> URL: https://issues.apache.org/jira/browse/NUTCH-2455
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use  pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2539) Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml

2018-03-19 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2539:
--

 Summary: Not correct naming of db.url.filters and 
db.url.normalizers in nutch-default.xml
 Key: NUTCH-2539
 URL: https://issues.apache.org/jira/browse/NUTCH-2539
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.15
Reporter: Semyon Semyonov


There is a mismatch between config and code.

In code, 
 In CrawlDbFilter line 41:43
> public static final String URL_FILTERING = "crawldb.url.filters";
> public static final String URL_NORMALIZING = "crawldb.url.normalizers";
> public static final String URL_NORMALIZING_SCOPE = 
> "crawldb.url.normalizers.scope";

 

In nutch-default.xml
> 
> db.url.normalizers
> false
> Normalize urls when updating crawldb
> 
>
> 
> db.url.filters
> false
> Filter urls when updating crawldb
> 



These properties should be in line with code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2538) Refactoring of Regex Url Normalizer and Bidirectional Url ExemptionFilter

2018-03-16 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2538:
--

 Summary: Refactoring of Regex Url Normalizer and Bidirectional Url 
ExemptionFilter
 Key: NUTCH-2538
 URL: https://issues.apache.org/jira/browse/NUTCH-2538
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Semyon Semyonov


NUTCH-2522 uses the same regex logic as RegxUrlNormalizer. 
These plugins can be refactored to the same base class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2537) Logical OR instead of AND in UrlExemptionFilters

2018-03-16 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2537:
--

 Summary: Logical OR instead of AND in UrlExemptionFilters
 Key: NUTCH-2537
 URL: https://issues.apache.org/jira/browse/NUTCH-2537
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Semyon Semyonov


With NUTCH-2522 another urlexemptionfilter is added, therefore now we can 
combine the filters.
We should use more reasonable combination of ExemptionFilters based on OR, 
instead of AND.

The following code should be modified 
URLExemptionFilters.java : 66
 for (int i = 0; i < this.filters.length && exempted; i++) {
 exempted = this.filters[i].filter(fromUrl, toUrl);
 }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV

2018-03-12 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394972#comment-16394972
 ] 

Semyon Semyonov commented on NUTCH-1541:


[~wastl-nagel],
but why don't you write directly to HDFS without local file system step? In 
other words, why don't you create a new file in HDFS for each reducer?
I understand that it will reduce I/O for the file, but it will give a control 
for the distribution through multiple reducers.

> Indexer plugin to write CSV
> ---
>
> Key: NUTCH-1541
> URL: https://issues.apache.org/jira/browse/NUTCH-1541
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.7
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1541-v1.patch, NUTCH-1541-v2.patch
>
>
> With the new pluggable indexer a simple plugin would be handy to write 
> configurable fields into a CSV file - for further analysis or just for export.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV

2018-03-08 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391262#comment-16391262
 ] 

Semyon Semyonov commented on NUTCH-1541:


Hi [~wastl-nagel]
Why wasn't this plugin merged with master? 

> Indexer plugin to write CSV
> ---
>
> Key: NUTCH-1541
> URL: https://issues.apache.org/jira/browse/NUTCH-1541
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.7
>Reporter: Sebastian Nagel
>Priority: Minor
> Attachments: NUTCH-1541-v1.patch, NUTCH-1541-v2.patch
>
>
> With the new pluggable indexer a simple plugin would be handy to write 
> configurable fields into a CSV file - for further analysis or just for export.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2524) Crawl Script , if file exists in HDFS doesnt work.

2018-03-06 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2524:
--

 Summary: Crawl Script , if file exists in HDFS doesnt work.
 Key: NUTCH-2524
 URL: https://issues.apache.org/jira/browse/NUTCH-2524
 Project: Nutch
  Issue Type: Bug
  Components: bin
Reporter: Semyon Semyonov


In crawl script you can find something like 
if [[ -d "$CRAWL_PATH"/hostdb ]]; then
 echo "Processing sitemaps based on hosts in HostDB"
 __bin_nutch sitemap "$CRAWL_PATH"/crawldb -hostdb "$CRAWL_PATH"/hostdb 
-threads $NUM_THREADS
 fi

if [[ -d "$CRAWL_PATH"/hostdb ]]; doesnt work for HDFS only for local mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2522) Bidirectional URL exemption filter

2018-03-05 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2522:
--

 Summary:  Bidirectional URL exemption filter
 Key: NUTCH-2522
 URL: https://issues.apache.org/jira/browse/NUTCH-2522
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Semyon Semyonov


The current Nutch Url Exemption plugin exempts based on toUrl only, the new 
plugin uses both fromUrl and toUrl and after the regex transformation, exempts 
based on condition regex(fromUrl) == regex(toUrl).

This approach allows us to perform more complex url exemption filter checks, 
such as allow links:
http://[www.website.com/|http://www.website.com/]home -> 
http://[website.com/a|http://www.website.com/]bout ( with/without www).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2510) Crawl script modification. HostDb : generate, optional usage and descirption

2018-02-14 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364079#comment-16364079
 ] 

Semyon Semyonov edited comment on NUTCH-2510 at 2/14/18 2:50 PM:
-

I have provided the pull request.
 There are two indicator flags for the script:
 1) To update hostdb(but not use it in generate) put --hostdbupdate
 2) To update hostdb and use it in generate use both ---hostdbgenerate-  
--hostdbupdate


was (Author: semyon.semyo...@mail.com):
I have provided the pull request.
There are two indicator flags for the script:
1) To update hostdb(but not use it in generate) put --hostdbupdate
2) To update hostdb and use it in generate use both --hostdbgenerate 
--hostdbupdate

> Crawl script modification. HostDb : generate, optional usage and descirption
> 
>
> Key: NUTCH-2510
> URL: https://issues.apache.org/jira/browse/NUTCH-2510
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin
>Affects Versions: 1.15
>Reporter: Semyon Semyonov
>Priority: Minor
> Fix For: 1.14
>
>
> Script crawl now includes hostdb update as a part of crawling cycle, but :
> 1) There is no hostdb parameter for generate
> 2) Generation of hostdb is not optional, therefore hostdb is generated each 
> step without asking of user. It should be an optional parameter.
> 3) Description of 1 and 2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2510) Crawl script modification. HostDb : generate, optional usage and descirption

2018-02-14 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364079#comment-16364079
 ] 

Semyon Semyonov edited comment on NUTCH-2510 at 2/14/18 2:50 PM:
-

I have provided the pull request.
 There are two indicator flags for the script:
 1) To update hostdb(but not use it in generate) put --hostdbupdate
 2) To update hostdb and use it in generate use both --hostdbgenerate  
--hostdbupdate


was (Author: semyon.semyo...@mail.com):
I have provided the pull request.
 There are two indicator flags for the script:
 1) To update hostdb(but not use it in generate) put --hostdbupdate
 2) To update hostdb and use it in generate use both ---hostdbgenerate-  
--hostdbupdate

> Crawl script modification. HostDb : generate, optional usage and descirption
> 
>
> Key: NUTCH-2510
> URL: https://issues.apache.org/jira/browse/NUTCH-2510
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin
>Affects Versions: 1.15
>Reporter: Semyon Semyonov
>Priority: Minor
> Fix For: 1.14
>
>
> Script crawl now includes hostdb update as a part of crawling cycle, but :
> 1) There is no hostdb parameter for generate
> 2) Generation of hostdb is not optional, therefore hostdb is generated each 
> step without asking of user. It should be an optional parameter.
> 3) Description of 1 and 2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2510) Crawl script modification. HostDb : generate, optional usage and descirption

2018-02-14 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364079#comment-16364079
 ] 

Semyon Semyonov commented on NUTCH-2510:


I have provided the pull request.
There are two indicator flags for the script:
1) To update hostdb(but not use it in generate) put --hostdbupdate
2) To update hostdb and use it in generate use both --hostdbgenerate 
--hostdbupdate

> Crawl script modification. HostDb : generate, optional usage and descirption
> 
>
> Key: NUTCH-2510
> URL: https://issues.apache.org/jira/browse/NUTCH-2510
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin
>Affects Versions: 1.15
>Reporter: Semyon Semyonov
>Priority: Minor
> Fix For: 1.14
>
>
> Script crawl now includes hostdb update as a part of crawling cycle, but :
> 1) There is no hostdb parameter for generate
> 2) Generation of hostdb is not optional, therefore hostdb is generated each 
> step without asking of user. It should be an optional parameter.
> 3) Description of 1 and 2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2510) Crawl script modification. HostDb : generate, optional usage and descirption

2018-02-14 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2510:
--

 Summary: Crawl script modification. HostDb : generate, optional 
usage and descirption
 Key: NUTCH-2510
 URL: https://issues.apache.org/jira/browse/NUTCH-2510
 Project: Nutch
  Issue Type: Improvement
  Components: bin
Affects Versions: 1.15
Reporter: Semyon Semyonov
 Fix For: 1.14


Script crawl now includes hostdb update as a part of crawling cycle, but :
1) There is no hostdb parameter for generate

2) Generation of hostdb is not optional, therefore hostdb is generated each 
step without asking of user. It should be an optional parameter.

3) Description of 1 and 2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2481) HostDatum deltas(previous step statistics) and Metadata expressions

2018-01-25 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16332139#comment-16332139
 ] 

Semyon Semyonov edited comment on NUTCH-2481 at 1/25/18 4:19 PM:
-

An example of usage.

For example to use fetched deltas in generate.

1) To calculate FetchedDelta in the hostdb update
 
 hostdb.deltaExpression
 \{return new ("javafx.util.Pair","FetchedDelta", 
currentHostDatum.fetched - previousHostDatum.fetched);}
 

2) To use FetchedDelta in generate to not crawl the websites with FetchedDelta 
< 5


generate.max.count.expr 
 if(fetched > 70 && FetchedDelta < 5 ) \{return 
new("java.lang.Double", 0);} else \{return conf.getDouble("generate.max.count", 
-1);} 


 


was (Author: semyon.semyo...@mail.com):
An example of usage.

For example to use fetched deltas in generate.

1) To calculate FetchedDelta in the hostdb update

 hostdb.deltaExpression
 \{return new ("javafx.util.Pair","FetchedDelta", 
currentHostDatum.fetched - previousHostDatum.fetched);}


2) To use FetchedDelta in generate to not crawl the websites with FetchedDelta 
< 5


 generate.max.count.expr 
 if(fetched > 70 && FetchedDelta < 5 ) \{return 
new("java.lang.Double", 0);} else \{return conf.getDouble("generate.max.count", 
-1);} 


 

> HostDatum deltas(previous step statistics) and Metadata expressions
> ---
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
>  Issue Type: Improvement
>  Components: hostdb
>Reporter: Semyon Semyonov
>Priority: Minor
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
>  
> The solution allows to fill in metadata of hostdatum based on custom JEXL 
> expression using two hostdatum: before update(previousHostDatum) and after 
> update(currentHostDatum)..
> For example to fill in difference in quantity of fetched at round t and t-1 
> we can use the following expression
> 
>  hostdb.deltaExpression
>  \{return new ("javafx.util.Pair","FetchedDelta", 
> currentHostDatum.fetched - previousHostDatum.fetched);}
> 
> A pull request will be provided shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2504) Results of maxCountExpr and fetchDelayExpr should be stored in memory in Generate

2018-01-25 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2504:
---
Priority: Minor  (was: Major)

> Results of maxCountExpr and fetchDelayExpr should be stored in memory in 
> Generate
> -
>
> Key: NUTCH-2504
> URL: https://issues.apache.org/jira/browse/NUTCH-2504
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.15
>Reporter: Semyon Semyonov
>Priority: Minor
>
> With NUTCH-2455 the expressions maxCountExpr and fetchDelayExpr are 
> calculated for each value. That slows the process, instead we can store the 
> results for each host in hostDomainCounts. 
> That will take only 2 x sizeof(long) extra memory per host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2504) Results of maxCountExpr and fetchDelayExpr should be stored in memory in Generate

2018-01-25 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2504:
---
Issue Type: Improvement  (was: Sub-task)
Parent: (was: NUTCH-2455)

> Results of maxCountExpr and fetchDelayExpr should be stored in memory in 
> Generate
> -
>
> Key: NUTCH-2504
> URL: https://issues.apache.org/jira/browse/NUTCH-2504
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.15
>Reporter: Semyon Semyonov
>Priority: Major
>
> With NUTCH-2455 the expressions maxCountExpr and fetchDelayExpr are 
> calculated for each value. That slows the process, instead we can store the 
> results for each host in hostDomainCounts. 
> That will take only 2 x sizeof(long) extra memory per host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2504) Results of maxCountExpr and fetchDelayExpr should be stored in memory in Generate

2018-01-25 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2504:
---
Affects Version/s: 1.15

> Results of maxCountExpr and fetchDelayExpr should be stored in memory in 
> Generate
> -
>
> Key: NUTCH-2504
> URL: https://issues.apache.org/jira/browse/NUTCH-2504
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.15
>Reporter: Semyon Semyonov
>Priority: Minor
>
> With NUTCH-2455 the expressions maxCountExpr and fetchDelayExpr are 
> calculated for each value. That slows the process, instead we can store the 
> results for each host in hostDomainCounts. 
> That will take only 2 x sizeof(long) extra memory per host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2504) Results of maxCountExpr and fetchDelayExpr should be stored in memory in Generate

2018-01-25 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2504:
--

 Summary: Results of maxCountExpr and fetchDelayExpr should be 
stored in memory in Generate
 Key: NUTCH-2504
 URL: https://issues.apache.org/jira/browse/NUTCH-2504
 Project: Nutch
  Issue Type: Sub-task
  Components: generator
Reporter: Semyon Semyonov


With NUTCH-2455 the expressions maxCountExpr and fetchDelayExpr are calculated 
for each value. That slows the process, instead we can store the results for 
each host in hostDomainCounts. 

That will take only 2 x sizeof(long) extra memory per host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2481) HostDatum deltas(previous step statistics) and Metadata expressions

2018-01-19 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16332139#comment-16332139
 ] 

Semyon Semyonov commented on NUTCH-2481:


An example of usage.

For example to use fetched deltas in generate.

1) To calculate FetchedDelta in the hostdb update

 hostdb.deltaExpression
 \{return new ("javafx.util.Pair","FetchedDelta", 
currentHostDatum.fetched - previousHostDatum.fetched);}


2) To use FetchedDelta in generate to not crawl the websites with FetchedDelta 
< 5


 generate.max.count.expr 
 if(fetched > 70 && FetchedDelta < 5 ) \{return 
new("java.lang.Double", 0);} else \{return conf.getDouble("generate.max.count", 
-1);} 


 

> HostDatum deltas(previous step statistics) and Metadata expressions
> ---
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
>  Issue Type: Improvement
>  Components: hostdb
>Reporter: Semyon Semyonov
>Priority: Minor
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
>  
> The solution allows to fill in metadata of hostdatum based on custom JEXL 
> expression using two hostdatum: before update(previousHostDatum) and after 
> update(currentHostDatum)..
> For example to fill in difference in quantity of fetched at round t and t-1 
> we can use the following expression
> 
>  hostdb.deltaExpression
>  \{return new ("javafx.util.Pair","FetchedDelta", 
> currentHostDatum.fetched - previousHostDatum.fetched);}
> 
> A pull request will be provided shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics) and Metadata expressions

2018-01-17 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2481:
---
Component/s: (was: generator)

> HostDatum deltas(previous step statistics) and Metadata expressions
> ---
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
>  Issue Type: Improvement
>  Components: hostdb
>Reporter: Semyon Semyonov
>Priority: Minor
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
>  
> The solution allows to fill in metadata of hostdatum based on custom JEXL 
> expression using two hostdatum: before update(previousHostDatum) and after 
> update(currentHostDatum)..
> For example to fill in difference in quantity of fetched at round t and t-1 
> we can use the following expression
> 
>  hostdb.deltaExpression
>  \{return new ("javafx.util.Pair","FetchedDelta", 
> currentHostDatum.fetched - previousHostDatum.fetched);}
> 
> A pull request will be provided shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics) and Metadata expressions

2018-01-17 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2481:
---
Description: 
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

 

The solution allows to fill in metadata of hostdatum based on custom JEXL 
expression using two hostdatum: before update(previousHostDatum) and after 
update(currentHostDatum)..

For example to fill in difference in quantity of fetched at round t and t-1 we 
can use the following expression


 hostdb.deltaExpression
 \{return new ("javafx.util.Pair","FetchedDelta", 
currentHostDatum.fetched - previousHostDatum.fetched);}


A pull request will be provided shortly.

  was:
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

 

The 


> HostDatum deltas(previous step statistics) and Metadata expressions
> ---
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
>  Issue Type: Improvement
>  Components: hostdb
>Reporter: Semyon Semyonov
>Priority: Minor
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
>  
> The solution allows to fill in metadata of hostdatum based on custom JEXL 
> expression using two hostdatum: before update(previousHostDatum) and after 
> update(currentHostDatum)..
> For example to fill in difference in quantity of fetched at round t and t-1 
> we can use the following expression
> 
>  hostdb.deltaExpression
>  \{return new ("javafx.util.Pair","FetchedDelta", 
> currentHostDatum.fetched - previousHostDatum.fetched);}
> 
> A pull request will be provided shortly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)

2018-01-17 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2481:
---
Description: 
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

 

The 

  was:
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb. *The main problem - space. In generate 
HostDatum is stored in a Dictionary(RAM)*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate*  


> HostDatum deltas(previous step statistics)
> --
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator, hostdb
>Reporter: Semyon Semyonov
>Priority: Minor
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
>  
> The 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics) and Metadata expressions

2018-01-17 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2481:
---
Summary: HostDatum deltas(previous step statistics) and Metadata 
expressions  (was: HostDatum deltas(previous step statistics))

> HostDatum deltas(previous step statistics) and Metadata expressions
> ---
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator, hostdb
>Reporter: Semyon Semyonov
>Priority: Minor
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
>  
> The 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)

2018-01-17 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2481:
---
Priority: Minor  (was: Major)

> HostDatum deltas(previous step statistics)
> --
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator, hostdb
>Reporter: Semyon Semyonov
>Priority: Minor
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
> See an example bellow and two possible solutions.
> ??Lets say for each website we have condition of generate while number of 
> fetched < 150. 
> The problem is for some websites that condition will (almost)never be 
> finished, because of its structure. 
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page 
> ...etc.
> I would like to add the delta condition for fetched that describes speed of 
> the process. Lets say generate while number of fetched < 150 && delta_fetched 
> > 1. 
> Therefore in this case the process should stop on round 5 with total number 
> of fetched equals to 92. 
> ??
> I see two possible solutions :
> 1. In HostDatum class apart from current statistic include last step 
> statistics.
> class PagesStatistics
> {
>   protected int unfetched = 0;
>   protected int fetched = 0;
>   protected int notModified = 0;
>   protected int redirTemp = 0;
>   protected int redirPerm = 0;
>   protected int gone = 0;
> }
> Inside HostDatum
> private PagesStatistics currentStatistics;
> private PagesStatistics previousStepStatistics;
> And update both in UpdateHostDb. *The main problem - space. In generate 
> HostDatum is stored in a Dictionary(RAM)*
> 2. 
> Include metadata flag(s) in HostDatum and store as a field in 
> HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
> StopGenerate in UpdateHostDB.
> *The main advantage is space, we store only flag in the db. The main problem 
> - lack of flexibility in Generate*  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)

2018-01-17 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2481:
---
Component/s: generator

> HostDatum deltas(previous step statistics)
> --
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator, hostdb
>Reporter: Semyon Semyonov
>Priority: Major
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
> See an example bellow and two possible solutions.
> ??Lets say for each website we have condition of generate while number of 
> fetched < 150. 
> The problem is for some websites that condition will (almost)never be 
> finished, because of its structure. 
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page 
> ...etc.
> I would like to add the delta condition for fetched that describes speed of 
> the process. Lets say generate while number of fetched < 150 && delta_fetched 
> > 1. 
> Therefore in this case the process should stop on round 5 with total number 
> of fetched equals to 92. 
> ??
> I see two possible solutions :
> 1. In HostDatum class apart from current statistic include last step 
> statistics.
> class PagesStatistics
> {
>   protected int unfetched = 0;
>   protected int fetched = 0;
>   protected int notModified = 0;
>   protected int redirTemp = 0;
>   protected int redirPerm = 0;
>   protected int gone = 0;
> }
> Inside HostDatum
> private PagesStatistics currentStatistics;
> private PagesStatistics previousStepStatistics;
> And update both in UpdateHostDb. *The main problem - space. In generate 
> HostDatum is stored in a Dictionary(RAM)*
> 2. 
> Include metadata flag(s) in HostDatum and store as a field in 
> HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
> StopGenerate in UpdateHostDB.
> *The main advantage is space, we store only flag in the db. The main problem 
> - lack of flexibility in Generate*  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)

2017-12-15 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2481:
---
Description: 
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb. *The main problem - space. In generate 
HostDatum is stored in a Dictionary(RAM)*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate*  

  was:
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb. *The main problem - space. In generate 
HostDatum is stored in a Dictionary in a memory*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate*  


> HostDatum deltas(previous step statistics)
> --
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
>  Issue Type: Improvement
>  Components: hostdb
>Reporter: Semyon Semyonov
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
> See an example bellow and two possible solutions.
> ??Lets say for each website we have condition of generate while number of 
> fetched < 150. 
> The problem is for some websites that condition will (almost)never be 
> finished, because of its structure. 
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page 
> ...etc.
> I would like to add the delta condition for fetched that describes speed of 
> the process. Lets say generate while number of fetched < 150 && delta_fetched 
> > 1. 
> Therefore in this case the process should stop on round 5 with total number 
> of fetched equals to 92. 
> ??
> I see two possible solutions :
> 1. In HostDatum class apart from current statistic include last step 
> statistics.
> class PagesStatistics
> {
>   protected int unfetched = 0;
>   protected int fetched = 0;
>   protected int notModified = 0;
>   protected int redirTemp = 0;
>   protected int redirPerm = 0;
>   protected int gone = 0;

[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)

2017-12-15 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2481:
---
Description: 
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb. *The main problem - space. In generate 
HostDatum is stored in a Dictionary in a memory*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate*  

  was:
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb.* The main problem - space. In generate 
HostDatum is stored in a Dictionary in a memory,*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate.,*  


> HostDatum deltas(previous step statistics)
> --
>
> Key: NUTCH-2481
> URL: https://issues.apache.org/jira/browse/NUTCH-2481
> Project: Nutch
>  Issue Type: Improvement
>  Components: hostdb
>Reporter: Semyon Semyonov
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
> See an example bellow and two possible solutions.
> ??Lets say for each website we have condition of generate while number of 
> fetched < 150. 
> The problem is for some websites that condition will (almost)never be 
> finished, because of its structure. 
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page 
> ...etc.
> I would like to add the delta condition for fetched that describes speed of 
> the process. Lets say generate while number of fetched < 150 && delta_fetched 
> > 1. 
> Therefore in this case the process should stop on round 5 with total number 
> of fetched equals to 92. 
> ??
> I see two possible solutions :
> 1. In HostDatum class apart from current statistic include last step 
> statistics.
> class PagesStatistics
> {
>   protected int unfetched = 0;
>   protected int fetched = 0;
>   protected int notModified = 0;
>   protected int redirTemp = 0;
>   protected int redirPerm = 0;
>   protected int

[jira] [Created] (NUTCH-2481) HostDatum deltas(previous step statistics)

2017-12-15 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2481:
--

 Summary: HostDatum deltas(previous step statistics)
 Key: NUTCH-2481
 URL: https://issues.apache.org/jira/browse/NUTCH-2481
 Project: Nutch
  Issue Type: Improvement
  Components: hostdb
Reporter: Semyon Semyonov


To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb.* The main problem - space. In generate 
HostDatum is stored in a Dictionary in a memory,*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate.,*  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2017-12-08 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283775#comment-16283775
 ] 

Semyon Semyonov commented on NUTCH-2455:


[~wastl-nagel]  [~markus17]l
Please, have a look.

Could you please review two more issues at the same time as this issue, because 
they are closely related.
https://issues.apache.org/jira/browse/NUTCH-2454
and
https://issues.apache.org/jira/browse/NUTCH-2461

>From the commit, I duplicate:
Three questions/modification left open:
1) In several places we use url.getHost() in the Nutch code, in other we use 
url.getHost().toLower(). Why?
2) public static class ScoreHostKeyComparator extends WritableComparator should 
Implement Raw comparator. If you know how to do it you are welcome to do.
3) The whole Generator file is to big, it should be spread to several files. 
Again, if you know how to fix it in a good way, you are welcome. 



> Speed up the merging of HostDb entries for variable fetch delay
> ---
>
> Key: NUTCH-2455
> URL: https://issues.apache.org/jira/browse/NUTCH-2455
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.13
>Reporter: Markus Jelsma
> Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use  pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2017-12-05 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278949#comment-16278949
 ] 

Semyon Semyonov commented on NUTCH-2455:


Hi Sebastian,
I already started to work with the solution that I proposed. What do you think 
about it?
Will it work?

> Speed up the merging of HostDb entries for variable fetch delay
> ---
>
> Key: NUTCH-2455
> URL: https://issues.apache.org/jira/browse/NUTCH-2455
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.13
>Reporter: Markus Jelsma
> Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use  pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2017-12-05 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278820#comment-16278820
 ] 

Semyon Semyonov edited comment on NUTCH-2455 at 12/5/17 4:31 PM:
-

[~wastl-nagel]
I have started to work on this issue, and face some problems with combination 
of host and score.

You proposed 
the map function then emits key-value pairs  -> 

of course, the HostDatums must be wrapped into the value structure. It's 
already a custom class (SelectorEntry), so that should be doable
via partitioning and secondary sorting these arrive in the reduce function:
all keys with the same host in one call of the function
in the following order: first the HostDatum (just assign an artificially high 
score), then the CrawlDatum items sorted by decreasing score

In the code,
 limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / 
job.getNumReduceTasks(), in reduce acts as following:
   if (count == limit) {
  // do we have any segments left?
  if (currentsegmentnum < maxNumSegments) {
count = 0;
currentsegmentnum++;
  } else
break;
}

For each key in the reducer, where the key is a sorted score. Therefore the 
reducer takes TOPN scored urls across all hosts.

With the proposed approach it doesnt work anymore, because the data is started 
to be host based sorted( all keys with the same host in one call of the 
function).

For example, bbc.com(300 pages) and amazon.com(200 pages).
With topN = 70.
Now it works as follows :
1 - call for weight - 1.  20 pages from bbc.com + 10 pages from amazon.com
2-  call for weight - 0.5 .  5 pages from bbc.com +35 pages from amazon.com.

If we introduce "one call for the hostdb system" it will be
1 - call for bbc.com - 70 pages from bbc.com , 0 from amazon.com.

I'm thinking about the alternative solution:
1) Use a composite key (score, host). As a value we use SelectorEntry and add 
hostdatum there. From the first mapper(hostdb reader) we get only hostdb data, 
from the second mapper only crawldbdata.

Therefore, the combined output from two mappers can look like this:
(1, bbc.com) - (crawl, null)
(1, bbc.com) - (crawl,null)
(0.5, bbc.com) - (crawl,null)
(null, bbc.com) - (null, hostdb)

host is a partitioner key(or domain/ip, as it works now).

2) Implement SortComparatorClass.
If score == null, return 1, therefore all keys with score == null goes to the 
top.

3)(Optionally) use grouping comparator combine all keys with score == null, to 
one. 

After these step one the top we should have the hostdb data for all keys for 
the reducer, therefore first check it and load to the memory. Afterwards we 
just follow natural order with score and check the hostdb restriction.

What do you think about this way?


was (Author: semyon.semyo...@mail.com):
[~wastl-nagel]
I have started to work on this issue, and face some problems with combination 
of host and score.

You proposed 
??the map function then emits key-value pairs  -> 

of course, the HostDatums must be wrapped into the value structure. It's 
already a custom class (SelectorEntry), so that should be doable
via partitioning and secondary sorting these arrive in the reduce function:
all keys with the same host in one call of the function
in the following order: first the HostDatum (just assign an artificially high 
score), then the CrawlDatum items sorted by decreasing score??

In the code,
?? limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / 
job.getNumReduceTasks(), in reduce acts as following:
   if (count == limit) {
  // do we have any segments left?
  if (currentsegmentnum < maxNumSegments) {
count = 0;
currentsegmentnum++;
  } else
break;
}??

For each key in the reducer, where the key is a sorted score. Therefore the 
reducer takes TOPN scored urls across all hosts.

With the proposed approach it doesnt work anymore, because the data is started 
to be host based sorted(?? all keys with the same host in one call of the 
function??).

For example, bbc.com(300 pages) and amazon.com(200 pages).
With topN = 70.
Now it works as follows :
*1 - call for weight - 1.  20 pages from bbc.com + 10 pages from amazon.com
2-  call for weight - 0.5 .  5 pages from bbc.com +35 pages from amazon.com.*

If we introduce "one call for the hostdb system" it will be
*1 - call for bbc.com - 70 pages from bbc.com , 0 from amazon.com.*

I'm thinking about the alternative solution:
1) Use a composite key (score, host). As a value we use SelectorEntry and add 
hostdatum there. From the first mapper(hostdb reader) we get only hostdb data, 
from the second mapper only crawldbdata.

Therefore, the combined output from two mappers can look like this:
*(1, bbc.com) - (crawl, null)
(1, bbc.com) - (crawl,null)
(0.5, bbc.com) - (crawl,null)
(null, bbc.com) - (null, hostdb)*

host is 

[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2017-12-05 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278820#comment-16278820
 ] 

Semyon Semyonov commented on NUTCH-2455:


[~wastl-nagel]
I have started to work on this issue, and face some problems with combination 
of host and score.

You proposed 
??the map function then emits key-value pairs  -> 

of course, the HostDatums must be wrapped into the value structure. It's 
already a custom class (SelectorEntry), so that should be doable
via partitioning and secondary sorting these arrive in the reduce function:
all keys with the same host in one call of the function
in the following order: first the HostDatum (just assign an artificially high 
score), then the CrawlDatum items sorted by decreasing score??

In the code,
?? limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / 
job.getNumReduceTasks(), in reduce acts as following:
   if (count == limit) {
  // do we have any segments left?
  if (currentsegmentnum < maxNumSegments) {
count = 0;
currentsegmentnum++;
  } else
break;
}??

For each key in the reducer, where the key is a sorted score. Therefore the 
reducer takes TOPN scored urls across all hosts.

With the proposed approach it doesnt work anymore, because the data is started 
to be host based sorted(?? all keys with the same host in one call of the 
function??).

For example, bbc.com(300 pages) and amazon.com(200 pages).
With topN = 70.
Now it works as follows :
*1 - call for weight - 1.  20 pages from bbc.com + 10 pages from amazon.com
2-  call for weight - 0.5 .  5 pages from bbc.com +35 pages from amazon.com.*

If we introduce "one call for the hostdb system" it will be
*1 - call for bbc.com - 70 pages from bbc.com , 0 from amazon.com.*

I'm thinking about the alternative solution:
1) Use a composite key (score, host). As a value we use SelectorEntry and add 
hostdatum there. From the first mapper(hostdb reader) we get only hostdb data, 
from the second mapper only crawldbdata.

Therefore, the combined output from two mappers can look like this:
*(1, bbc.com) - (crawl, null)
(1, bbc.com) - (crawl,null)
(0.5, bbc.com) - (crawl,null)
(null, bbc.com) - (null, hostdb)*

host is a partitioner key(or domain/ip, as it works now).

2) Implement SortComparatorClass.
If score == null, return 1, therefore all keys with score == null goes to the 
top.

3)(Optionally) use grouping comparator combine all keys with score == null, to 
one. 

After these step one the top we should have the hostdb data for all keys for 
the reducer, therefore first check it and load to the memory. Afterwards we 
just follow natural order with score and check the hostdb restriction.

What do you think about this way?

> Speed up the merging of HostDb entries for variable fetch delay
> ---
>
> Key: NUTCH-2455
> URL: https://issues.apache.org/jira/browse/NUTCH-2455
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.13
>Reporter: Markus Jelsma
> Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use  pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2017-11-30 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272716#comment-16272716
 ] 

Semyon Semyonov commented on NUTCH-2455:


[~wastl-nagel]
What about this step read HostDatums together with CrawlDatums (cf. 
MultipleInputFormat, depends on NUTCH-2375) as input of the select step? 
Is not it simpler to read the HostDb in each mapper separately?

> Speed up the merging of HostDb entries for variable fetch delay
> ---
>
> Key: NUTCH-2455
> URL: https://issues.apache.org/jira/browse/NUTCH-2455
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.13
>Reporter: Markus Jelsma
> Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use  pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (NUTCH-2465) Broken Eclipse project. Classpaths and interactiveselenium should be fixed.

2017-11-30 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2465:
---
Comment: was deleted

(was: Seems the fix is more complex than I though.
Please review and fix accordingly.

There are some problems with 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java
)

> Broken Eclipse project. Classpaths and interactiveselenium should be fixed.
> ---
>
> Key: NUTCH-2465
> URL: https://issues.apache.org/jira/browse/NUTCH-2465
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Semyon Semyonov
> Fix For: 1.14
>
>
> With the latest version of develop the Eclipse project doesn't work anymore.
> There are two sets of problem:
> 1) Classpath problems 
> 2) Incorrect usage of org.apache.nutch.protocol.interactiveselenium in the 
> code. Should be replaced by 
> org.apache.nutch.protocol.interactiveselenium.handlers 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2017-11-29 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271137#comment-16271137
 ] 

Semyon Semyonov commented on NUTCH-2455:


Thanks for the suggestion [~wastl-nagel]

The main question is : why not to change key of reducer to host instead?
Now I see that reducers reduce based on sorting value, but it never used in the 
reducer itself.
output.collect(key, entry); is only call with the key.

Why not perform host parsing in the mapper and then reduce based on key(what 
means all the values from the host go to the same reducer)?

> Speed up the merging of HostDb entries for variable fetch delay
> ---
>
> Key: NUTCH-2455
> URL: https://issues.apache.org/jira/browse/NUTCH-2455
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.13
>Reporter: Markus Jelsma
> Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use  pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2465) Broken Eclipse project. Classpaths and interactiveselenium should be fixed.

2017-11-27 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266865#comment-16266865
 ] 

Semyon Semyonov commented on NUTCH-2465:


Seems the fix is more complex than I though.
Please review and fix accordingly.

There are some problems with 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java


> Broken Eclipse project. Classpaths and interactiveselenium should be fixed.
> ---
>
> Key: NUTCH-2465
> URL: https://issues.apache.org/jira/browse/NUTCH-2465
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Semyon Semyonov
> Fix For: 1.14
>
>
> With the latest version of develop the Eclipse project doesn't work anymore.
> There are two sets of problem:
> 1) Classpath problems 
> 2) Incorrect usage of org.apache.nutch.protocol.interactiveselenium in the 
> code. Should be replaced by 
> org.apache.nutch.protocol.interactiveselenium.handlers 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2465) Broken Eclipse project. Classpaths and interactiveselenium should be fixed.

2017-11-27 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2465:
--

 Summary: Broken Eclipse project. Classpaths and 
interactiveselenium should be fixed.
 Key: NUTCH-2465
 URL: https://issues.apache.org/jira/browse/NUTCH-2465
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.14
Reporter: Semyon Semyonov
 Fix For: 1.14


With the latest version of develop the Eclipse project doesn't work anymore.

There are two sets of problem:
1) Classpath problems 
2) Incorrect usage of org.apache.nutch.protocol.interactiveselenium in the 
code. Should be replaced by 
org.apache.nutch.protocol.interactiveselenium.handlers 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-11-14 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251388#comment-16251388
 ] 

Semyon Semyonov commented on NUTCH-2368:


Added NUTCH-2461 with proposed solution in the description

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368_RESTAPI_Fix.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-11-14 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251386#comment-16251386
 ] 

Semyon Semyonov commented on NUTCH-2368:


The critical bug for maxCount equals to 0

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368_RESTAPI_Fix.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-11-14 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2368:
---
Comment: was deleted

(was: The critical bug for maxCount equals to 0)

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368_RESTAPI_Fix.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2461) Generate passes the data to when maxCount == 0

2017-11-14 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2461:
---
Summary: Generate passes the data to when maxCount  == 0  (was: Generate 
pass the data to when maxCount  == 0)

> Generate passes the data to when maxCount  == 0
> ---
>
> Key: NUTCH-2461
> URL: https://issues.apache.org/jira/browse/NUTCH-2461
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.14
>Reporter: Semyon Semyonov
>Priority: Critical
> Fix For: 1.14
>
>
> The generator checks condition 
> if (maxCount > 0) : line 421 and stop the generation when amount per host 
> exceeds maxCount( continue : line 455)
> but when  maxCount == 0 it goes directly to line 465 :output.collect(key, 
> entry);
> It is obviously not correct, the correct solution would be to add 
> if(maxCount == 0){
>   continue;
> }
> at line 380.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2461) Generate pass the data to when maxCount == 0

2017-11-14 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2461:
--

 Summary: Generate pass the data to when maxCount  == 0
 Key: NUTCH-2461
 URL: https://issues.apache.org/jira/browse/NUTCH-2461
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.14
Reporter: Semyon Semyonov
Priority: Critical
 Fix For: 1.14


The generator checks condition 
if (maxCount > 0) : line 421 and stop the generation when amount per host 
exceeds maxCount( continue : line 455)
but when  maxCount == 0 it goes directly to line 465 :output.collect(key, 
entry);

It is obviously not correct, the correct solution would be to add 
if(maxCount == 0){
continue;
}
at line 380.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-11-14 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251359#comment-16251359
 ] 

Semyon Semyonov commented on NUTCH-2368:


I found a nasty bug that breaks the feature completely.

The generator collected the url if maxcount == 0, because of the condition line 
421 if (maxCount > 0) insead of >= 0

I propose to add the check for condition 
 if(maxCount == 0){
continue;
}

Could you check it ASAP?


> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368_RESTAPI_Fix.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2017-11-09 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2455:
---
Attachment: NUTCH-2455.patch

The proposed patch is attached.

Though, I'm not sure that this is the best way to copy value of objects of Java:
.key.toString(), (HostDatum)value.clone()

> Speed up the merging of HostDb entries for variable fetch delay
> ---
>
> Key: NUTCH-2455
> URL: https://issues.apache.org/jira/browse/NUTCH-2455
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.13
>Reporter: Markus Jelsma
> Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use  pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-11-08 Thread Semyon Semyonov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244223#comment-16244223
 ] 

Semyon Semyonov commented on NUTCH-2368:


I found a bug. HostdbReaders steams are not reset for each key of the reducer.

Assume we have four hosts in the host-db
A B C D

First time the reducer does reduce for website C, hostdbReaders[i].next 
leftover is D
The second time we are looking for B, but leftover is D. Therefore the result 
of hostdbReaders[i].next is null.
The same for the all following keys of the reducer, hostdb is null. 

private HostDatum getHostDatum(String host) throws Exception {
  Text key = new Text();
  HostDatum value = new HostDatum();
  
  for (int i = 0; i < hostdbReaders.length; i++) {
while (hostdbReaders[i].next(key, value)) {
  if (host.equals(key.toString())) {
return value;
  }
}
  }
  return null;
}

What do you think is the best method to solve it? Recreate it each time?
  Path path = new Path(job.get(GENERATOR_HOSTDB), "current");
  hostdbReaders = SequenceFileOutputFormat.getReaders(job, path);

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368_RESTAPI_Fix.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2454) REST API fix for usage of hostdb in generator

2017-11-03 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2454:
---
Attachment: NUTCH-2368_RESTAPI_Fix.patch

> REST API fix for usage of hostdb in generator
> -
>
> Key: NUTCH-2454
> URL: https://issues.apache.org/jira/browse/NUTCH-2454
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.12
>Reporter: Semyon Semyonov
>Priority: Normal
> Fix For: 1.14
>
> Attachments: NUTCH-2368_RESTAPI_Fix.patch
>
>
> NutchNUTCH-2368
> Variable generate.max.count and fetcher.server.delay



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2454) REST API fix for usage of hostdb in generator

2017-11-03 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2454:
--

 Summary: REST API fix for usage of hostdb in generator
 Key: NUTCH-2454
 URL: https://issues.apache.org/jira/browse/NUTCH-2454
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.12
Reporter: Semyon Semyonov
Priority: Normal
 Fix For: 1.14


NutchNUTCH-2368
Variable generate.max.count and fetcher.server.delay



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-11-03 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2368:
---
Attachment: NUTCH-2368_RESTAPI_Fix.patch

There was a problem with REST API client, because API uses different Run method 
and this method didn't include the hostdb parameter.

The path fixes this problem.
It may have some problems with the line offset, because I'm not so fluent with 
SVN yet.

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368_RESTAPI_Fix.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2441) ARG_SEGMENT usage

2017-10-16 Thread Semyon Semyonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2441:
---
Attachment: metadataARG_SEGMENT.patch

> ARG_SEGMENT usage
> -
>
> Key: NUTCH-2441
> URL: https://issues.apache.org/jira/browse/NUTCH-2441
> Project: Nutch
>  Issue Type: Improvement
>  Components: metadata
>Affects Versions: 1.13
>Reporter: Semyon Semyonov
> Fix For: 1.14
>
> Attachments: metadataARG_SEGMENT.patch
>
>
> The class metadata/Nutch.java  public static final String ARG_SEGMENT = 
> "segment" is not used correctly. In some cases Fetcher and ParseSegment it is 
> interpreted as a single segmenet, in others CrawlDb, LinkDb, IndexingJob as 
> an array of segments. Such misunderstanding leads to inconsistency of usage 
> of the parameter.
> After a discussion with [~wastl-nagel]  the proposed solution is to allow the 
> usage of both array and a string in all cases. That gives an opportunity to 
> not introduce the broken changes.
> A path is proposed.
>  *The question left is refactoring, all these five components share the same 
> code(two versions of the same code to be precise). Shouldn't we extract a 
> method and reduce duplicates?  *



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2441) ARG_SEGMENT usage

2017-10-16 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2441:
--

 Summary: ARG_SEGMENT usage
 Key: NUTCH-2441
 URL: https://issues.apache.org/jira/browse/NUTCH-2441
 Project: Nutch
  Issue Type: Improvement
  Components: metadata
Affects Versions: 1.13
Reporter: Semyon Semyonov
 Fix For: 1.14


The class metadata/Nutch.java  public static final String ARG_SEGMENT = 
"segment" is not used correctly. In some cases Fetcher and ParseSegment it is 
interpreted as a single segmenet, in others CrawlDb, LinkDb, IndexingJob as an 
array of segments. Such misunderstanding leads to inconsistency of usage of the 
parameter.

After a discussion with [~wastl-nagel]  the proposed solution is to allow the 
usage of both array and a string in all cases. That gives an opportunity to not 
introduce the broken changes.

A path is proposed.

 *The question left is refactoring, all these five components share the same 
code(two versions of the same code to be precise). Shouldn't we extract a 
method and reduce duplicates?  *



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)