[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay
[ https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407778#comment-16407778 ] Semyon Semyonov commented on NUTCH-2455: I see a conflict for this branch and master, let me know when you want to merge it and I'm going to fix them. By the way, we ran it several times for number of hosts in between 100 000 and 2 000 000 , it worked quite well. > Speed up the merging of HostDb entries for variable fetch delay > --- > > Key: NUTCH-2455 > URL: https://issues.apache.org/jira/browse/NUTCH-2455 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.13 >Reporter: Markus Jelsma >Priority: Major > Attachments: NUTCH-2455.patch > > > Citing Sebastian at NUTCH-2420: > ??The correct solution would be to use pairs as keys in the > Selector job, with a partitioner and secondary sorting so that all keys with > same host end up in the same call of the reducer. If values can also hold a > HostDb entry and the sort comparator guarantees that the HostDb entry > (entries if partitioned by domain or IP) comes in front of all CrawlDb > entries. But that would be a substantial improvement...?? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2539) Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml
Semyon Semyonov created NUTCH-2539: -- Summary: Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml Key: NUTCH-2539 URL: https://issues.apache.org/jira/browse/NUTCH-2539 Project: Nutch Issue Type: Improvement Affects Versions: 1.15 Reporter: Semyon Semyonov There is a mismatch between config and code. In code, In CrawlDbFilter line 41:43 > public static final String URL_FILTERING = "crawldb.url.filters"; > public static final String URL_NORMALIZING = "crawldb.url.normalizers"; > public static final String URL_NORMALIZING_SCOPE = > "crawldb.url.normalizers.scope"; In nutch-default.xml > > db.url.normalizers > false > Normalize urls when updating crawldb > > > > db.url.filters > false > Filter urls when updating crawldb > These properties should be in line with code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2538) Refactoring of Regex Url Normalizer and Bidirectional Url ExemptionFilter
Semyon Semyonov created NUTCH-2538: -- Summary: Refactoring of Regex Url Normalizer and Bidirectional Url ExemptionFilter Key: NUTCH-2538 URL: https://issues.apache.org/jira/browse/NUTCH-2538 Project: Nutch Issue Type: Improvement Components: plugin Reporter: Semyon Semyonov NUTCH-2522 uses the same regex logic as RegxUrlNormalizer. These plugins can be refactored to the same base class. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2537) Logical OR instead of AND in UrlExemptionFilters
Semyon Semyonov created NUTCH-2537: -- Summary: Logical OR instead of AND in UrlExemptionFilters Key: NUTCH-2537 URL: https://issues.apache.org/jira/browse/NUTCH-2537 Project: Nutch Issue Type: Improvement Components: plugin Reporter: Semyon Semyonov With NUTCH-2522 another urlexemptionfilter is added, therefore now we can combine the filters. We should use more reasonable combination of ExemptionFilters based on OR, instead of AND. The following code should be modified URLExemptionFilters.java : 66 for (int i = 0; i < this.filters.length && exempted; i++) { exempted = this.filters[i].filter(fromUrl, toUrl); } -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV
[ https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394972#comment-16394972 ] Semyon Semyonov commented on NUTCH-1541: [~wastl-nagel], but why don't you write directly to HDFS without local file system step? In other words, why don't you create a new file in HDFS for each reducer? I understand that it will reduce I/O for the file, but it will give a control for the distribution through multiple reducers. > Indexer plugin to write CSV > --- > > Key: NUTCH-1541 > URL: https://issues.apache.org/jira/browse/NUTCH-1541 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.7 >Reporter: Sebastian Nagel >Priority: Minor > Attachments: NUTCH-1541-v1.patch, NUTCH-1541-v2.patch > > > With the new pluggable indexer a simple plugin would be handy to write > configurable fields into a CSV file - for further analysis or just for export. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV
[ https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391262#comment-16391262 ] Semyon Semyonov commented on NUTCH-1541: Hi [~wastl-nagel] Why wasn't this plugin merged with master? > Indexer plugin to write CSV > --- > > Key: NUTCH-1541 > URL: https://issues.apache.org/jira/browse/NUTCH-1541 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.7 >Reporter: Sebastian Nagel >Priority: Minor > Attachments: NUTCH-1541-v1.patch, NUTCH-1541-v2.patch > > > With the new pluggable indexer a simple plugin would be handy to write > configurable fields into a CSV file - for further analysis or just for export. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2524) Crawl Script , if file exists in HDFS doesnt work.
Semyon Semyonov created NUTCH-2524: -- Summary: Crawl Script , if file exists in HDFS doesnt work. Key: NUTCH-2524 URL: https://issues.apache.org/jira/browse/NUTCH-2524 Project: Nutch Issue Type: Bug Components: bin Reporter: Semyon Semyonov In crawl script you can find something like if [[ -d "$CRAWL_PATH"/hostdb ]]; then echo "Processing sitemaps based on hosts in HostDB" __bin_nutch sitemap "$CRAWL_PATH"/crawldb -hostdb "$CRAWL_PATH"/hostdb -threads $NUM_THREADS fi if [[ -d "$CRAWL_PATH"/hostdb ]]; doesnt work for HDFS only for local mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2522) Bidirectional URL exemption filter
Semyon Semyonov created NUTCH-2522: -- Summary: Bidirectional URL exemption filter Key: NUTCH-2522 URL: https://issues.apache.org/jira/browse/NUTCH-2522 Project: Nutch Issue Type: Improvement Components: plugin Reporter: Semyon Semyonov The current Nutch Url Exemption plugin exempts based on toUrl only, the new plugin uses both fromUrl and toUrl and after the regex transformation, exempts based on condition regex(fromUrl) == regex(toUrl). This approach allows us to perform more complex url exemption filter checks, such as allow links: http://[www.website.com/|http://www.website.com/]home -> http://[website.com/a|http://www.website.com/]bout ( with/without www). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (NUTCH-2510) Crawl script modification. HostDb : generate, optional usage and descirption
[ https://issues.apache.org/jira/browse/NUTCH-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364079#comment-16364079 ] Semyon Semyonov edited comment on NUTCH-2510 at 2/14/18 2:50 PM: - I have provided the pull request. There are two indicator flags for the script: 1) To update hostdb(but not use it in generate) put --hostdbupdate 2) To update hostdb and use it in generate use both ---hostdbgenerate- --hostdbupdate was (Author: semyon.semyo...@mail.com): I have provided the pull request. There are two indicator flags for the script: 1) To update hostdb(but not use it in generate) put --hostdbupdate 2) To update hostdb and use it in generate use both --hostdbgenerate --hostdbupdate > Crawl script modification. HostDb : generate, optional usage and descirption > > > Key: NUTCH-2510 > URL: https://issues.apache.org/jira/browse/NUTCH-2510 > Project: Nutch > Issue Type: Improvement > Components: bin >Affects Versions: 1.15 >Reporter: Semyon Semyonov >Priority: Minor > Fix For: 1.14 > > > Script crawl now includes hostdb update as a part of crawling cycle, but : > 1) There is no hostdb parameter for generate > 2) Generation of hostdb is not optional, therefore hostdb is generated each > step without asking of user. It should be an optional parameter. > 3) Description of 1 and 2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (NUTCH-2510) Crawl script modification. HostDb : generate, optional usage and descirption
[ https://issues.apache.org/jira/browse/NUTCH-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364079#comment-16364079 ] Semyon Semyonov edited comment on NUTCH-2510 at 2/14/18 2:50 PM: - I have provided the pull request. There are two indicator flags for the script: 1) To update hostdb(but not use it in generate) put --hostdbupdate 2) To update hostdb and use it in generate use both --hostdbgenerate --hostdbupdate was (Author: semyon.semyo...@mail.com): I have provided the pull request. There are two indicator flags for the script: 1) To update hostdb(but not use it in generate) put --hostdbupdate 2) To update hostdb and use it in generate use both ---hostdbgenerate- --hostdbupdate > Crawl script modification. HostDb : generate, optional usage and descirption > > > Key: NUTCH-2510 > URL: https://issues.apache.org/jira/browse/NUTCH-2510 > Project: Nutch > Issue Type: Improvement > Components: bin >Affects Versions: 1.15 >Reporter: Semyon Semyonov >Priority: Minor > Fix For: 1.14 > > > Script crawl now includes hostdb update as a part of crawling cycle, but : > 1) There is no hostdb parameter for generate > 2) Generation of hostdb is not optional, therefore hostdb is generated each > step without asking of user. It should be an optional parameter. > 3) Description of 1 and 2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2510) Crawl script modification. HostDb : generate, optional usage and descirption
[ https://issues.apache.org/jira/browse/NUTCH-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364079#comment-16364079 ] Semyon Semyonov commented on NUTCH-2510: I have provided the pull request. There are two indicator flags for the script: 1) To update hostdb(but not use it in generate) put --hostdbupdate 2) To update hostdb and use it in generate use both --hostdbgenerate --hostdbupdate > Crawl script modification. HostDb : generate, optional usage and descirption > > > Key: NUTCH-2510 > URL: https://issues.apache.org/jira/browse/NUTCH-2510 > Project: Nutch > Issue Type: Improvement > Components: bin >Affects Versions: 1.15 >Reporter: Semyon Semyonov >Priority: Minor > Fix For: 1.14 > > > Script crawl now includes hostdb update as a part of crawling cycle, but : > 1) There is no hostdb parameter for generate > 2) Generation of hostdb is not optional, therefore hostdb is generated each > step without asking of user. It should be an optional parameter. > 3) Description of 1 and 2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2510) Crawl script modification. HostDb : generate, optional usage and descirption
Semyon Semyonov created NUTCH-2510: -- Summary: Crawl script modification. HostDb : generate, optional usage and descirption Key: NUTCH-2510 URL: https://issues.apache.org/jira/browse/NUTCH-2510 Project: Nutch Issue Type: Improvement Components: bin Affects Versions: 1.15 Reporter: Semyon Semyonov Fix For: 1.14 Script crawl now includes hostdb update as a part of crawling cycle, but : 1) There is no hostdb parameter for generate 2) Generation of hostdb is not optional, therefore hostdb is generated each step without asking of user. It should be an optional parameter. 3) Description of 1 and 2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (NUTCH-2481) HostDatum deltas(previous step statistics) and Metadata expressions
[ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16332139#comment-16332139 ] Semyon Semyonov edited comment on NUTCH-2481 at 1/25/18 4:19 PM: - An example of usage. For example to use fetched deltas in generate. 1) To calculate FetchedDelta in the hostdb update hostdb.deltaExpression \{return new ("javafx.util.Pair","FetchedDelta", currentHostDatum.fetched - previousHostDatum.fetched);} 2) To use FetchedDelta in generate to not crawl the websites with FetchedDelta < 5 generate.max.count.expr if(fetched > 70 && FetchedDelta < 5 ) \{return new("java.lang.Double", 0);} else \{return conf.getDouble("generate.max.count", -1);} was (Author: semyon.semyo...@mail.com): An example of usage. For example to use fetched deltas in generate. 1) To calculate FetchedDelta in the hostdb update hostdb.deltaExpression \{return new ("javafx.util.Pair","FetchedDelta", currentHostDatum.fetched - previousHostDatum.fetched);} 2) To use FetchedDelta in generate to not crawl the websites with FetchedDelta < 5 generate.max.count.expr if(fetched > 70 && FetchedDelta < 5 ) \{return new("java.lang.Double", 0);} else \{return conf.getDouble("generate.max.count", -1);} > HostDatum deltas(previous step statistics) and Metadata expressions > --- > > Key: NUTCH-2481 > URL: https://issues.apache.org/jira/browse/NUTCH-2481 > Project: Nutch > Issue Type: Improvement > Components: hostdb >Reporter: Semyon Semyonov >Priority: Minor > > To allow the usage of previous step statistics(deltas of fetched,unfetced > etc) in hostdb. The motivation is usage of this statistics in generate with > maxCount expressions. > > The solution allows to fill in metadata of hostdatum based on custom JEXL > expression using two hostdatum: before update(previousHostDatum) and after > update(currentHostDatum).. > For example to fill in difference in quantity of fetched at round t and t-1 > we can use the following expression > > hostdb.deltaExpression > \{return new ("javafx.util.Pair","FetchedDelta", > currentHostDatum.fetched - previousHostDatum.fetched);} > > A pull request will be provided shortly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2504) Results of maxCountExpr and fetchDelayExpr should be stored in memory in Generate
[ https://issues.apache.org/jira/browse/NUTCH-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2504: --- Priority: Minor (was: Major) > Results of maxCountExpr and fetchDelayExpr should be stored in memory in > Generate > - > > Key: NUTCH-2504 > URL: https://issues.apache.org/jira/browse/NUTCH-2504 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.15 >Reporter: Semyon Semyonov >Priority: Minor > > With NUTCH-2455 the expressions maxCountExpr and fetchDelayExpr are > calculated for each value. That slows the process, instead we can store the > results for each host in hostDomainCounts. > That will take only 2 x sizeof(long) extra memory per host. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2504) Results of maxCountExpr and fetchDelayExpr should be stored in memory in Generate
[ https://issues.apache.org/jira/browse/NUTCH-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2504: --- Issue Type: Improvement (was: Sub-task) Parent: (was: NUTCH-2455) > Results of maxCountExpr and fetchDelayExpr should be stored in memory in > Generate > - > > Key: NUTCH-2504 > URL: https://issues.apache.org/jira/browse/NUTCH-2504 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.15 >Reporter: Semyon Semyonov >Priority: Major > > With NUTCH-2455 the expressions maxCountExpr and fetchDelayExpr are > calculated for each value. That slows the process, instead we can store the > results for each host in hostDomainCounts. > That will take only 2 x sizeof(long) extra memory per host. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2504) Results of maxCountExpr and fetchDelayExpr should be stored in memory in Generate
[ https://issues.apache.org/jira/browse/NUTCH-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2504: --- Affects Version/s: 1.15 > Results of maxCountExpr and fetchDelayExpr should be stored in memory in > Generate > - > > Key: NUTCH-2504 > URL: https://issues.apache.org/jira/browse/NUTCH-2504 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.15 >Reporter: Semyon Semyonov >Priority: Minor > > With NUTCH-2455 the expressions maxCountExpr and fetchDelayExpr are > calculated for each value. That slows the process, instead we can store the > results for each host in hostDomainCounts. > That will take only 2 x sizeof(long) extra memory per host. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2504) Results of maxCountExpr and fetchDelayExpr should be stored in memory in Generate
Semyon Semyonov created NUTCH-2504: -- Summary: Results of maxCountExpr and fetchDelayExpr should be stored in memory in Generate Key: NUTCH-2504 URL: https://issues.apache.org/jira/browse/NUTCH-2504 Project: Nutch Issue Type: Sub-task Components: generator Reporter: Semyon Semyonov With NUTCH-2455 the expressions maxCountExpr and fetchDelayExpr are calculated for each value. That slows the process, instead we can store the results for each host in hostDomainCounts. That will take only 2 x sizeof(long) extra memory per host. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2481) HostDatum deltas(previous step statistics) and Metadata expressions
[ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16332139#comment-16332139 ] Semyon Semyonov commented on NUTCH-2481: An example of usage. For example to use fetched deltas in generate. 1) To calculate FetchedDelta in the hostdb update hostdb.deltaExpression \{return new ("javafx.util.Pair","FetchedDelta", currentHostDatum.fetched - previousHostDatum.fetched);} 2) To use FetchedDelta in generate to not crawl the websites with FetchedDelta < 5 generate.max.count.expr if(fetched > 70 && FetchedDelta < 5 ) \{return new("java.lang.Double", 0);} else \{return conf.getDouble("generate.max.count", -1);} > HostDatum deltas(previous step statistics) and Metadata expressions > --- > > Key: NUTCH-2481 > URL: https://issues.apache.org/jira/browse/NUTCH-2481 > Project: Nutch > Issue Type: Improvement > Components: hostdb >Reporter: Semyon Semyonov >Priority: Minor > > To allow the usage of previous step statistics(deltas of fetched,unfetced > etc) in hostdb. The motivation is usage of this statistics in generate with > maxCount expressions. > > The solution allows to fill in metadata of hostdatum based on custom JEXL > expression using two hostdatum: before update(previousHostDatum) and after > update(currentHostDatum).. > For example to fill in difference in quantity of fetched at round t and t-1 > we can use the following expression > > hostdb.deltaExpression > \{return new ("javafx.util.Pair","FetchedDelta", > currentHostDatum.fetched - previousHostDatum.fetched);} > > A pull request will be provided shortly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics) and Metadata expressions
[ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2481: --- Component/s: (was: generator) > HostDatum deltas(previous step statistics) and Metadata expressions > --- > > Key: NUTCH-2481 > URL: https://issues.apache.org/jira/browse/NUTCH-2481 > Project: Nutch > Issue Type: Improvement > Components: hostdb >Reporter: Semyon Semyonov >Priority: Minor > > To allow the usage of previous step statistics(deltas of fetched,unfetced > etc) in hostdb. The motivation is usage of this statistics in generate with > maxCount expressions. > > The solution allows to fill in metadata of hostdatum based on custom JEXL > expression using two hostdatum: before update(previousHostDatum) and after > update(currentHostDatum).. > For example to fill in difference in quantity of fetched at round t and t-1 > we can use the following expression > > hostdb.deltaExpression > \{return new ("javafx.util.Pair","FetchedDelta", > currentHostDatum.fetched - previousHostDatum.fetched);} > > A pull request will be provided shortly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics) and Metadata expressions
[ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2481: --- Description: To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb. The motivation is usage of this statistics in generate with maxCount expressions. The solution allows to fill in metadata of hostdatum based on custom JEXL expression using two hostdatum: before update(previousHostDatum) and after update(currentHostDatum).. For example to fill in difference in quantity of fetched at round t and t-1 we can use the following expression hostdb.deltaExpression \{return new ("javafx.util.Pair","FetchedDelta", currentHostDatum.fetched - previousHostDatum.fetched);} A pull request will be provided shortly. was: To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb. The motivation is usage of this statistics in generate with maxCount expressions. The > HostDatum deltas(previous step statistics) and Metadata expressions > --- > > Key: NUTCH-2481 > URL: https://issues.apache.org/jira/browse/NUTCH-2481 > Project: Nutch > Issue Type: Improvement > Components: hostdb >Reporter: Semyon Semyonov >Priority: Minor > > To allow the usage of previous step statistics(deltas of fetched,unfetced > etc) in hostdb. The motivation is usage of this statistics in generate with > maxCount expressions. > > The solution allows to fill in metadata of hostdatum based on custom JEXL > expression using two hostdatum: before update(previousHostDatum) and after > update(currentHostDatum).. > For example to fill in difference in quantity of fetched at round t and t-1 > we can use the following expression > > hostdb.deltaExpression > \{return new ("javafx.util.Pair","FetchedDelta", > currentHostDatum.fetched - previousHostDatum.fetched);} > > A pull request will be provided shortly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)
[ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2481: --- Description: To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb. The motivation is usage of this statistics in generate with maxCount expressions. The was: To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb. The motivation is usage of this statistics in generate with maxCount expressions. See an example bellow and two possible solutions. ??Lets say for each website we have condition of generate while number of fetched < 150. The problem is for some websites that condition will (almost)never be finished, because of its structure. 1) Round1. 1 page 2) Round2. 10 pages 3) Round3. 80 pages 4) Round 4. 1 page 5) Round 5. 1 page ...etc. I would like to add the delta condition for fetched that describes speed of the process. Lets say generate while number of fetched < 150 && delta_fetched > 1. Therefore in this case the process should stop on round 5 with total number of fetched equals to 92. ?? I see two possible solutions : 1. In HostDatum class apart from current statistic include last step statistics. class PagesStatistics { protected int unfetched = 0; protected int fetched = 0; protected int notModified = 0; protected int redirTemp = 0; protected int redirPerm = 0; protected int gone = 0; } Inside HostDatum private PagesStatistics currentStatistics; private PagesStatistics previousStepStatistics; And update both in UpdateHostDb. *The main problem - space. In generate HostDatum is stored in a Dictionary(RAM)* 2. Include metadata flag(s) in HostDatum and store as a field in HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of StopGenerate in UpdateHostDB. *The main advantage is space, we store only flag in the db. The main problem - lack of flexibility in Generate* > HostDatum deltas(previous step statistics) > -- > > Key: NUTCH-2481 > URL: https://issues.apache.org/jira/browse/NUTCH-2481 > Project: Nutch > Issue Type: Improvement > Components: generator, hostdb >Reporter: Semyon Semyonov >Priority: Minor > > To allow the usage of previous step statistics(deltas of fetched,unfetced > etc) in hostdb. The motivation is usage of this statistics in generate with > maxCount expressions. > > The -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics) and Metadata expressions
[ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2481: --- Summary: HostDatum deltas(previous step statistics) and Metadata expressions (was: HostDatum deltas(previous step statistics)) > HostDatum deltas(previous step statistics) and Metadata expressions > --- > > Key: NUTCH-2481 > URL: https://issues.apache.org/jira/browse/NUTCH-2481 > Project: Nutch > Issue Type: Improvement > Components: generator, hostdb >Reporter: Semyon Semyonov >Priority: Minor > > To allow the usage of previous step statistics(deltas of fetched,unfetced > etc) in hostdb. The motivation is usage of this statistics in generate with > maxCount expressions. > > The -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)
[ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2481: --- Priority: Minor (was: Major) > HostDatum deltas(previous step statistics) > -- > > Key: NUTCH-2481 > URL: https://issues.apache.org/jira/browse/NUTCH-2481 > Project: Nutch > Issue Type: Improvement > Components: generator, hostdb >Reporter: Semyon Semyonov >Priority: Minor > > To allow the usage of previous step statistics(deltas of fetched,unfetced > etc) in hostdb. The motivation is usage of this statistics in generate with > maxCount expressions. > See an example bellow and two possible solutions. > ??Lets say for each website we have condition of generate while number of > fetched < 150. > The problem is for some websites that condition will (almost)never be > finished, because of its structure. > 1) Round1. 1 page > 2) Round2. 10 pages > 3) Round3. 80 pages > 4) Round 4. 1 page > 5) Round 5. 1 page > ...etc. > I would like to add the delta condition for fetched that describes speed of > the process. Lets say generate while number of fetched < 150 && delta_fetched > > 1. > Therefore in this case the process should stop on round 5 with total number > of fetched equals to 92. > ?? > I see two possible solutions : > 1. In HostDatum class apart from current statistic include last step > statistics. > class PagesStatistics > { > protected int unfetched = 0; > protected int fetched = 0; > protected int notModified = 0; > protected int redirTemp = 0; > protected int redirPerm = 0; > protected int gone = 0; > } > Inside HostDatum > private PagesStatistics currentStatistics; > private PagesStatistics previousStepStatistics; > And update both in UpdateHostDb. *The main problem - space. In generate > HostDatum is stored in a Dictionary(RAM)* > 2. > Include metadata flag(s) in HostDatum and store as a field in > HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of > StopGenerate in UpdateHostDB. > *The main advantage is space, we store only flag in the db. The main problem > - lack of flexibility in Generate* -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)
[ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2481: --- Component/s: generator > HostDatum deltas(previous step statistics) > -- > > Key: NUTCH-2481 > URL: https://issues.apache.org/jira/browse/NUTCH-2481 > Project: Nutch > Issue Type: Improvement > Components: generator, hostdb >Reporter: Semyon Semyonov >Priority: Major > > To allow the usage of previous step statistics(deltas of fetched,unfetced > etc) in hostdb. The motivation is usage of this statistics in generate with > maxCount expressions. > See an example bellow and two possible solutions. > ??Lets say for each website we have condition of generate while number of > fetched < 150. > The problem is for some websites that condition will (almost)never be > finished, because of its structure. > 1) Round1. 1 page > 2) Round2. 10 pages > 3) Round3. 80 pages > 4) Round 4. 1 page > 5) Round 5. 1 page > ...etc. > I would like to add the delta condition for fetched that describes speed of > the process. Lets say generate while number of fetched < 150 && delta_fetched > > 1. > Therefore in this case the process should stop on round 5 with total number > of fetched equals to 92. > ?? > I see two possible solutions : > 1. In HostDatum class apart from current statistic include last step > statistics. > class PagesStatistics > { > protected int unfetched = 0; > protected int fetched = 0; > protected int notModified = 0; > protected int redirTemp = 0; > protected int redirPerm = 0; > protected int gone = 0; > } > Inside HostDatum > private PagesStatistics currentStatistics; > private PagesStatistics previousStepStatistics; > And update both in UpdateHostDb. *The main problem - space. In generate > HostDatum is stored in a Dictionary(RAM)* > 2. > Include metadata flag(s) in HostDatum and store as a field in > HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of > StopGenerate in UpdateHostDB. > *The main advantage is space, we store only flag in the db. The main problem > - lack of flexibility in Generate* -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)
[ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2481: --- Description: To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb. The motivation is usage of this statistics in generate with maxCount expressions. See an example bellow and two possible solutions. ??Lets say for each website we have condition of generate while number of fetched < 150. The problem is for some websites that condition will (almost)never be finished, because of its structure. 1) Round1. 1 page 2) Round2. 10 pages 3) Round3. 80 pages 4) Round 4. 1 page 5) Round 5. 1 page ...etc. I would like to add the delta condition for fetched that describes speed of the process. Lets say generate while number of fetched < 150 && delta_fetched > 1. Therefore in this case the process should stop on round 5 with total number of fetched equals to 92. ?? I see two possible solutions : 1. In HostDatum class apart from current statistic include last step statistics. class PagesStatistics { protected int unfetched = 0; protected int fetched = 0; protected int notModified = 0; protected int redirTemp = 0; protected int redirPerm = 0; protected int gone = 0; } Inside HostDatum private PagesStatistics currentStatistics; private PagesStatistics previousStepStatistics; And update both in UpdateHostDb. *The main problem - space. In generate HostDatum is stored in a Dictionary(RAM)* 2. Include metadata flag(s) in HostDatum and store as a field in HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of StopGenerate in UpdateHostDB. *The main advantage is space, we store only flag in the db. The main problem - lack of flexibility in Generate* was: To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb. The motivation is usage of this statistics in generate with maxCount expressions. See an example bellow and two possible solutions. ??Lets say for each website we have condition of generate while number of fetched < 150. The problem is for some websites that condition will (almost)never be finished, because of its structure. 1) Round1. 1 page 2) Round2. 10 pages 3) Round3. 80 pages 4) Round 4. 1 page 5) Round 5. 1 page ...etc. I would like to add the delta condition for fetched that describes speed of the process. Lets say generate while number of fetched < 150 && delta_fetched > 1. Therefore in this case the process should stop on round 5 with total number of fetched equals to 92. ?? I see two possible solutions : 1. In HostDatum class apart from current statistic include last step statistics. class PagesStatistics { protected int unfetched = 0; protected int fetched = 0; protected int notModified = 0; protected int redirTemp = 0; protected int redirPerm = 0; protected int gone = 0; } Inside HostDatum private PagesStatistics currentStatistics; private PagesStatistics previousStepStatistics; And update both in UpdateHostDb. *The main problem - space. In generate HostDatum is stored in a Dictionary in a memory* 2. Include metadata flag(s) in HostDatum and store as a field in HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of StopGenerate in UpdateHostDB. *The main advantage is space, we store only flag in the db. The main problem - lack of flexibility in Generate* > HostDatum deltas(previous step statistics) > -- > > Key: NUTCH-2481 > URL: https://issues.apache.org/jira/browse/NUTCH-2481 > Project: Nutch > Issue Type: Improvement > Components: hostdb >Reporter: Semyon Semyonov > > To allow the usage of previous step statistics(deltas of fetched,unfetced > etc) in hostdb. The motivation is usage of this statistics in generate with > maxCount expressions. > See an example bellow and two possible solutions. > ??Lets say for each website we have condition of generate while number of > fetched < 150. > The problem is for some websites that condition will (almost)never be > finished, because of its structure. > 1) Round1. 1 page > 2) Round2. 10 pages > 3) Round3. 80 pages > 4) Round 4. 1 page > 5) Round 5. 1 page > ...etc. > I would like to add the delta condition for fetched that describes speed of > the process. Lets say generate while number of fetched < 150 && delta_fetched > > 1. > Therefore in this case the process should stop on round 5 with total number > of fetched equals to 92. > ?? > I see two possible solutions : > 1. In HostDatum class apart from current statistic include last step > statistics. > class PagesStatistics > { > protected int unfetched = 0; > protected int fetched = 0; > protected int notModified = 0; > protected int redirTemp = 0; > protected int redirPerm = 0; > protected int gone = 0;
[jira] [Updated] (NUTCH-2481) HostDatum deltas(previous step statistics)
[ https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2481: --- Description: To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb. The motivation is usage of this statistics in generate with maxCount expressions. See an example bellow and two possible solutions. ??Lets say for each website we have condition of generate while number of fetched < 150. The problem is for some websites that condition will (almost)never be finished, because of its structure. 1) Round1. 1 page 2) Round2. 10 pages 3) Round3. 80 pages 4) Round 4. 1 page 5) Round 5. 1 page ...etc. I would like to add the delta condition for fetched that describes speed of the process. Lets say generate while number of fetched < 150 && delta_fetched > 1. Therefore in this case the process should stop on round 5 with total number of fetched equals to 92. ?? I see two possible solutions : 1. In HostDatum class apart from current statistic include last step statistics. class PagesStatistics { protected int unfetched = 0; protected int fetched = 0; protected int notModified = 0; protected int redirTemp = 0; protected int redirPerm = 0; protected int gone = 0; } Inside HostDatum private PagesStatistics currentStatistics; private PagesStatistics previousStepStatistics; And update both in UpdateHostDb. *The main problem - space. In generate HostDatum is stored in a Dictionary in a memory* 2. Include metadata flag(s) in HostDatum and store as a field in HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of StopGenerate in UpdateHostDB. *The main advantage is space, we store only flag in the db. The main problem - lack of flexibility in Generate* was: To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb. The motivation is usage of this statistics in generate with maxCount expressions. See an example bellow and two possible solutions. ??Lets say for each website we have condition of generate while number of fetched < 150. The problem is for some websites that condition will (almost)never be finished, because of its structure. 1) Round1. 1 page 2) Round2. 10 pages 3) Round3. 80 pages 4) Round 4. 1 page 5) Round 5. 1 page ...etc. I would like to add the delta condition for fetched that describes speed of the process. Lets say generate while number of fetched < 150 && delta_fetched > 1. Therefore in this case the process should stop on round 5 with total number of fetched equals to 92. ?? I see two possible solutions : 1. In HostDatum class apart from current statistic include last step statistics. class PagesStatistics { protected int unfetched = 0; protected int fetched = 0; protected int notModified = 0; protected int redirTemp = 0; protected int redirPerm = 0; protected int gone = 0; } Inside HostDatum private PagesStatistics currentStatistics; private PagesStatistics previousStepStatistics; And update both in UpdateHostDb.* The main problem - space. In generate HostDatum is stored in a Dictionary in a memory,* 2. Include metadata flag(s) in HostDatum and store as a field in HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of StopGenerate in UpdateHostDB. *The main advantage is space, we store only flag in the db. The main problem - lack of flexibility in Generate.,* > HostDatum deltas(previous step statistics) > -- > > Key: NUTCH-2481 > URL: https://issues.apache.org/jira/browse/NUTCH-2481 > Project: Nutch > Issue Type: Improvement > Components: hostdb >Reporter: Semyon Semyonov > > To allow the usage of previous step statistics(deltas of fetched,unfetced > etc) in hostdb. The motivation is usage of this statistics in generate with > maxCount expressions. > See an example bellow and two possible solutions. > ??Lets say for each website we have condition of generate while number of > fetched < 150. > The problem is for some websites that condition will (almost)never be > finished, because of its structure. > 1) Round1. 1 page > 2) Round2. 10 pages > 3) Round3. 80 pages > 4) Round 4. 1 page > 5) Round 5. 1 page > ...etc. > I would like to add the delta condition for fetched that describes speed of > the process. Lets say generate while number of fetched < 150 && delta_fetched > > 1. > Therefore in this case the process should stop on round 5 with total number > of fetched equals to 92. > ?? > I see two possible solutions : > 1. In HostDatum class apart from current statistic include last step > statistics. > class PagesStatistics > { > protected int unfetched = 0; > protected int fetched = 0; > protected int notModified = 0; > protected int redirTemp = 0; > protected int redirPerm = 0; > protected int
[jira] [Created] (NUTCH-2481) HostDatum deltas(previous step statistics)
Semyon Semyonov created NUTCH-2481: -- Summary: HostDatum deltas(previous step statistics) Key: NUTCH-2481 URL: https://issues.apache.org/jira/browse/NUTCH-2481 Project: Nutch Issue Type: Improvement Components: hostdb Reporter: Semyon Semyonov To allow the usage of previous step statistics(deltas of fetched,unfetced etc) in hostdb. The motivation is usage of this statistics in generate with maxCount expressions. See an example bellow and two possible solutions. ??Lets say for each website we have condition of generate while number of fetched < 150. The problem is for some websites that condition will (almost)never be finished, because of its structure. 1) Round1. 1 page 2) Round2. 10 pages 3) Round3. 80 pages 4) Round 4. 1 page 5) Round 5. 1 page ...etc. I would like to add the delta condition for fetched that describes speed of the process. Lets say generate while number of fetched < 150 && delta_fetched > 1. Therefore in this case the process should stop on round 5 with total number of fetched equals to 92. ?? I see two possible solutions : 1. In HostDatum class apart from current statistic include last step statistics. class PagesStatistics { protected int unfetched = 0; protected int fetched = 0; protected int notModified = 0; protected int redirTemp = 0; protected int redirPerm = 0; protected int gone = 0; } Inside HostDatum private PagesStatistics currentStatistics; private PagesStatistics previousStepStatistics; And update both in UpdateHostDb.* The main problem - space. In generate HostDatum is stored in a Dictionary in a memory,* 2. Include metadata flag(s) in HostDatum and store as a field in HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of StopGenerate in UpdateHostDB. *The main advantage is space, we store only flag in the db. The main problem - lack of flexibility in Generate.,* -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay
[ https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283775#comment-16283775 ] Semyon Semyonov commented on NUTCH-2455: [~wastl-nagel] [~markus17]l Please, have a look. Could you please review two more issues at the same time as this issue, because they are closely related. https://issues.apache.org/jira/browse/NUTCH-2454 and https://issues.apache.org/jira/browse/NUTCH-2461 >From the commit, I duplicate: Three questions/modification left open: 1) In several places we use url.getHost() in the Nutch code, in other we use url.getHost().toLower(). Why? 2) public static class ScoreHostKeyComparator extends WritableComparator should Implement Raw comparator. If you know how to do it you are welcome to do. 3) The whole Generator file is to big, it should be spread to several files. Again, if you know how to fix it in a good way, you are welcome. > Speed up the merging of HostDb entries for variable fetch delay > --- > > Key: NUTCH-2455 > URL: https://issues.apache.org/jira/browse/NUTCH-2455 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.13 >Reporter: Markus Jelsma > Attachments: NUTCH-2455.patch > > > Citing Sebastian at NUTCH-2420: > ??The correct solution would be to use pairs as keys in the > Selector job, with a partitioner and secondary sorting so that all keys with > same host end up in the same call of the reducer. If values can also hold a > HostDb entry and the sort comparator guarantees that the HostDb entry > (entries if partitioned by domain or IP) comes in front of all CrawlDb > entries. But that would be a substantial improvement...?? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay
[ https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278949#comment-16278949 ] Semyon Semyonov commented on NUTCH-2455: Hi Sebastian, I already started to work with the solution that I proposed. What do you think about it? Will it work? > Speed up the merging of HostDb entries for variable fetch delay > --- > > Key: NUTCH-2455 > URL: https://issues.apache.org/jira/browse/NUTCH-2455 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.13 >Reporter: Markus Jelsma > Attachments: NUTCH-2455.patch > > > Citing Sebastian at NUTCH-2420: > ??The correct solution would be to use pairs as keys in the > Selector job, with a partitioner and secondary sorting so that all keys with > same host end up in the same call of the reducer. If values can also hold a > HostDb entry and the sort comparator guarantees that the HostDb entry > (entries if partitioned by domain or IP) comes in front of all CrawlDb > entries. But that would be a substantial improvement...?? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay
[ https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278820#comment-16278820 ] Semyon Semyonov edited comment on NUTCH-2455 at 12/5/17 4:31 PM: - [~wastl-nagel] I have started to work on this issue, and face some problems with combination of host and score. You proposed the map function then emits key-value pairs -> of course, the HostDatums must be wrapped into the value structure. It's already a custom class (SelectorEntry), so that should be doable via partitioning and secondary sorting these arrive in the reduce function: all keys with the same host in one call of the function in the following order: first the HostDatum (just assign an artificially high score), then the CrawlDatum items sorted by decreasing score In the code, limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / job.getNumReduceTasks(), in reduce acts as following: if (count == limit) { // do we have any segments left? if (currentsegmentnum < maxNumSegments) { count = 0; currentsegmentnum++; } else break; } For each key in the reducer, where the key is a sorted score. Therefore the reducer takes TOPN scored urls across all hosts. With the proposed approach it doesnt work anymore, because the data is started to be host based sorted( all keys with the same host in one call of the function). For example, bbc.com(300 pages) and amazon.com(200 pages). With topN = 70. Now it works as follows : 1 - call for weight - 1. 20 pages from bbc.com + 10 pages from amazon.com 2- call for weight - 0.5 . 5 pages from bbc.com +35 pages from amazon.com. If we introduce "one call for the hostdb system" it will be 1 - call for bbc.com - 70 pages from bbc.com , 0 from amazon.com. I'm thinking about the alternative solution: 1) Use a composite key (score, host). As a value we use SelectorEntry and add hostdatum there. From the first mapper(hostdb reader) we get only hostdb data, from the second mapper only crawldbdata. Therefore, the combined output from two mappers can look like this: (1, bbc.com) - (crawl, null) (1, bbc.com) - (crawl,null) (0.5, bbc.com) - (crawl,null) (null, bbc.com) - (null, hostdb) host is a partitioner key(or domain/ip, as it works now). 2) Implement SortComparatorClass. If score == null, return 1, therefore all keys with score == null goes to the top. 3)(Optionally) use grouping comparator combine all keys with score == null, to one. After these step one the top we should have the hostdb data for all keys for the reducer, therefore first check it and load to the memory. Afterwards we just follow natural order with score and check the hostdb restriction. What do you think about this way? was (Author: semyon.semyo...@mail.com): [~wastl-nagel] I have started to work on this issue, and face some problems with combination of host and score. You proposed ??the map function then emits key-value pairs -> of course, the HostDatums must be wrapped into the value structure. It's already a custom class (SelectorEntry), so that should be doable via partitioning and secondary sorting these arrive in the reduce function: all keys with the same host in one call of the function in the following order: first the HostDatum (just assign an artificially high score), then the CrawlDatum items sorted by decreasing score?? In the code, ?? limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / job.getNumReduceTasks(), in reduce acts as following: if (count == limit) { // do we have any segments left? if (currentsegmentnum < maxNumSegments) { count = 0; currentsegmentnum++; } else break; }?? For each key in the reducer, where the key is a sorted score. Therefore the reducer takes TOPN scored urls across all hosts. With the proposed approach it doesnt work anymore, because the data is started to be host based sorted(?? all keys with the same host in one call of the function??). For example, bbc.com(300 pages) and amazon.com(200 pages). With topN = 70. Now it works as follows : *1 - call for weight - 1. 20 pages from bbc.com + 10 pages from amazon.com 2- call for weight - 0.5 . 5 pages from bbc.com +35 pages from amazon.com.* If we introduce "one call for the hostdb system" it will be *1 - call for bbc.com - 70 pages from bbc.com , 0 from amazon.com.* I'm thinking about the alternative solution: 1) Use a composite key (score, host). As a value we use SelectorEntry and add hostdatum there. From the first mapper(hostdb reader) we get only hostdb data, from the second mapper only crawldbdata. Therefore, the combined output from two mappers can look like this: *(1, bbc.com) - (crawl, null) (1, bbc.com) - (crawl,null) (0.5, bbc.com) - (crawl,null) (null, bbc.com) - (null, hostdb)* host is
[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay
[ https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278820#comment-16278820 ] Semyon Semyonov commented on NUTCH-2455: [~wastl-nagel] I have started to work on this issue, and face some problems with combination of host and score. You proposed ??the map function then emits key-value pairs -> of course, the HostDatums must be wrapped into the value structure. It's already a custom class (SelectorEntry), so that should be doable via partitioning and secondary sorting these arrive in the reduce function: all keys with the same host in one call of the function in the following order: first the HostDatum (just assign an artificially high score), then the CrawlDatum items sorted by decreasing score?? In the code, ?? limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / job.getNumReduceTasks(), in reduce acts as following: if (count == limit) { // do we have any segments left? if (currentsegmentnum < maxNumSegments) { count = 0; currentsegmentnum++; } else break; }?? For each key in the reducer, where the key is a sorted score. Therefore the reducer takes TOPN scored urls across all hosts. With the proposed approach it doesnt work anymore, because the data is started to be host based sorted(?? all keys with the same host in one call of the function??). For example, bbc.com(300 pages) and amazon.com(200 pages). With topN = 70. Now it works as follows : *1 - call for weight - 1. 20 pages from bbc.com + 10 pages from amazon.com 2- call for weight - 0.5 . 5 pages from bbc.com +35 pages from amazon.com.* If we introduce "one call for the hostdb system" it will be *1 - call for bbc.com - 70 pages from bbc.com , 0 from amazon.com.* I'm thinking about the alternative solution: 1) Use a composite key (score, host). As a value we use SelectorEntry and add hostdatum there. From the first mapper(hostdb reader) we get only hostdb data, from the second mapper only crawldbdata. Therefore, the combined output from two mappers can look like this: *(1, bbc.com) - (crawl, null) (1, bbc.com) - (crawl,null) (0.5, bbc.com) - (crawl,null) (null, bbc.com) - (null, hostdb)* host is a partitioner key(or domain/ip, as it works now). 2) Implement SortComparatorClass. If score == null, return 1, therefore all keys with score == null goes to the top. 3)(Optionally) use grouping comparator combine all keys with score == null, to one. After these step one the top we should have the hostdb data for all keys for the reducer, therefore first check it and load to the memory. Afterwards we just follow natural order with score and check the hostdb restriction. What do you think about this way? > Speed up the merging of HostDb entries for variable fetch delay > --- > > Key: NUTCH-2455 > URL: https://issues.apache.org/jira/browse/NUTCH-2455 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.13 >Reporter: Markus Jelsma > Attachments: NUTCH-2455.patch > > > Citing Sebastian at NUTCH-2420: > ??The correct solution would be to use pairs as keys in the > Selector job, with a partitioner and secondary sorting so that all keys with > same host end up in the same call of the reducer. If values can also hold a > HostDb entry and the sort comparator guarantees that the HostDb entry > (entries if partitioned by domain or IP) comes in front of all CrawlDb > entries. But that would be a substantial improvement...?? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay
[ https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272716#comment-16272716 ] Semyon Semyonov commented on NUTCH-2455: [~wastl-nagel] What about this step read HostDatums together with CrawlDatums (cf. MultipleInputFormat, depends on NUTCH-2375) as input of the select step? Is not it simpler to read the HostDb in each mapper separately? > Speed up the merging of HostDb entries for variable fetch delay > --- > > Key: NUTCH-2455 > URL: https://issues.apache.org/jira/browse/NUTCH-2455 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.13 >Reporter: Markus Jelsma > Attachments: NUTCH-2455.patch > > > Citing Sebastian at NUTCH-2420: > ??The correct solution would be to use pairs as keys in the > Selector job, with a partitioner and secondary sorting so that all keys with > same host end up in the same call of the reducer. If values can also hold a > HostDb entry and the sort comparator guarantees that the HostDb entry > (entries if partitioned by domain or IP) comes in front of all CrawlDb > entries. But that would be a substantial improvement...?? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Issue Comment Deleted] (NUTCH-2465) Broken Eclipse project. Classpaths and interactiveselenium should be fixed.
[ https://issues.apache.org/jira/browse/NUTCH-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2465: --- Comment: was deleted (was: Seems the fix is more complex than I though. Please review and fix accordingly. There are some problems with src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java ) > Broken Eclipse project. Classpaths and interactiveselenium should be fixed. > --- > > Key: NUTCH-2465 > URL: https://issues.apache.org/jira/browse/NUTCH-2465 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Semyon Semyonov > Fix For: 1.14 > > > With the latest version of develop the Eclipse project doesn't work anymore. > There are two sets of problem: > 1) Classpath problems > 2) Incorrect usage of org.apache.nutch.protocol.interactiveselenium in the > code. Should be replaced by > org.apache.nutch.protocol.interactiveselenium.handlers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay
[ https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271137#comment-16271137 ] Semyon Semyonov commented on NUTCH-2455: Thanks for the suggestion [~wastl-nagel] The main question is : why not to change key of reducer to host instead? Now I see that reducers reduce based on sorting value, but it never used in the reducer itself. output.collect(key, entry); is only call with the key. Why not perform host parsing in the mapper and then reduce based on key(what means all the values from the host go to the same reducer)? > Speed up the merging of HostDb entries for variable fetch delay > --- > > Key: NUTCH-2455 > URL: https://issues.apache.org/jira/browse/NUTCH-2455 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.13 >Reporter: Markus Jelsma > Attachments: NUTCH-2455.patch > > > Citing Sebastian at NUTCH-2420: > ??The correct solution would be to use pairs as keys in the > Selector job, with a partitioner and secondary sorting so that all keys with > same host end up in the same call of the reducer. If values can also hold a > HostDb entry and the sort comparator guarantees that the HostDb entry > (entries if partitioned by domain or IP) comes in front of all CrawlDb > entries. But that would be a substantial improvement...?? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2465) Broken Eclipse project. Classpaths and interactiveselenium should be fixed.
[ https://issues.apache.org/jira/browse/NUTCH-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266865#comment-16266865 ] Semyon Semyonov commented on NUTCH-2465: Seems the fix is more complex than I though. Please review and fix accordingly. There are some problems with src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java > Broken Eclipse project. Classpaths and interactiveselenium should be fixed. > --- > > Key: NUTCH-2465 > URL: https://issues.apache.org/jira/browse/NUTCH-2465 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.14 >Reporter: Semyon Semyonov > Fix For: 1.14 > > > With the latest version of develop the Eclipse project doesn't work anymore. > There are two sets of problem: > 1) Classpath problems > 2) Incorrect usage of org.apache.nutch.protocol.interactiveselenium in the > code. Should be replaced by > org.apache.nutch.protocol.interactiveselenium.handlers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2465) Broken Eclipse project. Classpaths and interactiveselenium should be fixed.
Semyon Semyonov created NUTCH-2465: -- Summary: Broken Eclipse project. Classpaths and interactiveselenium should be fixed. Key: NUTCH-2465 URL: https://issues.apache.org/jira/browse/NUTCH-2465 Project: Nutch Issue Type: Bug Affects Versions: 1.14 Reporter: Semyon Semyonov Fix For: 1.14 With the latest version of develop the Eclipse project doesn't work anymore. There are two sets of problem: 1) Classpath problems 2) Incorrect usage of org.apache.nutch.protocol.interactiveselenium in the code. Should be replaced by org.apache.nutch.protocol.interactiveselenium.handlers -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251388#comment-16251388 ] Semyon Semyonov commented on NUTCH-2368: Added NUTCH-2461 with proposed solution in the description > Variable generate.max.count and fetcher.server.delay > > > Key: NUTCH-2368 > URL: https://issues.apache.org/jira/browse/NUTCH-2368 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368_RESTAPI_Fix.patch > > > In some cases we need to use host specific characteristics in determining > crawl speed and bulk sizes because with our (Openindex) settings we can just > recrawl host with up to 800k urls. > This patch solves the problem by introducing the HostDB to the Generator and > providing powerful Jexl expressions. Check these two expressions added to the > Generator: > {code} > -Dgenerate.max.count.expr=' > if (unfetched + fetched > 80) { > return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + > 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1) > } else { > return conf.getDouble("generate.max.count", 300); > }' > -Dgenerate.fetch.delay.expr=' > if (unfetched + fetched > 80) { > return (pct95._rs_ + 500); > } else { > return conf.getDouble("fetcher.server.delay", 1000) > }' > {code} > For each large host: select as many records as possible that are possible to > fetch based on number of threads, 95th percentile response time of the fetch > limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads. > The second expression just follows up to that, settings the crawlDelay of the > fetch queue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251386#comment-16251386 ] Semyon Semyonov commented on NUTCH-2368: The critical bug for maxCount equals to 0 > Variable generate.max.count and fetcher.server.delay > > > Key: NUTCH-2368 > URL: https://issues.apache.org/jira/browse/NUTCH-2368 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368_RESTAPI_Fix.patch > > > In some cases we need to use host specific characteristics in determining > crawl speed and bulk sizes because with our (Openindex) settings we can just > recrawl host with up to 800k urls. > This patch solves the problem by introducing the HostDB to the Generator and > providing powerful Jexl expressions. Check these two expressions added to the > Generator: > {code} > -Dgenerate.max.count.expr=' > if (unfetched + fetched > 80) { > return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + > 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1) > } else { > return conf.getDouble("generate.max.count", 300); > }' > -Dgenerate.fetch.delay.expr=' > if (unfetched + fetched > 80) { > return (pct95._rs_ + 500); > } else { > return conf.getDouble("fetcher.server.delay", 1000) > }' > {code} > For each large host: select as many records as possible that are possible to > fetch based on number of threads, 95th percentile response time of the fetch > limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads. > The second expression just follows up to that, settings the crawlDelay of the > fetch queue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Issue Comment Deleted] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2368: --- Comment: was deleted (was: The critical bug for maxCount equals to 0) > Variable generate.max.count and fetcher.server.delay > > > Key: NUTCH-2368 > URL: https://issues.apache.org/jira/browse/NUTCH-2368 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368_RESTAPI_Fix.patch > > > In some cases we need to use host specific characteristics in determining > crawl speed and bulk sizes because with our (Openindex) settings we can just > recrawl host with up to 800k urls. > This patch solves the problem by introducing the HostDB to the Generator and > providing powerful Jexl expressions. Check these two expressions added to the > Generator: > {code} > -Dgenerate.max.count.expr=' > if (unfetched + fetched > 80) { > return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + > 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1) > } else { > return conf.getDouble("generate.max.count", 300); > }' > -Dgenerate.fetch.delay.expr=' > if (unfetched + fetched > 80) { > return (pct95._rs_ + 500); > } else { > return conf.getDouble("fetcher.server.delay", 1000) > }' > {code} > For each large host: select as many records as possible that are possible to > fetch based on number of threads, 95th percentile response time of the fetch > limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads. > The second expression just follows up to that, settings the crawlDelay of the > fetch queue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2461) Generate passes the data to when maxCount == 0
[ https://issues.apache.org/jira/browse/NUTCH-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2461: --- Summary: Generate passes the data to when maxCount == 0 (was: Generate pass the data to when maxCount == 0) > Generate passes the data to when maxCount == 0 > --- > > Key: NUTCH-2461 > URL: https://issues.apache.org/jira/browse/NUTCH-2461 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.14 >Reporter: Semyon Semyonov >Priority: Critical > Fix For: 1.14 > > > The generator checks condition > if (maxCount > 0) : line 421 and stop the generation when amount per host > exceeds maxCount( continue : line 455) > but when maxCount == 0 it goes directly to line 465 :output.collect(key, > entry); > It is obviously not correct, the correct solution would be to add > if(maxCount == 0){ > continue; > } > at line 380. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2461) Generate pass the data to when maxCount == 0
Semyon Semyonov created NUTCH-2461: -- Summary: Generate pass the data to when maxCount == 0 Key: NUTCH-2461 URL: https://issues.apache.org/jira/browse/NUTCH-2461 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.14 Reporter: Semyon Semyonov Priority: Critical Fix For: 1.14 The generator checks condition if (maxCount > 0) : line 421 and stop the generation when amount per host exceeds maxCount( continue : line 455) but when maxCount == 0 it goes directly to line 465 :output.collect(key, entry); It is obviously not correct, the correct solution would be to add if(maxCount == 0){ continue; } at line 380. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251359#comment-16251359 ] Semyon Semyonov commented on NUTCH-2368: I found a nasty bug that breaks the feature completely. The generator collected the url if maxcount == 0, because of the condition line 421 if (maxCount > 0) insead of >= 0 I propose to add the check for condition if(maxCount == 0){ continue; } Could you check it ASAP? > Variable generate.max.count and fetcher.server.delay > > > Key: NUTCH-2368 > URL: https://issues.apache.org/jira/browse/NUTCH-2368 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368_RESTAPI_Fix.patch > > > In some cases we need to use host specific characteristics in determining > crawl speed and bulk sizes because with our (Openindex) settings we can just > recrawl host with up to 800k urls. > This patch solves the problem by introducing the HostDB to the Generator and > providing powerful Jexl expressions. Check these two expressions added to the > Generator: > {code} > -Dgenerate.max.count.expr=' > if (unfetched + fetched > 80) { > return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + > 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1) > } else { > return conf.getDouble("generate.max.count", 300); > }' > -Dgenerate.fetch.delay.expr=' > if (unfetched + fetched > 80) { > return (pct95._rs_ + 500); > } else { > return conf.getDouble("fetcher.server.delay", 1000) > }' > {code} > For each large host: select as many records as possible that are possible to > fetch based on number of threads, 95th percentile response time of the fetch > limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads. > The second expression just follows up to that, settings the crawlDelay of the > fetch queue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay
[ https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2455: --- Attachment: NUTCH-2455.patch The proposed patch is attached. Though, I'm not sure that this is the best way to copy value of objects of Java: .key.toString(), (HostDatum)value.clone() > Speed up the merging of HostDb entries for variable fetch delay > --- > > Key: NUTCH-2455 > URL: https://issues.apache.org/jira/browse/NUTCH-2455 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.13 >Reporter: Markus Jelsma > Attachments: NUTCH-2455.patch > > > Citing Sebastian at NUTCH-2420: > ??The correct solution would be to use pairs as keys in the > Selector job, with a partitioner and secondary sorting so that all keys with > same host end up in the same call of the reducer. If values can also hold a > HostDb entry and the sort comparator guarantees that the HostDb entry > (entries if partitioned by domain or IP) comes in front of all CrawlDb > entries. But that would be a substantial improvement...?? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244223#comment-16244223 ] Semyon Semyonov commented on NUTCH-2368: I found a bug. HostdbReaders steams are not reset for each key of the reducer. Assume we have four hosts in the host-db A B C D First time the reducer does reduce for website C, hostdbReaders[i].next leftover is D The second time we are looking for B, but leftover is D. Therefore the result of hostdbReaders[i].next is null. The same for the all following keys of the reducer, hostdb is null. private HostDatum getHostDatum(String host) throws Exception { Text key = new Text(); HostDatum value = new HostDatum(); for (int i = 0; i < hostdbReaders.length; i++) { while (hostdbReaders[i].next(key, value)) { if (host.equals(key.toString())) { return value; } } } return null; } What do you think is the best method to solve it? Recreate it each time? Path path = new Path(job.get(GENERATOR_HOSTDB), "current"); hostdbReaders = SequenceFileOutputFormat.getReaders(job, path); > Variable generate.max.count and fetcher.server.delay > > > Key: NUTCH-2368 > URL: https://issues.apache.org/jira/browse/NUTCH-2368 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368_RESTAPI_Fix.patch > > > In some cases we need to use host specific characteristics in determining > crawl speed and bulk sizes because with our (Openindex) settings we can just > recrawl host with up to 800k urls. > This patch solves the problem by introducing the HostDB to the Generator and > providing powerful Jexl expressions. Check these two expressions added to the > Generator: > {code} > -Dgenerate.max.count.expr=' > if (unfetched + fetched > 80) { > return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + > 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1) > } else { > return conf.getDouble("generate.max.count", 300); > }' > -Dgenerate.fetch.delay.expr=' > if (unfetched + fetched > 80) { > return (pct95._rs_ + 500); > } else { > return conf.getDouble("fetcher.server.delay", 1000) > }' > {code} > For each large host: select as many records as possible that are possible to > fetch based on number of threads, 95th percentile response time of the fetch > limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads. > The second expression just follows up to that, settings the crawlDelay of the > fetch queue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2454) REST API fix for usage of hostdb in generator
[ https://issues.apache.org/jira/browse/NUTCH-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2454: --- Attachment: NUTCH-2368_RESTAPI_Fix.patch > REST API fix for usage of hostdb in generator > - > > Key: NUTCH-2454 > URL: https://issues.apache.org/jira/browse/NUTCH-2454 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.12 >Reporter: Semyon Semyonov >Priority: Normal > Fix For: 1.14 > > Attachments: NUTCH-2368_RESTAPI_Fix.patch > > > NutchNUTCH-2368 > Variable generate.max.count and fetcher.server.delay -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2454) REST API fix for usage of hostdb in generator
Semyon Semyonov created NUTCH-2454: -- Summary: REST API fix for usage of hostdb in generator Key: NUTCH-2454 URL: https://issues.apache.org/jira/browse/NUTCH-2454 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.12 Reporter: Semyon Semyonov Priority: Normal Fix For: 1.14 NutchNUTCH-2368 Variable generate.max.count and fetcher.server.delay -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2368: --- Attachment: NUTCH-2368_RESTAPI_Fix.patch There was a problem with REST API client, because API uses different Run method and this method didn't include the hostdb parameter. The path fixes this problem. It may have some problems with the line offset, because I'm not so fluent with SVN yet. > Variable generate.max.count and fetcher.server.delay > > > Key: NUTCH-2368 > URL: https://issues.apache.org/jira/browse/NUTCH-2368 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.14 > > Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, > NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368_RESTAPI_Fix.patch > > > In some cases we need to use host specific characteristics in determining > crawl speed and bulk sizes because with our (Openindex) settings we can just > recrawl host with up to 800k urls. > This patch solves the problem by introducing the HostDB to the Generator and > providing powerful Jexl expressions. Check these two expressions added to the > Generator: > {code} > -Dgenerate.max.count.expr=' > if (unfetched + fetched > 80) { > return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + > 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1) > } else { > return conf.getDouble("generate.max.count", 300); > }' > -Dgenerate.fetch.delay.expr=' > if (unfetched + fetched > 80) { > return (pct95._rs_ + 500); > } else { > return conf.getDouble("fetcher.server.delay", 1000) > }' > {code} > For each large host: select as many records as possible that are possible to > fetch based on number of threads, 95th percentile response time of the fetch > limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads. > The second expression just follows up to that, settings the crawlDelay of the > fetch queue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2441) ARG_SEGMENT usage
[ https://issues.apache.org/jira/browse/NUTCH-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Semyon Semyonov updated NUTCH-2441: --- Attachment: metadataARG_SEGMENT.patch > ARG_SEGMENT usage > - > > Key: NUTCH-2441 > URL: https://issues.apache.org/jira/browse/NUTCH-2441 > Project: Nutch > Issue Type: Improvement > Components: metadata >Affects Versions: 1.13 >Reporter: Semyon Semyonov > Fix For: 1.14 > > Attachments: metadataARG_SEGMENT.patch > > > The class metadata/Nutch.java public static final String ARG_SEGMENT = > "segment" is not used correctly. In some cases Fetcher and ParseSegment it is > interpreted as a single segmenet, in others CrawlDb, LinkDb, IndexingJob as > an array of segments. Such misunderstanding leads to inconsistency of usage > of the parameter. > After a discussion with [~wastl-nagel] the proposed solution is to allow the > usage of both array and a string in all cases. That gives an opportunity to > not introduce the broken changes. > A path is proposed. > *The question left is refactoring, all these five components share the same > code(two versions of the same code to be precise). Shouldn't we extract a > method and reduce duplicates? * -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2441) ARG_SEGMENT usage
Semyon Semyonov created NUTCH-2441: -- Summary: ARG_SEGMENT usage Key: NUTCH-2441 URL: https://issues.apache.org/jira/browse/NUTCH-2441 Project: Nutch Issue Type: Improvement Components: metadata Affects Versions: 1.13 Reporter: Semyon Semyonov Fix For: 1.14 The class metadata/Nutch.java public static final String ARG_SEGMENT = "segment" is not used correctly. In some cases Fetcher and ParseSegment it is interpreted as a single segmenet, in others CrawlDb, LinkDb, IndexingJob as an array of segments. Such misunderstanding leads to inconsistency of usage of the parameter. After a discussion with [~wastl-nagel] the proposed solution is to allow the usage of both array and a string in all cases. That gives an opportunity to not introduce the broken changes. A path is proposed. *The question left is refactoring, all these five components share the same code(two versions of the same code to be precise). Shouldn't we extract a method and reduce duplicates? * -- This message was sent by Atlassian JIRA (v6.4.14#64029)