[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375857#comment-14375857 ] Julien Nioche commented on NUTCH-1958: -- I agree but I think there could be benefits in using depth as a default score. The main one is that people often get confused between crawl iteration number and depth, making the depth explicit via the score would be a good debugging / educational step. It is a default value and people will override it and remove it altogether. Not having a default value is certainly OK but having one is better in the sense that it helps users realise that there is something there that the can use (nor not). Am happy with not having a default value BTW, just thinking aloud here. Thanks! Remove scoring-opic from nutch-default.xml -- Key: NUTCH-1958 URL: https://issues.apache.org/jira/browse/NUTCH-1958 Project: Nutch Issue Type: Improvement Affects Versions: 2.3, 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 2.4, 1.10 I propose we remove scoring-opic from nutch-default. We all know it is flawed for any kind of incremental crawl, which most of us do. It is also useless if you want to perform a single crawl, if you must crawl all records of a domain, using OPIC for prioritizing URLS makes no sense. It also confuses users as we have seen in the past and recently [1]. What do you think? [1]: http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375787#comment-14375787 ] Markus Jelsma commented on NUTCH-1958: -- Hello Julien - neither. Scoring-depth does not really assign a score to a document in same sense as opic or webgraph. OPIC is flaud for any crawl where pages are going to get refetched, and enabling webgraph by default is perhaps a bit too much in sense of performance and that it does not automatically converge to a stable state (# cycles are predefined). If you do a single crawl without refetching, i.e. get all pages of a domain, OPIC is not required. If you are going to crawl everything anyway, then prioritizing is useless. What do you think? Remove scoring-opic from nutch-default.xml -- Key: NUTCH-1958 URL: https://issues.apache.org/jira/browse/NUTCH-1958 Project: Nutch Issue Type: Improvement Affects Versions: 2.3, 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 2.4, 1.10 I propose we remove scoring-opic from nutch-default. We all know it is flawed for any kind of incremental crawl, which most of us do. It is also useless if you want to perform a single crawl, if you must crawl all records of a domain, using OPIC for prioritizing URLS makes no sense. It also confuses users as we have seen in the past and recently [1]. What do you think? [1]: http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375696#comment-14375696 ] Julien Nioche commented on NUTCH-1958: -- What would you suggest as a replacement? scoring-depth? Remove scoring-opic from nutch-default.xml -- Key: NUTCH-1958 URL: https://issues.apache.org/jira/browse/NUTCH-1958 Project: Nutch Issue Type: Improvement Affects Versions: 2.3, 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 2.4, 1.10 I propose we remove scoring-opic from nutch-default. We all know it is flawed for any kind of incremental crawl, which most of us do. It is also useless if you want to perform a single crawl, if you must crawl all records of a domain, using OPIC for prioritizing URLS makes no sense. It also confuses users as we have seen in the past and recently [1]. What do you think? [1]: http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376227#comment-14376227 ] Sebastian Nagel commented on NUTCH-1958: Scoring-oping is not that bad, scores are plausible also for smaller site crawls. An option would be to finally fix our OPIC implementation, so that scores do not get out of control for long-running incremental crawls. This should be possible by keeping cash and score used for indexing separate. A challenge worth to take since the problem is known for long and some considerations are done ([[1|http://wiki.apache.org/nutch/FixingOpicScoring]]). Remove scoring-opic from nutch-default.xml -- Key: NUTCH-1958 URL: https://issues.apache.org/jira/browse/NUTCH-1958 Project: Nutch Issue Type: Improvement Affects Versions: 2.3, 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 2.4, 1.10 I propose we remove scoring-opic from nutch-default. We all know it is flawed for any kind of incremental crawl, which most of us do. It is also useless if you want to perform a single crawl, if you must crawl all records of a domain, using OPIC for prioritizing URLS makes no sense. It also confuses users as we have seen in the past and recently [1]. What do you think? [1]: http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372572#comment-14372572 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-1958: --- +1 Remove scoring-opic from nutch-default.xml -- Key: NUTCH-1958 URL: https://issues.apache.org/jira/browse/NUTCH-1958 Project: Nutch Issue Type: Improvement Affects Versions: 2.3, 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 2.4, 1.10 I propose we remove scoring-opic from nutch-default. We all know it is flawed for any kind of incremental crawl, which most of us do. It is also useless if you want to perform a single crawl, if you must crawl all records of a domain, using OPIC for prioritizing URLS makes no sense. It also confuses users as we have seen in the past and recently [1]. What do you think? [1]: http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)