[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-03-23 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375857#comment-14375857
 ] 

Julien Nioche commented on NUTCH-1958:
--

I agree but I think there could be benefits in using depth as a default score. 
The main one is that people often get confused between crawl iteration number 
and depth, making the depth explicit via the score would be a good debugging / 
educational step. 

It is a default value and people will override it and remove it altogether. Not 
having a default value is certainly OK but having one is better in the sense 
that it helps users realise that there is something there that the can use (nor 
not).

Am happy with not having a default value BTW, just thinking aloud here. Thanks!


 Remove scoring-opic from nutch-default.xml
 --

 Key: NUTCH-1958
 URL: https://issues.apache.org/jira/browse/NUTCH-1958
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3, 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.4, 1.10


 I propose we remove scoring-opic from nutch-default. We all know it is flawed 
 for any kind of incremental crawl, which most of us do. It is also useless if 
 you want to perform a single crawl, if you must crawl all records of a 
 domain, using OPIC for prioritizing URLS makes no sense. It also confuses 
 users as we have seen in the past and recently [1].
 What do you think?
 [1]: 
 http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-03-23 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375787#comment-14375787
 ] 

Markus Jelsma commented on NUTCH-1958:
--

Hello Julien - neither. Scoring-depth does not really assign a score to a 
document in same sense as opic or webgraph. OPIC is flaud for any crawl where 
pages are going to get refetched, and enabling webgraph by default is perhaps a 
bit too much in sense of performance and that it does not automatically 
converge to a stable state (# cycles are predefined).

If you do a single crawl without refetching, i.e. get all pages of a domain, 
OPIC is not required. If you are going to crawl everything anyway, then 
prioritizing is useless.

What do you think?



 Remove scoring-opic from nutch-default.xml
 --

 Key: NUTCH-1958
 URL: https://issues.apache.org/jira/browse/NUTCH-1958
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3, 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.4, 1.10


 I propose we remove scoring-opic from nutch-default. We all know it is flawed 
 for any kind of incremental crawl, which most of us do. It is also useless if 
 you want to perform a single crawl, if you must crawl all records of a 
 domain, using OPIC for prioritizing URLS makes no sense. It also confuses 
 users as we have seen in the past and recently [1].
 What do you think?
 [1]: 
 http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-03-23 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375696#comment-14375696
 ] 

Julien Nioche commented on NUTCH-1958:
--

What would you suggest as a replacement? scoring-depth?  

 Remove scoring-opic from nutch-default.xml
 --

 Key: NUTCH-1958
 URL: https://issues.apache.org/jira/browse/NUTCH-1958
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3, 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.4, 1.10


 I propose we remove scoring-opic from nutch-default. We all know it is flawed 
 for any kind of incremental crawl, which most of us do. It is also useless if 
 you want to perform a single crawl, if you must crawl all records of a 
 domain, using OPIC for prioritizing URLS makes no sense. It also confuses 
 users as we have seen in the past and recently [1].
 What do you think?
 [1]: 
 http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-03-23 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376227#comment-14376227
 ] 

Sebastian Nagel commented on NUTCH-1958:


Scoring-oping is not that bad, scores are plausible also for smaller site 
crawls. An option would be to finally fix our OPIC implementation, so that 
scores do not get out of control for long-running incremental crawls. This 
should be possible by keeping cash and score used for indexing separate. A 
challenge worth to take since the problem is known for long and some 
considerations are done ([[1|http://wiki.apache.org/nutch/FixingOpicScoring]]).

 Remove scoring-opic from nutch-default.xml
 --

 Key: NUTCH-1958
 URL: https://issues.apache.org/jira/browse/NUTCH-1958
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3, 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.4, 1.10


 I propose we remove scoring-opic from nutch-default. We all know it is flawed 
 for any kind of incremental crawl, which most of us do. It is also useless if 
 you want to perform a single crawl, if you must crawl all records of a 
 domain, using OPIC for prioritizing URLS makes no sense. It also confuses 
 users as we have seen in the past and recently [1].
 What do you think?
 [1]: 
 http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372572#comment-14372572
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1958:
---

+1 

 Remove scoring-opic from nutch-default.xml
 --

 Key: NUTCH-1958
 URL: https://issues.apache.org/jira/browse/NUTCH-1958
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3, 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.4, 1.10


 I propose we remove scoring-opic from nutch-default. We all know it is flawed 
 for any kind of incremental crawl, which most of us do. It is also useless if 
 you want to perform a single crawl, if you must crawl all records of a 
 domain, using OPIC for prioritizing URLS makes no sense. It also confuses 
 users as we have seen in the past and recently [1].
 What do you think?
 [1]: 
 http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)