HEADS UP: Config changes related to scoring API
Hi, I just committed the scoring API (NUTCH-240). Please note that if you re-define the 'plugin.includes' property in your nutch-site.xml, now you have to add to your list the 'scoring-opic' plugin (and/or any other scoring plugin that you've implemented). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Closed: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=all ] Andrzej Bialecki closed NUTCH-240: --- Fix Version: 0.8-dev Resolution: Fixed Patches applied. Any further API improvements are welcome, the current API is less than ideal but allows experimenting with various scoring strategies, which is IMHO more important at this moment than API purity. > Scoring API: extension point, scoring filters and an OPIC plugin > > > Key: NUTCH-240 > URL: http://issues.apache.org/jira/browse/NUTCH-240 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Fix For: 0.8-dev > Attachments: Generator.patch.txt, patch.txt, patch1.txt, patch2.txt > > This patch refactors all places where Nutch manipulates page scores, into a > plugin-based API. Using this API it's possible to implement different scoring > algorithms. It is also much easier to understand how scoring works. > Multiple scoring plugins can be run in sequence, in a manner similar to > URLFilters. > Included is also an OPICScoringFilter plugin, which contains the current > implementation of the scoring algorithm. Together with the scoring API it > provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Experiment on crawler behaviour
Hi, I found this article pretty interesting: http://drunkmenworkhere.org/219 Could we come up with some codified rules, reverse-engineered from the bots' behavior? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-268) Generator and lib-http use different definitions of "unique host"
[ http://issues.apache.org/jira/browse/NUTCH-268?page=comments#action_12383327 ] Andrzej Bialecki commented on NUTCH-268: - I forgot to add: if we change Generator to use IP addresses, then we should warn users that running a local caching DNS server becomes practically mandatory - otherwise Generator would be very slow, not to mention that it would generate a lot of DNS traffic to external servers. > Generator and lib-http use different definitions of "unique host" > - > > Key: NUTCH-268 > URL: http://issues.apache.org/jira/browse/NUTCH-268 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Fix For: 0.8-dev > > Generator uses a host name, as extracted from URL, to determine the maximum > number of URLs from a unique host (when generator.max.per.host is set > 0). > This supposedly should prevent the situation where fetchlists become > dominated by URLs coming from the same hosts, which in turn would clash with > "politeness" rules. > However, http plugins (lib-http HttpBase.blockAddr) don't use host name, and > instead use it's IP address (explicitly doing a DNS lookup on the host name > extracted from URL). This leads to the following undesirable behavior: > * if DNS name resolves to different IPs (round-robin balancing), then > technically we are in violation of the "politeness" rules, because lib-http > doesn't see this as a conflict and permits concurrent accesses to the same > host name. > * if different DNS names resolve to the same IP address (very common: > CNAME-s, subdomains, web hosting, etc) then the purpose of > generate.max.per.host is defeated, because lib-http will block more > frequently than intended, leading to excessive numbers of "Exceeded > http.max.delays" exceptions. > Proposed solution: synchronize Generator and lib-http in their interpretation > of "unique host". Introduce a boolean property which instructs both Generator > and lib-http to use in both places either IP addresses or host names as > "unique hosts". -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-268) Generator and lib-http use different definitions of "unique host"
Generator and lib-http use different definitions of "unique host" - Key: NUTCH-268 URL: http://issues.apache.org/jira/browse/NUTCH-268 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Andrzej Bialecki Assigned to: Andrzej Bialecki Fix For: 0.8-dev Generator uses a host name, as extracted from URL, to determine the maximum number of URLs from a unique host (when generator.max.per.host is set > 0). This supposedly should prevent the situation where fetchlists become dominated by URLs coming from the same hosts, which in turn would clash with "politeness" rules. However, http plugins (lib-http HttpBase.blockAddr) don't use host name, and instead use it's IP address (explicitly doing a DNS lookup on the host name extracted from URL). This leads to the following undesirable behavior: * if DNS name resolves to different IPs (round-robin balancing), then technically we are in violation of the "politeness" rules, because lib-http doesn't see this as a conflict and permits concurrent accesses to the same host name. * if different DNS names resolve to the same IP address (very common: CNAME-s, subdomains, web hosting, etc) then the purpose of generate.max.per.host is defeated, because lib-http will block more frequently than intended, leading to excessive numbers of "Exceeded http.max.delays" exceptions. Proposed solution: synchronize Generator and lib-http in their interpretation of "unique host". Introduce a boolean property which instructs both Generator and lib-http to use in both places either IP addresses or host names as "unique hosts". -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
summarizer.setConf(conf) should be removed.
Hi, getExtensionInstance() already set the conf in case it the class implements Configurable. ... if(object instanceof Configurable) { ((Configurable)object).setConf(this.conf); .. so calling summarizer.setConf(conf); sets the configuration a second time, what is useless. Should I file a bug? Stefan
Re: distance between words
Sorry my bad English. Ok, I'm see that I wrote my suggestion very wrongly. Please try the following: search in msn and google for the following: Freddie i want to ride my bicycle I think this is unambiguous what I would like to see in results. In msn are 21,958 hits and there is the 4th position the good results. (4th from 21,958) In google there are 308,000 hits, and there is the first hit is the full text of music (1st from 308,000) I think in this situation the google results is better than msn. In the google is a larger dataset, and there is better result. I think the nutch results is bad in most cases. I found that in 'explain.jsp' the result scored by full phrase also ("Freddie i want to ride my bicycle"). But in this situation it is bad, because "Freddie" is not near to "i want...". Best Regards, Ferenc
mozdex
Dear List! I don't know who support mozdex.com, but this server doesn't search since Saturday. Regards, Ferenc
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Yes, this should be definitely mentioned somewhere (in the documentation :) At least we left a track on the mailing list so it'll be possible to refer to it. D. Jérôme Charron wrote: You're right -- changing anything with the input (snippets length, number of documents etc) will alter the clusters. This is basically how it works. If you want clustering in your search engine then, depending on the type of data you serve, you'll have to experiment with the settings a bit and see which give you satisfactory results. I don't think there is any particular reason to provide different data to the clusterer. Moreover, it'd complicate things quite badly. Thanks Dawid for your response. In fact, I don't really want to change this, but just to be sure that everybody is aware about it and to have some opinions. Regards Jérôme
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
You're right -- changing anything with the input (snippets length, number of documents etc) will alter the clusters. This is basically how it works. If you want clustering in your search engine then, depending on the type of data you serve, you'll have to experiment with the settings a bit and see which give you satisfactory results. I don't think there is any particular reason to provide different data to the clusterer. Moreover, it'd complicate things quite badly. Thanks Dawid for your response. In fact, I don't really want to change this, but just to be sure that everybody is aware about it and to have some opinions. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Hi Jerome, Yes Dawid, but it is already committed => the clustering now uses the plain text version returned by the toString() method. Ugh, yes, sorry about that, it uses Summary.toStrings(summaries) to be specific and that uses toString internally. Actually, the clustering uses the summaries as input. I assumes it would provides some better results if it takes the whole documents content. no? I assumes that clustering uses the summaries instead of documents content for some performances purpose. Not always. Or rather: depends what your goals are. Full document clustering will take longer (word segmentation, feature extraction etc), but since you have more data to work with, document similarity should be more accurate and hence clusters more sensible. In practice, however, similarity between documents and "cluster quality" is just a mathematical concept which is never shown to the user -- what the user sees is the representation of a cluster, which in case of full-document clustering is usually quite inconvenient to build and has a weak relationship with the actual mathematical model of clusters. Contextual (keyword-in-context) snippets have a great advantage: they are shorter and carry the neighborhood of your query's terms. This very neighborhood (or rather: repetitive sequences of terms) can be used to first determine "clusters" of documents and then to describe them to the user. This is how most Web clustering algorithms work (excuse me if I explained it in a very imprecise way). But there is a (bad) side effect : since the size of the summaries is configurable, the clustering "quality" will vary depending on the summaries size configuration. I really found this very confusing : when folks adjust this parameter it is only for front-end consideration (they want to display a long or a short summary), but certainly not for clustering reasons. You're right -- changing anything with the input (snippets length, number of documents etc) will alter the clusters. This is basically how it works. If you want clustering in your search engine then, depending on the type of data you serve, you'll have to experiment with the settings a bit and see which give you satisfactory results. I don't think there is any particular reason to provide different data to the clusterer. Moreover, it'd complicate things quite badly. D.