HEADS UP: Config changes related to scoring API

2006-05-12 Thread Andrzej Bialecki

Hi,

I just committed the scoring API (NUTCH-240). Please note that if you 
re-define the 'plugin.includes' property in your nutch-site.xml, now you 
have to add to your list the 'scoring-opic' plugin (and/or any other 
scoring plugin that you've implemented).


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Closed: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-05-12 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-240?page=all ]
 
Andrzej Bialecki  closed NUTCH-240:
---

Fix Version: 0.8-dev
 Resolution: Fixed

Patches applied. Any further API improvements are welcome, the current API is 
less than ideal but allows experimenting with various scoring strategies, which 
is IMHO more important at this moment than API purity.

> Scoring API: extension point, scoring filters and an OPIC plugin
> 
>
>  Key: NUTCH-240
>  URL: http://issues.apache.org/jira/browse/NUTCH-240
>  Project: Nutch
> Type: Improvement

> Versions: 0.8-dev
> Reporter: Andrzej Bialecki 
> Assignee: Andrzej Bialecki 
>  Fix For: 0.8-dev
>  Attachments: Generator.patch.txt, patch.txt, patch1.txt, patch2.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a 
> plugin-based API. Using this API it's possible to implement different scoring 
> algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to 
> URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current 
> implementation of the scoring algorithm. Together with the scoring API it 
> provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Experiment on crawler behaviour

2006-05-12 Thread Andrzej Bialecki

Hi,

I found this article pretty interesting:

   http://drunkmenworkhere.org/219

Could we come up with some codified rules, reverse-engineered from the 
bots' behavior?


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Commented: (NUTCH-268) Generator and lib-http use different definitions of "unique host"

2006-05-12 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-268?page=comments#action_12383327 ] 

Andrzej Bialecki  commented on NUTCH-268:
-

I forgot to add: if we change Generator to use IP addresses, then we should 
warn users that running a local caching DNS server becomes practically 
mandatory - otherwise Generator would be very slow, not to mention that it 
would generate a lot of DNS traffic to external servers.

> Generator and lib-http use different definitions of "unique host"
> -
>
>  Key: NUTCH-268
>  URL: http://issues.apache.org/jira/browse/NUTCH-268
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
> Reporter: Andrzej Bialecki 
> Assignee: Andrzej Bialecki 
>  Fix For: 0.8-dev

>
> Generator uses a host name, as extracted from URL, to determine the maximum 
> number of URLs from a unique host (when generator.max.per.host is set > 0). 
> This supposedly should prevent the situation where fetchlists become 
> dominated by URLs coming from the same hosts, which in turn would clash with 
> "politeness" rules.
> However, http plugins (lib-http HttpBase.blockAddr) don't use host name, and 
> instead use it's IP address (explicitly doing a DNS lookup on the host name 
> extracted from URL). This leads to the following undesirable behavior:
> * if DNS name resolves to different IPs (round-robin balancing), then 
> technically we are in violation of the "politeness" rules, because lib-http 
> doesn't see this as a conflict and permits concurrent accesses to the same 
> host name.
> * if different DNS names resolve to the same IP address (very common: 
> CNAME-s, subdomains, web hosting, etc) then the purpose of 
> generate.max.per.host is defeated, because lib-http will block more 
> frequently than intended, leading to excessive numbers of  "Exceeded 
> http.max.delays" exceptions.
> Proposed solution: synchronize Generator and lib-http in their interpretation 
> of "unique host". Introduce a boolean property which instructs both Generator 
> and lib-http to use in both places either IP addresses or host names as 
> "unique hosts".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-268) Generator and lib-http use different definitions of "unique host"

2006-05-12 Thread Andrzej Bialecki (JIRA)
Generator and lib-http use different definitions of "unique host"
-

 Key: NUTCH-268
 URL: http://issues.apache.org/jira/browse/NUTCH-268
 Project: Nutch
Type: Bug

Versions: 0.8-dev
Reporter: Andrzej Bialecki 
 Assigned to: Andrzej Bialecki  
 Fix For: 0.8-dev


Generator uses a host name, as extracted from URL, to determine the maximum 
number of URLs from a unique host (when generator.max.per.host is set > 0). 
This supposedly should prevent the situation where fetchlists become dominated 
by URLs coming from the same hosts, which in turn would clash with "politeness" 
rules.

However, http plugins (lib-http HttpBase.blockAddr) don't use host name, and 
instead use it's IP address (explicitly doing a DNS lookup on the host name 
extracted from URL). This leads to the following undesirable behavior:

* if DNS name resolves to different IPs (round-robin balancing), then 
technically we are in violation of the "politeness" rules, because lib-http 
doesn't see this as a conflict and permits concurrent accesses to the same host 
name.

* if different DNS names resolve to the same IP address (very common: CNAME-s, 
subdomains, web hosting, etc) then the purpose of generate.max.per.host is 
defeated, because lib-http will block more frequently than intended, leading to 
excessive numbers of  "Exceeded http.max.delays" exceptions.

Proposed solution: synchronize Generator and lib-http in their interpretation 
of "unique host". Introduce a boolean property which instructs both Generator 
and lib-http to use in both places either IP addresses or host names as "unique 
hosts".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



summarizer.setConf(conf) should be removed.

2006-05-12 Thread Stefan Groschupf

Hi,
getExtensionInstance() already set the conf in case it the class  
implements Configurable.

...
  if(object instanceof Configurable) {
((Configurable)object).setConf(this.conf);

.. so calling  summarizer.setConf(conf); sets the configuration a  
second time, what is useless.


Should I file a bug?

Stefan 


Re: distance between words

2006-05-12 Thread YourSoft

Sorry my bad English.
Ok, I'm see that I wrote my suggestion very wrongly.

Please try the following:
search in msn and google for the following:
Freddie i want to ride my bicycle

I think this is unambiguous what I would like to see in results.
In msn are 21,958 hits and there is the 4th position the good results. 
(4th from 21,958)
In google there are 308,000 hits, and there is the first hit is the full 
text of music (1st from 308,000)


I think in this situation the google results is better than msn. In the 
google is a larger dataset, and there is better result.

I think the nutch results is bad in most cases.

I found that in 'explain.jsp' the result scored by full phrase also 
("Freddie i want to ride my bicycle").
But in this situation it is bad, because "Freddie" is not near to "i 
want...".


Best Regards,
   Ferenc



mozdex

2006-05-12 Thread YourSoft

Dear List!

I don't know who support mozdex.com, but this server doesn't search 
since Saturday.


Regards,
   Ferenc


Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-12 Thread Dawid Weiss


Yes, this should be definitely mentioned somewhere (in the documentation 
:) At least we left a track on the mailing list so it'll be possible to 
refer to it.


D.

Jérôme Charron wrote:

You're right -- changing anything with the input (snippets length,
number of documents etc) will alter the clusters. This is basically how
it works. If you want clustering in your search engine then, depending
on the type of data you serve, you'll have to experiment with the
settings a bit and see which give you satisfactory results. I don't
think there is any particular reason to provide different data to the
clusterer. Moreover, it'd complicate things quite badly.


Thanks Dawid for your response.
In fact, I don't really want to change this, but just to be sure that
everybody is aware about it and to have some opinions.

Regards

Jérôme



Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-12 Thread Jérôme Charron

You're right -- changing anything with the input (snippets length,
number of documents etc) will alter the clusters. This is basically how
it works. If you want clustering in your search engine then, depending
on the type of data you serve, you'll have to experiment with the
settings a bit and see which give you satisfactory results. I don't
think there is any particular reason to provide different data to the
clusterer. Moreover, it'd complicate things quite badly.


Thanks Dawid for your response.
In fact, I don't really want to change this, but just to be sure that
everybody is aware about it and to have some opinions.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-12 Thread Dawid Weiss


Hi Jerome,


Yes Dawid, but it is already committed => the clustering now uses the plain
text version returned by the toString() method.


Ugh, yes, sorry about that, it uses Summary.toStrings(summaries) to be 
specific and that uses toString internally.



Actually, the clustering uses the summaries as input. I assumes it would
provides some better results if it takes the whole documents content. no?
I assumes that clustering uses the summaries instead of documents content
for some performances purpose.


Not always. Or rather: depends what your goals are. Full document 
clustering will take longer (word segmentation, feature extraction etc), 
but since you have more data to work with, document similarity should be 
more accurate and hence clusters more sensible. In practice, however, 
similarity between documents and "cluster quality" is just a 
mathematical concept which is never shown to the user -- what the user 
sees is the representation of a cluster, which in case of full-document 
clustering is usually quite inconvenient to build and has a weak 
relationship with the actual mathematical model of clusters.


Contextual (keyword-in-context) snippets have a great advantage: they 
are shorter and carry the neighborhood of your query's terms. This very 
neighborhood (or rather: repetitive sequences of terms) can be used to 
first determine "clusters" of documents and then to describe them to the 
user. This is how most Web clustering algorithms work (excuse me if I 
explained it in a very imprecise way).



But there is a (bad) side effect : since the size of the summaries is
configurable, the clustering "quality" will vary depending on the summaries
size configuration. I really found this very confusing : when folks adjust
this parameter it is only for front-end consideration (they want to display
a long or a short summary), but certainly not for clustering reasons.


You're right -- changing anything with the input (snippets length, 
number of documents etc) will alter the clusters. This is basically how 
it works. If you want clustering in your search engine then, depending 
on the type of data you serve, you'll have to experiment with the 
settings a bit and see which give you satisfactory results. I don't 
think there is any particular reason to provide different data to the 
clusterer. Moreover, it'd complicate things quite badly.


D.