new location! nutch user meeting San Francisco

2006-05-11 Thread Stefan Groschupf

Hi there,
since there is such a big interest in the nutch user meeting,
we decide to move to a other location.


We will now meet:
Rite-Spot Cafe
(415) 552-6066
2099 Folsom St
San Francisco, CA 94110
"Its in a good location too for parking and its even reachable by  
public transport -- 2 blocks from BART."

map:
http://www.google.com/maps?hl=en&lr=&q=the+rite+spot&near=San 
+Francisco,+CA&latlng=37775000,-122418333,13300172632540369639

Want to join?:
http://www.evite.com/app/publicUrl/[EMAIL PROTECTED]/nutch-1

Thanks a bunch to Micheal Stack that also get the new location  
organized!


So see you next week.
Greetings,
Stefan


Preventing overlapped search results.

2006-05-11 Thread Brian Hill

I'm new to Nutch, but I couldn't find this in the archives or docs and
it has me stumped.

I have two websites that I need to index in Nutch. I am presently
running two separate crawls to index these sites, but a single link is
screwing up my search results. 

I have two flat files in my Nutch directory, "Domain1" and "Domain2".
Each of these files contains the appropriate starting URL for each of
the two sites, and the two crawls generate completely separate database
folders, which are in turn called by two independent Nutch frontend
installations in Tomcat.

My problem is with the crawl-urlfilter.txt file. Because this is a local
search, I need to limit the domains and the file contains these lines:

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*domain1.edu/
+^http://([a-z0-9]*\.)*domain2.edu/

This would work perfectly EXCEPT that there is a single link on the
domain1.edu site to the homepage of the domain2.edu site. Nutch is
following this link, and as a result the domain1 search results are
bringing up the full domain1.edu AND domain2.edu sites. 

What's the best way to deal with this problem? When I run the Domain1
Nutch search, I need the results to be limited to the domain1.edu,
subdomain1.domain1.edu, and subdomain2.domain1.edu websites. Likewise,
if I add a reciprocal link to domain2.edu, I need users of THAT search
interface to receive results only relevant to that domain.

PLEASE don't tell me I need two independent Nutch installations! Your
help is appreciated.

Brian Hill


Re: [Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Jérôme Charron

Bob Carpenter of alias-i had this to say when I brought up this very
idea:
http://article.gmane.org/gmane.comp.jakarta.lucene.devel/12599


Thanks for you response Marvin.
But finally my question is : shouldn't the nutch clustering uses some
fixed size snippets instead of the configurable displayed size?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Jérôme Charron

> (but if the nutch-site.xml overrides the plugin.include property and
> doen't
> include it it will not be activated, like any other plugin)
yes, that's what I ment, I quess that's the default case for people
hacking plugins.


Oh, yes Sami, I understand what you mean...
Sorry, I just forgot to mention this point on the list (so, plugins hackers,
you need to add one of the new summary plugin if you want to have some
summaries displayed).
Sorry, I forgot too to add summary plugins in the default webapp context
file (nutch.xml) ... I will add this once the svn write access will be
available.
And one more time sorry, because I forgot too to report summary APIs changes
to web2 module...

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-11 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379116 ] 

Doug Cutting commented on NUTCH-267:


re: it's as if we didn't want it to be re-crawled if we can't find any inlinks 
to it

We prioritize crawling based on the number of pages we've crawled that link to 
it since we've last crawled it.  Assuming it had links to it that caused it to 
be crawled the first time, and that some of those will also be re-crawled, then 
its score will again increase.  But if no one links to it anymore, it will 
languish, and not be crawled again unless there're no higher-scoring pages.  
That sounds right to me, and I think it's what's suggested in the OPIC paper 
(if i skimmed it correctly).

Perhaps it should not be reset to zero, but one, since that's where pages start 
out.

re: why use "sqrt(opic) * docSimilarity" instead of "log(opic * docSimilarity)"

Wrapping log() around things changes the score value but not the ranking.  So 
the question is really, why use sqrt(opic)*docSimilarity and not just 
opic*docSimilarity?  The answer is simply that I tried a few queries and sqrt 
seemed to be required for OPIC to not overly dominate scoring.  It was a "seat 
of the pants" calculation, trying to balance the strength of anchor matches, 
opic scoring and title, url and body matching, etc.  One can disable this by 
changing the score power parameter.

> Indexer doesn't consider linkdb when calculating boost value
> 
>
>  Key: NUTCH-267
>  URL: http://issues.apache.org/jira/browse/NUTCH-267
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Chris Schneider
> Priority: Minor

>
> Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
> indexer.boost.by.link.count was true, the indexer boost value was scaled 
> based on the log of the # of inbound links:
> if (boostByLinkCount)
>   res *= (float)Math.log(Math.E + linkCount);
> This is no longer true (even before Andrzej implemented scoring filters). 
> Instead, the boost value is just the square root (or some other scorePower) 
> of the page score. Shouldn't the invertlinks command, which creates the 
> linkdb, have some affect on the boost value calculated during indexing 
> (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Interleaved (parallel) fetch cycles

2006-05-11 Thread Doug Cutting

Andrzej Bialecki wrote:

I'm planning to work on adding support in 0.8 for interleaved fetch cycles.


Great!

Then, when running an updatedb, the issue of scores and metadata comes 
into question. We can imagine now that there were some other updatedb-s 
run in the meantime, not necessarily with earlier fetchlists - so the 
score and metadata info could be actually newer in the latest CrawlDB 
than what we have inside the current segment. In such case, we will get 
the following in CrawlDbReducer:


* "old" value from CrawlDb (which could be actually newer!). Even if 
it's old, its fetchTime could be in the future due to the trick 
described above. We could also get null here, if we just discovered a 
new page.


* "original" value from CrawlDb, which was recorded in fetchlist. This, 
for once, has a true fetch time, and its metadata and score are 
snapshots of that information at the time of "generate".


* "new" value from Fetcher, with new score / metadata information. We 
will also get "new" values from redirects, which might not match any of 
the above values (i.e. they could use unique urls).


* "linked" values from parsers, with score / metadata contributions.

Now, the question is how to update the score, metadata, fetchTime and 
fetchInterval information. We need a way to determine if the "new" value 
we have is in fact newer or older than the "old" value - I'm not sure 
how to do this, fetchTime and fetchInterval could have been modified so 
they are not reliable... Perhaps we should add a "generation ID" to 
CrawlDatum?


Would it work to, when generating, set the fetch time for generated 
items to the current time?  That way, the "new" value will always be a 
bit after the "old" time.  In 0.7 we stored not the fetched-time but the 
time-to-next-fetch, so we had to set it into the future.  But if we 
instead just mark it as fetched now, so that it won't be re-generated 
until its fetch interval has expired, that would resolve this, no?



 Anyway, assuming we have a way to know this:

* if "new" is newer than "old", then we take all metadata from "old", 
overwrite all info with the values from "new", and we keep "new".


* if "new" is older than "old", then we overwrite its metadata with all 
values from "old". We do the same with fetchTime and fetchInterval.


That sounds right to me.  When is "original" used, if at all?

What about the score? I think that for new score calculations we should take 
the latest available score info from the "old" value.


That also sounds right.  The crawl db should own the scores.  Scores 
should not be updated by the fetcher, but only by crawldb updates.


Updatedb would also have to lock CrawlDB so that no other updatedb or 
generate could run while we modify it.


Yes, that sounds right too.

Thanks for working on this!

Doug


Re: [Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Marvin Humphrey


On May 11, 2006, at 3:36 AM, Jérôme Charron wrote:

Actually, the clustering uses the summaries as input. I assumes it  
would
provides some better results if it takes the whole documents  
content. no?
I assumes that clustering uses the summaries instead of documents  
content

for some performances purpose.
But there is a (bad) side effect : since the size of the summaries is
configurable, the clustering "quality" will vary depending on the  
summaries
size configuration. I really found this very confusing : when folks  
adjust
this parameter it is only for front-end consideration (they want to  
display

a long or a short summary), but certainly not for clustering reasons.

What you and others thinks about this?


Bob Carpenter of alias-i had this to say when I brought up this very  
idea:


http://article.gmane.org/gmane.comp.jakarta.lucene.devel/12599

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



Re: [jira] Updated: (NUTCH-251) Administration GUI

2006-05-11 Thread TDLN

I have my local changes, so I can't use the binary distribution.
Anyway, I will have a go at it and let you know.

Rgrds, Thomas



On 5/11/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:

Hi,

the easiest way is to download one of the binary distributions.
However as far I know the patches still work and need to be applied
to both projects.
Stefan

Am 11.05.2006 um 08:38 schrieb TDLN:

> Hi Stephan.
>
> I am about to get started with the Admin GUI and was wondering if
> these instructions are . still valid.
>
> More in particular, is it still necessary to patch Hadoop, or has this
> patch already been integrated?
>
> Also do you know how the Nutch patches go with the latest revision?
>
> Rgrds, Thomas Delnoij
>
> How to:
>
> + checkout latest nutch sources
>
> + checkout hadoop sources
> + patch hadoop with the hadoop patch
> + build hadoop jar
> + remove old hadoop jar from nutch/lib
> + place new hadoop jar in nutch/lib
>
>
> + uncompress plugin zip file
> + place plugins in nutch/src/plugins (patch not possible since svn
> does not support binary patches)
> + patch nutch with nutch patch
> + start gui with bin/nutch gui
>  + point your browser to: http://localhost:50060/general/
> + username and password are "admin". ( can be changed in nutch-
> default.xml)
> + select the "default" instance or create a new instance.
>




Re: [jira] Updated: (NUTCH-251) Administration GUI

2006-05-11 Thread Stefan Groschupf

Hi,

the easiest way is to download one of the binary distributions.
However as far I know the patches still work and need to be applied  
to both projects.

Stefan

Am 11.05.2006 um 08:38 schrieb TDLN:


Hi Stephan.

I am about to get started with the Admin GUI and was wondering if
these instructions are . still valid.

More in particular, is it still necessary to patch Hadoop, or has this
patch already been integrated?

Also do you know how the Nutch patches go with the latest revision?

Rgrds, Thomas Delnoij

How to:

+ checkout latest nutch sources

+ checkout hadoop sources
+ patch hadoop with the hadoop patch
+ build hadoop jar
+ remove old hadoop jar from nutch/lib
+ place new hadoop jar in nutch/lib


+ uncompress plugin zip file
+ place plugins in nutch/src/plugins (patch not possible since svn
does not support binary patches)
+ patch nutch with nutch patch
+ start gui with bin/nutch gui  

+ point your browser to: http://localhost:50060/general/
+ username and password are "admin". ( can be changed in nutch- 
default.xml)

+ select the "default" instance or create a new instance.





Re: [jira] Updated: (NUTCH-251) Administration GUI

2006-05-11 Thread TDLN

Hi Stephan.

I am about to get started with the Admin GUI and was wondering if
these instructions are . still valid.

More in particular, is it still necessary to patch Hadoop, or has this
patch already been integrated?

Also do you know how the Nutch patches go with the latest revision?

Rgrds, Thomas Delnoij

How to:

+ checkout latest nutch sources

+ checkout hadoop sources
+ patch hadoop with the hadoop patch
+ build hadoop jar
+ remove old hadoop jar from nutch/lib
+ place new hadoop jar in nutch/lib


+ uncompress plugin zip file
+ place plugins in nutch/src/plugins (patch not possible since svn
does not support binary patches)
+ patch nutch with nutch patch
+ start gui with bin/nutch gui http://localhost:50060/general/
+ username and password are "admin". ( can be changed in nutch-default.xml)
+ select the "default" instance or create a new instance.


Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Sami Siren

Jérôme Charron wrote:

(but if the nutch-site.xml overrides the plugin.include property and 
doen't

include it it will not be activated, like any other plugin)


yes, that's what I ment, I quess that's the default case for people 
hacking plugins.


--
Sami Siren


[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-11 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379072 ] 

Andrzej Bialecki  commented on NUTCH-267:
-

Hmm, resetting the score to 0 is also dubious - it's as if we didn't want it to 
be re-crawled if we can't find any inlinks to it... I believe it should be 
reset to the following value:

newScore = initialScore - sum(distributedScoreM) + sum(incomingScoreN)

where initialScore is the score we got from previous iterations (or 
injectedScore), sum(distributedScoreM) is what we have distributed to M 
outlinks from that page, and sum(incomingScoreN) is what is contributed by N 
inlinks. Current formula omits the sum(distributedScoreM); it also doesn't 
provide any way to "sponsor" pages with no incoming links so that they won't 
get broke (the concept of "virtual nodes" I mentioned above).

Re: summing logs: yes, but then why use "sqrt(opic) * docSimilarity" instead of 
"log(opic * docSimilarity)"?

> Indexer doesn't consider linkdb when calculating boost value
> 
>
>  Key: NUTCH-267
>  URL: http://issues.apache.org/jira/browse/NUTCH-267
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Chris Schneider
> Priority: Minor

>
> Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
> indexer.boost.by.link.count was true, the indexer boost value was scaled 
> based on the log of the # of inbound links:
> if (boostByLinkCount)
>   res *= (float)Math.log(Math.E + linkCount);
> This is no longer true (even before Andrzej implemented scoring filters). 
> Instead, the boost value is just the square root (or some other scorePower) 
> of the page score. Shouldn't the invertlinks command, which creates the 
> linkdb, have some affect on the boost value calculated during indexing 
> (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Interleaved (parallel) fetch cycles

2006-05-11 Thread Andrzej Bialecki

Hi,

I'm planning to work on adding support in 0.8 for interleaved fetch cycles.

What this means is that (within some limits) you can generate multiple 
fetchlists, fetch them at different times, and then update the crawldb 
not necessarily in the original sequence as they were generated. You can 
also generate more fetchlists before any updatedb is run.


This functionality was supported in 0.7.x. When FetchListTool selected a 
Page for fetching, its next fetch time was pushed 1 week in the future. 
This was a simple and effective way to prevent the same Pages ending up 
on the next fetchlist, but at the same time to have their waiting "time 
out" after 1 week, if e.g. fetching failed, segment was lost or 
whatever. Please note that this method requires modification of WebDB.


If fetching was completed and an updatedb was run, the original 
fetchTime/fetchInterval could be recovered from a copy of the Page 
inside the FetcherOutput.


Now, in 0.8 we do it differently. We don't modify CrawlDB, so we have no 
way of recording which CrawlDatums end up on some fetchlist. This means 
that two "generate" operations run in sequence, without intervening 
updatedb, will produce exactly the same fetchlists.


Generator would have to be modified to use the same trick as in 0.7. 
Unfortunately, this probably means that it will have to run a sort of 
updatedb, using its output fetchlist to mark entries in CrawlDB. This 
adds another map-reduce job to an already long-ish job (Generator 
already uses two map-reduce jobs). This also means that Generator will 
have to put a lock on CrawlDB for the duration of this job, so that no 
other "generate" or "updatedb" can update it at the same time.


Then, when running an updatedb, the issue of scores and metadata comes 
into question. We can imagine now that there were some other updatedb-s 
run in the meantime, not necessarily with earlier fetchlists - so the 
score and metadata info could be actually newer in the latest CrawlDB 
than what we have inside the current segment. In such case, we will get 
the following in CrawlDbReducer:


* "old" value from CrawlDb (which could be actually newer!). Even if 
it's old, its fetchTime could be in the future due to the trick 
described above. We could also get null here, if we just discovered a 
new page.


* "original" value from CrawlDb, which was recorded in fetchlist. This, 
for once, has a true fetch time, and its metadata and score are 
snapshots of that information at the time of "generate".


* "new" value from Fetcher, with new score / metadata information. We 
will also get "new" values from redirects, which might not match any of 
the above values (i.e. they could use unique urls).


* "linked" values from parsers, with score / metadata contributions.

Now, the question is how to update the score, metadata, fetchTime and 
fetchInterval information. We need a way to determine if the "new" value 
we have is in fact newer or older than the "old" value - I'm not sure 
how to do this, fetchTime and fetchInterval could have been modified so 
they are not reliable... Perhaps we should add a "generation ID" to 
CrawlDatum? Anyway, assuming we have a way to know this:


* if "new" is newer than "old", then we take all metadata from "old", 
overwrite all info with the values from "new", and we keep "new".


* if "new" is older than "old", then we overwrite its metadata with all 
values from "old". We do the same with fetchTime and fetchInterval. What 
about the score? I think that for new score calculations we should take 
the latest available score info from the "old" value.


Updatedb would also have to lock CrawlDB so that no other updatedb or 
generate could run while we modify it.


That's probably all at the moment ... Any comments or suggestions 
appreciated!


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Jérôme Charron

Add 3. Clustering would benefit from a plain text version.


Yes Dawid, but it is already committed => the clustering now uses the plain
text version returned by the toString() method.

Dawid, I have a question about clustering.
Actually, the clustering uses the summaries as input. I assumes it would
provides some better results if it takes the whole documents content. no?
I assumes that clustering uses the summaries instead of documents content
for some performances purpose.
But there is a (bad) side effect : since the size of the summaries is
configurable, the clustering "quality" will vary depending on the summaries
size configuration. I really found this very confusing : when folks adjust
this parameter it is only for front-end consideration (they want to display
a long or a short summary), but certainly not for clustering reasons.

What you and others thinks about this?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Dawid Weiss

The reason is that they should not use the same HTML code :
1. OpenSearch should only use  around highlights
2. search.jsp should use some more complicated HTML code ()


Add 3. Clustering would benefit from a plain text version.

D.