0.7-dev, the search scoring

2005-07-28 Thread Fredrik Andersson
Hey guys!

I just ported a lot of old 0.6 code to 0.7-dev/mapred. Lots of stuff
has changed I see! One thing I can't quite grasp though, is why the
Hit.getScore() has been removed in favour for the TopDocs-thingie
instead? I wrote a quick add-on to support getting the score straight
from the hit, which worked fine, but it would be nice to hear a reason
as to why the method was removed in the first place!

Also, are there any secret WIKIs, mailinglists, forums or similar
for the 0.7 development? Would be very interesting to see what's
cooking!

Greetings,
Fredrik


Re: [Nutch-dev] Re: ranking algorithm

2005-07-28 Thread Massimo Miccoli

Hi,

Some precisation. Without Analyze process the score  in calculate only 
by count inlink for a page.




Fredrik Andersson ha scritto:


It's open source, there's your in-depth info! : )

Kleinbergs original report Authoritative sources in a hyperlinked
environment can be downloaded at
http://www.cs.cornell.edu/home/kleinber/ . It has been tuned up by
various people since it was released, but the principle of hubs and
authorities are still the same.

Fredrik

On 7/28/05, Jay Pound [EMAIL PROTECTED] wrote:
 


is there a whitepaper on the algorithm for nutch, or some in-depth info on
it anywhere?
Thanks,
-J



   




---
SF.Net email is Sponsored by the Better Software Conference  EXPO September
19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile  Plan-Driven Development * Managing Projects  Teams * Testing  QA
Security * Process Improvement  Measurement * http://www.sqe.com/bsce5sf
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

 



Re: 0.7-dev, the search scoring

2005-07-28 Thread Doug Cutting

Fredrik Andersson wrote:

I just ported a lot of old 0.6 code to 0.7-dev/mapred. Lots of stuff
has changed I see! One thing I can't quite grasp though, is why the
Hit.getScore() has been removed in favour for the TopDocs-thingie
instead?


Hit.getScore() was generalized to Hit.getSortValue() in order to support 
sorting results by things other than score.  If you sort by score, as is 
the default, then ((FloatWritable)Hit.getSortValue()).get() is the 
score.  But if you sort by, e.g., a date string, then 
((UTF8)Hit.getSortValue()).toString() is the date string sorted on, and 
the score is unavailable.  Perhaps the score should be made available 
regardless?


Doug


RE: ranking algorithm

2005-07-28 Thread Andrey Ilinykh
A little bit offtopic. Nutch ranking algorithm uses score and nextScore. Who
can explain why we need nextScore?
Thank you,
  Andrey

-Original Message-
From: Fredrik Andersson [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 28, 2005 7:42 AM
To: nutch-dev@lucene.apache.org
Subject: Re: ranking algorithm


It's open source, there's your in-depth info! : )

Kleinbergs original report Authoritative sources in a hyperlinked
environment can be downloaded at
http://www.cs.cornell.edu/home/kleinber/ . It has been tuned up by
various people since it was released, but the principle of hubs and
authorities are still the same.

Fredrik

On 7/28/05, Jay Pound [EMAIL PROTECTED] wrote:
 is there a whitepaper on the algorithm for nutch, or some in-depth info on
 it anywhere?
 Thanks,
 -J
 
 



Re: 0.7-dev, the search scoring

2005-07-28 Thread Fredrik Andersson
Ah God, I am stupid ... thanks for that, Doug! I must have a bad
coding day today : )

Fredrik

On 7/28/05, Doug Cutting [EMAIL PROTECTED] wrote:
 Fredrik Andersson wrote:
  I just ported a lot of old 0.6 code to 0.7-dev/mapred. Lots of stuff
  has changed I see! One thing I can't quite grasp though, is why the
  Hit.getScore() has been removed in favour for the TopDocs-thingie
  instead?
 
 Hit.getScore() was generalized to Hit.getSortValue() in order to support
 sorting results by things other than score.  If you sort by score, as is
 the default, then ((FloatWritable)Hit.getSortValue()).get() is the
 score.  But if you sort by, e.g., a date string, then
 ((UTF8)Hit.getSortValue()).toString() is the date string sorted on, and
 the score is unavailable.  Perhaps the score should be made available
 regardless?
 
 Doug



NDFS Bug, Mapred from SVN - Tokenizer and New Line Error

2005-07-28 Thread Jon Shoberg

I'm trying to start a NDFS datanode and keep getting the following error:

[EMAIL PROTECTED] nutchmapre]$ bin/nutch datanode
050728 213401 10 parsing file:/usr/local/nutchmapre/conf/nutch-default.xml
050728 213402 10 parsing file:/usr/local/nutchmapre/conf/nutch-site.xml
050728 213402 10 Opened server at 7000
050728 213402 11 Starting DataNode in: /tmp/nutch/ndfs/data/data
050728 213402 11 Exception: java.util.NoSuchElementException
050728 213402 11 Lost connection to namenode.  Retrying...

I opened the source and added a stack trace to 
src/java/org/apache/nutch/ndfs/DataNode.java


   551   public void run() {
   552 LOG.info(Starting DataNode in: +data.data);
   553 while (true) {
   554   try {
   555 offerService();
   556   } catch (Exception ex) {
   557 LOG.info(Exception:  + ex);
   558 LOG.info(Lost connection to namenode.  Retrying...);
   559 ex.printStackTrace(); /*** Added by [EMAIL PROTECTED] ***/
   560 try {
   561   Thread.sleep(5000);
   562 } catch (InterruptedException ie) {
   563 }
   564   }
   565 }
   566   }

The stack trace presents the following:

java.util.NoSuchElementException
   at java.util.StringTokenizer.nextToken(StringTokenizer.java:259)
   at org.apache.nutch.ndfs.DF.init(DF.java:52)
   at org.apache.nutch.ndfs.FSDataset.getCapacity(FSDataset.java:204)
   at org.apache.nutch.ndfs.DataNode.offerService(DataNode.java:134)
   at org.apache.nutch.ndfs.DataNode.run(DataNode.java:555)
   at java.lang.Thread.run(Thread.java:534)

Looking at the code in  src/java/org/apache/nutch/ndfs/DF.java

38 Process process = Runtime.getRuntime().exec(new String[] 
{df,-k,path});

39
40 try {
41   if (process.waitFor() == 0) {
42 BufferedReader lines =
43   new BufferedReader(new 
InputStreamReader(process.getInputStream()));

44
45 lines.readLine(); // skip headings
46
47 StringTokenizer tokens =
48   new StringTokenizer(lines.readLine(), \t\n\r\f%);
49
50 this.filesystem = tokens.nextToken();
51 this.capacity = Long.parseLong(tokens.nextToken()) * 1024;
52 this.used = Long.parseLong(tokens.nextToken()) * 1024;
53 this.available = Long.parseLong(tokens.nextToken()) * 1024;
54 this.percentUsed = Integer.parseInt(tokens.nextToken());
55 this.mount = tokens.nextToken();
56
57   } else {
58 throw new IOException
59   (new BufferedReader(new 
InputStreamReader(process.getErrorStream()))

60.readLine());
61   }

There is a call to df -k.  Here is the output from my df -k

Filesystem   1K-blocks  Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
286735816  31398804 240771636  12% /
/dev/hda1   101086 31962 63905  34% /boot
none387484 0387484   0% /dev/shm

I'm sure this email will not format the text 100% but you can see there 
is an extra newline at /dev/mapper.  This should be easy to fix, I have 
some local hacks but may be able to submit something more final.


-j






NDFS and Fedora Core 3

2005-07-28 Thread Jon Shoberg


After working though a tokenize error I kept getting an error when 
bin/nutch datanode started up.  It said java.net.SocketException: 
Invalid argument or cannot assign requested address.  If this is the 
case you may have to add the following to the nutch script.  Afterwards 
I'm able to start the namenode and datanode without issue.


-Djava.net.preferIPv4Stack=true

A tech note about why this is 

If IPv6 is available on the operating system the underlying native 
socket will be an IPv6 socket. This allows Java(tm) applications to 
connect to, and accept connections from, both IPv4 and IPv6 hosts.


Jon
http://jon.shoberg.net