0.7-dev, the search scoring
Hey guys! I just ported a lot of old 0.6 code to 0.7-dev/mapred. Lots of stuff has changed I see! One thing I can't quite grasp though, is why the Hit.getScore() has been removed in favour for the TopDocs-thingie instead? I wrote a quick add-on to support getting the score straight from the hit, which worked fine, but it would be nice to hear a reason as to why the method was removed in the first place! Also, are there any secret WIKIs, mailinglists, forums or similar for the 0.7 development? Would be very interesting to see what's cooking! Greetings, Fredrik
Re: [Nutch-dev] Re: ranking algorithm
Hi, Some precisation. Without Analyze process the score in calculate only by count inlink for a page. Fredrik Andersson ha scritto: It's open source, there's your in-depth info! : ) Kleinbergs original report Authoritative sources in a hyperlinked environment can be downloaded at http://www.cs.cornell.edu/home/kleinber/ . It has been tuned up by various people since it was released, but the principle of hubs and authorities are still the same. Fredrik On 7/28/05, Jay Pound [EMAIL PROTECTED] wrote: is there a whitepaper on the algorithm for nutch, or some in-depth info on it anywhere? Thanks, -J --- SF.Net email is Sponsored by the Better Software Conference EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile Plan-Driven Development * Managing Projects Teams * Testing QA Security * Process Improvement Measurement * http://www.sqe.com/bsce5sf ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: 0.7-dev, the search scoring
Fredrik Andersson wrote: I just ported a lot of old 0.6 code to 0.7-dev/mapred. Lots of stuff has changed I see! One thing I can't quite grasp though, is why the Hit.getScore() has been removed in favour for the TopDocs-thingie instead? Hit.getScore() was generalized to Hit.getSortValue() in order to support sorting results by things other than score. If you sort by score, as is the default, then ((FloatWritable)Hit.getSortValue()).get() is the score. But if you sort by, e.g., a date string, then ((UTF8)Hit.getSortValue()).toString() is the date string sorted on, and the score is unavailable. Perhaps the score should be made available regardless? Doug
RE: ranking algorithm
A little bit offtopic. Nutch ranking algorithm uses score and nextScore. Who can explain why we need nextScore? Thank you, Andrey -Original Message- From: Fredrik Andersson [mailto:[EMAIL PROTECTED] Sent: Thursday, July 28, 2005 7:42 AM To: nutch-dev@lucene.apache.org Subject: Re: ranking algorithm It's open source, there's your in-depth info! : ) Kleinbergs original report Authoritative sources in a hyperlinked environment can be downloaded at http://www.cs.cornell.edu/home/kleinber/ . It has been tuned up by various people since it was released, but the principle of hubs and authorities are still the same. Fredrik On 7/28/05, Jay Pound [EMAIL PROTECTED] wrote: is there a whitepaper on the algorithm for nutch, or some in-depth info on it anywhere? Thanks, -J
Re: 0.7-dev, the search scoring
Ah God, I am stupid ... thanks for that, Doug! I must have a bad coding day today : ) Fredrik On 7/28/05, Doug Cutting [EMAIL PROTECTED] wrote: Fredrik Andersson wrote: I just ported a lot of old 0.6 code to 0.7-dev/mapred. Lots of stuff has changed I see! One thing I can't quite grasp though, is why the Hit.getScore() has been removed in favour for the TopDocs-thingie instead? Hit.getScore() was generalized to Hit.getSortValue() in order to support sorting results by things other than score. If you sort by score, as is the default, then ((FloatWritable)Hit.getSortValue()).get() is the score. But if you sort by, e.g., a date string, then ((UTF8)Hit.getSortValue()).toString() is the date string sorted on, and the score is unavailable. Perhaps the score should be made available regardless? Doug
NDFS Bug, Mapred from SVN - Tokenizer and New Line Error
I'm trying to start a NDFS datanode and keep getting the following error: [EMAIL PROTECTED] nutchmapre]$ bin/nutch datanode 050728 213401 10 parsing file:/usr/local/nutchmapre/conf/nutch-default.xml 050728 213402 10 parsing file:/usr/local/nutchmapre/conf/nutch-site.xml 050728 213402 10 Opened server at 7000 050728 213402 11 Starting DataNode in: /tmp/nutch/ndfs/data/data 050728 213402 11 Exception: java.util.NoSuchElementException 050728 213402 11 Lost connection to namenode. Retrying... I opened the source and added a stack trace to src/java/org/apache/nutch/ndfs/DataNode.java 551 public void run() { 552 LOG.info(Starting DataNode in: +data.data); 553 while (true) { 554 try { 555 offerService(); 556 } catch (Exception ex) { 557 LOG.info(Exception: + ex); 558 LOG.info(Lost connection to namenode. Retrying...); 559 ex.printStackTrace(); /*** Added by [EMAIL PROTECTED] ***/ 560 try { 561 Thread.sleep(5000); 562 } catch (InterruptedException ie) { 563 } 564 } 565 } 566 } The stack trace presents the following: java.util.NoSuchElementException at java.util.StringTokenizer.nextToken(StringTokenizer.java:259) at org.apache.nutch.ndfs.DF.init(DF.java:52) at org.apache.nutch.ndfs.FSDataset.getCapacity(FSDataset.java:204) at org.apache.nutch.ndfs.DataNode.offerService(DataNode.java:134) at org.apache.nutch.ndfs.DataNode.run(DataNode.java:555) at java.lang.Thread.run(Thread.java:534) Looking at the code in src/java/org/apache/nutch/ndfs/DF.java 38 Process process = Runtime.getRuntime().exec(new String[] {df,-k,path}); 39 40 try { 41 if (process.waitFor() == 0) { 42 BufferedReader lines = 43 new BufferedReader(new InputStreamReader(process.getInputStream())); 44 45 lines.readLine(); // skip headings 46 47 StringTokenizer tokens = 48 new StringTokenizer(lines.readLine(), \t\n\r\f%); 49 50 this.filesystem = tokens.nextToken(); 51 this.capacity = Long.parseLong(tokens.nextToken()) * 1024; 52 this.used = Long.parseLong(tokens.nextToken()) * 1024; 53 this.available = Long.parseLong(tokens.nextToken()) * 1024; 54 this.percentUsed = Integer.parseInt(tokens.nextToken()); 55 this.mount = tokens.nextToken(); 56 57 } else { 58 throw new IOException 59 (new BufferedReader(new InputStreamReader(process.getErrorStream())) 60.readLine()); 61 } There is a call to df -k. Here is the output from my df -k Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroup00-LogVol00 286735816 31398804 240771636 12% / /dev/hda1 101086 31962 63905 34% /boot none387484 0387484 0% /dev/shm I'm sure this email will not format the text 100% but you can see there is an extra newline at /dev/mapper. This should be easy to fix, I have some local hacks but may be able to submit something more final. -j
NDFS and Fedora Core 3
After working though a tokenize error I kept getting an error when bin/nutch datanode started up. It said java.net.SocketException: Invalid argument or cannot assign requested address. If this is the case you may have to add the following to the nutch script. Afterwards I'm able to start the namenode and datanode without issue. -Djava.net.preferIPv4Stack=true A tech note about why this is If IPv6 is available on the operating system the underlying native socket will be an IPv6 socket. This allows Java(tm) applications to connect to, and accept connections from, both IPv4 and IPv6 hosts. Jon http://jon.shoberg.net