Re: Re: Creating a new scoring filter.

2007-02-23 Thread Nicolás Lichtmaier


Hi, I'm working in a fixed set of URLs and I'd like to replace the 
standard OPIC score plugin with something different. I'd like to 
create a scoring plugin which entirely bases its score on the 
document parsed data (yes, I will trust the document text itself to 
decide its relevance).


I've been reading the code and the ScoringFilter interface seems to 
be targeted for use by OPIC like algorithms. For example, the step 
called after parsing is called "passScoreAfterParsing()", telling me 
what am I supposed to do in that method, and the method setting the 
scores is called "distributeScoreToOutlink()". All of this scares 
me... would it be safe to use these methods differently and, e.g., 
modify the socument score in "passScoreAfterParsing()" instead of 
just "passing it"?


You can modify whichever way you want - it's up to you. These methods 
simply ensure that the score data (not just the CrawlDatum.getScore(), 
but possibly a multitude of metadata collected on the way) is passed 
to appropriate segment parts.


E.g. in distributeScoreToOutlink() you could simply set the default 
score for new pages to a fixed value, without actually using the score 
information from the source page.




Yeah, but there I don't have the parse data for those new pages. What I 
would like to do is override "passScoreAfterParsing()" and not pass 
anything: just analyze the parsed data and decide a score. The problem 
is that that function doesn't get passed the CrawlDatum... it seems I'll 
need to modify Nutch itself =(


Thanks!



Re: Creating a new scoring filter.

2007-02-27 Thread Nicolás Lichtmaier



Yeah, but there I don't have the parse data for those new pages. What I
would like to do is override "passScoreAfterParsing()" and not pass
anything: just analyze the parsed data and decide a score. The problem
is that that function doesn't get passed the CrawlDatum... it seems I'll
need to modify Nutch itself =(

Can you be a bit more specific about your problem?


I'm indexing a fixed set of URLs that I think are a specific type of 
document. I don't care about links (I'm using -noAdditions to prevent 
adding links to crawldb, I've backported that to 0.8.x and it's waiting 
for somebody to commit it =) 
https://issues.apache.org/jira/browse/NUTCH-438 ).


I just want to replace the scoring algorithm with one which test if that 
URL really is that specific type of document. I want to use the parse 
data of a document to calculate its relevance.



Anyway, without the details, here is my guess on how you can do it:
1) In passScoreAfterParsing(), analyze the content and parse text and
put the relevant score information in parse data's metadata.
2) In distributeScoreToOutlink() ignore the outlinks (just give them
initialScore()),
but check your parse data and return an adjust datum with the status
STATUS_LINKED and score extracted from parse data. This adjust datum
will update the score of the original datum in updatedb.

Does this work for you?


It doesn't seem a good way to do it. What if there are no outlinks? This 
method won't be called at all. And anyway, it would be called once per 
each outlink, which would multiplicate the work.


Thanks!



Re: Creating a new scoring filter.

2007-02-27 Thread Nicolás Lichtmaier



It doesn't seem a good way to do it. What if there are no outlinks? This
method won't be called at all. And anyway, it would be called once per
each outlink, which would multiplicate the work.


Multiplication is easy to solve but you are right that it won't work
if there are no outlinks.

Maybe scoring filter api should change? A distributeScoreToOutlinks
method may be more useful than the current one: (which will be called
even if there are no outlinks)

CrawlDatum distributeScoreToOutlinks(Text fromUrl, List
toUrlList,   List datumList, ParseData parseData,
CrawlDatum adjust)

This method gives more control to the plugin since knowing all the
outlinks the plugin can make more informed decisions. Like, right now,
there is no way a scoring filter can be sure that it has distributed
all its cash (e.g if db.score.internal.link is 0.5 and
db.score.external.link is 1.0, filter will almost always distribute
less than its cash).

This will also work for your case, since you will just ignore the
outlinks and return the adjust datum based on information in parse
metadata.

What do you (and others) think?


I think that good API design here means not assuming so many things 
about the plugin behaviour. You are right about this 
"distributeScoreToOutlinks()", but IMO it should be called something 
like assignScores(). Then you could add an abstract class 
DistributingScorePlugin (implementing the interface) which overrides 
assignScores() and calls an "abstract protected" method called 
distributeScoreToOutlink().". So the code for traversing the outlinks 
would be in DistributingScorePlugin.


I would need another class, called ContentBasedScorePlugin. That class 
could call an abstract protected method called calculateScore() which 
would receive the parsed data and return the score.


What do you think?



Re: Creating a new scoring filter.

2007-02-27 Thread Nicolás Lichtmaier



I didn't understand the point of creating abstract base classes for
plugins. I am not strictly opposing it or anything, I just don't see
why it would make things simpler/more flexible. AFAICS, there is not
much an abstract base class can do but to pass the arguments of
assignScores to calculateScore/distributeScoreToOutlinks. I mean, here
is how I envision a ContentBasedScoringFilter class(or a
DistributingScoringFilter):

abstract class ContentBasedScoringFilter implements ScoringFilter {
  assignScores(args) { return calculateScore(args);  }
  protected abstract calculateScore(args);
}

Or do you have something else in mind?


Yes, something like that. But I also thought that if you don't want to 
repeat the logic of traversing through links (with all the logic which 
is now in ParseOutputFormat), that logic could be in an abstract class 
which would just traverse them and call an abstract function for each one.




Re: [Nutch-dev] Creating a new scoring filter

2007-04-19 Thread Nicolás Lichtmaier


sorry to re-open this thread, but I am facing the same problem of 
Nicolás.
I like both yours (Doğacan) and Nicolas' ideas, more yours as I think 
abstract

classes are not good extension points.


That wasn't what I had proposed. My suggestion was to use an interface, 
as always, but made this API real clean, expressing the minimum the rest 
of the code needs from a scoring plugin, removing assumptions about its 
implementation. Then I've proposed to have an abstract class, 
implementing this interface, with a skeleton for any class which works 
"distributing score to outlinks". So we would have the best of both 
worlds: People creating new "PageRank" algorithms wouldn't need to 
reimplement anuything, they would just subclass the abstract class. And 
people like you and me would directly implement the interface (or use a 
different abstract class if there's common logic to share). My boss put 
all of this on hold, but I'd like to implement this idea in a near 
future and try to have it included in Nutch.




Plugins initialized all the time!

2007-05-28 Thread Nicolás Lichtmaier
I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems 
that the plugin repository initializes itself all the timem until I get 
an out of memory exception. I've been seeing the code... the plugin 
repository mantains a map from Configuration to plugin repositories, but 
the Configuration object does not have an equals or hashCode method... 
wouldn't it be nice to add such a method (comparing property values)? 
Wouldn't that help prevent initializing many plugin repositories? What 
could be the cause to may problem? (Aaah.. so many questions... =) )


Bye!


Re: Plugins initialized all the time!

2007-05-28 Thread Nicolás Lichtmaier


More info...

I see "map" progressing from 0% to 100. It seems to reload plugins whan 
reaching 100%. Besides, I've realized that each NutchJob is a 
Configuration, so (as is there's no "equals") a plugin repo would be 
created per each NutchJob...




Re: Plugins initialized all the time!

2007-05-29 Thread Nicolás Lichtmaier



Which job causes the problem? Perhaps, we can find out what keeps
creating a conf object over and over.

Also, I have tried what you have suggested (better caching for plugin
repository) and it really seems to make a difference. Can you try with
this patch(*) to see if it solves your problem?

(*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch


Some comments about you patch. The approach seems nice, you only check 
the parameters that affect plugin loading. But have in mind that the 
plugin themselves will configure themselves with many other parameters, 
so to keep things safe there should be a PluginRepository for each set 
of parameters (including all of them). Besides, remember that CACHE is a 
WeakHashMap, you are creating ad-hoc PluginProperty objects as keys, 
something doesn't loook right... the lifespan of those objects will be 
much shorter than you require, perhaps you should be using 
SoftReferences instead, or a simple LRU (LinkedHashMap provides that 
simply) cache.


Anyway, I'll try to build my own Nutch to test your patch.

Thanks!



Re: Plugins initialized all the time!

2007-05-29 Thread Nicolás Lichtmaier



I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
that the plugin repository initializes itself all the timem until I get
an out of memory exception. I've been seeing the code... the plugin
repository mantains a map from Configuration to plugin repositories, but
the Configuration object does not have an equals or hashCode method...
wouldn't it be nice to add such a method (comparing property values)?
Wouldn't that help prevent initializing many plugin repositories? What
could be the cause to may problem? (Aaah.. so many questions... =) )


Which job causes the problem? Perhaps, we can find out what keeps
creating a conf object over and over.

Also, I have tried what you have suggested (better caching for plugin
repository) and it really seems to make a difference. Can you try with
this patch(*) to see if it solves your problem?

(*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch


I'm running it. So far it's working ok, and I haven't seen all those 
plugin loadings...


I've modified your patch though to define CACHE like this:

 private static final Map CACHE =
 new LinkedHashMap() {
   @Override
   protected boolean removeEldestEntry(
   Entry eldest) {
 return size() > 10;
   }
 };

...which means an LRU cache with a fixed size of 10.



Re: Plugins initialized all the time!

2007-05-31 Thread Nicolás Lichtmaier



Actually thinking a bit further into this, I kind of agree with you. I
initially thought that the best approach would be to change
PluginRepository.get(Configuration) to PluginRepository.get() where
get() just creates a configuration internally and initializes itself
with it. But then we wouldn't be passing JobConf to PluginRepository
but PluginRepository would do something like a
NutchConfiguration.create(), which is probably wrong.

So, all in all, I've come to believe that my (and Nicolas') patch is a
not-so-bad way of fixing this. It allows us to pass JobConf to
PluginRepository and stops creating new PluginRepository-s again and
again...

What do you think?


IMO a better way would be to add a proper equals() method to  Hadoop's 
Configuration object (and hashcode) that would call 
getProps().equals(o.getProps()). So that you could use them as keys... 
Every class which is a map from keys to values has "equals & hashcode" 
(Properties, HashMap, etc.).


Another nice thing would be to be able to "freeze" a configuration 
object, preventing anyone from modifying it.




Making "Hits" work as a normal List

2007-05-31 Thread Nicolás Lichtmaier

Why not?

I'm attaching a patch which does that. Nutch's API looks very messy, 
very un-Java. Why not making Hits work as a list so that it can be 
iterated with the new Java "for" and processed using the normal 
Collections APIs (for getting subsets, for instance, or for dumping the 
hits into another collection).
Index: src/java/org/apache/nutch/searcher/Hits.java
===
--- src/java/org/apache/nutch/searcher/Hits.java	(revisión: 543252)
+++ src/java/org/apache/nutch/searcher/Hits.java	(copia de trabajo)
@@ -20,13 +20,14 @@
 import java.io.DataInput;
 import java.io.DataOutput;
 import java.io.IOException;
+import java.util.AbstractList;
 
 import org.apache.hadoop.io.Writable;
 import org.apache.hadoop.io.WritableComparable;
 import org.apache.hadoop.io.Text;
 
 /** A set of hits matching a query. */
-public final class Hits implements Writable {
+public final class Hits extends AbstractList implements Writable  {
 
   private long total;
   private boolean totalIsExact = true;
@@ -83,7 +84,7 @@
   public void readFields(DataInput in) throws IOException {
 total = in.readLong();// read total hits
 top = new Hit[in.readInt()];  // read hits returned
-Class sortClass = null;
+Class sortClass = null;
 if (top.length > 0) { // read sort value class
   try {
 sortClass = Class.forName(Text.readString(in));
@@ -109,4 +110,23 @@
 }
   }
 
+  @Override
+  public Hit get(int index) {
+return getHit(index);
+  }
+
+  @Override
+  public int size() {
+return getLength();
+  }
+  
+  @Override
+  public boolean equals(Object o) {
+if(!super.equals(o))
+  return false;
+if(!(o instanceof Hits))
+  return false;
+Hits h = (Hits)o;
+return h.totalIsExact == totalIsExact && h.total == total;
+  }
 }


[PATCH] Moving HitDetails construction to a constructor =)

2007-05-31 Thread Nicolás Lichtmaier
I propose this patch. It moves the logic of contructing a HitDetails 
from a Lucene document to the HitDetails constructor. It also removes 
useless array copies. The benefit of this patch is to be able to use 
part of Nutch machinery (get a Lucene document by other means and later 
construct a HitDetails). It also looks cleaner IMO.


Thanks!

Index: src/java/org/apache/nutch/searcher/IndexSearcher.java
===
--- src/java/org/apache/nutch/searcher/IndexSearcher.java	(revisión: 543252)
+++ src/java/org/apache/nutch/searcher/IndexSearcher.java	(copia de trabajo)
@@ -21,6 +21,8 @@
 
 import java.util.ArrayList;
 import java.util.Enumeration;
+import java.util.Iterator;
+import java.util.List;
 
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.store.FSDirectory;
@@ -105,20 +107,8 @@
   }
 
   public HitDetails getDetails(Hit hit) throws IOException {
-ArrayList fields = new ArrayList();
-ArrayList values = new ArrayList();
-
 Document doc = luceneSearcher.doc(hit.getIndexDocNo());
-
-Enumeration e = doc.fields();
-while (e.hasMoreElements()) {
-  Field field = (Field)e.nextElement();
-  fields.add(field.name());
-  values.add(field.stringValue());
-}
-
-return new HitDetails((String[])fields.toArray(new String[fields.size()]),
-  (String[])values.toArray(new String[values.size()]));
+return new HitDetails(doc);
   }
 
   public HitDetails[] getDetails(Hit[] hits) throws IOException {
Index: src/java/org/apache/nutch/searcher/HitDetails.java
===
--- src/java/org/apache/nutch/searcher/HitDetails.java	(revisión: 543252)
+++ src/java/org/apache/nutch/searcher/HitDetails.java	(copia de trabajo)
@@ -21,8 +21,11 @@
 import java.io.DataOutput;
 import java.io.IOException;
 import java.util.ArrayList;
+import java.util.List;
 
 import org.apache.hadoop.io.*;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
 import org.apache.nutch.html.Entities;
 
 /** Data stored in the index for a hit.
@@ -52,7 +55,20 @@
 this.fields[1] = "url";
 this.values[1] = url;
   }
+  
+  /** Construct from Lucene document. */
+  public HitDetails(Document doc)
+  {
+List ff = doc.getFields();
+length = ff.size();
 
+for(int i = 0 ; i < length ; i++) {
+  Field field = (Field)ff.get(i);
+  fields[i] = field.name();
+  values[i] = field.stringValue();
+}
+  }
+
   /** Returns the number of fields contained in this. */
   public int getLength() { return length; }
 


[PATCH] Moving HitDetails construction to a HitDetails constructor (v2).

2007-06-01 Thread Nicolás Lichtmaier
This is a fixed version of the previous patch. Please, don't ignore me 
=). I'm trying to use Lucene queries with Nutch and this patch will 
help. This patch also removes a deprecated API usage, removes useless 
object creation and array copying.


Thanks!

Index: src/java/org/apache/nutch/searcher/IndexSearcher.java
===
--- src/java/org/apache/nutch/searcher/IndexSearcher.java	(revisión: 543252)
+++ src/java/org/apache/nutch/searcher/IndexSearcher.java	(copia de trabajo)
@@ -21,6 +21,8 @@
 
 import java.util.ArrayList;
 import java.util.Enumeration;
+import java.util.Iterator;
+import java.util.List;
 
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.store.FSDirectory;
@@ -105,20 +107,8 @@
   }
 
   public HitDetails getDetails(Hit hit) throws IOException {
-ArrayList fields = new ArrayList();
-ArrayList values = new ArrayList();
-
 Document doc = luceneSearcher.doc(hit.getIndexDocNo());
-
-Enumeration e = doc.fields();
-while (e.hasMoreElements()) {
-  Field field = (Field)e.nextElement();
-  fields.add(field.name());
-  values.add(field.stringValue());
-}
-
-return new HitDetails((String[])fields.toArray(new String[fields.size()]),
-  (String[])values.toArray(new String[values.size()]));
+return new HitDetails(doc);
   }
 
   public HitDetails[] getDetails(Hit[] hits) throws IOException {
Index: src/java/org/apache/nutch/searcher/HitDetails.java
===
--- src/java/org/apache/nutch/searcher/HitDetails.java	(revisión: 543252)
+++ src/java/org/apache/nutch/searcher/HitDetails.java	(copia de trabajo)
@@ -21,8 +21,11 @@
 import java.io.DataOutput;
 import java.io.IOException;
 import java.util.ArrayList;
+import java.util.List;
 
 import org.apache.hadoop.io.*;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
 import org.apache.nutch.html.Entities;
 
 /** Data stored in the index for a hit.
@@ -52,7 +55,23 @@
 this.fields[1] = "url";
 this.values[1] = url;
   }
+  
+  /** Construct from Lucene document. */
+  public HitDetails(Document doc)
+  {
+List ff = doc.getFields();
+length = ff.size();
+
+fields = new String[length];
+values = new String[length];
 
+for(int i = 0 ; i < length ; i++) {
+  Field field = (Field)ff.get(i);
+  fields[i] = field.name();
+  values[i] = field.stringValue();
+}
+  }
+
   /** Returns the number of fields contained in this. */
   public int getLength() { return length; }
 


Re: [PATCH] Moving HitDetails construction to a HitDetails constructor (v2).

2007-06-03 Thread Nicolás Lichtmaier



Please, don't ignore me =).
We don't - but there's only so much ou can do in 24 hrs/day, and Nutch 
developers have their own lives to attend to ... ;)


=) Sorry, I didn't mean to sound "demanding". It's that there's a 
natural focus in real features and I thought that "tidyness" patches get 
unnoticed.





I'm trying to use Lucene queries with Nutch and this patch will help. 
This patch also removes a deprecated API usage, removes useless 
object creation and array copying.


I believe the conversion from Document to HitDetails was separated 
this way on purpose. Please note that front-end Nutch API has no 
dependencies on Lucene classes. If we applied your patch, all of a 
sudden HitDetails would become dependent on Lucene, causing front-end 
applications to become dependent on Lucene, too.


We can certainly fix the use of deprecated API as you suggested. As 
for the rest of the patch, in my opinion it should not be applied.




Oh, I see... a pitty. It looked cleaner too me, and I'll have to 
copy+paste that into my code. What about the other patch? (Retrofit Hits 
to implement List)