Re: Contributing

2006-03-13 Thread Alexander E Genaud
Mr. Vertical Search,

Are you suggesting changing the end user interface, the middle user
(crawl and content guy), or developer interface?

I am considering writing Ant Tasks for crawling. Do we expect that the
targets could remain consistent between releases (crawls crawl,
injects inject, whether nutch 0.7, 0.8, or 0.9)?

Cheers,
Alex
--
CCC7 D19D D107 F079 2F3D BF97 8443 DB5A 6DB8 9CE1
--


From: Vertical Search [EMAIL PROTECTED]
To: nutch-dev nutch-dev@lucene.apache.org
Date: Thu, 9 Mar 2006 12:10:42 -0600
Subject: Contributing
Hello,
I was wondering, if any one is willing to consider some changes to make
nutch more user friendly..
like to get a general feeling of the code base, reviewing code and cleaning
up shadow variables, etc.,

Is some one doing it already ? I am willing to take some time to contribute.
Are there any specific qualifications to contribute, please let me know..


Thanks


[jira] Closed: (NUTCH-229) improved handling of plugin folder configuration

2006-03-13 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-229?page=all ]
 
Andrzej Bialecki  closed NUTCH-229:
---

Resolution: Fixed

Applied. Thanks!

 improved handling of plugin folder configuration
 

  Key: NUTCH-229
  URL: http://issues.apache.org/jira/browse/NUTCH-229
  Project: Nutch
 Type: Improvement
 Reporter: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev
  Attachments: pluginFolder.patch

 Currently nutch only supports absoluth path or realative path that are part 
 of the classpath. 
 There are cases where it would be useful to be able using relative paaths 
 that  are not in the classpath for example have a centralized plugin 
 repository on a shared hdd in cluster or running nutch inside a ide etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-206) search server throws InstantiationException

2006-03-13 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-206?page=all ]
 
Andrzej Bialecki  closed NUTCH-206:
---

Fix Version: 0.8-dev
 Resolution: Fixed

Fixed in r 384011.

 search server throws InstantiationException
 ---

  Key: NUTCH-206
  URL: http://issues.apache.org/jira/browse/NUTCH-206
  Project: Nutch
 Type: Bug
   Components: searcher
 Versions: 0.8-dev
  Environment: windows 2003
 cygwin
 Reporter: jimmy
  Fix For: 0.8-dev


 060207 230215 23 Server connection on port  from 127.0.0.1 caught: 
 java.lang
 .RuntimeException: java.lang.InstantiationException: 
 org.apache.nutch.searcher.Q
 uery
 java.lang.RuntimeException: java.lang.InstantiationException: 
 org.apache.nutch.s
 earcher.Query
 at 
 org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:23
 8)
 at org.apache.hadoop.ipc.RPC$Invocation.readFields(RPC.java:88)
 at org.apache.hadoop.ipc.Server$Connection.run(Server.java:138)
 Caused by: java.lang.InstantiationException: org.apache.nutch.searcher.Query
 at java.lang.Class.newInstance0(Class.java:335)
 at java.lang.Class.newInstance(Class.java:303)
 at 
 org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:23
 1)
 ... 2 more
 060207 230215 23 Server connection on port  from 127.0.0.1: exiting
 060207 230225 24 Server connection on port  from 127.0.0.1: starting

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-3) multi values of header discarded

2006-03-13 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-3?page=all ]
 
Andrzej Bialecki  closed NUTCH-3:
-

Resolution: Fixed

Fixed in r 376089.

 multi values of header discarded
 

  Key: NUTCH-3
  URL: http://issues.apache.org/jira/browse/NUTCH-3
  Project: Nutch
 Type: Bug
 Reporter: Stefan Groschupf
 Assignee: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: contentPropertiesAddpatch.txt, multiValuesPropertyPatch.txt

 orignal by: phoebe
 http://sourceforge.net/tracker/index.php?func=detailaid=185group_id=59548atid=491356
 multi values of header discarded
 Each successive setting of a header value deletes the
 previous one.
 This patch allows multi values to be retained, such as
 cookies, using lf cr as a delimiter for each values.
 --- /tmp/HttpResponse.java 2005-01-27
 19:57:55.0 -0500
 +++ HttpResponse.java 2005-01-27 20:45:01.0 -0500
 @@ -324,7 +324,19 @@
 }
 String value = line.substring(valueStart);
 - headers.put(key, value);
 +//Spec allows multiple values, such as Set-Cookie -
 using lf cr as delimiter
 + if ( headers.containsKey(key)) {
 + try {
 + Object obj= headers.get(key);
 + if ( obj != null) {
 + String oldvalue=
 headers.get(key).toString();
 + value = oldvalue +
 \r\n + value;
 + }
 + } catch (Exception e) {
 + e.printStackTrace();
 + }
 + }
 + headers.put(key, value);
 }
 private Map parseHeaders(PushbackInputStream in,
 StringBuffer line)
 @@ -399,5 +411,3 @@
 }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Stefan Groschupf

* Change the syntax used in Nutch?
+1, my point of view is that we can do that for nutch 0.8 as far we  
document (see nutch-user ) it. :-)



Stefan 


Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Howie Wang



I have made some quick tests with regex-urlfilter...
The major problem is that it doen't use the  Perl syntax...
For instance, ît doesn't support the boundary matchers ^ and $ (which are
used in nutch)


Are there other ways to match start/end of string in the other
regex library? I use ^http a lot because a lot of sites pass around
urls in the query string, and I don't want them (eg.
http://del.icio.us/howie?url=http://lucene.apache.org/nutch)

Howie




Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Matt Kangas
I've been watching discussion of faster regex libs with much  
interest. But if regex speed seems to be a problem, would using less  
regexes be a good answer?


Protocol and extension filtering could be done by another URLFilter  
plugin that is dedicated to this task, and uses more lightweight  
string-chopping techniques. That way full regex support could be  
retained for the tasks where it's really needed.



On Mar 13, 2006, at 12:31 PM, Howie Wang wrote:




I have made some quick tests with regex-urlfilter...
The major problem is that it doen't use the  Perl syntax...
For instance, ît doesn't support the boundary matchers ^ and $  
(which are

used in nutch)


Are there other ways to match start/end of string in the other
regex library? I use ^http a lot because a lot of sites pass around
urls in the query string, and I don't want them (eg.
http://del.icio.us/howie?url=http://lucene.apache.org/nutch)

Howie


--
Matt Kangas / [EMAIL PROTECTED]




Re: AnalyzerFactory

2006-03-13 Thread Doug Cutting

Jérôme Charron wrote:

It seems that the usage of AnalyzerFactory was removed while porting Indexer
to map/reduce.
(AnalyzerFactory is no more called in trunk code)
Is it intentional?
(if no, I have a patch that I can commit, so thanks to confirm)


It was not intentional.  Thanks for fixing this!

Doug


Null Pointer exception in AnalyzerFactory?

2006-03-13 Thread Chris Mattmann
Hi Folks,

  I updated to the latest SVN revision (385691) today, and I am now seeing a
Null Pointer exception in the AnalyzerFactory.java class. It seems that in
some cases, the method:

  private Extension getExtension(String lang) { Extension extension =
(Extension) this.conf.getObject(lang);if (extension == null) {
extension = findExtension(lang);  if (extension != null) {
this.conf.setObject(lang, extension);  }}return extension;  }


Has a null lang parameter passed to it, which causes a NullPointer
exception at line: 81 in
src/java/org/apache/nutch/analyzer/AnalyzerFactory.java

I found that if I checked for null in the lang variable, and returned null
if lang == null, that my crawl finished. Here is a small patch that will fix
the crawl:

Index: 
/Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory
.java ===
--- 
/Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory
.java(revision 385691) +++
/Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory
.java(working copy) @@ -78,14 +78,19 @@private Extension
getExtension(String lang) { -Extension extension = (Extension)
this.conf.getObject(lang); -if (extension == null) { -  extension =
findExtension(lang); -  if (extension != null) { -
this.conf.setObject(lang, extension); -  } -} -return extension;
+if(lang == null){ +return null; +} +else{ +
Extension extension = (Extension) this.conf.getObject(lang); +if
(extension == null) { +  extension = findExtension(lang); +
if (extension != null) { +this.conf.setObject(lang, extension);
+  } +} +return extension;+}   }
private Extension findExtension(String lang) {


NOTE: not sure if returning null is the right thing to do here, but hey, at
least it made my crawl finish! :-)

Cheers,
  Chris



__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246

___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



Re: Null Pointer exception in AnalyzerFactory?

2006-03-13 Thread Jérôme Charron
   I updated to the latest SVN revision (385691) today, and I am now seeing
 a
 Null Pointer exception in the AnalyzerFactory.java class.

Fixed (r385702). Thanks Chris.


 NOTE: not sure if returning null is the right thing to do here, but hey,
 at
 least it made my crawl finish! :-)

It is the right thing to do.

Cheers,

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Howie Wang

Thanks to everybody for your suggestions.
But really, my problem is not technical, but political :
What should we do if we switch to automaton regexp lib ?
1. Keeps the well-known perl syntax for regexp (and then find a way to
simulate them with automaton limited syntax) ?
2. Switch to the automaton limited syntax (=must be well documented)


My vote would be for option 1. It's less work for everyone
(except for the person incorporating the new library :)




[proposal] catching session-id urls

2006-03-13 Thread Matt Kangas

Hi nutch-dev,

I know that we have RegexUrlNormalizer already for removing session- 
ids from URLs, but lately I've been wondering if there isn't a more  
general way to solve this, without relying on pre-built patterns.


I think I have an answer that will work. I haven't seen this approach  
published anywhere, so any failings are entirely my fault. ;) What  
I'm wondering is:
- Does this seem like a good (effective, efficient) algorithm for  
catching session-id URLs?

- If so, where is the best place to implement it within Nutch?

Basic idea: session ids within URLs only cause problems for crawlers  
when they change. This typically occurs when a server-side session  
expires and a new id is issued. So, rather than looking for URL  
argument patterns (as RegexUrlNormalizer does), look for a value- 
transition pattern.


Algorithm:

1) Iterate over each page in a fetched segment

2) For each successful fetch, extract:
 - The fetched URL. Call this (u0)
 - All links on the page that refer to the same site/domain. Call  
this set (u1..N)


3) Parse u0 into parameters (p0) as follows:
 - named parameters: add (key,value) to Map
 - positional (path) params: add (position,value) to Map

So for the url http://foo.bar/spam/eggs?x=truey=2;, pseudocode  
would look like:

 p0 = new HashMap();
 p0.put(new Integer(1), spam);
 p0.put(new Integer(2), eggs);
 p0.put(x, true);
 p0.put(y, 2);

4) Parse u1..N into (p1..N) using the same method

5) Compare p0 with p1..N. Look for the following pattern:
 - keys that are present for all p0..N, and
 - values that are identical for all p1..N, and
 - the value in p0 is _different_

If you see this condition, flag the page as contains session id that  
just changed and deal with it accordingly. (Delete from crawldb, etc)


So... for anyone who's still reading ;), does this seem like it would  
work for catching session-ids? What corner-cases would trip it up?  
Can you think of cases when it would fall flat? And if it still seems  
worthwhile, where's the best place within Nutch to put it? (Perhaps a  
new ExtensionPoint that is used by nutch updatedb?)


--Matt

--
Matt Kangas / [EMAIL PROTECTED]




[jira] Created: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-13 Thread Ken Krugler (JIRA)
OPIC score for outlinks should be based on # of valid links, not total # of 
links.
--

 Key: NUTCH-230
 URL: http://issues.apache.org/jira/browse/NUTCH-230
 Project: Nutch
Type: Improvement
Versions: 0.8-dev
Reporter: Ken Krugler
Priority: Minor


In ParseOutputFormat.java, the write() method currently divides the page score 
by the # of outlinks:

  score /= links.length;

It then loops over the links, and any that pass the normalize/filter gauntlet 
get added to the crawl output.

But this means that any filtered links result in some amount of the page's OPIC 
score being lost.

For Nutch 0.7, I built a list of valid (post-filter) links, and then used that 
to determine the per-link OPIC score, after which I iterated over the list, 
adding entries to the crawl output.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Andrzej Bialecki

Incze Lajos wrote:
* simulate ^ and $ operators by prepending and appending special start 
and end markers to the input string.


E.g.
   String START = __START__;
   String END = __END__;
   inputString = START + inputString + END;



What about

char START = '^';
char END = '$';
inputString = START + inputString + END;

?
  


The probability of encountering a $ sign somewhere inside URL is not 
insignificant... I agree that it's very unlikely (perhaps even illegal) 
to use ^ in URLs, but $ are sometimes used.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com