Re: Contributing
Mr. Vertical Search, Are you suggesting changing the end user interface, the middle user (crawl and content guy), or developer interface? I am considering writing Ant Tasks for crawling. Do we expect that the targets could remain consistent between releases (crawls crawl, injects inject, whether nutch 0.7, 0.8, or 0.9)? Cheers, Alex -- CCC7 D19D D107 F079 2F3D BF97 8443 DB5A 6DB8 9CE1 -- From: Vertical Search [EMAIL PROTECTED] To: nutch-dev nutch-dev@lucene.apache.org Date: Thu, 9 Mar 2006 12:10:42 -0600 Subject: Contributing Hello, I was wondering, if any one is willing to consider some changes to make nutch more user friendly.. like to get a general feeling of the code base, reviewing code and cleaning up shadow variables, etc., Is some one doing it already ? I am willing to take some time to contribute. Are there any specific qualifications to contribute, please let me know.. Thanks
[jira] Closed: (NUTCH-229) improved handling of plugin folder configuration
[ http://issues.apache.org/jira/browse/NUTCH-229?page=all ] Andrzej Bialecki closed NUTCH-229: --- Resolution: Fixed Applied. Thanks! improved handling of plugin folder configuration Key: NUTCH-229 URL: http://issues.apache.org/jira/browse/NUTCH-229 Project: Nutch Type: Improvement Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev Attachments: pluginFolder.patch Currently nutch only supports absoluth path or realative path that are part of the classpath. There are cases where it would be useful to be able using relative paaths that are not in the classpath for example have a centralized plugin repository on a shared hdd in cluster or running nutch inside a ide etc. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-206) search server throws InstantiationException
[ http://issues.apache.org/jira/browse/NUTCH-206?page=all ] Andrzej Bialecki closed NUTCH-206: --- Fix Version: 0.8-dev Resolution: Fixed Fixed in r 384011. search server throws InstantiationException --- Key: NUTCH-206 URL: http://issues.apache.org/jira/browse/NUTCH-206 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Environment: windows 2003 cygwin Reporter: jimmy Fix For: 0.8-dev 060207 230215 23 Server connection on port from 127.0.0.1 caught: java.lang .RuntimeException: java.lang.InstantiationException: org.apache.nutch.searcher.Q uery java.lang.RuntimeException: java.lang.InstantiationException: org.apache.nutch.s earcher.Query at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:23 8) at org.apache.hadoop.ipc.RPC$Invocation.readFields(RPC.java:88) at org.apache.hadoop.ipc.Server$Connection.run(Server.java:138) Caused by: java.lang.InstantiationException: org.apache.nutch.searcher.Query at java.lang.Class.newInstance0(Class.java:335) at java.lang.Class.newInstance(Class.java:303) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:23 1) ... 2 more 060207 230215 23 Server connection on port from 127.0.0.1: exiting 060207 230225 24 Server connection on port from 127.0.0.1: starting -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-3) multi values of header discarded
[ http://issues.apache.org/jira/browse/NUTCH-3?page=all ] Andrzej Bialecki closed NUTCH-3: - Resolution: Fixed Fixed in r 376089. multi values of header discarded Key: NUTCH-3 URL: http://issues.apache.org/jira/browse/NUTCH-3 Project: Nutch Type: Bug Reporter: Stefan Groschupf Assignee: Stefan Groschupf Fix For: 0.8-dev Attachments: contentPropertiesAddpatch.txt, multiValuesPropertyPatch.txt orignal by: phoebe http://sourceforge.net/tracker/index.php?func=detailaid=185group_id=59548atid=491356 multi values of header discarded Each successive setting of a header value deletes the previous one. This patch allows multi values to be retained, such as cookies, using lf cr as a delimiter for each values. --- /tmp/HttpResponse.java 2005-01-27 19:57:55.0 -0500 +++ HttpResponse.java 2005-01-27 20:45:01.0 -0500 @@ -324,7 +324,19 @@ } String value = line.substring(valueStart); - headers.put(key, value); +//Spec allows multiple values, such as Set-Cookie - using lf cr as delimiter + if ( headers.containsKey(key)) { + try { + Object obj= headers.get(key); + if ( obj != null) { + String oldvalue= headers.get(key).toString(); + value = oldvalue + \r\n + value; + } + } catch (Exception e) { + e.printStackTrace(); + } + } + headers.put(key, value); } private Map parseHeaders(PushbackInputStream in, StringBuffer line) @@ -399,5 +411,3 @@ } -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Much faster RegExp lib needed in nutch?
* Change the syntax used in Nutch? +1, my point of view is that we can do that for nutch 0.8 as far we document (see nutch-user ) it. :-) Stefan
Re: Much faster RegExp lib needed in nutch?
I have made some quick tests with regex-urlfilter... The major problem is that it doen't use the Perl syntax... For instance, ît doesn't support the boundary matchers ^ and $ (which are used in nutch) Are there other ways to match start/end of string in the other regex library? I use ^http a lot because a lot of sites pass around urls in the query string, and I don't want them (eg. http://del.icio.us/howie?url=http://lucene.apache.org/nutch) Howie
Re: Much faster RegExp lib needed in nutch?
I've been watching discussion of faster regex libs with much interest. But if regex speed seems to be a problem, would using less regexes be a good answer? Protocol and extension filtering could be done by another URLFilter plugin that is dedicated to this task, and uses more lightweight string-chopping techniques. That way full regex support could be retained for the tasks where it's really needed. On Mar 13, 2006, at 12:31 PM, Howie Wang wrote: I have made some quick tests with regex-urlfilter... The major problem is that it doen't use the Perl syntax... For instance, ît doesn't support the boundary matchers ^ and $ (which are used in nutch) Are there other ways to match start/end of string in the other regex library? I use ^http a lot because a lot of sites pass around urls in the query string, and I don't want them (eg. http://del.icio.us/howie?url=http://lucene.apache.org/nutch) Howie -- Matt Kangas / [EMAIL PROTECTED]
Re: AnalyzerFactory
Jérôme Charron wrote: It seems that the usage of AnalyzerFactory was removed while porting Indexer to map/reduce. (AnalyzerFactory is no more called in trunk code) Is it intentional? (if no, I have a patch that I can commit, so thanks to confirm) It was not intentional. Thanks for fixing this! Doug
Null Pointer exception in AnalyzerFactory?
Hi Folks, I updated to the latest SVN revision (385691) today, and I am now seeing a Null Pointer exception in the AnalyzerFactory.java class. It seems that in some cases, the method: private Extension getExtension(String lang) { Extension extension = (Extension) this.conf.getObject(lang);if (extension == null) { extension = findExtension(lang); if (extension != null) { this.conf.setObject(lang, extension); }}return extension; } Has a null lang parameter passed to it, which causes a NullPointer exception at line: 81 in src/java/org/apache/nutch/analyzer/AnalyzerFactory.java I found that if I checked for null in the lang variable, and returned null if lang == null, that my crawl finished. Here is a small patch that will fix the crawl: Index: /Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory .java === --- /Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory .java(revision 385691) +++ /Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory .java(working copy) @@ -78,14 +78,19 @@private Extension getExtension(String lang) { -Extension extension = (Extension) this.conf.getObject(lang); -if (extension == null) { - extension = findExtension(lang); - if (extension != null) { - this.conf.setObject(lang, extension); - } -} -return extension; +if(lang == null){ +return null; +} +else{ + Extension extension = (Extension) this.conf.getObject(lang); +if (extension == null) { + extension = findExtension(lang); + if (extension != null) { +this.conf.setObject(lang, extension); + } +} +return extension;+} } private Extension findExtension(String lang) { NOTE: not sure if returning null is the right thing to do here, but hey, at least it made my crawl finish! :-) Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Null Pointer exception in AnalyzerFactory?
I updated to the latest SVN revision (385691) today, and I am now seeing a Null Pointer exception in the AnalyzerFactory.java class. Fixed (r385702). Thanks Chris. NOTE: not sure if returning null is the right thing to do here, but hey, at least it made my crawl finish! :-) It is the right thing to do. Cheers, Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
Thanks to everybody for your suggestions. But really, my problem is not technical, but political : What should we do if we switch to automaton regexp lib ? 1. Keeps the well-known perl syntax for regexp (and then find a way to simulate them with automaton limited syntax) ? 2. Switch to the automaton limited syntax (=must be well documented) My vote would be for option 1. It's less work for everyone (except for the person incorporating the new library :)
[proposal] catching session-id urls
Hi nutch-dev, I know that we have RegexUrlNormalizer already for removing session- ids from URLs, but lately I've been wondering if there isn't a more general way to solve this, without relying on pre-built patterns. I think I have an answer that will work. I haven't seen this approach published anywhere, so any failings are entirely my fault. ;) What I'm wondering is: - Does this seem like a good (effective, efficient) algorithm for catching session-id URLs? - If so, where is the best place to implement it within Nutch? Basic idea: session ids within URLs only cause problems for crawlers when they change. This typically occurs when a server-side session expires and a new id is issued. So, rather than looking for URL argument patterns (as RegexUrlNormalizer does), look for a value- transition pattern. Algorithm: 1) Iterate over each page in a fetched segment 2) For each successful fetch, extract: - The fetched URL. Call this (u0) - All links on the page that refer to the same site/domain. Call this set (u1..N) 3) Parse u0 into parameters (p0) as follows: - named parameters: add (key,value) to Map - positional (path) params: add (position,value) to Map So for the url http://foo.bar/spam/eggs?x=truey=2;, pseudocode would look like: p0 = new HashMap(); p0.put(new Integer(1), spam); p0.put(new Integer(2), eggs); p0.put(x, true); p0.put(y, 2); 4) Parse u1..N into (p1..N) using the same method 5) Compare p0 with p1..N. Look for the following pattern: - keys that are present for all p0..N, and - values that are identical for all p1..N, and - the value in p0 is _different_ If you see this condition, flag the page as contains session id that just changed and deal with it accordingly. (Delete from crawldb, etc) So... for anyone who's still reading ;), does this seem like it would work for catching session-ids? What corner-cases would trip it up? Can you think of cases when it would fall flat? And if it still seems worthwhile, where's the best place within Nutch to put it? (Perhaps a new ExtensionPoint that is used by nutch updatedb?) --Matt -- Matt Kangas / [EMAIL PROTECTED]
[jira] Created: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.
OPIC score for outlinks should be based on # of valid links, not total # of links. -- Key: NUTCH-230 URL: http://issues.apache.org/jira/browse/NUTCH-230 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Ken Krugler Priority: Minor In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks: score /= links.length; It then loops over the links, and any that pass the normalize/filter gauntlet get added to the crawl output. But this means that any filtered links result in some amount of the page's OPIC score being lost. For Nutch 0.7, I built a list of valid (post-filter) links, and then used that to determine the per-link OPIC score, after which I iterated over the list, adding entries to the crawl output. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Much faster RegExp lib needed in nutch?
Incze Lajos wrote: * simulate ^ and $ operators by prepending and appending special start and end markers to the input string. E.g. String START = __START__; String END = __END__; inputString = START + inputString + END; What about char START = '^'; char END = '$'; inputString = START + inputString + END; ? The probability of encountering a $ sign somewhere inside URL is not insignificant... I agree that it's very unlikely (perhaps even illegal) to use ^ in URLs, but $ are sometimes used. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com