Re: Contributing

2006-03-13 Thread Alexander E Genaud
Mr. Vertical Search, Are you suggesting changing the end user interface, the middle user (crawl and content guy), or developer interface? I am considering writing Ant Tasks for crawling. Do we expect that the targets could remain consistent between releases (crawls crawl, injects inject, whether

[jira] Closed: (NUTCH-229) improved handling of plugin folder configuration

2006-03-13 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-229?page=all ] Andrzej Bialecki closed NUTCH-229: --- Resolution: Fixed Applied. Thanks! improved handling of plugin folder configuration

[jira] Closed: (NUTCH-206) search server throws InstantiationException

2006-03-13 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-206?page=all ] Andrzej Bialecki closed NUTCH-206: --- Fix Version: 0.8-dev Resolution: Fixed Fixed in r 384011. search server throws InstantiationException

[jira] Closed: (NUTCH-3) multi values of header discarded

2006-03-13 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-3?page=all ] Andrzej Bialecki closed NUTCH-3: - Resolution: Fixed Fixed in r 376089. multi values of header discarded Key: NUTCH-3 URL:

Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Stefan Groschupf
* Change the syntax used in Nutch? +1, my point of view is that we can do that for nutch 0.8 as far we document (see nutch-user ) it. :-) Stefan

Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Howie Wang
I have made some quick tests with regex-urlfilter... The major problem is that it doen't use the Perl syntax... For instance, ît doesn't support the boundary matchers ^ and $ (which are used in nutch) Are there other ways to match start/end of string in the other regex library? I use ^http a

Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Matt Kangas
I've been watching discussion of faster regex libs with much interest. But if regex speed seems to be a problem, would using less regexes be a good answer? Protocol and extension filtering could be done by another URLFilter plugin that is dedicated to this task, and uses more lightweight

Re: AnalyzerFactory

2006-03-13 Thread Doug Cutting
Jérôme Charron wrote: It seems that the usage of AnalyzerFactory was removed while porting Indexer to map/reduce. (AnalyzerFactory is no more called in trunk code) Is it intentional? (if no, I have a patch that I can commit, so thanks to confirm) It was not intentional. Thanks for fixing

Null Pointer exception in AnalyzerFactory?

2006-03-13 Thread Chris Mattmann
Hi Folks, I updated to the latest SVN revision (385691) today, and I am now seeing a Null Pointer exception in the AnalyzerFactory.java class. It seems that in some cases, the method: private Extension getExtension(String lang) { Extension extension = (Extension)

Re: Null Pointer exception in AnalyzerFactory?

2006-03-13 Thread Jérôme Charron
I updated to the latest SVN revision (385691) today, and I am now seeing a Null Pointer exception in the AnalyzerFactory.java class. Fixed (r385702). Thanks Chris. NOTE: not sure if returning null is the right thing to do here, but hey, at least it made my crawl finish! :-) It is the

Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Howie Wang
Thanks to everybody for your suggestions. But really, my problem is not technical, but political : What should we do if we switch to automaton regexp lib ? 1. Keeps the well-known perl syntax for regexp (and then find a way to simulate them with automaton limited syntax) ? 2. Switch to the

[proposal] catching session-id urls

2006-03-13 Thread Matt Kangas
Hi nutch-dev, I know that we have RegexUrlNormalizer already for removing session- ids from URLs, but lately I've been wondering if there isn't a more general way to solve this, without relying on pre-built patterns. I think I have an answer that will work. I haven't seen this approach

[jira] Created: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-13 Thread Ken Krugler (JIRA)
OPIC score for outlinks should be based on # of valid links, not total # of links. -- Key: NUTCH-230 URL: http://issues.apache.org/jira/browse/NUTCH-230 Project: Nutch Type: Improvement

Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Andrzej Bialecki
Incze Lajos wrote: * simulate ^ and $ operators by prepending and appending special start and end markers to the input string. E.g. String START = __START__; String END = __END__; inputString = START + inputString + END; What about char START = '^'; char END = '$';