Mr. Vertical Search,
Are you suggesting changing the end user interface, the middle user
(crawl and content guy), or developer interface?
I am considering writing Ant Tasks for crawling. Do we expect that the
targets could remain consistent between releases (crawls crawl,
injects inject, whether
[ http://issues.apache.org/jira/browse/NUTCH-229?page=all ]
Andrzej Bialecki closed NUTCH-229:
---
Resolution: Fixed
Applied. Thanks!
improved handling of plugin folder configuration
[ http://issues.apache.org/jira/browse/NUTCH-206?page=all ]
Andrzej Bialecki closed NUTCH-206:
---
Fix Version: 0.8-dev
Resolution: Fixed
Fixed in r 384011.
search server throws InstantiationException
[ http://issues.apache.org/jira/browse/NUTCH-3?page=all ]
Andrzej Bialecki closed NUTCH-3:
-
Resolution: Fixed
Fixed in r 376089.
multi values of header discarded
Key: NUTCH-3
URL:
* Change the syntax used in Nutch?
+1, my point of view is that we can do that for nutch 0.8 as far we
document (see nutch-user ) it. :-)
Stefan
I have made some quick tests with regex-urlfilter...
The major problem is that it doen't use the Perl syntax...
For instance, ît doesn't support the boundary matchers ^ and $ (which are
used in nutch)
Are there other ways to match start/end of string in the other
regex library? I use ^http a
I've been watching discussion of faster regex libs with much
interest. But if regex speed seems to be a problem, would using less
regexes be a good answer?
Protocol and extension filtering could be done by another URLFilter
plugin that is dedicated to this task, and uses more lightweight
Jérôme Charron wrote:
It seems that the usage of AnalyzerFactory was removed while porting Indexer
to map/reduce.
(AnalyzerFactory is no more called in trunk code)
Is it intentional?
(if no, I have a patch that I can commit, so thanks to confirm)
It was not intentional. Thanks for fixing
Hi Folks,
I updated to the latest SVN revision (385691) today, and I am now seeing a
Null Pointer exception in the AnalyzerFactory.java class. It seems that in
some cases, the method:
private Extension getExtension(String lang) { Extension extension =
(Extension)
I updated to the latest SVN revision (385691) today, and I am now seeing
a
Null Pointer exception in the AnalyzerFactory.java class.
Fixed (r385702). Thanks Chris.
NOTE: not sure if returning null is the right thing to do here, but hey,
at
least it made my crawl finish! :-)
It is the
Thanks to everybody for your suggestions.
But really, my problem is not technical, but political :
What should we do if we switch to automaton regexp lib ?
1. Keeps the well-known perl syntax for regexp (and then find a way to
simulate them with automaton limited syntax) ?
2. Switch to the
Hi nutch-dev,
I know that we have RegexUrlNormalizer already for removing session-
ids from URLs, but lately I've been wondering if there isn't a more
general way to solve this, without relying on pre-built patterns.
I think I have an answer that will work. I haven't seen this approach
OPIC score for outlinks should be based on # of valid links, not total # of
links.
--
Key: NUTCH-230
URL: http://issues.apache.org/jira/browse/NUTCH-230
Project: Nutch
Type: Improvement
Incze Lajos wrote:
* simulate ^ and $ operators by prepending and appending special start
and end markers to the input string.
E.g.
String START = __START__;
String END = __END__;
inputString = START + inputString + END;
What about
char START = '^';
char END = '$';
14 matches
Mail list logo