yes I have been debugging, everything looks fine as it goes into the mapper
code,
from Injector.java
@line 69
try
{
url = urlNormalizer.normalize(url);
url = filters.filter(url); <- this is what returns null
} catch ( ... )
...
}
if (url != null) { <-- this check always fails because of that
...
}
I trace the call in to PrefixURLFilter.filter(url) and always get a null
returned from here...
if (trie.shortestMatch(url)== null)
return null;
else
return url;
Does this clarify the root of the problem?
-Charlie
On 2/12/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:
Hey Charlie,
What do the logs say in logs/hadoop.log?
You can also try to to set a breakpoint in Eclipse in the map method of
InjectMapper and reduce method of InjectReducer. When you get there in
debug mode, inspect your variables and check if everything looks good.
You can also check if your urls make it through: url =
filters.filter(url); in InjectMapper
HTH,
Renaud
Charlie Williams wrote:
> I have been trying to learn the Nutch code base by stepping through
> the code
> in debug mode of Eclipse. However I am unable to understand a piece of
> code
> in the Injector.
>
> When I run the crawl command used for intranet crawling, it successfully
> injects urls into the database. When I run standalone Injector, on the
> same
> set of urls it injects nothing, returning null from each pass of
> PrefixURLFilter.filter( url )
>
> I saw in an achieve that that the crawl command uses crawl-tool.xml
> for its
> config, where otherwise nutch-site.xml is used. So I made the
> nutch-site.xmlfile exactly the same, but this seemed to have no
> result. Does anyone know
> why?
>
> I apologize for the newb question, but any help would be greatly
> appreciated.
>
> -Charlie
>
--
Renaud Richardet +1 617 230 9112
my email is my first name at apache.org http://www.oslutions.com
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general