I thought I would follow up on this for anyone who has also had the problem.
I found the root of the problem to be that conf/prefix-url.txt is not
included in the nutch-0.8.1 download on the site. Therefore the file cannot
be loaded when running the inject/generate/etc. calls.

I'm not sure why the crawl command still worked properly, but adding the
file and filling it with 'http' solved my problem.

-Charlie

On 2/12/07, Charlie Williams <[EMAIL PROTECTED]> wrote:

yes I have been debugging, everything looks fine as it goes into the
mapper code,

from Injector.java
@line 69

try
{
  url = urlNormalizer.normalize(url);
  url = filters.filter(url); <- this is what returns null
} catch ( ... )
 ...
}

if (url != null) { <-- this check always fails because of that
  ...
}

I trace the call in to PrefixURLFilter.filter(url) and always get a null
returned from here...

if (trie.shortestMatch(url)== null)
   return null;
else
   return url;

Does this clarify the root of the problem?

-Charlie


On 2/12/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:
>
> Hey Charlie,
>
> What do the logs say in logs/hadoop.log?
>
> You can also try to to set a breakpoint in Eclipse in the map method of
> InjectMapper and reduce method of InjectReducer. When you get there in
> debug mode, inspect your variables and check if everything looks good.
> You can also check if your urls make it through: url =
> filters.filter(url);  in InjectMapper
>
> HTH,
> Renaud
>
>
> Charlie Williams wrote:
> > I have been trying to learn the Nutch code base by stepping through
> > the code
> > in debug mode of Eclipse. However I am unable to understand a piece of
> > code
> > in the Injector.
> >
> > When I run the crawl command used for intranet crawling, it
> successfully
> > injects urls into the database. When I run standalone Injector, on the
> > same
> > set of urls it injects nothing, returning null from each pass of
> > PrefixURLFilter.filter( url )
> >
> > I saw in an achieve that that the crawl command uses crawl-tool.xml
> > for its
> > config, where otherwise nutch-site.xml is used. So I made the
> > nutch-site.xmlfile exactly the same, but this seemed to have no
> > result. Does anyone know
> > why?
> >
> > I apologize for the newb question, but any help would be greatly
> > appreciated.
> >
> > -Charlie
> >
>
>
> --
> Renaud Richardet                                      +1 617 230 9112
> my email is my first name at apache.org      http://www.oslutions.com
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to