webgraph in limited domain

Martin Aesch Sat, 14 Dec 2013 09:51:18 -0800

Dear nutchers,

I have a larger set of domains and many URLs which I want to process. I
only want to crawl pages from those domains, but I am interested in all
outlinks regardless wether its inbound or not.


I am using property db.ignore.external.links=true. And I want to create
a webgraphdb. Currently, I am getting an empty webgraphdb.

In org/apache/nutch/parse/ParseOutputFormat.java non-domain anchors are
filtered out already at parse phase and do not make their way in
parsedata. I had somehow the hope this happens at a later stage.

Any (hackish) way for doing that? 

Any suggestions are very welcome.

Martin

webgraph in limited domain

Reply via email to