Did u check crawl-urlfilter.txt?
All the domain names that you'd like to crawl have to mentioned.
e.g.
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*mersin\.edu\.tr/
+^http://([a-z0-9]*\.)*tubitak\.gov\.tr/
Also check property db.ignore.external.links in nutch-default.xml. Should be
Hi,
I am running the latest version for nutch. While crawling one particular
site I get a AbstractMethodError in the cyberneko plugin for all of it pages
when doing a Fetch.
As i understand, this has to do because of difference between the runtime
and compile version. However, I am running it
Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to
fix the problem.
On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch harrynu...@gmail.com wrote:
Hi,
I am running the latest version for nutch. While crawling one particular
site I get a AbstractMethodError in the cyberneko
:
Hi Harry,
Could you try using parse-tika instead and see if you are getting the same
problem? I gather from your email that you are using Nutch 1.1 or the SVN
version, so parse-tika should be used by default. Have you deactivated it?
Thanks
Julien
On 21 April 2010 11:58, Harry Nutch harrynu
I think you need to specify the individual segment..
bin/nutch readseg -dump crawl-20100420112025/segments/20100422092816
dumpSegmentDirectory
On Wed, Apr 21, 2010 at 9:38 PM, nachonieto3 jinietosanc...@gmail.comwrote:
Thank you a lot! Now I'm working on that, I have some doubts more...I'm not
try bin/nutch on the console.
It will give you a list of commands. You could use them to read segments e.g
bin/nutch readdb ..
On Mon, Apr 19, 2010 at 11:36 PM, nachonieto3 jinietosanc...@gmail.comwrote:
I have a doubt...How are the final results of Nutch stored?I mean, in which
format is
Did you check robots.txt
On Wed, Apr 21, 2010 at 7:57 AM, joshua paul jos...@neocodesoftware.comwrote:
after getting this email, I tried commenting out this line in
regex-urlfilter.txt =
#-[...@=]
but it didn't help... i still get same message - no urls to feth
regex-urlfilter.txt =
#
I am new to nutch and still trying to figure out the code flow, however, as
a work around to issue #1, after the crawl finishes you could run linkdb and
index command separately from cygwin.
$bin/nutch invertlinks crawl/linkdb -dir crawl/segments
$ bin/nutch index crawl/indexes crawl/crawldb/