Re: Hi

2010-05-06 Thread Harry Nutch
Did u check crawl-urlfilter.txt? All the domain names that you'd like to crawl have to mentioned. e.g. # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*mersin\.edu\.tr/ +^http://([a-z0-9]*\.)*tubitak\.gov\.tr/ Also check property db.ignore.external.links in nutch-default.xml. Should be

AbstractMethodError for cyberneko parser

2010-04-21 Thread Harry Nutch
Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko plugin for all of it pages when doing a Fetch. As i understand, this has to do because of difference between the runtime and compile version. However, I am running it

Re: AbstractMethodError for cyberneko parser

2010-04-21 Thread Harry Nutch
Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to fix the problem. On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch harrynu...@gmail.com wrote: Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko

Re: AbstractMethodError for cyberneko parser

2010-04-21 Thread Harry Nutch
: Hi Harry, Could you try using parse-tika instead and see if you are getting the same problem? I gather from your email that you are using Nutch 1.1 or the SVN version, so parse-tika should be used by default. Have you deactivated it? Thanks Julien On 21 April 2010 11:58, Harry Nutch harrynu

Re: Format of the Nutch Results

2010-04-21 Thread Harry Nutch
I think you need to specify the individual segment.. bin/nutch readseg -dump crawl-20100420112025/segments/20100422092816 dumpSegmentDirectory On Wed, Apr 21, 2010 at 9:38 PM, nachonieto3 jinietosanc...@gmail.comwrote: Thank you a lot! Now I'm working on that, I have some doubts more...I'm not

Re: Format of the Nutch Results

2010-04-20 Thread Harry Nutch
try bin/nutch on the console. It will give you a list of commands. You could use them to read segments e.g bin/nutch readdb .. On Mon, Apr 19, 2010 at 11:36 PM, nachonieto3 jinietosanc...@gmail.comwrote: I have a doubt...How are the final results of Nutch stored?I mean, in which format is

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread Harry Nutch
Did you check robots.txt On Wed, Apr 21, 2010 at 7:57 AM, joshua paul jos...@neocodesoftware.comwrote: after getting this email, I tried commenting out this line in regex-urlfilter.txt = #-[...@=] but it didn't help... i still get same message - no urls to feth regex-urlfilter.txt = #

Re: nutch 1.1 crawl d/n complete issue

2010-04-15 Thread Harry Nutch
I am new to nutch and still trying to figure out the code flow, however, as a work around to issue #1, after the crawl finishes you could run linkdb and index command separately from cygwin. $bin/nutch invertlinks crawl/linkdb -dir crawl/segments $ bin/nutch index crawl/indexes crawl/crawldb/