Have a good look under your hadoop.log which should be created when you
initiate a crawl with Nutch, this will be extremely valuable. In addition
there are various properties in nutch-site.xml which can be set to make
logging more verbose at various levels e.g. fetching

In order to root out various errors you will need to get used to looking
through yours logs. It is also advised to try and include as much log data
as possible when posting queries on the user list. You can find more
information about this here as it will greatly help you get accurate and
detailed help from the list in the future. Please have a look here [1].

I would advise you to delete all crawled data and begin a fresh crawl, this
way you can try the above, looking at your logs, before we try to root out
where exactly the errors are stemming from.

HTH

[1]
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Becoming_a_Nutch_Developer



On Tue, Jul 12, 2011 at 7:31 PM, Sethi, Parampreet <
parampreet.se...@teamaol.com> wrote:

> Hey Lewis, Thanks for the quick reply. Looks like I am tangled now =)
>
> I tried the tutorial mentioned at
> http://wiki.apache.org/nutch/RunningNutchAndSolr
>
> For me step 3 is not working. Two of the directories are not created (which
> should be there after step 3 is complete.)
>
> crawl/crawldb - Created
> crawl/linkdb - not created
> crawl/segments - not created
>
> Also, I changed the url to http://nutch.apache.org, but still same log
> message "Generator: 0 records selected for fetching, exiting ..."
>
> Looks like I am missing some key step =(.
>
> -param
>
> On 7/12/11 1:37 PM, "lewis john mcgibbney" <lewis.mcgibb...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I think you are maybe getting tangled here. Please see the following
> > tutorial for Nutch 1.3 [1]
> >
> > Please also note that the URL you provided is the old Nutch site and now
> > redirects to http://nutch.apache.org
> >
> > [1] http://wiki.apache.org/nutch/RunningNutchAndSolr
> >
> > On Tue, Jul 12, 2011 at 5:23 PM, Sethi, Parampreet <
> > parampreet.se...@teamaol.com> wrote:
> >
> >> Thanks for updating the tutorial. I tried my setup, the crawl command is
> >> running. But none of the pages are being crawled.
> >> I created urls directory inside local folder and added new file nutch
> with
> >> url in the same as mentioned in tutorial.
> >>
> >> (I also tried file named urls inside nutch/runtime/local diretcory. The
> >> contents of urls file is http://lucene.apache.org/nutch/ )
> >>
> >> Here's the log:
> >>
> >> us137390:local parampreetsethi$  bin/nutch crawl urls -dir crawl -depth
> 3
> >> -topN 50
> >> solrUrl is not set, indexing will be skipped...
> >> crawl started in: crawl
> >> rootUrlDir = urls
> >> threads = 10
> >> depth = 3
> >> solrUrl=null
> >> topN = 50
> >> Injector: starting at 2011-07-12 12:22:12
> >> Injector: crawlDb: crawl/crawldb
> >> Injector: urlDir: urls
> >> Injector: Converting injected urls to crawl db entries.
> >> Injector: Merging injected urls into crawl db.
> >> Injector: finished at 2011-07-12 12:22:15, elapsed: 00:00:03
> >> Generator: starting at 2011-07-12 12:22:15
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: filtering: true
> >> Generator: normalizing: true
> >> Generator: topN: 50
> >> Generator: jobtracker is 'local', generating exactly one partition.
> >> Generator: 0 records selected for fetching, exiting ...
> >> Stopping at depth=0 - no more URLs to fetch.
> >> No URLs to fetch - check your seed list and URL filters.
> >> crawl finished: crawl
> >>
> >>
> >> Please help.
> >>
> >> Thanks
> >> Param
> >>
> >> On 7/12/11 5:52 AM, "Julien Nioche" <lists.digitalpeb...@gmail.com>
> wrote:
> >>
> >>> On 12 July 2011 10:30, Julien Nioche <lists.digitalpeb...@gmail.com>
> >> wrote:
> >>>
> >>>>
> >>>>
> >>>>>>> There seems to be no crawl-urlfilter file indeed. Don't know why
> it's
> >>>>>>> gone since
> >>>>>>> the crawl command is still there. You can find the file in the 1.2
> >>>>>>> release:
> >> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/
> >>>>>>
> >>>>>> Crawl-urlfilter has been removed  purposefully as it did not add
> >>>>> anything
> >>>>>> to the other url filters (automaton | regex) in terms of
> >> functionality.
> >>>>> By
> >>>>>> default the urlfilters contain (+.) which IIRC was what the
> >>>>>> Crawl-urlfilter used to do.
> >>>>>>
> >>>>>
> >>>>> That's reasonable. But now news users are unaware and don't know what
> >> to
> >>>>> do
> >>>>> with this error message.
> >>>>>
> >>>>
> >>>> Yep, the tutorial needs updating indeed
> >>>>
> >>>
> >>> done
> >>>
> >>>
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>>>>> Thanks for a quick reply.
> >>>>>>>>
> >>>>>>>> I searched in the nutch directory but still do not see that file
> :(.
> >>>>>>>
> >>>>>>> Here's
> >>>>>>>
> >>>>>>>> complete file list inside runtime/local/conf directory.
> >>>>>>>>
> >>>>>>>> us137390:conf parampreetsethi$ pwd
> >>>>>>>>
> /Users/parampreetsethi/Documents/workspace/nutch/runtime/local/conf
> >>>>>>>> us137390:conf parampreetsethi$ ls -t
> >>>>>>>> automaton-urlfilter.txt    domain-urlfilter.txt
>  nutch-default.xml
> >>>>>>>> prefix-urlfilter.txt    solrindex-mapping.xml
> >>>>>>>> configuration.xsl    httpclient-auth.xml    nutch-site.xml
> >>>>>>>> regex-normalize.xml    subcollections.xml
> >>>>>>>> domain-suffixes.xml    log4j.properties    parse-plugins.dtd
> >>>>>>>> regex-urlfilter.txt    suffix-urlfilter.txt
> >>>>>>>> domain-suffixes.xsd    nutch-conf.xsl        parse-plugins.xml
> >>>>>>>> schema.xml tika-mimetypes.xml
> >>>>>>>>
> >>>>>>>> By the way, I tried deploying the code by checking out from svn
> >>>>>>>
> >>>>>>> repository,
> >>>>>>>
> >>>>>>>> but could not build it. I was getting following error:
> >>>>>>>>
> >>>>>>>> resolve-default:
> >>>>>>>
> >>>>>>>> [ivy:resolve] :: Ivy 2.2.0 - 20100923230623 ::
> >>>>>>> http://ant.apache.org/ivy/
> >>>>>>>
> >>>>>>>> :: [ivy:resolve] :: loading settings :: file =
> >>>>>>>>
> >>>>>>>>
> /Users/parampreetsethi/Documents/workspace/nutch/ivy/ivysettings.xml
> >>>>>>>> [ivy:resolve]
> >>>>>>>> [ivy:resolve] :: problems summary ::
> >>>>>>>> [ivy:resolve] :::: WARNINGS
> >>>>>>>> [ivy:resolve]         module not found:
> >>>>>>>> org.apache.gora#gora-core;0.2-incubating
> >>>>>>>> [ivy:resolve]     ==== local: tried
> >>>>>>>> [ivy:resolve]
> >>>>>>>
> >>>>>>>
> >>>>>
> >>
> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-core/0.2-incubati
> >>>>>>> ng
> >>>>>>>
> >>>>>>>> / ivys/ivy.xml
> >>>>>>>> [ivy:resolve]       -- artifact
> >>>>>>>> org.apache.gora#gora-core;0.2-incubating!gora-core.jar:
> >>>>>>>> [ivy:resolve]
> >>>>>>>
> >>>>>>>
> >>>>>
> >>
> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-core/0.2-incubati
> >>>>>>> ng
> >>>>>>>
> >>>>>>>> / jars/gora-core.jar
> >>>>>>>> [ivy:resolve]         module not found:
> >>>>>>>> org.apache.gora#gora-sql;0.2-incubating
> >>>>>>>> [ivy:resolve]     ==== local: tried
> >>>>>>>> [ivy:resolve]
> >>>>>>>
> >>>>>>>
> >>>>>
> >>
> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-sql/0.2-incubatin
> >>>>>>> g/
> >>>>>>>
> >>>>>>>> i vys/ivy.xml
> >>>>>>>> [ivy:resolve]       -- artifact
> >>>>>>>> org.apache.gora#gora-sql;0.2-incubating!gora-sql.jar:
> >>>>>>>> [ivy:resolve]
> >>>>>>>
> >>>>>>>
> >>>>>
> >>
> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-sql/0.2-incubatin
> >>>>>>> g/
> >>>>>>>
> >>>>>>>> j ars/gora-sql.jar
> >>>>>>>> [ivy:resolve]
> ::::::::::::::::::::::::::::::::::::::::::::::
> >>>>>>>> [ivy:resolve]         ::          UNRESOLVED DEPENDENCIES
> ::
> >>>>>>>> [ivy:resolve]
> ::::::::::::::::::::::::::::::::::::::::::::::
> >>>>>>>> [ivy:resolve]         :: org.apache.gora#gora-core;0.2-incubating:
> >>>>> not
> >>>>>>>> found [ivy:resolve]         ::
> >>>>> org.apache.gora#gora-sql;0.2-incubating:
> >>>>>>>> not found [ivy:resolve]
> >>>>>>>>
> >>>>>>>> :::::::::::::::::::::::::::::::::::::::::::::: [ivy:resolve]
> >>>>>>>>
> >>>>>>>> [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE
> DETAILS
> >>>>>>>>
> >>>>>>>> BUILD FAILED
> >>>>>>>
> >>>>>>>> /Users/parampreetsethi/Documents/workspace/nutch/build.xml:458:
> >>>>>>> impossible
> >>>>>>>
> >>>>>>>> to resolve dependencies:
> >>>>>>>>     resolve failed - see output for details
> >>>>>>>>
> >>>>>>>> -param
> >>>>>>>>
> >>>>>>>> On 7/11/11 5:56 PM, "Jerry E. Craig, Jr." <jcr...@inforeverse.com
> >
> >>>>>>>
> >>>>>>> wrote:
> >>>>>>>>> Look down a little further for the
> >>>>>>>>>
> >>>>>>>>> or
> >>>>>>>>> runtime/local/bin/nutch (version >= 1.3)
> >>>>>>>>>
> >>>>>>>>> If you download the bin then it's in the runtime directory.
> >>>>>>>>>
> >>>>>>>>> Jerry E. Craig, Jr.
> >>>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Sethi, Parampreet [mailto:parampreet.se...@teamaol.com]
> >>>>>>>>> Sent: Monday, July 11, 2011 2:51 PM
> >>>>>>>>> To: user@nutch.apache.org
> >>>>>>>>> Subject: Nutch Novice help
> >>>>>>>>>
> >>>>>>>>> Hi All,
> >>>>>>>>>
> >>>>>>>>> Sorry for such a naïve question,  I downloaded nutch 1.3 binary
> >>>>> today
> >>>>>>>
> >>>>>>> and
> >>>>>>>
> >>>>>>>>> trying to set it up as mentioned in Tutorial at
> >>>>>>>>> http://wiki.apache.org/nutch/NutchTutorial
> >>>>>>>>>
> >>>>>>>>> How ever I am not able to find crawl-urlfilter.txt inside conf
> >>>>>>>
> >>>>>>> directory.
> >>>>>>>
> >>>>>>>>> Is there any other place where I should look for this file?
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>> Param
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> *
> >>>> *Open Source Solutions for Text Engineering
> >>>>
> >>>> http://digitalpebble.blogspot.com/
> >>>> http://www.digitalpebble.com
> >>>>
> >>>
> >>>
> >>
> >>
> >
>
>


-- 
*Lewis*

Reply via email to