Re: Questions/issues with nutch

Tejas Patil Wed, 26 Jun 2013 22:04:17 -0700

On Wed, Jun 26, 2013 at 9:53 PM, h b <hb6...@gmail.com> wrote:

> Thanks for the response Lewis.
> I did read these links, I mostly followed the first link and tried both the
> 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer exception on
> solr, so I figured that I should first deal with getting the crawl part to
> work and then deal with solr indexing. Hence I went back to trying it
> stepwise.
>


You should try running the crawl using individual commands and see where
the problem is. The nutch tutorial which Lewis pointed you to had those
commands. Even peeking into the bin/crawl script would also help as it
calls the nutch commands.

>
> As for the second link, it is more about using HBase as store instead of
> gora. This is not really a option for me yet, cause my grid does not have
> hbase installed yet. Getting it done is not much under my control
>

HBase is one of the datastores supported by Apache Gora. That tutorial
speaks about how to configure Nutch (actually Gora) to use HBase as a
backend. So, its wrong to say that the tutorial was about HBase and not
Gora.

>
> the FAQ link is the one I had not gone through until I checked your
> response, but I do not find answers to any of my questions
> (directly/indirectly) in it.
>

Ok

>
>
>
>
> On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
>
> > Hi Hemant,
> > I strongly advise you to take some time to look through the Nutch
> Tutorial
> > for 1.x and 2.x.
> > http://wiki.apache.org/nutch/NutchTutorial
> > http://wiki.apache.org/nutch/Nutch2Tutorial
> > Also please see the FAQ's, which you will find very very useful.
> > http://wiki.apache.org/nutch/FAQ
> >
> > Thanks
> > Lewis
> >
> >
> > On Wed, Jun 26, 2013 at 5:18 PM, h b <hb6...@gmail.com> wrote:
> >
> > > Hi,
> > > I am first time user of nutch. I installed
> > > nutch(2.2)/solr(4.3)/hadoop(0.20) and got started to crawl a single
> > > webpage.
> > >
> > > I am running nutch step by step. These are the problems I came across -
> > >
> > > 1. Inject did not work, i..e the url does not reflect in the
> > > webdb(gora-memstore). The way I verify this is after running inject, i
> > run
> > > readdb with dump. This created a directory in hdfs with 0 size part
> file.
> > >
> > > 2. config files - This confused me a lot. When run from deploy
> directory,
> > > does nutch use the config files from local/conf? Changes made to
> > > local/conf/nutch-site.xml did not take effect after editing this file.
> I
> > > had to edit this in order to get rid of the 'http.agent.name' error. I
> > > finally ended up hard-coding this in the code, rebuilding and running
> to
> > > keep going forward.
> > >
> > > 3. how to interpret readdb - Running readdb -stats, shows a lot out
> > output
> > > but I do not see my url from seed.txt in there. So I do not know if the
> > > entry in webdb actually reflects my seed.txt at all or not.
> > >
> > > 4. logs - When nutch is run from the deploy directory, the
> > logs/hadoop.log
> > > is not generated anymore, not locally, nor on the grid. I tried to make
> > it
> > > verbose by changing log4j.properties to DEBUG, but still had not file
> > > generated.
> > >
> > > Any help with this would help me move forward with nutch.
> > >
> > > Regards
> > > Hemant
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>

Re: Questions/issues with nutch

Reply via email to