Re: Questions/issues with nutch

h b Fri, 28 Jun 2013 10:47:14 -0700

Thanks Tejas
I tried these steps, One step I added, was updatedb

*bin/nutch updatedb*


Just to be consistent with the doc, and your suggestion on some other
thread, I used solr 3.6 instead of 4.x
I copied the schema.xml from nutch/conf (rootlevel) and started solr. It
failed with

SEVERE: org.apache.solr.common.SolrException: undefined field text


One of the google thread, suggested I ignore this error, so I ignored and
indexed anyway

So now I got it to work. Playing some more with the queries




On Fri, Jun 28, 2013 at 9:52 AM, Tejas Patil <tejas.patil...@gmail.com>wrote:

> The "storage.schema.webpage" seems messed up but I don't have ample time
> now to look into it. Here is what I would suggest to get things working:
> *
> *
> *[1] Remove all the old data from HBase*
>
> (I assume that HBase is running while you do this)
> *cd $HBASE_HOME*
> *./bin/hbase shell
> *
> In the HBase shell, use "list" to see all the tables, delete all of those
> related to Nutch (ones named as *webpage).
> Remove them using "disable" and "drop" commands.
>
> eg. if one of the tables is "webpage", you would run this:
> *disable 'webpage'
> *
> *drop 'webpage'*
> * *
>
> *[2] Run crawl*
> I assume that you have not changed "storage.schema.webpage" is
> nutch-site.xml and nutch-default.xml. If yes, revert it to:
>
> *<property>*
> *  <name>storage.schema.webpage</**name>*
> *  <value>webpage</value>*
> *  <description>This value holds the schema name used for Nutch web db.*
> *  Note that Nutch ignores the value in the gora mapping files, and uses*
> *  this as the webpage schema name.*
> *  </description>*
> *</property>*
>
> Run crawl commands:
> *bin/nutch inject urls/*
> *bin/nutch generate -topN 50000  -noFilter -adddays 0*
> *bin/nutch fetch -all -threads 5  *
> *bin/nutch parse -all *
>
> *[3] Perform indexing*
> I assume that you have Solr setup and NUTCH_HOME/conf/schema.xml copied in
> ${SOLR_HOME}/example/solr/conf/. See bullets 4-6 in [0] for details.
> Start solr and run the indexing command:
> *bin/nutch solrindex  $SOLR_URL -all *
>
> [0] : http://wiki.apache.org/nutch/NutchTutorial
>
> Thanks,
> Tejas
>
> On Thu, Jun 27, 2013 at 1:47 PM, h b <hb6...@gmail.com> wrote:
>
> > Ok, so avro did not work quite well for me, I got a test grid with hbase,
> > and I started using that for now. All steps ran without errors and I see
> my
> > crawled doc in hbase.
> > However, after running the solr integration, and querying solr, I get
> back
> > nothing. Index files look very tiny. The one thing I noted is a message
> > during almost every step
> >
> > 13/06/27 20:37:53 INFO store.HBaseStore: Keyclass and nameclass match but
> > mismatching table names  mappingfile schema is 'webpage' vs actual schema
> > 'crawl2_webpage' , assuming they are the same.
> >
> > This looks suspicious and I think this is the one causing the solr index
> to
> > be empty. Googling suggested I should edit the nutch-default,xml, I tried
> > and rebuilt the job but no luck with this message.
> >
> >
> >
> > On Thu, Jun 27, 2013 at 10:30 AM, h b <hb6...@gmail.com> wrote:
> >
> > > Ok, I ran a ant, ant jar and ant job and that seems to have picked up
> the
> > > config changes.
> > > Now, the inject output shows that it is using AvroStore as Gora
> storage.
> > >
> > > Now I am getting Nullpointer on
> > >
> > > java.lang.NullPointerException
> > >         at
> > >
> >
> org.apache.gora.mapreduce.GoraOutputFormat.setOutputPath(GoraOutputFormat.java:70)
> > >         at
> > >
> >
> org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:91)
> > >         at
> > >
> >
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:521)
> > >         at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:636)
> > >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> > >         at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> > >         at java.security.AccessController.doPrivileged(Native Method)
> > >         at javax.security.auth.Subject.doAs(Subject.java:396)
> > >         at
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> > >         at org.apache.hadoop.mapred.Child.main(Child.java:264)
> > >
> > > which does not look like nutch related. I will work on this and write
> > back
> > > if I get stuck on something else, or will write back if I succeed.
> > >
> > >
> > > On Thu, Jun 27, 2013 at 10:18 AM, h b <hb6...@gmail.com> wrote:
> > >
> > >> Hi Lewis,
> > >>
> > >> Sorry for missing that one. So I update the top level conf and rebuild
> > >> the job.
> > >>
> > >> cat ~/nutch/apache-nutch-2.2/conf/nutch-site.xml
> > >>
> > >> ......
> > >>   <property>
> > >>     <name>storage.data.store.class</name>
> > >>     <value>org.apache.gora.avro.store.AvroStore</value>
> > >>   </property>
> > >> ......
> > >>
> > >> cd ~/nutch/apache-nutch-2.2/
> > >> ant job
> > >> cd ~/nutch/apache-nutch-2.2/runtime/deploy/
> > >>
> > >>
> > >> bin/nutch inject urls -crawlId crawl1
> > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: starting at
> > >> 2013-06-27 17:12:01
> > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Injecting
> urlDir:
> > >> urls
> > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Using class
> > >> org.apache.gora.memory.store.MemStore as the Gora storage class.
> > >>
> > >> It still shows me MemStore.
> > >>
> > >> In the jobtracker I see a [crawl1]inject urls job does not have
> > >> urls_injected property
> > >> I have a *db.score.injected* 1.0, but dont think that is anything to
> say
> > >> about urls injected.
> > >>
> > >>
> > >>
> > >> On Thu, Jun 27, 2013 at 7:09 AM, Lewis John Mcgibbney <
> > >> lewis.mcgibb...@gmail.com> wrote:
> > >>
> > >>> Hi,
> > >>> Please re-read my mail.
> > >>> If you are using the deploy directory e.g. running on a hadoop
> cluster,
> > >>> then make sure to edit nutch-site.xml from within the top level conf
> > >>> directory _not_ the conf directory in runtime/local.
> > >>> If you look at the ant runtime target in the build script you will
> see
> > >>> the
> > >>> code which generates the runtime directory structure.
> > >>> Make changes to conf/nutch-site.xml, build the job jar, navigate to
> > >>> runtime/deploy, run the code.
> > >>> It's easier to make the job jar and scripts in deploy available to
> the
> > >>> job
> > >>> tracker.
> > >>> You also didn't comment on the counters for the inject job. Do you
> see
> > >>> any?
> > >>> Best
> > >>> Lewis
> > >>>
> > >>> On Wednesday, June 26, 2013, h b <hb6...@gmail.com> wrote:
> > >>> > Here is an example of what I am saying about the config changes not
> > >>> taking
> > >>> > effect.
> > >>> >
> > >>> > cd runtime/deploy
> > >>> > cat ../local/conf/nutch-site.xml
> > >>> > ......
> > >>> >
> > >>> >   <property>
> > >>> >     <name>storage.data.store.class</name>
> > >>> >     <value>org.apache.gora.avro.store.AvroStore</value>
> > >>> >   </property>
> > >>> > .....
> > >>> >
> > >>> > cd ../..
> > >>> >
> > >>> > ant job
> > >>> >
> > >>> > cd runtime/deploy
> > >>> > bin/nutch inject urls -crawlId crawl1
> > >>> > .....
> > >>> > 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class
> > >>> > org.apache.gora.memory.store.MemStore as the Gora storage class.
> > >>> > .....
> > >>> >
> > >>> > So the nutch-site.xml was changed to use AvroStore as storage class
> > and
> > >>> job
> > >>> > was rebuilt, and I reran inject, the output of which still shows
> that
> > >>> it
> > >>> is
> > >>> > trying to use Memstore.
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>> > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney <
> > >>> > lewis.mcgibb...@gmail.com> wrote:
> > >>> >
> > >>> >> The Gora MemStore was introduced to deal predominantly with test
> > >>> scenarios.
> > >>> >> This is justified as the 2.x code is pulled nightly and after
> every
> > >>> commit
> > >>> >> and tested.
> > >>> >> It is nnot thread safe and should not be used (until we fix some
> > >>> issues)
> > >>> >> for any kind of serious deployment.
> > >>> >> From your inject task on the job tracker, you will be able to see
> > >>> >> 'urls_injected' counters which represent the number of urls
> actually
> > >>> >> persisted through Gora into the datastore.
> > >>> >> I understand that HBase is not an option. Gora should also support
> > >>> writing
> > >>> >> the output into Avro sequence files... which can be pumped into
> > hdfs.
> > >>> We
> > >>> >> have done some work on this so I suppose that right now is as
> good a
> > >>> time
> > >>> >> as any for you to try it out.
> > >>> >> use the default datastore as org.apache.gora.avro.store.AvroStore
> I
> > >>> think.
> > >>> >> You can double check by looking into gora.properties
> > >>> >> As a note, youu should use nutch-site.xml within the top level
> conf
> > >>> >> directory for all your Nutch configuration. You should then
> create a
> > >>> new
> > >>> >> job jar for use in hadoop by calling 'ant job' after the changes
> are
> > >>> made.
> > >>> >> hth
> > >>> >> Lewis
> > >>> >>
> > >>> >> On Wednesday, June 26, 2013, h b <hb6...@gmail.com> wrote:
> > >>> >> > The quick responses flowing are very encouraging. Thanks Tejas.
> > >>> >> > Tejas, as I mentioned earlier, in fact I actually ran it step by
> > >>> step.
> > >>> >> >
> > >>> >> > So first I ran the inject command and then the readdb with dump
> > >>> option
> > >>> >> and
> > >>> >> > did not see anything in the dump files, that leads me to say
> that
> > >>> the
> > >>> >> > inject did not work.I verified the regex-urlfilter and made sure
> > >>> that
> > >>> my
> > >>> >> > url is not getting filtered.
> > >>> >> >
> > >>> >> > I agree that the second link is about configuring HBase as a
> > >>> storageDB.
> > >>> >> > However, I do not have Hbase installed and dont foresee getting
> it
> > >>> >> > installed any sooner, hence using HBase for storage is not a
> > option,
> > >>> so I
> > >>> >> > am going to have to stick to Gora with memory store.
> > >>> >> >
> > >>> >> >
> > >>> >> >
> > >>> >> >
> > >>> >> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <
> > >>> tejas.patil...@gmail.com
> > >>> >> >wrote:
> > >>> >> >
> > >>> >> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <hb6...@gmail.com> wrote:
> > >>> >> >>
> > >>> >> >> > Thanks for the response Lewis.
> > >>> >> >> > I did read these links, I mostly followed the first link and
> > >>> tried
> > >>> >> both
> > >>> >> >> the
> > >>> >> >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null
> pointer
> > >>> >> exception
> > >>> >> >> on
> > >>> >> >> > solr, so I figured that I should first deal with getting the
> > >>> crawl
> > >>> >> part
> > >>> >> >> to
> > >>> >> >> > work and then deal with solr indexing. Hence I went back to
> > >>> trying
> > >>> it
> > >>> >> >> > stepwise.
> > >>> >> >> >
> > >>> >> >>
> > >>> >> >> You should try running the crawl using individual commands and
> > see
> > >>> where
> > >>> >> >> the problem is. The nutch tutorial which Lewis pointed you to
> had
> > >>> those
> > >>> >> >> commands. Even peeking into the bin/crawl script would also
> help
> > >>> as it
> > >>> >> >> calls the nutch commands.
> > >>> >> >>
> > >>> >> >> >
> > >>> >> >> > As for the second link, it is more about using HBase as store
> > >>> instead
> > >>> >> of
> > >>> >> >> > gora. This is not really a option for me yet, cause my grid
> > does
> > >>> not
> > >>> >> have
> > >>> >> >> > hbase installed yet. Getting it done is not much under my
> > control
> > >>> >> >> >
> > >>> >> >>
> > >>> >> >> HBase is one of the datastores supported by Apache Gora. That
> > >>> tutorial
> > >>> >> >> speaks about how to configure Nutch (actually Gora) to use
> HBase
> > >>> as a
> > >>> >> >> backend. So, its wrong to say that the tutorial was about HBase
> > and
> > >>> not
> > >>> >> >> Gora.
> > >>> >> >>
> > >>> >> >> >
> > >>> >> >> > the FAQ link is the one I had not gone through until I
> checked
> > >>> your
> > >>> >> >> > response, but I do not find answers to any of my questions
> > >>> >> >> > (directly/indirectly) in it.
> > >>> >> >> >
> > >>> >> >>
> > >>> >> >> Ok
> > >>> >> >>
> > >>> >> >> >
> > >>> >> >> >
> > >>> >> >> >
> > >>> >> >> >
> > >>> >> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
> > >>> >> >> > > *Lewis*
> > >>> >>
> > >>> >
> > >>>
> > >>> --
> > >>> *Lewis*
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Questions/issues with nutch

Reply via email to