Thanks Tejas I tried these steps, One step I added, was updatedb *bin/nutch updatedb*
Just to be consistent with the doc, and your suggestion on some other thread, I used solr 3.6 instead of 4.x I copied the schema.xml from nutch/conf (rootlevel) and started solr. It failed with SEVERE: org.apache.solr.common.SolrException: undefined field text One of the google thread, suggested I ignore this error, so I ignored and indexed anyway So now I got it to work. Playing some more with the queries On Fri, Jun 28, 2013 at 9:52 AM, Tejas Patil <tejas.patil...@gmail.com>wrote: > The "storage.schema.webpage" seems messed up but I don't have ample time > now to look into it. Here is what I would suggest to get things working: > * > * > *[1] Remove all the old data from HBase* > > (I assume that HBase is running while you do this) > *cd $HBASE_HOME* > *./bin/hbase shell > * > In the HBase shell, use "list" to see all the tables, delete all of those > related to Nutch (ones named as *webpage). > Remove them using "disable" and "drop" commands. > > eg. if one of the tables is "webpage", you would run this: > *disable 'webpage' > * > *drop 'webpage'* > * * > > *[2] Run crawl* > I assume that you have not changed "storage.schema.webpage" is > nutch-site.xml and nutch-default.xml. If yes, revert it to: > > *<property>* > * <name>storage.schema.webpage</**name>* > * <value>webpage</value>* > * <description>This value holds the schema name used for Nutch web db.* > * Note that Nutch ignores the value in the gora mapping files, and uses* > * this as the webpage schema name.* > * </description>* > *</property>* > > Run crawl commands: > *bin/nutch inject urls/* > *bin/nutch generate -topN 50000 -noFilter -adddays 0* > *bin/nutch fetch -all -threads 5 * > *bin/nutch parse -all * > > *[3] Perform indexing* > I assume that you have Solr setup and NUTCH_HOME/conf/schema.xml copied in > ${SOLR_HOME}/example/solr/conf/. See bullets 4-6 in [0] for details. > Start solr and run the indexing command: > *bin/nutch solrindex $SOLR_URL -all * > > [0] : http://wiki.apache.org/nutch/NutchTutorial > > Thanks, > Tejas > > On Thu, Jun 27, 2013 at 1:47 PM, h b <hb6...@gmail.com> wrote: > > > Ok, so avro did not work quite well for me, I got a test grid with hbase, > > and I started using that for now. All steps ran without errors and I see > my > > crawled doc in hbase. > > However, after running the solr integration, and querying solr, I get > back > > nothing. Index files look very tiny. The one thing I noted is a message > > during almost every step > > > > 13/06/27 20:37:53 INFO store.HBaseStore: Keyclass and nameclass match but > > mismatching table names mappingfile schema is 'webpage' vs actual schema > > 'crawl2_webpage' , assuming they are the same. > > > > This looks suspicious and I think this is the one causing the solr index > to > > be empty. Googling suggested I should edit the nutch-default,xml, I tried > > and rebuilt the job but no luck with this message. > > > > > > > > On Thu, Jun 27, 2013 at 10:30 AM, h b <hb6...@gmail.com> wrote: > > > > > Ok, I ran a ant, ant jar and ant job and that seems to have picked up > the > > > config changes. > > > Now, the inject output shows that it is using AvroStore as Gora > storage. > > > > > > Now I am getting Nullpointer on > > > > > > java.lang.NullPointerException > > > at > > > > > > org.apache.gora.mapreduce.GoraOutputFormat.setOutputPath(GoraOutputFormat.java:70) > > > at > > > > > > org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:91) > > > at > > > > > > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:521) > > > at > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:636) > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) > > > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > > > at java.security.AccessController.doPrivileged(Native Method) > > > at javax.security.auth.Subject.doAs(Subject.java:396) > > > at > > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) > > > at org.apache.hadoop.mapred.Child.main(Child.java:264) > > > > > > which does not look like nutch related. I will work on this and write > > back > > > if I get stuck on something else, or will write back if I succeed. > > > > > > > > > On Thu, Jun 27, 2013 at 10:18 AM, h b <hb6...@gmail.com> wrote: > > > > > >> Hi Lewis, > > >> > > >> Sorry for missing that one. So I update the top level conf and rebuild > > >> the job. > > >> > > >> cat ~/nutch/apache-nutch-2.2/conf/nutch-site.xml > > >> > > >> ...... > > >> <property> > > >> <name>storage.data.store.class</name> > > >> <value>org.apache.gora.avro.store.AvroStore</value> > > >> </property> > > >> ...... > > >> > > >> cd ~/nutch/apache-nutch-2.2/ > > >> ant job > > >> cd ~/nutch/apache-nutch-2.2/runtime/deploy/ > > >> > > >> > > >> bin/nutch inject urls -crawlId crawl1 > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: starting at > > >> 2013-06-27 17:12:01 > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Injecting > urlDir: > > >> urls > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Using class > > >> org.apache.gora.memory.store.MemStore as the Gora storage class. > > >> > > >> It still shows me MemStore. > > >> > > >> In the jobtracker I see a [crawl1]inject urls job does not have > > >> urls_injected property > > >> I have a *db.score.injected* 1.0, but dont think that is anything to > say > > >> about urls injected. > > >> > > >> > > >> > > >> On Thu, Jun 27, 2013 at 7:09 AM, Lewis John Mcgibbney < > > >> lewis.mcgibb...@gmail.com> wrote: > > >> > > >>> Hi, > > >>> Please re-read my mail. > > >>> If you are using the deploy directory e.g. running on a hadoop > cluster, > > >>> then make sure to edit nutch-site.xml from within the top level conf > > >>> directory _not_ the conf directory in runtime/local. > > >>> If you look at the ant runtime target in the build script you will > see > > >>> the > > >>> code which generates the runtime directory structure. > > >>> Make changes to conf/nutch-site.xml, build the job jar, navigate to > > >>> runtime/deploy, run the code. > > >>> It's easier to make the job jar and scripts in deploy available to > the > > >>> job > > >>> tracker. > > >>> You also didn't comment on the counters for the inject job. Do you > see > > >>> any? > > >>> Best > > >>> Lewis > > >>> > > >>> On Wednesday, June 26, 2013, h b <hb6...@gmail.com> wrote: > > >>> > Here is an example of what I am saying about the config changes not > > >>> taking > > >>> > effect. > > >>> > > > >>> > cd runtime/deploy > > >>> > cat ../local/conf/nutch-site.xml > > >>> > ...... > > >>> > > > >>> > <property> > > >>> > <name>storage.data.store.class</name> > > >>> > <value>org.apache.gora.avro.store.AvroStore</value> > > >>> > </property> > > >>> > ..... > > >>> > > > >>> > cd ../.. > > >>> > > > >>> > ant job > > >>> > > > >>> > cd runtime/deploy > > >>> > bin/nutch inject urls -crawlId crawl1 > > >>> > ..... > > >>> > 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class > > >>> > org.apache.gora.memory.store.MemStore as the Gora storage class. > > >>> > ..... > > >>> > > > >>> > So the nutch-site.xml was changed to use AvroStore as storage class > > and > > >>> job > > >>> > was rebuilt, and I reran inject, the output of which still shows > that > > >>> it > > >>> is > > >>> > trying to use Memstore. > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney < > > >>> > lewis.mcgibb...@gmail.com> wrote: > > >>> > > > >>> >> The Gora MemStore was introduced to deal predominantly with test > > >>> scenarios. > > >>> >> This is justified as the 2.x code is pulled nightly and after > every > > >>> commit > > >>> >> and tested. > > >>> >> It is nnot thread safe and should not be used (until we fix some > > >>> issues) > > >>> >> for any kind of serious deployment. > > >>> >> From your inject task on the job tracker, you will be able to see > > >>> >> 'urls_injected' counters which represent the number of urls > actually > > >>> >> persisted through Gora into the datastore. > > >>> >> I understand that HBase is not an option. Gora should also support > > >>> writing > > >>> >> the output into Avro sequence files... which can be pumped into > > hdfs. > > >>> We > > >>> >> have done some work on this so I suppose that right now is as > good a > > >>> time > > >>> >> as any for you to try it out. > > >>> >> use the default datastore as org.apache.gora.avro.store.AvroStore > I > > >>> think. > > >>> >> You can double check by looking into gora.properties > > >>> >> As a note, youu should use nutch-site.xml within the top level > conf > > >>> >> directory for all your Nutch configuration. You should then > create a > > >>> new > > >>> >> job jar for use in hadoop by calling 'ant job' after the changes > are > > >>> made. > > >>> >> hth > > >>> >> Lewis > > >>> >> > > >>> >> On Wednesday, June 26, 2013, h b <hb6...@gmail.com> wrote: > > >>> >> > The quick responses flowing are very encouraging. Thanks Tejas. > > >>> >> > Tejas, as I mentioned earlier, in fact I actually ran it step by > > >>> step. > > >>> >> > > > >>> >> > So first I ran the inject command and then the readdb with dump > > >>> option > > >>> >> and > > >>> >> > did not see anything in the dump files, that leads me to say > that > > >>> the > > >>> >> > inject did not work.I verified the regex-urlfilter and made sure > > >>> that > > >>> my > > >>> >> > url is not getting filtered. > > >>> >> > > > >>> >> > I agree that the second link is about configuring HBase as a > > >>> storageDB. > > >>> >> > However, I do not have Hbase installed and dont foresee getting > it > > >>> >> > installed any sooner, hence using HBase for storage is not a > > option, > > >>> so I > > >>> >> > am going to have to stick to Gora with memory store. > > >>> >> > > > >>> >> > > > >>> >> > > > >>> >> > > > >>> >> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil < > > >>> tejas.patil...@gmail.com > > >>> >> >wrote: > > >>> >> > > > >>> >> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <hb6...@gmail.com> wrote: > > >>> >> >> > > >>> >> >> > Thanks for the response Lewis. > > >>> >> >> > I did read these links, I mostly followed the first link and > > >>> tried > > >>> >> both > > >>> >> >> the > > >>> >> >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null > pointer > > >>> >> exception > > >>> >> >> on > > >>> >> >> > solr, so I figured that I should first deal with getting the > > >>> crawl > > >>> >> part > > >>> >> >> to > > >>> >> >> > work and then deal with solr indexing. Hence I went back to > > >>> trying > > >>> it > > >>> >> >> > stepwise. > > >>> >> >> > > > >>> >> >> > > >>> >> >> You should try running the crawl using individual commands and > > see > > >>> where > > >>> >> >> the problem is. The nutch tutorial which Lewis pointed you to > had > > >>> those > > >>> >> >> commands. Even peeking into the bin/crawl script would also > help > > >>> as it > > >>> >> >> calls the nutch commands. > > >>> >> >> > > >>> >> >> > > > >>> >> >> > As for the second link, it is more about using HBase as store > > >>> instead > > >>> >> of > > >>> >> >> > gora. This is not really a option for me yet, cause my grid > > does > > >>> not > > >>> >> have > > >>> >> >> > hbase installed yet. Getting it done is not much under my > > control > > >>> >> >> > > > >>> >> >> > > >>> >> >> HBase is one of the datastores supported by Apache Gora. That > > >>> tutorial > > >>> >> >> speaks about how to configure Nutch (actually Gora) to use > HBase > > >>> as a > > >>> >> >> backend. So, its wrong to say that the tutorial was about HBase > > and > > >>> not > > >>> >> >> Gora. > > >>> >> >> > > >>> >> >> > > > >>> >> >> > the FAQ link is the one I had not gone through until I > checked > > >>> your > > >>> >> >> > response, but I do not find answers to any of my questions > > >>> >> >> > (directly/indirectly) in it. > > >>> >> >> > > > >>> >> >> > > >>> >> >> Ok > > >>> >> >> > > >>> >> >> > > > >>> >> >> > > > >>> >> >> > > > >>> >> >> > > > >>> >> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney < > > >>> >> >> > > *Lewis* > > >>> >> > > >>> > > > >>> > > >>> -- > > >>> *Lewis* > > >>> > > >> > > >> > > > > > >