Wait, ignore my last email. The issue is on the solr side!
On Thu, Apr 10, 2014 at 1:13 PM, Xavier Morera <xav...@familiamorera.com>wrote: > Thanks Julien and Sebastian. Tried that and got the exception below. Is > there a way of knowing more in detail what is the exception so that I can > continue troubleshooting? I am getting really really close! I also attach > the full output. > > This is the exception, but no additional info > Indexer: java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) > > > Also I found this which means that something is actually happening > ndexing 20140410124128 on SOLR index -> http://localhost:8983/solr > cygpath: can't convert empty path > Indexer: starting at 2014-04-10 12:41:42 > Indexer: deleting gone documents: false > Indexer: URL filtering: false > Indexer: URL normalizing: false > Active IndexWriters : > SOLRIndexWriter > solr.server.url : URL of the SOLR instance (mandatory) > solr.commit.size : buffer size when sending to SOLR (default 1000) > solr.mapping.file : name of the mapping file for fields (default > solrindex-mapping.xml) > solr.auth : use authentication (default false) > solr.auth.username : use authentication (default false) > solr.auth : username for authentication > solr.auth.password : password for authentication > > > My full nutch-site.xml is > > *<?xml version="1.0"?>* > *<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>* > *<!-- Put site-specific property overrides in this file. -->* > *<configuration>* > * <property>* > * <name>http.agent.name <http://http.agent.name></name>* > * <value>nutch-solr-integration</value>* > * </property>* > * <property>* > * <name>generate.max.per.host</name>* > * <value>100</value>* > * </property>* > * <property>* > * <name>plugin.includes</name>* > * > <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>* > * </property>* > * <property>* > * <name>fs.file.impl</name>* > * > <value>com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem</value>* > * <description>Enables patch for issue HADOOP-7682 on > Windows</description>* > * </property>* > *</configuration>* > > And for urls/site.txt I have > http://www.trenurbano.co.cr > > And in regex-urlfilter.txt I have > +^http://([a-z0-9]*\.)*trenurbano.co.cr/ > > Thanks in advance, > Xavier > > > > On Thu, Apr 10, 2014 at 12:35 PM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > >> Hi Xavier >> >> Your config file looks a bit outdated. Here are the values set by default >> (see http://svn.apache.org/repos/asf/nutch/trunk/conf/nutch-default.xml) >> >> <property> >> <name>plugin.includes</name> >> >> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|*indexer-solr*|scoring-opic|urlnormalizer-(pass|regex|basic)</value> >> >> </property> >> >> Your problem comes from the fact that you are missing indexer-solr. >> >> You should not need >> *query-(basic|site|url)|response-(json|xml)|summary-basic *as they date back >> to times immemorial when we used to manage the indexing and search ourselves. >> >> HTH >> >> Julien >> >> >> On 10 April 2014 18:05, Xavier Morera <xav...@familiamorera.com> wrote: >> >>> Hi, >>> >>> I have followed several Nutch tutorials - including the main one >>> http://wiki.apache.org/nutch/NutchTutorial - to crawl sites (which >>> works, I can see in the console as the pages get crawled and the >>> directories built with the data) but for the life of me I can't get >>> anything posted to Solr. The Solr console doesn't even squint, therefore >>> Nutch is not sending anything. >>> >>> This is the command that I send over that crawls and in theory should >>> also post >>> bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr 2 >>> >>> But I found that I could also use this one when it is already crawled >>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb >>> crawl/linkdb crawl/segments/* >>> >>> But no luck. >>> >>> This is the only thing that called my attention but I read that by >>> adding the property below would work but doesn't work. >>> *No IndexWriters activated - check your configuration* >>> >>> This is the property >>> <property> >>> <name>plugin.includes</name> >>> >>> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> >>> </property> >>> >>> Any idea? Apache Nutch 1.8 running Java 1.6 via Cygwin on Windows. >>> >>> -- >>> *Xavier Morera* >>> email: xav...@familiamorera.com >>> CR: +(506) 8849 8866 >>> US: +1 (305) 600 4919 >>> skype: xmorera >>> >>> >>> >>> -- >>> *Xavier Morera* >>> email: xav...@familiamorera.com >>> CR: +(506) 8849 8866 >>> US: +1 (305) 600 4919 >>> skype: xmorera >>> >> >> >> >> -- >> >> Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> > > > > -- > *Xavier Morera* > email: xav...@familiamorera.com > CR: +(506) 8849 8866 > US: +1 (305) 600 4919 > skype: xmorera > -- *Xavier Morera* email: xav...@familiamorera.com CR: +(506) 8849 8866 US: +1 (305) 600 4919 skype: xmorera