Start by changing the line '-.' in regex-urlfilter.txt to '+.' or provide a regex to match your domain.
On Fri, Jan 17, 2014 at 11:54 PM, Maria <[email protected]> wrote: > Hello and good evening. > I'new at nutch. I am using the version 2.2.1 and MySql as datastore. I > followed this tutorial : http://nlp.solutions.asia/?p=362#more-362. The > first time that i made crawl, it was a success. I start with this url: > nutch.apache.org. I could see the result on my database in workbench. But > when i tried a different url, the crawl began to fail one after another. > > I have this in regex-urlfilter.txt: > > # accept anything else > #+. > > +^http://([a-z0-9]*\.)* nutch.apache.org/ > > # > -. > > And in nutch-site.xml, I have: > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > > > <configuration> > <property> > <name>http.agent.name</name> > <value>Maria</value> > </property> > > <property> > <name>http.robots.agents</name> > <value>Maria,*</value> .... > </description> > </property> > > <property> > <name>http.accept.language</name> > <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> > <description>Value of the “Accept-Language” request header field. > This allows selecting non-English language as default one to retrieve. > It is a useful setting for search engines build for certain national group. > </description> > </property> > > <property> > <name>parser.character.encoding.default</name> > <value>utf-8</value> > <description>The character encoding to fall back to when no other > information > is available</description> > </property> > > <property> > <name>storage.data.store.class</name> > <value>org.apache.gora.sql.store.SqlStore</value> > <description>The Gora DataStore class for storing and retrieving data. > Currently the following stores are available: …. > </description> > </property> > > </configuration> > > First I started to look for solution for this problem: > > InjectorJob: total number of urls rejected by filters: 2 > InjectorJob: total number of urls injected after normalization: 0 > > But I didn't find match to solve this. So, i'm searching for solution to > this exception that is in hadoop.log: > > java.lang.NullPointerException > at org.apache.avro.util.Utf8.<init>(Utf8.java:37) > at > org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) > > I didn't find match to solve this yet. > I appreciate any suggestion. > And thanks for reading. > > Maria *.* > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/InjectorJob-total-number-of-urls-injected-after-normalization-and-filtering-0-looking-for-solutions-tp4111993.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

