Hello and good evening.
I'new at nutch. I am using the version 2.2.1 and MySql as datastore. I
followed this tutorial : http://nlp.solutions.asia/?p=362#more-362. The
first time that i made crawl, it was a success. I start with this url:
nutch.apache.org. I could see the result on my database in workbench. But
when i tried a different url, the crawl began to fail one after another.
I have this in regex-urlfilter.txt:
# accept anything else
#+.
+^http://([a-z0-9]*\.)* nutch.apache.org/
#
-.
And in nutch-site.xml, I have:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>Maria</value>
</property>
<property>
<name>http.robots.agents</name>
<value>Maria,*</value> ....
</description>
</property>
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other
information
is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>
</configuration>
First I started to look for solution for this problem:
InjectorJob: total number of urls rejected by filters: 2
InjectorJob: total number of urls injected after normalization: 0
But I didn't find match to solve this. So, i'm searching for solution to
this exception that is in hadoop.log:
java.lang.NullPointerException
at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
at
org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
I didn't find match to solve this yet.
I appreciate any suggestion.
And thanks for reading.
Maria *.*
--
View this message in context:
http://lucene.472066.n3.nabble.com/InjectorJob-total-number-of-urls-injected-after-normalization-and-filtering-0-looking-for-solutions-tp4111993.html
Sent from the Nutch - User mailing list archive at Nabble.com.