Start by changing the line '-.' in regex-urlfilter.txt to '+.' or provide a
regex to match your domain.


On Fri, Jan 17, 2014 at 11:54 PM, Maria <[email protected]> wrote:

> Hello and good evening.
> I'new at nutch. I am using the version 2.2.1 and MySql  as datastore. I
> followed this tutorial : http://nlp.solutions.asia/?p=362#more-362. The
> first time that i made crawl, it was a success. I start with this url:
> nutch.apache.org. I could see the result on my database in workbench. But
> when i tried a different url, the crawl began to fail one after another.
>
> I have this in regex-urlfilter.txt:
>
> # accept anything else
> #+.
>
> +^http://([a-z0-9]*\.)* nutch.apache.org/
>
> #
> -.
>
> And in nutch-site.xml, I have:
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
>
>
> <configuration>
> <property>
> <name>http.agent.name</name>
> <value>Maria</value>
> </property>
>
> <property>
> <name>http.robots.agents</name>
> <value>Maria,*</value> ....
> </description>
> </property>
>
> <property>
> <name>http.accept.language</name>
> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> <description>Value of the “Accept-Language” request header field.
> This allows selecting non-English language as default one to retrieve.
> It is a useful setting for search engines build for certain national group.
> </description>
> </property>
>
> <property>
> <name>parser.character.encoding.default</name>
> <value>utf-8</value>
> <description>The character encoding to fall back to when no other
> information
> is available</description>
> </property>
>
> <property>
> <name>storage.data.store.class</name>
> <value>org.apache.gora.sql.store.SqlStore</value>
> <description>The Gora DataStore class for storing and retrieving data.
> Currently the following stores are available: ….
> </description>
> </property>
>
> </configuration>
>
> First I started to look for solution for this problem:
>
> InjectorJob: total number of urls rejected by filters: 2
> InjectorJob: total number of urls injected after normalization: 0
>
> But I didn't find match to solve this. So, i'm searching for solution to
> this exception that is in hadoop.log:
>
> java.lang.NullPointerException
>         at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>         at
> org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>         at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>         at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>
> I didn't find match to solve this yet.
> I appreciate any suggestion.
> And thanks for reading.
>
> Maria *.*
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/InjectorJob-total-number-of-urls-injected-after-normalization-and-filtering-0-looking-for-solutions-tp4111993.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to