Re: [Error Crawling Job Failed] NUTCH 1.9

Muhamad Muchlis Mon, 03 Nov 2014 03:01:00 -0800

Hi Markus,

When i run this command :

*nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/**

I got an error here is the log :

2014-11-03 17:55:04,602 INFO  indexer.IndexingJob - Indexer: starting at
2014-11-03 17:55:04
2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: deleting gone
documents: false
2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL filtering:
false
2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL
normalizing: false
2014-11-03 17:55:04,860 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2014-11-03 17:55:04,861 INFO  indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication

2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: crawl/indexes
2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/crawldb
2014-11-03 17:55:04,978 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/linkdb
2014-11-03 17:55:04,979 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20141103163424
2014-11-03 17:55:04,980 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20141103175027
2014-11-03 17:55:04,981 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20141103175109
2014-11-03 17:55:05,033 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-11-03 17:55:05,110 ERROR security.UserGroupInformation -
PriviledgedActionException as:me
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
2014-11-03 17:55:05,112 ERROR indexer.IndexingJob - Indexer:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)

Advice me please..

On Mon, Nov 3, 2014 at 5:47 PM, Muhamad Muchlis <tru3....@gmail.com> wrote:

> Like this ?
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
>
> <property>
>  <name>http.agent.name</name>
>  <value>My Nutch Spider</value>
> </property>
>
> *<property>*
> * <name>solr.server.url</name>*
> * <value>http://localhost:8983/solr/ <http://localhost:8983/solr/></value>*
> *</property>*
>
>
> </configuration>
>
>
> On Mon, Nov 3, 2014 at 5:41 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
>
>> You can set solr.server.url in your nutch-site.xml or pass it via command
>> line as -Dsolr.server.url=<URL>
>>
>>
>>
>> -----Original message-----
>> > From:Muhamad Muchlis <tru3....@gmail.com>
>> > Sent: Monday 3rd November 2014 11:37
>> > To: user@nutch.apache.org
>> > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
>> >
>> > Hi Markus,
>> >
>> > Where can I find the settings solr url?  -D
>> >
>> > On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma <
>> markus.jel...@openindex.io>
>> > wrote:
>> >
>> > > Well, here is is:
>> > > java.lang.RuntimeException: Missing SOLR URL. Should be set via
>> > > -Dsolr.server.url
>> > >
>> > >
>> > >
>> > > -----Original message-----
>> > > > From:Muhamad Muchlis <tru3....@gmail.com>
>> > > > Sent: Monday 3rd November 2014 10:58
>> > > > To: user@nutch.apache.org
>> > > > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
>> > > >
>> > > > 2014-11-03 16:56:06,530 INFO  indexer.IndexingJob - Indexer:
>> starting at
>> > > > 2014-11-03 16:56:06
>> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer:
>> deleting
>> > > gone
>> > > > documents: false
>> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
>> > > filtering:
>> > > > false
>> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
>> > > > normalizing: false
>> > > > 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR
>> URL.
>> > > > Should be set via -D solr.server.url
>> > > > SOLRIndexWriter
>> > > > solr.server.url : URL of the SOLR instance (mandatory)
>> > > > solr.commit.size : buffer size when sending to SOLR (default 1000)
>> > > > solr.mapping.file : name of the mapping file for fields (default
>> > > > solrindex-mapping.xml)
>> > > > solr.auth : use authentication (default false)
>> > > > solr.auth.username : use authentication (default false)
>> > > > solr.auth : username for authentication
>> > > > solr.auth.password : password for authentication
>> > > >
>> > > > 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer:
>> > > > java.lang.RuntimeException: Missing SOLR URL. Should be set via -D
>> > > > solr.server.url
>> > > > SOLRIndexWriter
>> > > > solr.server.url : URL of the SOLR instance (mandatory)
>> > > > solr.commit.size : buffer size when sending to SOLR (default 1000)
>> > > > solr.mapping.file : name of the mapping file for fields (default
>> > > > solrindex-mapping.xml)
>> > > > solr.auth : use authentication (default false)
>> > > > solr.auth.username : use authentication (default false)
>> > > > solr.auth : username for authentication
>> > > > solr.auth.password : password for authentication
>> > > >
>> > > > at
>> > > >
>> > >
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
>> > > > at
>> > > >
>> > >
>> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159)
>> > > > at
>> org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:57)
>> > > > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91)
>> > > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>> > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> > > > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>> > > >
>> > > >
>> > > > On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma <
>> > > markus.jel...@openindex.io>
>> > > > wrote:
>> > > >
>> > > > > Hi - see the logs for more details.
>> > > > > Markus
>> > > > >
>> > > > > -----Original message-----
>> > > > > > From:Muhamad Muchlis <tru3....@gmail.com>
>> > > > > > Sent: Monday 3rd November 2014 9:15
>> > > > > > To: user@nutch.apache.org
>> > > > > > Subject: [Error Crawling Job Failed] NUTCH 1.9
>> > > > > >
>> > > > > > Hello.
>> > > > > >
>> > > > > > I get an error message when I run the command:
>> > > > > >
>> > > > > > *crawl seed/seed.txt crawl -depth 3 -topN 5*
>> > > > > >
>> > > > > >
>> > > > > > Error Message :
>> > > > > >
>> > > > > > SOLRIndexWriter
>> > > > > > solr.server.url : URL of the SOLR instance (mandatory)
>> > > > > > solr.commit.size : buffer size when sending to SOLR (default
>> 1000)
>> > > > > > solr.mapping.file : name of the mapping file for fields (default
>> > > > > > solrindex-mapping.xml)
>> > > > > > solr.auth : use authentication (default false)
>> > > > > > solr.auth.username : use authentication (default false)
>> > > > > > solr.auth : username for authentication
>> > > > > > solr.auth.password : password for authentication
>> > > > > >
>> > > > > >
>> > > > > > Indexer: java.io.IOException: Job failed!
>> > > > > > at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>> > > > > > at
>> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>> > > > > > at
>> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>> > > > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> > > > > > at
>> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>> > > > > >
>> > > > > >
>> > > > > > Can anyone explain why this happened ?
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > Best regard's
>> > > > > >
>> > > > > > M.Muchlis
>> > > > > >
>> > > > >
>> > > >
>> > >
>>
>
>

Re: [Error Crawling Job Failed] NUTCH 1.9

Reply via email to