RE: [Error Crawling Job Failed] NUTCH 1.9

Markus Jelsma Mon, 03 Nov 2014 03:08:16 -0800
Oh - if you need to index multiple segments, don't use segments/* but -dir 
segments/
 
 
-----Original message-----
> From:Muhamad Muchlis <tru3....@gmail.com>
> Sent: Monday 3rd November 2014 12:00
> To: user@nutch.apache.org
> Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
> 
> Hi Markus,
> 
> When i run this command :
> 
> *nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/**
> 
> 
> 
> I got an error here is the log :
> 
> 2014-11-03 17:55:04,602 INFO  indexer.IndexingJob - Indexer: starting at
> 2014-11-03 17:55:04
> 2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: deleting gone
> documents: false
> 2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL filtering:
> false
> 2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL
> normalizing: false
> 2014-11-03 17:55:04,860 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2014-11-03 17:55:04,861 INFO  indexer.IndexingJob - Active IndexWriters :
> SOLRIndexWriter
> solr.server.url : URL of the SOLR instance (mandatory)
> solr.commit.size : buffer size when sending to SOLR (default 1000)
> solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
> solr.auth : use authentication (default false)
> solr.auth.username : use authentication (default false)
> solr.auth : username for authentication
> solr.auth.password : password for authentication
> 
> 
> 2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: crawl/indexes
> 2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: crawl/crawldb
> 2014-11-03 17:55:04,978 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: crawl/linkdb
> 2014-11-03 17:55:04,979 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: crawl/segments/20141103163424
> 2014-11-03 17:55:04,980 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: crawl/segments/20141103175027
> 2014-11-03 17:55:04,981 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: crawl/segments/20141103175109
> 2014-11-03 17:55:05,033 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2014-11-03 17:55:05,110 ERROR security.UserGroupInformation -
> PriviledgedActionException as:me
> cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
> 2014-11-03 17:55:05,112 ERROR indexer.IndexingJob - Indexer:
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
> Input path does not exist:
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> 
> Advice me please..
> 
> 
> On Mon, Nov 3, 2014 at 5:47 PM, Muhamad Muchlis <tru3....@gmail.com> wrote:
> 
> > Like this ?
> >
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> > <!-- Put site-specific property overrides in this file. -->
> >
> > <configuration>
> >
> > <property>
> >  <name>http.agent.name</name>
> >  <value>My Nutch Spider</value>
> > </property>
> >
> > *<property>*
> > * <name>solr.server.url</name>*
> > * <value>http://localhost:8983/solr/ <http://localhost:8983/solr/></value>*
> > *</property>*
> >
> >
> > </configuration>
> >
> >
> > On Mon, Nov 3, 2014 at 5:41 PM, Markus Jelsma <markus.jel...@openindex.io>
> > wrote:
> >
> >> You can set solr.server.url in your nutch-site.xml or pass it via command
> >> line as -Dsolr.server.url=<URL>
> >>
> >>
> >>
> >> -----Original message-----
> >> > From:Muhamad Muchlis <tru3....@gmail.com>
> >> > Sent: Monday 3rd November 2014 11:37
> >> > To: user@nutch.apache.org
> >> > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
> >> >
> >> > Hi Markus,
> >> >
> >> > Where can I find the settings solr url?  -D
> >> >
> >> > On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma <
> >> markus.jel...@openindex.io>
> >> > wrote:
> >> >
> >> > > Well, here is is:
> >> > > java.lang.RuntimeException: Missing SOLR URL. Should be set via
> >> > > -Dsolr.server.url
> >> > >
> >> > >
> >> > >
> >> > > -----Original message-----
> >> > > > From:Muhamad Muchlis <tru3....@gmail.com>
> >> > > > Sent: Monday 3rd November 2014 10:58
> >> > > > To: user@nutch.apache.org
> >> > > > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
> >> > > >
> >> > > > 2014-11-03 16:56:06,530 INFO  indexer.IndexingJob - Indexer:
> >> starting at
> >> > > > 2014-11-03 16:56:06
> >> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer:
> >> deleting
> >> > > gone
> >> > > > documents: false
> >> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
> >> > > filtering:
> >> > > > false
> >> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
> >> > > > normalizing: false
> >> > > > 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR
> >> URL.
> >> > > > Should be set via -D solr.server.url
> >> > > > SOLRIndexWriter
> >> > > > solr.server.url : URL of the SOLR instance (mandatory)
> >> > > > solr.commit.size : buffer size when sending to SOLR (default 1000)
> >> > > > solr.mapping.file : name of the mapping file for fields (default
> >> > > > solrindex-mapping.xml)
> >> > > > solr.auth : use authentication (default false)
> >> > > > solr.auth.username : use authentication (default false)
> >> > > > solr.auth : username for authentication
> >> > > > solr.auth.password : password for authentication
> >> > > >
> >> > > > 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer:
> >> > > > java.lang.RuntimeException: Missing SOLR URL. Should be set via -D
> >> > > > solr.server.url
> >> > > > SOLRIndexWriter
> >> > > > solr.server.url : URL of the SOLR instance (mandatory)
> >> > > > solr.commit.size : buffer size when sending to SOLR (default 1000)
> >> > > > solr.mapping.file : name of the mapping file for fields (default
> >> > > > solrindex-mapping.xml)
> >> > > > solr.auth : use authentication (default false)
> >> > > > solr.auth.username : use authentication (default false)
> >> > > > solr.auth : username for authentication
> >> > > > solr.auth.password : password for authentication
> >> > > >
> >> > > > at
> >> > > >
> >> > >
> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
> >> > > > at
> >> > > >
> >> > >
> >> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159)
> >> > > > at
> >> org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:57)
> >> > > > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91)
> >> > > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> >> > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >> > > > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> >> > > >
> >> > > >
> >> > > > On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma <
> >> > > markus.jel...@openindex.io>
> >> > > > wrote:
> >> > > >
> >> > > > > Hi - see the logs for more details.
> >> > > > > Markus
> >> > > > >
> >> > > > > -----Original message-----
> >> > > > > > From:Muhamad Muchlis <tru3....@gmail.com>
> >> > > > > > Sent: Monday 3rd November 2014 9:15
> >> > > > > > To: user@nutch.apache.org
> >> > > > > > Subject: [Error Crawling Job Failed] NUTCH 1.9
> >> > > > > >
> >> > > > > > Hello.
> >> > > > > >
> >> > > > > > I get an error message when I run the command:
> >> > > > > >
> >> > > > > > *crawl seed/seed.txt crawl -depth 3 -topN 5*
> >> > > > > >
> >> > > > > >
> >> > > > > > Error Message :
> >> > > > > >
> >> > > > > > SOLRIndexWriter
> >> > > > > > solr.server.url : URL of the SOLR instance (mandatory)
> >> > > > > > solr.commit.size : buffer size when sending to SOLR (default
> >> 1000)
> >> > > > > > solr.mapping.file : name of the mapping file for fields (default
> >> > > > > > solrindex-mapping.xml)
> >> > > > > > solr.auth : use authentication (default false)
> >> > > > > > solr.auth.username : use authentication (default false)
> >> > > > > > solr.auth : username for authentication
> >> > > > > > solr.auth.password : password for authentication
> >> > > > > >
> >> > > > > >
> >> > > > > > Indexer: java.io.IOException: Job failed!
> >> > > > > > at
> >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> >> > > > > > at
> >> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
> >> > > > > > at
> >> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> >> > > > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >> > > > > > at
> >> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> >> > > > > >
> >> > > > > >
> >> > > > > > Can anyone explain why this happened ?
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > Best regard's
> >> > > > > >
> >> > > > > > M.Muchlis
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >>
> >
> >
>
RE: [Error Crawling Job Failed] NUTCH 1.9

Reply via email to