Hi,

I'm trying to crawl  NSF ACADIS with nutch-selenium. I meet a problem
*with linkdb/current/part-00000/data
does not exist. *I checked my directory and my files during crawling, and
it appears this file sometimes exist and sometimes disappear. This is quite
weird and stranger.

Another problem is when we crawl NSIDC ADE, it will give us a 403 forbidden
error. Does this mean NSIDC ADE is blocking us?

The log of first error is in the bottom of this email. Any help would be
appreciated.

Regards,
Shuo Li





LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb
LinkDb: java.io.FileNotFoundException: File
file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-00000/data
does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:47)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)

Reply via email to