> Hi,
> 
> These day I follow the Nutch totur: 
> http://wiki.apache.org/nutch/NutchTutorial, but I always get the error 
> message as follows:
> 
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt 
> TestCrawl http://localhost:8983/solr/ 2
> Injector: starting at 2013-04-23 22:00:46
> Injector: crawlDb: TestCrawl/crawldb
> Injector: urlDir: urls/seed.txt
> Injector: Converting injected urls to crawl db entries.
> 2013-04-23 22:00:46.562 java[1047:1903] Unable to load realm info from 
> SCDynamicStore
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering: 1
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2013-04-23 22:01:01, elapsed: 00:00:14
> 2013年 4月23日 星期二 22时01分01秒 CST : Iteration 1 of 2
> Generating a new segment
> 2013-04-23 22:01:01.888 java[1055:1903] Unable to load realm info from 
> SCDynamicStore
> Generator: starting at 2013-04-23 22:01:02
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: Partitioning selected urls for politeness.
> Generator: segment: TestCrawl/segments/20130423220110
> Generator: finished at 2013-04-23 22:01:17, elapsed: 00:00:15
> Operating on segment : 
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetching : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetcher: Your 'http.agent.name' value should be listed first in 
> 'http.robots.agents' property.
> Fetcher: starting at 2013-04-23 22:01:18
> Fetcher: segment: 
> TestCrawl/segments/drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetcher Timelimit set for : 1366736478177
> 2013-04-23 22:01:18.308 java[1068:1903] Unable to load realm info from 
> SCDynamicStore
> Fetcher: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
> Relative path in absolute URI: 
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
>       at org.apache.hadoop.fs.Path.initialize(Path.java:148)
>       at org.apache.hadoop.fs.Path.<init>(Path.java:126)
>       at org.apache.hadoop.fs.Path.<init>(Path.java:50)
>       at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1084)
>       at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
>       at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
>       at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
>       at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
>       at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
>       at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
>       at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
>       at 
> org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1023)
>       at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
>       at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
>       at 
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
>       at 
> org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
>       at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
>       at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
>       at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
>       at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
>       at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>       at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
>       at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
>       at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
>       at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>       at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
>       at java.net.URI.checkPath(URI.java:1788)
>       at java.net.URI.<init>(URI.java:734)
>       at org.apache.hadoop.fs.Path.initialize(Path.java:145)
>       ... 30 more
> 
> All I did was following the totur as follows:
> 1. download nutch bin from: 
> http://mirror.esocc.com/apache/nutch/1.6/apache-nutch-1.6-bin.zip
> 2. unzip and step into the dir: apache-nutch-1.6
> 3. in my home dir i setup JAVA_HOME in .bash_profile like:
> JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
> export JAVA_HOME
> 4. change the content in conf/nutch-site.xml to follows:
> <configuration>
>     <property>
>         <name>http.agent.name</name>
>         <value>NutchSpider</value>
>     </property>
> </configuration>
> 
> 5. under dir: apache-nutch-1.6, excute:
> mkdir -p urls
> cd urls
> touch seed.txt
> 6. edit seed.txt with content:
> http://nutch.apache.org/
> 7. then edit file conf/regex-urlfilter.txt and replace
> # accept anything else
> +.
> with
> +^http://([a-z0-9]*\.)*nutch.apache.org/
> 8. finally, i run comand under dir :apache-nutch-1.6 as follows:
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt 
> TestCrawl http://localhost:8983/solr/ 2
> 
> 9. at the end show the error message as mentioned before.
> 
> 
> please help me to solve this problem, thanks very much.
> 
> my java version:
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ java -version
> java version "1.6.0_43"
> Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203)
> Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)
> 
> Max OS X version 10.7.5
> 
> 
> 
> Best Regards.
> --------------------------------------
> Maohua Liu
> Email: [email protected]
> 

Reply via email to