In local mode, the hadoop jars in the classpath (see runtime/local/lib.hadoop-core-1.2.0.jar) of nutch jobs. From the hadoops' FileSytem class (see line 132 in [0]), the default value of 'fs.default.name' is picked up by code.
[0] : http://svn.apache.org/viewvc/hadoop/common/branches/branch-1.2/src/core/org/apache/hadoop/fs/FileSystem.java?view=markup On Fri, Jan 3, 2014 at 10:22 AM, Bin Wang <binwang...@gmail.com> wrote: > Hi Tejas, > > Thanks a lot for your response, now I completely understand how WordCount > example read path as HDFS path because you use `hadoop` command to call the > WordCount.jar. And `hadoop` configuration says: > <configuration> > <property> > <name>fs.default.name</name> > <value>hdfs://localhost:9000</value> > </property> > </configuration> > ... > > However, Nutch 1.7 can be installed without Hadoop preinstalled. Where > does Nutch read the filesystem configuration? there is no core-site.xml for > Nutch. > Isn't it? then it is default as local ? > > /usr/bin > > > > > On Thu, Jan 2, 2014 at 10:02 PM, Tejas Patil <tejas.patil...@gmail.com>wrote: > >> The config 'fs.default.name' of core-site.xml is what makes this happen. >> Its default value is "file:///" which corresponds to local mode of Hadoop. >> In local mode Hadoop looks for paths on the local file system. In >> distributed mode of Hadoop, 'fs.default.name' would be >> "hdfs://IP_OF_NAMENODE/" and it will look for those paths in HDFS. >> >> Thanks, >> Tejas >> >> >> On Thu, Jan 2, 2014 at 7:28 PM, Bin Wang <binwang...@gmail.com> wrote: >> >>> Hi there, >>> >>> When I went through the source code of Nutch - the ParseSegment class, >>> which is the class to "parse content in a segment". Here is its map reduce >>> job configuration part. >>> >>> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup >>> (Line >>> 199 - 213) >>> >>> 199 JobConf job = new NutchJob(getConf()); 200 job.setJobName("parse " >>> + segment); 201 202 FileInputFormat.addInputPath(job, new >>> Path(segment, Content.DIR_NAME)); 203 job.set(Nutch.SEGMENT_NAME_KEY, >>> segment.getName()); 204 >>> job.setInputFormat(SequenceFileInputFormat.class); 205 >>> job.setMapperClass(ParseSegment.class); 206 >>> job.setReducerClass(ParseSegment.class); 207 208 >>> FileOutputFormat.setOutputPath(job, >>> segment); 209 job.setOutputFormat(ParseOutputFormat.class); 210 >>> job.setOutputKeyClass(Text.class); 211 >>> job.setOutputValueClass(ParseImpl.class); 212 213 JobClient.runJob(job); >>> >>> Here, in line 202 and line 208, the map reduce input/output path has >>> been configured by calling methods addInputPath/setOutputPath from >>> FileInputFormat. >>> And it is the absolute path in the Linux OS instead of HDFS virtual >>> path. >>> >>> And on the other hand, when I look at the WordCount example in the >>> hadoop homepage. >>> https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html (Line 39 - >>> 55) >>> >>> 39. JobConf conf = new JobConf(WordCount.class); 40. >>> conf.setJobName("wordcount"); 41. 42. >>> conf.setOutputKeyClass(Text.class); 43. >>> conf.setOutputValueClass(IntWritable.class); 44. 45. >>> conf.setMapperClass(Map.class); 46. >>> conf.setCombinerClass(Reduce.class); 47. >>> conf.setReducerClass(Reduce.class); 48. 49. >>> conf.setInputFormat(TextInputFormat.class); 50. >>> conf.setOutputFormat(TextOutputFormat.class); 51. 52. >>> FileInputFormat.setInputPaths(conf, >>> new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new >>> Path(args[1])); 54. 55. JobClient.runJob(conf); >>> Here, the input/output path was configured in the same way as Nutch but >>> the path was actually passed by passing the arguments. >>> bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount >>> /usr/joe/wordcount/input /usr/joe/wordcount/output >>> And we can see the paths passed to the program are actually HDFS path.. >>> not Linux OS path.. >>> I am confused here is there some other configuration that I missed which >>> lead to the run environment difference? In which case, should I pass >>> absolute or HDFS path? >>> >>> Thanks a lot! >>> >>> /usr/bin >>> >>> >> >