Re: MapReduce with multi-languages
Mr. Taeho Kang, I need to analyze different character encoding text too. And I suggested to support encoding configuration in TextInputFormat. https://issues.apache.org/jira/browse/HADOOP-3481 But I think you should convert the text file encoding to UTF-8 at present. Regards, Taeho Kang: Dear Hadoop User Group, What are elegant ways to do mapred jobs on text-based data encoded with something other than UTF-8? It looks like Hadoop assumes the text data is always in UTF-8 and handles data that way - encoding with UTF-8 and decoding with UTF-8. And whenever the data is not in UTF-8 encoded format, problems arise. Here is what I'm thinking of to clear the situation.. correct and advise me if you see my approaches look bad! (1) Re-encode the original data with UTF-8? (2) Replace the part of source code where UTF-8 encoder and decoder are used? Or has anyone of you guys had trouble with running map-red job on data with multi-languages? Any suggestions/advices are welcome and appreciated! Regards, Taeho -- NOMURA Yoshihide: Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan Tel: 044-754-2675 (Ext: 7106-6916) Fax: 044-754-2570 (Ext: 7108-7060) E-Mail: [EMAIL PROTECTED]
Re: Hadoop example in vista
Hello, I think it is because of Vista's User Access Control (UAC), so you should start command prompt as administrator. http://www.mydigitallife.info/2007/02/17/how-to-open-elevated-command-prompt-with-administrator-privileges-in-windows-vista/ Regards, Eason.Lee: I'm running hadoop example in vista with cygwin Everything seems OK in setup, but when I run the example the error below happens. 08/07/01 13:41:34 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= Exception in thread "main" org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions of `D:\\tmp\\hadoop-Eason\\mapred\\system\\job_local_1': Permission denied at org.apache.hadoop.util.Shell.runCommand(Shell.java:195) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286) at org.apache.hadoop.util.Shell.execCommand(Shell.java:317) at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:522) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:267) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:273) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:549) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:700) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973) at cyz.WordCount.run(WordCount.java:84) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at cyz.WordCount.main(WordCount.java:124) any suggestion will help thanks -- NOMURA Yoshihide: Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan Tel: 044-754-2675 (Ext: 7112-6358) Fax: 044-754-2570 (Ext: 7112-3834) E-Mail: [EMAIL PROTECTED]
Re: MapWritable as output value of Reducer
Hello Taran, If you want to use MapWritable as reducer's output value like this class, public class ReduceA implements ReducerLongWritable, MapWritable> You couldn't use TextOutputFormat in this case, because MapWritable doesn't have any toString() method. I think SequenceFileOutputFormat is more suitable. If you want to chain the jobs, you should use SequenceFileInputFormat and SequenceFileOutputFormat like this way. JobConf confA = new JobConf(A.class); conf.setJobName("A"); conf.setOutputKeyClass(LongWritable.class); conf.setOutputValueClass(MapWritable.class); conf.setMapperClass(MapA.class); conf.setReducerClass(ReduceA.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); conf.setInputPath(new Path("/inputA")); conf.setOutputPath(new Path("/outputA")); JobClient.runJob(confA); JobConf confB = new JobConf(B.class); conf.setJobName("B"); conf.setOutputKeyClass(LongWritable.class); conf.setOutputValueClass(MapWritable.class); conf.setMapperClass(MapB.class); conf.setReducerClass(ReduceB.class); conf.setInputFormat(SequenceFileInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); conf.setInputPath(new Path("/outputA")); conf.setOutputPath(new Path("/outputB")); JobClient.runJob(confB); Regards, Tarandeep Singh: hi, Can I use MapWritable as an output value of a Reducer ? If yes, how will the (key, value) pairs in the MapWritable object will be written to the file ? What output format should I use in this case ? Further, I want to chain the output of the first map reduce job to another map reduce job, so in the second map reduce job, what input format should I specify ? Can I reconstruct the MapWritable objects in the mapper of the second job ? Thanks, Taran -- NOMURA Yoshihide: Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan Tel: 044-754-2675 (Ext: 7112-6358) Fax: 044-754-2570 (Ext: 7112-3834) E-Mail: [EMAIL PROTECTED]
Text file character encoding
Hello, I'm using Hadoop 0.17.0 to analyze some large amount of CSV files. And I need to read such files in different character encoding from UTF-8, but I think TextInputFormat doesn't support such character encoding. I guess LineRecordReader class or Text class should support encoding settings like this. conf.set("io.file.defaultEncoding", "MS932"); Is there any plan to supoort different character encoding in TextInputFormat? Regards, -- NOMURA Yoshihide: Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan Tel: 044-754-2675 (Ext: 7112-6358) Fax: 044-754-2570 (Ext: 7112-3834) E-Mail: [EMAIL PROTECTED]