Re: MapReduce with multi-languages

2008-07-10 Thread NOMURA Yoshihide

Mr. Taeho Kang,

I need to analyze different character encoding text too.
And I suggested to support encoding configuration in TextInputFormat.

https://issues.apache.org/jira/browse/HADOOP-3481

But I think you should convert the text file encoding to UTF-8 at present.

Regards,

Taeho Kang:

Dear Hadoop User Group,

What are elegant ways to do mapred jobs on text-based data encoded with
something other than UTF-8?

It looks like Hadoop assumes the text data is always in UTF-8 and handles
data that way - encoding with UTF-8 and decoding with UTF-8.
And whenever the data is not in UTF-8 encoded format, problems arise.

Here is what I'm thinking of to clear the situation.. correct and advise me
if you see my approaches look bad!

(1) Re-encode the original data with UTF-8?
(2) Replace the part of source code where UTF-8 encoder and decoder are
used?

Or has anyone of you guys had trouble with running map-red job on data with
multi-languages?

Any suggestions/advices are welcome and appreciated!

Regards,

Taeho



--
NOMURA Yoshihide:
Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan
Tel: 044-754-2675 (Ext: 7106-6916)
Fax: 044-754-2570 (Ext: 7108-7060)
E-Mail: [EMAIL PROTECTED]



Re: Hadoop example in vista

2008-07-01 Thread NOMURA Yoshihide

Hello,

I think it is because of Vista's User Access Control (UAC), so you 
should start command prompt as administrator.


http://www.mydigitallife.info/2007/02/17/how-to-open-elevated-command-prompt-with-administrator-privileges-in-windows-vista/

Regards,

Eason.Lee:

 I'm running hadoop example in vista with cygwin
Everything seems OK in setup, but when I run the example
the error below happens.

08/07/01 13:41:34 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
Exception in thread "main" org.apache.hadoop.util.Shell$ExitCodeException:
chmod: changing permissions of
`D:\\tmp\\hadoop-Eason\\mapred\\system\\job_local_1': Permission denied
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:195)
 at org.apache.hadoop.util.Shell.run(Shell.java:134)
 at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
 at org.apache.hadoop.util.Shell.execCommand(Shell.java:317)
 at
org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:522)
 at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
 at
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:267)
 at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:273)
 at
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:549)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:700)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
 at cyz.WordCount.run(WordCount.java:84)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at cyz.WordCount.main(WordCount.java:124)
any suggestion will help
thanks



--
NOMURA Yoshihide:
Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan
Tel: 044-754-2675 (Ext: 7112-6358)
Fax: 044-754-2570 (Ext: 7112-3834)
E-Mail: [EMAIL PROTECTED]



Re: MapWritable as output value of Reducer

2008-06-05 Thread NOMURA Yoshihide

Hello Taran,

If you want to use MapWritable as reducer's output value like this class,

  public class ReduceA implements ReducerLongWritable, MapWritable>


You couldn't use TextOutputFormat in this case, because MapWritable 
doesn't have any toString() method.

I think SequenceFileOutputFormat is more suitable.

If you want to chain the jobs, you should use SequenceFileInputFormat 
and SequenceFileOutputFormat like this way.


 JobConf confA = new JobConf(A.class);
 conf.setJobName("A");
 conf.setOutputKeyClass(LongWritable.class);
 conf.setOutputValueClass(MapWritable.class);
 conf.setMapperClass(MapA.class);
 conf.setReducerClass(ReduceA.class);
 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(SequenceFileOutputFormat.class);
 conf.setInputPath(new Path("/inputA"));
 conf.setOutputPath(new Path("/outputA"));
 JobClient.runJob(confA);

 JobConf confB = new JobConf(B.class);
 conf.setJobName("B");
 conf.setOutputKeyClass(LongWritable.class);
 conf.setOutputValueClass(MapWritable.class);
 conf.setMapperClass(MapB.class);
 conf.setReducerClass(ReduceB.class);
 conf.setInputFormat(SequenceFileInputFormat.class);
 conf.setOutputFormat(SequenceFileOutputFormat.class);
 conf.setInputPath(new Path("/outputA"));
 conf.setOutputPath(new Path("/outputB"));
 JobClient.runJob(confB);

Regards,

Tarandeep Singh:

hi,

Can I use MapWritable as an output value of a Reducer ?

If yes, how will the (key, value) pairs in the MapWritable object will be
written to the file ? What output format should I use in this case ?

Further, I want to chain the output of the first map reduce job to another
map reduce job, so in the second map reduce job, what input format should I
specify ?

Can I reconstruct the MapWritable objects in the mapper of the second job ?

Thanks,
Taran



--
NOMURA Yoshihide:
Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan
Tel: 044-754-2675 (Ext: 7112-6358)
Fax: 044-754-2570 (Ext: 7112-3834)
E-Mail: [EMAIL PROTECTED]



Text file character encoding

2008-06-01 Thread NOMURA Yoshihide
Hello,
I'm using Hadoop 0.17.0 to analyze some large amount of CSV files.

And I need to read such files in different character encoding from UTF-8,
but I think TextInputFormat doesn't support such character encoding.

I guess LineRecordReader class or Text class should support encoding
settings like this.
 conf.set("io.file.defaultEncoding", "MS932");

Is there any plan to supoort different character encoding in
TextInputFormat?

Regards,
-- 
NOMURA Yoshihide:
Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan
Tel: 044-754-2675 (Ext: 7112-6358)
Fax: 044-754-2570 (Ext: 7112-3834)
E-Mail: [EMAIL PROTECTED]