RE: MapReduce with multi-languages

2008-07-11 Thread Koji Noguchi
Hi.

Asked Runping about this.
Here's his reply.

Koji 


=
On 7/10/08 11:16 PM, Koji Noguchi [EMAIL PROTECTED] wrote:
  Runping,
  
  Can they use Buffer class?
  
  Koji

Yes, use Buffer or ByteWritable for the key/value classes.
But the critical point is to implement their own record reader/input
format classes.
Runping

=

-Original Message-
From: NOMURA Yoshihide [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 10, 2008 10:36 PM
To: core-user@hadoop.apache.org
Subject: Re: MapReduce with multi-languages

Mr. Taeho Kang,

I need to analyze different character encoding text too.
And I suggested to support encoding configuration in TextInputFormat.

https://issues.apache.org/jira/browse/HADOOP-3481

But I think you should convert the text file encoding to UTF-8 at
present.

Regards,

Taeho Kang:
 Dear Hadoop User Group,
 
 What are elegant ways to do mapred jobs on text-based data encoded
with
 something other than UTF-8?
 
 It looks like Hadoop assumes the text data is always in UTF-8 and
handles
 data that way - encoding with UTF-8 and decoding with UTF-8.
 And whenever the data is not in UTF-8 encoded format, problems arise.
 
 Here is what I'm thinking of to clear the situation.. correct and
advise me
 if you see my approaches look bad!
 
 (1) Re-encode the original data with UTF-8?
 (2) Replace the part of source code where UTF-8 encoder and decoder
are
 used?
 
 Or has anyone of you guys had trouble with running map-red job on data
with
 multi-languages?
 
 Any suggestions/advices are welcome and appreciated!
 
 Regards,
 
 Taeho
 

-- 
NOMURA Yoshihide:
 Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan
 Tel: 044-754-2675 (Ext: 7106-6916)
 Fax: 044-754-2570 (Ext: 7108-7060)
 E-Mail: [EMAIL PROTECTED]



Re: MapReduce with multi-languages

2008-07-10 Thread NOMURA Yoshihide

Mr. Taeho Kang,

I need to analyze different character encoding text too.
And I suggested to support encoding configuration in TextInputFormat.

https://issues.apache.org/jira/browse/HADOOP-3481

But I think you should convert the text file encoding to UTF-8 at present.

Regards,

Taeho Kang:

Dear Hadoop User Group,

What are elegant ways to do mapred jobs on text-based data encoded with
something other than UTF-8?

It looks like Hadoop assumes the text data is always in UTF-8 and handles
data that way - encoding with UTF-8 and decoding with UTF-8.
And whenever the data is not in UTF-8 encoded format, problems arise.

Here is what I'm thinking of to clear the situation.. correct and advise me
if you see my approaches look bad!

(1) Re-encode the original data with UTF-8?
(2) Replace the part of source code where UTF-8 encoder and decoder are
used?

Or has anyone of you guys had trouble with running map-red job on data with
multi-languages?

Any suggestions/advices are welcome and appreciated!

Regards,

Taeho



--
NOMURA Yoshihide:
Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan
Tel: 044-754-2675 (Ext: 7106-6916)
Fax: 044-754-2570 (Ext: 7108-7060)
E-Mail: [EMAIL PROTECTED]