Re: MeCab UTF-8 Decoding Problem

Terry Reedy Sat, 29 Jun 2013 08:36:55 -0700

On 6/29/2013 10:02 AM, Dave Angel wrote:

On 06/29/2013 07:29 AM, [email protected] wrote:

Hi,


Using Python 2.7 on Linux, presumably?  It'd be better to be explicit.


I am trying to use a program called MeCab, which does syntax analysis
on Japanese text.

It is generally nice to give a link when asking about 3rd partysoftware. https://code.google.com/p/mecab/

In this case, nearly all the non-boilerplate text is Japanese ;-(.

>> The problem I am having is that it returns a byte string

and the problem with bytes is that they can have any encoding.

In Python 2 (indicated by your print *statements*), a byte string isjust a string.

and if I try to print it, it prints question marks for almost
all characters. However, if I try to use .decide, it throws an error.
Here is my code:


What do the MeCab docs say the tagger.parse byte string represents?
Maybe it's not text at all.  But surely it's not utf-8.


https://mecab.googlecode.com/svn/trunk/mecab/doc/index.html
MeCab: Yet Another Part-of-Speech and Morphological Analyzer
followed by Japanese.

#!/usr/bin/python
# -*- coding:utf-8 -*-

import MeCab
tagger = MeCab.Tagger("-Owakati")
text = 'MeCabで遊んでみよう！'


Parts of this appear in the output, as indicated by spaces.
'MeCabで遊 んで みよ う！'

result = tagger.parse(text)
print result

result = result.decode('utf-8')
print result

And here is the output:

MeCab �� �� ��んで�� �� ��う！

Python normally prints bytes with ascii chars representing eitherthemselves or other values with hex escapes. This looks more likeunicode sent to a terminal with a limited character set. I would add


print type(result)

to be sure.

Traceback (most recent call last):
   File "test.py", line 11, in <module>
     result = result.decode('utf-8')
   File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7:
invalid continuation byte


------------------
(program exited with code: 1)
Press return to continue

Also my terminal is able to display Japanese characters properly. For
example print '日本語' works perfectly fine.



--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Re: MeCab UTF-8 Decoding Problem

Reply via email to