On Sat, 29 Jun 2013 04:29:23 -0700, fobos3 wrote: > Hi, > > I am trying to use a program called MeCab, which does syntax analysis on > Japanese text. The problem I am having is that it returns a byte string > and if I try to print it, it prints question marks for almost all > characters. However, if I try to use .decide, it throws an error. Here > is my code: > > #!/usr/bin/python > # -*- coding:utf-8 -*- > > import MeCab > tagger = MeCab.Tagger("-Owakati") > text = 'MeCabで遊んでみよう!'
I see from below you are using Python 2.7. Here you are using a byte-string rather than Unicode. The actual bytes that you get *may* be indeterminate. I don't think that Python guarantees that just because the source file is declared as UTF-8, that *implicit* encoding into bytes will necessarily use UTF-8. Even if it does, it is still better to use an explicit Unicode string, and explicitly encode into bytes using whatever encoding MeCab expects you to use, say: text = u'MeCabで遊んでみよう!'.encode('utf-8') By the way, what makes you think that MeCab expects, and returns, text encoded using UTF-8? > result = tagger.parse(text) > print result > > result = result.decode('utf-8') > print result > > And here is the output: > > MeCab �� �� ��んで�� �� ��う! MeCab has returned a bunch of bytes, representing some text in some encoding. When you print those bytes, your terminal uses whatever its default encoding is (probably UTF-8, on a Linux system) and tries to make sense of the bytes, using � for any byte it cannot make sense of. This is good evidence that MeCab is *not* actually using UTF-8. And sure enough, when you try to decode it manually: > Traceback (most recent call last): > File "test.py", line 11, in <module> > result = result.decode('utf-8') > File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: > invalid continuation byte Assuming that the bytes being returned are *supposed* to be encoded in UTF-8, it's possible that MeCab is simply buggy and cannot produce proper UTF-8 encoded byte strings. This wouldn't surprise me -- after all, using *byte strings* as non-ASCII text strongly suggests that the author doesn't understand Unicode very well. But perhaps more likely, MeCab isn't using UTF-8 at all. What does the documentation say? A third possibility is that the string you feed to MeCab is simply mangled beyond recognition due to the way you create it using the implicit encoding from chars to bytes. Change the line text = 'MeCab ...' to use an explicit Unicode string and encode, as above, and maybe the error will go away. -- Steven -- http://mail.python.org/mailman/listinfo/python-list