On Wed, 27 Dec 2006, Yura Smolsky wrote:

s = u"some unicode"
summary = s.encode("utf8")

thats I do for all string. I index strings this way without
problem for a long time, but for this particular data below it returns 
InvalidArgsError.

Why not pass unicode to PyLucene directly ?
If you pass a regular, utf-8, string, PyLucene has to convert it back to unicode anyway since that's all Java does.

Andi..


sometimes i receive weird exception for unicode data, which is
japanese text. here an entry from the log:

2006-12-27 11:01:18,541 ERROR
Traceback (most recent call last):
 File "/home/search/lib/index/Index.py", line 91, in indexDocument
   doc.add(Field("summary", fields['summary'], Field.Store.YES, 
Field.Index.TOKENIZED))
InvalidArgsError: (<type 'PyLucene.Field'>, '__init__', ('summary', 
'\xe7\xb4\xa0\xe6\x95\xb5\xe3\x81\xaa\xe3\x82\xaf\xe3\x83\
xaa\xe3\x82\xb9\xe3\x83\x9e\xe3\x82\xb9\xe3\x83\x97\xe3\x83\xac\xe3\x82\xbc\xe3\x83\xb3\xe3\x83\x88\xe3\x81\x8c\xe5\xb1\x8a\xe
3\x81\x8d\xe3\x81\xbe\xe3\x81\x97\xe3\x81\x9f\xef\xa3\xa6 
\xe3\x83\xab\xe3\x82\xa4\xe3\x82\xb5\xe3\x83\xb3\xe3\x82\xbf\xe3\x81
\x95\xe3\x82\x93\xe3\x81\x8b\xe3\x82\x89\xef\xa6\xa8 
\xe3\x81\x84\xe3\x81\x88\xe3\x81\x84\xe3\x81\x88\xe3\x80\x82\xef\xbc\x91\
xe5\xb9\xb4\xe9\xa0\x91\xe5\xbc\xb5\xe3\x81\xa3\xe3\x81\x9f\xe3\x80\x8c\xe8\x87\xaa\xe5\x88\x86\xe3\x80\x8d\xe3\x81\x8b\xe3\x8
2\x89\xe3\x80\x8c\xe8\x87\xaa\xe5\x88\x86\xe3\x80\x8d\xe3\x81\xab\xe3\x80\x82\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xef\xbc\x88\
xe7\xac\x91\xef\xbc\x89 
\xe3\x81\x9d\xe3\x82\x8c\xe3\x82\x82\xe3\x80\x8e\xe8\xa6\xaa\xe3\x81\xb0\xe3\x81\x8b\xe3\x82\xb0\xe3\x
83\x83\xe3\x82\xba\xe3\x80\x8f\xf0\x95\xbe\xb9', <Field_Store: YES>, <Field_Index: 
TOKENIZED>))

what is actually wrong with parameters?

AV> Dunno, it could be a problem with converting to Unicode ?

AV> It looks like the argument is a regular python string instance, not a 
unicode
AV> string instance. Because Java uses only unicode strings, regular python
AV> strings are converted to Unicode by assuming they're utf-8 encoded. Is that
AV> the case with this string ?

AV> A way around the problem is to convert the string to Unicode yourself before
AV> passing it to PyLucene.

AV> If you send in a piece of code that reproduces the problem, I can be more
AV> helpful.

AV> Andi..




--
Yura Smolsky



_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to