Ok thank you Andi.
I’ll use the sidepath with the bytes for the moment.
Hope it will get solved soon though.


Met vriendelijke groeten,
Marc Jeurissen

Bibliotheek UAntwerpen
Stadscampus – Ve35.303
Venusstraat 35 – 2000 Antwerpen
marc.jeuris...@uantwerpen.be
T +32 3 265 49 71



From: Andi Vajda
Sent: woensdag 9 oktober 2019 23:33
To: Andi Vajda
Cc: pylucene-dev@lucene.apache.org
Subject: Re: Field.setStringValue


On Wed, 9 Oct 2019, Andi Vajda wrote:

>
> On Wed, 9 Oct 2019, Marc Jeurissen wrote:
>
>> Good day to you,
>> 
>> I have the following issue when setting the value of a field, value 
>> containing a character > 160 (Pylucene 8.1.1, Python 3.7.2)
>> 
>> ...
>> (Pdb) field
>> <Field: stored,indexed,tokenized,omitNorms 
>> indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:>>
>> (Pdb) value = '«Volgende facturen werden verstuurd aan de financiële 
>> dienst.»'
>> (Pdb) type(value)
>> <class 'str'>
>> (Pdb) field.setStringValue(value)
>> (Pdb) field
>> <Field: 
>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:«Volgende
>>  
>> facturen werden verstuurd aan de financiële dienst>>
>> 
>> The field value has lost 2 characters.
>> 
>> But when I encode value:
>> 
>> (Pdb) value = value.encode('utf-8')
>> (Pdb) value
>> b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xable 
>> dienst.\xc2\xbb'
>> 
>> (Pdb) field.setStringValue(value)
>> (Pdb) field
>> <Field: 
>> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:«Volgende
>>  
>> facturen werden verstuurd aan de financiële dienst.»>>
>> 
>> The field value is correct.
>> 
>> So what does field.setStringValue expect: a string (as says the Lucene 
>> documentation) or a byte sequence?
>
> Indeed, there is a problem. I was able to reproduce it with just 
> StringBuffer, no lucene involved at all:
>
>>>> from lucene import initVM
>>>> initVM()
>>>> b=b'\xc2\xabVolgende facturen werden verstuurd aan de 
>>>> financi\xc3\xabledienst.\xc2\xbb'
>>>> a=b.decode('utf-8')
>>>> from java.lang import StringBuffer
>>>> StringBuffer(b)
> <StringBuffer: «Volgende facturen werden verstuurd aan de financiëledienst.»>
>>>> StringBuffer(a)
> <StringBuffer: «Volgende facturen werden verstuurd aan de financiëledienst>
>>>> StringBuffer(a).length()
> 59
>>>> StringBuffer(b).length()
> 61
>>>> type(a)
> <class 'str'>
>>>> type(b)
> <class 'bytes'>
>
> There must be a bug in the Python 'str' -> Java 'String' conversion code.
> Any Java API such as field.setStringValue() that expects a java.lang.String() 
> can be passed a 'str' or 'bytes', JCC auto-converts as needed. This is very 
> likely where the bug is.

Digging a bit further, it doesn't seem to be a problem when using Python 2. 
I'm not implying this is a python bug, strings are just very different 
between python 2 and 3.

Andi..

>
> Andi..
>
>> 
>> Thank you very much.
>> 
>> 
>> Met vriendelijke groeten,
>> Marc Jeurissen
>> 
>> Bibliotheek UAntwerpen
>> Stadscampus ? Ve35.303
>> Venusstraat 35 ? 2000 Antwerpen
>> marc.jeuris...@uantwerpen.be
>> T +32 3 265 49 71
>> 
>> 
>> 
>

Reply via email to