Ok thank you Andi. I’ll use the sidepath with the bytes for the moment. Hope it will get solved soon though.
Met vriendelijke groeten, Marc Jeurissen Bibliotheek UAntwerpen Stadscampus – Ve35.303 Venusstraat 35 – 2000 Antwerpen marc.jeuris...@uantwerpen.be T +32 3 265 49 71 From: Andi Vajda Sent: woensdag 9 oktober 2019 23:33 To: Andi Vajda Cc: pylucene-dev@lucene.apache.org Subject: Re: Field.setStringValue On Wed, 9 Oct 2019, Andi Vajda wrote: > > On Wed, 9 Oct 2019, Marc Jeurissen wrote: > >> Good day to you, >> >> I have the following issue when setting the value of a field, value >> containing a character > 160 (Pylucene 8.1.1, Python 3.7.2) >> >> ... >> (Pdb) field >> <Field: stored,indexed,tokenized,omitNorms >> indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:>> >> (Pdb) value = '«Volgende facturen werden verstuurd aan de financiële >> dienst.»' >> (Pdb) type(value) >> <class 'str'> >> (Pdb) field.setStringValue(value) >> (Pdb) field >> <Field: >> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:«Volgende >> >> facturen werden verstuurd aan de financiële dienst>> >> >> The field value has lost 2 characters. >> >> But when I encode value: >> >> (Pdb) value = value.encode('utf-8') >> (Pdb) value >> b'\xc2\xabVolgende facturen werden verstuurd aan de financi\xc3\xable >> dienst.\xc2\xbb' >> >> (Pdb) field.setStringValue(value) >> (Pdb) field >> <Field: >> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS<text:«Volgende >> >> facturen werden verstuurd aan de financiële dienst.»>> >> >> The field value is correct. >> >> So what does field.setStringValue expect: a string (as says the Lucene >> documentation) or a byte sequence? > > Indeed, there is a problem. I was able to reproduce it with just > StringBuffer, no lucene involved at all: > >>>> from lucene import initVM >>>> initVM() >>>> b=b'\xc2\xabVolgende facturen werden verstuurd aan de >>>> financi\xc3\xabledienst.\xc2\xbb' >>>> a=b.decode('utf-8') >>>> from java.lang import StringBuffer >>>> StringBuffer(b) > <StringBuffer: «Volgende facturen werden verstuurd aan de financiëledienst.»> >>>> StringBuffer(a) > <StringBuffer: «Volgende facturen werden verstuurd aan de financiëledienst> >>>> StringBuffer(a).length() > 59 >>>> StringBuffer(b).length() > 61 >>>> type(a) > <class 'str'> >>>> type(b) > <class 'bytes'> > > There must be a bug in the Python 'str' -> Java 'String' conversion code. > Any Java API such as field.setStringValue() that expects a java.lang.String() > can be passed a 'str' or 'bytes', JCC auto-converts as needed. This is very > likely where the bug is. Digging a bit further, it doesn't seem to be a problem when using Python 2. I'm not implying this is a python bug, strings are just very different between python 2 and 3. Andi.. > > Andi.. > >> >> Thank you very much. >> >> >> Met vriendelijke groeten, >> Marc Jeurissen >> >> Bibliotheek UAntwerpen >> Stadscampus ? Ve35.303 >> Venusstraat 35 ? 2000 Antwerpen >> marc.jeuris...@uantwerpen.be >> T +32 3 265 49 71 >> >> >> >