[patrick.oliver.glau...@cern.ch]
Sent: Friday, September 28, 2012 10:36 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing in Solr: invalid UTF-8
Thank you. I will check our textification process and see how to improve it.
Patrick
From: Michael McCandless [luc
On 9 October 2012 17:42, Patrick Oliver Glauner
patrick.oliver.glau...@cern.ch wrote:
Hello everybody
Meanwhile, I checked this issue in detail: we use pdftotext to extract text
from our PDFs (http://cds.cern.ch/). Some generated text files contain
\u and \uD835.
unicode(text,
Thank you. I will check our textification process and see how to improve it.
Patrick
From: Michael McCandless [luc...@mikemccandless.com]
Sent: Wednesday, September 26, 2012 5:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing in Solr: invalid
Python's unicode function takes an optional (keyword) errors
argument, telling it what to do when an invalid UTF8 byte sequence is
seen.
The default (errors='strict') is to throw the exceptions you're
seeing. But you can also pass errors='replace' or errors='ignore'.
See
Hello
We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs,
DOC etc. Our indexing script is written in Python 2.4 using solrpy:
[...]
text = remove_control_characters(text) # except \r, \t, \n
utext = unicode(text, 'utf-8')
SOLR_CONNECTION.add(id=recid, fulltext=utext)
Subject: Indexing in Solr: invalid UTF-8
Hello
We use Solr 3.1 and Jetty to index previously extracted fulltexts from PDFs,
DOC etc. Our indexing script is written in Python 2.4 using solrpy:
[...]
text = remove_control_characters(text) # except \r, \t, \n
utext = unicode(text, 'utf-8
[markus.jel...@openindex.io]
Sent: Tuesday, September 25, 2012 7:24 PM
To: solr-user@lucene.apache.org; Patrick Oliver Glauner
Subject: RE: Indexing in Solr: invalid UTF-8
Hi - you need to get rid of all non-character code points.
http://unicode.org/cldr/utility/list-unicodeset.jsp
On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner
patrick.oliver.glau...@cern.ch wrote:
Hi
Thanks. But I see that 0xd835 is missing in this list (see my exceptions).
What's the best way to get rid of all of them in Python? I am new to unicode
in Python but I am sure that this use case