Re: problems getting data into solr index
Hi Mike, Brian Thanks for helping with this, and for clearing up my misunderstanding. Solr the python module and Solr the package being two different things, I've got you. The issues I have are compounded by the fact that we're hovering between using the Unicode branch of Django and the older branch that has newforms, both of which have an impact on what I'm trying to do. It's getting closer to being resolved, and it's down to your advice, so thanks again. -- View this message in context: http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a11230922 Sent from the Solr - User mailing list archive at Nabble.com.
Re: problems getting data into solr index
Mike is talking about solr.py, the python script, I'm talking about Solr itself. I think your problem is in the former. You should play around with unicode in python for awhile. Remember that your terminal itself probably doesn't support utf-8, the biggest problem I run into is doing print utf8string Python forces you to be good about this stuff, but it's a steep climb. Google for python unicode and read the various tutorials to get a handle on it. -b On Jun 20, 2007, at 9:38 AM, vanderkerkoff wrote: Hello Mike, Brian My brain is approcahing saturation point and I'm reading these two opinoins as opposing each other. I'm sure I'm reading it incorrectly, but they seem to contradict each other. Are they? Brian Whitman wrote: Solr has no problems with proper utf8 and you don't need to do anything special to get it to work. Check out the newer solr.py in JIRA. Mike Klaas wrote: Perhaps this is why: solr.py expects unicode. You can pass it ascii, and it will transparently convert to unicode fine because that is the default codec. If you end up with utf-8, it will try to convert to unicode using the ascii codec and fail. -- View this message in context: http://www.nabble.com/problems- getting-data-into-solr-index-tf3915542.html#a11213488 Sent from the Solr - User mailing list archive at Nabble.com. -- http://variogr.am/ [EMAIL PROTECTED]
Re: problems getting data into solr index
On 20-Jun-07, at 6:38 AM, vanderkerkoff wrote: Hello Mike, Brian My brain is approcahing saturation point and I'm reading these two opinoins as opposing each other. I'm sure I'm reading it incorrectly, but they seem to contradict each other. Are they? solr.py takes unicode and encodes it as utf-8 to send to Solr. -Mike
Re: problems getting data into solr index
Cheesr Mike, read the page, it's starting to get into my brian now. Django was giving me unicode string, so I did some encoding and decoding and now the data is getting into solr, and it's simply not passing the characters that are cuasing problems, which is great. I'm going to follow the same sort of principle in my python code when I'm adding the items, so I can keep my solr index up to date as and when things are entered. Here's the code I'm using to enter the data. http://pastie.textmate.org/71367 2 little things, I'm getting an error when it's trying to optimise the index AttributeError: SolrConnection instance has no attribute 'optimise' You don't know what that is about do you? I'm still on solr1.1 as we were having trouble getting this sort of interaction to work with 1.2, not sure if it's related. 2. I've used your suggestions to force the output into ascii, but if I try to force it into utf8, which I though solr would accept, it fails. I'm not sure why though. Mike Klaas wrote: Hi, To diagnose this properly, you're going to have to figure out if you're dealing with encoded bytes or unicode, and what django does. See http://www.joelonsoftware.com/articles/Unicode.html. As a short-term solution, you can force things to ascii using: str(s.decode('ascii', 'ignore')) # assuming s is a bytestring u.encode('ascii', 'ignore') # assuming u is a unicode string -Mike -- View this message in context: http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a11174969 Sent from the Solr - User mailing list archive at Nabble.com.
Re: problems getting data into solr index
I think I've resolved this. I've edited that solr.py file to optimize=True on commit and moved the commit outside of the loop http://pastie.textmate.org/71392 The data is going in, it's optmizing once but it's showing as commit = 0 in the stats page of my solr. There's no errors that I can see, and the data is definately in the index as I can now search for it. vanderkerkoff wrote: 2 little things, I'm getting an error when it's trying to optimise the index AttributeError: SolrConnection instance has no attribute 'optimise' You don't know what that is about do you? I'm still on solr1.1 as we were having trouble getting this sort of interaction to work with 1.2, not sure if it's related. -- View this message in context: http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a11176732 Sent from the Solr - User mailing list archive at Nabble.com.
Re: problems getting data into solr index
On 18-Jun-07, at 6:27 AM, vanderkerkoff wrote: Cheesr Mike, read the page, it's starting to get into my brian now. Django was giving me unicode string, so I did some encoding and decoding and now the data is getting into solr, and it's simply not passing the characters that are cuasing problems, which is great. Glad to hear that it is working. 2 little things, I'm getting an error when it's trying to optimise the index AttributeError: SolrConnection instance has no attribute 'optimise' You don't know what that is about do you? Er, it means that SolrConnection has no optimise command. Instead do conn.commit(optimize=True) I'm still on solr1.1 as we were having trouble getting this sort of interaction to work with 1.2, not sure if it's related. 2. I've used your suggestions to force the output into ascii, but if I try to force it into utf8, which I though solr would accept, it fails. I'm not sure why though. Perhaps this is why: solr.py expects unicode. You can pass it ascii, and it will transparently convert to unicode fine because that is the default codec. If you end up with utf-8, it will try to convert to unicode using the ascii codec and fail. So, you could completely skip the ;encode('ascii', 'ignore') line. Of course, you'd have the characters in the text. I'm not quite sure what you're after, since leaving it in utf-8 would leave the funny characters that you wanted to strip. -MIke
Re: problems getting data into solr index
Hi, To diagnose this properly, you're going to have to figure out if you're dealing with encoded bytes or unicode, and what django does. See http://www.joelonsoftware.com/articles/Unicode.html. As a short-term solution, you can force things to ascii using: str(s.decode('ascii', 'ignore')) # assuming s is a bytestring u.encode('ascii', 'ignore') # assuming u is a unicode string -Mike On 15-Jun-07, at 2:45 AM, vanderkerkoff wrote: Hi Mike The characters that are giving us problems are the old favourites of apostrophe's and quotes pasted from Microsoft Word into a Django Web Site. I'm not using django's newforms yet, but still using the old ones. Any help or tips or sending me off to sites to read stuff Mike I'll be grateful. I'm coming round to the idea that I might have to strip these odd characters out with python before they get sent into the database, that would be the most sensible option I think. Mike Klaas wrote: I've dealt with tons of issues with python and unicode, but I need more information before proceeding with tips. Specifically, what is the format of the shit being copied and pasted into your app, and what python datatype is handling it? I suspect it is encoded somehow, which could be problematic. Is it going through a web browser? How is it getting into mysql? -MIke -- View this message in context: http://www.nabble.com/problems- getting-data-into-solr-index-tf3915542.html#a11136156 Sent from the Solr - User mailing list archive at Nabble.com.
Re: problems getting data into solr index
Hello Hoss Thanks for replying, I tried what you suggested as the iniital step of my troubleshooting and it outputs it fine. It was what I suspected initially as well, but thanks for the advice. hossman_lucene wrote: : I'm running solr1.2 and Jetty, I'm having problems looping through a mysql : database with python and putting the data into the solr index. : : Here's the error : : UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 369: : ordinal not in range(128) I may be missing something here, but i don't think that error is coming from Solr ... UnicodeDecodeError appears to be a python error message, so i suspect the probelm is between MySql and your python script .. i bet if yo uchange your script to comment out hte lines where you talk to solr, and just read the data from mysql and throw it to /dev/null you'd still see that message. http://wiki.wxpython.org/UnicodeDecodeError -Hoss -- View this message in context: http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a5954 Sent from the Solr - User mailing list archive at Nabble.com.
Re: problems getting data into solr index
Hi Yonik Here's the output from netcat POST /solr/update HTTP/1.1 Host: localhost:8983 Accept-Encoding: identity Content-Length: 83 Content-Type: text/xml; charset=utf-8 that looks Ok to me, but I am a bit twp you see. :-) Yonik Seeley wrote: On 6/13/07, vanderkerkoff [EMAIL PROTECTED] wrote: I'm running solr1.2 and Jetty, I'm having problems looping through a mysql database with python and putting the data into the solr index. Here's the error UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 369: ordinal not in range(128) There are two issues... what char encoding you tell solr to use, via Content-type in the HTTP headers (solr defaults to UTF-8), and then if what you send matches that coding. If you can get the complete message (including HTTP headers) that is being sent to Solr, that would help people debug the problem. One easy way is to use netcat to pretend to be solr: 1) shut down solr 2) start up netcat on solr's port nc -l -p 8983 3) send your update message from the client as you normally would -Yonik -- View this message in context: http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a6020 Sent from the Solr - User mailing list archive at Nabble.com.
Re: problems getting data into solr index
is it ok? 2007/6/14, vanderkerkoff [EMAIL PROTECTED]: Hi Yonik Here's the output from netcat POST /solr/update HTTP/1.1 Host: localhost:8983 Accept-Encoding: identity Content-Length: 83 Content-Type: text/xml; charset=utf-8 that looks Ok to me, but I am a bit twp you see. :-) Yonik Seeley wrote: On 6/13/07, vanderkerkoff [EMAIL PROTECTED] wrote: I'm running solr1.2 and Jetty, I'm having problems looping through a mysql database with python and putting the data into the solr index. Here's the error UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 369: ordinal not in range(128) There are two issues... what char encoding you tell solr to use, via Content-type in the HTTP headers (solr defaults to UTF-8), and then if what you send matches that coding. If you can get the complete message (including HTTP headers) that is being sent to Solr, that would help people debug the problem. One easy way is to use netcat to pretend to be solr: 1) shut down solr 2) start up netcat on solr's port nc -l -p 8983 3) send your update message from the client as you normally would -Yonik -- View this message in context: http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a6020 Sent from the Solr - User mailing list archive at Nabble.com. -- regards jl
Re: problems getting data into solr index
Hi Brian I've now set the mysqldb to be default charset utf8, and everything else is utf8. collation etc etc. I think I know what the problem is, and it's a really old one and I feel foolish now for not realising it earlier. Our content people are copying and pasting sh*t from word into the content. :-) Now that the database is utf8, I'd like to write something to change the crap from word into a readable value before it get's into the database. Using python, so I suppose this is more of a python question than a solr one. Anyone got any tips anyway? Brian Whitman wrote: Post the line of code this is breaking on. Are you pulling the data from mysql as utf8? Are you setting the encoding of Mysqldb? Solr has no problems with proper utf8 and you don't need to do anything special to get it to work. Check out the newer solr.py in JIRA. -- View this message in context: http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a8400 Sent from the Solr - User mailing list archive at Nabble.com.
Re: problems getting data into solr index
On 14-Jun-07, at 4:30 AM, vanderkerkoff wrote: Hi Brian I've now set the mysqldb to be default charset utf8, and everything else is utf8. collation etc etc. I think I know what the problem is, and it's a really old one and I feel foolish now for not realising it earlier. Our content people are copying and pasting sh*t from word into the content. :-) Now that the database is utf8, I'd like to write something to change the crap from word into a readable value before it get's into the database. Using python, so I suppose this is more of a python question than a solr one. Anyone got any tips anyway? I've dealt with tons of issues with python and unicode, but I need more information before proceeding with tips. Specifically, what is the format of the shit being copied and pasted into your app, and what python datatype is handling it? I suspect it is encoded somehow, which could be problematic. Is it going through a web browser? How is it getting into mysql? -MIke
Re: problems getting data into solr index
On 6/13/07, vanderkerkoff [EMAIL PROTECTED] wrote: I'm running solr1.2 and Jetty, I'm having problems looping through a mysql database with python and putting the data into the solr index. Here's the error UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 369: ordinal not in range(128) There are two issues... what char encoding you tell solr to use, via Content-type in the HTTP headers (solr defaults to UTF-8), and then if what you send matches that coding. If you can get the complete message (including HTTP headers) that is being sent to Solr, that would help people debug the problem. One easy way is to use netcat to pretend to be solr: 1) shut down solr 2) start up netcat on solr's port nc -l -p 8983 3) send your update message from the client as you normally would -Yonik
Re: problems getting data into solr index
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 369: ordinal not in range(128) What character is at position 369? make sure it is valid unicode... Is there a simple way to tell solr to accept UTF8 characters? Solr can accept UTF8 characters... check the utf8-example.xml example in exampledocs. If you can put the character at position 369 into utf8-example.xml and post it successfully (using post.sh or post.jar) then I suspect however you are posting the xml is not encoding the stream properly.