Re: problems getting data into solr index

2007-06-21 Thread vanderkerkoff

Hi Mike, Brian

Thanks for helping with this, and for clearing up my misunderstanding.  Solr
the python module and Solr the package being two different things, I've got
you.

The issues I have are compounded by the fact that we're hovering between
using the Unicode branch of Django and the older branch that has newforms,
both of which have an impact on what I'm trying to do.

It's getting closer to being resolved, and it's down to your advice, so
thanks again.






-- 
View this message in context: 
http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a11230922
Sent from the Solr - User mailing list archive at Nabble.com.



Re: problems getting data into solr index

2007-06-20 Thread Brian Whitman
Mike is talking about solr.py, the python script, I'm talking about  
Solr itself.
I think your problem is in the former. You should play around with  
unicode in python for awhile. Remember that your terminal itself  
probably doesn't support utf-8, the biggest problem I run into is doing


 print utf8string

Python forces you to be good about this stuff, but it's a steep  
climb. Google for python unicode and read the various tutorials to  
get a handle on it.


-b


On Jun 20, 2007, at 9:38 AM, vanderkerkoff wrote:



Hello Mike, Brian

My brain is approcahing saturation point and I'm reading these two  
opinoins

as opposing each other.

I'm sure I'm reading it incorrectly, but they seem to contradict  
each other.


Are they?


Brian Whitman wrote:


Solr has no problems with proper utf8 and you don't need to do
anything special to get it to work. Check out the newer solr.py in  
JIRA.





Mike Klaas wrote:


Perhaps this is why: solr.py expects unicode.  You can pass it ascii,
and it will transparently convert to unicode fine because that is the
default codec.  If you end up with utf-8, it will try to convert to
unicode using the ascii codec and fail.



--
View this message in context: http://www.nabble.com/problems- 
getting-data-into-solr-index-tf3915542.html#a11213488

Sent from the Solr - User mailing list archive at Nabble.com.



--
http://variogr.am/
[EMAIL PROTECTED]





Re: problems getting data into solr index

2007-06-20 Thread Mike Klaas



On 20-Jun-07, at 6:38 AM, vanderkerkoff wrote:



Hello Mike, Brian

My brain is approcahing saturation point and I'm reading these two  
opinoins

as opposing each other.

I'm sure I'm reading it incorrectly, but they seem to contradict  
each other.


Are they?


solr.py takes unicode and encodes it as utf-8 to send to Solr.

-Mike


Re: problems getting data into solr index

2007-06-18 Thread vanderkerkoff

Cheesr Mike, read the page, it's starting to get into my brian now.

Django was giving me unicode string, so I did some encoding and decoding and
now the data is getting into solr, and it's simply not passing the
characters that are cuasing problems, which is great.

I'm going to follow the same sort of principle in my python code when I'm
adding the items, so I can keep my solr index up to date as and when things
are entered.

Here's the code I'm using to enter the data.

http://pastie.textmate.org/71367

2 little things, I'm getting an error when it's trying to optimise the index

AttributeError: SolrConnection instance has no attribute 'optimise'

You don't know what that is about do you?

I'm still on solr1.1 as we were having trouble getting this sort of
interaction to work with 1.2, not sure if it's related.

2.  I've used your suggestions to force the output into ascii, but if I try
to force it into utf8, which I though solr would accept, it fails.  I'm not
sure why though.

 



Mike Klaas wrote:
 
 Hi,
 
 To diagnose this properly, you're going to have to figure out if  
 you're dealing with encoded bytes or unicode, and what django does.   
 See http://www.joelonsoftware.com/articles/Unicode.html.
 
 As a short-term solution, you can force things to ascii using:
 
 str(s.decode('ascii', 'ignore')) # assuming s is a bytestring
 u.encode('ascii', 'ignore') # assuming u is a unicode string
 
 -Mike
 

-- 
View this message in context: 
http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a11174969
Sent from the Solr - User mailing list archive at Nabble.com.



Re: problems getting data into solr index

2007-06-18 Thread vanderkerkoff

I think I've resolved this.

I've edited that solr.py file to optimize=True on commit and moved the
commit outside of the loop

http://pastie.textmate.org/71392

The data is going in, it's optmizing once but it's showing as commit = 0 in
the stats page of my solr.

There's no errors that I can see, and the data is definately in the index as
I can now search for it.



vanderkerkoff wrote:
 
 
 2 little things, I'm getting an error when it's trying to optimise the
 index
 
 AttributeError: SolrConnection instance has no attribute 'optimise'
 
 You don't know what that is about do you?
 
 I'm still on solr1.1 as we were having trouble getting this sort of
 interaction to work with 1.2, not sure if it's related.
 
 

-- 
View this message in context: 
http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a11176732
Sent from the Solr - User mailing list archive at Nabble.com.



Re: problems getting data into solr index

2007-06-18 Thread Mike Klaas

On 18-Jun-07, at 6:27 AM, vanderkerkoff wrote:



Cheesr Mike, read the page, it's starting to get into my brian now.

Django was giving me unicode string, so I did some encoding and  
decoding and

now the data is getting into solr, and it's simply not passing the
characters that are cuasing problems, which is great.


Glad to hear that it is working.

2 little things, I'm getting an error when it's trying to optimise  
the index


AttributeError: SolrConnection instance has no attribute 'optimise'

You don't know what that is about do you?


Er, it means that SolrConnection has no optimise command.  Instead do

conn.commit(optimize=True)


I'm still on solr1.1 as we were having trouble getting this sort of
interaction to work with 1.2, not sure if it's related.

2.  I've used your suggestions to force the output into ascii, but  
if I try
to force it into utf8, which I though solr would accept, it fails.   
I'm not

sure why though.


Perhaps this is why: solr.py expects unicode.  You can pass it ascii,  
and it will transparently convert to unicode fine because that is the  
default codec.  If you end up with utf-8, it will try to convert to  
unicode using the ascii codec and fail.


So, you could completely skip the ;encode('ascii', 'ignore') line.   
Of course, you'd have the characters in the text.  I'm not quite sure  
what you're after, since leaving it in utf-8 would leave the funny  
characters that you wanted to strip.


-MIke


Re: problems getting data into solr index

2007-06-16 Thread Mike Klaas

Hi,

To diagnose this properly, you're going to have to figure out if  
you're dealing with encoded bytes or unicode, and what django does.   
See http://www.joelonsoftware.com/articles/Unicode.html.


As a short-term solution, you can force things to ascii using:

str(s.decode('ascii', 'ignore')) # assuming s is a bytestring
u.encode('ascii', 'ignore') # assuming u is a unicode string

-Mike

On 15-Jun-07, at 2:45 AM, vanderkerkoff wrote:



Hi Mike
The characters that are giving us problems are the old favourites of
apostrophe's and quotes pasted from Microsoft Word into a Django  
Web Site.

I'm not using django's newforms yet, but still using the old ones.

Any help or tips or sending me off to sites to read stuff Mike I'll be
grateful.

I'm coming round to the idea that I might have to strip these odd  
characters
out with python before they get sent into the database, that would  
be the

most sensible option I think.



Mike Klaas wrote:


I've dealt with tons of issues with python and unicode, but I need
more information before proceeding with tips.

Specifically, what is the format of the shit being copied and
pasted into your app, and what python datatype is handling it?  I
suspect it is encoded somehow, which could be problematic.  Is it
going through a web browser?  How is it getting into mysql?

-MIke






--
View this message in context: http://www.nabble.com/problems- 
getting-data-into-solr-index-tf3915542.html#a11136156

Sent from the Solr - User mailing list archive at Nabble.com.





Re: problems getting data into solr index

2007-06-14 Thread vanderkerkoff

Hello Hoss

Thanks for replying, I tried what you suggested as the iniital step of my
troubleshooting and it outputs it fine.

It was what I suspected initially as well, but thanks for the advice.



hossman_lucene wrote:
 
 
 : I'm running solr1.2 and Jetty, I'm having problems looping through a
 mysql
 : database with python and putting the data into the solr index.
 :
 : Here's the error
 :
 : UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
 369:
 : ordinal not in range(128)
 
 I may be missing something here, but i don't think that error is coming
 from Solr ... UnicodeDecodeError appears to be a python error message,
 so i suspect the probelm is between MySql and your python script .. i bet
 if yo uchange your script to comment out hte lines where you talk to solr,
 and just read the data from mysql and throw it to /dev/null you'd still
 see that message.
 
 http://wiki.wxpython.org/UnicodeDecodeError
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a5954
Sent from the Solr - User mailing list archive at Nabble.com.



Re: problems getting data into solr index

2007-06-14 Thread vanderkerkoff

Hi Yonik

Here's the output from netcat

POST /solr/update HTTP/1.1
Host: localhost:8983
Accept-Encoding: identity
Content-Length: 83
Content-Type: text/xml; charset=utf-8

that looks Ok to me, but I am a bit twp you see.

:-)

Yonik Seeley wrote:
 
 On 6/13/07, vanderkerkoff [EMAIL PROTECTED] wrote:
 I'm running solr1.2 and Jetty, I'm having problems looping through a
 mysql
 database with python and putting the data into the solr index.

 Here's the error

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 369:
 ordinal not in range(128)
 
 There are two issues... what char encoding you tell solr to use, via
 Content-type in the HTTP headers (solr defaults to UTF-8), and then if
 what you send matches that coding.
 
 If you can get the complete message (including HTTP headers) that is
 being sent to Solr, that would help people debug the problem.
 
 One easy way is to use netcat to pretend to be solr:
 1) shut down solr
 2) start up netcat on solr's port
   nc -l -p 8983
 3) send your update message from the client as you normally would
 
 -Yonik
 
 

-- 
View this message in context: 
http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a6020
Sent from the Solr - User mailing list archive at Nabble.com.



Re: problems getting data into solr index

2007-06-14 Thread James liu

is it ok?

2007/6/14, vanderkerkoff [EMAIL PROTECTED]:



Hi Yonik

Here's the output from netcat

POST /solr/update HTTP/1.1
Host: localhost:8983
Accept-Encoding: identity
Content-Length: 83
Content-Type: text/xml; charset=utf-8

that looks Ok to me, but I am a bit twp you see.

:-)

Yonik Seeley wrote:

 On 6/13/07, vanderkerkoff [EMAIL PROTECTED] wrote:
 I'm running solr1.2 and Jetty, I'm having problems looping through a
 mysql
 database with python and putting the data into the solr index.

 Here's the error

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
369:
 ordinal not in range(128)

 There are two issues... what char encoding you tell solr to use, via
 Content-type in the HTTP headers (solr defaults to UTF-8), and then if
 what you send matches that coding.

 If you can get the complete message (including HTTP headers) that is
 being sent to Solr, that would help people debug the problem.

 One easy way is to use netcat to pretend to be solr:
 1) shut down solr
 2) start up netcat on solr's port
   nc -l -p 8983
 3) send your update message from the client as you normally would

 -Yonik



--
View this message in context:
http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a6020
Sent from the Solr - User mailing list archive at Nabble.com.





--
regards
jl


Re: problems getting data into solr index

2007-06-14 Thread vanderkerkoff

Hi Brian

I've now set the mysqldb to be default charset utf8, and everything else is
utf8.  collation etc etc.

I think I know what the problem is, and it's a really old one and I feel
foolish now for not realising it earlier.

Our content people are copying and pasting sh*t from word into the content.

:-)

Now that the database is utf8, I'd like to write something to change the
crap from word into a readable value before it get's into the database. 
Using python, so I suppose this is more of a python question than a solr
one.

Anyone got any tips anyway? 



Brian Whitman wrote:
 
 Post the line of code this is breaking on. Are you pulling the data  
 from mysql as utf8? Are you setting the encoding of Mysqldb?
 
 Solr has no problems with proper utf8 and you don't need to do  
 anything special to get it to work. Check out the newer solr.py in JIRA.
 

-- 
View this message in context: 
http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a8400
Sent from the Solr - User mailing list archive at Nabble.com.



Re: problems getting data into solr index

2007-06-14 Thread Mike Klaas

On 14-Jun-07, at 4:30 AM, vanderkerkoff wrote:



Hi Brian

I've now set the mysqldb to be default charset utf8, and everything  
else is

utf8.  collation etc etc.

I think I know what the problem is, and it's a really old one and I  
feel

foolish now for not realising it earlier.

Our content people are copying and pasting sh*t from word into the  
content.


:-)

Now that the database is utf8, I'd like to write something to  
change the
crap from word into a readable value before it get's into the  
database.
Using python, so I suppose this is more of a python question than a  
solr

one.

Anyone got any tips anyway?


I've dealt with tons of issues with python and unicode, but I need  
more information before proceeding with tips.


Specifically, what is the format of the shit being copied and  
pasted into your app, and what python datatype is handling it?  I  
suspect it is encoded somehow, which could be problematic.  Is it  
going through a web browser?  How is it getting into mysql?


-MIke




Re: problems getting data into solr index

2007-06-13 Thread Yonik Seeley

On 6/13/07, vanderkerkoff [EMAIL PROTECTED] wrote:

I'm running solr1.2 and Jetty, I'm having problems looping through a mysql
database with python and putting the data into the solr index.

Here's the error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 369:
ordinal not in range(128)


There are two issues... what char encoding you tell solr to use, via
Content-type in the HTTP headers (solr defaults to UTF-8), and then if
what you send matches that coding.

If you can get the complete message (including HTTP headers) that is
being sent to Solr, that would help people debug the problem.

One easy way is to use netcat to pretend to be solr:
1) shut down solr
2) start up netcat on solr's port
 nc -l -p 8983
3) send your update message from the client as you normally would

-Yonik


Re: problems getting data into solr index

2007-06-13 Thread Ryan McKinley


 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 
 369: ordinal not in range(128)



What character is at position 369?  make sure it is valid unicode...




Is there a simple way to tell solr to accept UTF8 characters?



Solr can accept UTF8 characters... check the utf8-example.xml example in 
exampledocs.


If you can put the character at position 369 into utf8-example.xml and 
post it successfully (using post.sh or post.jar) then I suspect however 
you are posting the xml is not encoding the stream properly.