Re: To make sure XML is UTF-8
Hi Jeffrey, How did you manage with your database conneciton in latin-1 to get your information properly in utf-8 ? to manage stemming everything ??? Thanks a lot, How did you manage if Tiong Jeffrey wrote: Hi Ajanta, thanks! Since I used PHP, I managed to use the PHP decode function to change it to UTF-8. But just a question, even if we change mysql default char-set to UTF-8, and if the input originally is in other format, the mysql engine won't help to convert it to UTF-8 rite? I think my question is, what is the use of defining the char-set in mysql other than for labeling purpose? Thanks! Jeffrey On 6/13/07, Ajanta Phatak [EMAIL PROTECTED] wrote: Hi Not sure if you've had a solution for your problem yet, but I had dealt with a similar issue that is mentioned below and hopefully it'll help you too. Of course, this assumes that your original data is in utf-8 format. The default charset encoding for mysql is Latin1 and our display format was utf-8 and that was the problem. These are the steps I performed to get the search data in utf-8 format.. Changed the my.cnf as so (though we can avoid this by executing commands on every new connection if we don't want the whole db in utf format): Under: [mysqld] added: # setting default charset to utf-8 collation_server=utf8_unicode_ci character_set_server=utf8 default-character-set=utf8 Under: [client] default-character-set=utf8 After changing, restarted mysqld, re-created the db, re-inserted all the data again in the db using my data insert code (java program) and re-created the Solr index. The key is to change the settings for both the mysqld and client sections in my.cnf - the mysqld setting is to make sure that mysql doesn't convert it to latin1 while storing the data and the client setting is to ensure that the data is not converted while accessing - going in or coming out from the server. Ajanta. Tiong Jeffrey wrote: Ya you are right! After I change it to UTF-8 the error still there... I looked at the log, this is what it appears, 127.0.0.1 - - [10/06/2007:03:52:06 +] POST /solr/update HTTP/1.1 500 4022 I tried to search but couldn't understand what error is this, anybody has any idea on this? Thanks!!! On 6/10/07, Chris Hostetter [EMAIL PROTECTED] wrote: : way during indexing is - FATAL: Connection error (is Solr running at : http://localhost/solr/update : ?): java.io.IOException: Server returned HTTP Response code: 500 for URL: : http://local/solr/update; : 4.Although the error code doesnt specify is XML utf-8 code error, but I did : a bit research, and look at the XML file that i have, it doesn't fulfill the : utf-8 encoding I *strongly* encourage you to look at the body of the response and/or the error log of your Servlet container and find out *exactly* what the cause of the error is ... you could spend a lot of time working on this and discover it's not your real problem. -Hoss -- View this message in context: http://www.nabble.com/To-make-sure-XML-is-UTF-8-tp11031646p20093197.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: To make sure XML is UTF-8
Hi Not sure if you've had a solution for your problem yet, but I had dealt with a similar issue that is mentioned below and hopefully it'll help you too. Of course, this assumes that your original data is in utf-8 format. The default charset encoding for mysql is Latin1 and our display format was utf-8 and that was the problem. These are the steps I performed to get the search data in utf-8 format.. Changed the my.cnf as so (though we can avoid this by executing commands on every new connection if we don't want the whole db in utf format): Under: [mysqld] added: # setting default charset to utf-8 collation_server=utf8_unicode_ci character_set_server=utf8 default-character-set=utf8 Under: [client] default-character-set=utf8 After changing, restarted mysqld, re-created the db, re-inserted all the data again in the db using my data insert code (java program) and re-created the Solr index. The key is to change the settings for both the mysqld and client sections in my.cnf - the mysqld setting is to make sure that mysql doesn't convert it to latin1 while storing the data and the client setting is to ensure that the data is not converted while accessing - going in or coming out from the server. Ajanta. Tiong Jeffrey wrote: Ya you are right! After I change it to UTF-8 the error still there... I looked at the log, this is what it appears, 127.0.0.1 - - [10/06/2007:03:52:06 +] POST /solr/update HTTP/1.1 500 4022 I tried to search but couldn't understand what error is this, anybody has any idea on this? Thanks!!! On 6/10/07, Chris Hostetter [EMAIL PROTECTED] wrote: : way during indexing is - FATAL: Connection error (is Solr running at : http://localhost/solr/update : ?): java.io.IOException: Server returned HTTP Response code: 500 for URL: : http://local/solr/update; : 4.Although the error code doesnt specify is XML utf-8 code error, but I did : a bit research, and look at the XML file that i have, it doesn't fulfill the : utf-8 encoding I *strongly* encourage you to look at the body of the response and/or the error log of your Servlet container and find out *exactly* what the cause of the error is ... you could spend a lot of time working on this and discover it's not your real problem. -Hoss
Re: To make sure XML is UTF-8
Hi Ajanta, thanks! Since I used PHP, I managed to use the PHP decode function to change it to UTF-8. But just a question, even if we change mysql default char-set to UTF-8, and if the input originally is in other format, the mysql engine won't help to convert it to UTF-8 rite? I think my question is, what is the use of defining the char-set in mysql other than for labeling purpose? Thanks! Jeffrey On 6/13/07, Ajanta Phatak [EMAIL PROTECTED] wrote: Hi Not sure if you've had a solution for your problem yet, but I had dealt with a similar issue that is mentioned below and hopefully it'll help you too. Of course, this assumes that your original data is in utf-8 format. The default charset encoding for mysql is Latin1 and our display format was utf-8 and that was the problem. These are the steps I performed to get the search data in utf-8 format.. Changed the my.cnf as so (though we can avoid this by executing commands on every new connection if we don't want the whole db in utf format): Under: [mysqld] added: # setting default charset to utf-8 collation_server=utf8_unicode_ci character_set_server=utf8 default-character-set=utf8 Under: [client] default-character-set=utf8 After changing, restarted mysqld, re-created the db, re-inserted all the data again in the db using my data insert code (java program) and re-created the Solr index. The key is to change the settings for both the mysqld and client sections in my.cnf - the mysqld setting is to make sure that mysql doesn't convert it to latin1 while storing the data and the client setting is to ensure that the data is not converted while accessing - going in or coming out from the server. Ajanta. Tiong Jeffrey wrote: Ya you are right! After I change it to UTF-8 the error still there... I looked at the log, this is what it appears, 127.0.0.1 - - [10/06/2007:03:52:06 +] POST /solr/update HTTP/1.1 500 4022 I tried to search but couldn't understand what error is this, anybody has any idea on this? Thanks!!! On 6/10/07, Chris Hostetter [EMAIL PROTECTED] wrote: : way during indexing is - FATAL: Connection error (is Solr running at : http://localhost/solr/update : ?): java.io.IOException: Server returned HTTP Response code: 500 for URL: : http://local/solr/update; : 4.Although the error code doesnt specify is XML utf-8 code error, but I did : a bit research, and look at the XML file that i have, it doesn't fulfill the : utf-8 encoding I *strongly* encourage you to look at the body of the response and/or the error log of your Servlet container and find out *exactly* what the cause of the error is ... you could spend a lot of time working on this and discover it's not your real problem. -Hoss
Re: To make sure XML is UTF-8
: Ya you are right! After I change it to UTF-8 the error still there... I : looked at the log, this is what it appears, : : 127.0.0.1 - - [10/06/2007:03:52:06 +] POST /solr/update HTTP/1.1 500 : 4022 thta looks like the access log .. not the error log. Solr is logging the details of what went wrong (and it should be putting those details in the body of hte response as well). Either find where your servlet container is logging Solr's messages or change your client to tell you what the body of the 500 error says (the log file is the most useful because there may be other errors in it you should be aware of as well not specific to this request) -Hoss
Re: To make sure XML is UTF-8
This is how the whole process looks like - 1. I have a web page that I want to index. So I first copy that web page, breaking it down to different section, and store it in mysql into different column 2. I then wrote a small PHP script that draw all the value from all the fields from mysql and then write it into an xml file 3. I then use solr to index this xml file, and the error that appears half way during indexing is - FATAL: Connection error (is Solr running at http://localhost/solr/update ?): java.io.IOException: Server returned HTTP Response code: 500 for URL: http://local/solr/update; 4.Although the error code doesnt specify is XML utf-8 code error, but I did a bit research, and look at the XML file that i have, it doesn't fulfill the utf-8 encoding I have been trying these for couple of hours, but still to no avail. I would like to find out 1. How to know the webpage that I copy into my mysql is what coding? 2. at what point of this whole process should I convert it to UTF-8? I tried change the collation in mysql for all the columns to UTF-8 from latin1-swedish, but it still doesnt work Thanks On 6/9/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Thought this is not directly related to Solr, but I have a XML output from mysql database, but during indexing the XML output is not working. And the problem is part of the XML output is not in UTF-8 encoding, how can I convert it to UTF-8 and how do I know what kind of coding it uses in the first place (the data I export from the mysql database). Thanks! How do you generate XML output? Output itself is usually a raw byte array, it uses Transport and Encoding. If you save it in a file system and forget about transport-layer-encoding you will get some new problems... during indexing the XML output is not working - what exactly happens, which kind of error messages?
Re: To make sure XML is UTF-8
This is how the whole process looks like - 1. I have a web page that I want to index. So I first copy that web page, breaking it down to different section, and store it in mysql into different column 2. I then wrote a small PHP script that draw all the value from all the fields from mysql and then write it into an xml file 3. I then use solr to index this xml file, and the error that appears half way during indexing is - FATAL: Connection error (is Solr running at http://localhost/solr/update ?): java.io.IOException: Server returned HTTP Response code: 500 for URL: http://local/solr/update; 4.Although the error code doesnt specify is XML utf-8 code error, but I did a bit research, and look at the XML file that i have, it doesn't fulfill the utf-8 encoding I have been trying these for couple of hours, but still to no avail. I would like to find out 1. How to know the webpage that I copy into my mysql is what coding? The charset can be in the response header, and/or the meta tags for the page. See http://krugle.com/kse/files/svn/svn.apache.org/lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java for code used by Nutch for this. Or it could be missing from both. Or it could be wrong for either/both. The issue of determining the right charset for an arbitrary web page isn't an easy one. If you have some way of doing analysis in advance such that you know for sure it's always X, that's going to simplify things for you. 2. at what point of this whole process should I convert it to UTF-8? As soon as possible - which means right when you're processing the page. I tried change the collation in mysql for all the columns to UTF-8 from latin1-swedish, but it still doesnt work Collation settings in the DB change how the DB interprets the data, but it doesn't change the data itself. -- Ken On 6/9/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Thought this is not directly related to Solr, but I have a XML output from mysql database, but during indexing the XML output is not working. And the problem is part of the XML output is not in UTF-8 encoding, how can I convert it to UTF-8 and how do I know what kind of coding it uses in the first place (the data I export from the mysql database). Thanks! How do you generate XML output? Output itself is usually a raw byte array, it uses Transport and Encoding. If you save it in a file system and forget about transport-layer-encoding you will get some new problems... during indexing the XML output is not working - what exactly happens, which kind of error messages? -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it
Re: To make sure XML is UTF-8
2. I then wrote a small PHP script that draw all the value from all the fields from mysql and then write it into an xml file You might find the utf8_encode utf8_decode php functions useful, http://nz2.php.net/utf8_encode http://nz2.php.net/utf8_decode $utf8string = utf8_encode($row['column']); -Nick On 6/10/07, Ken Krugler [EMAIL PROTECTED] wrote: This is how the whole process looks like - 1. I have a web page that I want to index. So I first copy that web page, breaking it down to different section, and store it in mysql into different column 2. I then wrote a small PHP script that draw all the value from all the fields from mysql and then write it into an xml file 3. I then use solr to index this xml file, and the error that appears half way during indexing is - FATAL: Connection error (is Solr running at http://localhost/solr/update ?): java.io.IOException: Server returned HTTP Response code: 500 for URL: http://local/solr/update; 4.Although the error code doesnt specify is XML utf-8 code error, but I did a bit research, and look at the XML file that i have, it doesn't fulfill the utf-8 encoding I have been trying these for couple of hours, but still to no avail. I would like to find out 1. How to know the webpage that I copy into my mysql is what coding? The charset can be in the response header, and/or the meta tags for the page. See http://krugle.com/kse/files/svn/svn.apache.org/lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java for code used by Nutch for this. Or it could be missing from both. Or it could be wrong for either/both. The issue of determining the right charset for an arbitrary web page isn't an easy one. If you have some way of doing analysis in advance such that you know for sure it's always X, that's going to simplify things for you. 2. at what point of this whole process should I convert it to UTF-8? As soon as possible - which means right when you're processing the page. I tried change the collation in mysql for all the columns to UTF-8 from latin1-swedish, but it still doesnt work Collation settings in the DB change how the DB interprets the data, but it doesn't change the data itself. -- Ken On 6/9/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Thought this is not directly related to Solr, but I have a XML output from mysql database, but during indexing the XML output is not working. And the problem is part of the XML output is not in UTF-8 encoding, how can I convert it to UTF-8 and how do I know what kind of coding it uses in the first place (the data I export from the mysql database). Thanks! How do you generate XML output? Output itself is usually a raw byte array, it uses Transport and Encoding. If you save it in a file system and forget about transport-layer-encoding you will get some new problems... during indexing the XML output is not working - what exactly happens, which kind of error messages? -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it
Re: To make sure XML is UTF-8
: way during indexing is - FATAL: Connection error (is Solr running at : http://localhost/solr/update : ?): java.io.IOException: Server returned HTTP Response code: 500 for URL: : http://local/solr/update; : 4.Although the error code doesnt specify is XML utf-8 code error, but I did : a bit research, and look at the XML file that i have, it doesn't fulfill the : utf-8 encoding I *strongly* encourage you to look at the body of the response and/or the error log of your Servlet container and find out *exactly* what the cause of the error is ... you could spend a lot of time working on this and discover it's not your real problem. -Hoss
Re: To make sure XML is UTF-8
Tiong Jeffrey wrote: Thought this is not directly related to Solr, but I have a XML output from mysql database, but during indexing the XML output is not working. And the problem is part of the XML output is not in UTF-8 encoding, how can I convert it to UTF-8 and how do I know what kind of coding it uses in the first place (the data I export from the mysql database). Thanks! You won't have any problem with standard JAXP and java.util.* etc. classes, even with comlpex MySQL data (one column is LATIN1, another is LATIN2, another is ASCII, ...) In Java, use standard classes: String, Long, Date. And use JAXP. -- View this message in context: http://www.nabble.com/To-make-sure-XML-is-UTF-8-tf3891427.html#a11032117 Sent from the Solr - User mailing list archive at Nabble.com.
Re: To make sure XML is UTF-8
Thought this is not directly related to Solr, but I have a XML output from mysql database, but during indexing the XML output is not working. And the problem is part of the XML output is not in UTF-8 encoding, how can I convert it to UTF-8 and how do I know what kind of coding it uses in the first place (the data I export from the mysql database). Thanks! How do you generate XML output? Output itself is usually a raw byte array, it uses Transport and Encoding. If you save it in a file system and forget about transport-layer-encoding you will get some new problems... during indexing the XML output is not working - what exactly happens, which kind of error messages?