Re: To make sure XML is UTF-8

2008-10-21 Thread sunnyfr

Hi Jeffrey,

How did you manage with your database conneciton in latin-1 to get your
information properly in utf-8 ?
to manage stemming  everything ???

Thanks a lot,

How did you manage if 

Tiong Jeffrey wrote:
 
 Hi Ajanta,
 
 thanks! Since I used PHP, I managed to use the PHP decode function to
 change
 it to UTF-8.
 
 But just a question, even if we change mysql default char-set to UTF-8,
 and
 if the input originally is in other format, the mysql engine won't help to
 convert it to UTF-8 rite? I think my question is, what is the use of
 defining the char-set in mysql other than for labeling purpose?
 
 Thanks!
 
 Jeffrey
 
 On 6/13/07, Ajanta Phatak [EMAIL PROTECTED] wrote:

 Hi

 Not sure if you've had a solution for your problem yet, but I had dealt
 with a similar issue that is mentioned below and hopefully it'll help
 you too. Of course, this assumes that your original data is in utf-8
 format.

 The default charset encoding for mysql is Latin1 and our display format
 was utf-8 and that was the problem. These are the steps I performed to
 get the search data in utf-8 format..

 Changed the my.cnf as so (though we can avoid this by executing commands
 on every new connection if we don't want the whole db in utf format):

 Under: [mysqld] added:
 # setting default charset to utf-8
 collation_server=utf8_unicode_ci
 character_set_server=utf8
 default-character-set=utf8

 Under: [client]
 default-character-set=utf8

 After changing, restarted mysqld, re-created the db, re-inserted all the
 data again in the db using my data insert code (java program) and
 re-created the Solr index. The key is to change the settings for both
 the mysqld and client sections in my.cnf - the mysqld setting is to make
 sure that mysql doesn't convert it to latin1 while storing the data and
 the client setting is to ensure that the data is not converted while
 accessing - going in or coming out from the server.

 Ajanta.


 Tiong Jeffrey wrote:
  Ya you are right! After I change it to UTF-8 the error still there... I
  looked at the log, this is what it appears,
 
  127.0.0.1 -  -  [10/06/2007:03:52:06 +] POST /solr/update
  HTTP/1.1 500
  4022
 
  I tried to search but couldn't understand what error is this, anybody
 has
  any idea on this?
 
  Thanks!!!
 
  On 6/10/07, Chris Hostetter [EMAIL PROTECTED] wrote:
 
  : way during indexing is - FATAL: Connection error (is Solr running
 at
  : http://localhost/solr/update
  : ?): java.io.IOException: Server returned HTTP Response code: 500 for
  URL:
  : http://local/solr/update;
  : 4.Although the error code doesnt specify is XML utf-8 code error,
  but I
  did
  : a bit research, and look at the XML file that i have, it doesn't
  fulfill
  the
  : utf-8 encoding
 
  I *strongly* encourage you to look at the body of the response and/or
  the
  error log of your Servlet container and find out *exactly* what the
  cause
  of the error is ... you could spend a lot of time working on this and
  discover it's not your real problem.
 
 
 
  -Hoss
 
 

 
 

-- 
View this message in context: 
http://www.nabble.com/To-make-sure-XML-is-UTF-8-tp11031646p20093197.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: To make sure XML is UTF-8

2007-06-12 Thread Ajanta Phatak

Hi

Not sure if you've had a solution for your problem yet, but I had dealt 
with a similar issue that is mentioned below and hopefully it'll help 
you too. Of course, this assumes that your original data is in utf-8 format.


The default charset encoding for mysql is Latin1 and our display format 
was utf-8 and that was the problem. These are the steps I performed to 
get the search data in utf-8 format..


Changed the my.cnf as so (though we can avoid this by executing commands 
on every new connection if we don't want the whole db in utf format):


Under: [mysqld] added:
# setting default charset to utf-8
collation_server=utf8_unicode_ci
character_set_server=utf8
default-character-set=utf8

Under: [client]
default-character-set=utf8

After changing, restarted mysqld, re-created the db, re-inserted all the 
data again in the db using my data insert code (java program) and 
re-created the Solr index. The key is to change the settings for both 
the mysqld and client sections in my.cnf - the mysqld setting is to make 
sure that mysql doesn't convert it to latin1 while storing the data and 
the client setting is to ensure that the data is not converted while 
accessing - going in or coming out from the server.


Ajanta.


Tiong Jeffrey wrote:

Ya you are right! After I change it to UTF-8 the error still there... I
looked at the log, this is what it appears,

127.0.0.1 -  -  [10/06/2007:03:52:06 +] POST /solr/update 
HTTP/1.1 500

4022

I tried to search but couldn't understand what error is this, anybody has
any idea on this?

Thanks!!!

On 6/10/07, Chris Hostetter [EMAIL PROTECTED] wrote:


: way during indexing is - FATAL: Connection error (is Solr running at
: http://localhost/solr/update
: ?): java.io.IOException: Server returned HTTP Response code: 500 for
URL:
: http://local/solr/update;
: 4.Although the error code doesnt specify is XML utf-8 code error, 
but I

did
: a bit research, and look at the XML file that i have, it doesn't 
fulfill

the
: utf-8 encoding

I *strongly* encourage you to look at the body of the response and/or 
the
error log of your Servlet container and find out *exactly* what the 
cause

of the error is ... you could spend a lot of time working on this and
discover it's not your real problem.



-Hoss





Re: To make sure XML is UTF-8

2007-06-12 Thread Tiong Jeffrey

Hi Ajanta,

thanks! Since I used PHP, I managed to use the PHP decode function to change
it to UTF-8.

But just a question, even if we change mysql default char-set to UTF-8, and
if the input originally is in other format, the mysql engine won't help to
convert it to UTF-8 rite? I think my question is, what is the use of
defining the char-set in mysql other than for labeling purpose?

Thanks!

Jeffrey

On 6/13/07, Ajanta Phatak [EMAIL PROTECTED] wrote:


Hi

Not sure if you've had a solution for your problem yet, but I had dealt
with a similar issue that is mentioned below and hopefully it'll help
you too. Of course, this assumes that your original data is in utf-8
format.

The default charset encoding for mysql is Latin1 and our display format
was utf-8 and that was the problem. These are the steps I performed to
get the search data in utf-8 format..

Changed the my.cnf as so (though we can avoid this by executing commands
on every new connection if we don't want the whole db in utf format):

Under: [mysqld] added:
# setting default charset to utf-8
collation_server=utf8_unicode_ci
character_set_server=utf8
default-character-set=utf8

Under: [client]
default-character-set=utf8

After changing, restarted mysqld, re-created the db, re-inserted all the
data again in the db using my data insert code (java program) and
re-created the Solr index. The key is to change the settings for both
the mysqld and client sections in my.cnf - the mysqld setting is to make
sure that mysql doesn't convert it to latin1 while storing the data and
the client setting is to ensure that the data is not converted while
accessing - going in or coming out from the server.

Ajanta.


Tiong Jeffrey wrote:
 Ya you are right! After I change it to UTF-8 the error still there... I
 looked at the log, this is what it appears,

 127.0.0.1 -  -  [10/06/2007:03:52:06 +] POST /solr/update
 HTTP/1.1 500
 4022

 I tried to search but couldn't understand what error is this, anybody
has
 any idea on this?

 Thanks!!!

 On 6/10/07, Chris Hostetter [EMAIL PROTECTED] wrote:

 : way during indexing is - FATAL: Connection error (is Solr running at
 : http://localhost/solr/update
 : ?): java.io.IOException: Server returned HTTP Response code: 500 for
 URL:
 : http://local/solr/update;
 : 4.Although the error code doesnt specify is XML utf-8 code error,
 but I
 did
 : a bit research, and look at the XML file that i have, it doesn't
 fulfill
 the
 : utf-8 encoding

 I *strongly* encourage you to look at the body of the response and/or
 the
 error log of your Servlet container and find out *exactly* what the
 cause
 of the error is ... you could spend a lot of time working on this and
 discover it's not your real problem.



 -Hoss





Re: To make sure XML is UTF-8

2007-06-10 Thread Chris Hostetter
: Ya you are right! After I change it to UTF-8 the error still there... I
: looked at the log, this is what it appears,
:
: 127.0.0.1 -  -  [10/06/2007:03:52:06 +] POST /solr/update HTTP/1.1 500
: 4022

thta looks like the access log .. not the error log. Solr is logging the
details of what went wrong (and it should be putting those details in the
body of hte response as well).  Either find where your servlet container
is logging Solr's messages or change your client to tell you what the body
of the 500 error says

(the log file is the most useful because there may be other errors in it
you should be aware of as well not specific to this request)



-Hoss



Re: To make sure XML is UTF-8

2007-06-09 Thread Tiong Jeffrey

This is how the whole process looks like -

1. I have a web page that I want to index. So I first copy that web page,
breaking it down to different section, and store it in mysql into different
column
2. I then wrote a small PHP script that draw all the value from all the
fields from mysql and then write it into an xml file
3. I then use solr to index this xml file, and the error that appears half
way during indexing is - FATAL: Connection error (is Solr running at
http://localhost/solr/update
?): java.io.IOException: Server returned HTTP Response code: 500 for URL:
http://local/solr/update;
4.Although the error code doesnt specify is XML utf-8 code error, but I did
a bit research, and look at the XML file that i have, it doesn't fulfill the
utf-8 encoding

I have been trying these for couple of hours, but still to no avail. I would
like to find out
1. How to know the webpage that I copy into my mysql is what coding?
2. at what point of this whole process should I convert it to UTF-8? I tried
change the collation in mysql for all the columns to UTF-8 from
latin1-swedish, but it still doesnt work

Thanks

On 6/9/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:


 Thought this is not directly related to Solr, but I have a XML output
from
 mysql database, but during indexing the XML output is not working. And
the
 problem is part of the XML output is not in UTF-8 encoding, how can I
 convert it to UTF-8 and how do I know what kind of coding it uses in the
 first place (the data I export from the mysql database). Thanks!

How do you generate XML output? Output itself is usually a raw byte
array, it uses Transport and Encoding. If you save it in a file
system and forget about transport-layer-encoding you will get some
new problems...

 during indexing the XML output is not working
- what exactly happens, which kind of error messages?





Re: To make sure XML is UTF-8

2007-06-09 Thread Ken Krugler

This is how the whole process looks like -

1. I have a web page that I want to index. So I first copy that web page,
breaking it down to different section, and store it in mysql into different
column
2. I then wrote a small PHP script that draw all the value from all the
fields from mysql and then write it into an xml file
3. I then use solr to index this xml file, and the error that appears half
way during indexing is - FATAL: Connection error (is Solr running at
http://localhost/solr/update
?): java.io.IOException: Server returned HTTP Response code: 500 for URL:
http://local/solr/update;
4.Although the error code doesnt specify is XML utf-8 code error, but I did
a bit research, and look at the XML file that i have, it doesn't fulfill the
utf-8 encoding

I have been trying these for couple of hours, but still to no avail. I would
like to find out
1. How to know the webpage that I copy into my mysql is what coding?


The charset can be in the response header, and/or the meta tags for 
the page. See 
http://krugle.com/kse/files/svn/svn.apache.org/lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java 
for code used by Nutch for this.


Or it could be missing from both. Or it could be wrong for either/both.

The issue of determining the right charset for an arbitrary web page 
isn't an easy one. If you have some way of doing analysis in advance 
such that you know for sure it's always X, that's going to simplify 
things for you.



2. at what point of this whole process should I convert it to UTF-8?


As soon as possible - which means right when you're processing the page.


I tried
change the collation in mysql for all the columns to UTF-8 from
latin1-swedish, but it still doesnt work


Collation settings in the DB change how the DB interprets the data, 
but it doesn't change the data itself.


-- Ken



On 6/9/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:



 Thought this is not directly related to Solr, but I have a XML output

from

 mysql database, but during indexing the XML output is not working. And

the

 problem is part of the XML output is not in UTF-8 encoding, how can I
 convert it to UTF-8 and how do I know what kind of coding it uses in the
 first place (the data I export from the mysql database). Thanks!


How do you generate XML output? Output itself is usually a raw byte
array, it uses Transport and Encoding. If you save it in a file
system and forget about transport-layer-encoding you will get some
new problems...


 during indexing the XML output is not working

- what exactly happens, which kind of error messages?



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it


Re: To make sure XML is UTF-8

2007-06-09 Thread Nick Jenkin

2. I then wrote a small PHP script that draw all the value from all the
fields from mysql and then write it into an xml file


You might find the utf8_encode  utf8_decode php functions useful,
http://nz2.php.net/utf8_encode
http://nz2.php.net/utf8_decode

$utf8string = utf8_encode($row['column']);

-Nick

On 6/10/07, Ken Krugler [EMAIL PROTECTED] wrote:

This is how the whole process looks like -

1. I have a web page that I want to index. So I first copy that web page,
breaking it down to different section, and store it in mysql into different
column
2. I then wrote a small PHP script that draw all the value from all the
fields from mysql and then write it into an xml file
3. I then use solr to index this xml file, and the error that appears half
way during indexing is - FATAL: Connection error (is Solr running at
http://localhost/solr/update
?): java.io.IOException: Server returned HTTP Response code: 500 for URL:
http://local/solr/update;
4.Although the error code doesnt specify is XML utf-8 code error, but I did
a bit research, and look at the XML file that i have, it doesn't fulfill the
utf-8 encoding

I have been trying these for couple of hours, but still to no avail. I would
like to find out
1. How to know the webpage that I copy into my mysql is what coding?

The charset can be in the response header, and/or the meta tags for
the page. See
http://krugle.com/kse/files/svn/svn.apache.org/lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
for code used by Nutch for this.

Or it could be missing from both. Or it could be wrong for either/both.

The issue of determining the right charset for an arbitrary web page
isn't an easy one. If you have some way of doing analysis in advance
such that you know for sure it's always X, that's going to simplify
things for you.

2. at what point of this whole process should I convert it to UTF-8?

As soon as possible - which means right when you're processing the page.

I tried
change the collation in mysql for all the columns to UTF-8 from
latin1-swedish, but it still doesnt work

Collation settings in the DB change how the DB interprets the data,
but it doesn't change the data itself.

-- Ken


On 6/9/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

  Thought this is not directly related to Solr, but I have a XML output
from
  mysql database, but during indexing the XML output is not working. And
the
  problem is part of the XML output is not in UTF-8 encoding, how can I
  convert it to UTF-8 and how do I know what kind of coding it uses in the
  first place (the data I export from the mysql database). Thanks!

How do you generate XML output? Output itself is usually a raw byte
array, it uses Transport and Encoding. If you save it in a file
system and forget about transport-layer-encoding you will get some
new problems...

  during indexing the XML output is not working
- what exactly happens, which kind of error messages?


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it



Re: To make sure XML is UTF-8

2007-06-09 Thread Chris Hostetter
: way during indexing is - FATAL: Connection error (is Solr running at
: http://localhost/solr/update
: ?): java.io.IOException: Server returned HTTP Response code: 500 for URL:
: http://local/solr/update;
: 4.Although the error code doesnt specify is XML utf-8 code error, but I did
: a bit research, and look at the XML file that i have, it doesn't fulfill the
: utf-8 encoding

I *strongly* encourage you to look at the body of the response and/or the
error log of your Servlet container and find out *exactly* what the cause
of the error is ... you could spend a lot of time working on this and
discover it's not your real problem.



-Hoss


Re: To make sure XML is UTF-8

2007-06-08 Thread Funtick


Tiong Jeffrey wrote:
 
 Thought this is not directly related to Solr, but I have a XML output from
 mysql database, but during indexing the XML output is not working. And the
 problem is part of the XML output is not in UTF-8 encoding, how can I
 convert it to UTF-8 and how do I know what kind of coding it uses in the
 first place (the data I export from the mysql database). Thanks!
 

You won't have any problem with standard JAXP and java.util.* etc. classes,
even with
comlpex MySQL data (one column is LATIN1, another is LATIN2, another is
ASCII, ...)

In Java, use standard classes: String, Long, Date. And use JAXP.
-- 
View this message in context: 
http://www.nabble.com/To-make-sure-XML-is-UTF-8-tf3891427.html#a11032117
Sent from the Solr - User mailing list archive at Nabble.com.



Re: To make sure XML is UTF-8

2007-06-08 Thread funtick

Thought this is not directly related to Solr, but I have a XML output from
mysql database, but during indexing the XML output is not working. And the
problem is part of the XML output is not in UTF-8 encoding, how can I
convert it to UTF-8 and how do I know what kind of coding it uses in the
first place (the data I export from the mysql database). Thanks!


How do you generate XML output? Output itself is usually a raw byte  
array, it uses Transport and Encoding. If you save it in a file  
system and forget about transport-layer-encoding you will get some  
new problems...



during indexing the XML output is not working

- what exactly happens, which kind of error messages?