This is how the whole process looks like -

1. I have a web page that I want to index. So I first copy that web page,
breaking it down to different section, and store it in mysql into different
column
2. I then wrote a small PHP script that draw all the value from all the
fields from mysql and then write it into an xml file
3. I then use solr to index this xml file, and the error that appears half
way during indexing is - "FATAL: Connection error (is Solr running at
http://localhost/solr/update
?): java.io.IOException: Server returned HTTP Response code: 500 for URL:
http://local/solr/update";
4.Although the error code doesnt specify is XML utf-8 code error, but I did
a bit research, and look at the XML file that i have, it doesn't fulfill the
utf-8 encoding

I have been trying these for couple of hours, but still to no avail. I would
like to find out
1. How to know the webpage that I copy into my mysql is what coding?

The charset can be in the response header, and/or the meta tags for the page. See http://krugle.com/kse/files/svn/svn.apache.org/lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java for code used by Nutch for this.

Or it could be missing from both. Or it could be wrong for either/both.

The issue of determining the right charset for an arbitrary web page isn't an easy one. If you have some way of doing analysis in advance such that you know for sure it's always X, that's going to simplify things for you.

2. at what point of this whole process should I convert it to UTF-8?

As soon as possible - which means right when you're processing the page.

I tried
change the collation in mysql for all the columns to UTF-8 from
latin1-swedish, but it still doesnt work

Collation settings in the DB change how the DB interprets the data, but it doesn't change the data itself.

-- Ken


On 6/9/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

 Thought this is not directly related to Solr, but I have a XML output
from
 mysql database, but during indexing the XML output is not working. And
the
 problem is part of the XML output is not in UTF-8 encoding, how can I
 convert it to UTF-8 and how do I know what kind of coding it uses in the
 first place (the data I export from the mysql database). Thanks!

How do you generate XML output? "Output" itself is usually a raw byte
array, it uses "Transport" and "Encoding". If you save it in a file
system and forget about "transport-layer-encoding" you will get some
new problems...

 during indexing the XML output is not working
- what exactly happens, which kind of error messages?


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Reply via email to