#50139 [Opn->Fbk]: text in UTF-8 encoded xml cut off by xml parser with German umlauts

2009-11-11 Thread jani
 ID:   50139
 Updated by:   j...@php.net
 Reported By:  gros at mpdl dot mpg dot de
-Status:   Open
+Status:   Feedback
 Bug Type: XML Reader
 Operating System: Mac OS-X 10.6.2
 PHP Version:  5.3.0
 New Comment:

And please provide the complete script you used. It works fine for me
with very crude script..


Previous Comments:


[2009-11-11 12:46:47] gros at mpdl dot mpg dot de

Thanks, but the file is telling it's encoding, actually. Both in the
header (application/xml) and in the file:



And also using 

$xml_parser = xml_parser_create("UTF-8");

does not help!



[2009-11-11 12:42:16] j...@php.net

Duh, i missed the very first line in your xml file. :)
So what you're actually reporting is that the input encoding isn't
detected properly?



[2009-11-11 12:41:12] j...@php.net

It might work better if your xml file told the encoding OR if you told
the xml_parser_create() the input encoding..



[2009-11-10 18:02:57] gros at mpdl dot mpg dot de

Just to add:
I also used curl for fetching this piece of xml and the result was the
same.



[2009-11-10 17:59:10] gros at mpdl dot mpg dot de

Description:

When parsing an xml file with UTF-8 encoding (like this one:
http://bit.ly/3PSi44), text containing German umlauts is cut off:

original:
Kaiser Wilhelm Institut für
Züchtungsforschung

result after parsing:
"Kaiser Wilhelm Institut f"

or parsing this
Societäts-Verlag

results in "äts-Verlag"


Reproduce code:
---
$snippet = file_get_contents("http://bit.ly/3PSi44";);

if (!($xml_parser = xml_parser_create(""))) 
die("Couldn't create parser.");

xml_parser_set_option($xml_parser,
XML_OPTION_TARGET_ENCODING,'UTF-8');  

xml_set_element_handler($xml_parser,"startElementHandler","endElementHandler");
xml_set_character_data_handler( 
$xml_parser,
"characterDataHandler");

$retstr = "";
if(!xml_parse($xml_parser, 
$snippet)) 
{
$retstr = sprintf("XML 
error: %s at line %d",

xml_error_string(xml_get_error_code($xml_parser)),

xml_get_current_line_number($xml_parser));
}
xml_parser_free($xml_parser);




Expected result:

I expect properly imported text like outlined in the description:

parsing this:
Kaiser Wilhelm Institut für
Züchtungsforschung

should result in:
"Kaiser Wilhelm Institut für Züchtungsforschung"

or parsing this
Societäts-Verlag

should result in "Societäts-Verlag"

Actual result:
--
I get cut-off pieces of text when the text contains German umlauts (see
two examples in the description).

parsing this:
Kaiser Wilhelm Institut für
Züchtungsforschung

results in:
"Kaiser Wilhelm Institut f"

or parsing this
Societäts-Verlag

results in "äts-Verlag"





-- 
Edit this bug report at http://bugs.php.net/?id=50139&edit=1



#50139 [Opn->Fbk]: text in UTF-8 encoded xml cut off by xml parser with German umlauts

2009-11-11 Thread jani
 ID:   50139
 Updated by:   j...@php.net
 Reported By:  gros at mpdl dot mpg dot de
-Status:   Open
+Status:   Feedback
 Bug Type: XML Reader
 Operating System: Mac OS-X 10.6.2
 PHP Version:  5.3.0
 New Comment:

It might work better if your xml file told the encoding OR if you told
the xml_parser_create() the input encoding..


Previous Comments:


[2009-11-10 18:02:57] gros at mpdl dot mpg dot de

Just to add:
I also used curl for fetching this piece of xml and the result was the
same.



[2009-11-10 17:59:10] gros at mpdl dot mpg dot de

Description:

When parsing an xml file with UTF-8 encoding (like this one:
http://bit.ly/3PSi44), text containing German umlauts is cut off:

original:
Kaiser Wilhelm Institut für
Züchtungsforschung

result after parsing:
"Kaiser Wilhelm Institut f"

or parsing this
Societäts-Verlag

results in "äts-Verlag"


Reproduce code:
---
$snippet = file_get_contents("http://bit.ly/3PSi44";);

if (!($xml_parser = xml_parser_create(""))) 
die("Couldn't create parser.");

xml_parser_set_option($xml_parser,
XML_OPTION_TARGET_ENCODING,'UTF-8');  

xml_set_element_handler($xml_parser,"startElementHandler","endElementHandler");
xml_set_character_data_handler( 
$xml_parser,
"characterDataHandler");

$retstr = "";
if(!xml_parse($xml_parser, 
$snippet)) 
{
$retstr = sprintf("XML 
error: %s at line %d",

xml_error_string(xml_get_error_code($xml_parser)),

xml_get_current_line_number($xml_parser));
}
xml_parser_free($xml_parser);




Expected result:

I expect properly imported text like outlined in the description:

parsing this:
Kaiser Wilhelm Institut für
Züchtungsforschung

should result in:
"Kaiser Wilhelm Institut für Züchtungsforschung"

or parsing this
Societäts-Verlag

should result in "Societäts-Verlag"

Actual result:
--
I get cut-off pieces of text when the text contains German umlauts (see
two examples in the description).

parsing this:
Kaiser Wilhelm Institut für
Züchtungsforschung

results in:
"Kaiser Wilhelm Institut f"

or parsing this
Societäts-Verlag

results in "äts-Verlag"





-- 
Edit this bug report at http://bugs.php.net/?id=50139&edit=1