Re: Malformed XML with exotic characters

2011-02-03 Thread Markus Jelsma
Hi

I've seen almost all funky charsets but gothic is always trouble. I'm also 
unsure if its really a bug in Solr. It could well be the Xerces being unable 
to cope. Besides, most systems indeed don't go well with gothic. This mail 
client does, but my terminal can't find its cursor after (properly) displaying 
such text.
 
http://got.wikipedia.org/wiki/%F0%90%8C%B7%F0%90%8C%B0%F0%90%8C%BF%F0%90%8C%B1%F0%90%8C%B9%F0%90%8C%B3%F0%90%8C%B0%F0%90%8C%B1%F0%90%8C%B0%F0%90%8C%BF%F0%90%8D%82%F0%90%8C%B2%F0%90%8D%83/Haubidabaurgs

Thanks for the input.

Cheers,

On Tuesday 01 February 2011 19:59:33 Robert Muir wrote:
 Hi, it might only be a problem with your xml tools (e.g. firefox).
 the problem here is characters outside of the basic multilingual plane
 (in this case Gothic).
 XML tools typically fall apart on these portions of unicode (in lucene
 we recently reverted to a patched/hacked copy of xerces specifically
 for this reason).
 
 If you care about characters outside of the basic multilingual plane
 actually working, unfortunately you have to start being very very very
 particular about what software you use... you can assume most
 software/setups WON'T work.
 For example, if you were to use mysql's utf8 character set you would
 find it doesn't actually support all of UTF-8! in this case you would
 need to use the recent 'utf8mb4' or something instead, that is
 actually utf-8!
 Thats just one example of a well-used piece of software that suffers
 from issues like this, there are others.
 
 Its for reasons like these that if support for these languages is
 important to you, I would stick with the most simple/textual methods
 for input and output: e.g. using things like CSV and JSON if you can.
 I would also fully test every component/jar in your application
 individually and once you get it working, don't ever upgrade.
 
 In any case, if you are having problems with characters outside of the
 basic multilingual plane, and you suspect its actually a bug in Solr,
 please open a JIRA issue, especially if you can provide some way to
 reproduce it
 


Malformed XML with exotic characters

2011-02-01 Thread Markus Jelsma
There is an issue with the XML response writer. It cannot cope with some very 
exotic characters or possibly the right-to-left writing systems. The issue can 
be reproduced by indexing the content of the home page of wikipedia as it 
contains a lot of exotic matter. The problem does not affect the JSON response 
writer.

The problem is, i am unsure whether this is a bug in Solr or that perhaps 
Firefox itself trips over.


Here's the output of the JSONResponeWriter for a query returning the home 
page:
{
 responseHeader:{
  status:0,
  QTime:1,
  params:{
fl:url,content,
indent:true,
wt:json,
q:*:*,
rows:1}},
 response:{numFound:6744,start:0,docs:[
{
 url:http://www.wikipedia.org/;,
 content:Wikipedia English The Free Encyclopedia 3 543 000+ articles 
日
本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel 
Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie libre 
1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей Italiano 
L’enciclopedia libera 768 000+ voci Português A enciclopédia livre 669 000+ 
artigos Polski Wolna encyklopedia 769 000+ haseł Nederlands De vrije 
encyclopedie 668 000+ artikelen Search  • Suchen  • Rechercher  • Szukaj  • 
Ricerca  • 検索  • Buscar  • Busca  • Zoeken  • Поиск  • Sök  • 搜尋  • Cerca  • 
Søk  • Haku  • Пошук  • Hledání  • Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara  
• Cari  • Søg  • بحث  • Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو  
• חיפוש  • Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky 
Dansk Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia 
Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk (bokmål) 
Polski Português Română Русский Slovenčina Slovenščina Српски / Srpski Suomi 
Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文   100 000+   العربية  
• Български  • Català  • Česky  • Dansk  • Deutsch  • English  • Español  • 
Esperanto  • فارسی  • Français  • 한국어  • Bahasa Indonesia  • Italiano  • עברית  
• Lietuvių  • Magyar  • Bahasa Melayu  • Nederlands  • 日本語  • Norsk (bokmål)  
• Polski  • Português  • Русский  • Română  • Slovenčina  • Slovenščina  • 
Српски / Srpski  • Suomi  • Svenska  • Türkçe  • Українська  • Tiếng Việt  • 
Volapük  • Winaray  • 中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  • 
Asturianu  • Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • 
Беларуская 
( Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  • 
Brezhoneg  • Чăваш  
• Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  • Gaeilge  • Galego  • 
ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  • Íslenska  • Basa Jawa  • 
ಕನ್ನಡ  • 
ქართული  • Kurdî / كوردی  • Latina  • Latviešu  • Lëtzebuergesch  • Lumbaart  
• Македонски  • മലയാളം  • मराठी  • नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • 
Nnapulitano  
• Occitan  • Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ مکھی 
پنجابی  • Shqip  • Sicilianu  • Simple English  • Sinugboanon  • 
Srpskohrvatski / Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • தமிழ்  
• తెలుగు  • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   Bahsa 
Acèh  • Alemannisch  • አማርኛ  • Arpitan  • ܐܬܘܪܝܐ  • Avañe’ẽ  • Aymar Aru  • 
Bân-lâm-gú  • Bahasa Banjar  • Basa Banyumasan  • Башҡорт  • भोजपुरी  • Bikol 
Central  • Boarisch  • བོད་ཡིག  • Chavacano de Zamboanga  • Corsu  • Deitsch  • 
ދިވެހި  • Diné Bizaad  • Eald Englisc  • Emigliàn–Rumagnòl  • Эрзянь  • 
Estremeñu  
• Fiji Hindi  • Føroyskt  • Furlan  • Gaelg  • Gàidhlig  • 贛語  • گیلکی  • Hak-
kâ-fa / 客家話  • Хальмг  • ʻŌlelo Hawaiʻi  • Hornjoserbsce  • Ilokano  • 
Interlingua  • Interlingue  • Ирон Æвзаг  • Kapampangan  • Kaszëbsczi  • 
Kernewek  • ភាសាខ្មែរ  • Kinyarwanda  • Коми  • Кыргызча  • Ladino / לאדינו  • 
Ligure  • Limburgs  • Lingála  • lojban  • Malagasy  • Malti  • 文言  • Māori  • 
مصرى  • مازِرونی / Mäzeruni  • Монгол  • မြန်မာဘာသာ  • Nāhuatlahtōlli  • 
Nedersaksisch  • Nouormand  • Novial  • Нохчийн  • Олык Марий  • O‘zbek  • पाऴि 
 
• Pangasinán  • ਪੰਜਾਬੀ / پنجابی  • Papiamentu  • پښتو  • Picard  • Къарачай–
Малкъар  • Қазақша  • Qırımtatarca  • Rumantsch  • Русиньскый Язык  • संस्कृतम् 
 • 
Sámegiella  • Sardu  • Саха Тыла  • Scots  • Seeltersk  • සිංහල  • Ślůnski  • 
Af 
Soomaali  • کوردی  • Tarandíne  • Татарча / Tatarça  • Тоҷикӣ  • Lea faka-
Tonga  • Türkmen  • Удмурт  • ᨅᨔ ᨕᨙᨁᨗ  • Uyghur / ئۇيغۇرچه  • Vèneto  • Võro  • 
West-Vlams  • Wolof  • 吴语  • ייִדיש  • Zazaki   100+   Akan  • Аҧсуа  • Авар  • 
Bamanankan  • Bislama  • Буряад  • Chamoru  • Chichewa  • Cuengh  • 
Dolnoserbski  • Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
  • Hausa / هَوُسَا  • Igbo  • ᐃᓄᒃᑎᑐᑦ / Inuktitut  • Iñupiak  • 
Kalaallisut  • कश्मीरी / كشميري  • Kongo  • Кырык Мары  • ພາສາລາວ  • Лакку  • 
Luganda  • Mìng-dĕ̤ng-ngṳ̄  • Mirandés  • Мокшень  • Молдовеняскэ  • Na Vosa 
Vaka-Viti  • Dorerin Naoero  • Nēhiyawēwin / ᓀᐦᐃᔭᐍᐏᐣ  • Norfuk / Pitkern  • 

Re: Malformed XML with exotic characters

2011-02-01 Thread Stefan Matheis
Hi Markus,

to verify that it's not an Firefox-Issue, try xmllint on your shell to
check the given xml?

Regards
Stefan

On Tue, Feb 1, 2011 at 4:43 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
 There is an issue with the XML response writer. It cannot cope with some very
 exotic characters or possibly the right-to-left writing systems. The issue can
 be reproduced by indexing the content of the home page of wikipedia as it
 contains a lot of exotic matter. The problem does not affect the JSON response
 writer.

 The problem is, i am unsure whether this is a bug in Solr or that perhaps
 Firefox itself trips over.


 Here's the output of the JSONResponeWriter for a query returning the home
 page:
 {
  responseHeader:{
  status:0,
  QTime:1,
  params:{
        fl:url,content,
        indent:true,
        wt:json,
        q:*:*,
        rows:1}},
  response:{numFound:6744,start:0,docs:[
        {
         url:http://www.wikipedia.org/;,
         content:Wikipedia English The Free Encyclopedia 3 543 000+ 
 articles 日
 本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel
 Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie libre
 1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей Italiano
 L’enciclopedia libera 768 000+ voci Português A enciclopédia livre 669 000+
 artigos Polski Wolna encyklopedia 769 000+ haseł Nederlands De vrije
 encyclopedie 668 000+ artikelen Search  • Suchen  • Rechercher  • Szukaj  •
 Ricerca  • 検索  • Buscar  • Busca  • Zoeken  • Поиск  • Sök  • 搜尋  • Cerca  •
 Søk  • Haku  • Пошук  • Hledání  • Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara
 • Cari  • Søg  • بحث  • Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو
 • חיפוש  • Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky
 Dansk Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia
 Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk (bokmål)
 Polski Português Română Русский Slovenčina Slovenščina Српски / Srpski Suomi
 Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文   100 000+   العربية
 • Български  • Català  • Česky  • Dansk  • Deutsch  • English  • Español  •
 Esperanto  • فارسی  • Français  • 한국어  • Bahasa Indonesia  • Italiano  • עברית
 • Lietuvių  • Magyar  • Bahasa Melayu  • Nederlands  • 日本語  • Norsk (bokmål)
 • Polski  • Português  • Русский  • Română  • Slovenčina  • Slovenščina  •
 Српски / Srpski  • Suomi  • Svenska  • Türkçe  • Українська  • Tiếng Việt  •
 Volapük  • Winaray  • 中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  •
 Asturianu  • Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • 
 Беларуская
 ( Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  • 
 Brezhoneg  • Чăваш
 • Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  • Gaeilge  • Galego  •
 ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  • Íslenska  • Basa Jawa  • 
 ಕನ್ನಡ  •
 ქართული  • Kurdî / كوردی  • Latina  • Latviešu  • Lëtzebuergesch  • Lumbaart
 • Македонски  • മലയാളം  • मराठी  • नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • 
 Nnapulitano
 • Occitan  • Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ مکھی
 پنجابی  • Shqip  • Sicilianu  • Simple English  • Sinugboanon  •
 Srpskohrvatski / Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • தமிழ்
 • తెలుగు  • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   
 Bahsa
 Acèh  • Alemannisch  • አማርኛ  • Arpitan  • ܐܬܘܪܝܐ  • Avañe’ẽ  • Aymar Aru  •
 Bân-lâm-gú  • Bahasa Banjar  • Basa Banyumasan  • Башҡорт  • भोजपुरी  • Bikol
 Central  • Boarisch  • བོད་ཡིག  • Chavacano de Zamboanga  • Corsu  • Deitsch  
 •
 ދިވެހި  • Diné Bizaad  • Eald Englisc  • Emigliàn–Rumagnòl  • Эрзянь  • 
 Estremeñu
 • Fiji Hindi  • Føroyskt  • Furlan  • Gaelg  • Gàidhlig  • 贛語  • گیلکی  • Hak-
 kâ-fa / 客家話  • Хальмг  • ʻŌlelo Hawaiʻi  • Hornjoserbsce  • Ilokano  •
 Interlingua  • Interlingue  • Ирон Æвзаг  • Kapampangan  • Kaszëbsczi  •
 Kernewek  • ភាសាខ្មែរ  • Kinyarwanda  • Коми  • Кыргызча  • Ladino / לאדינו  •
 Ligure  • Limburgs  • Lingála  • lojban  • Malagasy  • Malti  • 文言  • Māori  •
 مصرى  • مازِرونی / Mäzeruni  • Монгол  • မြန်မာဘာသာ  • Nāhuatlahtōlli  •
 Nedersaksisch  • Nouormand  • Novial  • Нохчийн  • Олык Марий  • O‘zbek  • 
 पाऴि
 • Pangasinán  • ਪੰਜਾਬੀ / پنجابی  • Papiamentu  • پښتو  • Picard  • Къарачай–
 Малкъар  • Қазақша  • Qırımtatarca  • Rumantsch  • Русиньскый Язык  • 
 संस्कृतम्  •
 Sámegiella  • Sardu  • Саха Тыла  • Scots  • Seeltersk  • සිංහල  • Ślůnski  • 
 Af
 Soomaali  • کوردی  • Tarandíne  • Татарча / Tatarça  • Тоҷикӣ  • Lea faka-
 Tonga  • Türkmen  • Удмурт  • ᨅᨔ ᨕᨙᨁᨗ  • Uyghur / ئۇيغۇرچه  • Vèneto  • Võro  
 •
 West-Vlams  • Wolof  • 吴语  • ייִדיש  • Zazaki   100+   Akan  • Аҧсуа  • Авар  
 •
 Bamanankan  • Bislama  • Буряад  • Chamoru  • Chichewa  • Cuengh  •
 Dolnoserbski  • Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  •
   • Hausa / هَوُسَا  • Igbo  • ᐃᓄᒃᑎᑐᑦ / Inuktitut  • Iñupiak  •
 Kalaallisut  • 

Re: Malformed XML with exotic characters

2011-02-01 Thread François Schiettecatte
Markus 

A few things to check, make sure whatever SOLR is hosted on is outputting utf-8 
( URIEncoding=UTF-8 in the Connector section in server.xml on Tomcat for 
example), which it looks like here, also make sure that whatever http header 
there is tells firefox that it is getting utf-8 (otherwise it defaults to 
iso-8859-1/latin-1), finally make sure that whatever font you use in firefox 
has the 'exotic' characters you are expecting. There might also be some issues 
on your platform with mixing script direction but that is probably not likely.

Cheers

François

On Feb 1, 2011, at 10:43 AM, Markus Jelsma wrote:

 There is an issue with the XML response writer. It cannot cope with some very 
 exotic characters or possibly the right-to-left writing systems. The issue 
 can 
 be reproduced by indexing the content of the home page of wikipedia as it 
 contains a lot of exotic matter. The problem does not affect the JSON 
 response 
 writer.
 
 The problem is, i am unsure whether this is a bug in Solr or that perhaps 
 Firefox itself trips over.
 
 
 Here's the output of the JSONResponeWriter for a query returning the home 
 page:
 {
 responseHeader:{
  status:0,
  QTime:1,
  params:{
   fl:url,content,
   indent:true,
   wt:json,
   q:*:*,
   rows:1}},
 response:{numFound:6744,start:0,docs:[
   {
url:http://www.wikipedia.org/;,
content:Wikipedia English The Free Encyclopedia 3 543 000+ articles 
 日
 本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel 
 Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie 
 libre 
 1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей Italiano 
 L’enciclopedia libera 768 000+ voci Português A enciclopédia livre 669 000+ 
 artigos Polski Wolna encyklopedia 769 000+ haseł Nederlands De vrije 
 encyclopedie 668 000+ artikelen Search  • Suchen  • Rechercher  • Szukaj  • 
 Ricerca  • 検索  • Buscar  • Busca  • Zoeken  • Поиск  • Sök  • 搜尋  • Cerca  • 
 Søk  • Haku  • Пошук  • Hledání  • Keresés  • Căutare  • 찾기  • Tìm kiếm  • 
 Ara  
 • Cari  • Søg  • بحث  • Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • 
 جستجو  
 • חיפוש  • Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky 
 Dansk Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia 
 Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk (bokmål) 
 Polski Português Română Русский Slovenčina Slovenščina Српски / Srpski Suomi 
 Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文   100 000+   العربية  
 • Български  • Català  • Česky  • Dansk  • Deutsch  • English  • Español  • 
 Esperanto  • فارسی  • Français  • 한국어  • Bahasa Indonesia  • Italiano  • 
 עברית  
 • Lietuvių  • Magyar  • Bahasa Melayu  • Nederlands  • 日本語  • Norsk (bokmål)  
 • Polski  • Português  • Русский  • Română  • Slovenčina  • Slovenščina  • 
 Српски / Srpski  • Suomi  • Svenska  • Türkçe  • Українська  • Tiếng Việt  • 
 Volapük  • Winaray  • 中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  • 
 Asturianu  • Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • 
 Беларуская 
 ( Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  • 
 Brezhoneg  • Чăваш  
 • Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  • Gaeilge  • Galego  • 
 ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  • Íslenska  • Basa Jawa  • 
 ಕನ್ನಡ  • 
 ქართული  • Kurdî / كوردی  • Latina  • Latviešu  • Lëtzebuergesch  • Lumbaart  
 • Македонски  • മലയാളം  • मराठी  • नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • 
 Nnapulitano  
 • Occitan  • Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ 
 مکھی 
 پنجابی  • Shqip  • Sicilianu  • Simple English  • Sinugboanon  • 
 Srpskohrvatski / Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • 
 தமிழ்  
 • తెలుగు  • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   
 Bahsa 
 Acèh  • Alemannisch  • አማርኛ  • Arpitan  • ܐܬܘܪܝܐ  • Avañe’ẽ  • Aymar Aru  • 
 Bân-lâm-gú  • Bahasa Banjar  • Basa Banyumasan  • Башҡорт  • भोजपुरी  • Bikol 
 Central  • Boarisch  • བོད་ཡིག  • Chavacano de Zamboanga  • Corsu  • Deitsch  
 • 
 ދިވެހި  • Diné Bizaad  • Eald Englisc  • Emigliàn–Rumagnòl  • Эрзянь  • 
 Estremeñu  
 • Fiji Hindi  • Føroyskt  • Furlan  • Gaelg  • Gàidhlig  • 贛語  • گیلکی  • Hak-
 kâ-fa / 客家話  • Хальмг  • ʻŌlelo Hawaiʻi  • Hornjoserbsce  • Ilokano  • 
 Interlingua  • Interlingue  • Ирон Æвзаг  • Kapampangan  • Kaszëbsczi  • 
 Kernewek  • ភាសាខ្មែរ  • Kinyarwanda  • Коми  • Кыргызча  • Ladino / לאדינו  
 • 
 Ligure  • Limburgs  • Lingála  • lojban  • Malagasy  • Malti  • 文言  • Māori  
 • 
 مصرى  • مازِرونی / Mäzeruni  • Монгол  • မြန်မာဘာသာ  • Nāhuatlahtōlli  • 
 Nedersaksisch  • Nouormand  • Novial  • Нохчийн  • Олык Марий  • O‘zbek  • 
 पाऴि  
 • Pangasinán  • ਪੰਜਾਬੀ / پنجابی  • Papiamentu  • پښتو  • Picard  • Къарачай–
 Малкъар  • Қазақша  • Qırımtatarca  • Rumantsch  • Русиньскый Язык  • 
 संस्कृतम्  • 
 Sámegiella  • Sardu  • Саха Тыла  • Scots  • Seeltersk  • සිංහල  • 

Re: Malformed XML with exotic characters

2011-02-01 Thread Markus Jelsma
It's throwing out a lot of disturbing messages:

select.xml:17: parser error : Char 0xD800 out of allowed range
ki  • Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
   ^
select.xml:17: parser error : PCDATA invalid Char value 55296
ki  • Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
   ^
select.xml:17: parser error : Char 0xDF32 out of allowed range
 • Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • �
   ^
select.xml:17: parser error : PCDATA invalid Char value 57138
 • Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • �
   ^
select.xml:17: parser error : Char 0xD800 out of allowed range
�� Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ��
   ^
select.xml:17: parser error : PCDATA invalid Char value 55296
�� Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ��
   ^
select.xml:17: parser error : Char 0xDF3F out of allowed range
Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ���
   ^
select.xml:17: parser error : PCDATA invalid Char value 57151
Eʋegbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ���
   ^
select.xml:17: parser error : Char 0xD800 out of allowed range
egbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
   ^
select.xml:17: parser error : PCDATA invalid Char value 55296
egbe  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
   ^
select.xml:17: parser error : Char 0xDF44 out of allowed range
e  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • �
   ^
select.xml:17: parser error : PCDATA invalid Char value 57156
e  • Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • �
   ^
select.xml:17: parser error : Char 0xD800 out of allowed range
�• Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ��
   ^
select.xml:17: parser error : PCDATA invalid Char value 55296
�• Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ��
   ^
select.xml:17: parser error : Char 0xDF39 out of allowed range
� Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ���
   ^
select.xml:17: parser error : PCDATA invalid Char value 57145
� Frasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • ���
   ^
select.xml:17: parser error : Char 0xD800 out of allowed range
rasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
   ^
select.xml:17: parser error : PCDATA invalid Char value 55296
rasch  • Fulfulde  • Gagauz  • Gĩkũyũ  • 
   ^
select.xml:17: parser error : Char 0xDF43 out of allowed range
ch  • Fulfulde  • Gagauz  • Gĩkũyũ  • �
   ^
select.xml:17: parser error : PCDATA invalid Char value 57155
ch  • Fulfulde  • Gagauz  • Gĩkũyũ  • �
   ^
select.xml:17: parser error : Char 0xD800 out of allowed range
 • Fulfulde  • Gagauz  • Gĩkũyũ  • ��
   ^
select.xml:17: parser error : PCDATA invalid Char value 55296
 • Fulfulde  • Gagauz  • Gĩkũyũ  • ��
   ^
select.xml:17: parser error : Char 0xDF3A out of allowed range
�� Fulfulde  • Gagauz  • Gĩkũyũ  • ���
   ^
select.xml:17: parser error : PCDATA invalid Char value 57146
�� Fulfulde  • Gagauz  • Gĩkũyũ  • ���


On Tuesday 01 February 2011 17:00:19 Stefan Matheis wrote:
 Hi Markus,
 
 to verify that it's not an Firefox-Issue, try xmllint on your shell to
 check the given xml?
 
 Regards
 Stefan
 
 On Tue, Feb 1, 2011 at 4:43 PM, Markus Jelsma
 
 markus.jel...@openindex.io wrote:
  There is an issue with the XML response 

Re: Malformed XML with exotic characters

2011-02-01 Thread Markus Jelsma
Hi,

There is no typical encoding issues on my system. I can index, query and 
display english, german, chinese, vietnamese etc.

Cheers

On Tuesday 01 February 2011 17:23:49 François Schiettecatte wrote:
 Markus
 
 A few things to check, make sure whatever SOLR is hosted on is outputting
 utf-8 ( URIEncoding=UTF-8 in the Connector section in server.xml on
 Tomcat for example), which it looks like here, also make sure that
 whatever http header there is tells firefox that it is getting utf-8
 (otherwise it defaults to iso-8859-1/latin-1), finally make sure that
 whatever font you use in firefox has the 'exotic' characters you are
 expecting. There might also be some issues on your platform with mixing
 script direction but that is probably not likely.
 
 Cheers
 
 François
 
 On Feb 1, 2011, at 10:43 AM, Markus Jelsma wrote:
  There is an issue with the XML response writer. It cannot cope with some
  very exotic characters or possibly the right-to-left writing systems.
  The issue can be reproduced by indexing the content of the home page of
  wikipedia as it contains a lot of exotic matter. The problem does not
  affect the JSON response writer.
  
  The problem is, i am unsure whether this is a bug in Solr or that perhaps
  Firefox itself trips over.
  
  
  Here's the output of the JSONResponeWriter for a query returning the home
  page:
  {
  responseHeader:{
  
   status:0,
   QTime:1,
   params:{
   
  fl:url,content,
  indent:true,
  wt:json,
  q:*:*,
  rows:1}},
  
  response:{numFound:6744,start:0,docs:[
  
  {
  
   url:http://www.wikipedia.org/;,
   content:Wikipedia English The Free Encyclopedia 3 543 000+ articles
   日
  
  本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel
  Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie
  libre 1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей
  Italiano L’enciclopedia libera 768 000+ voci Português A enciclopédia
  livre 669 000+ artigos Polski Wolna encyklopedia 769 000+ haseł
  Nederlands De vrije encyclopedie 668 000+ artikelen Search  • Suchen  •
  Rechercher  • Szukaj  • Ricerca  • 検索  • Buscar  • Busca  • Zoeken  •
  Поиск  • Sök  • 搜尋  • Cerca  • Søk  • Haku  • Пошук  • Hledání  •
  Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara • Cari  • Søg  • بحث  •
  Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو • חיפוש  •
  Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky Dansk
  Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia
  Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk
  (bokmål) Polski Português Română Русский Slovenčina Slovenščina Српски /
  Srpski Suomi Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文  
  100 000+   العربية • Български  • Català  • Česky  • Dansk  • Deutsch  •
  English  • Español  • Esperanto  • فارسی  • Français  • 한국어  • Bahasa
  Indonesia  • Italiano  • עברית • Lietuvių  • Magyar  • Bahasa Melayu  •
  Nederlands  • 日本語  • Norsk (bokmål) • Polski  • Português  • Русский  •
  Română  • Slovenčina  • Slovenščina  • Српски / Srpski  • Suomi  •
  Svenska  • Türkçe  • Українська  • Tiếng Việt  • Volapük  • Winaray  •
  中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  • Asturianu  •
  Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • Беларуская (
  Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  •
  Brezhoneg  • Чăваш • Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  •
  Gaeilge  • Galego  • ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  •
  Íslenska  • Basa Jawa  • ಕನ್ನಡ  • ქართული  • Kurdî / كوردی  • Latina  •
  Latviešu  • Lëtzebuergesch  • Lumbaart • Македонски  • മലയാളം  • मराठी 
  • नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • Nnapulitano • Occitan  •
  Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ مکھی پنجابی
   • Shqip  • Sicilianu  • Simple English  • Sinugboanon  • Srpskohrvatski
  / Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • தமிழ் • తెలుగు
   • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   Bahsa
  Acèh  • Alemannisch  • አማርኛ  • Arpitan  • ܐܬܘܪܝܐ  • Avañe’ẽ  • Aymar Aru
   • Bân-lâm-gú  • Bahasa Banjar  • Basa Banyumasan  • Башҡорт  • भोजपुरी 
  • Bikol Central  • Boarisch  • བོད་ཡིག  • Chavacano de Zamboanga  •
  Corsu  • Deitsch  • ދިވެހި  • Diné Bizaad  • Eald Englisc  •
  Emigliàn–Rumagnòl  • Эрзянь  • Estremeñu • Fiji Hindi  • Føroyskt  •
  Furlan  • Gaelg  • Gàidhlig  • 贛語  • گیلکی  • Hak- kâ-fa / 客家話  • Хальмг
   • ʻŌlelo Hawaiʻi  • Hornjoserbsce  • Ilokano  • Interlingua  •
  Interlingue  • Ирон Æвзаг  • Kapampangan  • Kaszëbsczi  • Kernewek  •
  ភាសាខ្មែរ  • Kinyarwanda  • Коми  • Кыргызча  • Ladino / לאדינו  •
  Ligure  • Limburgs  • Lingála  • lojban  • Malagasy  • Malti  • 文言  •
  Māori  • مصرى  • مازِرونی / Mäzeruni  • Монгол  • မြန်မာဘာသာ  •
  Nāhuatlahtōlli  • Nedersaksisch  • Nouormand  • Novial  • Нохчийн  •
  Олык Марий  • O‘zbek  • पाऴि • Pangasinán  • ਪੰਜਾਬੀ 

Re: Malformed XML with exotic characters

2011-02-01 Thread Sascha Szott

Hi folks,

I've made the same observation when working with Solr's 
ExtractingRequestHandler on the command line (no browser interaction).


When issuing the following curl command

curl 
'http://mysolrhost/solr/update/extract?extractOnly=trueextractFormat=textwt=xmlresource.name=foo.pdf' 
--data-binary @foo.pdf -H 'Content-type:text/xml; charset=utf-8'  foo.xml


Solr's XML response writer returns malformed xml, e.g., xmllint gives me:

foo.xml:21: parser error : Char 0xD835 out of allowed range
foo.xml:21: parser error : PCDATA invalid Char value 55349

I'm not totally sure, if this is an Tika/PDFBox issue. However, I would 
expect in every case that the XML output produced by Solr is well-formed 
even if the libraries used under the hood return garbage.



-Sascha

p.s. I can provide the pdf file in question, if anybody would like to 
see it in action.



On 01.02.2011 16:43, Markus Jelsma wrote:

There is an issue with the XML response writer. It cannot cope with some very
exotic characters or possibly the right-to-left writing systems. The issue can
be reproduced by indexing the content of the home page of wikipedia as it
contains a lot of exotic matter. The problem does not affect the JSON response
writer.

The problem is, i am unsure whether this is a bug in Solr or that perhaps
Firefox itself trips over.


Here's the output of the JSONResponeWriter for a query returning the home
page:
{
  responseHeader:{
   status:0,
   QTime:1,
   params:{
fl:url,content,
indent:true,
wt:json,
q:*:*,
rows:1}},
  response:{numFound:6744,start:0,docs:[
{
 url:http://www.wikipedia.org/;,
 content:Wikipedia English The Free Encyclopedia 3 543 000+ articles 
日
本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel
Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie libre
1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей Italiano
L’enciclopedia libera 768 000+ voci Português A enciclopédia livre 669 000+
artigos Polski Wolna encyklopedia 769 000+ haseł Nederlands De vrije
encyclopedie 668 000+ artikelen Search  • Suchen  • Rechercher  • Szukaj  •
Ricerca  • 検索  • Buscar  • Busca  • Zoeken  • Поиск  • Sök  • 搜尋  • Cerca  •
Søk  • Haku  • Пошук  • Hledání  • Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara
• Cari  • Søg  • بحث  • Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو
• חיפוש  • Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky
Dansk Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia
Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk (bokmål)
Polski Português Română Русский Slovenčina Slovenščina Српски / Srpski Suomi
Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文   100 000+   العربية
• Български  • Català  • Česky  • Dansk  • Deutsch  • English  • Español  •
Esperanto  • فارسی  • Français  • 한국어  • Bahasa Indonesia  • Italiano  • עברית
• Lietuvių  • Magyar  • Bahasa Melayu  • Nederlands  • 日本語  • Norsk (bokmål)
• Polski  • Português  • Русский  • Română  • Slovenčina  • Slovenščina  •
Српски / Srpski  • Suomi  • Svenska  • Türkçe  • Українська  • Tiếng Việt  •
Volapük  • Winaray  • 中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  •
Asturianu  • Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • 
Беларуская
( Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  • 
Brezhoneg  • Чăваш
• Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  • Gaeilge  • Galego  •
ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  • Íslenska  • Basa Jawa  • 
ಕನ್ನಡ  •
ქართული  • Kurdî / كوردی  • Latina  • Latviešu  • Lëtzebuergesch  • Lumbaart
• Македонски  • മലയാളം  • मराठी  • नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • 
Nnapulitano
• Occitan  • Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ مکھی
پنجابی  • Shqip  • Sicilianu  • Simple English  • Sinugboanon  •
Srpskohrvatski / Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • தமிழ்
• తెలుగు  • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   Bahsa
Acèh  • Alemannisch  • አማርኛ  • Arpitan  • ܐܬܘܪܝܐ  • Avañe’ẽ  • Aymar Aru  •
Bân-lâm-gú  • Bahasa Banjar  • Basa Banyumasan  • Башҡорт  • भोजपुरी  • Bikol
Central  • Boarisch  • བོད་ཡིག  • Chavacano de Zamboanga  • Corsu  • Deitsch  •
ދިވެހި  • Diné Bizaad  • Eald Englisc  • Emigliàn–Rumagnòl  • Эрзянь  • 
Estremeñu
• Fiji Hindi  • Føroyskt  • Furlan  • Gaelg  • Gàidhlig  • 贛語  • گیلکی  • Hak-
kâ-fa / 客家話  • Хальмг  • ʻŌlelo Hawaiʻi  • Hornjoserbsce  • Ilokano  •
Interlingua  • Interlingue  • Ирон Æвзаг  • Kapampangan  • Kaszëbsczi  •
Kernewek  • ភាសាខ្មែរ  • Kinyarwanda  • Коми  • Кыргызча  • Ladino / לאדינו  •
Ligure  • Limburgs  • Lingála  • lojban  • Malagasy  • Malti  • 文言  • Māori  •
مصرى  • مازِرونی / Mäzeruni  • Монгол  • မြန်မာဘာသာ  • Nāhuatlahtōlli  •
Nedersaksisch  • Nouormand  • Novial  • Нохчийн  • Олык Марий  • O‘zbek  • पाऴि
• Pangasinán  • ਪੰਜਾਬੀ / پنجابی  • Papiamentu  • پښتو  • Picard  • 

Re: Malformed XML with exotic characters

2011-02-01 Thread Markus Jelsma
You can exclude the input's involvement by checking if other response writers 
do work. For me, the JSONResponseWriter works perfectly with the same returned 
data in some AJAX environment.

On Tuesday 01 February 2011 18:29:06 Sascha Szott wrote:
 Hi folks,
 
 I've made the same observation when working with Solr's
 ExtractingRequestHandler on the command line (no browser interaction).
 
 When issuing the following curl command
 
 curl
 'http://mysolrhost/solr/update/extract?extractOnly=trueextractFormat=text;
 wt=xmlresource.name=foo.pdf' --data-binary @foo.pdf -H
 'Content-type:text/xml; charset=utf-8'  foo.xml
 
 Solr's XML response writer returns malformed xml, e.g., xmllint gives me:
 
 foo.xml:21: parser error : Char 0xD835 out of allowed range
 foo.xml:21: parser error : PCDATA invalid Char value 55349
 
 I'm not totally sure, if this is an Tika/PDFBox issue. However, I would
 expect in every case that the XML output produced by Solr is well-formed
 even if the libraries used under the hood return garbage.
 
 
 -Sascha
 
 p.s. I can provide the pdf file in question, if anybody would like to
 see it in action.
 
 On 01.02.2011 16:43, Markus Jelsma wrote:
  There is an issue with the XML response writer. It cannot cope with some
  very exotic characters or possibly the right-to-left writing systems.
  The issue can be reproduced by indexing the content of the home page of
  wikipedia as it contains a lot of exotic matter. The problem does not
  affect the JSON response writer.
  
  The problem is, i am unsure whether this is a bug in Solr or that perhaps
  Firefox itself trips over.
  
  
  Here's the output of the JSONResponeWriter for a query returning the home
  page:
  {
  
responseHeader:{

 status:0,
 QTime:1,
 params:{
  
  fl:url,content,
  indent:true,
  wt:json,
  q:*:*,
  rows:1}},
  
response:{numFound:6744,start:0,docs:[
  
  {
  
   url:http://www.wikipedia.org/;,
   content:Wikipedia English The Free Encyclopedia 3 543 000+ articles
   日
  
  本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel
  Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie
  libre 1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей
  Italiano L’enciclopedia libera 768 000+ voci Português A enciclopédia
  livre 669 000+ artigos Polski Wolna encyklopedia 769 000+ haseł
  Nederlands De vrije encyclopedie 668 000+ artikelen Search  • Suchen  •
  Rechercher  • Szukaj  • Ricerca  • 検索  • Buscar  • Busca  • Zoeken  •
  Поиск  • Sök  • 搜尋  • Cerca  • Søk  • Haku  • Пошук  • Hledání  •
  Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara • Cari  • Søg  • بحث  •
  Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو • חיפוש  •
  Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky Dansk
  Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia
  Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk
  (bokmål) Polski Português Română Русский Slovenčina Slovenščina Српски /
  Srpski Suomi Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文  
  100 000+   العربية • Български  • Català  • Česky  • Dansk  • Deutsch  •
  English  • Español  • Esperanto  • فارسی  • Français  • 한국어  • Bahasa
  Indonesia  • Italiano  • עברית • Lietuvių  • Magyar  • Bahasa Melayu  •
  Nederlands  • 日本語  • Norsk (bokmål) • Polski  • Português  • Русский  •
  Română  • Slovenčina  • Slovenščina  • Српски / Srpski  • Suomi  •
  Svenska  • Türkçe  • Українська  • Tiếng Việt  • Volapük  • Winaray  •
  中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  • Asturianu  •
  Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • Беларуская (
  Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  •
  Brezhoneg  • Чăваш • Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  •
  Gaeilge  • Galego  • ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  •
  Íslenska  • Basa Jawa  • ಕನ್ನಡ  • ქართული  • Kurdî / كوردی  • Latina  •
  Latviešu  • Lëtzebuergesch  • Lumbaart • Македонски  • മലയാളം  • मराठी 
  • नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • Nnapulitano • Occitan  •
  Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ مکھی پنجابی
   • Shqip  • Sicilianu  • Simple English  • Sinugboanon  • Srpskohrvatski
  / Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • தமிழ் • తెలుగు
   • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   Bahsa
  Acèh  • Alemannisch  • አማርኛ  • Arpitan  • ܐܬܘܪܝܐ  • Avañe’ẽ  • Aymar Aru
   • Bân-lâm-gú  • Bahasa Banjar  • Basa Banyumasan  • Башҡорт  • भोजपुरी 
  • Bikol Central  • Boarisch  • བོད་ཡིག  • Chavacano de Zamboanga  •
  Corsu  • Deitsch  • ދިވެހި  • Diné Bizaad  • Eald Englisc  •
  Emigliàn–Rumagnòl  • Эрзянь  • Estremeñu • Fiji Hindi  • Føroyskt  •
  Furlan  • Gaelg  • Gàidhlig  • 贛語  • گیلکی  • Hak- kâ-fa / 客家話  • Хальмг
   • ʻŌlelo Hawaiʻi  • Hornjoserbsce  • Ilokano  • Interlingua  •
  Interlingue  • Ирон Æвзаг  • 

Re: Malformed XML with exotic characters

2011-02-01 Thread Sascha Szott

Hi Markus,

in my case the JSON response writer returns valid JSON. The same holds 
for the PHP response writer.


-Sascha

On 01.02.2011 18:44, Markus Jelsma wrote:

You can exclude the input's involvement by checking if other response writers
do work. For me, the JSONResponseWriter works perfectly with the same returned
data in some AJAX environment.

On Tuesday 01 February 2011 18:29:06 Sascha Szott wrote:

Hi folks,

I've made the same observation when working with Solr's
ExtractingRequestHandler on the command line (no browser interaction).

When issuing the following curl command

curl
'http://mysolrhost/solr/update/extract?extractOnly=trueextractFormat=text;
wt=xmlresource.name=foo.pdf' --data-binary @foo.pdf -H
'Content-type:text/xml; charset=utf-8'  foo.xml

Solr's XML response writer returns malformed xml, e.g., xmllint gives me:

foo.xml:21: parser error : Char 0xD835 out of allowed range
foo.xml:21: parser error : PCDATA invalid Char value 55349

I'm not totally sure, if this is an Tika/PDFBox issue. However, I would
expect in every case that the XML output produced by Solr is well-formed
even if the libraries used under the hood return garbage.


-Sascha

p.s. I can provide the pdf file in question, if anybody would like to
see it in action.

On 01.02.2011 16:43, Markus Jelsma wrote:

There is an issue with the XML response writer. It cannot cope with some
very exotic characters or possibly the right-to-left writing systems.
The issue can be reproduced by indexing the content of the home page of
wikipedia as it contains a lot of exotic matter. The problem does not
affect the JSON response writer.

The problem is, i am unsure whether this is a bug in Solr or that perhaps
Firefox itself trips over.


Here's the output of the JSONResponeWriter for a query returning the home
page:
{

   responseHeader:{

status:0,
QTime:1,
params:{

fl:url,content,
indent:true,
wt:json,
q:*:*,
rows:1}},

   response:{numFound:6744,start:0,docs:[

{

 url:http://www.wikipedia.org/;,
 content:Wikipedia English The Free Encyclopedia 3 543 000+ articles
 日

本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel
Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie
libre 1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей
Italiano L’enciclopedia libera 768 000+ voci Português A enciclopédia
livre 669 000+ artigos Polski Wolna encyklopedia 769 000+ haseł
Nederlands De vrije encyclopedie 668 000+ artikelen Search  • Suchen  •
Rechercher  • Szukaj  • Ricerca  • 検索  • Buscar  • Busca  • Zoeken  •
Поиск  • Sök  • 搜尋  • Cerca  • Søk  • Haku  • Пошук  • Hledání  •
Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara • Cari  • Søg  • بحث  •
Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو • חיפוש  •
Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky Dansk
Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia
Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk
(bokmål) Polski Português Română Русский Slovenčina Slovenščina Српски /
Srpski Suomi Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文
100 000+   العربية • Български  • Català  • Česky  • Dansk  • Deutsch  •
English  • Español  • Esperanto  • فارسی  • Français  • 한국어  • Bahasa
Indonesia  • Italiano  • עברית • Lietuvių  • Magyar  • Bahasa Melayu  •
Nederlands  • 日本語  • Norsk (bokmål) • Polski  • Português  • Русский  •
Română  • Slovenčina  • Slovenščina  • Српски / Srpski  • Suomi  •
Svenska  • Türkçe  • Українська  • Tiếng Việt  • Volapük  • Winaray  •
中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  • Asturianu  •
Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • Беларуская (
Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  •
Brezhoneg  • Чăваш • Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  •
Gaeilge  • Galego  • ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  •
Íslenska  • Basa Jawa  • ಕನ್ನಡ  • ქართული  • Kurdî / كوردی  • Latina  •
Latviešu  • Lëtzebuergesch  • Lumbaart • Македонски  • മലയാളം  • मराठी
• नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • Nnapulitano • Occitan  •
Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ مکھی پنجابی
  • Shqip  • Sicilianu  • Simple English  • Sinugboanon  • Srpskohrvatski
/ Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • தமிழ் • తెలుగు
  • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   Bahsa
Acèh  • Alemannisch  • አማርኛ  • Arpitan  • ܐܬܘܪܝܐ  • Avañe’ẽ  • Aymar Aru
  • Bân-lâm-gú  • Bahasa Banjar  • Basa Banyumasan  • Башҡорт  • भोजपुरी
• Bikol Central  • Boarisch  • བོད་ཡིག  • Chavacano de Zamboanga  •
Corsu  • Deitsch  • ދިވެހި  • Diné Bizaad  • Eald Englisc  •
Emigliàn–Rumagnòl  • Эрзянь  • Estremeñu • Fiji Hindi  • Føroyskt  •
Furlan  • Gaelg  • Gàidhlig  • 贛語  • گیلکی  • Hak- kâ-fa / 客家話  • Хальмг
  • ʻŌlelo Hawaiʻi  • Hornjoserbsce  • Ilokano  • 

Re: Malformed XML with exotic characters

2011-02-01 Thread Robert Muir
Hi, it might only be a problem with your xml tools (e.g. firefox).
the problem here is characters outside of the basic multilingual plane
(in this case Gothic).
XML tools typically fall apart on these portions of unicode (in lucene
we recently reverted to a patched/hacked copy of xerces specifically
for this reason).

If you care about characters outside of the basic multilingual plane
actually working, unfortunately you have to start being very very very
particular about what software you use... you can assume most
software/setups WON'T work.
For example, if you were to use mysql's utf8 character set you would
find it doesn't actually support all of UTF-8! in this case you would
need to use the recent 'utf8mb4' or something instead, that is
actually utf-8!
Thats just one example of a well-used piece of software that suffers
from issues like this, there are others.

Its for reasons like these that if support for these languages is
important to you, I would stick with the most simple/textual methods
for input and output: e.g. using things like CSV and JSON if you can.
I would also fully test every component/jar in your application
individually and once you get it working, don't ever upgrade.

In any case, if you are having problems with characters outside of the
basic multilingual plane, and you suspect its actually a bug in Solr,
please open a JIRA issue, especially if you can provide some way to
reproduce it

On Tue, Feb 1, 2011 at 10:43 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 There is an issue with the XML response writer. It cannot cope with some very
 exotic characters or possibly the right-to-left writing systems. The issue can
 be reproduced by indexing the content of the home page of wikipedia as it
 contains a lot of exotic matter. The problem does not affect the JSON response
 writer.

 The problem is, i am unsure whether this is a bug in Solr or that perhaps
 Firefox itself trips over.


 Here's the output of the JSONResponeWriter for a query returning the home
 page:
 {
  responseHeader:{
  status:0,
  QTime:1,
  params:{
        fl:url,content,
        indent:true,
        wt:json,
        q:*:*,
        rows:1}},
  response:{numFound:6744,start:0,docs:[
        {
         url:http://www.wikipedia.org/;,
         content:Wikipedia English The Free Encyclopedia 3 543 000+ 
 articles 日
 本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel
 Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie libre
 1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей Italiano
 L’enciclopedia libera 768 000+ voci Português A enciclopédia livre 669 000+
 artigos Polski Wolna encyklopedia 769 000+ haseł Nederlands De vrije
 encyclopedie 668 000+ artikelen Search  • Suchen  • Rechercher  • Szukaj  •
 Ricerca  • 検索  • Buscar  • Busca  • Zoeken  • Поиск  • Sök  • 搜尋  • Cerca  •
 Søk  • Haku  • Пошук  • Hledání  • Keresés  • Căutare  • 찾기  • Tìm kiếm  • Ara
 • Cari  • Søg  • بحث  • Serĉu  • Претрага  • Paieška  • Hľadať  • Suk  • جستجو
 • חיפוש  • Търсене  • Poišči  • Cari  • Bilnga العربية Български Català Česky
 Dansk Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia
 Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk (bokmål)
 Polski Português Română Русский Slovenčina Slovenščina Српски / Srpski Suomi
 Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文   100 000+   العربية
 • Български  • Català  • Česky  • Dansk  • Deutsch  • English  • Español  •
 Esperanto  • فارسی  • Français  • 한국어  • Bahasa Indonesia  • Italiano  • עברית
 • Lietuvių  • Magyar  • Bahasa Melayu  • Nederlands  • 日本語  • Norsk (bokmål)
 • Polski  • Português  • Русский  • Română  • Slovenčina  • Slovenščina  •
 Српски / Srpski  • Suomi  • Svenska  • Türkçe  • Українська  • Tiếng Việt  •
 Volapük  • Winaray  • 中文   10 000+   Afrikaans  • Aragonés  • Armãneashce  •
 Asturianu  • Kreyòl Ayisyen  • Azərbaycan / آذربايجان ديلی  • বাংলা  • 
 Беларуская
 ( Акадэмічная  • Тарашкевiца )  • বিষ্ণুপ্রিযা় মণিপুরী  • Bosanski  • 
 Brezhoneg  • Чăваш
 • Cymraeg  • Eesti  • Ελληνικά  • Euskara  • Frysk  • Gaeilge  • Galego  •
 ગુજરાતી  • Հայերեն  • हिन्दी  • Hrvatski  • Ido  • Íslenska  • Basa Jawa  • 
 ಕನ್ನಡ  •
 ქართული  • Kurdî / كوردی  • Latina  • Latviešu  • Lëtzebuergesch  • Lumbaart
 • Македонски  • മലയാളം  • मराठी  • नेपाल भाषा  • नेपाली  • Norsk (nynorsk)  • 
 Nnapulitano
 • Occitan  • Piemontèis  • Plattdüütsch  • Ripoarisch  • Runa Simi  • شاہ مکھی
 پنجابی  • Shqip  • Sicilianu  • Simple English  • Sinugboanon  •
 Srpskohrvatski / Српскохрватски  • Basa Sunda  • Kiswahili  • Tagalog  • தமிழ்
 • తెలుగు  • ไทย  • اردو  • Walon  • Yorùbá  • 粵語  • Žemaitėška   1 000+   
 Bahsa
 Acèh  • Alemannisch  • አማርኛ  • Arpitan  • ܐܬܘܪܝܐ  • Avañe’ẽ  • Aymar Aru  •
 Bân-lâm-gú  • Bahasa Banjar  • Basa Banyumasan  • Башҡорт  • भोजपुरी  • Bikol
 Central  • Boarisch  • བོད་ཡིག  • Chavacano de Zamboanga  • Corsu  • Deitsch  
 •
 ދިވެހި  • Diné Bizaad  • Eald Englisc