[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 Nemo federicol...@tiscali.it changed: What|Removed |Added Assignee|agarr...@wikimedia.org |wikibugs-l@lists.wikimedia. ||org -- You are receiving this mail because: You are the assignee for the bug. You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 --- Comment #7 from Marcin Cieślak marcin.cies...@gmail.com --- 1. I just checked the current dump and it looks like that it is not truncated after the abovementioned page; but currently I can't find the page ID 803931 there. I'll double check that again, but simple pywikipedia loop: Python 2.7.3 (default, Sep 17 2012, 21:25:11) [GCC 4.3.4] on linux2 Type help, copyright, credits or license for more information. import xmlreader z = xmlreader.XmlDump(huwiki-20121021-pages-articles.xml.bz2) for i in z.parse(): ... if i.id == 803931: ... print repr(i) ... Reading XML dump... does not seem to give any results. 2. To fix this entry in the database I would simply remove the last byte of the thread_signature field. Or maybe a whole greek text can be removed and this: [[User:Gubbubu|font color=green face=Lucida calligraphyΓουββος ΘιλοÎ changed to [[User:Gubbubu|Gubbubu]] or something like that. -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 --- Comment #8 from Marcin Cieślak marcin.cies...@gmail.com --- Sorry, I used the wrong dump above, now tried this with 0 results: import xmlreader z = xmlreader.XmlDump(huwiki-20130120-pages-meta-current.xml.bz2) for i in z.parse(): if i.id in [803931, 803932]: print repr(i) -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 --- Comment #9 from Marcin Cieślak marcin.cies...@gmail.com --- Created attachment 11679 -- https://bugzilla.wikimedia.org/attachment.cgi?id=11679action=edit Dump of the text node of page 803932 Attached please find the result of running: import xmlreader out = open(803932.txt, w) z = xmlreader.XmlDump(huwiki-20130120-pages-meta-current.xml.bz2) for i in z.parse(): if i.id in [803932]: out.write(i.text.encode(utf-8)) break out.close() What's interesting, this body looks more complete than what is acutally displayed under the URL of this bug. Is the output prepared for export of better quality than the rendered wikipage? Interesting. -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 --- Comment #10 from Marcin Cieślak marcin.cies...@gmail.com --- Created attachment 11680 -- https://bugzilla.wikimedia.org/attachment.cgi?id=11680action=edit XML dump of page id=803932/ This is the node taken from the uncompressed dump. It seems that ThreadSignature part looks correct now: 0380 62 75 7c 26 6c 74 3b 66 6f 6e 74 20 63 6f 6c 6f |bu|lt;font colo| 0390 72 3d 26 71 75 6f 74 3b 67 72 65 65 6e 26 71 75 |r=quot;greenqu| 03a0 6f 74 3b 20 66 61 63 65 3d 26 71 75 6f 74 3b 4c |ot; face=quot;L| 03b0 75 63 69 64 61 20 63 61 6c 6c 69 67 72 61 70 68 |ucida calligraph| 03c0 79 26 71 75 6f 74 3b 26 67 74 3b ce 93 ce bf cf |yquot;gt;.| 03d0 85 ce b2 ce b2 ce bf cf 82 20 ce 98 ce b9 ce bb |. ..| 03e0 ce bf ef bf bd 3c 2f 54 68 72 65 61 64 53 69 67 |./ThreadSig| 03f0 6e 61 74 75 72 65 3e 0a 3c 2f 44 69 73 63 75 73 |nature./Discus| We have few more bytes from the signature available and XML tools do not complain about UTF-8 anymore. -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 --- Comment #11 from Marcin Cieślak marcin.cies...@gmail.com --- To sum up: 1) The dump looks okay. 2) I am confused about the actual information in the database: toolserver replica still shows truncated bytes in the database and the webpage itself shows truncated wikitext as well as [[Special:Export]]. -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 Andre Klapper aklap...@wikimedia.org changed: What|Removed |Added Whiteboard||aklapper-moreinfo --- Comment #6 from Andre Klapper aklap...@wikimedia.org --- Marcin: Could you answer comment 5, please? -- You are receiving this mail because: You are watching all bug changes. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 --- Comment #5 from Ariel T. Glenn ar...@wikimedia.org 2012-10-31 09:25:28 UTC --- Are the current dumps still missing a bunch of pages (as described in the original report)? What content should go into the thread_signature field for thread_id 1288 in order to fix this manually for the one row? -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 Krenair kren...@gmail.com changed: What|Removed |Added URL|https://secure.wikimedia.or |https://hu.wikipedia.org/wi |g/wikipedia/hu/wiki/Speciál |ki/Speciális:Lapok_exportál |is:Lapok_exportálása/Téma:S |ása/Téma:Szerkesztővita:Den |zerkesztővita:Dencey/Fölösl |cey/Fölösleges_információk/ |eges_információk/válasz_(3) |válasz_(3) CC||kren...@gmail.com -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 Mark A. Hershberger m...@everybody.org changed: What|Removed |Added Priority|Unprioritized |Normal CC||m...@everybody.org -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 Mark A. Hershberger m...@everybody.org changed: What|Removed |Added Priority|Normal |Low -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 Brion Vibber br...@wikimedia.org changed: What|Removed |Added Blocks||29821 -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 Brion Vibber br...@wikimedia.org changed: What|Removed |Added Blocks||29818 -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 Brion Vibber br...@wikimedia.org changed: What|Removed |Added Blocks|29818 | -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 Reedy s...@reedyboy.net changed: What|Removed |Added Keywords|shell | -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 Marcin Cieślak marcin.cies...@gmail.com changed: What|Removed |Added URL||https://secure.wikimedia.or ||g/wikipedia/hu/wiki/Speciál ||is:Lapok_exportálása/Téma:S ||zerkesztővita:Dencey/Fölösl ||eges_információk/válasz_(3) Web browser|--- |Opera -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 Marcin Cieślak marcin.cies...@gmail.com changed: What|Removed |Added Keywords||shell Severity|major |critical --- Comment #1 from Marcin Cieślak marcin.cies...@gmail.com 2011-06-24 13:49:28 UTC --- It looks like that database entries got truncated at 256th byte: select thread_signature from thread where thread_root=803932 \G *** 1. row *** thread_signature: span title=bétaverzió !--font style=text-decoration: blink;--font color=red♥/fontfont color=white♥/fontfont color=green♥/font /font [[User:Gubbubu|font color=green face=Lucida calligraphyΓουββος ΘιλοÎ thread_signature field is a TINYBLOB (http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/LiquidThreads/lqt.sql?revision=72707view=markup) but no attempt is obviously made to truncate UTF-8 contents sensibly. This means that database entries need to be fixed first, adding shell keyword and bumping priority. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 --- Comment #2 from Brion Vibber br...@wikimedia.org 2011-06-24 18:00:41 UTC --- So we can split this into a few separate parts: * saving data into thread_signature fails to properly truncate long strings * LQT's extension to XML export fails to run UTF-8 validation cleanup on output * old db entries potentially ought to get cleaned up (shell issue, but probably mostly irrelevant if the above is fixed) -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 --- Comment #3 from Brion Vibber br...@wikimedia.org 2011-06-24 18:13:47 UTC --- r90723 fixes the XML export on trunk; one-line fix will be easy to merge to deployment. Applies UtfNormal::cleanUp() on the XML chunk that LQT adds into the output stream; this is already applied on the rest of the export data via WikiExporter's xmlsafe() escaping wrapper etc. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564 --- Comment #4 from Marcin Cieślak marcin.cies...@gmail.com 2011-06-24 18:32:51 UTC --- Thanks for looking at this quickly. I just went through the LQT wikis using the toolserver databases, issuing a query: select thread_id, thread_signature from thread where length(thread_signature)=255; 149 sql enwikinews_p problem.sql 150 sql enwiktionary_p problem.sql 151 sql mediawikiwiki_p problem.sql 153 sql ptwikibooks_p problem.sql 154 sql strategywiki_p problem.sql 155 sql sewikimedia_p problem.sql 156 sql svwikisource_p problem.sql 157 sql wikimania2010wiki_p problem.sql 158 sql wikimania2011wiki_p problem.sql officewiki_p couldn't be checked because we don't have this one :) Few wikis have that long signatures stored, but the above case in huwiki is the only one that ends with a broken UTF-8 sequence. Many signatures in other database ended up encoded in HTML entities, so they have no chance to break UTF-8 this way. So it seems to be that only one row with thread_id = 1288 needs to be updated in the huwiki_p database. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l