[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2013-02-17 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

Nemo federicol...@tiscali.it changed:

   What|Removed |Added

   Assignee|agarr...@wikimedia.org  |wikibugs-l@lists.wikimedia.
   ||org

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are watching all bug changes.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2013-01-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

--- Comment #7 from Marcin Cieślak marcin.cies...@gmail.com ---
1. I just checked the current dump and it looks like that it is not truncated
after the abovementioned page; but currently I can't find the page ID 803931
there. I'll double check that again, but simple pywikipedia loop:


Python 2.7.3 (default, Sep 17 2012, 21:25:11)
[GCC 4.3.4] on linux2
Type help, copyright, credits or license for more information.
 import xmlreader
 z = xmlreader.XmlDump(huwiki-20121021-pages-articles.xml.bz2)
 for i in z.parse():
... if i.id == 803931:
... print repr(i)
...
Reading XML dump...

does not seem to give any results.

2. To fix this entry in the database I would simply remove the last byte of the
thread_signature field. Or maybe a whole greek text can be removed and
this:

[[User:Gubbubu|font color=green face=Lucida
calligraphyΓουββος ΘιλοÎ

changed to

[[User:Gubbubu|Gubbubu]]

or something like that.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2013-01-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

--- Comment #8 from Marcin Cieślak marcin.cies...@gmail.com ---
Sorry, I used the wrong dump above, now tried this with 0 results:

import xmlreader
z = xmlreader.XmlDump(huwiki-20130120-pages-meta-current.xml.bz2)
for i in z.parse():
if i.id in [803931, 803932]:
   print repr(i)

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2013-01-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

--- Comment #9 from Marcin Cieślak marcin.cies...@gmail.com ---
Created attachment 11679
  -- https://bugzilla.wikimedia.org/attachment.cgi?id=11679action=edit
Dump of the text node of page 803932

Attached please find the result of running:

import xmlreader
out = open(803932.txt, w)
z = xmlreader.XmlDump(huwiki-20130120-pages-meta-current.xml.bz2)
for i in z.parse():
if i.id in [803932]:
   out.write(i.text.encode(utf-8))
   break
out.close()

What's interesting, this body looks more complete than what is acutally
displayed under the URL of this bug. Is the output prepared for export of
better quality than the rendered wikipage? Interesting.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2013-01-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

--- Comment #10 from Marcin Cieślak marcin.cies...@gmail.com ---
Created attachment 11680
  -- https://bugzilla.wikimedia.org/attachment.cgi?id=11680action=edit
XML dump of page id=803932/

This is the node taken from the uncompressed dump.

It seems that ThreadSignature part looks correct now:

0380  62 75 7c 26 6c 74 3b 66  6f 6e 74 20 63 6f 6c 6f  |bu|lt;font colo|
0390  72 3d 26 71 75 6f 74 3b  67 72 65 65 6e 26 71 75  |r=quot;greenqu|
03a0  6f 74 3b 20 66 61 63 65  3d 26 71 75 6f 74 3b 4c  |ot; face=quot;L|
03b0  75 63 69 64 61 20 63 61  6c 6c 69 67 72 61 70 68  |ucida calligraph|
03c0  79 26 71 75 6f 74 3b 26  67 74 3b ce 93 ce bf cf  |yquot;gt;.|
03d0  85 ce b2 ce b2 ce bf cf  82 20 ce 98 ce b9 ce bb  |. ..|
03e0  ce bf ef bf bd 3c 2f 54  68 72 65 61 64 53 69 67  |./ThreadSig|
03f0  6e 61 74 75 72 65 3e 0a  3c 2f 44 69 73 63 75 73  |nature./Discus|

We have few more bytes from the signature available and XML tools do not
complain about UTF-8 anymore.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2013-01-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

--- Comment #11 from Marcin Cieślak marcin.cies...@gmail.com ---
To sum up:

1) The dump looks okay.

2) I am confused about the actual information in the database: toolserver
replica still shows truncated bytes in the database and the webpage itself
shows truncated wikitext as well as [[Special:Export]].

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2013-01-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

Andre Klapper aklap...@wikimedia.org changed:

   What|Removed |Added

 Whiteboard||aklapper-moreinfo

--- Comment #6 from Andre Klapper aklap...@wikimedia.org ---
Marcin: Could you answer comment 5, please?

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2012-10-31 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

--- Comment #5 from Ariel T. Glenn ar...@wikimedia.org 2012-10-31 09:25:28 
UTC ---
Are the current dumps still missing a bunch of pages (as described in the
original report)?

What content should go into the thread_signature field for thread_id 1288 in
order to fix this manually for the one row?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2012-10-30 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

Krenair kren...@gmail.com changed:

   What|Removed |Added

URL|https://secure.wikimedia.or |https://hu.wikipedia.org/wi
   |g/wikipedia/hu/wiki/Speciál |ki/Speciális:Lapok_exportál
   |is:Lapok_exportálása/Téma:S |ása/Téma:Szerkesztővita:Den
   |zerkesztővita:Dencey/Fölösl |cey/Fölösleges_információk/
   |eges_információk/válasz_(3) |válasz_(3)
 CC||kren...@gmail.com

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2011-08-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

Mark A. Hershberger m...@everybody.org changed:

   What|Removed |Added

   Priority|Unprioritized   |Normal
 CC||m...@everybody.org

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2011-08-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

Mark A. Hershberger m...@everybody.org changed:

   What|Removed |Added

   Priority|Normal  |Low

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2011-07-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

Brion Vibber br...@wikimedia.org changed:

   What|Removed |Added

 Blocks||29821

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2011-07-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

Brion Vibber br...@wikimedia.org changed:

   What|Removed |Added

 Blocks||29818

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2011-07-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

Brion Vibber br...@wikimedia.org changed:

   What|Removed |Added

 Blocks|29818   |

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2011-07-06 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

Reedy s...@reedyboy.net changed:

   What|Removed |Added

   Keywords|shell   |

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2011-06-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

Marcin Cieślak marcin.cies...@gmail.com changed:

   What|Removed |Added

URL||https://secure.wikimedia.or
   ||g/wikipedia/hu/wiki/Speciál
   ||is:Lapok_exportálása/Téma:S
   ||zerkesztővita:Dencey/Fölösl
   ||eges_információk/válasz_(3)
Web browser|--- |Opera

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2011-06-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

Marcin Cieślak marcin.cies...@gmail.com changed:

   What|Removed |Added

   Keywords||shell
   Severity|major   |critical

--- Comment #1 from Marcin Cieślak marcin.cies...@gmail.com 2011-06-24 
13:49:28 UTC ---
It looks like that database entries got truncated at 256th byte:

 select thread_signature  from thread where thread_root=803932 \G
*** 1. row ***
thread_signature: span title=bétaverzió !--font style=text-decoration:
blink;--font color=red♥/fontfont color=white♥/fontfont
color=green♥/font /font [[User:Gubbubu|font color=green face=Lucida
calligraphyΓουββος ΘιλοÎ

thread_signature field is a TINYBLOB
(http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/LiquidThreads/lqt.sql?revision=72707view=markup)
but no attempt is obviously made to truncate UTF-8 contents sensibly. 

This means that database entries need to be fixed first, adding shell keyword
and bumping priority.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2011-06-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

--- Comment #2 from Brion Vibber br...@wikimedia.org 2011-06-24 18:00:41 UTC 
---
So we can split this into a few separate parts:
* saving data into thread_signature fails to properly truncate long strings
* LQT's extension to XML export fails to run UTF-8 validation  cleanup on
output
* old db entries potentially ought to get cleaned up (shell issue, but probably
mostly irrelevant if the above is fixed)

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2011-06-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

--- Comment #3 from Brion Vibber br...@wikimedia.org 2011-06-24 18:13:47 UTC 
---
r90723 fixes the XML export on trunk; one-line fix will be easy to merge to
deployment.

Applies UtfNormal::cleanUp() on the XML chunk that LQT adds into the output
stream; this is already applied on the rest of the export data via
WikiExporter's xmlsafe() escaping wrapper etc.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 29564] Bad UTF-8 in ThreadSignature breaks huwiki XML dumps and Special:Export

2011-06-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=29564

--- Comment #4 from Marcin Cieślak marcin.cies...@gmail.com 2011-06-24 
18:32:51 UTC ---
Thanks for looking at this quickly.

I just went through the LQT wikis using the toolserver databases, issuing a
query:

select thread_id, thread_signature from thread where
length(thread_signature)=255;

149 sql enwikinews_p  problem.sql
150 sql enwiktionary_p  problem.sql
151 sql mediawikiwiki_p  problem.sql
153 sql ptwikibooks_p  problem.sql
154 sql strategywiki_p  problem.sql
155 sql sewikimedia_p  problem.sql
156 sql svwikisource_p  problem.sql
157 sql wikimania2010wiki_p  problem.sql
158 sql wikimania2011wiki_p  problem.sql

officewiki_p couldn't be checked because we don't have this one :)

Few wikis have that long signatures stored, but the above case in huwiki
is the only one that ends with a broken UTF-8 sequence. Many signatures in
other database ended up encoded in HTML entities, so they have no chance to
break UTF-8 this way.

So it seems to be that only one row with thread_id = 1288 needs to be updated
in the huwiki_p database.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l