Re: [Wikitech-l] Wikitech-l Digest, Vol 90, Issue 33

Edmund Fisher Mon, 17 Jan 2011 10:19:40 -0800

Huh???


www.englishfreeroam.co.cc

On 17 Jan 2011, at 17:41, wikitech-l-requ...@lists.wikimedia.org wrote:

> Send Wikitech-l mailing list submissions to
>    wikitech-l@lists.wikimedia.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>    https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> or, via email, send a message with subject or body 'help' to
>    wikitech-l-requ...@lists.wikimedia.org
> 
> You can reach the person managing the list at
>    wikitech-l-ow...@lists.wikimedia.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wikitech-l digest..."
> 
> 
> Today's Topics:
> 
>   1. Category sorting and first letters (Tim Starling)
>   2. Re: From page history to sentence history (Bryan Tong Minh)
>   3. Re: From page history to sentence history (Alex Brollo)
>   4. WMDE Developer Meetup moved to May (Daniel Kinzler)
>   5. Re: WYSIFTW status (Aryeh Gregor)
>   6. Re: [Toolserver-l] WMDE Developer Meetup moved to May
>      (Daniel Kinzler)
>   7. Re: June 8th 2011, World IPv6 Day (Aryeh Gregor)
>   8. Re: WMDE Developer Meetup moved to May (Chad)
>   9. Re: From page history to sentence history (Aryeh Gregor)
>  10. Re: From page history to sentence history (Anthony)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Tue, 18 Jan 2011 02:00:09 +1100
> From: Tim Starling <tstarl...@wikimedia.org>
> Subject: [Wikitech-l] Category sorting and first letters
> To: wikitech-l@lists.wikimedia.org
> Message-ID: <ih1lhs$pmn$1...@dough.gmane.org>
> Content-Type: text/plain; charset=UTF-8
> 
> In r80443 I added a feature allowing categories to be sorted using the
> Unicode Collation Algorithm (UCA). I wanted to briefly talk about the
> potential user impact, the design choices and the caveats.
> 
> Sorting was the easy part. The hard part was providing a "first
> letter" concept which would be reasonably sane. The idea I came up
> with was to compile a list of first letters, themselves sorted using
> the UCA. Then the "first letter" of a given string is the nearest
> letter in the list which sorts above the string.
> 
> For instance if you have letters A, B, C, and a string Aardvark, if
> you sort them you get:
> 
> A
> Aardvark
> B
> C
> 
> So we know that A is the first letter of Aardvark because Aardvark
> sorts immediately below A. This algorithm gives us a number of nice
> properties:
> 
> * It automatically drops accents, since accented letters sort the same
> as unaccented letters (at the primary level). Same with case
> differences, hiragana/katakana, etc.
> 
> * You can work out the initial Jamo of a Hangul syllable character by
> just omitting the composed syllables from the "first letter" list.
> Previously this was done with a special-case hack in
> Language::firstChar().
> 
> * Vowel reordering in Thai and Lao is automatically supported.
> So "??" sorts under heading "?" and "??" sorts under heading "?".
> 
> * The collation can be expanded to support all sorts of other crazy
> features, and the first letter feature will keep working in a sane
> way. For instance, you could have an English collation which removed
> "the" from the start of a title.
> 
> I compiled a list of 14,742 suitable header characters, identified by
> processing various Unicode data files. That list probably still needs
> lots of tweaks.
> 
> There is a down side to this scheme. The default UCA table gives all
> characters with a similar logical function to the digits 0-9 the same
> primary sort order as the corresponding ASCII digits. So a page like
> [[????]] on the Bihari Wikipedia will sort under a heading of "1"
> instead of "?". There may be other instances of accidental cultural
> imperialism. However, this can be fixed by compiling
> language-dependent lists of header characters.
> 
> The UCA default table is not meant to sort any language correctly,
> it's just a compromise collation. Support for language-specific
> collations can easily be added. Whether we get language-specific
> collations or not, I'd like to think about enabling this feature on
> Wikimedia.
> 
> The most glaring omission from the UCA default tables is sensible
> sorting of the unified Han.
> 
> In a Chinese context, there's an obvious way to sort characters, and
> that's by their order in the KangXi dictionary. The Unihan database
> gives such an ordering, and it's used within code blocks. But it's not
> used between code blocks. So if you sort by code point, all the Han
> characters that aren't in the U+4E00 to U+9FFF block will sort
> incorrectly. That's what the default UCA does, with a few minor
> exceptions.
> 
> In a Japanese context, the way to sort ideographic characters is to
> convert them to phonetic hiragana and then to sort the resulting
> string. I don't know if there is any free software for doing this. On
> the Japanese Wikipedia, they achieve the same result by manually
> setting the sort key of every page to be the hiragana version of the
> title.
> 
> There's lots of room here for other people to get involved, especially
> if you know a language other than English.
> 
> -- Tim Starling
> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 17 Jan 2011 16:29:58 +0100
> From: Bryan Tong Minh <bryan.tongm...@gmail.com>
> Subject: Re: [Wikitech-l] From page history to sentence history
> To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
> Message-ID:
>    <AANLkTi=w=6we2xngmmnikuffmth8krtivzxrsibju...@mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> On Mon, Jan 17, 2011 at 3:49 PM, Anthony <wikim...@inbox.org> wrote:
>> How would you define a particular sentence, paragraph or section of an
>> article? ?The difficulty of the solution lies in answering that
>> question.
>> 
> 
> Difficult, but doable. Jan-Paul's sentence-level editing tool is able
> to make the distinction. It would perhaps be possible to use that as a
> framework for sentence-level diffs.
> 
> 
> Bryan
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Mon, 17 Jan 2011 16:40:28 +0100
> From: Alex Brollo <alex.bro...@gmail.com>
> Subject: Re: [Wikitech-l] From page history to sentence history
> To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
> Message-ID:
>    <AANLkTi=whaz1d5ty9hbkdd-7lkfsd_fy0vtevjxad...@mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> 2011/1/17 Bryan Tong Minh <bryan.tongm...@gmail.com>
> 
>> 
>> Difficult, but doable. Jan-Paul's sentence-level editing tool is able
>> to make the distinction. It would perhaps be possible to use that as a
>> framework for sentence-level diffs.
>> 
> 
> Difficult, but diff between versions of a page does it. Looking at diff
> between pages, I simply thought firmly that only diff paragraphs were
> stored, so that the page was built as updated diff segments. I had no idea
> how this could be done, but  all was "magic"!
> 
> Alex
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Mon, 17 Jan 2011 17:11:12 +0100
> From: Daniel Kinzler <dan...@brightbyte.de>
> Subject: [Wikitech-l] WMDE Developer Meetup moved to May
> To: wikitech-l@lists.wikimedia.org, toolserve...@lists.wikimedia.org,
>    MediaWiki announcements and site admin list
>    <mediawik...@lists.wikimedia.org>
> Cc: Nicole Ebber <nicole.eb...@wikimedia.de>,    Pavel Richter
>    <pavel.rich...@wikimedia.de>
> Message-ID: <4d346a20....@brightbyte.de>
> Content-Type: text/plain; charset=UTF-8
> 
> Hi all
> 
> after some discussion, Wikimedia Germany decided not to hold a developer's
> meet-up around the Chapter's conference in March. We just couldn't fit this in
> nicely with the venue and the overall organization. Don't despair though:
> 
> This is what we will do instead:
> 
> * There will be a hackathon hosted by Wikimedia Germany in (late) May, 
> probably
> in Berlin, but that's not decided yet. This will mostly about hacking, with a
> strong focus on GLAM related stuff. There will be little in terms of 
> presentations.
> 
> * There will be the hacking days attached to Wikimania in Haifa, August 3./4.
> I'm in charge of setting up the program for that, and I'll try to make it a 
> nice
> mix of discussing technology and actually hacking. I would also like to have a
> get-together with thechies and chapter folks at some point during Wikimania.
> 
> I hope that this way, we can give the hacking events the attention they 
> deserve.
> Let me know what you think.
> 
> -- daniel
> 
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Mon, 17 Jan 2011 11:31:27 -0500
> From: Aryeh Gregor <simetrical+wikil...@gmail.com>
> Subject: Re: [Wikitech-l] WYSIFTW status
> To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
> Message-ID:
>    <aanlktikudzhxbhndkehewsuqhvcqbz2vestkm7xoz...@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
> 
> On Sun, Jan 16, 2011 at 7:16 PM, Magnus Manske
> <magnusman...@googlemail.com> wrote:
>> There is the question of what browsers/versions to test for. Should I
>> invest large amounts of time optimising performance in Firefox 3, when
>> FF4 will probably be released before WYSIFTW, and everyone and their
>> cousin upgrades?
> 
> Design for only the fastest browsers.  Other browsers could always
> just be dropped back to the old-fashioned editor.
> 
> 
> 
> ------------------------------
> 
> Message: 6
> Date: Mon, 17 Jan 2011 17:39:31 +0100
> From: Daniel Kinzler <dan...@brightbyte.de>
> Subject: Re: [Wikitech-l] [Toolserver-l] WMDE Developer Meetup moved
>    to May
> To: toolserve...@lists.wikimedia.org
> Cc: MediaWiki announcements and site admin list
>    <mediawik...@lists.wikimedia.org>, wikitech-l@lists.wikimedia.org,
>    Asaf Bartov <asaf.bar...@gmail.com>,    Pavel Richter
>    <pavel.rich...@wikimedia.de>,    Nicole Ebber <nicole.eb...@wikimedia.de>
> Message-ID: <4d3470c3.4040...@brightbyte.de>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> On 17.01.2011 17:14, Asaf Bartov wrote:
>> Correction: Haifa Hacking Days are to be held August 2nd-3rd.
>> Wikimania itself will be Aug 4th-6th.
> 
> Gah! Thanks Asaf.
> 
> There I went and looked it up, and then wrote the wrong thing into the email.
> Curses.
> 
> -- daniel
> 
> 
> 
> ------------------------------
> 
> Message: 7
> Date: Mon, 17 Jan 2011 11:44:28 -0500
> From: Aryeh Gregor <simetrical+wikil...@gmail.com>
> Subject: Re: [Wikitech-l] June 8th 2011, World IPv6 Day
> To: Happy-melon <happy-me...@live.com>,    Wikimedia developers
>    <wikitech-l@lists.wikimedia.org>
> Message-ID:
>    <AANLkTikk20OAKv-vreinxD-oBmfnzLbo97=xroqeb...@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
> 
> On Sun, Jan 16, 2011 at 7:12 PM, Happy-melon <happy-me...@live.com> wrote:
>> I don't entirely understand the point of this. ?The plan seems to be """get
>> a large enough fraction of 'the internet' to make a change which breaks for
>> some people all at the same time, so that those people get angry with the
>> ISPs that haven't got off their arses to fix said breakage, rather than
>> angry with the broken sites""", which is fair enough.
> 
> No, the point is to test what happens if IPv6 is supported on a large
> scale.  It's known from small-scale testing that this will break
> things for some small percentage of users, but no one's sure what the
> consequences are of switching this on fully for everyone.
> 
>> But AFAICT, the
>> breakage won't occur if your connection can't 'do' IPv6, but only if your
>> connection can't 'do' both IPv4 *and* IPv6 on the same site at the same
>> time. ?Surely that's not actually the problem that we need to solve if we're
>> to be able to migrate smoothly onto IPv6? ?When the IPv4 addresses run out,
>> we need to be able to start setting up websites which are *only* v6, surely?
> 
> There are many more clients in the world than servers, and servers
> have always been able to get dedicated IPv4 addresses much more easily
> than clients.  A server Internet connection in America will typically
> come with as many IPv4 addresses as you need, while you usually can't
> get a dedicated residential IP address unless you pay extra.  (And
> America has more IP addresses allocated per capita than anywhere else
> in the world, since it originally developed the Internet.)
> 
> So as IPv4 addresses become scarcer, the pressure to use IPv6 only
> will fall mostly on residential users.  Clients with only an IPv6
> address will only be able to get direct connections to IPv6-enabled
> servers.  The way servers are supposed to do this is serve both A and
> AAAA records for the same domain, so IPv4 clients use the A record and
> IPv6 clients use the AAAA record.
> 
> Unfortunately, someone at some point decided that if the client
> supports both IPv4 and IPv6, and the server publishes both A and AAAA
> records, the client should connect via IPv6.  In practice, almost no
> sites use IPv6, so the infrastructure is much less well-tested.
> Clients that think they have IPv6 connections might actually have the
> connection eaten by a middlebox, or just be slower or less reliable.
> So sites don't turn on the AAAA records in practice because it
> degrades service for clients with IPv6 connections, which means the
> servers aren't accessible to IPv6-only clients without workarounds.
> 
> IPv6 day is an attempt to see what happens if major sites publish AAAA
> records for a while.  Stuff will break, but hopefully not too
> horribly, and it will give both site operators and ISPs the chance to
> analyze what's wrong with their IPv6 support and what they can do to
> fix it.  This is a step toward major sites publishing AAAA records all
> the time, which is necessary to support IPv6-only clients.
> 
> Something like that, anyway.  I'm hardly an expert on these things.
> 
> 
> 
> ------------------------------
> 
> Message: 8
> Date: Mon, 17 Jan 2011 11:45:33 -0500
> From: Chad <innocentkil...@gmail.com>
> Subject: Re: [Wikitech-l] WMDE Developer Meetup moved to May
> To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
> Cc: toolserve...@lists.wikimedia.org,    MediaWiki announcements and site
>    admin list    <mediawik...@lists.wikimedia.org>
> Message-ID:
>    <AANLkTim3Q5CS20O=crvo0a2z7nnbqftrhauffgvbq...@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
> 
> On Mon, Jan 17, 2011 at 11:11 AM, Daniel Kinzler <dan...@brightbyte.de> wrote:
>> * There will be a hackathon hosted by Wikimedia Germany in (late) May, 
>> probably
>> in Berlin, but that's not decided yet. This will mostly about hacking, with a
>> strong focus on GLAM related stuff. There will be little in terms of 
>> presentations.
>> 
> 
> Late May? That's actually *really* awesome. Now I don't have
> to miss school to come :D
> 
> -Chad
> 
> 
> 
> ------------------------------
> 
> Message: 9
> Date: Mon, 17 Jan 2011 11:47:35 -0500
> From: Aryeh Gregor <simetrical+wikil...@gmail.com>
> Subject: Re: [Wikitech-l] From page history to sentence history
> To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
> Message-ID:
>    <AANLkTinBdUX_v4d0gvxzm=bf_le+1aqrmmjhk8xsv...@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
> 
> On Mon, Jan 17, 2011 at 5:55 AM, Alex Brollo <alex.bro...@gmail.com> wrote:
>> Before I dig a little more into wiki mysteries, I was absolutely sure that
>> wiki articles were stored into small pieces (paragraphs?) so that a small
>> edit into a long long page would take exactly the same disk space than a
>> small edit into a short page. But I discovered soon, that things are
>> different. :-)
> 
> Wikimedia stores diffs using delta compression, so actually this is
> basically what happens.  The size of the edit is what determines the
> size of the stored diff, not the size of the page.  (I don't know how
> this works in detail, though.)  IIRC, default MediaWiki doesn't work
> this way.
> 
> 
> 
> ------------------------------
> 
> Message: 10
> Date: Mon, 17 Jan 2011 12:41:22 -0500
> From: Anthony <wikim...@inbox.org>
> Subject: Re: [Wikitech-l] From page history to sentence history
> To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
> Message-ID:
>    <aanlktinfd+peoawn1t4xyzaecwpo1_nexm0eodglj...@mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> On Mon, Jan 17, 2011 at 10:40 AM, Alex Brollo <alex.bro...@gmail.com> wrote:
>> 2011/1/17 Bryan Tong Minh <bryan.tongm...@gmail.com>
>> 
>>> 
>>> Difficult, but doable. Jan-Paul's sentence-level editing tool is able
>>> to make the distinction. It would perhaps be possible to use that as a
>>> framework for sentence-level diffs.
>>> 
>> 
>> Difficult, but diff between versions of a page does it. Looking at diff
>> between pages, I simply thought firmly that only diff paragraphs were
>> stored, so that the page was built as updated diff segments. I had no idea
>> how this could be done, but ?all was "magic"!
> 
> Paragraphs are much easier to recognize than sentences, as wikitext
> has a paragraph delimiter - a blank line.  To truly recognize
> sentences, you basically have to engage in natural language
> processing, though you can probably get it right 90% of the time
> without too much effort.
> 
> And to recognize what's going on when a sentence changes *and* is
> moved from one paragraph to another, requires an even greater level of
> natural language understanding.  Again though, you can probably get it
> right most of the time without too much effort.
> 
> Wikitext actually makes it easier for the most part, as you can use
> tricks such as the fact that the periods in [[I.M. Someone]] don't
> represent sentence delimiters, since they are contained in square
> brackets.  But not all periods which occur in the middle of a sentence
> are contained in square brackets, and not all sentences end with a
> period.
> 
> I'd say "difficult but doable" is quite accurate, although with the
> caveat that even the state of the art tools available today are
> probably going to make mistakes that would be obvious to a human.  I'm
> sure there are tools for this, and there are probably some decent ones
> that are open source.  But it's not as simple as just adding an index.
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> 
> 
> End of Wikitech-l Digest, Vol 90, Issue 33
> ******************************************

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Wikitech-l Digest, Vol 90, Issue 33

Reply via email to