Re: [Wikitech-l] Proposal: slight change to the XML dump format

2014-10-29 Thread Andrew Dunbar
I noticed that the dump format version number went from 0.9 to 0.10.

I wonder if this format is documented somewhere or if some code might
expect 1.0?

Andrew Dunbar (hippietrail)

On 28 October 2014 20:45, Daniel Kinzler dan...@brightbyte.de wrote:

 Am 27.10.2014 21:58, schrieb Ariel T. Glenn:
  Thank you Google for hiding the start of this thread in my spam folder
  _
 
  I'm going to have to change my import tools for the new format, but
  that's the way it goes; it's a reasonable change.  Have you checked with
  folks on the xml data dumps list to see who might be affected?

 Not yet, shall do that now.

 Thanks!
 -- daniel


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Distinguishing disambiguation pages

2012-12-26 Thread Andrew Dunbar
It would also be great if these pages were marked in the dump files too.

It should be exactly the same way as how redirect pages are marked.


On 27 December 2012 01:41, Brad Jorsch bjor...@wikimedia.org wrote:

 On Tue, Dec 25, 2012 at 6:00 AM, Liangent liang...@gmail.com wrote:
  Is this enough?
 
  api.php?action=queryprop=pagepropsppprop=disambiguationtitles=

 One thing that would be nice would be the ability to go the other way.
 Consider for example this similar query that tests if the specified
 pages are in a category:


 api.php?action=queryprop=categoriesclcategories=Category:All_disambiguation_pagestitles=

 We can do the opposite, getting a list of pages in the category,
 something like this:


 api.php?action=querylist=categorymemberscmtitle=Category:All_disambiguation_pages

 It would be nice to have a corresponding
 api.php?action=querylist=pageswithproppwpprop=disambiguation. At a
 glance, it looks like we could do it easily enough if someone adds an
 index on page_props (pp_propname,pp_page).

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] HTML wikipedia dumps: Could you please provide them, or make public the code for interpreting templates?

2012-09-09 Thread Andrew Dunbar
 of the code into another programming language.

Bug 25984 - Isolate parser from database dependencies
https://bugzilla.wikimedia.org/show_bug.cgi?id=25984

Nobody at WikiMedia are working on this, but there's some patches from
other people that will certainly get you on your way.

But the developers at WikiMedia are very busy making a whole new
parser and WYSIWYG editor to go with it.

Hopefully this will clean up the code to the point that making your
own parser becomes a lot easier.

Good luck and sympathy (-:
Andrew Dunbar (hippietrail)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Order of execution JavaScript extensions

2012-06-06 Thread Andrew Dunbar
I'm having trouble getting a simple one-line User JS working on Wiktionary.

  $('#p-navigation').removeClass('first persistent').addClass('collapsed');

It works fine from Google Chrome's dev console. It makes the
navigation portal collapsible like the other portals in the sidebar.

But when I add it to my User:XXX/vector.js the result is not the same.
The class I add is there but the ones I remove are also still there
and the result is the standard navigation portal.

I suspect there is some other js executed after the user's vector.js
but I'm not sure how to check that.

I have tried setting a breakpoint on the node in Google Chrome's dev
tools and reloading the page, but it is never triggered.

Apologies if this is not the right mailing list. None of the lists
seemed fit according to http://www.mediawiki.org/wiki/Mailing_lists

Andrew Dunbar (hippietrail)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Order of execution JavaScript extensions

2012-06-06 Thread Andrew Dunbar
On 6 June 2012 13:57, Bergi a.d.be...@web.de wrote:
 Andrew Dunbar schrieb:

 I'm having trouble getting a simple one-line User JS working on
 Wiktionary.


 Apologies if this is not the right mailing list. None of the lists
 seemed fit according to http://www.mediawiki.org/wiki/Mailing_lists


 I think the http://en.wikipedia.org/wiki/Wikipedia:WikiProject_User_scripts
 would be a better place to discuss. Even though it's not Wiktionary, you
 should find the (user-)JS gurus there :-)

 Apart from that, I guess your code interferes with the
 ext.vector.collapsibleNav.js module. Waiting for it (with mw.loader.using)
 before executing your snippet should work.

Thanks for both parts of your answer. Your tip worked perfectly and I
know where to ask next time.

Andrew Dunbar (hippietrail)

 regards,
  Bergi


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Visual watchlist

2012-05-18 Thread Andrew Dunbar
On 9 May 2012 12:17, Arun Ganesh arun.plane...@gmail.com wrote:
 I thought of studying my watchlist for a moment to understand why it was
 the way it was, and I noticed the following:


   1. My watchlist begins half the page down, because of the watchlist
   options box, which btw I have never used or peered into.
   2. The first link in each item is that of the current article. I have
   never clicked this because I might as well go through the changes by using
   (diff)
   3. I have never clicked (hist) on the watchlist, I would first see the
   (diff) and only then browse the history

These days I most often click (hist), less often (diff), and
practically never anything else.

(hist) is more useful for me on the English Wiktionary because I
mostly add translation requests and several bots watch the recent
changes feed which result in minor changes to most pages alter, which
are of no interest to me. Also it seems that people monitoring the
activity also often add other translations. By clicking (hist) I can
see

1) If only bots have changed the page since me, in which case I don't
need to see a diff.

2) When there were several human edits, which ones were in languages I
am interested in.

3) The history page give me a way to get a diff of all changes since
my last edit, rather than just the most recent chage.

Andrew Dunbar (hippietrail)

   4. 0 is colored grey making it disappear from the list. But that does
   not mean the article never changed, it could be +400 -400 words but the net
   is 0. The edit calculation can be highly misleading. I would rather want to
   know how many characters were added and how many deletions. Articles which
   have only additions are low on my priority list to patrol.
   5.  Before contacting any user or checking his (contribs), I would
   always see what his edit was. I open the (diff) and (contribs) in new tabs.
   This could have become integrated because its part of the same task. Same
   goes for talk and the user page links littered all over my watchlist
   6. Knowing whether a user/ip has a talk page or not is important for me
   to identify a newbie or vandal
   7. Reading each edit summary is really slow. Identifying where it begins
   on a line is tough of all the information that precedes it.
   8. I can jump to the specific section directly by clicking the tiny →
   but not the section name itself. I have never used this link either as i
   would rather see the (diff)
   9. The (diff) gives me the diff with the entire article and image loaded
   below. In most cases, all the info I need while patrolling is just in the
   diff. I only need the article if i want to check if tables/images are
   broken.


  With that in mind I made this, which would solve most of my issues:
 http://commons.wikimedia.org/wiki/File:Mw-ux-visual_watchlist.png
 Let me know if it would work for you as well? I hope to put some more
 thought on it and improving the idea.
 --
 Arun Ganesh
 User:planemad
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] search=steven+tyler gets Steven_tyler

2011-05-14 Thread Andrew Dunbar
On 14 May 2011 20:37, K. Peachey p858sn...@gmail.com wrote:
 On Sat, May 14, 2011 at 8:33 PM,  jida...@jidanni.org wrote:
 OK, then why can't
 http://en.wikipedia.org/wiki/Steven_tyler
 just do a browser redirect to
 http://en.wikipedia.org/wiki/Steven_Tyler
 Because then we can't show the (Redirected from X) bar that
 accompanies the redirects

The JavaScript we use on the English Wiktionary also makes a slightly
different (Automaticaly redirected from X) bar, or something very
similar.

Andrew Dunbar (hippietrail)

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] search=steven+tyler gets Steven_tyler

2011-05-13 Thread Andrew Dunbar
On 13 May 2011 14:34, Carl (CBM) cbm.wikipe...@gmail.com wrote:
 On Fri, May 13, 2011 at 12:25 AM, Jay Ashworth j...@baylink.com wrote:
 They're not the same page.  Wikipedia page titles are case sensitive -- 
 except
 that the first character is forced to upper case by the engine.

 Does that search not return both?  Why would we have both?

 Like you said, the system is case sensitive. These redirects are
 created because the software doesn't handle case changes correctly
 otherwise. For example the following link leads to a no such page
 error because the appropriate redirect does not exist:
 http://en.wikipedia.org/wiki/Sterling_heights,_Michigan .

 It would be possible to code around this, so that the redirects would
 be simulated if they don't exist, but it hasn't happened.  In
 practice, people like me like to type a title in all lower case, and
 so we have redirects to make it work.

Indeed on the English Wiktionary we do have some JavaScript which runs
when on a page which would be a redlink. It checks all casing
combinations of: all lowercase, all uppercase, first letter uppercase
and the rest lowercase. If one of those exists it automatically
redirects after a couple of seconds.

With the different nature of Wikipedia titles you would probably want
to check sentence case and title case but would still miss quite a few
where only proper nouns within the title are capitalized.

And some people would probably hate such a feature too (-:

Andrew Dunbar (hippietrail)


 - Carl

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] search=steven+tyler gets Steven_tyler

2011-05-13 Thread Andrew Dunbar
On 13 May 2011 17:31, M. Williamson node...@gmail.com wrote:
 I still don't think page titles should be case sensitive. Last time I asked
 how useful this really was, back in 2005 or so, I got a tersely-worded
 response that we need it to disambiguate certain pages. OK, but how many
 cases does that actually apply to? I would think that the increased
 usability from removing case sensitivity would far outweigh the benefit of
 natural disambiguation that only applies to a tiny minority of pages, and
 which could easily be replaced with disambiguation pages.

There has been talk from time to time over the years to add full case
folding whereby page titles preserve a certain case of each letter
but ignore such info for internal operations. A lot like the
filesystem on Microsoft Windows. It would be a third setup option in
MediaWiki alongside case-sensitive and first-letter. But there's
never been enough interest and it's never been important enough and no
developer has ever stepped up. It would take a bit of work to
implement.

Andrew Dunbar (hippietrail)

 2011/5/12 Carl (CBM) cbm.wikipe...@gmail.com

 On Fri, May 13, 2011 at 12:25 AM, Jay Ashworth j...@baylink.com wrote:
  They're not the same page.  Wikipedia page titles are case sensitive --
 except
  that the first character is forced to upper case by the engine.
 
  Does that search not return both?  Why would we have both?

 Like you said, the system is case sensitive. These redirects are
 created because the software doesn't handle case changes correctly
 otherwise. For example the following link leads to a no such page
 error because the appropriate redirect does not exist:
 http://en.wikipedia.org/wiki/Sterling_heights,_Michigan .

 It would be possible to code around this, so that the redirects would
 be simulated if they don't exist, but it hasn't happened.  In
 practice, people like me like to type a title in all lower case, and
 so we have redirects to make it work.

 - Carl

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] search=steven+tyler gets Steven_tyler

2011-05-13 Thread Andrew Dunbar
On 14 May 2011 01:48, Aryeh Gregor simetrical+wikil...@gmail.com wrote:
 On Fri, May 13, 2011 at 3:31 AM, M. Williamson node...@gmail.com wrote:
 I still don't think page titles should be case sensitive. Last time I asked
 how useful this really was, back in 2005 or so, I got a tersely-worded
 response that we need it to disambiguate certain pages. OK, but how many
 cases does that actually apply to? I would think that the increased
 usability from removing case sensitivity would far outweigh the benefit of
 natural disambiguation that only applies to a tiny minority of pages, and
 which could easily be replaced with disambiguation pages.

 From a software perspective, the way to do this would be to store a
 canonicalized version of each page's title, and require that to be
 unique instead of the title itself.  This would be nice because we
 could allow underscores in page titles, for instance, in addition to
 being able to do case-folding.

 Note that Unicode capitalization is locale-dependent, but case-folding
 is not.  Thus we could use the same case-folding on all projects,
 including international projects like Commons.  There's only one
 exception -- Turkish, with its dotless and dotted i's.  But that's
 minor enough that we should be able to work around it without too much
 pain.

I'm almost positive Azeri has the same dotless i issue and perhaps
some of the other Turkic languages of Central Asia. One solution is to
do accent/diacritic normalization too as part of the canonicalization.

Andrew Dunbar (hippietrail)

 Some projects, like probably all Wiktionaries, would doubtless not
 want case-folding at all, so we should support different
 canonicalization algorithms.  Even the ones that don't want
 case-folding could still benefit from allowing underscores in titles.

 But all this would require a very intrusive rewrite.  Assumptions like
 replace spaces by underscores to get dbkey are hardwired into
 MediaWiki all over the place, unfortunately.  It's not clear that it's
 worth it, since there are downsides to case-folding too.  It might
 make more sense to auto-generate redirects instead, which would be a
 much easier project that wouldn't have the downsides.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-03 Thread Andrew Dunbar
On 4 May 2011 06:33, Trevor Parscal tpars...@wikimedia.org wrote:
 I think the idea that we might break the existing PHP parser out into a
 library for general use is rather silly.

 The parser is not a parser, it's a macro expander with a pile of
 regular-expressions used to convert short-hand HTML into actual HTML. The

Oh don't be silly. It may not be an LALR(1) parser or an LL parser or
even a recursive descent parser but last I checked parsing was the act
of breaking down a text into its elements, which the parser does. It
just does it in a pretty clunky way. Whether it stores the results in
an AST or in bunches of random state all over the place doesn't mean
it's doing something other than parsing.

A more accurate argument is that it's not just a parser since goes
directly on to transforming the input into HTML, which is the
equivalent of code generation.

 code that it outputs is highly dependent on the state of the wiki's
 configuration and database content at the moment of parsing. It also is
 useless to anyone wanting to do anything other than render a page into HTML,
 because the output is completely opaque as to where any of it
 was derived. Dividing the parser off into a library would require an
 substantial amount of MediaWiki code to be ported too just to get it
 working. On it's own, it would be essentially useless.

It seems we're getting bogged won in semantics because in MediaWiki we
use the word parser in two incompatible ways. 1) The PHP classes
which convert wikitext to HTML 2) A hypothetical or postulated part of
MediaWiki which does not exist to generate an intermediate form (AST)
between wikitext and HTML.

So the first thing we need to do is decide which of these two concepts
of parser we're talking about.

Would it be useful to have a library that can convert wikitext to HTML? Yes.
Would it be useful to have a library that can convert wikitext to an
AST? Unclear.
Would it be useful to have a library that can convert such AST to
HTML? Because of the semantic soup nobod has even brought this up yet.

 So, it's probably not an issue what license this hypothetical code would be
 released under.

 - Trevor

I'm pretty sure the offline wikitext parsing community would care
about the licensing as a separate issue to what kind of parser
technology it uses internally.

Andrew Dunbar (hippietrail)

 On Tue, May 3, 2011 at 1:25 PM, David Gerard dger...@gmail.com wrote:

 On 3 May 2011 21:15, Domas Mituzas midom.li...@gmail.com wrote:

  Thoughts? Also, for re-licensing, what level of approval do we need?
  All authors of the parser, or the current people in an svn blame?

  Current people are doing 'derivative work' on previous authors work. I
 think all are needed. Pain oh pain.


 This is the other reason to reduce it to mathematics, which can then
 be freely reimplemented.


 - d.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-03 Thread Andrew Dunbar
On 4 May 2011 08:19, Krinkle krinklem...@gmail.com wrote:
 Op 3 mei 2011, om 22:56 heeft Ryan Lane het volgende geschreven:

 On Tue, May 3, 2011 at 1:33 PM, Trevor Parscal
 tpars...@wikimedia.org wrote:
 On it's own, it would be essentially useless.


 The parser has a configuration state, takes wikitext in, and gives
 back html. It pulls additional data from the database in these steps
 as well, yes. However, I don't see how this would be different than
 any other implementation of the parser. All implementations will
 require configuration state, and will need to deal with things like
 templates and extensions.

 Though I prefer the concept of alternative parsers (for all the
 reasons mentioned in the other threads), I do think having our
 reference implementation available as a library is a good concept. I
 feel that making it available in a suitable license is ideal.

 - Ryan


 Afaik parser does not need a database or extension hooks for minimum but
 fully operational use.

 {{unknown templates}} default to redlinks,
 {{int:messages}} default to unknown,
 tags and {{#functions}} default to literals,
 {{MAGICWORDS}} to red links,
 etc...

 If a user of the parser would not have any of these (either none
 existing or no
 registry / database configured at all). It would fallback to the
 behaviour as if
 they are inexistant, not a problem ?

I agree a parser would not need a database but it would need a
standard interface or abstraction that in the full MediaWiki would
call to the database. Offline readers would implement this interface
to extract the wikitext from their compressed format or direct from an
XML dump file.

Some datamining tools might just stub this interface and deal with the
bare minimum.

Extension hooks are more interesting. I might assume offline readers
want as close results to the official sites as possible so will want
to implement the same hooks.

Other non-wikitext or non-page data from the database would also go
into the same interface/abstraction, or a separate one.

Andrew Dunbar (hippietrail)

 By having this available as a parser sites that host blogs and forums
 could potentially use wikitext to format their comments and forum
 threads
 (to avoid visitors from having to for example learn Wikitext for their
 wiki,
 WYSIWYM WYMeditor for WordPress and BBCode for a forum).

 Instead they could all be the same syntax. And within wiki context
 magic words, extensions, int messages etc. would be fed from the wiki
 database,
 outside just static.

 --
 Krinkle






 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Andrew Dunbar
On 4 May 2011 15:16, Tim Starling tstarl...@wikimedia.org wrote:
 On 04/05/11 14:07, Daniel Friesen wrote:
 I'm fairly certain myself that his intention was With HipHop support
 since the C that HipHop compiles PHP to can be extracted and re-used we
 can turn that compiled C into a C library that can be used anywhere by
 abstracting the database calls and what not out of the php version of
 the parser. And because HipHop has better performance we will no longer
 have to worry about parser abstractions slowing down the parser and as a
 result increasing the load on large websites like Wikipedia where they
 are noticeable. So that won't be in the way of adding those abstractions
 anymore.

 Yes that's right, more or less. HipHop generates C++ rather than C
 though.

 Basically you would split the parser into several objects:

 * A parser in the traditional sense.
 * An output callback object, which would handle generation of HTML or
 PDF or syntax trees or whatever.
 * A wiki environment interface object, which would handle link
 existence checks, template fetching, etc.

 Then you would use HipHop to compile:

 * The new parser class.
 * A few useful output classes, such as HTML.
 * A stub environment class which has no dependencies on the rest of
 MediaWiki.

 Then to top it off, you would add:

 * A HipHop extension which provides output and environment classes
 which pass their calls through to C-style function pointers.
 * A stable C ABI interface to the C++ library.
 * Interfaces between various high level languages and the new C
 library, such as Python, Ruby and Zend PHP.

 Doing this would leverage the MediaWiki development community and the
 existing PHP codebase to provide a well-maintained, reusable reference
 parser for MediaWiki wikitext.

+1

This is the single most exciting news on the MediaWiki front since I started
contributing to Wiktionary nine years ago (-:

Andrew Dunbar (hippietrail)

 -- Tim Starling


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Moving the Dump Process to another language

2011-03-25 Thread Andrew Dunbar
 could
 be used for building the dump as well?

 In general, I'm interested in pitching in some effort on anything
 related to the dump/import processes.

 Glad to hear it!  Drop by irc please, I'm in the usual channels. :-)

Just a thought, wouldn't it be easier to generate dumps in parallel if
we did away with the assumption that the dump would be in database
order. The metadata in the dump provides the ordering info for the
people that require it.

Andrew Dunbar (hippietrail)

 Ariel
 --
 James Linden
 kodekr...@gmail.com
 --

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Foundation-l] Data Summit Streaming

2011-02-11 Thread Andrew Dunbar
It doesn't work for me )-:

Your input can't be opened:
VLC is unable to open the MRL 'http://transcode1.wikimedia.org:8080'.
Check the log for details.

Andrew Dunbar (hippietrail)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Foundation-l] Data Summit Streaming

2011-02-11 Thread Andrew Dunbar
On 11 February 2011 22:18, Chad innocentkil...@gmail.com wrote:
 On Fri, Feb 11, 2011 at 5:57 AM, Andrew Dunbar hippytr...@gmail.com wrote:
 It doesn't work for me )-:

 Your input can't be opened:
 VLC is unable to open the MRL 'http://transcode1.wikimedia.org:8080'.
 Check the log for details.


 It was a stream. It's not streaming anything right now. Dunno
 if videos will be posted somewhere.

oh (-:

 -Chad

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Matching main namespace articles with associated talk page

2011-01-09 Thread Andrew Dunbar
On 9 January 2011 02:05, Aryeh Gregor simetrical+wikil...@gmail.com wrote:
 On Sat, Jan 8, 2011 at 12:34 PM, Diederik van Liere dvanli...@gmail.com 
 wrote:
 Yes, manually matching is fairly simple but in the worst case you need
 to iterate over n-1 talk pages (where n is the total number of talk
 pages of a Wikipedia) to find the talk page that belongs to a user
 page when using the dump files. Hence, if the dump file would contain
 for each article a tag with talk page id then it would significantly
 reduce the processing time.

 You're expected to build indexes for things like this.  If you import
 the data into MySQL, for instance, you can just do a join (since
 MediaWiki has good indexes by default).  If you're writing data
 analysis code manually for some reason, load the data into an on-disk
 B-tree, and then your worst case is logarithmic.  Without indexes,
 pretty much any operation on the data is going to take linear time.
 (In fact, so is lookup by page id, unless you're just doing a binary
 search on the dump file and assuming it's in id order . . .)

 If you don't want to set up a database yourself, you might want to
 look into getting a toolserver account, if you don't have one.  This
 would allow you read access to a live replica of Wikipedia's database,
 which of course has all these indexes.

You don't even have to use a B-Tree if that's beyond you. I just sort
the titles and then use a binary search on them. Plenty fast even in
Perl and Javascript.

Andrew Dunbar (hippietrail)

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Big problem to solve: good WYSIWYG on WMF wikis

2011-01-03 Thread Andrew Dunbar
On 3 January 2011 21:54, Andreas Jonsson andreas.jons...@kreablo.se wrote:
 2010-12-29 08:33, Andrew Dunbar skrev:
 I've thought a lot about this too. It certainly is not any type of
 standard grammar. But on the other hand it is a pretty common kind of
 nonstandard grammar. I call it a recursive text replacement grammar.

 Perhaps this type of grammar has some useful characteristics we can
 discover and document. It may be possible to follow the code flow and
 document each text replacement in sequence as a kind of parser spec
 rather than trying and failing again to shoehorn it into a standard
 LALR grammar.

 If it is possible to extract such a spec it would then be possible to
 implement it in other languages.

 Some research may even find that is possible to transform such a
 grammar deterministically into an LALR grammar...

 But even if not I'm certain it would demysitfy what happens in the
 parser so that problems and edge cases would be easier to locate.

 From my experience of implementing a wikitext parser, I would say that
 it might be possible to transform wikitext to a token stream that is
 possible to parse with a LALR parser.  My implementation
 (http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser)
 uses Antlr (which is an LL parser generator) and only rely on context
 sensitive parsing (Antlr's semantic predicates) for parsing
 apostrophes (bold and italics), and this might be possible to solve in
 a different way.  The rest of the complex cases are handled by the
 lexical analyser that produce a well behaving token stream that can be
 relatively straightforwardly parsed.

 My implementation is not 100% compatible, but I think that a 100%
 compatible parser is not desirable since the most exotic border cases
 would probably be characterized as bugs anyway (e.g. [[Link|table
 class=]]).  But I think that the basic idea can be used to produce
 a sufficiently compatible parser.

In that case what is needed is to hook your parser into our current code
and get it create output if you have not done that already. Then you will
want to run the existing parser tests on it. Then you will want to run both
parsers over a large sample of existing Wikipedia articles (make sure
you use the same revisions on both parsers!) and run them through diff.
Then we'll have a decent idea of whether there are any edge cases you
didn't spot or whether any of them are exploited in template magic.

Let us know the results!

Andrew Dunbar (hippietrail)


 Best Regards,

 /Andreas


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Big problem to solve: good WYSIWYG on WMF wikis

2010-12-28 Thread Andrew Dunbar
On 29 December 2010 02:07, Happy-melon happy-me...@live.com wrote:
 There are some things that we know:

 1) as Brion says, MediaWiki currently only presents content in one way: as
 wikitext run through the parser.  He may well be right that there is a
 bigger fish which could be caught than WYSIWYG editing by saying that MW
 should present data in other new and exciting ways, but that's actually a
 separate question.  *If* you wish to solve WYSIWYG editing, your baseline is
 wikitext and the parser.

Specifically, it only presents content as HTML. It's not really a
parser because it doesn't create an AST (Abstract Syntax Tree). It's a
wikitext to HTML converter. The flavour of the HTML can be somewhat
modulated by the skin but it could never output directly to something
totally different like RTF or PDF.

 2) guacamole is one of the more unusual descriptors I've heard for the
 parser, but it's far from the worst.  We all agree that it's horribly messy
 and most developers treat it like either a sleeping dragon or a *very*
 grumpy neighbour.  I'd say that the two biggest problems with it are that a)
 it's buried so deep in the codebase that literally the only way to get your
 wikitext parsed is to fire up the whole of the rest of MediaWiki around it
 to give it somewhere comfy to live in,

I have started to advocate the isolation of the parser from the rest
of the innards or MediaWiki for just this reason:
https://bugzilla.wikimedia.org/show_bug.cgi?id=25984

Free it up so that anybody can embed it in their code and get exactly
the same rendering that Wikipedia et al get, guaranteed.

We have to find all the edges where the parser calls other parts of
MediaWiki and all the edges where other parts of MediaWiki call the
parser. We then define these edges as interfaces so that we can drop
an alternative parser into MediaWiki and drop the current parser into
say an offline viewer or whatever.

With a freed up parser more people will hack on it, more people will
come to grok it and come up with strategies to address some of its
problems. It should also be a boon for unit testing.

(I have a very rough prototype working by the way with lots of stub classes)

 and b) there is as David says no way
 of explaining what it's supposed to be doing except saying follow the code;
 whatever it does is what it's supposed to do.  It seems to be generally
 accepted that it is *impossible* to represent everything the parser does in
 any standard grammar.

I've thought a lot about this too. It certainly is not any type of
standard grammar. But on the other hand it is a pretty common kind of
nonstandard grammar. I call it a recursive text replacement grammar.

Perhaps this type of grammar has some useful characteristics we can
discover and document. It may be possible to follow the code flow and
document each text replacement in sequence as a kind of parser spec
rather than trying and failing again to shoehorn it into a standard
LALR grammar.

If it is possible to extract such a spec it would then be possible to
implement it in other languages.

Some research may even find that is possible to transform such a
grammar deterministically into an LALR grammar...

But even if not I'm certain it would demysitfy what happens in the
parser so that problems and edge cases would be easier to locate.

Andrew Dunbar (hippietrail)

 Those are all standard gripes, and nothing new or exciting.  There are also,
 to quote a much-abused former world leader, some known unknowns:

 1) we don't know how to explain What You See when you parse wikitext except
 by prodding an exceedingly grumpy hundred thousand lines of PHP and *asking
 What it thinks* You Get.

 2) We don't know how to create a WYSIWYG editor for wikitext.

 Now, I'd say we have some unknown unknowns.

 1) *is* it because of wikitext's idiosyncracies that WYSIWYG is so
 difficult?  Is wikitext *by its nature* not amenable to WYSIWYG editing?

 2) would a wikitext which *was* representable in a standard grammar be
 amenable to WYSIWYG editing?

 3) would a wikitext which had an alternative parser, one that was not buried
 in the depths of MW (perhaps a full JS library that could be called in
 real-time on the client), be amenable to WYSIWYG editing?

 4) are questions 2 and 3 synonymous?

 --HM


 David Gerard dger...@gmail.com wrote in
 message news:aanlktimthux-undo1ctnexcrqbpp89t2m-pvha6fk...@mail.gmail.com...
 [crossposted to foundation-l and wikitech-l]


 There has to be a vision though, of something better. Maybe something
 that is an actual wiki, quick and easy, rather than the template
 coding hell Wikipedia's turned into. - something Fred Bauder just
 said on wikien-l.


 Our current markup is one of our biggest barriers to participation.

 AIUI, edit rates are about half what they were in 2005, even as our
 fame has gone from popular through famous to part of the
 structure of the world. I submit that this is not a good or healthy
 thing in any way and needs fixing

[Wikitech-l] Offline wiki tools

2010-12-15 Thread Andrew Dunbar
I've long been interested in offline tools that make use of WikiMedia
information, particularly the English Wiktionary.

I've recently come across a tool which can provide random access to a
bzip2 archive without decompressing it and I would like to make use of
it in my tools but I can't get it to compile and/or function with any
free Windows compiler I have access to. It works fine on the *nix
boxes I have tried but my personal machine is a Windows XP netbook.

The tool is seek-bzip2 by James Taylor and is available here:
http://bitbucket.org/james_taylor/seek-bzip2

* The free Borland compiler won't compile it due to missing (Unix?) header files
* lcc compiles it but it always fails with error unexpected EOF
* mingw compiles it if the -m64 option is removed from the Makefile
but it then has the same behaviour as the lcc build.

My C experience is now quite stale and my 64-bit programming
experience negligible.

(I'm also interested in hearing from other people working on offline
tools for dump files, wikitext parsing, or Wiktionary)

Andrew Dunbar (hippietrail)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Offline wiki tools

2010-12-15 Thread Andrew Dunbar
2010/12/16 Ángel González keis...@gmail.com:
 On 15/12/10 16:21, Andrew Dunbar wrote:
 I've long been interested in offline tools that make use of WikiMedia
 information, particularly the English Wiktionary.

 I've recently come across a tool which can provide random access to a
 bzip2 archive without decompressing it and I would like to make use of
 it in my tools but I can't get it to compile and/or function with any
 free Windows compiler I have access to. It works fine on the *nix
 boxes I have tried but my personal machine is a Windows XP netbook.

 The tool is seek-bzip2 by James Taylor and is available here:
 http://bitbucket.org/james_taylor/seek-bzip2

 * The free Borland compiler won't compile it due to missing (Unix?) header 
 files
 * lcc compiles it but it always fails with error unexpected EOF
 * mingw compiles it if the -m64 option is removed from the Makefile
 but it then has the same behaviour as the lcc build.

 My C experience is now quite stale and my 64-bit programming
 experience negligible.

 (I'm also interested in hearing from other people working on offline
 tools for dump files, wikitext parsing, or Wiktionary)

 Andrew Dunbar (hippietrail)

 Your problem are Windows text streams. The attached patch fixes it.

 Thank you for the link. I was completely unaware of it when I basically
 did the same thing for mediawiki a couple years ago.
 http://www.wiki-web.es/mediawiki-offline-reader/


Thanks Ángel! I feel like a fool for not realizing this. It's the same
problem I've worked around many times in the past but not recently. I
just got a similar answer on stackoverflow.com

By the way I'm keen to find something similar for .7z

It would be incredibly useful if these indices could be created as
part of the dump creation process. Should I file a feature request?

Andrew Dunbar (hippietrail)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Offline wiki tools

2010-12-15 Thread Andrew Dunbar
On 15 December 2010 20:41, Anthony wikim...@inbox.org wrote:
 On Wed, Dec 15, 2010 at 12:01 PM, Andrew Dunbar hippytr...@gmail.com wrote:
 By the way I'm keen to find something similar for .7z

 I've written something similar for .xz, which uses LZMA2 same as .7z.
 It creates a virtual read-only filesystem using FUSE (the FUSE part is
 in perl, which uses pipes to dd and xzcat).  Only real problem is that
 it doesn't use a stock .xz file, it uses a specially created one which
 concatenates lots of smaller .xz files (currently I concatenate
 between 5 and 20 or so 900K bz2 blocks into one .xz stream - between 5
 and 20 because there's a preference to split on /pagepage
 boundaries).

At the moment I'm interested in .bz2 and .7z because those are the
formats WikiMedia currently publishes data in. Though some files are
also in .gz so I would also like to find a solution for those.

I thought about the concatenation solution splitting at page
boundaries for .bz2 until I found out there was already a solution
that worked with the vanilla dump files as is.

 Apparently the folks at openzim have done something similar, using LZMA2.

 If anyone is interesting in working with me to make a package capable
 of being released to the public, I'd be willing to share my code.  But
 it sounds like I'm just reinventing a wheel already invented by
 opensim.

I'm interested in what everybody else is doing regarding offline
WikiMedia content. I'm also mainly using Perl though I just ran into a
problem with 64-bit values when indexing huge dump files.

 It would be incredibly useful if these indices could be created as
 part of the dump creation process. Should I file a feature request?

 With concatenated .xz files, creating the index is *much* faster,
 because the .xz format puts the stream size at the end of each stream.
  Plus with .xz all streams are broken on 4-byte boundaries, whereas
 with .bz2 blocks can end at any *bit* (which means you have to do
 painful bit shifting to create the index).

 The file is also *much* smaller, on the order of 5-10% of bzip2 for a
 full history dump.

Have we made the case for this format to the WikiMedia people? I think
they use .bz2 because it is pretty fast for very good compression
ratios but they use .7z for the full history dumps where the extremely
good compression ratios warrant the slower compression times since
these files can be gigantic.

How is .xz for compression times? Would we have to worry about patent
issues for LZMA?

Andrew Dunbar (hippietrail)

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Offline wiki tools

2010-12-15 Thread Andrew Dunbar
On 15 December 2010 20:24, Manuel Schneider
manuel.schnei...@wikimedia.ch wrote:
 Hi Andrew,

 maybe you'd like to check out ZIM: This is an standardized file format
 for compressed HTML dumps, focused on Wikimedia content at the moment.

 There is some C++ code around to read and write ZIM files and there are
 several projects using that, eg. the WP1.0 project, the Israeli and
 Kenyan Wikipedia Offline initiatives and more. Also the Wikimedia
 Foundation is currently in progress to adopt the format to provide ZIM
 files from Wikimedia wikis in the future.

This is very interesting and I'll be watching it. Where do the HTML
dumps come from? I'm pretty sure I've only seen static for Wikipedia
and not for Wiktionary for example. I am also looking at adapting the
parser for offline use to generate HTML from the dump file wikitext.

Andrew Dunbar (hippietrail)

 http://openzim.org/

 /Manuel

 Am 15.12.2010 16:21, schrieb Andrew Dunbar:
 I've long been interested in offline tools that make use of WikiMedia
 information, particularly the English Wiktionary.

 I've recently come across a tool which can provide random access to a
 bzip2 archive without decompressing it and I would like to make use of
 it in my tools but I can't get it to compile and/or function with any
 free Windows compiler I have access to. It works fine on the *nix
 boxes I have tried but my personal machine is a Windows XP netbook.

 The tool is seek-bzip2 by James Taylor and is available here:
 http://bitbucket.org/james_taylor/seek-bzip2

 * The free Borland compiler won't compile it due to missing (Unix?) header 
 files
 * lcc compiles it but it always fails with error unexpected EOF
 * mingw compiles it if the -m64 option is removed from the Makefile
 but it then has the same behaviour as the lcc build.

 My C experience is now quite stale and my 64-bit programming
 experience negligible.

 (I'm also interested in hearing from other people working on offline
 tools for dump files, wikitext parsing, or Wiktionary)

 Andrew Dunbar (hippietrail)

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


 --
 Regards
 Manuel Schneider

 Wikimedia CH - Verein zur Förderung Freien Wissens
 Wikimedia CH - Association for the advancement of free knowledge
 www.wikimedia.ch

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] require language dump for developing words and corresponding frequency

2010-12-14 Thread Andrew Dunbar
The dump site (http://download.wikimedia.org/) is still broken at the
moment but another way to build some word frequency data is by
randomly sampling the wikis for the languages you are interested in.
At least these Indic languages have Wikipedias of varying sizes:

Assamese http://as.wikipedia.org
Bihari http://bh.wikipedia.org
Bengali http://bn.wikipedia.org
Bishnupriya Manipuri http://bpy.wikipedia.org
Gujarati http://gu.wikipedia.org
Hindi http://hi.wikipedia.org
Kannada http://kn.wikipedia.org
Kashmiri http://ks.wikipedia.org
Marathi http://mr.wikipedia.org
Nepali http://ne.wikipedia.org
Nepal Bhasa http://new.wikipedia.org
Oriya http://or.wikipedia.org/wiki
Eastern Punjabi http://pa.wikipedia.org
Western Punjabi http://pnb.wikipedia.org
Sanskrit http://sa.wikipedia.org
Sindhi  http://sd.wikipedia.org
Tamil http://ta.wikipedia.org
Telugu http://te.wikipedia.org
Urdu http://ur.wikipedia.org

If you'd like to use it I have a tool that downloads random samples of
wiki pages and strips the HTML for purposes such as this.

Good luck!

Andrew Dunbar (hippietrail)

On 14 December 2010 18:36, pravin@gmail.com pravin@gmail.com wrote:
 Hi All,

  I am Pravin Satpute, I am working on language technology and for building
 words and it frequency, i required some webpages in indic language.

 Can i get the most recent dump without en.wiki

 Thanks,
 Pravin s
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How to find the version of a dump

2010-12-14 Thread Andrew Dunbar
On 14 December 2010 01:57, Monica shu monicashu...@gmail.com wrote:
 Thanks Diederik and Waksman,

 It seems that I need to do parse the dump for article data to get this piece
 of information...
 Yes, this will be the last choice, but I think there maybe some easier
 way...

 I just got home and checked the dump I've downloaded.
 It's downloaded on June, 10, 2010, the size is 6117881141 in bz2.
 I remember when I download, it's the latest version at that moment.
 As the dumps are generated every N months, and the one I have is bigger that
 the version 2010-01-30 as Waksman said, my version should be between Feb to
 June.

A Google search hints that enwiki-20100312-pages-articles.xml.bz2
might be the one with size 6117881141.

Andrew Dunbar (hippietrail)


 Does anybody remember the version between this period, or happened to
 download the same version with me?

 Thanks very much to tell me any related information again!


 Best regards!
 Monica




 On Mon, Dec 13, 2010 at 3:24 PM, Shaun Waksman shaunwaks...@gmail.comwrote:

 Hi Monica,

 The file sizes of the EN pages dumps that are available today are:

 5204823166  enwiki-20100312-pages-articles.xml.7z
 5983814213  enwiki-20100130-pages-articles.xml.bz2

 Note that the former is in 7z and the later is in bz2

 Does this help?

 Shaun


 On Mon, Dec 13, 2010 at 8:45 AM, Monica shu monicashu...@gmail.com
 wrote:

  Hi all,
 
  I have downloaded a dump several month ago.
  By accidentally, I lost the version info of this dump, so I don't know
 when
  this dump was generated.
  Is there any place that list out info about the past dumps(such as
  size...)?
 
  Thanks!
 
  Monica
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How to find the version of a dump

2010-12-14 Thread Andrew Dunbar
On 14 December 2010 20:04, Andrew Dunbar hippytr...@gmail.com wrote:
 On 14 December 2010 01:57, Monica shu monicashu...@gmail.com wrote:
 Thanks Diederik and Waksman,

 It seems that I need to do parse the dump for article data to get this piece
 of information...
 Yes, this will be the last choice, but I think there maybe some easier
 way...

 I just got home and checked the dump I've downloaded.
 It's downloaded on June, 10, 2010, the size is 6117881141 in bz2.
 I remember when I download, it's the latest version at that moment.
 As the dumps are generated every N months, and the one I have is bigger that
 the version 2010-01-30 as Waksman said, my version should be between Feb to
 June.

 A Google search hints that enwiki-20100312-pages-articles.xml.bz2
 might be the one with size 6117881141.

 Andrew Dunbar (hippietrail)


 Does anybody remember the version between this period, or happened to
 download the same version with me?

 Thanks very much to tell me any related information again!


 Best regards!
 Monica




 On Mon, Dec 13, 2010 at 3:24 PM, Shaun Waksman shaunwaks...@gmail.comwrote:

 Hi Monica,

 The file sizes of the EN pages dumps that are available today are:

 5204823166  enwiki-20100312-pages-articles.xml.7z
 5983814213  enwiki-20100130-pages-articles.xml.bz2

 Note that the former is in 7z and the later is in bz2

 Does this help?

 Shaun


 On Mon, Dec 13, 2010 at 8:45 AM, Monica shu monicashu...@gmail.com
 wrote:

  Hi all,
 
  I have downloaded a dump several month ago.
  By accidentally, I lost the version info of this dump, so I don't know
 when
  this dump was generated.
  Is there any place that list out info about the past dumps(such as
  size...)?
 
  Thanks!
 
  Monica
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



It should be trivial to add the dump data to the header each dump
file. Since in the files themselves the date field of the filename is
often replaced by latest this could be very useful. It could also be
useful to include the revision ID and timestamp of the latest revision
but I assume this would be a little more difficult. Should I file a
feature request?

Andrew Dunbar (hippietrail)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Looking for a mediawiki.org dump

2010-12-05 Thread Andrew Dunbar
Could anybody help me locate a dump of mediawiki.org while the dump
server is broken please? I only need current revisions.

Thanks in advance.

Andrew Dunbar (hippietrail)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] alternative way to get wikipedia dump while server is down

2010-11-27 Thread Andrew Dunbar
On 28 November 2010 02:42, Jeff Kubina jeff.kub...@gmail.com wrote:
 I have a copy of the 20091009 enwiki dumps if that would do:

 http://jeffkubina.org/data/download.wikimedia.org/enwiki/20091009/

 Jeff
 --
 Jeff Kubina http://google.com/profiles/jeff.kubina
 410-988-4436
 8am-10pm EST

 On Thu, Nov 25, 2010 at 12:30 PM, Oliver Schmidt 
 schmidt...@email.ulster.ac.uk wrote:

 Hello alltogether,

 is there any alternative way to get hands on a wikipedia dump?
 Preferably the last complete one.
 Which was supposed to be found at this address:
 http://download.wikimedia.org/enwiki/20100130/

 I would need that dump asap for my research.
 Thank you for any help!

 Best regards


 —

 Oliver Schmidt
 PhD student
 Nano Systems Biology Research Group

 University of Ulster, School of Biomedical Sciences
 Cromore Road, Coleraine BT52 1SA, Northern Ireland

 T: +44 / (0)28 / 7032 3367
 F: +44 / (0)28 / 7032 4375
 E: schmidt...@email.ulster.ac.ukmailto:schmidt...@email.ulster.ac.uk

 —

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


I don't suppose anybody has a copy of any Romanian or Georgian
Wiktionary from any time? (-:

Andrew Dunbar (hippietrail)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Invoking maintenance scripts return nothing at all

2010-11-16 Thread Andrew Dunbar
I wish to do some MediaWiki hacking which uses the codebase,
specifically the parser, but not the database or web server.
I'm running on Windows XP on an offline machine with PHP installed but
no MySql or web server.
I've unarchived the source and grabbed a copy of somebody's
LocalSettings.php but not attempted to to install MediaWiki beyond
this.

Obviously I don't expect to be able to do much, but when I try to run
any of the maintenance scripts I get no output whatsoever, not even
errors.

I was hoping to let the error messages guide me as to what is
essential, what needs to be stubbed, wrapped etc.

Am I missing something obvious or do these scripts return no errors by design?

Andrew Dunbar (hippietrail)

-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Invoking maintenance scripts return nothing at all

2010-11-16 Thread Andrew Dunbar
On 17 November 2010 02:37, Dmitriy Sintsov ques...@rambler.ru wrote:
 * Andrew Dunbar hippytr...@gmail.com [Tue, 16 Nov 2010 23:01:33
 +1100]:
 I wish to do some MediaWiki hacking which uses the codebase,
 specifically the parser, but not the database or web server.
 I'm running on Windows XP on an offline machine with PHP installed but
 no MySql or web server.
 I've unarchived the source and grabbed a copy of somebody's
 LocalSettings.php but not attempted to to install MediaWiki beyond
 this.

 Obviously I don't expect to be able to do much, but when I try to run
 any of the maintenance scripts I get no output whatsoever, not even
 errors.

 I was hoping to let the error messages guide me as to what is
 essential, what needs to be stubbed, wrapped etc.

 Am I missing something obvious or do these scripts return no errors by
 design?

 Andrew Dunbar (hippietrail)

 In the web environment, error messages may expose vulnerabilities to
 potential attacker. The errors might be written to php's error log,
 which is set up by

 error_log=path

 directive in php.ini. You may find the actual location of php.ini by
 executing

 php --ini

 Look also at the whole Error handling and logging section

 Does php work at all? Is there an configuration output

 php -r phpinfo();

 when issued from cmd.exe ?

 Does

 php dumpBackup.php --help

 being issued from /maintenance directory, produces the command line
 help?
 Dmitriy

Thanks Dmitry. PHP does work. The --help options always work. It
turned out the LocalSettings.php somebody on #mediawiki pointed me to
require_once()'d several extensions I didn't have and require_once()
seems to fail silently. I'll try to aquaint myself better with the
Error handling and logging section as you suggest.

Is there somewhere an official blank or example LocalSettings.php
file that would be better to use for people like me to avoid such
problems? Rolling my own from scratch doesn't seem ideal either.

Andrew Dunbar (hippietrail)

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] API vs data dumps

2010-11-07 Thread Andrew Dunbar
On 14 October 2010 09:37, Alex Brollo alex.bro...@gmail.com wrote:
 2010/10/13 Paul Houle p...@ontology2.com


     Don't be intimidated by working with the data dumps.  If you've got
 an XML API that does streaming processing (I used .NET's XmlReader) and
 use the old unix trick of piping the output of bunzip2 into your
 program,  it's really pretty easy.


 When I worked into it.source (a small dump! something like 300Mby unzipped),
 I used a simple do-it-yourself string python search routine  and I found it
 really faster then python xml routines. I presume that my scripts are really
 too rough to deserve sharing, but I encourage programmers to write a simple
 dump reader using speed of string search. My personal trick was to build an
 index, t.i. a list of pointers to articles and name of articles  into xml
 file, so that it was simple and fast to recover their content. I used it
 mainly because I didn't understand API at all. ;-)

 Alex


Hi Alex. I have been doing something similar in Perl for a few years
for the English
Wiktionary. I've never been sure on the best way to store all the
index files I create
especially in code to share with other people like I would like to
happen. If you'd
like to collaborate or anyone else for that matter it would be pretty cool.

You'll find my stuff on the Toolserver:
https://fisheye.toolserver.org/browse/enwikt

Andrew Dunbar (hippietrail)


-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Datamining infoboxes

2009-10-25 Thread Andrew Dunbar
2009/10/23 Aryeh Gregor simetrical+wikil...@gmail.com:
 On Fri, Oct 23, 2009 at 12:20 PM, Andrew Dunbar hippytr...@gmail.com wrote:
 Yes I didn't specify tl_namespace

 In MySQL that will usually make it impossible to effectively use an
 index on (tl_namespace, tl_title), so it's essential that you specify
 the NS.  (Which you should anyway to avoid hitting things like
 [[Template talk:Infobox language]].)  Some DBMSes (including sometimes
 MySQL = 5.0, although apparently not here) are smart enough to use
 this kind of index pretty well even if you don't specify the
 namespace, but it would still be somewhat more efficient to specify it
 -- the DB would have to do O(1/n) times as many index lookups, where n
 is the number of namespaces.

 and when I check for which columns
 have keys I could see none:
 mysql describe templatelinks;
 +--+-+--+-+-+---+
 | Field        | Type            | Null | Key | Default | Extra |
 +--+-+--+-+-+---+
 | tl_from      | int(8) unsigned | NO   |     | 0       |       |
 | tl_namespace | int(11)         | NO   |     | 0       |       |
 | tl_title     | varchar(255)    | NO   |     |         |       |
 +--+-+--+-+-+---+
 3 rows in set (0.01 sec)

 The toolserver database uses views.  In MySQL, views can't have
 indexes themselves, but your query is rewritten to run against the
 real table -- which you can't access directly, but which does have
 indexes.  EXPLAIN is your best bet here:

 mysql EXPLAIN SELECT tl_from FROM templatelinks WHERE tl_title IN
 ('Infobox_Language', 'Infobox_language');
 ++-+---+---+---+-+-+--+---+--+
 | id | select_type | table         | type  | possible_keys | key     |
 key_len | ref  | rows      | Extra                    |
 ++-+---+---+---+-+-+--+---+--+
 |  1 | SIMPLE      | templatelinks | index | NULL          | tl_from |
 265     | NULL | 149740990 | Using where; Using index |
 ++-+---+---+---+-+-+--+---+--+
 1 row in set (0.00 sec)

 mysql EXPLAIN SELECT tl_from FROM templatelinks WHERE tl_namespace=10
 AND tl_title IN ('Infobox_Language', 'Infobox_language');
 ++-+---+---+---+--+-+--+--+--+
 | id | select_type | table         | type  | possible_keys | key
   | key_len | ref  | rows | Extra                    |
 ++-+---+---+---+--+-+--+--+--+
 |  1 | SIMPLE      | templatelinks | range | tl_namespace  |
 tl_namespace | 261     | NULL | 6949 | Using where; Using index |
 ++-+---+---+---+--+-+--+--+--+
 1 row in set (0.00 sec)

 Note the number of rows scanned in each case.  Your query was scanning
 all of templatelinks, the other is retrieving the exact rows needed
 and not looking at any others (type = index vs. range).  The
 reason for this is given in the possible_keys column: MySQL can find
 no keys that are usable for lookup, if you omit tl_namespace.

Thanks for the very informative reply. I already knew most of this
stuff passively except database/SQL views. Now I've just got to put it
into more practice.

Andrew Dunbar (hippietrail)

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Datamining infoboxes

2009-10-23 Thread Andrew Dunbar
2009/10/23 Robert Ullmann rlullm...@gmail.com:
 I've been spending hours on the parsing now and don't find it simple
 at all due to the fact that templates can be nested. Just extracting
 the Infobox as one big lump is hard due to the need to match nested {{
 and }}

 Andrew Dunbar (hippietrail)

 Hi,

 Come now, you are over-thinking it. Find {{Infobox [Ll]anguage in
 the text, then count braces. Start at depth=2, count up and down 'till
 you reach 0, and you are at the end of the template. (you can be picky
 about only counting them if paired if you like ;-)

Actually you have to find {{[Ii]nfobox[ _][Ll]anguage
And I wanted to be robust. It's perfectly legal for single unmatched
braces to apear anywhere and I didn't want them to break my code. As
it happens there don't seem to currently be any in the language
infofoxes.
I couldn't be sure whether there would be any cases where a {{{ or }}}
might show up either. And a few other edge cases such as HTML
comments, nowiki and friends, template invocations in values, and
even possibly template invokations in names?

 Then just regex match the lines/parameters you want.

 However, if you are pulling the wikitext with the API, the XML parse
 tree option sounds good; then you can just use elementTree (or the
 like) and pull out the parameters directly

I've got it extracting the name/value pairs from the XML finally but
parsing XML is always a pain. And it still misses Norwegian, Bokmal,
and Nynorsk which wrap the infobox in another template...

Andrew Dunbar (hippietrail)

 Robert

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Datamining infoboxes

2009-10-22 Thread Andrew Dunbar
Infoboxes in Wikipedia often contain information which is quite useful
outside Wikipedia but can be surprisingly difficult to data-mine.

I would like to find all Wikipedia pages that use
Template:Infobox_Language and parse the parameters iso3 and
fam1...fam15

But my attempts to find such pages using either the Toolserver's
Wikipedia database or the Mediawiki API have not been fruitful. In
particular, SQL queries on the templatelinks table are intractably
slow. Why are there no keys on tl_from or tl_title?

Andrew Dunbar (hippietrail)

--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] sharing an article on Facebook

2009-10-17 Thread Andrew Dunbar
2009/10/16 Amir E. Aharoni amir.ahar...@mail.huji.ac.il

 On Fri, Oct 16, 2009 at 19:03, Platonides platoni...@gmail.com wrote:
  It's a problem at facebook.
  Did you try with % encoded urls, (eg.
  http://he.wikipedia.org/wiki/%D7%97%D7%99%D7%A4%D7%94 ) or only hebrew
  ones? (eg. 
  http://he.wikipedia.org/wiki/חיפהhttp://he.wikipedia.org/wiki/%D7%97%D7%99%D7%A4%D7%94).

 Actually, most of the time i use a bookmark that runs some JS code
 that shares the current site. In any case, the result is the same -
 whether i use the bookmark or manually enter the URL with %'s or with
 Hebrew chars.

 I think it's a case of modern browsers behaving differently to older
browsers. Older browsers only supported encoded URLs with % as the URL/URI
standard is defined. But these are very user-unfriendly so modern browsers
now convert these URLs into something readable for people whose native
language does not use Latin script. I think Facebook only accepts URLs which
comply to the standard and not the userfriendly human readable ones
supported by modern browsers.

So it's no bug in Mediawiki and not really a bug in Facebook but it would be
a userfriendly improvement for Facebook to interpret non-Latin URLs just as
the modern browsers do.

Andrew Dunbar (hippietrail)


 --
 אמיר אלישע אהרוני
 Amir Elisha Aharoni

 http://aharoni.wordpress.com

 We're living in pieces,
  I want to live in peace. - T. Moore

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Wiktionary API acceptable use policy

2009-09-01 Thread Andrew Dunbar
2009/9/1 James Richard james.richard...@gmail.com:
 Hello,
 I am interested in using the Wiktionary API (located at
 en.wiktionary.org/w/api.php) and was having trouble finding any information
 on what is acceptable commercial use.  If there are any controlling
 documents on the subject, can you please direct me to them?  In particular,
 I would like to know if there are any restrictions on the number of requests
 allowed in a given time period, and if there are any other restrictions on
 volume or frequency of use that I should keep in mind.

 If determining acceptable use of the API remains a subjective exercise, let
 me explain how I would like to use it and perhaps you can tell me if my
 intended use is acceptable.

 I am starting a new language translation service bureau that will use online
 tools to make the translation process more accurate and less expensive for
 the end customer.  We also intend to offer free access to our tools to any
 open source project or non-profit organization (in such a case, they would
 be free to use our project management, version control, and translator tools
 free of charge, but they would have to find their own volunteer translators
 to do the actual translation work).

 As part of our translation tool set, we would like to provide access to
 monolingual and bilingual dictionaries.  Wiktionary appears to be the
 perfect choice for this.  We would like to use the Wiktionary API to fetch
 words that are requested by our users (translators) and then render them on
 our own servers for viewing in our translator tool.  We would keep a local
 cache of fetched documents to minimize the number of API calls that we need
 to make.  We will, of course, give proper attribution, etc., but it is
 possible that we will eventually be making quite a large number of requests,
 so we thought we should check with you first.

 Is that acceptable use?

Another approach is to download the Wiktionary dump archive to parse offline:
http://download.wikipedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2

Andrew Dunbar (hippietrail)

 Cheers,
 James
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Extensions in Bugzilla

2009-07-31 Thread Andrew Dunbar
2009/7/31 Chad innocentkil...@gmail.com:
 Hey all,

 I've compiled a list[1] of extensions in Bugzilla that don't have a default
 assignee. If you want to be (or should already be) the assignee for any
 of these, please let me know. Would like to really cut that list down
 so bugs are getting triaged to someone who cares. Right now, they're
 all being assigned to wikibugs-l, and we know how many bugs he
 resolves :p

 -Chad

 [1] http://www.mediawiki.org/wiki/User:^demon/Unloved_extensions

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


DidYouMean is mine

Andrew Dunbar (hippietrail)


-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] URLs that aren't cool...

2009-07-29 Thread Andrew Dunbar
2009/7/29 Nikola Smolenski smole...@eunet.yu:
 Дана Tuesday 28 July 2009 19:16:22 Brion Vibber написа:
 On 7/28/09 10:04 AM, Aryeh Gregor wrote:
  On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamsonnode...@gmail.com
 wrote:
  Case insensitivity shouldn't be a problem for any language, as long as
  you do it properly.
 
  Turkish and other languages using dotless i, for example, will need a
  special rule - Turkish lowercase dotted i capitalizes to a capital
  dotted İ while lowercase undotted ı capitalizes to regular undotted I.
 
  And so what if a wiki is multilingual and you don't know what language
  the page name is in?  What if a Turkish wiki contains some English
  page names as loan words, for instance?

 Indeed, good handling of case-insensitive matchings would be a big win
 for human usability, but it's not easy to get right in all cases.

 The main problems are:

 1) Conflicts when we really do consider something separate, but the case
 folding rules match them together

 2) Language-specific case folding rules in a multilingual environment

 Turkish I with/without dot and German ß not always matching to SS are
 the primary examples off the top of my head. Also, some languages tend
 to drop accent markers in capital form (eg, Spanish). What can or should
 we do here?

 Similar to automatic redirect, we could build an authomatic disambiguation
 page. For example, someone on srwiki going to [[Dj]] would get:

 Did you mean:

 * [[Đ]]
 * [[DJ]]
 * [[D.J.]]

 A nearer-term help would be to go ahead and implement what we talked
 about a billion years ago but never got around to -- a decent did you
 mean X? message to display when you go to an empty page but there's
 something similar nearby.

 Was thinking a lot about this. The best solution I thought of would be to add
 a column to page table page_title_canonical. When an article is
 created/moved, this canonical title is built from the real title. When an
 article is looked up, if there is no match in page_title, build the canonical
 title from the URL and see if there is a match in page_title_canonical and if
 yes, display did you mean X or even go there automatically as if from a
 redirect (if there is only one match) or did you mean *X, *X1 if there are
 multiple matches.

 This canonical title would be made like this:
 * Remove disambiguator from the title if it exists
 * Remove punctuation and the like
 * Transliterate the title to Latin alphabet
 * Transliterate to pure ASCII
 * Lowercase
 * Order the words alphabetically

 What could possibly go wrong?

 Note that this would also be very helpful for non-Latin wikis - people often
 want Latin-only URLs since non-Latin URLs are t long. I also recall a
 recent discussion about a wiki in a language with nonstandard spelling (nds?)
 where they use bots to create dozens or even hundreds of redirects to an
 article title - this would also make that unneeded.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

I actually did make this extension a couple of years, intended for the
English Wiktionary where we manually add an {{also}} template to the
top of pages to like to other pages whose titles differ in minor ways
such as capitalization, hyphenation, apostrophes, accents, periods. I
think I had it working with Hebrew and Arabic and a few other exotic
languages besides.

It was running on Brion's test box for some time but getting little
interest. It's been offline and unmaintained since Brion moved and I
did a couple of overseas trips.

http://www.mediawiki.org/wiki/Extension:DidYouMean
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/DidYouMean/
https://bugzilla.wikimedia.org/show_bug.cgi?id=8648

It hooked all ways to create delete or move a page to maintain a
separate table of normalized page titles which it consulted when
displaying a page.
The code for display was designed for compatibility with the
then-current Wiktionary templates and would need to be implemented in
a more general way.
A core version would probably just add a field to the existing table.

Andrew Dunbar (hippietrail)


-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Bugzilla Weekly Report

2009-07-27 Thread Andrew Dunbar
2009/7/27 Nikola Smolenski smole...@eunet.yu:
 repor...@isidore.wikimedia.org wrote:
 MediaWiki Bugzilla Report for July 20, 2009 - July 27, 2009

 Bugs NEW               :  165
 Bugs RESOLVED          :  174

 I think this is the first time for quite a while that more bugs have
 been resolved than created. Congratulations to everyone responsible! :)

Could it be due to the new known to fail logic?

Andrew Dunbar (hippietrail)

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Minify

2009-06-26 Thread Andrew Dunbar
2009/6/26 Robert Rohde raro...@gmail.com:
 I'm going to mention this here, because it might be of interest on the
 Wikimedia cluster (or it might not).

 Last night I deposited Extension:Minify which is essentially a
 lightweight wrapper for the YUI CSS compressor and JSMin JavaScript
 compressor.  If installed it automatically captures all content
 exported through action=raw and precompresses it by removing comments,
 formatting, and other human readable elements.  All of the helpful
 elements still remain on the Mediawiki: pages, but they just don't get
 sent to users.

 Currently each page served to anons references 6 CSS/JS pages
 dynamically prepared by Mediawiki, of which 4 would be needed in the
 most common situation of viewing content online (i.e. assuming
 media=print and media=handheld are not downloaded in the typical
 case).

 These 4 pages, Mediawiki:Common.css, Mediawiki:Monobook.css, gen=css,
 and gen=js comprise about 60 kB on the English Wikipedia.  (I'm using
 enwiki as a benchmark, but Commons and dewiki also have similar
 numbers to those discussed below.)

 After gzip compression, which I assume is available on most HTTP
 transactions these days, they total 17039 bytes.  The comparable
 numbers if Minify is applied are 35 kB raw and 9980 after gzip, for a
 savings of 7 kB or about 40% of the total file size.

 Now in practical terms 7 kB could shave ~1.5s off a 36 kbps dialup
 connection.  Or given Erik Zachte's observation that action=raw is
 called 500 million times per day, and assuming up to 7 kB / 4 savings
 per call, could shave up to 900 GB off of Wikimedia's daily traffic.
 (In practice, it would probably be somewhat less.  900 GB seems to be
 slightly under 2% of Wikimedia's total daily traffic if I am reading
 the charts correctly.)


 Anyway, that's the use case (such as it is): slightly faster initial
 downloads and a small but probably measurable impact on total
 bandwidth.  The trade-off of course being that users receive CSS and
 JS pages from action=raw that are largely unreadable.  The extension
 exists if Wikimedia is interested, though to be honest I primarily
 created it for use with my own more tightly bandwidth constrained
 sites.

This sounds great but I have a problem with making action=raw return
something that is not raw. For MediaWiki I think it would be better to
add a new action=minify

What would the pluses and minuses of that be?

Andrew Dunbar (hippietrail)


 -Robert Rohde

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Different apostrophe signs and MediaWiki internal search

2009-06-23 Thread Andrew Dunbar
2009/6/23 Brion Vibber br...@wikimedia.org:
 Steve Bennett wrote:
 So, apostrophe (U+0027) - curved right single quote (U+2019): yes, probably.
 The other way around...probably not, unless that U+2019 exists on any 
 keyboards.

 Hyphen-minus (U+002D) - em dash (U+2014): I would say no. If you
 search for clock-work, you probably don't want to match a sentence
 like He was building a clock—work that is never easy—at the time.
 (contrived, sure)

 Just saying you probably don't want the full range of lookalikes -
 the left side of each mapping should be a keyboard character, and the
 right side should be semantically equivalent, unless commonly used
 incorrectly.

 Unless you cut and paste a term containing a fancy character from
 another window, but the page uses the plain character...

Indeed keyboards are not the only place characters come from. Word
processers often upgrade apostrophes hyphens and other characters.
This is the generic field of which smart quotes is a specific case.
Also input methods can insert characters not directly on the
keyboard. And cutting and pasting from web pages where the author
tried to choose specific characters with HTML entities and such.

I have definitely seen edits on Wikipedia where people were
correcting various kinds of hyphens and dashes. And of course while
the English Wikipedia forbids curved quotes each other wiki may well
have its own policy.

Andrew Dunbar (hippietrail)



 -- brion

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Extending wikilinks syntax

2009-06-20 Thread Andrew Dunbar
2009/6/20 Kalan kalan@gmail.com:
 Recent studies by Håkon Wium Lie
 http://www.princexml.com/howcome/2009/wikipedia/ clearly show that
 XHTML markup generated by widespread templates such as {{coord}} is
 overcomplexified. This is mostly what we are able to fix, but
 sometimes we aren’t due to the limitations we have created for
 ourselves (Håkon simply pointed it out as he couldn’t know the reasons
 behind it; this is why having a fresh look is useful). Perhaps the
 most serious limitation is:

 We don’t allow attributes for wikilinks.

 This limitations results in several disadvantages, for example:
 * Each time someone wants to style a link, they have to create a
 span or something else somewhere inside or outside the link text. In
 most cases, this is against the semantics and clarity.
 * We can’t give ids to links so that we can use them in CSS and JS.
 * Implementations of certain microformats (such as XFN, “url” property
 in hCard/hCalendar, etc) inside templates is impossible.

 I propose to extend wikilinks syntax by making links being parsed the
 same way as we parse file-links.

 That is, [[Special:Userlogout|log
 out|id=logoutlink|style=color:red|title=This will log you out]] will
 be a wikilink with style, title and id attributes. The current syntax
 is a subset of my proposal, so nothing should break.

 As the syntax for external links leaves us no opportunity to clearly
 extend it in the same spirit, I currently think of merging it with
 external links’ syntax, leaving the current single-brackets for
 backward compatibility. Besides these advantages, it will make our
 syntax even friendlier (have you seen newbies trying to insert http://
 into the double brackets?), and it will make us implicitly prohibit
 Protocol://-like titles (they all are erroneous creations by newbies
 anyway).

I have to agree we need ways to specify CSS id's and classes for
links from within templates to avoid ugly HTML bloat.
When adding support for them it shouldn't be any extra work to
add support for the other attributes as long as everyone can agree
on a decent syntax.

Andrew Dunbar (hippietrail)

 — Kalan

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Different apostrophe signs and MediaWiki internal search

2009-06-20 Thread Andrew Dunbar
2009/6/20 Neil Harris use...@tonal.clara.co.uk:
 Neil Harris wrote:
 Andrew Dunbar wrote:

 2009/6/20 Jaska Zedlik jz5...@gmail.com:


 Hello,
 On Fri, Jun 19, 2009 at 20:31, Rolf Lampa rolf.la...@rilnet.com wrote:



 Jaska Zedlik skrev:
 ...


 The code of the override function is the following:

 function stripForSearch( $string ) {
   $s = $string;
   $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
   return parent::stripForSearch( $s );
 }


 I'm not a PHP programmer, but why using the extra assignment of $s
 instead of using $string directly in the parent call, like so:

 function stripForSearch( $string ) {
     $s = preg_replace( '/\xe2\x80\x99/', '\'', $string );
     return parent::stripForSearch( $s );
 }



 Really, you are right, for the real function all these redundant 
 assignments
 should be strepped for the productivity purposes, I just used a framework
 from the Japanese language class which does soma Japanese-specific
 reduction, but I agree with your notice.


 The username anti-spoofing code already knows about a lot of looks similar
 characters which may be of some help.

 Andrew Dunbar (hippietrail)




 Of itself, the username anti-spoofing code table -- which I originally
 wrote -- is rather too thorough for this purpose, since it deliberately
 errs on the side of mapping even vaguely similar-looking characters to
 one another, regardless of character type and script system,and this,
 combined with case-folding and transitivity, leads to some apparently
 bizarre mappings that are of no practical use for any other application.

 If you're interested, I can take a look at producing a more limited
 punctuation-only version.

 -- Neil


 http://www.unicode.org/reports/tr39/data/confusables.txt is probably the
 single best source for information about visual confusables.

 Staying entirely within the Latin punctuation repertoire, and avoiding
 combining characters and other exotica such as math characters and
 dingbats, you might want to consider the following characters as
 possible unintentional lookalikes for the apostrophe:

 U+0027 APOSTROPHE
 U+2019 RIGHT SINGLE QUOTATION MARK
 U+2018 LEFT SINGLE QUOTATION MARK
 U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
 U+2032 PRIME
 U+00B4 ACUTE ACCENT
 U+0060 GRAVE ACCENT
 U+FF40 FULLWIDTH GRAVE ACCENT
 U+FF07 FULLWIDTH APOSTROPHE

 There are also lots of other characters that look like these from other
 languages, and various combining character combinations which could also
 look the same, but I doubt whether they would be generated in Latin text
 by accident.

I would add
U+02BB MODIFIER LETTER TURNED COMMA (hawaiian 'okina)
U+02C8 MODIFIER LETTER VERTICAL LINE (IPA primary stress mark)

It might be worthwhile folding some dashes and hyphens too.

Andrew Dunbar (hippietrail)

 Please check these against the actual code tables for reasonableness and
 accuracy before putting them in any code.

 -- Neil


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l