Re: [Wikitech-l] Proposal: slight change to the XML dump format
I noticed that the dump format version number went from 0.9 to 0.10. I wonder if this format is documented somewhere or if some code might expect 1.0? Andrew Dunbar (hippietrail) On 28 October 2014 20:45, Daniel Kinzler dan...@brightbyte.de wrote: Am 27.10.2014 21:58, schrieb Ariel T. Glenn: Thank you Google for hiding the start of this thread in my spam folder _ I'm going to have to change my import tools for the new format, but that's the way it goes; it's a reasonable change. Have you checked with folks on the xml data dumps list to see who might be affected? Not yet, shall do that now. Thanks! -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Distinguishing disambiguation pages
It would also be great if these pages were marked in the dump files too. It should be exactly the same way as how redirect pages are marked. On 27 December 2012 01:41, Brad Jorsch bjor...@wikimedia.org wrote: On Tue, Dec 25, 2012 at 6:00 AM, Liangent liang...@gmail.com wrote: Is this enough? api.php?action=queryprop=pagepropsppprop=disambiguationtitles= One thing that would be nice would be the ability to go the other way. Consider for example this similar query that tests if the specified pages are in a category: api.php?action=queryprop=categoriesclcategories=Category:All_disambiguation_pagestitles= We can do the opposite, getting a list of pages in the category, something like this: api.php?action=querylist=categorymemberscmtitle=Category:All_disambiguation_pages It would be nice to have a corresponding api.php?action=querylist=pageswithproppwpprop=disambiguation. At a glance, it looks like we could do it easily enough if someone adds an index on page_props (pp_propname,pp_page). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] HTML wikipedia dumps: Could you please provide them, or make public the code for interpreting templates?
of the code into another programming language. Bug 25984 - Isolate parser from database dependencies https://bugzilla.wikimedia.org/show_bug.cgi?id=25984 Nobody at WikiMedia are working on this, but there's some patches from other people that will certainly get you on your way. But the developers at WikiMedia are very busy making a whole new parser and WYSIWYG editor to go with it. Hopefully this will clean up the code to the point that making your own parser becomes a lot easier. Good luck and sympathy (-: Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Order of execution JavaScript extensions
I'm having trouble getting a simple one-line User JS working on Wiktionary. $('#p-navigation').removeClass('first persistent').addClass('collapsed'); It works fine from Google Chrome's dev console. It makes the navigation portal collapsible like the other portals in the sidebar. But when I add it to my User:XXX/vector.js the result is not the same. The class I add is there but the ones I remove are also still there and the result is the standard navigation portal. I suspect there is some other js executed after the user's vector.js but I'm not sure how to check that. I have tried setting a breakpoint on the node in Google Chrome's dev tools and reloading the page, but it is never triggered. Apologies if this is not the right mailing list. None of the lists seemed fit according to http://www.mediawiki.org/wiki/Mailing_lists Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Order of execution JavaScript extensions
On 6 June 2012 13:57, Bergi a.d.be...@web.de wrote: Andrew Dunbar schrieb: I'm having trouble getting a simple one-line User JS working on Wiktionary. Apologies if this is not the right mailing list. None of the lists seemed fit according to http://www.mediawiki.org/wiki/Mailing_lists I think the http://en.wikipedia.org/wiki/Wikipedia:WikiProject_User_scripts would be a better place to discuss. Even though it's not Wiktionary, you should find the (user-)JS gurus there :-) Apart from that, I guess your code interferes with the ext.vector.collapsibleNav.js module. Waiting for it (with mw.loader.using) before executing your snippet should work. Thanks for both parts of your answer. Your tip worked perfectly and I know where to ask next time. Andrew Dunbar (hippietrail) regards, Bergi ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Visual watchlist
On 9 May 2012 12:17, Arun Ganesh arun.plane...@gmail.com wrote: I thought of studying my watchlist for a moment to understand why it was the way it was, and I noticed the following: 1. My watchlist begins half the page down, because of the watchlist options box, which btw I have never used or peered into. 2. The first link in each item is that of the current article. I have never clicked this because I might as well go through the changes by using (diff) 3. I have never clicked (hist) on the watchlist, I would first see the (diff) and only then browse the history These days I most often click (hist), less often (diff), and practically never anything else. (hist) is more useful for me on the English Wiktionary because I mostly add translation requests and several bots watch the recent changes feed which result in minor changes to most pages alter, which are of no interest to me. Also it seems that people monitoring the activity also often add other translations. By clicking (hist) I can see 1) If only bots have changed the page since me, in which case I don't need to see a diff. 2) When there were several human edits, which ones were in languages I am interested in. 3) The history page give me a way to get a diff of all changes since my last edit, rather than just the most recent chage. Andrew Dunbar (hippietrail) 4. 0 is colored grey making it disappear from the list. But that does not mean the article never changed, it could be +400 -400 words but the net is 0. The edit calculation can be highly misleading. I would rather want to know how many characters were added and how many deletions. Articles which have only additions are low on my priority list to patrol. 5. Before contacting any user or checking his (contribs), I would always see what his edit was. I open the (diff) and (contribs) in new tabs. This could have become integrated because its part of the same task. Same goes for talk and the user page links littered all over my watchlist 6. Knowing whether a user/ip has a talk page or not is important for me to identify a newbie or vandal 7. Reading each edit summary is really slow. Identifying where it begins on a line is tough of all the information that precedes it. 8. I can jump to the specific section directly by clicking the tiny → but not the section name itself. I have never used this link either as i would rather see the (diff) 9. The (diff) gives me the diff with the entire article and image loaded below. In most cases, all the info I need while patrolling is just in the diff. I only need the article if i want to check if tables/images are broken. With that in mind I made this, which would solve most of my issues: http://commons.wikimedia.org/wiki/File:Mw-ux-visual_watchlist.png Let me know if it would work for you as well? I hope to put some more thought on it and improving the idea. -- Arun Ganesh User:planemad ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] search=steven+tyler gets Steven_tyler
On 14 May 2011 20:37, K. Peachey p858sn...@gmail.com wrote: On Sat, May 14, 2011 at 8:33 PM, jida...@jidanni.org wrote: OK, then why can't http://en.wikipedia.org/wiki/Steven_tyler just do a browser redirect to http://en.wikipedia.org/wiki/Steven_Tyler Because then we can't show the (Redirected from X) bar that accompanies the redirects The JavaScript we use on the English Wiktionary also makes a slightly different (Automaticaly redirected from X) bar, or something very similar. Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] search=steven+tyler gets Steven_tyler
On 13 May 2011 14:34, Carl (CBM) cbm.wikipe...@gmail.com wrote: On Fri, May 13, 2011 at 12:25 AM, Jay Ashworth j...@baylink.com wrote: They're not the same page. Wikipedia page titles are case sensitive -- except that the first character is forced to upper case by the engine. Does that search not return both? Why would we have both? Like you said, the system is case sensitive. These redirects are created because the software doesn't handle case changes correctly otherwise. For example the following link leads to a no such page error because the appropriate redirect does not exist: http://en.wikipedia.org/wiki/Sterling_heights,_Michigan . It would be possible to code around this, so that the redirects would be simulated if they don't exist, but it hasn't happened. In practice, people like me like to type a title in all lower case, and so we have redirects to make it work. Indeed on the English Wiktionary we do have some JavaScript which runs when on a page which would be a redlink. It checks all casing combinations of: all lowercase, all uppercase, first letter uppercase and the rest lowercase. If one of those exists it automatically redirects after a couple of seconds. With the different nature of Wikipedia titles you would probably want to check sentence case and title case but would still miss quite a few where only proper nouns within the title are capitalized. And some people would probably hate such a feature too (-: Andrew Dunbar (hippietrail) - Carl ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] search=steven+tyler gets Steven_tyler
On 13 May 2011 17:31, M. Williamson node...@gmail.com wrote: I still don't think page titles should be case sensitive. Last time I asked how useful this really was, back in 2005 or so, I got a tersely-worded response that we need it to disambiguate certain pages. OK, but how many cases does that actually apply to? I would think that the increased usability from removing case sensitivity would far outweigh the benefit of natural disambiguation that only applies to a tiny minority of pages, and which could easily be replaced with disambiguation pages. There has been talk from time to time over the years to add full case folding whereby page titles preserve a certain case of each letter but ignore such info for internal operations. A lot like the filesystem on Microsoft Windows. It would be a third setup option in MediaWiki alongside case-sensitive and first-letter. But there's never been enough interest and it's never been important enough and no developer has ever stepped up. It would take a bit of work to implement. Andrew Dunbar (hippietrail) 2011/5/12 Carl (CBM) cbm.wikipe...@gmail.com On Fri, May 13, 2011 at 12:25 AM, Jay Ashworth j...@baylink.com wrote: They're not the same page. Wikipedia page titles are case sensitive -- except that the first character is forced to upper case by the engine. Does that search not return both? Why would we have both? Like you said, the system is case sensitive. These redirects are created because the software doesn't handle case changes correctly otherwise. For example the following link leads to a no such page error because the appropriate redirect does not exist: http://en.wikipedia.org/wiki/Sterling_heights,_Michigan . It would be possible to code around this, so that the redirects would be simulated if they don't exist, but it hasn't happened. In practice, people like me like to type a title in all lower case, and so we have redirects to make it work. - Carl ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] search=steven+tyler gets Steven_tyler
On 14 May 2011 01:48, Aryeh Gregor simetrical+wikil...@gmail.com wrote: On Fri, May 13, 2011 at 3:31 AM, M. Williamson node...@gmail.com wrote: I still don't think page titles should be case sensitive. Last time I asked how useful this really was, back in 2005 or so, I got a tersely-worded response that we need it to disambiguate certain pages. OK, but how many cases does that actually apply to? I would think that the increased usability from removing case sensitivity would far outweigh the benefit of natural disambiguation that only applies to a tiny minority of pages, and which could easily be replaced with disambiguation pages. From a software perspective, the way to do this would be to store a canonicalized version of each page's title, and require that to be unique instead of the title itself. This would be nice because we could allow underscores in page titles, for instance, in addition to being able to do case-folding. Note that Unicode capitalization is locale-dependent, but case-folding is not. Thus we could use the same case-folding on all projects, including international projects like Commons. There's only one exception -- Turkish, with its dotless and dotted i's. But that's minor enough that we should be able to work around it without too much pain. I'm almost positive Azeri has the same dotless i issue and perhaps some of the other Turkic languages of Central Asia. One solution is to do accent/diacritic normalization too as part of the canonicalization. Andrew Dunbar (hippietrail) Some projects, like probably all Wiktionaries, would doubtless not want case-folding at all, so we should support different canonicalization algorithms. Even the ones that don't want case-folding could still benefit from allowing underscores in titles. But all this would require a very intrusive rewrite. Assumptions like replace spaces by underscores to get dbkey are hardwired into MediaWiki all over the place, unfortunately. It's not clear that it's worth it, since there are downsides to case-folding too. It might make more sense to auto-generate redirects instead, which would be a much easier project that wouldn't have the downsides. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)
On 4 May 2011 06:33, Trevor Parscal tpars...@wikimedia.org wrote: I think the idea that we might break the existing PHP parser out into a library for general use is rather silly. The parser is not a parser, it's a macro expander with a pile of regular-expressions used to convert short-hand HTML into actual HTML. The Oh don't be silly. It may not be an LALR(1) parser or an LL parser or even a recursive descent parser but last I checked parsing was the act of breaking down a text into its elements, which the parser does. It just does it in a pretty clunky way. Whether it stores the results in an AST or in bunches of random state all over the place doesn't mean it's doing something other than parsing. A more accurate argument is that it's not just a parser since goes directly on to transforming the input into HTML, which is the equivalent of code generation. code that it outputs is highly dependent on the state of the wiki's configuration and database content at the moment of parsing. It also is useless to anyone wanting to do anything other than render a page into HTML, because the output is completely opaque as to where any of it was derived. Dividing the parser off into a library would require an substantial amount of MediaWiki code to be ported too just to get it working. On it's own, it would be essentially useless. It seems we're getting bogged won in semantics because in MediaWiki we use the word parser in two incompatible ways. 1) The PHP classes which convert wikitext to HTML 2) A hypothetical or postulated part of MediaWiki which does not exist to generate an intermediate form (AST) between wikitext and HTML. So the first thing we need to do is decide which of these two concepts of parser we're talking about. Would it be useful to have a library that can convert wikitext to HTML? Yes. Would it be useful to have a library that can convert wikitext to an AST? Unclear. Would it be useful to have a library that can convert such AST to HTML? Because of the semantic soup nobod has even brought this up yet. So, it's probably not an issue what license this hypothetical code would be released under. - Trevor I'm pretty sure the offline wikitext parsing community would care about the licensing as a separate issue to what kind of parser technology it uses internally. Andrew Dunbar (hippietrail) On Tue, May 3, 2011 at 1:25 PM, David Gerard dger...@gmail.com wrote: On 3 May 2011 21:15, Domas Mituzas midom.li...@gmail.com wrote: Thoughts? Also, for re-licensing, what level of approval do we need? All authors of the parser, or the current people in an svn blame? Current people are doing 'derivative work' on previous authors work. I think all are needed. Pain oh pain. This is the other reason to reduce it to mathematics, which can then be freely reimplemented. - d. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)
On 4 May 2011 08:19, Krinkle krinklem...@gmail.com wrote: Op 3 mei 2011, om 22:56 heeft Ryan Lane het volgende geschreven: On Tue, May 3, 2011 at 1:33 PM, Trevor Parscal tpars...@wikimedia.org wrote: On it's own, it would be essentially useless. The parser has a configuration state, takes wikitext in, and gives back html. It pulls additional data from the database in these steps as well, yes. However, I don't see how this would be different than any other implementation of the parser. All implementations will require configuration state, and will need to deal with things like templates and extensions. Though I prefer the concept of alternative parsers (for all the reasons mentioned in the other threads), I do think having our reference implementation available as a library is a good concept. I feel that making it available in a suitable license is ideal. - Ryan Afaik parser does not need a database or extension hooks for minimum but fully operational use. {{unknown templates}} default to redlinks, {{int:messages}} default to unknown, tags and {{#functions}} default to literals, {{MAGICWORDS}} to red links, etc... If a user of the parser would not have any of these (either none existing or no registry / database configured at all). It would fallback to the behaviour as if they are inexistant, not a problem ? I agree a parser would not need a database but it would need a standard interface or abstraction that in the full MediaWiki would call to the database. Offline readers would implement this interface to extract the wikitext from their compressed format or direct from an XML dump file. Some datamining tools might just stub this interface and deal with the bare minimum. Extension hooks are more interesting. I might assume offline readers want as close results to the official sites as possible so will want to implement the same hooks. Other non-wikitext or non-page data from the database would also go into the same interface/abstraction, or a separate one. Andrew Dunbar (hippietrail) By having this available as a parser sites that host blogs and forums could potentially use wikitext to format their comments and forum threads (to avoid visitors from having to for example learn Wikitext for their wiki, WYSIWYM WYMeditor for WordPress and BBCode for a forum). Instead they could all be the same syntax. And within wiki context magic words, extensions, int messages etc. would be fed from the wiki database, outside just static. -- Krinkle ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
On 4 May 2011 15:16, Tim Starling tstarl...@wikimedia.org wrote: On 04/05/11 14:07, Daniel Friesen wrote: I'm fairly certain myself that his intention was With HipHop support since the C that HipHop compiles PHP to can be extracted and re-used we can turn that compiled C into a C library that can be used anywhere by abstracting the database calls and what not out of the php version of the parser. And because HipHop has better performance we will no longer have to worry about parser abstractions slowing down the parser and as a result increasing the load on large websites like Wikipedia where they are noticeable. So that won't be in the way of adding those abstractions anymore. Yes that's right, more or less. HipHop generates C++ rather than C though. Basically you would split the parser into several objects: * A parser in the traditional sense. * An output callback object, which would handle generation of HTML or PDF or syntax trees or whatever. * A wiki environment interface object, which would handle link existence checks, template fetching, etc. Then you would use HipHop to compile: * The new parser class. * A few useful output classes, such as HTML. * A stub environment class which has no dependencies on the rest of MediaWiki. Then to top it off, you would add: * A HipHop extension which provides output and environment classes which pass their calls through to C-style function pointers. * A stable C ABI interface to the C++ library. * Interfaces between various high level languages and the new C library, such as Python, Ruby and Zend PHP. Doing this would leverage the MediaWiki development community and the existing PHP codebase to provide a well-maintained, reusable reference parser for MediaWiki wikitext. +1 This is the single most exciting news on the MediaWiki front since I started contributing to Wiktionary nine years ago (-: Andrew Dunbar (hippietrail) -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Moving the Dump Process to another language
could be used for building the dump as well? In general, I'm interested in pitching in some effort on anything related to the dump/import processes. Glad to hear it! Drop by irc please, I'm in the usual channels. :-) Just a thought, wouldn't it be easier to generate dumps in parallel if we did away with the assumption that the dump would be in database order. The metadata in the dump provides the ordering info for the people that require it. Andrew Dunbar (hippietrail) Ariel -- James Linden kodekr...@gmail.com -- ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Foundation-l] Data Summit Streaming
It doesn't work for me )-: Your input can't be opened: VLC is unable to open the MRL 'http://transcode1.wikimedia.org:8080'. Check the log for details. Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Foundation-l] Data Summit Streaming
On 11 February 2011 22:18, Chad innocentkil...@gmail.com wrote: On Fri, Feb 11, 2011 at 5:57 AM, Andrew Dunbar hippytr...@gmail.com wrote: It doesn't work for me )-: Your input can't be opened: VLC is unable to open the MRL 'http://transcode1.wikimedia.org:8080'. Check the log for details. It was a stream. It's not streaming anything right now. Dunno if videos will be posted somewhere. oh (-: -Chad ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Matching main namespace articles with associated talk page
On 9 January 2011 02:05, Aryeh Gregor simetrical+wikil...@gmail.com wrote: On Sat, Jan 8, 2011 at 12:34 PM, Diederik van Liere dvanli...@gmail.com wrote: Yes, manually matching is fairly simple but in the worst case you need to iterate over n-1 talk pages (where n is the total number of talk pages of a Wikipedia) to find the talk page that belongs to a user page when using the dump files. Hence, if the dump file would contain for each article a tag with talk page id then it would significantly reduce the processing time. You're expected to build indexes for things like this. If you import the data into MySQL, for instance, you can just do a join (since MediaWiki has good indexes by default). If you're writing data analysis code manually for some reason, load the data into an on-disk B-tree, and then your worst case is logarithmic. Without indexes, pretty much any operation on the data is going to take linear time. (In fact, so is lookup by page id, unless you're just doing a binary search on the dump file and assuming it's in id order . . .) If you don't want to set up a database yourself, you might want to look into getting a toolserver account, if you don't have one. This would allow you read access to a live replica of Wikipedia's database, which of course has all these indexes. You don't even have to use a B-Tree if that's beyond you. I just sort the titles and then use a binary search on them. Plenty fast even in Perl and Javascript. Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Big problem to solve: good WYSIWYG on WMF wikis
On 3 January 2011 21:54, Andreas Jonsson andreas.jons...@kreablo.se wrote: 2010-12-29 08:33, Andrew Dunbar skrev: I've thought a lot about this too. It certainly is not any type of standard grammar. But on the other hand it is a pretty common kind of nonstandard grammar. I call it a recursive text replacement grammar. Perhaps this type of grammar has some useful characteristics we can discover and document. It may be possible to follow the code flow and document each text replacement in sequence as a kind of parser spec rather than trying and failing again to shoehorn it into a standard LALR grammar. If it is possible to extract such a spec it would then be possible to implement it in other languages. Some research may even find that is possible to transform such a grammar deterministically into an LALR grammar... But even if not I'm certain it would demysitfy what happens in the parser so that problems and edge cases would be easier to locate. From my experience of implementing a wikitext parser, I would say that it might be possible to transform wikitext to a token stream that is possible to parse with a LALR parser. My implementation (http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser) uses Antlr (which is an LL parser generator) and only rely on context sensitive parsing (Antlr's semantic predicates) for parsing apostrophes (bold and italics), and this might be possible to solve in a different way. The rest of the complex cases are handled by the lexical analyser that produce a well behaving token stream that can be relatively straightforwardly parsed. My implementation is not 100% compatible, but I think that a 100% compatible parser is not desirable since the most exotic border cases would probably be characterized as bugs anyway (e.g. [[Link|table class=]]). But I think that the basic idea can be used to produce a sufficiently compatible parser. In that case what is needed is to hook your parser into our current code and get it create output if you have not done that already. Then you will want to run the existing parser tests on it. Then you will want to run both parsers over a large sample of existing Wikipedia articles (make sure you use the same revisions on both parsers!) and run them through diff. Then we'll have a decent idea of whether there are any edge cases you didn't spot or whether any of them are exploited in template magic. Let us know the results! Andrew Dunbar (hippietrail) Best Regards, /Andreas ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Big problem to solve: good WYSIWYG on WMF wikis
On 29 December 2010 02:07, Happy-melon happy-me...@live.com wrote: There are some things that we know: 1) as Brion says, MediaWiki currently only presents content in one way: as wikitext run through the parser. He may well be right that there is a bigger fish which could be caught than WYSIWYG editing by saying that MW should present data in other new and exciting ways, but that's actually a separate question. *If* you wish to solve WYSIWYG editing, your baseline is wikitext and the parser. Specifically, it only presents content as HTML. It's not really a parser because it doesn't create an AST (Abstract Syntax Tree). It's a wikitext to HTML converter. The flavour of the HTML can be somewhat modulated by the skin but it could never output directly to something totally different like RTF or PDF. 2) guacamole is one of the more unusual descriptors I've heard for the parser, but it's far from the worst. We all agree that it's horribly messy and most developers treat it like either a sleeping dragon or a *very* grumpy neighbour. I'd say that the two biggest problems with it are that a) it's buried so deep in the codebase that literally the only way to get your wikitext parsed is to fire up the whole of the rest of MediaWiki around it to give it somewhere comfy to live in, I have started to advocate the isolation of the parser from the rest of the innards or MediaWiki for just this reason: https://bugzilla.wikimedia.org/show_bug.cgi?id=25984 Free it up so that anybody can embed it in their code and get exactly the same rendering that Wikipedia et al get, guaranteed. We have to find all the edges where the parser calls other parts of MediaWiki and all the edges where other parts of MediaWiki call the parser. We then define these edges as interfaces so that we can drop an alternative parser into MediaWiki and drop the current parser into say an offline viewer or whatever. With a freed up parser more people will hack on it, more people will come to grok it and come up with strategies to address some of its problems. It should also be a boon for unit testing. (I have a very rough prototype working by the way with lots of stub classes) and b) there is as David says no way of explaining what it's supposed to be doing except saying follow the code; whatever it does is what it's supposed to do. It seems to be generally accepted that it is *impossible* to represent everything the parser does in any standard grammar. I've thought a lot about this too. It certainly is not any type of standard grammar. But on the other hand it is a pretty common kind of nonstandard grammar. I call it a recursive text replacement grammar. Perhaps this type of grammar has some useful characteristics we can discover and document. It may be possible to follow the code flow and document each text replacement in sequence as a kind of parser spec rather than trying and failing again to shoehorn it into a standard LALR grammar. If it is possible to extract such a spec it would then be possible to implement it in other languages. Some research may even find that is possible to transform such a grammar deterministically into an LALR grammar... But even if not I'm certain it would demysitfy what happens in the parser so that problems and edge cases would be easier to locate. Andrew Dunbar (hippietrail) Those are all standard gripes, and nothing new or exciting. There are also, to quote a much-abused former world leader, some known unknowns: 1) we don't know how to explain What You See when you parse wikitext except by prodding an exceedingly grumpy hundred thousand lines of PHP and *asking What it thinks* You Get. 2) We don't know how to create a WYSIWYG editor for wikitext. Now, I'd say we have some unknown unknowns. 1) *is* it because of wikitext's idiosyncracies that WYSIWYG is so difficult? Is wikitext *by its nature* not amenable to WYSIWYG editing? 2) would a wikitext which *was* representable in a standard grammar be amenable to WYSIWYG editing? 3) would a wikitext which had an alternative parser, one that was not buried in the depths of MW (perhaps a full JS library that could be called in real-time on the client), be amenable to WYSIWYG editing? 4) are questions 2 and 3 synonymous? --HM David Gerard dger...@gmail.com wrote in message news:aanlktimthux-undo1ctnexcrqbpp89t2m-pvha6fk...@mail.gmail.com... [crossposted to foundation-l and wikitech-l] There has to be a vision though, of something better. Maybe something that is an actual wiki, quick and easy, rather than the template coding hell Wikipedia's turned into. - something Fred Bauder just said on wikien-l. Our current markup is one of our biggest barriers to participation. AIUI, edit rates are about half what they were in 2005, even as our fame has gone from popular through famous to part of the structure of the world. I submit that this is not a good or healthy thing in any way and needs fixing
[Wikitech-l] Offline wiki tools
I've long been interested in offline tools that make use of WikiMedia information, particularly the English Wiktionary. I've recently come across a tool which can provide random access to a bzip2 archive without decompressing it and I would like to make use of it in my tools but I can't get it to compile and/or function with any free Windows compiler I have access to. It works fine on the *nix boxes I have tried but my personal machine is a Windows XP netbook. The tool is seek-bzip2 by James Taylor and is available here: http://bitbucket.org/james_taylor/seek-bzip2 * The free Borland compiler won't compile it due to missing (Unix?) header files * lcc compiles it but it always fails with error unexpected EOF * mingw compiles it if the -m64 option is removed from the Makefile but it then has the same behaviour as the lcc build. My C experience is now quite stale and my 64-bit programming experience negligible. (I'm also interested in hearing from other people working on offline tools for dump files, wikitext parsing, or Wiktionary) Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Offline wiki tools
2010/12/16 Ángel González keis...@gmail.com: On 15/12/10 16:21, Andrew Dunbar wrote: I've long been interested in offline tools that make use of WikiMedia information, particularly the English Wiktionary. I've recently come across a tool which can provide random access to a bzip2 archive without decompressing it and I would like to make use of it in my tools but I can't get it to compile and/or function with any free Windows compiler I have access to. It works fine on the *nix boxes I have tried but my personal machine is a Windows XP netbook. The tool is seek-bzip2 by James Taylor and is available here: http://bitbucket.org/james_taylor/seek-bzip2 * The free Borland compiler won't compile it due to missing (Unix?) header files * lcc compiles it but it always fails with error unexpected EOF * mingw compiles it if the -m64 option is removed from the Makefile but it then has the same behaviour as the lcc build. My C experience is now quite stale and my 64-bit programming experience negligible. (I'm also interested in hearing from other people working on offline tools for dump files, wikitext parsing, or Wiktionary) Andrew Dunbar (hippietrail) Your problem are Windows text streams. The attached patch fixes it. Thank you for the link. I was completely unaware of it when I basically did the same thing for mediawiki a couple years ago. http://www.wiki-web.es/mediawiki-offline-reader/ Thanks Ángel! I feel like a fool for not realizing this. It's the same problem I've worked around many times in the past but not recently. I just got a similar answer on stackoverflow.com By the way I'm keen to find something similar for .7z It would be incredibly useful if these indices could be created as part of the dump creation process. Should I file a feature request? Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Offline wiki tools
On 15 December 2010 20:41, Anthony wikim...@inbox.org wrote: On Wed, Dec 15, 2010 at 12:01 PM, Andrew Dunbar hippytr...@gmail.com wrote: By the way I'm keen to find something similar for .7z I've written something similar for .xz, which uses LZMA2 same as .7z. It creates a virtual read-only filesystem using FUSE (the FUSE part is in perl, which uses pipes to dd and xzcat). Only real problem is that it doesn't use a stock .xz file, it uses a specially created one which concatenates lots of smaller .xz files (currently I concatenate between 5 and 20 or so 900K bz2 blocks into one .xz stream - between 5 and 20 because there's a preference to split on /pagepage boundaries). At the moment I'm interested in .bz2 and .7z because those are the formats WikiMedia currently publishes data in. Though some files are also in .gz so I would also like to find a solution for those. I thought about the concatenation solution splitting at page boundaries for .bz2 until I found out there was already a solution that worked with the vanilla dump files as is. Apparently the folks at openzim have done something similar, using LZMA2. If anyone is interesting in working with me to make a package capable of being released to the public, I'd be willing to share my code. But it sounds like I'm just reinventing a wheel already invented by opensim. I'm interested in what everybody else is doing regarding offline WikiMedia content. I'm also mainly using Perl though I just ran into a problem with 64-bit values when indexing huge dump files. It would be incredibly useful if these indices could be created as part of the dump creation process. Should I file a feature request? With concatenated .xz files, creating the index is *much* faster, because the .xz format puts the stream size at the end of each stream. Plus with .xz all streams are broken on 4-byte boundaries, whereas with .bz2 blocks can end at any *bit* (which means you have to do painful bit shifting to create the index). The file is also *much* smaller, on the order of 5-10% of bzip2 for a full history dump. Have we made the case for this format to the WikiMedia people? I think they use .bz2 because it is pretty fast for very good compression ratios but they use .7z for the full history dumps where the extremely good compression ratios warrant the slower compression times since these files can be gigantic. How is .xz for compression times? Would we have to worry about patent issues for LZMA? Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Offline wiki tools
On 15 December 2010 20:24, Manuel Schneider manuel.schnei...@wikimedia.ch wrote: Hi Andrew, maybe you'd like to check out ZIM: This is an standardized file format for compressed HTML dumps, focused on Wikimedia content at the moment. There is some C++ code around to read and write ZIM files and there are several projects using that, eg. the WP1.0 project, the Israeli and Kenyan Wikipedia Offline initiatives and more. Also the Wikimedia Foundation is currently in progress to adopt the format to provide ZIM files from Wikimedia wikis in the future. This is very interesting and I'll be watching it. Where do the HTML dumps come from? I'm pretty sure I've only seen static for Wikipedia and not for Wiktionary for example. I am also looking at adapting the parser for offline use to generate HTML from the dump file wikitext. Andrew Dunbar (hippietrail) http://openzim.org/ /Manuel Am 15.12.2010 16:21, schrieb Andrew Dunbar: I've long been interested in offline tools that make use of WikiMedia information, particularly the English Wiktionary. I've recently come across a tool which can provide random access to a bzip2 archive without decompressing it and I would like to make use of it in my tools but I can't get it to compile and/or function with any free Windows compiler I have access to. It works fine on the *nix boxes I have tried but my personal machine is a Windows XP netbook. The tool is seek-bzip2 by James Taylor and is available here: http://bitbucket.org/james_taylor/seek-bzip2 * The free Borland compiler won't compile it due to missing (Unix?) header files * lcc compiles it but it always fails with error unexpected EOF * mingw compiles it if the -m64 option is removed from the Makefile but it then has the same behaviour as the lcc build. My C experience is now quite stale and my 64-bit programming experience negligible. (I'm also interested in hearing from other people working on offline tools for dump files, wikitext parsing, or Wiktionary) Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Regards Manuel Schneider Wikimedia CH - Verein zur Förderung Freien Wissens Wikimedia CH - Association for the advancement of free knowledge www.wikimedia.ch ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] require language dump for developing words and corresponding frequency
The dump site (http://download.wikimedia.org/) is still broken at the moment but another way to build some word frequency data is by randomly sampling the wikis for the languages you are interested in. At least these Indic languages have Wikipedias of varying sizes: Assamese http://as.wikipedia.org Bihari http://bh.wikipedia.org Bengali http://bn.wikipedia.org Bishnupriya Manipuri http://bpy.wikipedia.org Gujarati http://gu.wikipedia.org Hindi http://hi.wikipedia.org Kannada http://kn.wikipedia.org Kashmiri http://ks.wikipedia.org Marathi http://mr.wikipedia.org Nepali http://ne.wikipedia.org Nepal Bhasa http://new.wikipedia.org Oriya http://or.wikipedia.org/wiki Eastern Punjabi http://pa.wikipedia.org Western Punjabi http://pnb.wikipedia.org Sanskrit http://sa.wikipedia.org Sindhi http://sd.wikipedia.org Tamil http://ta.wikipedia.org Telugu http://te.wikipedia.org Urdu http://ur.wikipedia.org If you'd like to use it I have a tool that downloads random samples of wiki pages and strips the HTML for purposes such as this. Good luck! Andrew Dunbar (hippietrail) On 14 December 2010 18:36, pravin@gmail.com pravin@gmail.com wrote: Hi All, I am Pravin Satpute, I am working on language technology and for building words and it frequency, i required some webpages in indic language. Can i get the most recent dump without en.wiki Thanks, Pravin s ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] How to find the version of a dump
On 14 December 2010 01:57, Monica shu monicashu...@gmail.com wrote: Thanks Diederik and Waksman, It seems that I need to do parse the dump for article data to get this piece of information... Yes, this will be the last choice, but I think there maybe some easier way... I just got home and checked the dump I've downloaded. It's downloaded on June, 10, 2010, the size is 6117881141 in bz2. I remember when I download, it's the latest version at that moment. As the dumps are generated every N months, and the one I have is bigger that the version 2010-01-30 as Waksman said, my version should be between Feb to June. A Google search hints that enwiki-20100312-pages-articles.xml.bz2 might be the one with size 6117881141. Andrew Dunbar (hippietrail) Does anybody remember the version between this period, or happened to download the same version with me? Thanks very much to tell me any related information again! Best regards! Monica On Mon, Dec 13, 2010 at 3:24 PM, Shaun Waksman shaunwaks...@gmail.comwrote: Hi Monica, The file sizes of the EN pages dumps that are available today are: 5204823166 enwiki-20100312-pages-articles.xml.7z 5983814213 enwiki-20100130-pages-articles.xml.bz2 Note that the former is in 7z and the later is in bz2 Does this help? Shaun On Mon, Dec 13, 2010 at 8:45 AM, Monica shu monicashu...@gmail.com wrote: Hi all, I have downloaded a dump several month ago. By accidentally, I lost the version info of this dump, so I don't know when this dump was generated. Is there any place that list out info about the past dumps(such as size...)? Thanks! Monica ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] How to find the version of a dump
On 14 December 2010 20:04, Andrew Dunbar hippytr...@gmail.com wrote: On 14 December 2010 01:57, Monica shu monicashu...@gmail.com wrote: Thanks Diederik and Waksman, It seems that I need to do parse the dump for article data to get this piece of information... Yes, this will be the last choice, but I think there maybe some easier way... I just got home and checked the dump I've downloaded. It's downloaded on June, 10, 2010, the size is 6117881141 in bz2. I remember when I download, it's the latest version at that moment. As the dumps are generated every N months, and the one I have is bigger that the version 2010-01-30 as Waksman said, my version should be between Feb to June. A Google search hints that enwiki-20100312-pages-articles.xml.bz2 might be the one with size 6117881141. Andrew Dunbar (hippietrail) Does anybody remember the version between this period, or happened to download the same version with me? Thanks very much to tell me any related information again! Best regards! Monica On Mon, Dec 13, 2010 at 3:24 PM, Shaun Waksman shaunwaks...@gmail.comwrote: Hi Monica, The file sizes of the EN pages dumps that are available today are: 5204823166 enwiki-20100312-pages-articles.xml.7z 5983814213 enwiki-20100130-pages-articles.xml.bz2 Note that the former is in 7z and the later is in bz2 Does this help? Shaun On Mon, Dec 13, 2010 at 8:45 AM, Monica shu monicashu...@gmail.com wrote: Hi all, I have downloaded a dump several month ago. By accidentally, I lost the version info of this dump, so I don't know when this dump was generated. Is there any place that list out info about the past dumps(such as size...)? Thanks! Monica ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l It should be trivial to add the dump data to the header each dump file. Since in the files themselves the date field of the filename is often replaced by latest this could be very useful. It could also be useful to include the revision ID and timestamp of the latest revision but I assume this would be a little more difficult. Should I file a feature request? Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Looking for a mediawiki.org dump
Could anybody help me locate a dump of mediawiki.org while the dump server is broken please? I only need current revisions. Thanks in advance. Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] alternative way to get wikipedia dump while server is down
On 28 November 2010 02:42, Jeff Kubina jeff.kub...@gmail.com wrote: I have a copy of the 20091009 enwiki dumps if that would do: http://jeffkubina.org/data/download.wikimedia.org/enwiki/20091009/ Jeff -- Jeff Kubina http://google.com/profiles/jeff.kubina 410-988-4436 8am-10pm EST On Thu, Nov 25, 2010 at 12:30 PM, Oliver Schmidt schmidt...@email.ulster.ac.uk wrote: Hello alltogether, is there any alternative way to get hands on a wikipedia dump? Preferably the last complete one. Which was supposed to be found at this address: http://download.wikimedia.org/enwiki/20100130/ I would need that dump asap for my research. Thank you for any help! Best regards — Oliver Schmidt PhD student Nano Systems Biology Research Group University of Ulster, School of Biomedical Sciences Cromore Road, Coleraine BT52 1SA, Northern Ireland T: +44 / (0)28 / 7032 3367 F: +44 / (0)28 / 7032 4375 E: schmidt...@email.ulster.ac.ukmailto:schmidt...@email.ulster.ac.uk — ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l I don't suppose anybody has a copy of any Romanian or Georgian Wiktionary from any time? (-: Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Invoking maintenance scripts return nothing at all
I wish to do some MediaWiki hacking which uses the codebase, specifically the parser, but not the database or web server. I'm running on Windows XP on an offline machine with PHP installed but no MySql or web server. I've unarchived the source and grabbed a copy of somebody's LocalSettings.php but not attempted to to install MediaWiki beyond this. Obviously I don't expect to be able to do much, but when I try to run any of the maintenance scripts I get no output whatsoever, not even errors. I was hoping to let the error messages guide me as to what is essential, what needs to be stubbed, wrapped etc. Am I missing something obvious or do these scripts return no errors by design? Andrew Dunbar (hippietrail) -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Invoking maintenance scripts return nothing at all
On 17 November 2010 02:37, Dmitriy Sintsov ques...@rambler.ru wrote: * Andrew Dunbar hippytr...@gmail.com [Tue, 16 Nov 2010 23:01:33 +1100]: I wish to do some MediaWiki hacking which uses the codebase, specifically the parser, but not the database or web server. I'm running on Windows XP on an offline machine with PHP installed but no MySql or web server. I've unarchived the source and grabbed a copy of somebody's LocalSettings.php but not attempted to to install MediaWiki beyond this. Obviously I don't expect to be able to do much, but when I try to run any of the maintenance scripts I get no output whatsoever, not even errors. I was hoping to let the error messages guide me as to what is essential, what needs to be stubbed, wrapped etc. Am I missing something obvious or do these scripts return no errors by design? Andrew Dunbar (hippietrail) In the web environment, error messages may expose vulnerabilities to potential attacker. The errors might be written to php's error log, which is set up by error_log=path directive in php.ini. You may find the actual location of php.ini by executing php --ini Look also at the whole Error handling and logging section Does php work at all? Is there an configuration output php -r phpinfo(); when issued from cmd.exe ? Does php dumpBackup.php --help being issued from /maintenance directory, produces the command line help? Dmitriy Thanks Dmitry. PHP does work. The --help options always work. It turned out the LocalSettings.php somebody on #mediawiki pointed me to require_once()'d several extensions I didn't have and require_once() seems to fail silently. I'll try to aquaint myself better with the Error handling and logging section as you suggest. Is there somewhere an official blank or example LocalSettings.php file that would be better to use for people like me to avoid such problems? Rolling my own from scratch doesn't seem ideal either. Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] API vs data dumps
On 14 October 2010 09:37, Alex Brollo alex.bro...@gmail.com wrote: 2010/10/13 Paul Houle p...@ontology2.com Don't be intimidated by working with the data dumps. If you've got an XML API that does streaming processing (I used .NET's XmlReader) and use the old unix trick of piping the output of bunzip2 into your program, it's really pretty easy. When I worked into it.source (a small dump! something like 300Mby unzipped), I used a simple do-it-yourself string python search routine and I found it really faster then python xml routines. I presume that my scripts are really too rough to deserve sharing, but I encourage programmers to write a simple dump reader using speed of string search. My personal trick was to build an index, t.i. a list of pointers to articles and name of articles into xml file, so that it was simple and fast to recover their content. I used it mainly because I didn't understand API at all. ;-) Alex Hi Alex. I have been doing something similar in Perl for a few years for the English Wiktionary. I've never been sure on the best way to store all the index files I create especially in code to share with other people like I would like to happen. If you'd like to collaborate or anyone else for that matter it would be pretty cool. You'll find my stuff on the Toolserver: https://fisheye.toolserver.org/browse/enwikt Andrew Dunbar (hippietrail) -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Datamining infoboxes
2009/10/23 Aryeh Gregor simetrical+wikil...@gmail.com: On Fri, Oct 23, 2009 at 12:20 PM, Andrew Dunbar hippytr...@gmail.com wrote: Yes I didn't specify tl_namespace In MySQL that will usually make it impossible to effectively use an index on (tl_namespace, tl_title), so it's essential that you specify the NS. (Which you should anyway to avoid hitting things like [[Template talk:Infobox language]].) Some DBMSes (including sometimes MySQL = 5.0, although apparently not here) are smart enough to use this kind of index pretty well even if you don't specify the namespace, but it would still be somewhat more efficient to specify it -- the DB would have to do O(1/n) times as many index lookups, where n is the number of namespaces. and when I check for which columns have keys I could see none: mysql describe templatelinks; +--+-+--+-+-+---+ | Field | Type | Null | Key | Default | Extra | +--+-+--+-+-+---+ | tl_from | int(8) unsigned | NO | | 0 | | | tl_namespace | int(11) | NO | | 0 | | | tl_title | varchar(255) | NO | | | | +--+-+--+-+-+---+ 3 rows in set (0.01 sec) The toolserver database uses views. In MySQL, views can't have indexes themselves, but your query is rewritten to run against the real table -- which you can't access directly, but which does have indexes. EXPLAIN is your best bet here: mysql EXPLAIN SELECT tl_from FROM templatelinks WHERE tl_title IN ('Infobox_Language', 'Infobox_language'); ++-+---+---+---+-+-+--+---+--+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | ++-+---+---+---+-+-+--+---+--+ | 1 | SIMPLE | templatelinks | index | NULL | tl_from | 265 | NULL | 149740990 | Using where; Using index | ++-+---+---+---+-+-+--+---+--+ 1 row in set (0.00 sec) mysql EXPLAIN SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND tl_title IN ('Infobox_Language', 'Infobox_language'); ++-+---+---+---+--+-+--+--+--+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | ++-+---+---+---+--+-+--+--+--+ | 1 | SIMPLE | templatelinks | range | tl_namespace | tl_namespace | 261 | NULL | 6949 | Using where; Using index | ++-+---+---+---+--+-+--+--+--+ 1 row in set (0.00 sec) Note the number of rows scanned in each case. Your query was scanning all of templatelinks, the other is retrieving the exact rows needed and not looking at any others (type = index vs. range). The reason for this is given in the possible_keys column: MySQL can find no keys that are usable for lookup, if you omit tl_namespace. Thanks for the very informative reply. I already knew most of this stuff passively except database/SQL views. Now I've just got to put it into more practice. Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Datamining infoboxes
2009/10/23 Robert Ullmann rlullm...@gmail.com: I've been spending hours on the parsing now and don't find it simple at all due to the fact that templates can be nested. Just extracting the Infobox as one big lump is hard due to the need to match nested {{ and }} Andrew Dunbar (hippietrail) Hi, Come now, you are over-thinking it. Find {{Infobox [Ll]anguage in the text, then count braces. Start at depth=2, count up and down 'till you reach 0, and you are at the end of the template. (you can be picky about only counting them if paired if you like ;-) Actually you have to find {{[Ii]nfobox[ _][Ll]anguage And I wanted to be robust. It's perfectly legal for single unmatched braces to apear anywhere and I didn't want them to break my code. As it happens there don't seem to currently be any in the language infofoxes. I couldn't be sure whether there would be any cases where a {{{ or }}} might show up either. And a few other edge cases such as HTML comments, nowiki and friends, template invocations in values, and even possibly template invokations in names? Then just regex match the lines/parameters you want. However, if you are pulling the wikitext with the API, the XML parse tree option sounds good; then you can just use elementTree (or the like) and pull out the parameters directly I've got it extracting the name/value pairs from the XML finally but parsing XML is always a pain. And it still misses Norwegian, Bokmal, and Nynorsk which wrap the infobox in another template... Andrew Dunbar (hippietrail) Robert ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Datamining infoboxes
Infoboxes in Wikipedia often contain information which is quite useful outside Wikipedia but can be surprisingly difficult to data-mine. I would like to find all Wikipedia pages that use Template:Infobox_Language and parse the parameters iso3 and fam1...fam15 But my attempts to find such pages using either the Toolserver's Wikipedia database or the Mediawiki API have not been fruitful. In particular, SQL queries on the templatelinks table are intractably slow. Why are there no keys on tl_from or tl_title? Andrew Dunbar (hippietrail) -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] sharing an article on Facebook
2009/10/16 Amir E. Aharoni amir.ahar...@mail.huji.ac.il On Fri, Oct 16, 2009 at 19:03, Platonides platoni...@gmail.com wrote: It's a problem at facebook. Did you try with % encoded urls, (eg. http://he.wikipedia.org/wiki/%D7%97%D7%99%D7%A4%D7%94 ) or only hebrew ones? (eg. http://he.wikipedia.org/wiki/חיפהhttp://he.wikipedia.org/wiki/%D7%97%D7%99%D7%A4%D7%94). Actually, most of the time i use a bookmark that runs some JS code that shares the current site. In any case, the result is the same - whether i use the bookmark or manually enter the URL with %'s or with Hebrew chars. I think it's a case of modern browsers behaving differently to older browsers. Older browsers only supported encoded URLs with % as the URL/URI standard is defined. But these are very user-unfriendly so modern browsers now convert these URLs into something readable for people whose native language does not use Latin script. I think Facebook only accepts URLs which comply to the standard and not the userfriendly human readable ones supported by modern browsers. So it's no bug in Mediawiki and not really a bug in Facebook but it would be a userfriendly improvement for Facebook to interpret non-Latin URLs just as the modern browsers do. Andrew Dunbar (hippietrail) -- אמיר אלישע אהרוני Amir Elisha Aharoni http://aharoni.wordpress.com We're living in pieces, I want to live in peace. - T. Moore ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Wiktionary API acceptable use policy
2009/9/1 James Richard james.richard...@gmail.com: Hello, I am interested in using the Wiktionary API (located at en.wiktionary.org/w/api.php) and was having trouble finding any information on what is acceptable commercial use. If there are any controlling documents on the subject, can you please direct me to them? In particular, I would like to know if there are any restrictions on the number of requests allowed in a given time period, and if there are any other restrictions on volume or frequency of use that I should keep in mind. If determining acceptable use of the API remains a subjective exercise, let me explain how I would like to use it and perhaps you can tell me if my intended use is acceptable. I am starting a new language translation service bureau that will use online tools to make the translation process more accurate and less expensive for the end customer. We also intend to offer free access to our tools to any open source project or non-profit organization (in such a case, they would be free to use our project management, version control, and translator tools free of charge, but they would have to find their own volunteer translators to do the actual translation work). As part of our translation tool set, we would like to provide access to monolingual and bilingual dictionaries. Wiktionary appears to be the perfect choice for this. We would like to use the Wiktionary API to fetch words that are requested by our users (translators) and then render them on our own servers for viewing in our translator tool. We would keep a local cache of fetched documents to minimize the number of API calls that we need to make. We will, of course, give proper attribution, etc., but it is possible that we will eventually be making quite a large number of requests, so we thought we should check with you first. Is that acceptable use? Another approach is to download the Wiktionary dump archive to parse offline: http://download.wikipedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 Andrew Dunbar (hippietrail) Cheers, James ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Extensions in Bugzilla
2009/7/31 Chad innocentkil...@gmail.com: Hey all, I've compiled a list[1] of extensions in Bugzilla that don't have a default assignee. If you want to be (or should already be) the assignee for any of these, please let me know. Would like to really cut that list down so bugs are getting triaged to someone who cares. Right now, they're all being assigned to wikibugs-l, and we know how many bugs he resolves :p -Chad [1] http://www.mediawiki.org/wiki/User:^demon/Unloved_extensions ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l DidYouMean is mine Andrew Dunbar (hippietrail) -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] URLs that aren't cool...
2009/7/29 Nikola Smolenski smole...@eunet.yu: Дана Tuesday 28 July 2009 19:16:22 Brion Vibber написа: On 7/28/09 10:04 AM, Aryeh Gregor wrote: On Tue, Jul 28, 2009 at 12:52 PM, Mark Williamsonnode...@gmail.com wrote: Case insensitivity shouldn't be a problem for any language, as long as you do it properly. Turkish and other languages using dotless i, for example, will need a special rule - Turkish lowercase dotted i capitalizes to a capital dotted İ while lowercase undotted ı capitalizes to regular undotted I. And so what if a wiki is multilingual and you don't know what language the page name is in? What if a Turkish wiki contains some English page names as loan words, for instance? Indeed, good handling of case-insensitive matchings would be a big win for human usability, but it's not easy to get right in all cases. The main problems are: 1) Conflicts when we really do consider something separate, but the case folding rules match them together 2) Language-specific case folding rules in a multilingual environment Turkish I with/without dot and German ß not always matching to SS are the primary examples off the top of my head. Also, some languages tend to drop accent markers in capital form (eg, Spanish). What can or should we do here? Similar to automatic redirect, we could build an authomatic disambiguation page. For example, someone on srwiki going to [[Dj]] would get: Did you mean: * [[Đ]] * [[DJ]] * [[D.J.]] A nearer-term help would be to go ahead and implement what we talked about a billion years ago but never got around to -- a decent did you mean X? message to display when you go to an empty page but there's something similar nearby. Was thinking a lot about this. The best solution I thought of would be to add a column to page table page_title_canonical. When an article is created/moved, this canonical title is built from the real title. When an article is looked up, if there is no match in page_title, build the canonical title from the URL and see if there is a match in page_title_canonical and if yes, display did you mean X or even go there automatically as if from a redirect (if there is only one match) or did you mean *X, *X1 if there are multiple matches. This canonical title would be made like this: * Remove disambiguator from the title if it exists * Remove punctuation and the like * Transliterate the title to Latin alphabet * Transliterate to pure ASCII * Lowercase * Order the words alphabetically What could possibly go wrong? Note that this would also be very helpful for non-Latin wikis - people often want Latin-only URLs since non-Latin URLs are t long. I also recall a recent discussion about a wiki in a language with nonstandard spelling (nds?) where they use bots to create dozens or even hundreds of redirects to an article title - this would also make that unneeded. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l I actually did make this extension a couple of years, intended for the English Wiktionary where we manually add an {{also}} template to the top of pages to like to other pages whose titles differ in minor ways such as capitalization, hyphenation, apostrophes, accents, periods. I think I had it working with Hebrew and Arabic and a few other exotic languages besides. It was running on Brion's test box for some time but getting little interest. It's been offline and unmaintained since Brion moved and I did a couple of overseas trips. http://www.mediawiki.org/wiki/Extension:DidYouMean http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/DidYouMean/ https://bugzilla.wikimedia.org/show_bug.cgi?id=8648 It hooked all ways to create delete or move a page to maintain a separate table of normalized page titles which it consulted when displaying a page. The code for display was designed for compatibility with the then-current Wiktionary templates and would need to be implemented in a more general way. A core version would probably just add a field to the existing table. Andrew Dunbar (hippietrail) -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Bugzilla Weekly Report
2009/7/27 Nikola Smolenski smole...@eunet.yu: repor...@isidore.wikimedia.org wrote: MediaWiki Bugzilla Report for July 20, 2009 - July 27, 2009 Bugs NEW : 165 Bugs RESOLVED : 174 I think this is the first time for quite a while that more bugs have been resolved than created. Congratulations to everyone responsible! :) Could it be due to the new known to fail logic? Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Minify
2009/6/26 Robert Rohde raro...@gmail.com: I'm going to mention this here, because it might be of interest on the Wikimedia cluster (or it might not). Last night I deposited Extension:Minify which is essentially a lightweight wrapper for the YUI CSS compressor and JSMin JavaScript compressor. If installed it automatically captures all content exported through action=raw and precompresses it by removing comments, formatting, and other human readable elements. All of the helpful elements still remain on the Mediawiki: pages, but they just don't get sent to users. Currently each page served to anons references 6 CSS/JS pages dynamically prepared by Mediawiki, of which 4 would be needed in the most common situation of viewing content online (i.e. assuming media=print and media=handheld are not downloaded in the typical case). These 4 pages, Mediawiki:Common.css, Mediawiki:Monobook.css, gen=css, and gen=js comprise about 60 kB on the English Wikipedia. (I'm using enwiki as a benchmark, but Commons and dewiki also have similar numbers to those discussed below.) After gzip compression, which I assume is available on most HTTP transactions these days, they total 17039 bytes. The comparable numbers if Minify is applied are 35 kB raw and 9980 after gzip, for a savings of 7 kB or about 40% of the total file size. Now in practical terms 7 kB could shave ~1.5s off a 36 kbps dialup connection. Or given Erik Zachte's observation that action=raw is called 500 million times per day, and assuming up to 7 kB / 4 savings per call, could shave up to 900 GB off of Wikimedia's daily traffic. (In practice, it would probably be somewhat less. 900 GB seems to be slightly under 2% of Wikimedia's total daily traffic if I am reading the charts correctly.) Anyway, that's the use case (such as it is): slightly faster initial downloads and a small but probably measurable impact on total bandwidth. The trade-off of course being that users receive CSS and JS pages from action=raw that are largely unreadable. The extension exists if Wikimedia is interested, though to be honest I primarily created it for use with my own more tightly bandwidth constrained sites. This sounds great but I have a problem with making action=raw return something that is not raw. For MediaWiki I think it would be better to add a new action=minify What would the pluses and minuses of that be? Andrew Dunbar (hippietrail) -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Different apostrophe signs and MediaWiki internal search
2009/6/23 Brion Vibber br...@wikimedia.org: Steve Bennett wrote: So, apostrophe (U+0027) - curved right single quote (U+2019): yes, probably. The other way around...probably not, unless that U+2019 exists on any keyboards. Hyphen-minus (U+002D) - em dash (U+2014): I would say no. If you search for clock-work, you probably don't want to match a sentence like He was building a clock—work that is never easy—at the time. (contrived, sure) Just saying you probably don't want the full range of lookalikes - the left side of each mapping should be a keyboard character, and the right side should be semantically equivalent, unless commonly used incorrectly. Unless you cut and paste a term containing a fancy character from another window, but the page uses the plain character... Indeed keyboards are not the only place characters come from. Word processers often upgrade apostrophes hyphens and other characters. This is the generic field of which smart quotes is a specific case. Also input methods can insert characters not directly on the keyboard. And cutting and pasting from web pages where the author tried to choose specific characters with HTML entities and such. I have definitely seen edits on Wikipedia where people were correcting various kinds of hyphens and dashes. And of course while the English Wikipedia forbids curved quotes each other wiki may well have its own policy. Andrew Dunbar (hippietrail) -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Extending wikilinks syntax
2009/6/20 Kalan kalan@gmail.com: Recent studies by Håkon Wium Lie http://www.princexml.com/howcome/2009/wikipedia/ clearly show that XHTML markup generated by widespread templates such as {{coord}} is overcomplexified. This is mostly what we are able to fix, but sometimes we aren’t due to the limitations we have created for ourselves (Håkon simply pointed it out as he couldn’t know the reasons behind it; this is why having a fresh look is useful). Perhaps the most serious limitation is: We don’t allow attributes for wikilinks. This limitations results in several disadvantages, for example: * Each time someone wants to style a link, they have to create a span or something else somewhere inside or outside the link text. In most cases, this is against the semantics and clarity. * We can’t give ids to links so that we can use them in CSS and JS. * Implementations of certain microformats (such as XFN, “url” property in hCard/hCalendar, etc) inside templates is impossible. I propose to extend wikilinks syntax by making links being parsed the same way as we parse file-links. That is, [[Special:Userlogout|log out|id=logoutlink|style=color:red|title=This will log you out]] will be a wikilink with style, title and id attributes. The current syntax is a subset of my proposal, so nothing should break. As the syntax for external links leaves us no opportunity to clearly extend it in the same spirit, I currently think of merging it with external links’ syntax, leaving the current single-brackets for backward compatibility. Besides these advantages, it will make our syntax even friendlier (have you seen newbies trying to insert http:// into the double brackets?), and it will make us implicitly prohibit Protocol://-like titles (they all are erroneous creations by newbies anyway). I have to agree we need ways to specify CSS id's and classes for links from within templates to avoid ugly HTML bloat. When adding support for them it shouldn't be any extra work to add support for the other attributes as long as everyone can agree on a decent syntax. Andrew Dunbar (hippietrail) — Kalan ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Different apostrophe signs and MediaWiki internal search
2009/6/20 Neil Harris use...@tonal.clara.co.uk: Neil Harris wrote: Andrew Dunbar wrote: 2009/6/20 Jaska Zedlik jz5...@gmail.com: Hello, On Fri, Jun 19, 2009 at 20:31, Rolf Lampa rolf.la...@rilnet.com wrote: Jaska Zedlik skrev: ... The code of the override function is the following: function stripForSearch( $string ) { $s = $string; $s = preg_replace( '/\xe2\x80\x99/', '\'', $s ); return parent::stripForSearch( $s ); } I'm not a PHP programmer, but why using the extra assignment of $s instead of using $string directly in the parent call, like so: function stripForSearch( $string ) { $s = preg_replace( '/\xe2\x80\x99/', '\'', $string ); return parent::stripForSearch( $s ); } Really, you are right, for the real function all these redundant assignments should be strepped for the productivity purposes, I just used a framework from the Japanese language class which does soma Japanese-specific reduction, but I agree with your notice. The username anti-spoofing code already knows about a lot of looks similar characters which may be of some help. Andrew Dunbar (hippietrail) Of itself, the username anti-spoofing code table -- which I originally wrote -- is rather too thorough for this purpose, since it deliberately errs on the side of mapping even vaguely similar-looking characters to one another, regardless of character type and script system,and this, combined with case-folding and transitivity, leads to some apparently bizarre mappings that are of no practical use for any other application. If you're interested, I can take a look at producing a more limited punctuation-only version. -- Neil http://www.unicode.org/reports/tr39/data/confusables.txt is probably the single best source for information about visual confusables. Staying entirely within the Latin punctuation repertoire, and avoiding combining characters and other exotica such as math characters and dingbats, you might want to consider the following characters as possible unintentional lookalikes for the apostrophe: U+0027 APOSTROPHE U+2019 RIGHT SINGLE QUOTATION MARK U+2018 LEFT SINGLE QUOTATION MARK U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK U+2032 PRIME U+00B4 ACUTE ACCENT U+0060 GRAVE ACCENT U+FF40 FULLWIDTH GRAVE ACCENT U+FF07 FULLWIDTH APOSTROPHE There are also lots of other characters that look like these from other languages, and various combining character combinations which could also look the same, but I doubt whether they would be generated in Latin text by accident. I would add U+02BB MODIFIER LETTER TURNED COMMA (hawaiian 'okina) U+02C8 MODIFIER LETTER VERTICAL LINE (IPA primary stress mark) It might be worthwhile folding some dashes and hyphens too. Andrew Dunbar (hippietrail) Please check these against the actual code tables for reasonableness and accuracy before putting them in any code. -- Neil ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l