Re: [Wikitech-l] Is the $_SESSION secure?

2010-09-24 Thread Dmitriy Sintsov
* Robert Leverington rob...@rhl.me.uk [Fri, 24 Sep 2010 06:57:03 
+0100]:
 On 2010-09-24, Dmitriy Sintsov wrote:
  One probably can rename it to another temporary name? Then move to
 final
  location during the next request, according to previousely passed
  cookie?
 
  Speaking of cookies, there are millions ways of looking at them, 
FF's
  WebDeveloper extension, HTTP headers extension, Wireshark 
application
 to
  name just few. Absolutely non-secure, when unencrypted.

 Session data is not stored in cookies, only a unique session 
identifier
 is passed to the client.

I think the question wasn't about the session data (part of which 
(username,id) is passed via cookies, but you're right, only a hash), but 
about uploading the file in few stages.
Dmitriy

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Is the $_SESSION secure?

2010-09-24 Thread Ashar Voultoiz
On 24/09/10 01:36, Neil Kandalgaonkar wrote:
 Good point, but in this case I'm just storing the path to a temporary file.

 The file isn't even sensitive data; it's just a user-uploaded media file
 for which the user has not yet selected a license, although we
 anticipate they will in a few minutes.

Hello Neil,

The file path might be sensitive, you do not want to potentially expose 
your path hierarchy. At least, I would not do it :)

About your issue, assuming the media file has been entered in the 
image/media database table :

- When the user is redirected to a new page upon upload, you might just 
pass the file ID by parameter / session.

- When the user is allowed to upload several files and then is prompted 
for licences, you might just look at the database for files owned by 
user for which licence is null.



-- 
Ashar Voultoiz


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Balancing MediaWiki Core/Extensions

2010-09-24 Thread Marcin Cieslak
 Roan Kattouw roan.katt...@gmail.com wrote:
 2010/9/22 Trevor Parscal tpars...@wikimedia.org:
 Modular feature development being unique to extensions points out a
 significant flaw in the design of MediaWiki core. There's no reason we
 can't convert existing core features to discreet components, much like
 how extensions are written, while leaving them in stock MediaWiki. This
 sort of design would also make light work of adding extensions to core.

 Making MediaWiki more modular won't magically make it possible (or
 even desirable) to write any old feature as an extension. Some things
 will still have to go in core and some things we'll simply /want/ to
 put in core because making them extensions would be ridiculous for
 some reason.

I'd rather have MediaWiki build on some classes  object instances
with a clear responsibilities, Inversion of Control and possibility
to test each _object_ separately without causing interference
to other components. Discussion what's core/extension is to me 
secondary. Maybe at some point hooks as we know them will
not be needed, we could be able to use interfaces as
provided by the programming language instead of hooks, 
possibly (yes, I'm dreaming) tied together by something like
PicoContainer. 

Even still, those interfaces *will* change and it would be 
good if developers could refactor the code in the core
and extensions at the same time.

I don't why we really duplicate so much functions even
within core (API vs. standard special pages for example).
But that's probably the issue to be solved in phase4 :)

So before asking how much to add into core, maybe we should
first clean up some, and then possibly add. Or sometimes
adding something (like a proper multi-wiki configuration 
management $wgConf++) may clean up and simplify some things
inside core.

//Marcin


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Acceptable use of API

2010-09-24 Thread Max Semenik
On 24.09.2010, 14:32 Robin wrote:

 I would like to collect data on interlanguage links for academic research
 purposes. I really do not want to use the dumps, since I would need to
 download dumps of all language Wikipedias, which would be huge.
 I have written a script which goes through the API, but I am wondering how
 often it is acceptable for me to query the API. Assuming I do not run
 parallel queries, do I need to wait between each query? If so, how long?

Crawling all the Wikipedias is not an easy task either. Probably,
toolserver.org would be more suitable. What data do you need, exactly?

-- 
Best regards,
  Max Semenik ([[User:MaxSem]])


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Acceptable use of API

2010-09-24 Thread Robert Ullmann
Hi,

You don't need the full dumps. Look at (for example) the tr.wp dump
that is running at the moment:

http://download.wikimedia.org/trwiki/20100924/

you'll see the text dumps and also dumps of various SQL tables. Look
at the one that is labelled Wiki interlanguage link records.

You ought to be able to reasonably download those for all of the
'pedias that you are interested in; it will certainly be better than
trawling with the API. They have (if I understand correctly what you
are asking) just the data you want.

Cheers,
Robert

On 9/24/10, Max Semenik maxsem.w...@gmail.com wrote:
 On 24.09.2010, 14:32 Robin wrote:

 I would like to collect data on interlanguage links for academic research
 purposes. I really do not want to use the dumps, since I would need to
 download dumps of all language Wikipedias, which would be huge.
 I have written a script which goes through the API, but I am wondering how
 often it is acceptable for me to query the API. Assuming I do not run
 parallel queries, do I need to wait between each query? If so, how long?

 Crawling all the Wikipedias is not an easy task either. Probably,
 toolserver.org would be more suitable. What data do you need, exactly?

 --
 Best regards,
   Max Semenik ([[User:MaxSem]])


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Acceptable use of API

2010-09-24 Thread Nicolas Vervelle
On Fri, Sep 24, 2010 at 1:19 PM, Max Semenik maxsem.w...@gmail.com wrote:

 On 24.09.2010, 14:32 Robin wrote:

  I would like to collect data on interlanguage links for academic research
  purposes. I really do not want to use the dumps, since I would need to
  download dumps of all language Wikipedias, which would be huge.
  I have written a script which goes through the API, but I am wondering
 how
  often it is acceptable for me to query the API. Assuming I do not run
  parallel queries, do I need to wait between each query? If so, how long?

 Crawling all the Wikipedias is not an easy task either. Probably,
 toolserver.org would be more suitable. What data do you need, exactly?


Full dumps are not required for retrieving interlanguage links.
For example, the last fr dump contains a dedicated file for them :
http://download.wikimedia.org/frwiki/20100915/frwiki-20100915-langlinks.sql.gz

It will be a lot faster to download this file (only 75M) than making more
than 1 million calls to the API for the fr wiki.

Nico
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Balancing MediaWiki Core/Extensions

2010-09-24 Thread Chad
On Fri, Sep 24, 2010 at 3:30 AM, Marcin Cieslak sa...@saper.info wrote:
 So before asking how much to add into core, maybe we should
 first clean up some, and then possibly add. Or sometimes
 adding something (like a proper multi-wiki configuration
 management $wgConf++) may clean up and simplify some things
 inside core.


$wgConf sucks ;-)

But proper configuration management is on the horizon.

-Chad

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Acceptable use of API

2010-09-24 Thread Robin Ryder
Hi,

Thanks for the quick answers, and for the useful link.

My previous e-mail was not detailed enough; sorry about that. Let me
clarify:
- I don't need to crawl the entire Wikipedia, only (for example) articles in
a category. ~1,000 articles would be a good start, and I definitely won't be
going above ~40,000 articles.
- For every article in the data set, I need to follow every interlanguage
link, and get the article creation date (i.e. creation date of [[en:Brad
Pitt]], [[fr:Brad Pitt]], [[it:Brad Pitt]], etc). As far as I can tell, this
means that I need one query for every language link.

The data are reasonably easy to get through the API. If my queries risk
overloading the server, I am obviously happy to go through the toolserver
(once my account gets approved!).


Robin Ryder

Postdoctoral researcher
CEREMADE - Paris Dauphine and CREST - INSEE

 On 24.09.2010, 14:32 Robin wrote:

 I would like to collect data on interlanguage links for academic research
 purposes. I really do not want to use the dumps, since I would need to
 download dumps of all language Wikipedias, which would be huge.
 I have written a script which goes through the API, but I am wondering
how
 often it is acceptable for me to query the API. Assuming I do not run
 parallel queries, do I need to wait between each query? If so, how long?

 Crawling all the Wikipedias is not an easy task either. Probably,
 toolserver.org would be more suitable. What data do you need, exactly?

 --
 Best regards,
   Max Semenik ([[User:MaxSem]])


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] using parserTests code for selenium test framework

2010-09-24 Thread Roan Kattouw
2010/9/23 Brion Vibber br...@pobox.com:
 (If using memcached, be sure to clear those out, reinitialize, or otherwise
 do something that forces old values to be cleared or ignored.)

$wgCacheEpoch is a good one for this. The easiest way to change it is
to touch LocalSettings.php.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Acceptable use of API

2010-09-24 Thread Roan Kattouw
2010/9/24 Robin Ryder robin.ry...@ensae.fr:
 - I don't need to crawl the entire Wikipedia, only (for example) articles in
 a category. ~1,000 articles would be a good start, and I definitely won't be
 going above ~40,000 articles.
 - For every article in the data set, I need to follow every interlanguage
 link, and get the article creation date (i.e. creation date of [[en:Brad
 Pitt]], [[fr:Brad Pitt]], [[it:Brad Pitt]], etc). As far as I can tell, this
 means that I need one query for every language link.

Unfortunately, this is true. You can't use a generator because those
don't work with interwiki titles, and you can't query multiple titles
in one request because prop=revisions only allows that in
get-only-the-latest-revision mode (and you want the earliest
revision).

Hitting the API repeatedly without waiting between requests and
without making parallel requests is considered acceptable usage AFAIK,
but I do think that the Toolserver would better suit your needs.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Acceptable use of API

2010-09-24 Thread Paul Houle
  On 9/24/2010 8:49 AM, Robin Ryder wrote:
 Hi,

 Thanks for the quick answers, and for the useful link.

 My previous e-mail was not detailed enough; sorry about that. Let me
 clarify:
 - I don't need to crawl the entire Wikipedia, only (for example) articles in
 a category. ~1,000 articles would be a good start, and I definitely won't be
 going above ~40,000 articles.
 - For every article in the data set, I need to follow every interlanguage
 link, and get the article creation date (i.e. creation date of [[en:Brad
 Pitt]], [[fr:Brad Pitt]], [[it:Brad Pitt]], etc). As far as I can tell, this
 means that I need one query for every language link.

 The data are reasonably easy to get through the API. If my queries risk
 overloading the server, I am obviously happy to go through the toolserver
 (once my account gets approved!).


The first part is easy to do if accuracy doesn't matter.  Precision and 
recall are often around 50% for categories in Wikipedia,  so if you 
really care about being right you have to construct your own 
categories,  and it helps to have a synoptic view.   Often you can get 
that view from Freebase and Dbpedia but I'm increasingly coming around 
to infoexing wikipedia directly because,  for things I care about,  I 
can do better than Dbpedia...  Freebase does add some special value 
because they do gardening,  data cleaning,  data mining,  hand edits and 
other things that clean up the mess.

Secondly,  it's not hard at all to run,  say,  200k requests against the 
API over the span of a few days.  I think you could get your creation 
dates from the history records.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] using parserTests code for selenium test framework

2010-09-24 Thread Brion Vibber
Dan, I think you're overestimating the difficulty of the basic wiki
family method.

Here is all that is required:
* a single wildcard entry in Apache configuration
* one or two lines in LocalSettings.php to pull a DB name from the
hostname/path/CLI parameters.

As for cleaning up resources to keep the machine from getting clogged,
it's very unlikely that your test wikis will fill up a
multi-hundred-gigabyte drive in the middle of a run. If you find that
they do, there's still no need to tie cleanup of any particular run to
any particular other run.

All you need to know is which runs have completed and can now be cleaned up.

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] using parserTests code for selenium test framework

2010-09-24 Thread Ryan Lane
 Here is all that is required:
 * a single wildcard entry in Apache configuration
 * one or two lines in LocalSettings.php to pull a DB name from the
 hostname/path/CLI parameters.

 As for cleaning up resources to keep the machine from getting clogged,
 it's very unlikely that your test wikis will fill up a
 multi-hundred-gigabyte drive in the middle of a run. If you find that
 they do, there's still no need to tie cleanup of any particular run to
 any particular other run.

 All you need to know is which runs have completed and can now be cleaned up.


I'd like to add some ideas to this thread that were discussed in the
Selenium meeting this morning. The basic plan we discussed (and I'm
sure I'll be corrected some on this) is as follows:

When a run begins, it registers itself with the wiki and gets a
session back. The wiki software, on creating the session, makes a new
run specific wiki using the wiki family method. The test will pass
both the session cookie, and a test type cookie, which will
dynamically configure the wiki as the tests run. When the run is
complete, it should notify the wiki that the test run is complete. The
wiki software will then destroy the session and the dynamically
created resources. If a run doesn't complete for some reason, a cron
can clean up resources that haven't been used in some appropriate
amount of time.

Respectfully,

Ryan Lane

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Unforseen linking on LocalFile-upload()

2010-09-24 Thread David Raison
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Good day,

This is my first post to this list and having read the last couple of
posts, my question might be a bit too low-level. But I still hope that
you might point me in the right direction.


I've been working on this extension that generates qrcode bitmaps and
displays them on a wiki page [0].

In that extension, I'm using the upload() method made available by the
LocalFile object, as documented on [1]. In my specific case, the
relevant code looks like this:


$ft = Title::makeTitleSafe( NS_FILE, $this-_dstFileName );
$localfile = wfLocalFile( $ft );
$saveName = $localfile-getName();
$pageText = 'QrCode [...]';
$status = $localfile-upload( $tmpName, $this-_label, $pageText,
File::DELETE_SOURCE, false, false, $this-_getBot() );


The extension is implemented as a parser function hooked into
ParserFirstCallInit.

Now, I haven't found any other explanation, so I suppose this use of the
upload() method leads to a peculiar behaviour on my wiki installation,
exhibited by these things:

1. QrCodes are generated for pages that do not have or transclude a
{{#qrcode:}} function call, in this case properties [2,3,4].

2. These uploaded files have properties [5] and they belong to a
category, which means they get linked in the categorylinks table. A
common result of this is that qrcode images turn up in i.e. semantic
queries [9,10].

3. Qrcodes are even generated for existing qrcodes [6,7]. One way to
trigger than behaviour is to visit a File's page and click on the Delete
link, without actually deleting the file. This leads to situations such
as [8].

4. The files get linked from several pages as this example shows [11].
None of the pages said to link to the file actually do include that
file, also those pages vary (2 days ago, 14 pages linked, today only 7 link)

5. Browsing the properties of the above file [12], you can see that it
got somehow mixed up with a completely different event.

6. Looking at the database, the mixup hypothesis is confirmed:

SELECT page_id,page_title,cl_sortkey  FROM `page` INNER JOIN
`categorylinks` FORCE INDEX (cl_sortkey) ON ((cl_from = page_id)) LEFT
JOIN `category` ON ((cat_title = page_title AND page_namespace = 14))
WHERE (1 = 1) AND cl_to = 'Project'  ORDER BY cl_sortkey

gives (among other data):

page_id page_title  cl_sortkey
1403SMS2Space   File:QR-Ask.png
1244Syn2Sat File:QR-LetzHack.png
1251ChillyChill File:QR-Syn2cat-radio-ara.png.png


This behaviour occurs in both mw 1.15.5 and 1.16. I would be very
grateful if someone more experienced could have a look at this
situation. Maybe I'm using the upload() method in a way I should not.

sincerely,
David Raison

[0] http://www.mediawiki.org/wiki/Extension:QrCode
[1]
http://svn.wikimedia.org/doc/classLocalFile.html#4b626952ae0390a7fa453a4bfece8252
[2] https://www.hackerspace.lu/wiki/File:QR-Is_U19.png
[3] https://www.hackerspace.lu/wiki/File:QR-Has_SingleIssuePrice.png
[4] https://www.hackerspace.lu/wiki/File:QR-Has_Issues.png
[5] https://www.hackerspace.lu/wiki/Property:Has_SingleIssuePrice
[6] https://www.hackerspace.lu/wiki/File:QR-QR-Location.png.png
[7]
https://www.hackerspace.lu/w/index.php?title=Special:RecentChangeshidebots=0
[8]
https://www.hackerspace.lu/wiki/File:QR-QR-QR-QR-Location.png.png.png.png
[9] https://www.hackerspace.lu/wiki/Projects#Concluded_Projects
[10] https://www.hackerspace.lu/wiki/Special:BrowseData#Q
[11] https://www.hackerspace.lu/wiki/File:QR-Syn2cat.png
[12] https://www.hackerspace.lu/wiki/Special:Browse/File:QR-2DSyn2cat.png


- -- 
The Hackerspace in Luxembourg!
syn2cat a.s.b.l. - Promoting social and technical innovations
11, rue du cimetière | Pavillon Am Hueflach
L-8018 Strassen | Luxembourg
http://www.hackerspace.lu
- 
mailto:da...@hackerspace.lu
xmpp:kwis...@jabber.hackerspaces.org
mobile: +43 650 73 63 834 | +352 691 44 23 24

Wear your geek: http://syn2cat.spreadshirt.net

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkydFFkACgkQYTtdUdP5zDe3ygCePV0b6tG9QZjQ8ZuytNWHQjR3
99IAn1e5mAP/k139J/yuzUPHMTTBjl85
=JAUx
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Unforseen linking on LocalFile-upload()

2010-09-24 Thread Brion Vibber
On Fri, Sep 24, 2010 at 2:12 PM, David Raison wrote:

 I've been working on this extension that generates qrcode bitmaps and
 displays them on a wiki page [0].


Hi David! I've actually been peeking at this extension as I'd like to use
something like this to generate scannable QR codes with Android software
download links for other projects. :)


 1. QrCodes are generated for pages that do not have or transclude a
 {{#qrcode:}} function call, in this case properties [2,3,4].


I haven't fully traced out the execution, but I do notice a few things in
the code that look suspicious.

It looks like you're naming the destination file based on the wiki page that
has the {{#qrcode}} in it by pulling $wgTitle:

// Use this page's title as part of the filename (Also regenerates
qrcodes when the label changes).
$this-_dstFileName = 'QR-'.$wgTitle-getDBKey().$append.'.png';

This might be the cause of some of your problems here... background jobs may
run re-parses of other seemingly unconnected wiki pages during a request,
and other fun things where $wgTitle isn't what you expect, and that might be
one cause of it triggering with an unexpected title. You may find that it's
more reliable to use $parser-getTitle(), which should definitely return the
title for the page being actively parsed.

More generally, using the calling page's title means that you can't easily
put multiple codes on a single page, and the same code used on different
pages will get copied unnecessarily.

I'd recommend naming the file using a hash of the properties used to
generate the image, instead of naming it for the using page. This will make
your code a bit more independent of where it gets called from, and will let
you both put multiple code images on one page and let common images be
shared among multiple pages.

One potential problem is garbage collection: a code that gets generated and
used, then removed and not used again will still have been loaded into the
system. This is an existing problem with things like the texvc math system,
but is a bit more visible here because the images appear in the local
uploads area within the wiki. (However they'll be deletable by admins, so
not too awful!)


6. Looking at the database, the mixup hypothesis is confirmed:

 SELECT page_id,page_title,cl_sortkey  FROM `page` INNER JOIN
 `categorylinks` FORCE INDEX (cl_sortkey) ON ((cl_from = page_id)) LEFT
 JOIN `category` ON ((cat_title = page_title AND page_namespace = 14))
 WHERE (1 = 1) AND cl_to = 'Project'  ORDER BY cl_sortkey

 gives (among other data):

 page_id page_title  cl_sortkey
 1403SMS2Space   File:QR-Ask.png
 1244Syn2Sat File:QR-LetzHack.png
 1251ChillyChill File:QR-Syn2cat-radio-ara.png.png


It's possible that the internal uploading process interferes with global
parsing state when it generates and saves the description page for the wiki;
if so, fixing that may require jumping through some interesting hoops. :)

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Is the $_SESSION secure?

2010-09-24 Thread Platonides
Neil: Yes.


Tim Starling wrote:
 On 24/09/10 10:00, Marco Schuster wrote:
 If it's user-uploaded, take care of garbage collection; actually, how
 does PHP handle it if you upload a file and then don't touch it during
 the script's runtime? Will it automatically be deleted after the
 script is finished or after a specific time?
 
 It's deleted on request shutdown.
 
 -- Tim Starling

If the file is not moved away, there's no point in storing its path in
$_SESSION as it won't be available on next request (it could be used for
parameter passing in globals but that's not proper style).

If the file is moved somewhere else, then you need to garbage collect it
in case the upload is never finished.
A find -delete from cron removing files other than a couple of days
could be enough.
It would be nice to be able to attach delete handlers to memcached keys
for the cases when there's something more that needs deleting (this is
the same problem we also had with the temp dbs for selenium tests).


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Toolserver-l] Static dump of German Wikipedia

2010-09-24 Thread Platonides
Ariel T. Glenn wrote:
 Στις 23-09-2010, ημέρα Πεμ, και ώρα 21:27 -0500, ο/η Q έγραψε:
 Given the fact that static dumps have been broken for *years* now,
 static dumps are on the bottom of WMFs priority list; I thought it
 would be the best if I just went ahead and built something that can be
 used (and, of course, improved).

 Marco

 That's what I just said. Work with them to fix it, IE: volunteer. IE:
 you fix it.

 
 Actually it's not so much that they are on the bottom of the list as
 that there are two people potentially looking at them, and they are
 Tomasz (who is also doing mobile) and me (and I am doing the XML dumps
 rather than the HTML ones, until they are reliable and happy).
 
 However if you are interested in working on these, I am *very* happy to
 help with suggestions, testing, feedback, etc., even while I am still
 woroking on the XML dumps.  Do yuu have time and interest?
 
 Ariel

Most (all?) articles should be already parsed in memcached. I think the
bottleneck would be the compression.
Note however that the ParserOutput would still need postprocessing, as
would ?action=render. The first thing that comes to my mind is to remove
the edit links (this use case alone seems enough for implementing
editsection stripping). Sadly, we can't (easily) add the edit sections
after the rendering.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Toolserver-l] Static dump of German Wikipedia

2010-09-24 Thread Marco Schuster
On Sat, Sep 25, 2010 at 12:56 AM, Platonides platoni...@gmail.com wrote:
 Ariel T. Glenn wrote:
 Στις 23-09-2010, ημέρα Πεμ, και ώρα 21:27 -0500, ο/η Q έγραψε:
 Given the fact that static dumps have been broken for *years* now,
 static dumps are on the bottom of WMFs priority list; I thought it
 would be the best if I just went ahead and built something that can be
 used (and, of course, improved).

 Marco

 That's what I just said. Work with them to fix it, IE: volunteer. IE:
 you fix it.


 Actually it's not so much that they are on the bottom of the list as
 that there are two people potentially looking at them, and they are
 Tomasz (who is also doing mobile) and me (and I am doing the XML dumps
 rather than the HTML ones, until they are reliable and happy).

 However if you are interested in working on these, I am *very* happy to
 help with suggestions, testing, feedback, etc., even while I am still
 woroking on the XML dumps.  Do yuu have time and interest?

 Ariel

 Most (all?) articles should be already parsed in memcached. I think the
 bottleneck would be the compression.
 Note however that the ParserOutput would still need postprocessing, as
 would ?action=render. The first thing that comes to my mind is to remove
 the edit links (this use case alone seems enough for implementing
 editsection stripping). Sadly, we can't (easily) add the edit sections
 after the rendering.
This should be doable using a simple regex which plainly goes for
span class=editsection.

Marco

-- 
VMSoft GbR
Nabburger Str. 15
81737 München
Geschäftsführer: Marco Schuster, Volker Hemmert
http://vmsoft-gbr.de

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Unforseen linking on LocalFile-upload()

2010-09-24 Thread David Raison
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Brion,

thanks for your quick answer!

On 25/09/10 00:27, Brion Vibber wrote:

 Hi David! I've actually been peeking at this extension as I'd like to use
 something like this to generate scannable QR codes with Android software
 download links for other projects. :)

Yeah, I was really astonished that there wasn't already a qrcode extension.


 1. QrCodes are generated for pages that do not have or transclude a
 {{#qrcode:}} function call, in this case properties [2,3,4].
 
 I haven't fully traced out the execution, but I do notice a few things in
 the code that look suspicious.
 
 It looks like you're naming the destination file based on the wiki page that
 has the {{#qrcode}} in it by pulling $wgTitle:
 
 // Use this page's title as part of the filename (Also regenerates
 qrcodes when the label changes).
 $this-_dstFileName = 'QR-'.$wgTitle-getDBKey().$append.'.png';
 
 This might be the cause of some of your problems here... background jobs may
 run re-parses of other seemingly unconnected wiki pages during a request,
 and other fun things where $wgTitle isn't what you expect, and that might be
 one cause of it triggering with an unexpected title. You may find that it's
 more reliable to use $parser-getTitle(), which should definitely return the
 title for the page being actively parsed.

Ok, I'll try that one then, or not, as you suggest below.


 More generally, using the calling page's title means that you can't easily
 put multiple codes on a single page, and the same code used on different
 pages will get copied unnecessarily.

Well you can use multiple codes on a single page, as demonstrated on the
Sandbox [a] and made possible by the $append variable, but you're
certainly right about the latter part.


 I'd recommend naming the file using a hash of the properties used to
 generate the image, instead of naming it for the using page. This will make
 your code a bit more independent of where it gets called from, and will let
 you both put multiple code images on one page and let common images be
 shared among multiple pages.

Will do that too then.

 One potential problem is garbage collection: a code that gets generated and
 used, then removed and not used again will still have been loaded into the
 system. This is an existing problem with things like the texvc math system,
 but is a bit more visible here because the images appear in the local
 uploads area within the wiki. (However they'll be deletable by admins, so
 not too awful!)

Having them uploaded was one of the main reasons I saved the images and
don't just return a url to the src attribute of an image tag. But I
guess you could have a bot run over it or I suppose there's a hook
triggered on deleting a page which would allow to also delete qrcodes
embedded into/linked to it.


 6. Looking at the database, the mixup hypothesis is confirmed:
 It's possible that the internal uploading process interferes with global
 parsing state when it generates and saves the description page for the wiki;
 if so, fixing that may require jumping through some interesting hoops. :)

Well then let's hope that the $parser-getTitle() alternative solves the
problem.

David

- -- 
The Hackerspace in Luxembourg!
syn2cat a.s.b.l. - Promoting social and technical innovations
11, rue du cimetière | Pavillon Am Hueflach
L-8018 Strassen | Luxembourg
http://www.hackerspace.lu
- 
mailto:da...@hackerspace.lu
xmpp:kwis...@jabber.hackerspaces.org
mobile: +43 650 73 63 834 | +352 691 44 23 24

Wear your geek: http://syn2cat.spreadshirt.net

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkydNtgACgkQYTtdUdP5zDdLjgCeOhffMKvX7Lp10HSi0zE45keB
zkMAnims8T3EUjp+C7uFtvbqHEvjFD+z
=2Q/i
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Unforseen linking on LocalFile-upload()

2010-09-24 Thread David Raison
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 25/09/10 01:40, David Raison wrote:
 I'd recommend naming the file using a hash of the properties used to
 generate the image, instead of naming it for the using page. This will make
 your code a bit more independent of where it gets called from, and will let
 you both put multiple code images on one page and let common images be
 shared among multiple pages.
 
 Will do that too then.

Hmm... this is interesting though...
If you check out the source, you'll see that I have replaced every call
to the global $wgTitle by the object returned by $parser-getTitle()

And though I don't set a var (but only read them), when you first
refresh a page with a QrCode on it, it replaces the page's title with
the qrcode's

 6. Looking at the database, the mixup hypothesis is confirmed:
 It's possible that the internal uploading process interferes with global
 parsing state when it generates and saves the description page for the wiki;

Maybe it is after all this problem. Coming to think of it! When you
upload an image using the Special:Upload page, the resulting page's
title exhibits exactly the behaviour mentioned above, it turns into

File:name of uploaded file

for example:

File:QR-ee09b666b60225368736dfaef75c62ea.png


 if so, fixing that may require jumping through some interesting hoops. :)

Can I just another method then, like for example publish() in
combination with some other methods that enter the upload into the database?

David

- -- 
The Hackerspace in Luxembourg!
syn2cat a.s.b.l. - Promoting social and technical innovations
11, rue du cimetière | Pavillon Am Hueflach
L-8018 Strassen | Luxembourg
http://www.hackerspace.lu
- 
mailto:da...@hackerspace.lu
xmpp:kwis...@jabber.hackerspaces.org
mobile: +43 650 73 63 834 | +352 691 44 23 24

Wear your geek: http://syn2cat.spreadshirt.net

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkydUTsACgkQYTtdUdP5zDfbTQCePNatYZoaUu5kj5jJ04icvCFT
z1wAoMDx+2/8ZHOyC+T9DoPPFDxhWkK3
=5O6+
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l