[Wikitech-l] Testing recitation-bot

2017-03-19 Thread Anthony Di Franco
Hi all,
 I'd like to plead for advice on testing Recitation-bot to demonstrate we
have resolved a bug that resulted in a ban for splitting image pages into
two separate incomplete pages last month.
 The log of attempts to test the fix is in this bug
<https://github.com/wpoa/recitation-bot/issues/69>.
 Here are the key points / open questions:

   - Each wiki I have successfully tried to test on, including the
   production wikis, test.wikipedia.org, and testwiki.wiki, seems to
   redirect the bot to a page with information about permissions / blocks,
   something I had never seen prior to the block.
   - What is the appropriate place to test? Seems to be www.thetestwiki.org
   - Could we appeal the block on the strength of the apparent correctness
   of the edit to fix the bug, at least temporarily so as to be able to
   demonstrate the fix on a wiki we were running successfully on in the recent
   past? Who would be best to approach with such a request?

Thanks any and all for any advice you can offer.
Anthony
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] pywikibot troubleshooting in recitation-bot

2017-01-05 Thread Anthony Di Franco
Actually I've made a bit of progress diagnosing this. It has nothing to do
with the login or looping in my code but may have to do with the
composition of my use of multiprocessing and pywikibot's use of
multithreading and may have to do with this bug:
https://phabricator.wikimedia.org/T135986
Any advice on managing this?

On Sat, Dec 24, 2016 at 1:36 PM Legoktm  wrote:

Hi,

+cc pywiki...@lists.wikimedia.org

On 12/22/2016 03:55 PM, Anthony Di Franco wrote:
> Hi all,
>  I'm doing some renovations on recitation-bot and running into trouble
when
> the time comes for pywikibot to upload article data to wikisource and
> commons. The thread doing so hangs without any sort of informative error.
I
> made sure that the unix user under which the web service that is using
> pywikibot is running is logged into each wiki per Max's advice but I still
> have the problem. I'm going to try to get more information about what's
> going on but would also appreciate pointers about what might be going
> wrong. Particularly, the web service is now running under Kubernetes
rather
> than sun grid engine, so I suspect that the login state might not be
making
> it into the container - can anyone advise on where the login state is
> maintained and whether this will be transferred into the kubernetes
> container?

Pywikibot stores all of its state in the same directory that your
user-config.py file is in.

In my experience python hanging is typically an accidental infinite
loop. Adding debug logging can help pinpoint where it starts hanging and
narrow down the problematic code.

-- Legoktm

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] pywikibot troubleshooting in recitation-bot

2016-12-22 Thread Anthony Di Franco
Hi all,
 I'm doing some renovations on recitation-bot and running into trouble when
the time comes for pywikibot to upload article data to wikisource and
commons. The thread doing so hangs without any sort of informative error. I
made sure that the unix user under which the web service that is using
pywikibot is running is logged into each wiki per Max's advice but I still
have the problem. I'm going to try to get more information about what's
going on but would also appreciate pointers about what might be going
wrong. Particularly, the web service is now running under Kubernetes rather
than sun grid engine, so I suspect that the login state might not be making
it into the container - can anyone advise on where the login state is
maintained and whether this will be transferred into the kubernetes
container?
Thanks,
Anthony
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Technical advice on expert review?

2016-04-16 Thread Anthony Cole
BMJ, the publishers of the *British Medical Journal* and other top-tier
biomedical journals, have kindly recruited the best minds they can get to
review the en.Wikipedia's article, "Parkinson's disease".

We began the review by passing the article, in a Word document, from one
reviewer to the next by email. Each made proposed changes to the article
text and left comments in the document, using Word's "Review" and "Track
changes" features.

At that point we needed to start a discussion, and Word isn't ideal for
that. So I pasted the relevant paragraphs from the Word document into the
left column of a wiki table, and the reviewers' comments into the right
column, where the discussion could happen. [1] I manually applied
background colours to distinguish deletions from additions in the left
column, using .

That discussion has now begun but one of the many things I've learned
during all this is, the top researchers and theorists spend a lot of time
in the air (travelling to conferences, lectures, meetings), and it is then,
free from the demands of job and family, when they do their reviewing.

So, I have pasted that wiki table into Word and have made it available to
the reviewers here: [2]. Now they can download a copy before they get on a
flight, and email it back to me with their comments when they're online
again, and I'll transcribe their comments into the wiki table for
discussion.

This may be as simple as it gets but I just thought I'd put this before
you, in case you may have thoughts on a better technical approach for next
time. (BMJ have offered to do more of these.) I'm finding the construction
of the wiki table tedious (particularly highlighting the deletions and
additions) though I'm getting faster, and transcribing offline comments
from the Word document into the wiki table will be a small chore. The wiki
table pastes easily into Word with highlighting and formatting intact, but
not vice versa. (I've also asked at Village pump (technical).)

Any thoughts on making this easier or smarter would be much appreciated.

Anthony Cole

1. https://en.wikipedia.org/wiki/User:Anthonyhcole/sandbox
2.
https://onedrive.live.com/view.aspx?resid=C1FF29217E209194!2141&ithint=file%2cdocx&app=Word&authkey=!AFGj7fd2K4v7N5o
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Release candidate for 1.24.0

2014-11-23 Thread Anthony Cole
Ignore my last post - I appended it to the wrong thread.

Anthony Cole <http://en.wikipedia.org/wiki/User_talk:Anthonyhcole>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] changing edit summaries

2014-11-23 Thread Anthony Cole
(Sorry, I posted this in the wrong thread a few minutes ago.)

On Thu, Nov 13, 2014 at 9:57 PM, Yusuke Matsubara https://lists.wikimedia.org/mailman/listinfo/wikitech-l>> wrote:
>** On Thu, Nov 13, 2014 at 9:15 PM, Amir E. Aharoni
**>* https://lists.wikimedia.org/mailman/listinfo/wikitech-l>> wrote:
*>>* I tried looking for it in Bugzilla; I expected to find a two-digit bug for
*>>* it, but I couldn't find any at all. Of course it's possible that I didn't
*>>* look well enough.
*>>* A bit different, but there is an extension that enables
*>* "supplementing" additional non-modifiable edit summaries:
*>* https://www.mediawiki.org/wiki/Extension:RevisionCommentSupplement
<https://www.mediawiki.org/wiki/Extension:RevisionCommentSupplement>
*>>* It was contributed (without a Bugzilla request) by Burthsceh, a
*>* volunteer at Japanese Wikipedia, prompted by the necessity to fix
*>* attributions made in edit summaries (for reused texts). [1]  I don't
*>* think it has been extensively reviewed, though.
*>>* With that approach, you could effectively modify an edit summary by
*>* appending a modified one and rev-deleting the original one.
*>>* [1] 
https://ja.wikipedia.org/wiki/Wikipedia:%E4%BA%95%E6%88%B8%E7%AB%AF/subj/%E5%B1%A5%E6%AD%B4%E3%83%9A%E3%83%BC%E3%82%B8%E3%81%AE%E5%80%8B%E3%80%85%E3%81%AE%E7%89%88%E3%81%AB%E5%AF%BE%E3%81%97%E3%81%A6%E8%BF%BD%E5%8A%A0%E3%81%AE%E3%82%B3%E3%83%A1%E3%83%B3%E3%83%88%E3%82%92%E8%A1%A8%E7%A4%BA%E3%81%99%E3%82%8B%E6%96%B9%E6%B3%95%E3%81%AE%E5%B0%8E%E5%85%A5%E3%81%AE%E6%8F%90%E6%A1%88
<https://ja.wikipedia.org/wiki/Wikipedia:%E4%BA%95%E6%88%B8%E7%AB%AF/subj/%E5%B1%A5%E6%AD%B4%E3%83%9A%E3%83%BC%E3%82%B8%E3%81%AE%E5%80%8B%E3%80%85%E3%81%AE%E7%89%88%E3%81%AB%E5%AF%BE%E3%81%97%E3%81%A6%E8%BF%BD%E5%8A%A0%E3%81%AE%E3%82%B3%E3%83%A1%E3%83%B3%E3%83%88%E3%82%92%E8%A1%A8%E7%A4%BA%E3%81%99%E3%82%8B%E6%96%B9%E6%B3%95%E3%81%AE%E5%B0%8E%E5%85%A5%E3%81%AE%E6%8F%90%E6%A1%88>
*


On Wed Nov 12 01:05:26 UTC 2014 I asked this list if the technical team
could help the patrollers of recent changes to Wikipedia's medical articles
“...tag the log entry of revisions ... as having been reviewed for
policy/guideline compliance by a trusted editor.” [1]

I am quite technically illiterate and may have misunderstood, but judging
by Yusuke Matsubara's description, the extension he mentions above seems
like it might fit our needs. Will it enable patrollers to add a comment to
the edit summary? Does anyone know if it works on en.Wikipedia?

1.https://lists.wikimedia.org/pipermail/wikitech-l/2014-November/079418.html


Anthony Cole <http://en.wikipedia.org/wiki/User_talk:Anthonyhcole>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Release candidate for 1.24.0

2014-11-22 Thread Anthony Cole
On Thu, Nov 13, 2014 at 9:57 PM, Yusuke Matsubara https://lists.wikimedia.org/mailman/listinfo/wikitech-l>> wrote:
>* On Thu, Nov 13, 2014 at 9:15 PM, Amir E. Aharoni
*>* https://lists.wikimedia.org/mailman/listinfo/wikitech-l>> wrote:
*>>* I tried looking for it in Bugzilla; I expected to find a two-digit bug for
*>>* it, but I couldn't find any at all. Of course it's possible that I didn't
*>>* look well enough.
*>>* A bit different, but there is an extension that enables
*>* "supplementing" additional non-modifiable edit summaries:
*>* https://www.mediawiki.org/wiki/Extension:RevisionCommentSupplement

*>>* It was contributed (without a Bugzilla request) by Burthsceh, a
*>* volunteer at Japanese Wikipedia, prompted by the necessity to fix
*>* attributions made in edit summaries (for reused texts). [1]  I don't
*>* think it has been extensively reviewed, though.
*>>* With that approach, you could effectively modify an edit summary by
*>* appending a modified one and rev-deleting the original one.
*>>* [1] 
https://ja.wikipedia.org/wiki/Wikipedia:%E4%BA%95%E6%88%B8%E7%AB%AF/subj/%E5%B1%A5%E6%AD%B4%E3%83%9A%E3%83%BC%E3%82%B8%E3%81%AE%E5%80%8B%E3%80%85%E3%81%AE%E7%89%88%E3%81%AB%E5%AF%BE%E3%81%97%E3%81%A6%E8%BF%BD%E5%8A%A0%E3%81%AE%E3%82%B3%E3%83%A1%E3%83%B3%E3%83%88%E3%82%92%E8%A1%A8%E7%A4%BA%E3%81%99%E3%82%8B%E6%96%B9%E6%B3%95%E3%81%AE%E5%B0%8E%E5%85%A5%E3%81%AE%E6%8F%90%E6%A1%88

*


On Wed Nov 12 01:05:26 UTC 2014 I asked this list if the technical team
could help the patrollers of recent changes to Wikipedia's medical articles
“...tag the log entry of revisions ... as having been reviewed for
policy/guideline compliance by a trusted editor.” [1]

I am quite technically illiterate and may have misunderstood, but judging
by Yusuke Matsubara's description, the extension he mentions above seems
like it might fit our needs. Will it enable patrollers to add a comment to
the edit summary? Does anyone know if it works on en.Wikipedia?

1.
https://lists.wikimedia.org/pipermail/wikitech-l/2014-November/079418.html
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Revision metadata as a service?

2014-11-12 Thread Anthony Cole
Thanks Bartosz. Just to clarify:

If we apply FlaggedRevs to all medical articles (articles that have the
WP:MED template on their talk page), configured to display the latest
article version, can we create a permission (say, Medicine Reviewer) that
allows one to tag the revision log entry with a comment? Would it interfere
in any way with the normal practice of other editors who don't have that
permission?

Anthony Cole <http://en.wikipedia.org/wiki/User_talk:Anthonyhcole>


On Wed, Nov 12, 2014 at 3:05 AM, Bartosz Dziewoński 
wrote:

> W dniu środa, 12 listopada 2014 Anthony Cole 
> napisał(a):
> >
> > Allow us to tag the log entry of normal revisions (not pending
> > changes/flagged revisions - no medical articles presently have flagged
> > revisions, and none are likely to in the near future) as having been
> > reviewed for policy/guideline compliance by a trusted editor.
>
>
> This is one of the things the FlaggedRevs  extension (the same one that
> powers the "pending changes" system on the English Wikipedia) allows you to
> do. It can be configured to provide arbitrary flags (not just binary
> "okay"/"not okay"), and it can be configured to display the latest version
> of the article (rather than the "flagged" one) to visitors by default, and
> it can be configred to work on all articles on a wiki.
>
>
> --
> -- Matma Rex
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Revision metadata as a service?

2014-11-11 Thread Anthony Cole
As someone who patrols recent changes to our 33,000 (and rising) health
sciences-related articles with a diminishing number of colleagues, one
thing developers could do to help us keep that content safe would be this:

Allow us to tag the log entry of normal revisions (not pending
changes/flagged revisions - no medical articles presently have flagged
revisions, and none are likely to in the near future) as having been
reviewed for policy/guideline compliance by a trusted editor.

There are maybe a dozen regular/semi-regular med patrollers (down from
about twice that number three years ago), and I'm very conscious that we
aren't keeping up. If I see a revision has been reviewed by one of that
dozen whom I trust, I'll (not always, but often) skip checking that
revision and move on to the next unreviewed revision.

This will

a) save me and the others a lot of time, allowing us to cover much more
ground and
b) give us a handle on how thoroughly we're vetting changes to this
sensitive content.

Ideally, each of us that reviews a revision should be able to tag its log
entry - so we can see the depth of review each revision has undergone.

The board of Wiki Project Med Foundation are discussing this at the moment,
and we see it as a very effective step toward safeguarding and improving
our medical offering. If you could do this for us, it would be very much
appreciated.

Anthony Cole <http://en.wikipedia.org/wiki/User_talk:Anthonyhcole>


On Mon, Nov 10, 2014 at 11:25 PM, Federico Leva (Nemo) 
wrote:

> Yes, a failed piece of rotting [configuration] code on en.wiki is called
> "Pending changes"; this doesn't mean that the extension, used by over 200
> wikis, has been affected in any way.
>
> Nemo
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Top Level Design, LLC greedy registrar!

2014-04-05 Thread Anthony
The Chinese word for "wiki" is not "wiki".

As far as sharing my own personal certainties,
https://en.wikipedia.org/wiki/Generic_trademark would be a start, but for
the most part certainties aren't something that can easily be shared.


On Fri, Apr 4, 2014 at 5:12 PM, Federico Leva (Nemo) wrote:

> "wiki" is a generic term, not a trademarked one.
>>
>
> You sure? Please share your certainties!
>  Foundation_trademarking_.E7.B6.AD.E5.9F.BA.2C_the_Chinese_
> word_for_.22wiki.22.3F>
>
> Nemo
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] .wiki gTLD

2014-02-22 Thread Anthony
I wouldn't think any of those other than perhaps "media.wiki" would
implicate a WMF trademark. As far as MediaWiki, WMF does claim a trademark
on that.


On Sat, Feb 22, 2014 at 5:17 AM, addshorewiki wrote:

> en.wiki
> data.wiki
> meta.wiki
> media.wiki
> en.books.wiki
> en.voyage.wiki
>
> Most of them sound rather plausible.
>
> Addshore
>  That is a tough one, since most project names /start/ with "wiki".
> "pedia.wiki" just sounds awkward, as do many others.
>
> On Fri, 21 Feb 2014, at 5:56, Derric Atzrott wrote:
> > ICANN just delegated the gTLD .WIKI yesterday.  It's being managed by Top
> Level
> > Design, LLC.  I'm not entirely sure what that means for all of us
> exactly, but I
> > suspect that the WMF is going to want to at least register Wikipedia.wiki
> and
> > Wikimedia.wiki once the gTLD is open for registration.
> >
> > Some of the new gTLDs are already opening up for registration.  .sexy and
> > .tattoo will be opening for registration on 25 February.
> >
> > It looks like if we want to get .wiki domains we will be getting them
> sometime
> > in May or June during the "sunrise" period.[1]
> >
> > ICANN also has a full list of new gTLDs that they have approved.[2]
> >
> > Thank you,
> > Derric Atzrott
> > Computer Specialist
> > Alizee Pathology
> >
> > [1]: http://www.namejet.com/Pages/newtlds/tld/WIKI
> > [2]: http://newgtlds.icann.org/en/program-status/delegated-strings
> >
> >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

2014-01-21 Thread Anthony
If you're going to use xz then you wouldn't even have to recompress the
blocks that haven't changed and are already well compressed.


On Tue, Jan 21, 2014 at 5:26 PM, Randall Farmer  wrote:

> Ack, sorry for the (no subject); again in the right thread:
>
> > For external uses like XML dumps integrating the compression
> > strategy into LZMA would however be very attractive. This would also
> > benefit other users of LZMA compression like HBase.
>
> For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
>
> That has a 4 MB buffer, compression ratios within 15-25% of
> current 7zip (or histzip), and goes at 30MB/s on my box,
> which is still 8x faster than the status quo (going by a 1GB
> benchmark).
>
> Trying to get quick-and-dirty long-range matching into LZMA isn't
> feasible for me personally and there may be inherent technical
> difficulties. Still, I left a note on the 7-Zip boards as folks
> suggested; feel free to add anything there:
> https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
>
> Thanks for the reply,
> Randall
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Fwd: Participation in an Aaron Swartz Hackathon event

2013-10-11 Thread Anthony
It wasn't really a joke.
On Oct 11, 2013 5:34 PM, "Petr Bena"  wrote:

> That's not a funny joke...
>
> On Fri, Oct 11, 2013 at 7:38 PM, Anthony  wrote:
> > Which websites are you planning on hacking into?
> >
> >
> > On Fri, Oct 11, 2013 at 12:17 PM, Quim Gil  wrote:
> >
> >> There is a plan for a worldwide round of Aaron Hackathons, on the
> upcoming
> >> Nov 8-10 weekend.
> >>
> >> http://aaronswartzhackathon.**org/ <http://aaronswartzhackathon.org/>
> >>
> >> Coordination:
> >> https://www.noisebridge.net/**wiki/Worldwide_Aaron_Swartz_**
> >> Memorial_Hackathon_Series<
> https://www.noisebridge.net/wiki/Worldwide_Aaron_Swartz_Memorial_Hackathon_Series
> >
> >>
> >> We have been invited to run a hackathon. Can we organize it? We would
> need
> >> to find a project and a critical mass of contributors willing to
> document
> >> and coordinate the hackathon.
> >>
> >> One possibility could be to kick-off the hackathon on Friday 8 Nov in a
> >> physical location (San Francisco), and focus initially on the
> distribution
> >> of tasks. Then remote participants could also participate taking tasks,
> >> participating with the rest of the group on some IRC channel and
> occasional
> >> videoconferences.
> >>
> >> About the project, I personally think that it should have a link with
> the
> >> motivation of the hackathon:
> >>
> >> "We were part of an inchoate, ad-hoc community of collaborators who
> helped
> >> each other learn how to code. No, not how to write code - how to write
> code
> >> for the purpose of changing the world." - Zooko, on memories of Aaron
> >>
> >> See also https://en.wikipedia.org/wiki/**Aaron_Swartz#Life_and_works<
> https://en.wikipedia.org/wiki/Aaron_Swartz#Life_and_works>
> >>
> >>
> >> On Sat, 5 Oct 2013, Noah Swartz wrote:
> >>
> >>> Hey assorted Wikimedia people,
> >>> As I may have mentioned to some of you previously, we're running
> another
> >>> round of Aaron Hackathons, on the upcoming Nov 8-10 weekend. I was
> >>> wondering if WMF would be interested in providing a project for people
> to
> >>> work on. For each event we're hoping to have one well structured
> project
> >>> that people - both technical and non - can work on that can have some
> >>> support from people who have worked on it or related projects
> previously,
> >>> so that participants can jump right in.
> >>> Would you be willing to structure something for people to work on? If
> not
> >>> are there other WMF related things that people can do? Maybe go
> through a
> >>> list of open bugs or feature requests? Or maybe just writing
> documents, or
> >>> doing outreach, any project is welcome.
> >>> Currently we have two tentative events in SF and ~5 more confirmed
> >>> locations elsewhere around the world. We have a very basic landing
> page up
> >>> at http://aaronswartzhackathon.**org/ <
> http://aaronswartzhackathon.org/>which might give you more of a sense of
> >>> what's going on. I assume that SF is the location that works best for
> you
> >>> but let me know if you think somewhere else would be good. I'd really
> love
> >>> to see you all participate so let me know if there's anything I can do
> to
> >>> help.
> >>> As always feel free to pass this along to anyone else who you think
> might
> >>> be interested, and I'm happy to answer any and all questions you have.
> >>> Looking forward to hearing back soon!
> >>> Noah
> >>>
> >>>
> >>
> >>
> >> __**_
> >> Wikitech-l mailing list
> >> Wikitech-l@lists.wikimedia.org
> >> https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l>
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Fwd: Participation in an Aaron Swartz Hackathon event

2013-10-11 Thread Anthony
Which websites are you planning on hacking into?


On Fri, Oct 11, 2013 at 12:17 PM, Quim Gil  wrote:

> There is a plan for a worldwide round of Aaron Hackathons, on the upcoming
> Nov 8-10 weekend.
>
> http://aaronswartzhackathon.**org/ 
>
> Coordination:
> https://www.noisebridge.net/**wiki/Worldwide_Aaron_Swartz_**
> Memorial_Hackathon_Series
>
> We have been invited to run a hackathon. Can we organize it? We would need
> to find a project and a critical mass of contributors willing to document
> and coordinate the hackathon.
>
> One possibility could be to kick-off the hackathon on Friday 8 Nov in a
> physical location (San Francisco), and focus initially on the distribution
> of tasks. Then remote participants could also participate taking tasks,
> participating with the rest of the group on some IRC channel and occasional
> videoconferences.
>
> About the project, I personally think that it should have a link with the
> motivation of the hackathon:
>
> "We were part of an inchoate, ad-hoc community of collaborators who helped
> each other learn how to code. No, not how to write code - how to write code
> for the purpose of changing the world." - Zooko, on memories of Aaron
>
> See also 
> https://en.wikipedia.org/wiki/**Aaron_Swartz#Life_and_works
>
>
> On Sat, 5 Oct 2013, Noah Swartz wrote:
>
>> Hey assorted Wikimedia people,
>> As I may have mentioned to some of you previously, we're running another
>> round of Aaron Hackathons, on the upcoming Nov 8-10 weekend. I was
>> wondering if WMF would be interested in providing a project for people to
>> work on. For each event we're hoping to have one well structured project
>> that people - both technical and non - can work on that can have some
>> support from people who have worked on it or related projects previously,
>> so that participants can jump right in.
>> Would you be willing to structure something for people to work on? If not
>> are there other WMF related things that people can do? Maybe go through a
>> list of open bugs or feature requests? Or maybe just writing documents, or
>> doing outreach, any project is welcome.
>> Currently we have two tentative events in SF and ~5 more confirmed
>> locations elsewhere around the world. We have a very basic landing page up
>> at http://aaronswartzhackathon.**org/ 
>> which might give you more of a sense of
>> what's going on. I assume that SF is the location that works best for you
>> but let me know if you think somewhere else would be good. I'd really love
>> to see you all participate so let me know if there's anything I can do to
>> help.
>> As always feel free to pass this along to anyone else who you think might
>> be interested, and I'm happy to answer any and all questions you have.
>> Looking forward to hearing back soon!
>> Noah
>>
>>
>
>
> __**_
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/**mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] How's the SSL thing going?

2013-08-01 Thread Anthony
On Thu, Aug 1, 2013 at 12:52 AM, Jeremy Baron  wrote:

> On Thu, Aug 1, 2013 at 4:28 AM, Anthony  wrote:
> > Does rapid key rotation in any way make a MITM attack less detectable?
> > Presumably the NSA would have no problem getting a fraudulent certificate
> > signed by DigiCert.
>
> I'm not seeing the relevance. And we have the SSL observatory (EFF) fwiw.
>

I fully admit that I don't understand exactly how SSL observatory works.  I
thought it detected when the key changes, so I was wondering whether
rapidly rotating keys might thwart that.  But again, I don't really
understand how it works.  So it wasn't a rhetorical question.


We (society, standards making bodies, etc.) need to do more to reform
> the current SSL mafia system. (i.e. it should be easier for a vendor
> to remove a CA from a root store and we shouldn't have a situation
> where many dozens of orgs all have the ability to sign certs valid for
> any domain.)
>

In order to not be easily detected, the cert used by the MITM would need to
be from the same CA as the usual one (DigiCert?).  Or at least from someone
who had obtained DigiCert's key.  Or is my cluelessness about how SSL
observatory works showing once again?
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] How's the SSL thing going?

2013-07-31 Thread Anthony
On Wed, Jul 31, 2013 at 5:59 PM, George Herbert wrote:

> The second is site key security (ensuring the NSA never gets your private
> keys).


Who theoretically has access to the private keys (and/or the signing key)
right now?

The third is perfect forward security with rapid key rotation.
>

Does rapid key rotation in any way make a MITM attack less detectable?
Presumably the NSA would have no problem getting a fraudulent certificate
signed by DigiCert.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Git for idiots

2013-05-08 Thread Anthony
I guess the viewpoint and perspective from the more experienced users may
be different. The veterans may start to take some knowledge for granted, as
a given knowledge that they may thing people would already know.

For example, the underlying concept of git commit is something that I now
take for granted. But it's something that I have to really visit and talk
about when I'm teaching a friend to use git.

Perhaps we need contributions from people who have only just learned how to
properly use git, rather than only having veterans write the guide up.
People who still have the learning experience fresh in their minds.

And perhaps we also need contributions from people who routinely teach
their friends about git.


On Thu, May 9, 2013 at 12:54 AM, Chad  wrote:

> Well, that was the point of [[Git/Getting started]] because the Workflow
> document sucks.
>
> -Chad
>
> On Wed, May 8, 2013 at 12:41 PM, Petr Bena  wrote:
> > I was using these tutorials in past, and they were pretty complicated
> > for me to understand git. I don't say it should be considered some
> > "official" documentation, rather something what desperate people could
> > use.
> >
> > Git/Workflow and such are written by people who already understand git
> > - they don't see what seems complicated to newbies or what is really
> > hard and makes git evil to new users. I think some far, far simpler
> > guide like "git for dummies" or whatever similar, could be useful for
> > many new contributors who have no idea about git.
> >
> > On Wed, May 8, 2013 at 6:37 PM, Chad  wrote:
> >> On Wed, May 8, 2013 at 12:34 PM, Petr Bena  wrote:
> >>> Hi,
> >>>
> >>> Long time ago when I started learning with git I decided to create a
> >>> simple guide (basically I was just taking some notes of what is
> >>> needed). I never thought that it could be useful to anyone so I never
> >>> announced it anywhere. However I got some feedback to it, so I decided
> >>> to inform you too.
> >>>
> >>> The basic idea is to create a TOTALLY SIMPLE guide that git
> >>> illiterates like me can understand and thanks to which they would find
> >>> out how to do stuff in wikimedia git / gerrit.
> >>>
> >>> Link is here: www.mediawiki.org/wiki/User:Petrb/Git_for_idiots
> >>>
> >>> It doesn't contain so much and there are some mistakes / feel free to
> fix them.
> >>>
> >>> Since wikimedia switched to gerrit from svn I have yet met a tons of
> >>> people who had problems adapting to it, so this could eventually help
> >>> some.
> >>>
> >>
> >> We've got [[Git/Workflow]], [[Git/Tutorial]] and [[Git/Getting
> >> started]] (in decreasing
> >> order of complexity/depth), so I would be hesitant to add yet another
> >> howto page.
> >> Considering this is supposed to be quick-and-easy docs, I'd suggest
> folding any
> >> unique content into the getting started doc.
> >>
> >> -Chad
> >>
> >> ___
> >> Wikitech-l mailing list
> >> Wikitech-l@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Prototyping Wiki Inline Comments

2013-05-04 Thread Anthony
Hi Daniel,

I hope to clarify the question you asked. Let's say userA highlighted the
text "sunflower stalk" and wrote an inline starting comment.

Now userB decides to give an inline reply to the inline comment.

Are you asking about how the system would handle a case in which userA
shifts the highlighting from "sunflower stalk" to "green leaves" at the
same time when userB is adding the inline reply?
(and you have userB who thinks his inline reply would be appearing next to
the "sunflower stalk" text.)


On Sat, May 4, 2013 at 4:15 AM, Daniel Mietchen <
daniel.mietc...@googlemail.com> wrote:

> Hi Anthony,
>
> interesting feature. How would the system handle cases in which the
> content originally pointed at when making the initial inline comment
> has been changed?
>
> Daniel
>
>
> On Fri, May 3, 2013 at 10:09 PM, Anthony  wrote:
> > Dear all,
> >
> > I have applied for the Prototyping Inline Comments for the Google Summer
> of
> > Code.
> >
> > Essentially, the project is an extension that allows any wiki user to
> > select text and then make an inline comment or a reply to an existing
> > inline comment. Imagine: a user lands in a Wikipedia article, selects one
> > sentence and leaves an inline comment that others can optionally read and
> > reply to.
> >
> > Users can make useful comments regarding specific part of articles, which
> > will be a part of collaborative work. The key benefit is to users to
> > collaborate easily - because this actually allows you to point to
> something
> > and comment in direct reference to it. It's like pointing your finger to
> a
> > piece of paper and telling your friend sitting next to you, which can
> only
> > be done in person and is currently impossible over the Internet. So it's
> a
> > really powerful feature for collaborations since it makes one of the
> > Internet-impossibles into a possible action.
> >
> > That was for the insertion of a new comment. For the replying part, it
> will
> > be a format will likely be similar to how threads are like in a forum,
> for
> > the prototype.
> >
> > As I go along the project, I will be posting more technical details and
> > updates. From now til the end of the project, I do hope to get everyone's
> > feedback along the way :)
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Prototyping Wiki Inline Comments

2013-05-03 Thread Anthony
Dear all,

I have applied for the Prototyping Inline Comments for the Google Summer of
Code.

Essentially, the project is an extension that allows any wiki user to
select text and then make an inline comment or a reply to an existing
inline comment. Imagine: a user lands in a Wikipedia article, selects one
sentence and leaves an inline comment that others can optionally read and
reply to.

Users can make useful comments regarding specific part of articles, which
will be a part of collaborative work. The key benefit is to users to
collaborate easily - because this actually allows you to point to something
and comment in direct reference to it. It's like pointing your finger to a
piece of paper and telling your friend sitting next to you, which can only
be done in person and is currently impossible over the Internet. So it's a
really powerful feature for collaborations since it makes one of the
Internet-impossibles into a possible action.

That was for the insertion of a new comment. For the replying part, it will
be a format will likely be similar to how threads are like in a forum, for
the prototype.

As I go along the project, I will be posting more technical details and
updates. From now til the end of the project, I do hope to get everyone's
feedback along the way :)
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Why are we still using captchas on WMF sites?

2013-01-21 Thread Anthony
On Mon, Jan 21, 2013 at 3:00 AM, David Gerard  wrote:
> I mean, you could redefine "something that doesn't block all spambots
> but does hamper a significant proportion of humans" as "successful",
> but it would be a redefinition.

It's not a definition, it's a judgment.

And whether or not it's a correct judgment depends on how many
spambots are blocked, and how many productive individuals are
"hampered", among other things.

After all, reverting spam hampers people too.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Update on IPv6

2012-06-16 Thread Anthony
On Thu, Jun 14, 2012 at 12:54 AM, Daniel Friesen
 wrote:
> On Tue, 12 Jun 2012 05:14:07 -0700, Anthony  wrote:
>
>> On Sun, Jun 10, 2012 at 2:03 PM, Marcin Cieslak  wrote:
>>>
>>> You *DON'T* want to
>>> renumber your whole home network every time your ISP changes your IPv6
>>> prefix.
>>
>>
>> If only they had some service which converted easy to remember names
>> into IPv6 addresses.
>
>
> You don't want to put DNS names inside of firewall rules. Some won't let
> you, and for others it's risky... ever read a manual?

That comment was uncalled for.

> IPv6 uses global addresses not internal ones (and for good reason).

IPv6 supports unique local addresses in addition to global addresses.

> Forcing local networks using local addresses to host local data remotely is
> also ridiculous.

Well, I think I misunderstood what Marcin was saying.  So, sorry about that.

But, on the other hand, there is nothing that prohibits people from
hosting DNS only locally.  If, for some reason, you want to use an IP
address which is assigned by your ISP, for a host which is only
accessible via the local network, then you might even want to do this.
 I'm not sure why you'd want to use an IP address which is assigned by
your ISP for a host which is only supposed to be accessible via the
local network, though.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Update on IPv6

2012-06-12 Thread Anthony
On Sun, Jun 10, 2012 at 2:28 PM, David Gerard  wrote:
> On 9 June 2012 21:51, Anthony  wrote:
>
>> Here at BestISP, we assign you a unique number that you can never
>> change!  We attach this unique number to all your Internet
>> communications, so that every time you go back to a website, that
>> website knows they're dealing with the same person.
>> Switch to BestISP!  1% faster communications, and the increased
>> ability for websites to track you!
>
>
> Whereas in the real world, IPv4 static IPs are considered a cost-extra
> feature, and one source of IPv6 resistance at consumer ISPs has been
> that they couldn't sell said cost-extra feature with it.

Sure they can.  There's nothing stopping them from charging more for
static IPv6 vs. dynamic IPv6.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Update on IPv6

2012-06-12 Thread Anthony
On Sun, Jun 10, 2012 at 2:03 PM, Marcin Cieslak  wrote:
> You *DON'T* want to
> renumber your whole home network every time your ISP changes your IPv6
> prefix.

If only they had some service which converted easy to remember names
into IPv6 addresses.

> Just because some people got away with the stuff they do on the Internet
> because their ISP changes their IPv4 address every so and then does
> not mean that dynamic IPv4 address provides *any* privacy.

A dynamic address (IPv4 or IPv6) generally provides *some* privacy
above a static one.  Not a lot, especially not without taking other
measures, but some.

> The whole point of IPv6 is to give the choice not to use external
> providers - you become part of the "cloud", not just a dumb consumer.

I didn't realize that was the whole point of IPv6.

In any case, I'd say most Internet users *want* to be treated as a
dumb consumer, and not become part of the cloud.

Yes, there's a small portion of the population that wants to run their
own webserver and own email server and maintain an always on computer,
constantly updated with the latest security fixes, sitting in their
DMZ.  But not more than 4,294,967,296 of them.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Update on IPv6

2012-06-09 Thread Anthony
On Sat, Jun 9, 2012 at 4:29 PM, Anthony  wrote:
> On Fri, Jun 8, 2012 at 9:59 AM, Strainu  wrote:
>> 2012/6/8 Anthony :
>>> No one has to break the loop.  The loop will break itself.  Either
>>> enough people will get sick of NAT to cause demand for IPv6, or they
>>> won't.
>>
>> That one way of seeing things, but I fear it's a bit simplistic and
>> naive. People won't "get sick of NAT", since most of them don't know
>> what NAT is anyway. They'll just notice that "the speed sucks" or that
>> they can't edit Wikipedia because their public IP was blocked. But
>> they won't know IPv6 is (part of) the solution unless someone tells
>> them to, by events like the IPv6 day.
>
> Or by the ISP which provides IPv6 advertising those faster speeds or
> decreased privacy.

Here at BestISP, we assign you a unique number that you can never
change!  We attach this unique number to all your Internet
communications, so that every time you go back to a website, that
website knows they're dealing with the same person.

Switch to BestISP!  1% faster communications, and the increased
ability for websites to track you!

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Update on IPv6

2012-06-09 Thread Anthony
On Fri, Jun 8, 2012 at 9:59 AM, Strainu  wrote:
> 2012/6/8 Anthony :
>> No one has to break the loop.  The loop will break itself.  Either
>> enough people will get sick of NAT to cause demand for IPv6, or they
>> won't.
>
> That one way of seeing things, but I fear it's a bit simplistic and
> naive. People won't "get sick of NAT", since most of them don't know
> what NAT is anyway. They'll just notice that "the speed sucks" or that
> they can't edit Wikipedia because their public IP was blocked. But
> they won't know IPv6 is (part of) the solution unless someone tells
> them to, by events like the IPv6 day.

Or by the ISP which provides IPv6 advertising those faster speeds or
decreased privacy.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Update on IPv6

2012-06-09 Thread Anthony
On Sat, Jun 9, 2012 at 7:51 AM, Daniel Friesen
 wrote:
> On Fri, 08 Jun 2012 03:49:01 -0700, Risker  wrote:
>> Do this now, please.  Even I can see how easy it ought to be to replace
>> the last
>> three digits of an IPv4 address with XXX in publicly viewable lists
>> and logsand reduce the publicly visible IPv6 string to its first three
>> segments.
>
>
> It's not. This is not something simple to do technically.

When someone edits without being logged in, automatically log them in
under a newly created username "X.Y.Z.xxx #N", where N is just the
lowest number which creates a unique name.

(Of course, if you're going to do that, just abandon the whole IP
address thing altogether.  When a user who is not logged in gets to
the edit screen, there are two extra fields: username and password.
Username is pre-filled with "Random User #", where  is
a random not-yet-used number.)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Update on IPv6

2012-06-08 Thread Anthony
On Fri, Jun 8, 2012 at 4:08 AM, Strainu  wrote:
> Risker, I think you're over-reacting here. Yes, there are risks
> associated with IPv6. No, they haven't been addressed completely
> before IPv6 day (apparently because of the very late moment the
> decision to participate was taken). But it hasn't destroyed the
> projects so far and chances are, by the time IPv6 vandalism will have
> any significant effect, they will be solved (estimates are that 50% of
> the Internet users will have IPv6 only in 6 years [1]).

You seem to be assuming that vandals will switch to IPv6 at the same
rate as non-vandals.

An analogous assumption, which has proven to be false, would be that
vandals would use anonymizing proxies at the same rate as non-vandals.

> If there is little content available on IPv6, people will
> not even be aware it exists and they will not demand it from their
> ISP, which means there will be no users for IPv6 content making it
> useless and the loop will continue. Someone had to break this loop and
> the content providers were the easiest place this could happen.

No one has to break the loop.  The loop will break itself.  Either
enough people will get sick of NAT to cause demand for IPv6, or they
won't.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Wikimedia-l] Update on IPv6

2012-06-02 Thread Anthony
On Sat, Jun 2, 2012 at 9:59 AM, Leslie Carr  wrote:
> On Sat, Jun 2, 2012 at 6:13 AM, Anthony  wrote:
>> On Sat, Jun 2, 2012 at 8:49 AM, Thomas Dalton  
>> wrote:
>>> On 2 June 2012 13:44, Anthony  wrote:
>>>> On Fri, Jun 1, 2012 at 7:27 PM, John Du Hart  wrote:
>>>>> What personal information do you think is contained in an IPv6 address?
>>>>
>>>> Don't they sometimes contain MAC address information?
>>>
>>> I don't know, but I wouldn't consider my MAC address to be personal
>>> information... you might be able to work out what brand of computer
>>> I'm using, but I can live with that.
>
> I think that having a problem with the implementation of IPv6 is about
> 10 years too late now ;)

The problem isn't with IPv6.  The problem is with the way WMF uses IP addresses.

Of course, it's about 10 years too late for that too.  :)

> If someone cares about their mac address information, they can use
> privacy extensions - http://en.wikipedia.org/wiki/Ipv6#Privacy .

I agree.  Though it would probably be a good idea to warn people about
the problem, before publishing their address for the world to see.  A
sentence or two added to the IP address warning which already appears
would probably put things on par with IPv4 addresses.

Personally II think WMF is far too loose about IP addresses in the
first place.  But as I said above, it's about 10 years too late for
that.

---

http://csrc.nist.gov/publications/nistpubs/800-122/sp800-122.pdf

Page 2-2

"The following list contains examples of information that may be
considered PII."

"Asset information, such as Internet Protocol (IP) or Media Access
Control (MAC) address or other host-specific persistent static
identifier that consistently links to a particular person or small,
well-defined group of people"

"Information identifying personally owned property, such as vehicle
registration number or title
number and related information"

Granted, it only says "may be considered" PII.  Certainly seems
definitive to me, though.

And note, of course, that IPv4 addresses also may be considered PII.
IPv6 addresses are just sometimes more likely to be
persistent/static/consistent, and often link to a smaller, more
well-defined group of people.  But then, see above, as IPv6 addresses
sometimes are more anonymous than IPv4 addresses.  It all depends on
the implementation.

Anyway, I do think MAC addresses are certainly (in the vast majority
of cases), PII.  That IPv6 addresses are often PII.  And that IPv4
addresses are often PII.  I don't think IPv6 addresses are
particularly more likely to be PII than IPv4 addresses.  So,
basically, I think the privacy concern specifically about IPv6 is
mostly misplaced.  But it would be nice to readdress the privacy
concerns over IP addresses in general.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table

2011-09-20 Thread Anthony
Thanks for the explanation.  I guess I see what you're getting at now.
 Sorry I didn't see it sooner.

On Tue, Sep 20, 2011 at 8:50 PM, Brion Vibber  wrote:
> On Tue, Sep 20, 2011 at 5:36 PM, Anthony  wrote:
>
>> On Tue, Sep 20, 2011 at 3:37 PM, Happy Melon  wrote:
>> > It may or may not be an architecturally-better design to have it as a
>> > separate table, although considering how rapidly MW's 'architecture'
>> changes
>> > I'd say keeping things as simple as possible is probably a virtue.  But
>> that
>> > is the basis on which we should be deciding it.
>>
>> It's an intentional denormalization of the database done apparently
>> for performance reasons (although, I still can't figure out exactly
>> *why* it's being done as it still seems to be useful only for the dump
>> system, and therefore should be part of the dump system, not part of
>> mediawiki proper).  It doesn't even seem to apply to "normal", i.e.
>> non-Wikimedia, installations.
>>
>
> 1) Those dumps are generated by MediaWiki from MediaWiki's database -- try
> Special:Export on the web UI, some API methods, and the dumpBackup.php maint
> script family.
>
> 2) Checksums would be of fairly obvious benefit to verifying text storage
> integrity within MediaWiki's own databases (though perhaps best sitting on
> or keyed to the text table...?) Default installs tend to use simple
> plain-text or gzipped storage, but big installs like Wikimedia's sites (and
> not necessarily just us!) optimize storage space by batch-compressing
> multiple text nodes into a local or remote blobs table.
>
>
>> On Tue, Sep 20, 2011 at 4:45 PM, Happy Melon 
>> wrote:
>> > This is a big project which still retains enthusiasm because we recognise
>> > that it has equally big potential to provide interesting new features far
>> > beyond the immediate usecases we can construct now (dump validation and
>> > 'something to do with reversions').
>>
>> Can you explain how it's going to help with dump validation?  It seems
>> to me that further denormalizing the database is only going to
>> *increase* these sorts of problems.
>>
>
> You'd be able to confirm that the text in an XML dump, or accessible through
> the wiki directly, matches what the database thinks it contains -- and that
> a given revision hasn't been corrupted by some funky series of accidents in
> XML dump recycling or External Storage recompression.
>
> IMO that's about the only thing it's really useful for; detecting
> non-obviously-performed reversions seems like an edge case that's not worth
> optimizing for, since it would fail to handle lots of cases like reverting
> partial edits (say an "undo" of a section edit where there are other
> intermediary edits -- since the other parts of the page text are not
> identical, you won't get a match on the checksum).
>
> -- brion
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table

2011-09-20 Thread Anthony
On Tue, Sep 20, 2011 at 3:37 PM, Happy Melon  wrote:
> It may or may not be an architecturally-better design to have it as a
> separate table, although considering how rapidly MW's 'architecture' changes
> I'd say keeping things as simple as possible is probably a virtue.  But that
> is the basis on which we should be deciding it.

It's an intentional denormalization of the database done apparently
for performance reasons (although, I still can't figure out exactly
*why* it's being done as it still seems to be useful only for the dump
system, and therefore should be part of the dump system, not part of
mediawiki proper).  It doesn't even seem to apply to "normal", i.e.
non-Wikimedia, installations.

On Tue, Sep 20, 2011 at 4:45 PM, Happy Melon  wrote:
> This is a big project which still retains enthusiasm because we recognise
> that it has equally big potential to provide interesting new features far
> beyond the immediate usecases we can construct now (dump validation and
> 'something to do with reversions').

Can you explain how it's going to help with dump validation?  It seems
to me that further denormalizing the database is only going to
*increase* these sorts of problems.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table

2011-09-20 Thread Anthony
On Tue, Sep 20, 2011 at 9:34 AM, Domas Mituzas  wrote:
>>
>> Ah, okay.  I remember that's what happened in MyISAM but I figured
>> they had that fixed in InnoDB.
>
> InnoDB has optimized path for index builds, not for schema changes.

No support for built-in function-based indexes, right?  (I searched a
bit and couldn't find any.)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table

2011-09-20 Thread Anthony
On Mon, Sep 19, 2011 at 10:39 PM, Daniel Friesen
 wrote:
> On 11-09-19 06:39 PM, Anthony wrote:
>> On Mon, Sep 19, 2011 at 3:57 PM, Brion Vibber  wrote:
>>> That's probably the simplest solution; adding a new empty table will be very
>>> quick. It may make it slower to use the field though, depending on what all
>>> uses/exposes it.
>> Isn't adding a new column with all NULL values quick too?
> Apparently in InnoDB a table ALTER requires an entire copy of the table
> to do. In other words to do a table alter every box doing it needs to be
> able to hold the entire Wikipedia revision table twice to add a new column.

Ah, okay.  I remember that's what happened in MyISAM but I figured
they had that fixed in InnoDB.

On Mon, Sep 19, 2011 at 3:57 PM, Brion Vibber  wrote:
> During stub dump generation for instance this would need to add a left outer
> join on the other table, and add things to the dump output (and also needs
> an update to the XML schema for the dump format). This would then need to be
> preserved through subsequent dump passes as well.

Doesn't the stub dump generation computer have its own database?  I
still don't see the point of putting all this extra work on the master
database in order to maintain a function-based index which is only
being used for dumps.

The dump generation computer should have its own database.  For most
of the tables/dumps (probably all but the full-text ones), you could
even use sqlite and offer that as a download option for those who
aren't stuck in 1998.

On Tue, Sep 20, 2011 at 3:23 AM, Daniel Friesen
 wrote:
> On 11-09-19 11:43 PM, Domas Mituzas wrote:
>> so, what are the use cases and how does one index for them? is it global 
>> hash check, per page? etc

> One use case I know if is this bug:
> https://bugzilla.wikimedia.org/show_bug.cgi?id=2939

Calculating and storing checksums on every revision in the database is
way overkill for that.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table

2011-09-19 Thread Anthony
On Mon, Sep 19, 2011 at 3:57 PM, Brion Vibber  wrote:
> That's probably the simplest solution; adding a new empty table will be very
> quick. It may make it slower to use the field though, depending on what all
> uses/exposes it.

Isn't adding a new column with all NULL values quick too?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 7:20 PM, Anthony  wrote:
> On Sun, Sep 18, 2011 at 7:07 PM, bawolff  wrote:
>> Anthony wrote:
>> The pages you link to seem to indicate he's nothing more than a
>> willy-on-wheels type vandal, who at worst tricked an admin into doing
>> a delete of a page with a very high number of revisions making the
>> server kittens cry for a moment. There's no indication he has "mad
>> hacker skillz" in any way or form (and given the tone of that
>> Encyclopedia Dramatica page, I assume they'd be bragging about it if
>> he did).
>
> As I said, I couldn't find a page which described it in detail.  Maybe
> if you look at archive.org?

By the way, my comment about "mad hacker skillz" was meant to be
sarcastic.  The term "script kiddie" is probably more accurate.

I don't know how the person did it.  I don't know whether they were
*the* Grawp or just a copycat.  I don't know if they found a parsing
bug, or they found a backdoor through a default password, or if they
hacked my account password (*).  I even don't know if it was
javascript or style sheets or gabagool or whatever the hell.  All I
know is that he fucked up my site so bad I didn't know how to fix it
(other than restoring the database, which I didn't feel like doing).
I asked someone to take a look at the site, and he said I was attacked
by Grawp and I needed to upgrade my Mediawiki.  At that point I said
"fuck it, I'm just going to host a few pages at Knol, and just take
down the rest".

(*) I believe it was the former, though, because when I looked at the
database the page edits were made by a regular user, not by me, and
not by a special account.

And all of this is irrelevant.  Generating an MD5 collision does not
in any way involve "mad hacker skillz".

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 7:07 PM, bawolff  wrote:
> Anthony wrote:
>> It does not involve generating hash collisions, but it involves
>> finding various bugs in mediawiki and using them to vandalise, often
>> by injecting javascript.  The best description I could find was at
>> Encyclopedia Dramatica, which seems to be taken down (there's a cache
>> if you do a google search for "grawp wikipedia").  There's also a
>> description at http://en.wikipedia.org/wiki/User:Grawp , which does
>> not do justice to the "mad hacker skillz" of this individual and his
>> intent on finding bugs in mediawiki and exploiting them.
>>
>
> Say what? Being able to inject js is a very serious vulnerability. If
> he's doing this, why haven't I seen any security releases triggered by
> a vandal finding an XSS? has no one reported it?

I have no idea.  How long have you been reading the release notes?
This was a few years ago that this happened to me, and the software I
was using was probably a year or two old.

I didn't investigate into the details of the bug.  I didn't have the
time to do that, which is why I just took the site down rather than
bother.

> The pages you link to seem to indicate he's nothing more than a
> willy-on-wheels type vandal, who at worst tricked an admin into doing
> a delete of a page with a very high number of revisions making the
> server kittens cry for a moment. There's no indication he has "mad
> hacker skillz" in any way or form (and given the tone of that
> Encyclopedia Dramatica page, I assume they'd be bragging about it if
> he did).

As I said, I couldn't find a page which described it in detail.  Maybe
if you look at archive.org?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Fwd: Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 6:00 PM, Roan Kattouw  wrote:
> On Sun, Sep 18, 2011 at 11:00 PM, Anthony  wrote:
>> Now I don't know how important the CPU differences in calculating the
>> two versions would be.  If they're significant enough, then fine, use
>> MD5, but make sure there are warnings all over the place about its
>> use.
>>
> I ran some benchmarks on one of the WMF machines. The input I used is
> a 137.5 MB (144,220,582 bytes) OGV file that someone asked me to
> upload to Commons recently. For each benchmark, I hashed the file 25
> times and computed the average running time.
>
> MD5: 393 ms
> SHA-1: 404 ms
> SHA-256: 1281 ms

Did you try any of the non-secure hash functions?  If you're going to
go with MD5, might as well go with the significantly faster CRC-64.

If you're just using it to detect reverts, then you can run the CRC-64
check first, and then confirm with a check of the entire message.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 6:01 PM, Anthony  wrote:
> There's also a
> description at http://en.wikipedia.org/wiki/User:Grawp , which does
> not do justice to the "mad hacker skillz" of this individual and his
> intent on finding bugs in mediawiki and exploiting them.

(and/or the Grawp copycats - personally I don't know if it was "Grawp"
himself or a copycat that attacked my wiki)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 5:50 PM, Chad  wrote:
> On Sun, Sep 18, 2011 at 5:47 PM, Anthony  wrote:
>> On Sun, Sep 18, 2011 at 5:30 PM, Chad  wrote:
>>> On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
>>>  wrote:
>>>> It is meaningless to talk about cryptography without a threat model, just 
>>>> as Robert says. Is anybody actually attacking us? Or are we worried about 
>>>> accidental collisions?
>>>>
>>>
>>> I believe it began as accidental collisions, then everyone promptly
>>> put on their tinfoil hats and started talking about a hypothetical
>>> vandal who has the time and desire to generate hash collisions.
>>
>> Having run a wiki which I eventually abandoned due to various "Grawp
>> attacks", I can assure you that there's nothing hypothetical about it.
>>
>
> For those of us who do not know...what the heck is a Grawp attack?
> Does it involve generating hash collisions?

It does not involve generating hash collisions, but it involves
finding various bugs in mediawiki and using them to vandalise, often
by injecting javascript.  The best description I could find was at
Encyclopedia Dramatica, which seems to be taken down (there's a cache
if you do a google search for "grawp wikipedia").  There's also a
description at http://en.wikipedia.org/wiki/User:Grawp , which does
not do justice to the "mad hacker skillz" of this individual and his
intent on finding bugs in mediawiki and exploiting them.

If you did something as lame as relying on no one generating an MD5
collision (*), it would happen.  If you use SHA-1, it may or may not
happen, depending on how quickly computers get faster, and how many
further attacks are made on the algorithm.  If you use SHA-256 (**),
it's significantly less likely to happen, and you'll probably have a
warning in the form of an announcement on Slashdot that SHA-256 has
been broken, before it happens.

(*) Something which I have done myself on my home computer in a couple
minutes, and apparently now can be done in a couple seconds.

(**) Which, incidentally, is possibly the single most secure hash for
Wikimedia to use at the current time.  SHA-512 is significantly more
"broken" than SHA-256, and the more theoretically secure hashes have
received much less scrutiny than SHA-256.  If you want to be more
secure than SHA-256, you should combine SHA-256 with some other
hashing algorithm.)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 5:30 PM, Chad  wrote:
> On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
>  wrote:
>> It is meaningless to talk about cryptography without a threat model, just as 
>> Robert says. Is anybody actually attacking us? Or are we worried about 
>> accidental collisions?
>>
>
> I believe it began as accidental collisions, then everyone promptly
> put on their tinfoil hats and started talking about a hypothetical
> vandal who has the time and desire to generate hash collisions.

Having run a wiki which I eventually abandoned due to various "Grawp
attacks", I can assure you that there's nothing hypothetical about it.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Fwd: Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 1:55 AM, Robert Rohde  wrote:
> If collision attacks really matter we should use SHA-1.

If collision attacks really matter you should use, at least, SHA-256, no?

> However, do
> any of the proposed use cases care about whether someone might
> intentionally inject a collision?  In the proposed uses I've looked at
> it, it seems irrelevant.  The intentional collision will get flagged
> as a revert and the text leading to that collision would be discarded.
>  How is that a bad thing?

Well, what if the checksum of the initial page hasn't been calculated
yet?  Then some miscreant sets the page to spam which collides, and
then the spam gets reverted.  The good page would be the one that gets
thrown out.

Maybe that's not feasible.  Maybe it is.  Either way, I'd feel very
uncomfortable about the fact that someday someone might decide to use
the checksums in some way in which collisions would matter.

Now I don't know how important the CPU differences in calculating the
two versions would be.  If they're significant enough, then fine, use
MD5, but make sure there are warnings all over the place about its
use.

(As another possibility, what if someone writes a bot to detect
certain reverts?  I can see spammers/vandals having a field day with
this sort of thing.)

>> For offline analyses, there's no need to change the online database tables.
>
> Need?  That's debatable, but one of the major motivators is the desire
> to have hash values in database dumps (both for revert checks and for
> checksums on correct data import / export).  Both of those are
> "offline" uses, but it is beneficial to have that information
> precomputed and stored rather than frequently regenerated.

Why not in a separate file?  There's no need to get permission from
anyone or mess with the schema to generate a file with revision ids
and checksums.  If WMF won't host it at the regular dump location
(which I can't see why they wouldn't), you could host it at
archive.org.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 2:33 AM, Ariel T. Glenn  wrote:
> Στις 17-09-2011, ημέρα Σαβ, και ώρα 22:55 -0700, ο/η Robert Rohde
> έγραψε:
>> On Sat, Sep 17, 2011 at 4:56 PM, Anthony  wrote:
>
> 
>
>> > For offline analyses, there's no need to change the online database tables.
>>
>> Need?  That's debatable, but one of the major motivators is the desire
>> to have hash values in database dumps (both for revert checks and for
>> checksums on correct data import / export).  Both of those are
>> "offline" uses, but it is beneficial to have that information
>> precomputed and stored rather than frequently regenerated.
>
> If we don't have it in the online database tables, this defeats the
> purpose of having the value in there at all, for the purpose of
> generating the XML dumps.
>
> Recall that the dumps are generated in two passes; in the first pass we
> retrieve from the db and record all of the metadata about revisions, and
> in the second (time-comsuming) pass we re-use the text of the revisions
> from a previous dump file if the text is in there.  We want to compare
> the has of that text against what the online database says the hash is;
> if they don't match, we want to fetch the live copy.

Well, this is exactly the type of use in which collisions do matter.
Do you really want the dump to not record the correct data when some
miscreant creates an intentional collision?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-18 Thread Anthony
On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
 wrote:
> It is meaningless to talk about cryptography without a threat model, just as 
> Robert says. Is
> anybody actually attacking us?

You mean, like Grawp?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table (discussing r94289)

2011-09-17 Thread Anthony
On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde  wrote:
> Is there a good reason to prefer SHA-1?
>
> Both have weaknesses allowing one to construct a collision (with
> considerable effort)

Considerable effort?  I can create an MD5 collision in a few minutes
on my home computer.  Is there anything even remotely like this for
SHA-1?

> MD5 is shorter and in my experience about 25% faster to compute.
>
> Personally I've tended to view MD5 as more than good enough in offline 
> analyses.

For offline analyses, there's no need to change the online database tables.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-06 Thread Anthony
On Fri, May 6, 2011 at 2:37 PM, Chad  wrote:
> On Fri, May 6, 2011 at 2:14 PM, Brion Vibber  wrote:
>> I'd like to respectfully ask that this thread be taken offlist, perhaps to a
>> wiki page or a private thread among those who are interested.
>>
>> There's no active intent to change any licensing right now, and general
>> discussion of software licenses and edge cases is pretty far off topic.
>>
>> Thanks!
>
> I'm going to have to agree with Brion and Bryan here. Please can
> the interested parties take this offlist?

What are we taking offlist, and where offlist are we taking it?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-06 Thread Anthony
On Fri, May 6, 2011 at 2:24 PM, Trevor Parscal  wrote:
> "Dynamic linking" implies we have something to dynamically link in the first
> place. A parser library consisting of compiled PHP in this particular case.
>
> Let's just cross this hypothetical bridge when we come to it, shall we?

I guess, but I'm not sure it'll ever come up.

"Would it be useful to have a library that can convert wikitext to HTML? Yes."

Would it be even more useful to have a standalone program, with
minimal dependencies, that can convert wikitext to HTML?  Hell yeah.

Granted, that's only half the problem.  The other (and much more
difficult) problem is how to convert a *set* of pages (templates and
whatnot) into a single chunk of wikitext, which can then be fed into
the wikitext to HTML parser.  But even without that part it would
still be quite useful to have that standalone wikitext to HTML
converter.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-06 Thread Anthony
On Fri, May 6, 2011 at 10:55 AM, Bryan Tong Minh
 wrote:
> Can we stop discussing this issue? I believe that most MediaWiki
> developers are in fact not interested in changing the status quo with
> regards to licensing, so there is no point in discussing it.

That there isn't going to be a license change is exactly *why* it
needs to be discussed.  If dynamic linking is fine, then a dynamically
linked library is appropriate.  On the other hand, if it isn't (or, at
least, if it's not clear that it is), then something more like gzip
would be better.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-06 Thread Anthony
On Fri, May 6, 2011 at 10:35 AM, Jay Ashworth  wrote:
> Feist v Rural; header files are *factual data*; no creativity there.

I disagree that there is no creativity in a header file.  It's
certainly not an open and shut case.

> None of our opinions matter until there's caselaw, of course, and there
> isn't.

Right.  That's the main point I was making.  The safer option would be
to convert the library into a standalone app (which in this case makes
a lot of a sense anyway), and then just pipe data to/from it (or use
files, or whatever).  Wikitext in, html out.  This would be *much*
simpler than converting Mediawiki in its entirety into a standalone
parser which you could pipe to.  At least, it was last time I tried to
do it, which admittedly was several years ago.  And I really don't see
how you could argue that this creates a derivative work.  It's no
different than piping to/from gzip, and I don't think anyone argues
that *that* creates a derivative work.


On Fri, May 6, 2011 at 10:38 AM, Jay Ashworth  wrote:
>> From: "Anthony" 
>> On Tue, May 3, 2011 at 6:56 PM, Aryeh Gregor
>> > You can always *use* GPLd code however you like.
>>
>> Does "use" include "prepare a derivative work"?
>
> As long as you don't distribute it, sure.  The GPL was, is, and will
> always be *a license to distribute*.  GPL doesn't even forbid you to
> modify and make available as a web app; you need to release under AGPL
> if you want to restrict that.

It doesn't need to forbid it.  It only needs to fail to permit it.
"You may not propagate or modify a covered work except as expressly
provided under this License."  "or modify".

So where does the GPL expressly provide for modifying the program
without licensing the derivative under the GPL?

Yes, there's nothing in the GPL which requires you to release that GPL
derivative to the public.  That's what the AGPL does.  But if one of
your employees or volunteers gets a hold of it and puts it up on a P2P
hosting site, then it's out there, and you're going to have a hell of
a time suing people for copyright infringement for copying your work
which is a derivative of a GPL work and not getting sued yourself for
violating the GPL.

>> > If you want to *distribute* proprietary
>> > (or otherwise GPL-incompatible) code that depends on my volunteer
>> > contributions, I'm happy to tell you to go jump off a bridge.
>>
>> Copyright law gives the author an exclusive right to *prepare*
>> derivative works, not just to *distribute* derivative works. What in
>> the GPL gives you permission to prepare a proprietary derivative work
>> which you do not distribute?
>
> Citation?  Note that such a citation must take into account whether
> there's any *use* in so doing in a non-computer-code environment.

A citation for what?  The fact that copyright law recognizes an
exclusive right of an author to *prepare* derivative works?  Title 17,
Section 106(2) of the US Code
(http://www.law.cornell.edu/uscode/html/uscode17/usc_sec_17_0106000-.html).

The other half of my statement was a question.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-06 Thread Anthony
On Tue, May 3, 2011 at 6:56 PM, Aryeh Gregor
 wrote:
> You can still link it with proprietary code as long as you don't
> distribute the result, so it would be fine for research projects or
> similar that rely on proprietary components.

What happens if one of your employees or volunteers distributes the result?

> You can always *use* GPLd code however you like.

Does "use" include "prepare a derivative work"?

> If you want to *distribute* proprietary
> (or otherwise GPL-incompatible) code that depends on my volunteer
> contributions, I'm happy to tell you to go jump off a bridge.

Copyright law gives the author an exclusive right to *prepare*
derivative works, not just to *distribute* derivative works.  What in
the GPL gives you permission to prepare a proprietary derivative work
which you do not distribute?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-06 Thread Anthony
On Fri, May 6, 2011 at 9:41 AM, Jay Ashworth  wrote:
> - Original Message -
>> From: "Anthony" 
>
>> On Wed, May 4, 2011 at 6:57 PM, Aryeh Gregor
>>  wrote:
>> > Linking has no special status in the GPL -- it's just a question of
>> > what legally constitutes a derivative work. If a C program that
>> > dynamically links to a library is legally a derivative work of that
>> > library,
>>
>> It isn't. A C program which *contains* a library is legally a
>> derivative work of that library.
>
> Static linking fits that description.  Dynamic linking -- through the
> FSF would really like it to -- does not.

I'm not sure if that's true or not.  There's certainly an argument to
be made that dynamic linking creates a derivative work *at the time it
is linked*.  Also, there's an even stronger argument that using the
GPL header files to compile the unlinked program creates a derivative
work.  (If you want to reverse engineer the header files then you can
get around that problem, but that's a lot of extra work, and in most
cases, such as this one, you might as well convert the library into a
standalone program that can be used via a pipe.)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-05 Thread Anthony
On Tue, May 3, 2011 at 11:55 PM, Jay Ashworth  wrote:
> The reasons why many programmers prefer GPL to BSD -- to keep the work
> they've invested long hours in for free from being submerged in someone's
> commercial project with no recompense to them -- which GPL forbids and
> BSD does not -- is widely understoood.

GPL forbids use in a commercial project?  Huh?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-05 Thread Anthony
On Wed, May 4, 2011 at 6:57 PM, Aryeh Gregor
 wrote:
> Linking has no special status in the GPL -- it's just a question of
> what legally constitutes a derivative work.  If a C program that
> dynamically links to a library is legally a derivative work of that
> library,

It isn't.  A C program which *contains* a library is legally a
derivative work of that library.

> a PHP program that dynamically calls functions from another
> PHP program is almost surely a derivative work too.  The decision
> would be made by a judge, who wouldn't have the faintest idea of the
> technical details and therefore would only care about the general
> effect.

Galoob v. Nintendo:  "the infringing work must incorporate a portion
of the copyrighted work in some form."

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] strange page id numbering

2011-02-15 Thread Anthony Ventresque (Dr)

> On Tue, Feb 15, 2011 at 12:58 PM, Q  wrote:
> > On 2/15/2011 11:34 AM, Anthony Ventresque (Dr) wrote:
> >> Wikipedia... is that a relevant answer to your remark?
> >
> > There's about 284 of those, you'll have to be a bit more specific.
>
> Anyone who says "Wikipedia", in English, in a context that makes it
> clear they're referring to a specific site, is referring to the
> English Wikipedia.  Many English Wikipedia users are probably only
> vaguely aware that Wikipedias in other languages exist, and most
> probably haven't looked at them.  Insisting on precise terminology
> when dealing with users is not helpful or reasonable, because you
> can't expect them to know the terminology.

It's indeed English Wikipedia.


CONFIDENTIALITY: This email is intended solely for the person(s) named and may 
be confidential and/or privileged. If you are not the intended recipient, 
please delete it, notify us and do not copy, use, or disclose its content. 
Thank you.

Towards A Sustainable Earth: Print Only When Necessary

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] strange page id numbering

2011-02-15 Thread Anthony Ventresque (Dr)

> On Tue, Feb 15, 2011 at 9:29 AM, Anthony Ventresque (Dr)
>  wrote:
> > I was indeed suspecting something like that, but the difference in number 
> > of pages is large while we are talking about a relatively short delay 
> > (minutes?).
>
> Depending on what site you're talking about.

Wikipedia... is that a relevant answer to your remark?

Anthony

CONFIDENTIALITY: This email is intended solely for the person(s) named and may 
be confidential and/or privileged. If you are not the intended recipient, 
please delete it, notify us and do not copy, use, or disclose its content. 
Thank you.

Towards A Sustainable Earth: Print Only When Necessary

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] strange page id numbering

2011-02-15 Thread Anthony Ventresque (Dr)
> Anthony Ventresque (Dr) wrote:
> > Hi,
> >
> >
> > I've found something strange in some files. The maximum ids for a page are:
> > latest
> >
> > pages-articles.xml: 29189922
> > page.sql:   28707562
> > categorylinks.sql:  28705949
> > (15,684 categories and 135,521 articles are missing)
> >
> > 2011-01-15
> > pages-articles.xml: 30492297
> > page.sql:   30480288
> > categorylinks.sql:  30479519
> >
> > Any idea why these numbers are different?
> >
> > Thanks for your help,
> > Anthony
>
> The pages-articles dump will have started a bit after page.sql has been
> dumped.
> The different files on the downloads page are not synchronised.
>

Thanks for the quick reply.

I was indeed suspecting something like that, but the difference in number of 
pages is large while we are talking about a relatively short delay (minutes?).

Anthony

CONFIDENTIALITY: This email is intended solely for the person(s) named and may 
be confidential and/or privileged. If you are not the intended recipient, 
please delete it, notify us and do not copy, use, or disclose its content. 
Thank you.

Towards A Sustainable Earth: Print Only When Necessary

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] strange page id numbering

2011-02-14 Thread Anthony Ventresque (Dr)
Hi,


I've found something strange in some files. The maximum ids for a page are:

latest

pages-articles.xml: 29189922

page.sql:   28707562

categorylinks.sql:  28705949
(15,684 categories and 135,521 articles are missing)

2011-01-15

pages-articles.xml: 30492297

page.sql:   30480288

categorylinks.sql:  30479519

Any idea why these numbers are different?

Thanks for your help,
Anthony

CONFIDENTIALITY: This email is intended solely for the person(s) named and may 
be confidential and/or privileged. If you are not the intended recipient, 
please delete it, notify us and do not copy, use, or disclose its content. 
Thank you.

Towards A Sustainable Earth: Print Only When Necessary

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] categorisation issues in dumps

2011-02-14 Thread Anthony Ventresque (Dr)
Thanks for your help, it indeed works.

From: wikitech-l-boun...@lists.wikimedia.org 
[wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Platonides 
[platoni...@gmail.com]
Sent: 09 February 2011 05:32
To: wikitech-l@lists.wikimedia.org
Subject: Re: [Wikitech-l] categorisation issues in dumps

Anthony Ventresque (Dr) wrote:
> Hi,
>
> I am trying to build an offline version of the wikipedia categorisation tree. 
> As usual with projects on wikipedia, I've downloaded dumps (actually the 
> interesting one here is pages-articles.xml). And I found that none of the 
> dumps has the relation between  "Category:1960_works" and "Category:1960" 
> which is present on the web page. And it is the same for a lot of categories 
> I tried: many links are missing in the dump, but are present in the web. Any 
> idea why is that so?
>
> Thanks for your help,
> Anthony

Using page.sql.gz and categorylinks.sql.gz would be more efficient for
your task.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

CONFIDENTIALITY: This email is intended solely for the person(s) named and may 
be confidential and/or privileged. If you are not the intended recipient, 
please delete it, notify us and do not copy, use, or disclose its content. 
Thank you.

Towards A Sustainable Earth: Print Only When Necessary

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] categorisation issues in dumps

2011-02-07 Thread Anthony Ventresque (Dr)
Hi,

I am trying to build an offline version of the wikipedia categorisation tree. 
As usual with projects on wikipedia, I've downloaded dumps (actually the 
interesting one here is pages-articles.xml). And I found that none of the dumps 
has the relation between  "Category:1960_works" and "Category:1960" which is 
present on the web page. And it is the same for a lot of categories I tried: 
many links are missing in the dump, but are present in the web. Any idea why is 
that so?

Thanks for your help,
Anthony

CONFIDENTIALITY: This email is intended solely for the person(s) named and may 
be confidential and/or privileged. If you are not the intended recipient, 
please delete it, notify us and do not copy, use, or disclose its content. 
Thank you.

Towards A Sustainable Earth: Print Only When Necessary

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WMF and IPv6

2011-02-03 Thread Anthony
On Thu, Feb 3, 2011 at 5:29 PM, Anthony  wrote:
> But, "supports IPv6" could be as simple as having an http proxy server
> which sends (fake) IPv6 XFF headers.
>
> By fake, I mean that there's not even a need for the client to
> actually use that IPv6 address, so long as each user/session gets a
> different IP within a block controlled by that ISP.

And as an added bonus by using these proxies they can be more easily
tracked for corporate marketing and government surveillance purposes!

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WMF and IPv6

2011-02-03 Thread Anthony
On Thu, Feb 3, 2011 at 5:20 PM, River Tarnell  wrote:
> In article ,
> Martijn Hoekstra   wrote:
>>So what are exactly the implications for blocking and related issues
>>when we will start to see ISP level NATing?
>
> Users will either need to move to an ISP that supports IPv6, or accept
> that they will be frequently blocked on Wikipedia for no reason.

But, "supports IPv6" could be as simple as having an http proxy server
which sends (fake) IPv6 XFF headers.

By fake, I mean that there's not even a need for the client to
actually use that IPv6 address, so long as each user/session gets a
different IP within a block controlled by that ISP.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WMF and IPv6

2011-02-03 Thread Anthony
On Thu, Feb 3, 2011 at 5:10 PM, River Tarnell  wrote:
> In article ,
> Anthony   wrote:
>>Is there a standard for using IPv6 inside X-Forwarded-For headers?
>
> There is no standard for X-Forwarded-For at all.

Not even a de-facto one?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WMF and IPv6

2011-02-03 Thread Anthony
On Thu, Feb 3, 2011 at 4:45 PM, Brion Vibber  wrote:
> Front-end proxies need to speak IPv6 to the outside world so they can accept
> connections from IPv6 clients, add the clients' IPv6 addresses to the HTTP
> X-Forwarded-For header which gets passed to the Apaches, and then return the
> response body back to the client.

Interesting.  Is there a standard for using IPv6 inside
X-Forwarded-For headers?  I would think you'd need a new header
altogether.

(Yes, this is just used internally so it doesn't matter, but I'm still curious.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] From page history to sentence history

2011-01-21 Thread Anthony
On Fri, Jan 21, 2011 at 6:48 AM, Aryeh Gregor
 wrote:
> Not to mention, the text table is immutable,
> so creating and publishing text table dumps incrementally should be
> trivial.

The problem there is deletion and oversight.  The best solution if you
didn't have to worry about that would be to have a database on the
dump servers with only public data, which accesses a live feed (over
the LAN).  Then creating a dump would be as simple as pg_dump, and
fancier incremental dumps could be made relatively simply as well.

Then again, if your live feed tells you which revisions to
delete/oversight, that's still a viable solution.

> On Thu, Jan 20, 2011 at 4:04 AM, Anthony  wrote:
>> It wouldn't be trivial, but it wouldn't be particularly hard either.
>> Most of the work is already being done.  It's just being done
>> inefficiently.
>
> I'm glad to see you know what you're talking about here.  Presumably
> you've examined the relevant code closely and determined exactly how
> you'd implement the necessary changes in order to evaluate the
> difficulty.  Needless to say, patches are welcome.

Access to the servers is welcome.  I can't possibly test and improve
performance without it.

Alternatively, give me a free live feed, and I'll make a decent dump
system here at home, and provide the source code when I'm done.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] From page history to sentence history

2011-01-19 Thread Anthony
On Wed, Jan 19, 2011 at 7:49 PM, Happy-melon  wrote:
> "Anthony"  wrote in message
> news:AANLkTi=uk+uf3y_b+zld57wcfuef_7rf-bt8tnvtg...@mail.gmail.com...
>> No, that's not the question.  The question is why are you
>> uncompressing and undiffing (from DiffHistoryBlobs) only to recompress
>> (to bz2) and then uncompress and recompress (to 7z) when you can get
>> roughly the same compression by just extracting the blobs and removing
>> any non-public data.
>
> That's probably not nearly as straightforward as it sounds.

I have no idea how straightforward it sounds, so I won't argue with that.

> RevDel'd and
> suppressed revisions are not removed from the text storage; even Oversighted
> revisions are left there, only the entry in the revision table is removed or
> altered.  I don't know OTTOMH how regularly the DiffHistoryBlob system
> stores a 'key frame', and how easy it would be to break diff chains in order
> to snip out non-public data from them, but I'd guess a) not very, and b)
> that the current code doesn't give any consideration to doing so because
> there's no reason for it to do so.  So refactoring it to incorporate that,
> while not impossible, is a non-trivial amount of work.

It wouldn't be trivial, but it wouldn't be particularly hard either.
Most of the work is already being done.  It's just being done
inefficiently.

On Wed, Jan 19, 2011 at 7:49 PM, Happy-melon  wrote:
>> And there are lots of lower-priority things that are being done.  And
>> lots of dollars sitting on the sidelines doing nothing.
>
> Low-priority interesting things tend to get done when you have volunteers
> doing them.  While the value of some of the Foundation's expenditure is
> commonly debated, I think you'd struggle to argue that many of the WMF's
> dollars are not doing *anything*.

Last I checked there were millions of them sitting in the bank.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] From page history to sentence history

2011-01-19 Thread Anthony
On Wed, Jan 19, 2011 at 3:33 AM, Aryeh Gregor
 wrote:
> On Wed, Jan 19, 2011 at 3:59 AM, Anthony  wrote:
>> Why isn't this being used for the dumps?
>
> Well, the relevant code is totally unrelated, so the question is sort
> of a non sequitur.

No, the question is why the relevant code is totally unrelated.
Specifically, I'm talking about the full history dumps.

> If you mean "Why don't we have incremental dumps?"

No, that's not the question.  The question is why are you
uncompressing and undiffing (from DiffHistoryBlobs) only to recompress
(to bz2) and then uncompress and recompress (to 7z) when you can get
roughly the same compression by just extracting the blobs and removing
any non-public data.  Or, if it's easier, continue to uncompress (in
gz) and undiff then rediff and recompress (in gz), as that will be
much much faster than compressing in bz2.

You'll also wind up with a full history dump which is *much* easier to
work with.  Yes, you'll break backward compatibility, but considering
that the English full history dump never finishes, even if you just
implemented it for that one it'd be better than the present, which is
to have nothing.

> I'm assuming the answer
> is (as usual in software development) that there are higher-priority
> things to do right now.

And there are lots of lower-priority things that are being done.  And
lots of dollars sitting on the sidelines doing nothing.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] From page history to sentence history

2011-01-18 Thread Anthony
On Tue, Jan 18, 2011 at 7:21 PM, Aryeh Gregor
 wrote:
> On Mon, Jan 17, 2011 at 9:12 PM, Roan Kattouw  wrote:
>> Wikimedia doesn't technically use delta compression. It concatenates a
>> couple dozen adjacent revisions of the same page and compresses that
>> (with gzip?), achieving very good compression ratios because there is
>> a huge amount of duplication in, say, 20 adjacent revisions of
>> [[Barack Obama]] (small changes to a large page, probably a few
>> identical versions to due vandalism reverts, etc.).
>
> We used to do this, but the problem was that many articles are much
> larger than the compression window of typical compression algorithms,
> so the redundancy between adjacent revisions wasn't helping
> compression except for short articles.  Tim wrote a diff-based history
> storage method (see DiffHistoryBlob in includes/HistoryBlob.php) and
> deployed it on Wikimedia, for 93% space savings:
>
> http://lists.wikimedia.org/pipermail/wikitech-l/2010-March/047231.html

Why isn't this being used for the dumps?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] From page history to sentence history

2011-01-17 Thread Anthony
On Mon, Jan 17, 2011 at 12:41 PM, Anthony  wrote:
> And to recognize what's going on when a sentence changes *and* is
> moved from one paragraph to another, requires an even greater level of
> natural language understanding.  Again though, you can probably get it
> right most of the time without too much effort.

Or at the paragraph level, when two paragraphs are combined into one
(vs. one paragraph being deleted), or one paragraph is split into two
(vs. one paragraph being added), or any of the various other, more
complicated changes that take place.

If you want a high level of accuracy when trying to determine who
added a particular fact (such as "Overall, the city is relatively
flat", which may have started out as "Paris, in general, contains very
few changes in elevation"), you really need to combine automated tools
with human understanding.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] From page history to sentence history

2011-01-17 Thread Anthony
On Mon, Jan 17, 2011 at 10:40 AM, Alex Brollo  wrote:
> 2011/1/17 Bryan Tong Minh 
>
>>
>> Difficult, but doable. Jan-Paul's sentence-level editing tool is able
>> to make the distinction. It would perhaps be possible to use that as a
>> framework for sentence-level diffs.
>>
>
> Difficult, but diff between versions of a page does it. Looking at diff
> between pages, I simply thought firmly that only diff paragraphs were
> stored, so that the page was built as updated diff segments. I had no idea
> how this could be done, but  all was "magic"!

Paragraphs are much easier to recognize than sentences, as wikitext
has a paragraph delimiter - a blank line.  To truly recognize
sentences, you basically have to engage in natural language
processing, though you can probably get it right 90% of the time
without too much effort.

And to recognize what's going on when a sentence changes *and* is
moved from one paragraph to another, requires an even greater level of
natural language understanding.  Again though, you can probably get it
right most of the time without too much effort.

Wikitext actually makes it easier for the most part, as you can use
tricks such as the fact that the periods in [[I.M. Someone]] don't
represent sentence delimiters, since they are contained in square
brackets.  But not all periods which occur in the middle of a sentence
are contained in square brackets, and not all sentences end with a
period.

I'd say "difficult but doable" is quite accurate, although with the
caveat that even the state of the art tools available today are
probably going to make mistakes that would be obvious to a human.  I'm
sure there are tools for this, and there are probably some decent ones
that are open source.  But it's not as simple as just adding an index.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] From page history to sentence history

2011-01-17 Thread Anthony
On Sun, Jan 16, 2011 at 7:34 PM, Lars Aronsson  wrote:
> Many articles are soo long, and have been edited so many
> times, that the history view is almost useless. If I want
> to find out when and how the sentence "Overall, the city
> is relatively flat"in the article [[en:Paris]] has changed
> over time, I can sit all day and analyze individual diffs.
>
> I think it would be very useful if I could highlight a
> sentence, paragraph or section of an article and get a
> reduced history view with only those edits that changed
> that part of the page. What sorts of indexes would be needed
> to facilitate such a search? Has anybody already implemented
> this as a separate tool?

How would you define a particular sentence, paragraph or section of an
article?  The difficulty of the solution lies in answering that
question.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIFTW status

2011-01-16 Thread Anthony
On Sun, Jan 16, 2011 at 8:24 PM, Krinkle  wrote:
> Op 17 jan 2011, om 02:12 heeft Anthony het volgende geschreven:
>
>> On Sun, Jan 16, 2011 at 7:16 PM, Magnus Manske
>>  wrote:
>>> A quick update on WYSIFTW, my "augmented wikitext" editor. (Please
>>> see
>>> http://meta.wikimedia.org/wiki/WYSIFTW for details.)
>>
>> Shouldn't it be WYSIFWT?
>
> No. The name was WYSI WTF. Now it's WYSI FTW[0].
>
> How is fwt related ?[1]

I thought it stood for wikitext formatted and formatted wikitext.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIFTW status

2011-01-16 Thread Anthony
On Sun, Jan 16, 2011 at 7:16 PM, Magnus Manske
 wrote:
> A quick update on WYSIFTW, my "augmented wikitext" editor. (Please see
> http://meta.wikimedia.org/wiki/WYSIFTW for details.)

Shouldn't it be WYSIFWT?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Does anybody have the 20080726 dump version?

2011-01-04 Thread Anthony
On Sat, Jan 1, 2011 at 11:46 AM, Ariel T. Glenn  wrote:
> Στις 01-01-2011, ημέρα Σαβ, και ώρα 16:42 +, ο/η David Gerard
> έγραψε:
>> On 31 December 2010 17:09, Ariel T. Glenn  wrote:
>>
>> > I'd like all the dumps from all the projects to be on line.  Being
>> > realistic I think we would wind up keeping offline copies of all of it,
>> > and copies from every 6 months online, with the last several months of
>> > consecutive runs = around 20 or 30 of them also online.
>>
>>
>> Has anyone found anyone at the Internet Archive who answers their
>> email and would be interested in making these available to the world?
>> Sounds just their thing. Unless there's some reason it isn't.
>
> Yes, we know some people at the Archive, I am not sure what they would
> need to arrange however. It's just a matter of having someone upload the
> dumps up there, as someno has done for a few of them in the past...
> unless you are talking about having them grab the dumps every couple
> weeks and put them someplace organized.

Yes, I've asked them before about something similar for another
project, and they told me to just upload it and only contact them once
it was over 100 files or something (I forget the number).  As I recall
they're a pain to upload to, though, unless there was some rsync
access that I was missing or something.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Does anybody have the 20080726 dump version?

2010-12-31 Thread Anthony
File transfer is done.  Thanks for helping with the transfer.

Anthony

On Fri, Dec 31, 2010 at 8:28 AM, Huib Laurens  wrote:
> If it fails i can give you access on others ways, its a dedicated
> server that doesn't have a job right now...
>
> 2010/12/31, Anthony :
>> On Fri, Dec 31, 2010 at 1:47 AM, Huib Laurens  wrote:
>>> Okay, I emailed to Anthony how he can upload it.
>>
>> Transfer is in progress.  ETA about 10 hours.  md5sum is
>> 30c9b48de3ede527289bcdb810126723
>>
>> Hopefully there aren't any problems as I'm not quite sure how to
>> resume upload with ftp.
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
> --
> Verzonden vanaf mijn mobiele apparaat
>
> Regards,
> Huib "Abigor" Laurens
>
>
>
> Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Does anybody have the 20080726 dump version?

2010-12-31 Thread Anthony
On Fri, Dec 31, 2010 at 10:54 AM, Platonides  wrote:
> Would be nice having an additional md5sum file for the uncompressed dumps.

Yes.

Here's what I found on my SATA and USB drives.  I haven't had a chance
to go through my IDE drives - that would take a while as I don't yet
have a decent removable drive bay for IDE.  I'm sure I have a bunch of
old dumps on there, though.

If someone wants to host any/all of this publicly and permanently, I'd
love to help them get copies of the files.  If it's just to store
offline under the control of the WMF, I wouldn't.

20050309_old_table.sql.bz2
en20050309.7z
enwiki-20060125-pages-meta-history.xml.7z
enwiki-20060125-pages-meta-history.xml.7z.note
enwiki-20070402_files.xml
enwiki-20070402_meta.xml
enwiki-20070402-pages.csv
enwiki-20070402-pages-meta-history.xml.7z
enwiki-20070402-pages-meta-history.xml.bz2
enwiki-20070402-pages-meta-history.xml.bz2
enwiki-20070402-pages-meta-history.xml.bz2
enwiki-20070402-pages-meta-history.xml.bz2.info
enwiki-20070402-pages-meta-history.xml.tree.xz
enwiki-20070402-revisions.csv
enwiki-20070908-pages-meta-current.xml.bz2
enwiki-20070908-pages-meta-current.xml.bz2.filepart
enwiki-20070908-pages-meta-history.xml.bz2
enwiki-20070908-pages-meta-history.xml.bz2.filepart
enwiki-20071018-md5sums.txt
enwiki-20071018-pages-meta-history.xml.bz2
enwiki-20080103-abstract.xml.7z
enwiki-20080103-abstract.xml.xz
enwiki-20080103-all-titles-in-ns0.7z
enwiki-20080103-all-titles-in-ns0.gz
enwiki-20080103-categorylinks.sql.7z
enwiki-20080103-categorylinks.sql.gz
enwiki-20080103-externallinks.sql.7z
enwiki-20080103-externallinks.sql.gz
enwiki-20080103_files.xml
enwiki-20080103-imagelinks.sql.7z
enwiki-20080103-imagelinks.sql.gz
enwiki-20080103-image.sql.7z
enwiki-20080103-image.sql.gz
enwiki-20080103-interwiki.sql.7z
enwiki-20080103-interwiki.sql.gz
enwiki-20080103-langlinks.sql.7z
enwiki-20080103-langlinks.sql.gz
enwiki-20080103-logging.sql.7z
enwiki-20080103-logging.sql.gz
enwiki-20080103-logging.sql.gz
enwiki-20080103-md5sums.txt
enwiki-20080103-md5sums.txt
enwiki-20080103_meta.xml
enwiki-20080103-oldimage.sql.7z
enwiki-20080103-oldimage.sql.gz
enwiki-20080312-imagelinks.sql.gz
enwiki-20080312-image.sql.gz
enwiki-20080312-interwiki.sql.gz
enwiki-20080312-langlinks.sql.gz
enwiki-20080312-logging.sql.gz
enwiki-20080312-md5sums.txt
enwiki-20080312-oldimage.sql.gz
enwiki-20080312-pagelinks.sql.gz
enwiki-20080312-page_restrictions.sql.gz
enwiki-20080312-pages-articles.xml.bz2
enwiki-20080312-pages-meta-current.xml.bz2
enwiki-20080312-pages-meta-current.xml.bz2.crcs.bz2
enwiki-20080312-pages-meta-current.xml.bz2.scan.bz2
enwiki-20080312-pages-meta-current.xml.sizes.bz2
enwiki-20080312-pages-meta-history.xml.7z
enwiki-20080312-pages-meta-history.xml.bz2
enwiki-20080312-page.sql.gz
enwiki-20080312-redirect.sql.gz
enwiki-20080312-site_stats.sql.gz
enwiki-20080312-stub-articles.xml.gz
enwiki-20080312-stub-articles.xml.gz
enwiki-20080312-stub-meta-current.xml.gz
enwiki-20080312-stub-meta-history.xml.gz
enwiki-20080312-stub-meta-history.xml.gz
enwiki-20080312-templatelinks.sql.gz
enwiki-20080312-user_groups.sql.gz
enwiki-20080524-abstract.xml
enwiki-20080524-abstract.xml.xz
enwiki-20080524-all-titles-in-ns0.gz
enwiki-20080524-all-titles-in-ns0.gz
enwiki-20080524-categorylinks.sql.gz
enwiki-20080524-categorylinks.sql.gz
enwiki-20080524-externallinks.sql.gz
enwiki-20080524-externallinks.sql.gz
enwiki-20080524-imagelinks.sql.gz
enwiki-20080524-imagelinks.sql.gz
enwiki-20080524-image.sql.gz
enwiki-20080524-image.sql.gz
enwiki-20080524-interwiki.sql.gz
enwiki-20080524-interwiki.sql.gz
enwiki-20080524-langlinks.sql.gz
enwiki-20080524-langlinks.sql.gz
enwiki-20080524-logging.sql.gz
enwiki-20080524-logging.sql.gz
enwiki-20080524-md5sums.txt
enwiki-20080524-md5sums.txt
enwiki-20080524-oldimage.sql.gz
enwiki-20080524-oldimage.sql.gz
enwiki-20080524-pagelinks.sql.gz
enwiki-20080524-pagelinks.sql.gz
enwiki-20080524-page_restrictions.sql.gz
enwiki-20080524-page_restrictions.sql.gz
enwiki-20080524-pages-articles.xml.bz2
enwiki-20080524-pages-articles.xml.bz2
enwiki-20080524-pages-meta-current.xml.bz2
enwiki-20080524-pages-meta-current.xml.bz2
enwiki-20080524-pages-meta-history.xml.bz2
enwiki-20080524-page.sql.gz
enwiki-20080524-page.sql.gz
enwiki-20080524-redirect.sql.gz
enwiki-20080524-redirect.sql.gz
enwiki-20080524-site_stats.sql.gz
enwiki-20080524-site_stats.sql.gz
enwiki-20080524-stub-articles.xml.gz
enwiki-20080524-stub-articles.xml.gz
enwiki-20080524-stub-meta-current.xml.gz
enwiki-20080524-stub-meta-current.xml.gz
enwiki-20080524-stub-meta-history.xml.gz
enwiki-20080524-stub-meta-history.xml.gz
enwiki-20080524-templatelinks.sql.gz
enwiki-20080524-templatelinks.sql.gz
enwiki-20080524-user_groups.sql.gz
enwiki-20080524-user_groups.sql.gz
enwiki-20080714-abstract.xml
enwiki-20080714-abstract.xml.xz
enwiki-20080714-all-titles-in-ns0.gz
enwiki-20080714-all-titles-in-ns0.gz
enwiki-20080714-categorylinks.sql.gz
enwiki-20080714-categorylinks.sql.gz
enwiki-20080714-category.sql

Re: [Wikitech-l] Does anybody have the 20080726 dump version?

2010-12-31 Thread Anthony
On Fri, Dec 31, 2010 at 4:08 AM, Ariel T. Glenn  wrote:
> Anthony:
>
> We would like to get copies of any of these dumps as well.  This
> includes any of the other files: stubs, tables, the lot.
>
> If you have them for other languages or other time periods, that would
> be great to know too.  I think we could ship you a disk, or two if
> needed. Contact me off list if you like.
>
> Ariel (looking to fill up dataset2's new arrays ;-) )

I'll work on a list.  Are these going to be hosted somewhere?  It
would be nice for me to have an offsite backup.  Then I'd feel more
comfortable tossing the bz2 files once I've recompressed them to xz.

I should mention that these were collected from all over the Internet,
and while I've made some effort to make sure they are the same files
originally distributed, it's possible that some of them are corrupted
or even fraudulent.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Does anybody have the 20080726 dump version?

2010-12-31 Thread Anthony
On Fri, Dec 31, 2010 at 1:47 AM, Huib Laurens  wrote:
> Okay, I emailed to Anthony how he can upload it.

Transfer is in progress.  ETA about 10 hours.  md5sum is
30c9b48de3ede527289bcdb810126723

Hopefully there aren't any problems as I'm not quite sure how to
resume upload with ftp.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Does anybody have the 20080726 dump version?

2010-12-30 Thread Anthony
On Thu, Dec 30, 2010 at 8:38 AM, Anthony  wrote:
> I just asked Dreamhost if they would give me permission to violate
> their TOS for this one time one file.

And the person who responded just told me that he's not authorized to
give me permission to do that.

So, any volunteers to host this would still be appreciated.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Does anybody have the 20080726 dump version?

2010-12-30 Thread Anthony
I just asked Dreamhost if they would give me permission to violate
their TOS for this one time one file.  Barring that, I'd need
somewhere to upload or scp it to (any volunteers?), or if you're in
the US either an 8 gig (5 gig or more) USB drive and self-addressed
stamped envelope, or about $12 or so if you want me to buy an 8 gig
USB drive and ship it to you, with the data, uninsured.

If all goes smoothly, the upload should take about 19 hours with my cablemodem.

Is there any sort of extract of the data that would suffice instead?

On Wed, Dec 29, 2010 at 8:59 PM, Monica shu  wrote:
> Yes, I think they are the same!
>
> Is there any method to download it?
>
> Thanks very much!!
>
>
>
> On Wed, Dec 29, 2010 at 10:06 PM, Anthony  wrote:
>
>> You talking about enwiki?
>>
>> I have enwiki-20080724-pages-articles.xml.bz2.  Nothing for 20080726.
>>
>> On Wed, Dec 29, 2010 at 2:54 AM, Monica shu 
>> wrote:
>> > @_...@...
>> >
>> > Thanks any way:)
>> >
>> > Anyone else hands  up?
>> >
>> > On Wed, Dec 29, 2010 at 3:18 PM, Chad  wrote:
>> >
>> >> On Wed, Dec 29, 2010 at 12:16 AM, Monica shu 
>> >> wrote:
>> >> > Hi all,
>> >> >
>> >> > I have looked through the web for the 20080726 version of the dump
>> file
>> >> > "pages-articles.xml.bz2".
>> >> > But I can't find any result.
>> >> > Can anybody provide me a download link? Thank a lot!
>> >> >
>> >>
>> >> True story: I used to have a copy of the 20080726 dump. I
>> >> deleted it like a year ago because I didn't need it anymore
>> >> and I didn't know it had gone missing at the time.
>> >>
>> >> I should ask next time :(
>> >>
>> >> -Chad
>> >>
>> >> ___
>> >> Wikitech-l mailing list
>> >> Wikitech-l@lists.wikimedia.org
>> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> >>
>> > ___
>> > Wikitech-l mailing list
>> > Wikitech-l@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> >
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Does anybody have the 20080726 dump version?

2010-12-29 Thread Anthony
You talking about enwiki?

I have enwiki-20080724-pages-articles.xml.bz2.  Nothing for 20080726.

On Wed, Dec 29, 2010 at 2:54 AM, Monica shu  wrote:
> @_...@...
>
> Thanks any way:)
>
> Anyone else hands  up?
>
> On Wed, Dec 29, 2010 at 3:18 PM, Chad  wrote:
>
>> On Wed, Dec 29, 2010 at 12:16 AM, Monica shu 
>> wrote:
>> > Hi all,
>> >
>> > I have looked through the web for the 20080726 version of the dump file
>> > "pages-articles.xml.bz2".
>> > But I can't find any result.
>> > Can anybody provide me a download link? Thank a lot!
>> >
>>
>> True story: I used to have a copy of the 20080726 dump. I
>> deleted it like a year ago because I didn't need it anymore
>> and I didn't know it had gone missing at the time.
>>
>> I should ask next time :(
>>
>> -Chad
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Using MySQL as a NoSQL

2010-12-24 Thread Anthony
On Fri, Dec 24, 2010 at 4:08 AM, Domas Mituzas  wrote:
> Hi!
>
> A:
>> It's easy to get fast results if you don't care about your reads being
>> atomic (*), and I find it hard to believe they've managed to get
>> atomic reads without going through MySQL.
>
> MySQL upper layers know nothing much about transactions, it is all 
> engine-specific - BEGIN and COMMIT processing is deferred to table handlers.
> It would incredibly easy for them to implement repeatable read snapshots :) 
> (if thats what you mean by atomic read)

I suppose it's possible in theory, but in any case, it's not what
they're doing.  They *are* going through MySQL, via the HandlerSocket
plugin.

I wonder if they'd get much different performance by just using
prepared statements and read committed isolation, with the
transactions spanning multiple requests.  The tables would only get
locked once per transaction, right?

Or do I just have no idea what I'm talking about?

>> (*) Among other possibilities, just use MyISAM.
>
> How is that applicable to any discussion?

It was an example of a way to get fast results if you don't care about
your reads being atomic.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Using MySQL as a NoSQL

2010-12-23 Thread Anthony
On Thu, Dec 23, 2010 at 9:34 AM, Nikola Smolenski  wrote:
> I have recently encountered this text in which the author claims very
> high MySQL speedups for simple queries (7.5 times faster than MySQL,
> twice faster than memcached) by reading the data directly from InnoDB
> where possible (MySQL is still used for writing and for complex
> queries.) Knowing that faster DB is always good, I thought this would be
> an interesting thing to consider :)
>
> http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html

It's easy to get fast results if you don't care about your reads being
atomic (*), and I find it hard to believe they've managed to get
atomic reads without going through MySQL.

(*) Among other possibilities, just use MyISAM.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Offline wiki tools

2010-12-15 Thread Anthony
On Thu, Dec 16, 2010 at 12:47 AM, Andrew Dunbar  wrote:
> At the moment I'm interested in .bz2 and .7z because those are the
> formats WikiMedia currently publishes data in.

I'm fairly certain the specific 7z format which Wikimedia uses doesn't
allow for random access, because the dictionary is never reset.

> Have we made the case for this format to the WikiMedia people?

No, there's no off-the-shelf tool to create these files - the standard
.xz file created by xz utils puts everything in one stream, which is
basically equivalent to the .7z files already being made.  I'm sure
"patches are welcome", but I don't have the time to create the patch.

> How is .xz for compression times?

At the default settings, it's quite slow.  I believe it's pretty much
the same as 7zip with its default settings.  The main reason I was
using xz instead of 7zip is that xz handles pipes better -
specifically, 7zip doesn't allow you to pipe from stdin to stdout.
(See https://bugs.launchpad.net/ubuntu/+source/p7zip/+bug/383667 and
the response - "You should use lzma." - well, lzma utils has been
replaced by xz utils.)

For decompression, .xz is generally faster than .bz2, slower than .gz

> Would we have to worry about patent issues for LZMA?

No, it uses LZMA2.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] dataset1, xml dumps

2010-12-15 Thread Anthony
On Wed, Dec 15, 2010 at 3:30 PM, Ariel T. Glenn  wrote:
> We are interested in other mirrors of the dumps; see
>
> http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps

On the talk page, it says "torrents are useful to save bandwidth,
which is not our problem".  If bandwidth is not the problem, then what
*is* the problem?

If the problem is just to get someone to store the data on hard
drives, then it's a much easier problem than actually *hosting* that
data.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Offline wiki tools

2010-12-15 Thread Anthony
On Wed, Dec 15, 2010 at 12:01 PM, Andrew Dunbar  wrote:
> By the way I'm keen to find something similar for .7z

I've written something similar for .xz, which uses LZMA2 same as .7z.
It creates a virtual read-only filesystem using FUSE (the FUSE part is
in perl, which uses pipes to dd and xzcat).  Only real problem is that
it doesn't use a stock .xz file, it uses a specially created one which
concatenates lots of smaller .xz files (currently I concatenate
between 5 and 20 or so 900K bz2 blocks into one .xz stream - between 5
and 20 because there's a preference to split on 
boundaries).

Apparently the folks at openzim have done something similar, using LZMA2.

If anyone is interesting in working with me to make a package capable
of being released to the public, I'd be willing to share my code.  But
it sounds like I'm just reinventing a wheel already invented by
opensim.

> It would be incredibly useful if these indices could be created as
> part of the dump creation process. Should I file a feature request?

With concatenated .xz files, creating the index is *much* faster,
because the .xz format puts the stream size at the end of each stream.
 Plus with .xz all streams are broken on 4-byte boundaries, whereas
with .bz2 blocks can end at any *bit* (which means you have to do
painful bit shifting to create the index).

The file is also *much* smaller, on the order of 5-10% of bzip2 for a
full history dump.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-04-08 Thread Anthony
On Thu, Apr 8, 2010 at 7:34 PM, Q  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> On 4/8/2010 4:28 PM, Anthony wrote:
> > I'd like to add that the md5 of the *uncompressed* file is
> > cd4eee6d3d745ce716db2931c160ee35 .  That's what I got from both the
> > uncompressed 7z and the uncompressed bz2.  They matched, whew.
> > Uncompressing and md5ing the bz2 took well over a week.  Uncompressing
> and
> > md5ing the 7z took less than a day.
> >
>
> Dumping and parsing large XML files came up at work today which made me
> think of this, how big exactly is the uncompressed file?
>

5.34 terabytes was the figure I got.

"7z l enwiki-20100130-pages-meta-history.xml.7z" gives an uncompressed size
of 5873134833455. I assume that's bytes, and googling "5873134833455 bytes
to terabytes" gives me "5.34158501 terabytes".
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-04-08 Thread Anthony
I'd like to add that the md5 of the *uncompressed* file is
cd4eee6d3d745ce716db2931c160ee35 .  That's what I got from both the
uncompressed 7z and the uncompressed bz2.  They matched, whew.
Uncompressing and md5ing the bz2 took well over a week.  Uncompressing and
md5ing the 7z took less than a day.

On Mon, Mar 29, 2010 at 8:16 PM, Tomasz Finc  wrote:

> You can find all the md5sums at
>
> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-md5sums.txt
>
> --tomasz
>
> Anthony wrote:
>
>> Got an md5sum?
>>
>>
>> On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc > tf...@wikimedia.org>> wrote:
>>
>>I love lzma compression.
>>
>>enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB
>>
>>enwiki-20100130-pages-meta-history.xml.7z 31.9 GB
>>
>>Download at http://tinyurl.com/yeelbse
>>
>>Enjoy!
>>
>>--tomasz
>>
>>Tomasz Finc wrote:
>> > Tomasz Finc wrote:
>> >> New full history en wiki snapshot is hot off the presses!
>> >>
>> >> It's currently being checksummed which will take a while for
>>280GB+ of
>> >> compressed data but for those brave souls willing to test please
>>grab it
>> >> from
>> >>
>> >>
>>
>> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2
>> >>
>> >>
>> >> and give us feedback about its quality. This run took just over
>>a month
>> >> and gained a huge speed up after Tims work on re-compressing ES.
>>If we
>> >> see no hiccups with this data snapshot, I'll start mirroring it
>>to other
>> >> locations (internet archive, amazon public data sets, etc).
>> >>
>> >> For those not familiar, the last successful run that we've seen
>>of this
>> >> data goes all the way back to 2008-10-03. That's over 1.5 years of
>> >> people waiting to get access to these data bits.
>> >>
>> >> I'm excited to say that we seem to have it :)
>> >>
>> >> --tomasz
>> >
>> > We now have an md5sum for
>> enwiki-20100130-pages-meta-history.xml.bz2.
>> >
>> > "65677bc275442c7579857cc26b355ded"
>> >
>> > Please verify against it before filing issues.
>> >
>> > --tomasz
>> >
>> >
>> > ___
>> > Wikitech-l mailing list
>> > Wikitech-l@lists.wikimedia.org
>><mailto:Wikitech-l@lists.wikimedia.org>
>>
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>
>>___
>>Xmldatadumps-admin-l mailing list
>>xmldatadumps-admi...@lists.wikimedia.org
>><mailto:xmldatadumps-admi...@lists.wikimedia.org>
>>
>>https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
>>
>>
>>
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D

2010-03-29 Thread Anthony
Got an md5sum?

On Mon, Mar 29, 2010 at 5:46 PM, Tomasz Finc  wrote:

> I love lzma compression.
>
> enwiki-20100130-pages-meta-history.xml.bz2 280.3 GB
>
> enwiki-20100130-pages-meta-history.xml.7z 31.9 GB
>
> Download at http://tinyurl.com/yeelbse
>
> Enjoy!
>
> --tomasz
>
> Tomasz Finc wrote:
> > Tomasz Finc wrote:
> >> New full history en wiki snapshot is hot off the presses!
> >>
> >> It's currently being checksummed which will take a while for 280GB+ of
> >> compressed data but for those brave souls willing to test please grab it
> >> from
> >>
> >>
> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-history.xml.bz2
> >>
> >>
> >> and give us feedback about its quality. This run took just over a month
> >> and gained a huge speed up after Tims work on re-compressing ES. If we
> >> see no hiccups with this data snapshot, I'll start mirroring it to other
> >> locations (internet archive, amazon public data sets, etc).
> >>
> >> For those not familiar, the last successful run that we've seen of this
> >> data goes all the way back to 2008-10-03. That's over 1.5 years of
> >> people waiting to get access to these data bits.
> >>
> >> I'm excited to say that we seem to have it :)
> >>
> >> --tomasz
> >
> > We now have an md5sum for enwiki-20100130-pages-meta-history.xml.bz2.
> >
> > "65677bc275442c7579857cc26b355ded"
> >
> > Please verify against it before filing issues.
> >
> > --tomasz
> >
> >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> ___
> Xmldatadumps-admin-l mailing list
> xmldatadumps-admi...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hiphop! :)

2010-02-28 Thread Anthony
On Sun, Feb 28, 2010 at 4:33 PM, Domas Mituzas wrote:

> >
> > Nevertheless - a process isn't the same process when it's going at 10x
> > the speed. This'll be interesting.
>
> not 10x. I did concurrent benchmarks for API requests (e.g. opensearch) on
> modern boxes, and saw:
>
> HipHop: Requests per second:1975.39 [#/sec] (mean)
> Zend: Requests per second:371.29 [#/sec] (mean)
>
> these numbers seriously kick ass. I still can't believe I observe 2000
> mediawiki requests/s from a single box ;-)
>

Great job Domas.  It'll be exciting to see the final product.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-17 Thread Anthony
On Wed, Feb 17, 2010 at 8:51 AM, Anthony  wrote:

> On Wed, Feb 17, 2010 at 6:54 AM, Domas Mituzas wrote:
>
>> >
>> > It showed that there was quite a bit of bathwater thrown out.  And at
>> least
>> > one very large baby (Google translation), which was temporarily
>> > resurrected.  We still don't know how many other, smaller, babies were
>> > thrown out, and likely never will.
>>
>> I'm pretty sure, that at least 99.9% of the drop in those graphs was
>> exactly the activity I was going after.
>>
>
> I know you are.  That's why I started out by trying to figure out who your
> boss is.
>

However, with that said, I'm sure you have the best of intentions, Domas,
and I assume this is an isolated misjudgment in a sea of positive and useful
contributions.  I'm sorry if I underestimated your valuable contributions to
Wikimedia.  The fact is, sitting where I'm sitting (which is out of my
difficulty in getting along with rather than my lack of interest in
helping), I don't get to see all the work you do behind the scenes.  So
please don't take offense to my lack of praise for them.  Sorry.

Anthony
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-17 Thread Anthony
On Wed, Feb 17, 2010 at 6:54 AM, Domas Mituzas wrote:

> >
> > It showed that there was quite a bit of bathwater thrown out.  And at
> least
> > one very large baby (Google translation), which was temporarily
> > resurrected.  We still don't know how many other, smaller, babies were
> > thrown out, and likely never will.
>
> I'm pretty sure, that at least 99.9% of the drop in those graphs was
> exactly the activity I was going after.
>

I know you are.  That's why I started out by trying to figure out who your
boss is.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Anthony
On Tue, Feb 16, 2010 at 11:32 PM, Tim Starling wrote:

> I think it's common knowledge among people who have been reading these
> lists for a long time, that Anthony has a serious deficit in his
> sarcasm detection department, and often gives inappropriate responses
> to sarcastic comments.
>

Hah, I need to put that in my tagline or something.

Well, thanks for the defense, I guess.  Or were you being sarcastic?
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Anthony
On Tue, Feb 16, 2010 at 11:18 PM, John Vandenberg  wrote:

> With this solution, it is now possible to determine how much of the
> traffic was from valid services.  i.e. google translate and other
> useful services will identify themselves


And what separates google translate from other useful services which hotload
Wikipedia (other than the $2 million, which is not to say that $2 million is
a bad reason to separate it, but let's at least be honest if that's the
reason)?


> I am even less in favour of Domas retiring to an armchair, and think
> that anyone suggesting that is deluding themselves about Wikimedia's
> need of Domas, and Domas' reason for volunteering.
>

Well, I never did say I was in favor of it.  I merely pointed out his
hypocrisy in claiming that he would love to have it be so.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Anthony
On Tue, Feb 16, 2010 at 9:47 PM, John Vandenberg  wrote:

> On Wed, Feb 17, 2010 at 1:00 PM, Anthony  wrote:
> > On Wed, Feb 17, 2010 at 11:57 AM, Domas Mituzas 
> wrote:
> >> Probably everything looks easier from your armchair. I'd love to have
> that
> >> view! :)
> >>
> >
> > Then stop volunteering.
>
> Did you miss the point?
>

I don't think so.  I believe his point was to complain about the position
that he is in.  My response was that he was in that position by choice, and
that if he'd love to be in my position, it's a really easy for thing for him
to accomplish.

The graphs provided in this thread clearly show that the solution had
> a positive & desired effect.
>

It showed that there was quite a bit of bathwater thrown out.  And at least
one very large baby (Google translation), which was temporarily
resurrected.  We still don't know how many other, smaller, babies were
thrown out, and likely never will.

In any case, I don't see how your comment follows from mine.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Anthony
> Anyway, you probably are missing one important point.
> We're trying to make Wikipedia's service better.
>

I'm sure you are.  But that doesn't mean I agree with your methods.

Probably everything looks easier from your armchair. I'd love to have that
> view! :)
>

Then stop volunteering.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Anthony
> > And "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT
> > 5.1)", is pretty much
> > useless, unless you've already identified the spammer through some other
> > process.
>
> It isn't useless. It clearly shows that the user is acting malicious by
> having automated software that disguises under common user agent.
>

1) Only if you've already identified the spammer through some other process
(otherwise, you don't even know if they're using automated software).
2) It doesn't really show that the user is acting malicious even if you can
determine that they're using automated software.  They might be using
software written by someone else.  Or they might have read the error message
which says "please supply a user agent" and followed it by supplying a user
agent.  It might be malicious, or it might be an error in judgment.
Regardless, what are you going to do about it?  Block the IP?  For how
long?  Even if it's dynamic?  Even if it's shared by many others?
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Anthony
On Tue, Feb 16, 2010 at 2:31 PM, Domas Mituzas wrote:

> > Presumably some percentage of that 20-50% will come back as the
> > spammers realize they have to supply the string.  Presumably we
> > then start playing whack-a-mole.
>
> Yes, we will ban all IPs participating in this.
>

Guess it's just a matter of time until *reading* Wikipedia is unavailable to
large portions of the world.

> Presumably there's a plan for what to do when the spammers begin
> > supplying a new, random string every time.
>
> Random strings are easy to identify, fixed strings are easy to verify.
>

And "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT
5.1)", is pretty much
useless, unless you've already identified the spammer through some other
process.

> (I do worry about where this is going, though.)
>
> Going where it always goes, proper operations of the website. Been there,
> done that.
>

Do any of the other major websites completely block traffic when they see
blank user agents?
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Anthony
On Tue, Feb 16, 2010 at 11:32 AM, Ariel T. Glenn wrote:

> In fact some WMF paid employees (including me) were in the channel at
> that time and agreed with the decision.  It seemed then and still seems
> to me a reasonable course of action given the circumstances. I
> understand it's aggravating to people who didn't get notice; let's look
> forward.  PLease just add the UA header and your tools / bots/ etc.
> will be back to working.  Thanks.
>

It's not a big deal for anyone aware of the problem.   Adding "User-Agent:
Janna" to my scripts is no big deal, nor is adding a randomized UA from a
list of common UAs in privoxy (I'm pretty sure there's a plugin for that).

I do wonder how many people are going to wind up getting strange errors that
they don't know how to fix due to this, though.  Is it at all feasible to
throttle such traffic rather than blocking it completely?
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Anthony
On Tue, Feb 16, 2010 at 11:04 AM, Domas Mituzas wrote:

> > No idea.  For ages you've been able to just go onto the Wikimedia servers
> > and change whatever you feel like, and answer to nobody?  You must be
> > misunderstanding my question or something.
>
> Kind of. Isn't that a good enough motivation? :-)
>
> Though of course, I tend to consult with tech team members, and they're
> free to overturn anything I change, especially if they come up with better
> solutions (and they usually do!).
> And indeed, I guess WMF owns the ultimate power of terminating my access :)
>

In all honesty, I find that fascinating.  If someone manages to write a book
about how that system works, I'll probably buy it.

On the other hand, I guess it's off topic.  So enough about that.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Anthony
On Tue, Feb 16, 2010 at 10:39 AM, Domas Mituzas wrote:

> > Cool.  Who's your boss, and who's your boss's boss?  Sorry, I couldn't
> find
> > you in the org chart or I'd just have looked that up myself.
>
> Nobody?


Really?  Were you doing this work as a contractor, or as a volunteer?
Someone's gotta be in charge of the contractors and/or the volunteers, no?


> Been like that for ages, haven't it?
>

No idea.  For ages you've been able to just go onto the Wikimedia servers
and change whatever you feel like, and answer to nobody?  You must be
misunderstanding my question or something.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Anthony
On Tue, Feb 16, 2010 at 10:31 AM, Domas Mituzas wrote:

> Hi!
>
> > Whose decision was this?
>
> Mine.
>
> >  Were Erik, Sue, or Danese involved?
>
> No.
>

Cool.  Who's your boss, and who's your boss's boss?  Sorry, I couldn't find
you in the org chart or I'd just have looked that up myself.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Anthony
On Mon, Feb 15, 2010 at 8:54 PM, Domas Mituzas wrote:

> Hi!
>
> from now on specific per-bot/per-software/per-client User-Agent header is
> mandatory for contacting Wikimedia sites.
>
> Domas
>

Hi,

Whose decision was this?  Were Erik, Sue, or Danese involved?
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


  1   2   >