Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered

2010-12-16 Thread Joseph Reagle
On Wednesday, December 15, 2010, Tim Starling wrote:
 There were some changes made to the page text that weren't represented
 in diff_log, specifically changing certain camel-case links to free
 links.

It appears my problems were related to some CR/LF issues not round-tripping 
between diff and patch, but I hope to be able to address that. And yes, in 
addition to some of the CamelCase issues, I expect another problem is that if a 
page is blanked Describe the new page here. will reappear outside of the 
diff_log.


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered

2010-12-16 Thread Tim Starling
On 16/12/10 23:10, Joseph Reagle wrote:
 On Wednesday, December 15, 2010, Tim Starling wrote:
 There were some changes made to the page text that weren't represented
 in diff_log, specifically changing certain camel-case links to free
 links.
 It appears my problems were related to some CR/LF issues not round-tripping 
 between diff and patch, but I hope to be able to address that. And yes, in 
 addition to some of the CamelCase issues, I expect another problem is that if 
 a page is blanked Describe the new page here. will reappear outside of the 
 diff_log.

I don't think that will be a problem. But there are other problems
that I've encountered.

UseMod had a deletion feature. It turns out to be easy enough to skip
deleted pages, since they don't have a corresponding entry in rclog.

It also had an admin-only rename feature, which optionally fixed links
in all pages. This accounts for the free link changes I was seeing
earlier. And it had a link replacement feature which could be invoked
without a page move. These features were rarely used, due to the
arcane interface, usually people just moved pages by copying and
pasting. But during the free-link conversion, a lot of pages were
renamed using the admin-only feature.

All these admin-only features were unlogged, but it turns out to be
possible to reconstruct page moves, because when a page was moved, its
name was updated in rclog but not in diff_log. By finding the first
diff_log entry with the new name, you can roughly work out when the
page moves were done.

Anyway, I'm developing a script which will import the dump into a
modified MediaWiki instance, the idea being that I can then export XML
from it. Once it works, I'll upload the XML to somewhere. I'm not sure
when that will be.

-- Tim Starling

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered

2010-12-16 Thread Joseph Reagle

I have the first 10K edits up reconstructed in their various pages at:
  http://cyber.law.harvard.edu/~reagle/wp-redux/

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered

2010-12-16 Thread lior gimel
This is amazing!
Thanks for the work and effort, this reconstruction is a priceless resource
for researchers.
Lior

On Thu, Dec 16, 2010 at 8:53 PM, Joseph Reagle joseph.2...@reagle.orgwrote:


 I have the first 10K edits up reconstructed in their various pages at:
  
 http://cyber.law.harvard.edu/~reagle/wp-redux/http://cyber.law.harvard.edu/%7Ereagle/wp-redux/

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered

2010-12-16 Thread Joseph Reagle
On Thursday, December 16, 2010, lior gimel wrote:
 This is amazing!

And buggy! :-)

 Thanks for the work and effort, this reconstruction is a priceless resource
 for researchers.

Thanks to Tim for providing the data, and for working on a much better version 
that I look forward to!

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Google ngrams

2010-12-16 Thread emijrp
Hi all;

I leave this link here... http://ngrams.googlelabs.com/datasets

An example
http://ngrams.googlelabs.com/graph?content=collaborativeyear_start=1920year_end=corpus=0smoothing=3

Regards,
emijrp
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Google ngrams

2010-12-16 Thread Samuel Klein
I was just playing with this... remarkable.   Someone should do the
same with Wikipedia's text over time, which would provide even crisper
comparisons [as within categories].

http://ngrams.googlelabs.com/graph?content=art,technology,wwwyear_start=1950year_end=2008corpus=5smoothing=4

On Thu, Dec 16, 2010 at 5:28 PM, emijrp emi...@gmail.com wrote:
 Hi all;

 I leave this link here... http://ngrams.googlelabs.com/datasets

 An example
 http://ngrams.googlelabs.com/graph?content=collaborativeyear_start=1920year_end=corpus=0smoothing=3

 Regards,
 emijrp

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l





-- 
Samuel Klein          identi.ca:sj           w:user:sj          +1 617 529 4266

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Google ngrams

2010-12-16 Thread emijrp
Look at this one ; )
http://ngrams.googlelabs.com/graph?content=security%2Cfreedomyear_start=1950year_end=2008corpus=5smoothing=4

2010/12/17 Samuel Klein meta...@gmail.com

 I was just playing with this... remarkable.   Someone should do the
 same with Wikipedia's text over time, which would provide even crisper
 comparisons [as within categories].


 http://ngrams.googlelabs.com/graph?content=art,technology,wwwyear_start=1950year_end=2008corpus=5smoothing=4

 On Thu, Dec 16, 2010 at 5:28 PM, emijrp emi...@gmail.com wrote:
  Hi all;
 
  I leave this link here... http://ngrams.googlelabs.com/datasets
 
  An example
 
 http://ngrams.googlelabs.com/graph?content=collaborativeyear_start=1920year_end=corpus=0smoothing=3
 
  Regards,
  emijrp
 
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 
 



 --
 Samuel Klein  identi.ca:sj   w:user:sj  +1 617 529
 4266

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l