Thanks for all the suggestions you shared! @ Aaron, it would be great if you can share me the dataset you have. I think 20150602 is fairly new. In the meanwhile, I will explore the utilities you mentioned. Think they are good stuff to learn and practice. Thanks!
On Wed, Jan 20, 2016 at 9:20 AM, Aaron Halfaker <aaron.halfa...@gmail.com> wrote: > The deltas library implements the rough WikiWho strategy in a difflib sort > of way as "SegmentMatcher". > > Re. diffs, I have some datasets that I have generated and can share. > Would enwiki-20150602 be recent enough for your uses? > > If not, then I'd also like to point you to > http://pythonhosted.org/mwdiffs/ which provides some nice utilities for > parallel processing diffs from MediaWiki dumps using the `deltas` library. > See http://pythonhosted.org/mwdiffs/utilities.html. Those utilities will > natively parallelize computation so that you can divide the total runtime > (100 days) by how many CPUs you have to run with. E.g. 100 days / 16 CPUs > = 6.3 days. On a hadoop streaming setup (Altiscale), I've been able to > get the whole English Wikipedia history processed in 48 hours, so it's not > a massive benefit -- yet. > > -Aaron > > On Wed, Jan 20, 2016 at 8:49 AM, Flöck, Fabian <fabian.flo...@gesis.org> > wrote: > >> Hi, you can also look at our WikiWho code, we have tested it to extract >> the changes between revisions considerably faster than a simple diff. see >> here: https://github.com/maribelacosta/wikiwho . you would have to adapt >> the code a bit to give you the pure diffs though. let me know if you need >> help. >> >> best, >> fabian >> >> >> >> On 20.01.2016, at 13:15, Scott Hale <computermacgy...@gmail.com> wrote: >> >> Hi Bowen, >> >> You might compare the performance of Aaron Halfaker's deltas library: >> https://github.com/halfak/deltas >> (You might have already done so, I guess, but just in case) >> >> In either case, I suspect the tasks will need to be parallelized to be >> achieved in a reasonable time scale. How many editions are you working with? >> >> Cheers, >> Scott >> >> >> On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu <yuxxx...@umn.edu> wrote: >> >>> Hello all, >>> >>> I am a 2nd PhD student working in Grouplens Research group at the >>> University of Minnesota - Twin Cities. Recently, I am working on a project >>> to study how identity based and bond based theories would help understand >>> editor's behavior in WikiProjects within the group context, but I am having >>> a technical problems that need help and advise. >>> >>> I am trying to parse each revision content of the editors from the XML >>> dumps - the contents they added or deleted in each revision. I used the >>> compare function in difflib to obtain the added or deleted contents by >>> comparing two string objects, which runs extremely slow when the strings >>> are huge specifically in the case of the Wikipedia revision contents. >>> Without any parallel processing techniques, the expecting runtime to >>> download and parse the 201 dumps would be ~100+ days.. I was pointed to >>> altiscale, but not yet sure exactly how to use it for my problem. >>> >>> It would be really great if anyone would give me some suggestion to help >>> me make more progress. Thanks in advance! >>> >>> Sincerely, >>> Bowen >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> >> >> >> -- >> Dr Scott Hale >> Data Scientist >> Oxford Internet Institute >> University of Oxford >> http://www.scotthale.net/ >> scott.h...@oii.ox.ac.uk >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> >> >> >> >> Gruß, >> Fabian >> >> -- >> Fabian Flöck >> Research Associate >> Computational Social Science department @GESIS >> Unter Sachsenhausen 6-8, 50667 Cologne, Germany >> Tel: + 49 (0) 221-47694-208 >> fabian.flo...@gesis.org >> >> www.gesis.org >> www.facebook.com/gesis.org >> >> >> >> >> >> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l