Thanks for all the suggestions you shared!

@ Aaron, it would be great if you can share me the dataset you have. I
think 20150602 is fairly new. In the meanwhile, I will explore the
utilities you mentioned. Think they are good stuff to learn and practice.
Thanks!

On Wed, Jan 20, 2016 at 9:20 AM, Aaron Halfaker <aaron.halfa...@gmail.com>
wrote:

> The deltas library implements the rough WikiWho strategy in a difflib sort
> of way as "SegmentMatcher".
>
> Re. diffs, I have some datasets that I have generated and can share.
> Would enwiki-20150602 be recent enough for your uses?
>
> If not, then I'd also like to point you to
> http://pythonhosted.org/mwdiffs/ which provides some nice utilities for
> parallel processing diffs from MediaWiki dumps using the `deltas` library.
> See http://pythonhosted.org/mwdiffs/utilities.html.  Those utilities will
> natively parallelize computation so that you can divide the total runtime
> (100 days) by how many CPUs you have to run with.  E.g. 100 days / 16 CPUs
> = 6.3 days.   On a hadoop streaming setup (Altiscale), I've been able to
> get the whole English Wikipedia history processed in 48 hours, so it's not
> a massive benefit -- yet.
>
> -Aaron
>
> On Wed, Jan 20, 2016 at 8:49 AM, Flöck, Fabian <fabian.flo...@gesis.org>
> wrote:
>
>> Hi, you can also look at our WikiWho code, we have tested it to extract
>> the changes between revisions considerably faster than a simple diff. see
>> here: https://github.com/maribelacosta/wikiwho . you would have to adapt
>> the code a bit to give you the pure diffs though. let me know if you need
>> help.
>>
>> best,
>> fabian
>>
>>
>>
>> On 20.01.2016, at 13:15, Scott Hale <computermacgy...@gmail.com> wrote:
>>
>> Hi Bowen,
>>
>> You might compare the performance of Aaron Halfaker's deltas library:
>> https://github.com/halfak/deltas
>> (You might have already done so, I guess, but just in case)
>>
>> In either case, I suspect the tasks will need to be parallelized to be
>> achieved in a reasonable time scale. How many editions are you working with?
>>
>> Cheers,
>> Scott
>>
>>
>> On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu <yuxxx...@umn.edu> wrote:
>>
>>> Hello all,
>>>
>>> I am a 2nd PhD student working in Grouplens Research group at the
>>> University of Minnesota - Twin Cities. Recently, I am working on a project
>>> to study how identity based and bond based theories would help understand
>>> editor's behavior in WikiProjects within the group context, but I am having
>>> a technical problems that need help and advise.
>>>
>>> I am trying to parse each revision content of the editors from the XML
>>> dumps - the contents they added or deleted in each revision. I used the
>>> compare function in difflib to obtain the added or deleted contents by
>>> comparing two string objects, which runs extremely slow when the strings
>>> are huge specifically in the case of the Wikipedia revision contents.
>>> Without any parallel processing techniques, the expecting runtime to
>>> download and parse the 201 dumps would be ~100+ days.. I was pointed to
>>> altiscale, but not yet sure exactly how to use it for my problem.
>>>
>>> It would be really great if anyone would give me some suggestion to help
>>> me make more progress. Thanks in advance!
>>>
>>> Sincerely,
>>> Bowen
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>
>>
>> --
>> Dr Scott Hale
>> Data Scientist
>> Oxford Internet Institute
>> University of Oxford
>> http://www.scotthale.net/
>> scott.h...@oii.ox.ac.uk
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>>
>>
>>
>> Gruß,
>> Fabian
>>
>> --
>> Fabian Flöck
>> Research Associate
>> Computational Social Science department @GESIS
>> Unter Sachsenhausen 6-8, 50667 Cologne, Germany
>> Tel: + 49 (0) 221-47694-208
>> fabian.flo...@gesis.org
>>
>> www.gesis.org
>> www.facebook.com/gesis.org
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to