Re: [Wiki-research-l] Wikipedia Literature Review - Tools and Data Sets
Dear Mohamad, thanks for compiling this comprehensive list. You might want to add JWPL: http://code.google.com/p/jwpl/ and WikipediaMiner: http://wikipedia-miner.sourceforge.net/ -Torsten From: wiki-research-l-boun...@lists.wikimedia.org [mailto:wiki-research-l-boun...@lists.wikimedia.org] On Behalf Of mohamad mehdi Sent: Monday, April 18, 2011 3:20 PM To: wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] Wikipedia Literature Review - Tools and Data Sets Hi everyone, This is a follow up on a previous thread (Wikipedia data sets) related to the Wikipedia literature review (Chitu Okoli). As I mentioned in my previous email, part of our study is to identify the data collection methods and data sets used for Wikipedia studies. Therefore, we searched for online tools used to extract Wikipedia articles and for pre-compiled Wikipedia articles data sets; we were able to identify the following list. Please let us know of any other sources you know about. Also, we would like to know if there is any existing Wikipedia page that includes such a list so we can add to it. Otherwise, where do you suggest adding this list so it is noticeable and useful for the community? http://download.wikimedia.org/ /* official Wikipedia database dumps */ http://datamob.org/datasets/tag/wikipedia /* Multiple data sets (English Wikipedia articles that have been transformed into XML) */ http://wiki.dbpedia.org/Datasets /* Structured information from Wikipedia*/ http://labs.systemone.at/wikipedia3/* Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly updated dataset containing around 47 million triples.*/ http://www.scribd.com/doc/9582/integrating-wikipediawordnet /* article talking about integrating WorldNet and Wikipedia with YAGO */ http://www.infochimps.com/datasets/taxobox-wikipedia-infoboxes-with-taxonomic-information-on-animal/ http://www.infochimps.com/link_frame?dataset=11043 /* Wikipedia Datasets for the Hadoop Hack | Cloudera */ http://www.infochimps.com/link_frame?dataset=11166 /* Wikipedia: Lists of common misspellings/For machines */ http://www.infochimps.com/link_frame?dataset=11028 /* Building a (fast) Wikipedia offline reader */ http://www.infochimps.com/link_frame?dataset=11004 /* Using the Wikipedia page-to-page link database */ http://www.infochimps.com/link_frame?dataset=11285 /* List of films */ http://www.infochimps.com/link_frame?dataset=11598 /* MusicBrainz Database */ http://dammit.lt/wikistats/ /* Wikitech-l page counters */ http://snap.stanford.edu/data/wiki-meta.html/* Complete Wikipedia edit history (up to January 2008) */ http://aws.amazon.com/datasets/2596?_encoding=UTF8jiveRedirect=1 /* Wikipedia Page Traffic Statistics */ http://aws.amazon.com/datasets/2506 /* Wikipedia XML Data */ http://www-958.ibm.com/software/data/cognos/manyeyes/datasets?q=Wikipedia+ /* list of Wikipedia data sets */ Examples: http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/top-1000-accessed-wikipedia-articl/versions/1 /* Top 1000 Accessed Wikipedia Articles */ http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/wikipedia-hits/versions/1 /* Wikipedia Hits */ Tools to extract data from Wikipedia: http://www.evanjones.ca/software/wikipedia2text.html/* Extracting Text from Wikipedia */ http://www.infochimps.com/link_frame?dataset=11121/* Wikipedia article traffic statistics */ http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/ /* Generating a Plain Text Corpus from Wikipedia */ http://www.infochimps.com/datasets/wikipedia-articles-title-autocomplete Thank you, Mohamad Mehdi ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Wikipedia Literature Review - Tools and Data Sets
And maybe Wiktionary parser and visual interface :) http://code.google.com/p/wikokit/ Best regards, Andrew Krizhanovsky On Wed, Apr 20, 2011 at 12:15 PM, Torsten Zesch ze...@tk.informatik.tu-darmstadt.de wrote: Dear Mohamad, thanks for compiling this comprehensive list. You might want to add JWPL: http://code.google.com/p/jwpl/ and WikipediaMiner: http://wikipedia-miner.sourceforge.net/ -Torsten From: wiki-research-l-boun...@lists.wikimedia.org [mailto:wiki-research-l-boun...@lists.wikimedia.org] On Behalf Of mohamad mehdi Sent: Monday, April 18, 2011 3:20 PM To: wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] Wikipedia Literature Review - Tools and Data Sets Hi everyone, This is a follow up on a previous thread (Wikipedia data sets) related to the Wikipedia literature review (Chitu Okoli). As I mentioned in my previous email, part of our study is to identify the data collection methods and data sets used for Wikipedia studies. Therefore, we searched for online tools used to extract Wikipedia articles and for pre-compiled Wikipedia articles data sets; we were able to identify the following list. Please let us know of any other sources you know about. Also, we would like to know if there is any existing Wikipedia page that includes such a list so we can add to it. Otherwise, where do you suggest adding this list so it is noticeable and useful for the community? http://download.wikimedia.org/ /* official Wikipedia database dumps */ http://datamob.org/datasets/tag/wikipedia /* Multiple data sets (English Wikipedia articles that have been transformed into XML) */ http://wiki.dbpedia.org/Datasets /* Structured information from Wikipedia*/ http://labs.systemone.at/wikipedia3/* Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly updated dataset containing around 47 million triples.*/ http://www.scribd.com/doc/9582/integrating-wikipediawordnet /* article talking about integrating WorldNet and Wikipedia with YAGO */ http://www.infochimps.com/datasets/taxobox-wikipedia-infoboxes-with-taxonomic-information-on-animal/ http://www.infochimps.com/link_frame?dataset=11043 /* Wikipedia Datasets for the Hadoop Hack | Cloudera */ http://www.infochimps.com/link_frame?dataset=11166 /* Wikipedia: Lists of common misspellings/For machines */ http://www.infochimps.com/link_frame?dataset=11028 /* Building a (fast) Wikipedia offline reader */ http://www.infochimps.com/link_frame?dataset=11004 /* Using the Wikipedia page-to-page link database */ http://www.infochimps.com/link_frame?dataset=11285 /* List of films */ http://www.infochimps.com/link_frame?dataset=11598 /* MusicBrainz Database */ http://dammit.lt/wikistats/ /* Wikitech-l page counters */ http://snap.stanford.edu/data/wiki-meta.html/* Complete Wikipedia edit history (up to January 2008) */ http://aws.amazon.com/datasets/2596?_encoding=UTF8jiveRedirect=1 /* Wikipedia Page Traffic Statistics */ http://aws.amazon.com/datasets/2506 /* Wikipedia XML Data */ http://www-958.ibm.com/software/data/cognos/manyeyes/datasets?q=Wikipedia+ /* list of Wikipedia data sets */ Examples: http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/top-1000-accessed-wikipedia-articl/versions/1 /* Top 1000 Accessed Wikipedia Articles */ http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/wikipedia-hits/versions/1 /* Wikipedia Hits */ Tools to extract data from Wikipedia: http://www.evanjones.ca/software/wikipedia2text.html /* Extracting Text from Wikipedia */ http://www.infochimps.com/link_frame?dataset=11121/* Wikipedia article traffic statistics */ http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/ /* Generating a Plain Text Corpus from Wikipedia */ http://www.infochimps.com/datasets/wikipedia-articles-title-autocomplete Thank you, Mohamad Mehdi ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] best practices for recruiting study participants?
Dear Jodi and all! I hope that you are fine. Here there is a wiki page listing suggestions on how to develop a research in a way that respects Wikimedia community principles: http://meta.wikimedia.org/wiki/Notes_on_good_practices_on_Wikipedia_research Hopeing is useful! Have a nice day, Mayo «·´`·.(*·.¸(`·.¸ ¸.·´)¸.·*).·´`·» «·´¨*·¸¸« Mayo Fuster Morell ».¸.·*¨`·» «·´`·.(¸.·´(¸.·* *·.¸)`·.¸).·´`·» Research Digital Commons Governance: http://www.onlinecreation.info Ph.D European University Institute Postdoctoral Researcher. Institute of Govern and Public Policies. Autonomous University of Barcelona. Visiting scholar. Internet Interdisciplinary Institute. Open University of Catalonia (UOC). Visiting researcher (2008). School of information. University of California, Berkeley. Member Research Committee. Wikimedia Foundation http://www.onlinecreation.info E-mail: mayo.fus...@eui.eu Skype: mayoneti Phone Spanish State: 0034-648877748 From: wiki-research-l-boun...@lists.wikimedia.org [wiki-research-l-boun...@lists.wikimedia.org] On Behalf Of Jodi Schneider [jodi.schnei...@deri.org] Sent: 20 April 2011 01:18 To: Research into Wikimedia content and communities Subject: [Wiki-research-l] best practices for recruiting study participants? What are the recommended ways to recruit Wikipedians for a research study? My thoughts are: Specific recruitment (i.e. to particular populations/randomized samples): - email? - Talk page messages? Generic recruitment: - post to the Village Pump - post to the appropriate project mailing list(s) Does that seem right? Anybody willing to share successful email/Talk page messages (offlist is fine)? I'm particularly concerned about giving sufficient info, tone, and not being spammy (perhaps a hard balance to hit!). -Jodi ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Wikipedia Literature Review - Tools and Data Sets
We wrote a bunch a Python scripts for parsing Wikipedia dumps with different goals. You can get them at https://github.com/phauly/wiki-network/ We also released some datasets of network extracted from User Talk pages. See http://www.gnuband.org/2011/04/19/wikipedia_datasets_released/ Enjoy! ;) P. On Wed, Apr 20, 2011 at 10:39 AM, Andrew Krizhanovsky andrew.krizhanov...@gmail.com wrote: And maybe Wiktionary parser and visual interface :) http://code.google.com/p/wikokit/ Best regards, Andrew Krizhanovsky On Wed, Apr 20, 2011 at 12:15 PM, Torsten Zesch ze...@tk.informatik.tu-darmstadt.de wrote: Dear Mohamad, thanks for compiling this comprehensive list. You might want to add JWPL: http://code.google.com/p/jwpl/ and WikipediaMiner: http://wikipedia-miner.sourceforge.net/ -Torsten From: wiki-research-l-boun...@lists.wikimedia.org [mailto:wiki-research-l-boun...@lists.wikimedia.org] On Behalf Of mohamad mehdi Sent: Monday, April 18, 2011 3:20 PM To: wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] Wikipedia Literature Review - Tools and Data Sets Hi everyone, This is a follow up on a previous thread (Wikipedia data sets) related to the Wikipedia literature review (Chitu Okoli). As I mentioned in my previous email, part of our study is to identify the data collection methods and data sets used for Wikipedia studies. Therefore, we searched for online tools used to extract Wikipedia articles and for pre-compiled Wikipedia articles data sets; we were able to identify the following list. Please let us know of any other sources you know about. Also, we would like to know if there is any existing Wikipedia page that includes such a list so we can add to it. Otherwise, where do you suggest adding this list so it is noticeable and useful for the community? http://download.wikimedia.org/ /* official Wikipedia database dumps */ http://datamob.org/datasets/tag/wikipedia /* Multiple data sets (English Wikipedia articles that have been transformed into XML) */ http://wiki.dbpedia.org/Datasets /* Structured information from Wikipedia*/ http://labs.systemone.at/wikipedia3/* Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly updated dataset containing around 47 million triples.*/ http://www.scribd.com/doc/9582/integrating-wikipediawordnet /* article talking about integrating WorldNet and Wikipedia with YAGO */ http://www.infochimps.com/datasets/taxobox-wikipedia-infoboxes-with-taxonomic-information-on-animal/ http://www.infochimps.com/link_frame?dataset=11043 /* Wikipedia Datasets for the Hadoop Hack | Cloudera */ http://www.infochimps.com/link_frame?dataset=11166 /* Wikipedia: Lists of common misspellings/For machines */ http://www.infochimps.com/link_frame?dataset=11028 /* Building a (fast) Wikipedia offline reader */ http://www.infochimps.com/link_frame?dataset=11004 /* Using the Wikipedia page-to-page link database */ http://www.infochimps.com/link_frame?dataset=11285 /* List of films */ http://www.infochimps.com/link_frame?dataset=11598 /* MusicBrainz Database */ http://dammit.lt/wikistats/ /* Wikitech-l page counters */ http://snap.stanford.edu/data/wiki-meta.html/* Complete Wikipedia edit history (up to January 2008) */ http://aws.amazon.com/datasets/2596?_encoding=UTF8jiveRedirect=1 /* Wikipedia Page Traffic Statistics */ http://aws.amazon.com/datasets/2506 /* Wikipedia XML Data */ http://www-958.ibm.com/software/data/cognos/manyeyes/datasets?q=Wikipedia+ /* list of Wikipedia data sets */ Examples: http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/top-1000-accessed-wikipedia-articl/versions/1 /* Top 1000 Accessed Wikipedia Articles */ http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/wikipedia-hits/versions/1 /* Wikipedia Hits */ Tools to extract data from Wikipedia: http://www.evanjones.ca/software/wikipedia2text.html /* Extracting Text from Wikipedia */ http://www.infochimps.com/link_frame?dataset=11121/* Wikipedia article traffic statistics */ http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/ /* Generating a Plain Text Corpus from Wikipedia */ http://www.infochimps.com/datasets/wikipedia-articles-title-autocomplete Thank you, Mohamad Mehdi ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- -- Paolo Massa Email: paolo AT gnuband DOT org Blog: http://gnuband.org ___ Wiki-research-l
Re: [Wiki-research-l] Wikipedia Literature Review - Tools and Data Sets
Not directly related with Wikipedia, but about wikis: WikiTeam[1] and their dumps[2] about wikis. Thanks to these dumps, you can compare your research results about Wikipedia community with other wiki communities in the world. [1] http://code.google.com/p/wikiteam/ [2] http://code.google.com/p/wikiteam/downloads/list?can=1 2011/4/18 mohamad mehdi mohamad_me...@hotmail.com Hi everyone, This is a follow up on a previous thread (Wikipedia data sets) related to the Wikipedia literature review (Chitu Okoli). As I mentioned in my previous email, part of our study is to identify the data collection methods and data sets used for Wikipedia studies. Therefore, we searched for online tools used to extract Wikipedia articles and for pre-compiled Wikipedia articles data sets; we were able to identify the following list. Please let us know of any other sources you know about. Also, we would like to know if there is any existing Wikipedia page that includes such a list so we can add to it. Otherwise, where do you suggest adding this list so it is noticeable and useful for the community? http://download.wikimedia.org/ /* official Wikipedia database dumps */ http://datamob.org/datasets/tag/wikipedia /* Multiple data sets (English Wikipedia articles that have been transformed into XML) */ http://wiki.dbpedia.org/Datasets /* Structured information from Wikipedia*/ http://labs.systemone.at/wikipedia3/* Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly updated dataset containing around 47 million triples.*/ http://www.scribd.com/doc/9582/integrating-wikipediawordnet /* article talking about integrating WorldNet and Wikipedia with YAGO */ http://www.infochimps.com/datasets/taxobox-wikipedia-infoboxes-with-taxonomic-information-on-animal/ http://www.infochimps.com/link_frame?dataset=11043 /* Wikipedia Datasets for the Hadoop Hack | Cloudera */ http://www.infochimps.com/link_frame?dataset=11166 /* Wikipedia: Lists of common misspellings/For machines */ http://www.infochimps.com/link_frame?dataset=11028 /* Building a (fast) Wikipedia offline reader */ http://www.infochimps.com/link_frame?dataset=11004 /* Using the Wikipedia page-to-page link database */ http://www.infochimps.com/link_frame?dataset=11285 /* List of films */ http://www.infochimps.com/link_frame?dataset=11598 /* MusicBrainz Database */ http://dammit.lt/wikistats/ /* Wikitech-l page counters */ http://snap.stanford.edu/data/wiki-meta.html/* Complete Wikipedia edit history (up to January 2008) */ http://aws.amazon.com/datasets/2596?_encoding=UTF8jiveRedirect=1 /* Wikipedia Page Traffic Statistics */ http://aws.amazon.com/datasets/2506 /* Wikipedia XML Data */ http://www-958.ibm.com/software/data/cognos/manyeyes/datasets?q=Wikipedia+ /* list of Wikipedia data sets */ Examples: http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/top-1000-accessed-wikipedia-articl/versions/1 /* Top 1000 Accessed Wikipedia Articles */ http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/wikipedia-hits/versions/1 /* Wikipedia Hits */ Tools to extract data from Wikipedia: http://www.evanjones.ca/software/wikipedia2text.html/* Extracting Text from Wikipedia */ http://www.infochimps.com/link_frame?dataset=11121/* Wikipedia article traffic statistics */ http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/ /* Generating a Plain Text Corpus from Wikipedia */ http://www.infochimps.com/datasets/wikipedia-articles-title-autocomplete Thank you, Mohamad Mehdi ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Wikipedia Literature Review - Tools and Data Sets
Hi everyone, Thank you all for your replies, we really appreciate your cooperation. Below is a summary of the tools and data sets recommended by Torsten, Andrew, paolo, and emijrp. We would also like to know if there is any existing Wikipedia page that includes such a list so we can add to it. Otherwise, where do you suggest adding this list so it is noticeable and useful for the community? http://code.google.com/p/jwpl/ http://wikipedia-miner.sourceforge.net/ http://code.google.com/p/wikokit//*Wiktionary parser and visual interface */ https://github.com/phauly/wiki-network/ /*Python scripts for parsing Wikipedia dumps with different goals*/ http://www.gnuband.org/2011/04/19/wikipedia_datasets_released//*datasets of network extracted from User Talk pages*/ http://code.google.com/p/wikiteam/ http://code.google.com/p/wikiteam/downloads/list?can=1 http://www.research.ibm.com/visual/projects/history_flow/ http://meta.wikimedia.org/wiki/WikiXRay http://statmediawiki.forja.rediris.es/index_en.html Best regards, Mohamad Mehdi ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l