Re: [Wikitech-l] require language dump for developing words and corresponding frequency
The dump site (http://download.wikimedia.org/) is still broken at the moment but another way to build some word frequency data is by randomly sampling the wikis for the languages you are interested in. At least these Indic languages have Wikipedias of varying sizes: Assamese http://as.wikipedia.org Bihari http://bh.wikipedia.org Bengali http://bn.wikipedia.org Bishnupriya Manipuri http://bpy.wikipedia.org Gujarati http://gu.wikipedia.org Hindi http://hi.wikipedia.org Kannada http://kn.wikipedia.org Kashmiri http://ks.wikipedia.org Marathi http://mr.wikipedia.org Nepali http://ne.wikipedia.org Nepal Bhasa http://new.wikipedia.org Oriya http://or.wikipedia.org/wiki Eastern Punjabi http://pa.wikipedia.org Western Punjabi http://pnb.wikipedia.org Sanskrit http://sa.wikipedia.org Sindhi http://sd.wikipedia.org Tamil http://ta.wikipedia.org Telugu http://te.wikipedia.org Urdu http://ur.wikipedia.org If you'd like to use it I have a tool that downloads random samples of wiki pages and strips the HTML for purposes such as this. Good luck! Andrew Dunbar (hippietrail) On 14 December 2010 18:36, pravin@gmail.com pravin@gmail.com wrote: Hi All, I am Pravin Satpute, I am working on language technology and for building words and it frequency, i required some webpages in indic language. Can i get the most recent dump without en.wiki Thanks, Pravin s ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] How to find the version of a dump
On 14 December 2010 01:57, Monica shu monicashu...@gmail.com wrote: Thanks Diederik and Waksman, It seems that I need to do parse the dump for article data to get this piece of information... Yes, this will be the last choice, but I think there maybe some easier way... I just got home and checked the dump I've downloaded. It's downloaded on June, 10, 2010, the size is 6117881141 in bz2. I remember when I download, it's the latest version at that moment. As the dumps are generated every N months, and the one I have is bigger that the version 2010-01-30 as Waksman said, my version should be between Feb to June. A Google search hints that enwiki-20100312-pages-articles.xml.bz2 might be the one with size 6117881141. Andrew Dunbar (hippietrail) Does anybody remember the version between this period, or happened to download the same version with me? Thanks very much to tell me any related information again! Best regards! Monica On Mon, Dec 13, 2010 at 3:24 PM, Shaun Waksman shaunwaks...@gmail.comwrote: Hi Monica, The file sizes of the EN pages dumps that are available today are: 5204823166 enwiki-20100312-pages-articles.xml.7z 5983814213 enwiki-20100130-pages-articles.xml.bz2 Note that the former is in 7z and the later is in bz2 Does this help? Shaun On Mon, Dec 13, 2010 at 8:45 AM, Monica shu monicashu...@gmail.com wrote: Hi all, I have downloaded a dump several month ago. By accidentally, I lost the version info of this dump, so I don't know when this dump was generated. Is there any place that list out info about the past dumps(such as size...)? Thanks! Monica ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] How to find the version of a dump
On 14 December 2010 20:04, Andrew Dunbar hippytr...@gmail.com wrote: On 14 December 2010 01:57, Monica shu monicashu...@gmail.com wrote: Thanks Diederik and Waksman, It seems that I need to do parse the dump for article data to get this piece of information... Yes, this will be the last choice, but I think there maybe some easier way... I just got home and checked the dump I've downloaded. It's downloaded on June, 10, 2010, the size is 6117881141 in bz2. I remember when I download, it's the latest version at that moment. As the dumps are generated every N months, and the one I have is bigger that the version 2010-01-30 as Waksman said, my version should be between Feb to June. A Google search hints that enwiki-20100312-pages-articles.xml.bz2 might be the one with size 6117881141. Andrew Dunbar (hippietrail) Does anybody remember the version between this period, or happened to download the same version with me? Thanks very much to tell me any related information again! Best regards! Monica On Mon, Dec 13, 2010 at 3:24 PM, Shaun Waksman shaunwaks...@gmail.comwrote: Hi Monica, The file sizes of the EN pages dumps that are available today are: 5204823166 enwiki-20100312-pages-articles.xml.7z 5983814213 enwiki-20100130-pages-articles.xml.bz2 Note that the former is in 7z and the later is in bz2 Does this help? Shaun On Mon, Dec 13, 2010 at 8:45 AM, Monica shu monicashu...@gmail.com wrote: Hi all, I have downloaded a dump several month ago. By accidentally, I lost the version info of this dump, so I don't know when this dump was generated. Is there any place that list out info about the past dumps(such as size...)? Thanks! Monica ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l It should be trivial to add the dump data to the header each dump file. Since in the files themselves the date field of the filename is often replaced by latest this could be very useful. It could also be useful to include the revision ID and timestamp of the latest revision but I assume this would be a little more difficult. Should I file a feature request? Andrew Dunbar (hippietrail) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] dataset1, xml dumps
For folks who have not been following the saga on http://wikitech.wikimedia.org/view/Dataset1 we were able to get the raid array back in service last night on the XML data dumps server, and we are now busily copying data off of it to another host. There's about 11T of dumps to copy over; once that's done we will start serving these dumps read-only to the public again. Because the state of the server hardware is still uncertain, we don't want to do anything that might put the data at risk until that copy has been made. The replacement server is on order and we are watching that closely. We have also been working on deploying a server to run one round of dumps in the interrim. Thanks for your patience (which is a way of saying, I know you are all out of patience, as am I, but hang on just a little longer). Ariel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] require language dump for developing words and corresponding frequency
On 14 December 2010 14:28, Andrew Dunbar hippytr...@gmail.com wrote: The dump site (http://download.wikimedia.org/) is still broken at the moment but another way to build some word frequency data is by randomly sampling the wikis for the languages you are interested in. At least these Indic languages have Wikipedias of varying sizes: Assamese http://as.wikipedia.org Bihari http://bh.wikipedia.org Bengali http://bn.wikipedia.org Bishnupriya Manipuri http://bpy.wikipedia.org Gujarati http://gu.wikipedia.org Hindi http://hi.wikipedia.org Kannada http://kn.wikipedia.org Kashmiri http://ks.wikipedia.org Marathi http://mr.wikipedia.org Nepali http://ne.wikipedia.org Nepal Bhasa http://new.wikipedia.org Oriya http://or.wikipedia.org/wiki Eastern Punjabi http://pa.wikipedia.org Western Punjabi http://pnb.wikipedia.org Sanskrit http://sa.wikipedia.org Sindhi http://sd.wikipedia.org Tamil http://ta.wikipedia.org Telugu http://te.wikipedia.org Urdu http://ur.wikipedia.org If you'd like to use it I have a tool that downloads random samples of wiki pages and strips the HTML for purposes such as this. Yeah, let me know, that will be very useful Thanks, Pravin s ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] dataset1, xml dumps
Great news! Thanks for the update and thanks for all you guys' work getting it beaten back into shape. Keeping fingers crossed for all going well on the transfer... -- brion On Dec 14, 2010 1:12 AM, Ariel T. Glenn ar...@wikimedia.org wrote: For folks who have not been following the saga on http://wikitech.wikimedia.org/view/Dataset1 we were able to get the raid array back in service last night on the XML data dumps server, and we are now busily copying data off of it to another host. There's about 11T of dumps to copy over; once that's done we will start serving these dumps read-only to the public again. Because the state of the server hardware is still uncertain, we don't want to do anything that might put the data at risk until that copy has been made. The replacement server is on order and we are watching that closely. We have also been working on deploying a server to run one round of dumps in the interrim. Thanks for your patience (which is a way of saying, I know you are all out of patience, as am I, but hang on just a little longer). Ariel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] dataset1, xml dumps
+1 Diederik On 2010-12-14, at 12:02, Brion Vibber br...@pobox.com wrote: Great news! Thanks for the update and thanks for all you guys' work getting it beaten back into shape. Keeping fingers crossed for all going well on the transfer... -- brion On Dec 14, 2010 1:12 AM, Ariel T. Glenn ar...@wikimedia.org wrote: For folks who have not been following the saga on http://wikitech.wikimedia.org/view/Dataset1 we were able to get the raid array back in service last night on the XML data dumps server, and we are now busily copying data off of it to another host. There's about 11T of dumps to copy over; once that's done we will start serving these dumps read-only to the public again. Because the state of the server hardware is still uncertain, we don't want to do anything that might put the data at risk until that copy has been made. The replacement server is on order and we are watching that closely. We have also been working on deploying a server to run one round of dumps in the interrim. Thanks for your patience (which is a way of saying, I know you are all out of patience, as am I, but hang on just a little longer). Ariel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] How to find the version of a dump
On Mon, Dec 13, 2010 at 7:09 PM, Michael Gurlitz michael.gurl...@gmail.com wrote: I grabbed the following files in the days before the server broke, and I can set up a torrent file if anyone's interested, or I could FTP them to a server. 2010-10-11 was the last full Wikipedia dump that was completed. 6652983189 (6.2GB) enwiki-20101011-pages-articles.xml.bz2 I would very much like to get a copy of enwiki-20101011-pages-articles.xml.bz2 if that's possible? If you need a server to upload to, message me off-list and I can provide it. -- James ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] dataset1, xml dumps
Thanks. Double good news: http://lists.wikimedia.org/pipermail/foundation-l/2010-December/063088.html 2010/12/14 Ariel T. Glenn ar...@wikimedia.org For folks who have not been following the saga on http://wikitech.wikimedia.org/view/Dataset1 we were able to get the raid array back in service last night on the XML data dumps server, and we are now busily copying data off of it to another host. There's about 11T of dumps to copy over; once that's done we will start serving these dumps read-only to the public again. Because the state of the server hardware is still uncertain, we don't want to do anything that might put the data at risk until that copy has been made. The replacement server is on order and we are watching that closely. We have also been working on deploying a server to run one round of dumps in the interrim. Thanks for your patience (which is a way of saying, I know you are all out of patience, as am I, but hang on just a little longer). Ariel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] How to find the version of a dump
Monica shu wrote: Hi emijrp, Here is my dump's info: *enwiki-latest-pages-articles.xml.bz2 * *a3a5ee062abc16a79d111273d4a1a99a* Thanks~ I can't find such md5 on any dump. Here are the md5s of the latest enwiki pages-articles: a9506e8aedd3b830e059b7c8a3c0dbcd enwiki-20100904-pages-articles.xml.bz2 09ae0db25ae95af53296e812bc67554b enwiki-20100916-pages-articles.xml.bz2 7a4805475bba1599933b3acd5150bd4d enwiki-20101011-pages-articles.xml.bz2 ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] A tool or web form for creating new pages
Hi all, I'd like to make it easier for novice users to create Sign Language definition pages with videos for en.wiktionary's new Sign gloss: namespace. It's already possible to create such pages, but it requires a large number of steps, which can deter potential contributors. I'd like to make a command-line tool or web-form. The user would provide 1. Their name and password* 2. The name of the page 3. The text contents of the page (definition, etymology, etc. as plain text fields) 4. A video of the sign (and maybe also a video of it in use) The tool would then automatically 0a. Check if the page already exists (and stop if it does) 0b. Convert the video to a format appropriate for Commons, if needed 1. Log in as the user 2. Upload the video to Commons 3. Create the page with the desired contents. Is this a good idea? Is there something like this already that I could use as a basis? If this were a web form, how would I handle username+password securely? Thanks, Ben *: Ideally I'd like to be able to help users who don't yet have accounts to make accounts, and also somehow automatically handle the account-linking business between Commons, Wiktionary, etc. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] A tool or web form for creating new pages
Ben Schwartz wrote: Hi all, I'd like to make it easier for novice users to create Sign Language definition pages with videos for en.wiktionary's new Sign gloss: namespace. It's already possible to create such pages, but it requires a large number of steps, which can deter potential contributors. I'd like to make a command-line tool or web-form. The user would provide 1. Their name and password* 2. The name of the page 3. The text contents of the page (definition, etymology, etc. as plain text fields) 4. A video of the sign (and maybe also a video of it in use) The tool would then automatically 0a. Check if the page already exists (and stop if it does) 0b. Convert the video to a format appropriate for Commons, if needed 1. Log in as the user 2. Upload the video to Commons 3. Create the page with the desired contents. Is this a good idea? Is there something like this already that I could use as a basis? I don't think that's appropiate as a single web form. You would want to make it a Wizard with several steps, so you don't make him record a video just to then tell We already have that page. Neil may be interested on this. He recently made the UploadWizard. *: Ideally I'd like to be able to help users who don't yet have accounts to make accounts, and also somehow automatically handle the account-linking business between Commons, Wiktionary, etc. That's not really a problem. All of these steps could be handled quite easily. If this were a web form, how would I handle username+password securely? This is best done in an extension. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] A tool or web form for creating new pages
On 12/14/2010 05:40 PM, Platonides wrote: Ben Schwartz wrote: I'd like to make a command-line tool or web-form. ... Is there something like this already that I could use as a basis? I don't think that's appropiate as a single web form. You would want to make it a Wizard with several steps, so you don't make him record a video just to then tell We already have that page. Oh, good idea... although an AJAXy webform could also serve that purpose. Neil may be interested on this. He recently made the UploadWizard. I hadn't heard of this extension, but it looks interesting. I presume it's not yet active on the actual Commons? *: Ideally I'd like to be able to help users who don't yet have accounts to make accounts, and also somehow automatically handle the account-linking business between Commons, Wiktionary, etc. That's not really a problem. All of these steps could be handled quite easily. If this were a web form, how would I handle username+password securely? This is best done in an extension. Can a single extension span Commons and Wiktionary? Would I have to convince both of them to install my extension before I can use it? I had planned to prototype this on a third-party server, precisely to avoid interfering with the real infrastructure. I'm trying to minimize the number of required clicks, so I'd hate to push people through a multi-step upload wizard on one site, and then a separate definition-page-creation wizard on another site. Thanks, Ben ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] A tool or web form for creating new pages
@2010-12-15 00:12, Ben Schwartz: On 12/14/2010 05:40 PM, Platonides wrote: Neil may be interested on this. He recently made the UploadWizard. I hadn't heard of this extension, but it looks interesting. I presume it's not yet active on the actual Commons? Recently activated: http://commons.wikimedia.org/wiki/Special:UploadWizard Regards, Nux. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] A tool or web form for creating new pages
On 12/14/10 3:12 PM, Ben Schwartz wrote: I hadn't heard of this extension, but it looks interesting. I presume it's not yet active on the actual Commons? Nope, we deployed it (mostly just to see if we could), and it works for many people, but it's still buggy, so it's not widely promoted. http://commons.wikimedia.org/wiki/Special:UploadWizard bugzilla - http://bit.ly/UploadWizardBugs Can a single extension span Commons and Wiktionary? With a minimum of cooperation between the two, we can put an extension on Wiktionary that uploads to Commons, and then you can configure Wiktionary to get some of its media from Commons via the InstantCommons extension. Would I have to convince both of them to install my extension before I can use it? No. I'd hate to push people through a multi-step upload wizard on one site, and then a separate definition-page-creation wizard on another site. The usability project has a working system with Add Media Wizard where you can drop in a media file (and even upload it to Commons) while editing an article. I don't know if that meets your needs. I hadn't really thought about special-purpose upload wizards, but it could certainly be done. Maybe the page could be invoked in special ways for slightly altered flows. At the moment my main goal is get the number of crucial bugs down, but this is a cool idea. I happen to have an interest (although not a talent) in ASL since I have a deaf friend, so this might be something I could work on in my spare time. -- Neil Kandalgaonkar |) ne...@wikimedia.org ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l