Re: [Wikitech-l] require language dump for developing words and corresponding frequency

2010-12-14 Thread Andrew Dunbar
The dump site (http://download.wikimedia.org/) is still broken at the
moment but another way to build some word frequency data is by
randomly sampling the wikis for the languages you are interested in.
At least these Indic languages have Wikipedias of varying sizes:

Assamese http://as.wikipedia.org
Bihari http://bh.wikipedia.org
Bengali http://bn.wikipedia.org
Bishnupriya Manipuri http://bpy.wikipedia.org
Gujarati http://gu.wikipedia.org
Hindi http://hi.wikipedia.org
Kannada http://kn.wikipedia.org
Kashmiri http://ks.wikipedia.org
Marathi http://mr.wikipedia.org
Nepali http://ne.wikipedia.org
Nepal Bhasa http://new.wikipedia.org
Oriya http://or.wikipedia.org/wiki
Eastern Punjabi http://pa.wikipedia.org
Western Punjabi http://pnb.wikipedia.org
Sanskrit http://sa.wikipedia.org
Sindhi  http://sd.wikipedia.org
Tamil http://ta.wikipedia.org
Telugu http://te.wikipedia.org
Urdu http://ur.wikipedia.org

If you'd like to use it I have a tool that downloads random samples of
wiki pages and strips the HTML for purposes such as this.

Good luck!

Andrew Dunbar (hippietrail)

On 14 December 2010 18:36, pravin@gmail.com pravin@gmail.com wrote:
 Hi All,

  I am Pravin Satpute, I am working on language technology and for building
 words and it frequency, i required some webpages in indic language.

 Can i get the most recent dump without en.wiki

 Thanks,
 Pravin s
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How to find the version of a dump

2010-12-14 Thread Andrew Dunbar
On 14 December 2010 01:57, Monica shu monicashu...@gmail.com wrote:
 Thanks Diederik and Waksman,

 It seems that I need to do parse the dump for article data to get this piece
 of information...
 Yes, this will be the last choice, but I think there maybe some easier
 way...

 I just got home and checked the dump I've downloaded.
 It's downloaded on June, 10, 2010, the size is 6117881141 in bz2.
 I remember when I download, it's the latest version at that moment.
 As the dumps are generated every N months, and the one I have is bigger that
 the version 2010-01-30 as Waksman said, my version should be between Feb to
 June.

A Google search hints that enwiki-20100312-pages-articles.xml.bz2
might be the one with size 6117881141.

Andrew Dunbar (hippietrail)


 Does anybody remember the version between this period, or happened to
 download the same version with me?

 Thanks very much to tell me any related information again!


 Best regards!
 Monica




 On Mon, Dec 13, 2010 at 3:24 PM, Shaun Waksman shaunwaks...@gmail.comwrote:

 Hi Monica,

 The file sizes of the EN pages dumps that are available today are:

 5204823166  enwiki-20100312-pages-articles.xml.7z
 5983814213  enwiki-20100130-pages-articles.xml.bz2

 Note that the former is in 7z and the later is in bz2

 Does this help?

 Shaun


 On Mon, Dec 13, 2010 at 8:45 AM, Monica shu monicashu...@gmail.com
 wrote:

  Hi all,
 
  I have downloaded a dump several month ago.
  By accidentally, I lost the version info of this dump, so I don't know
 when
  this dump was generated.
  Is there any place that list out info about the past dumps(such as
  size...)?
 
  Thanks!
 
  Monica
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How to find the version of a dump

2010-12-14 Thread Andrew Dunbar
On 14 December 2010 20:04, Andrew Dunbar hippytr...@gmail.com wrote:
 On 14 December 2010 01:57, Monica shu monicashu...@gmail.com wrote:
 Thanks Diederik and Waksman,

 It seems that I need to do parse the dump for article data to get this piece
 of information...
 Yes, this will be the last choice, but I think there maybe some easier
 way...

 I just got home and checked the dump I've downloaded.
 It's downloaded on June, 10, 2010, the size is 6117881141 in bz2.
 I remember when I download, it's the latest version at that moment.
 As the dumps are generated every N months, and the one I have is bigger that
 the version 2010-01-30 as Waksman said, my version should be between Feb to
 June.

 A Google search hints that enwiki-20100312-pages-articles.xml.bz2
 might be the one with size 6117881141.

 Andrew Dunbar (hippietrail)


 Does anybody remember the version between this period, or happened to
 download the same version with me?

 Thanks very much to tell me any related information again!


 Best regards!
 Monica




 On Mon, Dec 13, 2010 at 3:24 PM, Shaun Waksman shaunwaks...@gmail.comwrote:

 Hi Monica,

 The file sizes of the EN pages dumps that are available today are:

 5204823166  enwiki-20100312-pages-articles.xml.7z
 5983814213  enwiki-20100130-pages-articles.xml.bz2

 Note that the former is in 7z and the later is in bz2

 Does this help?

 Shaun


 On Mon, Dec 13, 2010 at 8:45 AM, Monica shu monicashu...@gmail.com
 wrote:

  Hi all,
 
  I have downloaded a dump several month ago.
  By accidentally, I lost the version info of this dump, so I don't know
 when
  this dump was generated.
  Is there any place that list out info about the past dumps(such as
  size...)?
 
  Thanks!
 
  Monica
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



It should be trivial to add the dump data to the header each dump
file. Since in the files themselves the date field of the filename is
often replaced by latest this could be very useful. It could also be
useful to include the revision ID and timestamp of the latest revision
but I assume this would be a little more difficult. Should I file a
feature request?

Andrew Dunbar (hippietrail)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] dataset1, xml dumps

2010-12-14 Thread Ariel T. Glenn
For folks who have not been following the saga on 
http://wikitech.wikimedia.org/view/Dataset1
we were able to get the raid array back in service last night on the XML
data dumps server, and we are now busily copying data off of it to
another host.  There's about 11T of dumps to copy over; once that's done
we will start serving these dumps read-only to the public again.
Because the state of the server hardware is still uncertain, we don't
want to do anything that might put the data at risk until that copy has
been made.

The replacement server is on order and we are watching that closely. 

We have also been working on deploying a server to run one round of
dumps in the interrim.

Thanks for your patience (which is a way of saying, I know you are all
out of patience, as am I, but hang on just a little longer).

Ariel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] require language dump for developing words and corresponding frequency

2010-12-14 Thread pravin....@gmail.com
On 14 December 2010 14:28, Andrew Dunbar hippytr...@gmail.com wrote:

 The dump site (http://download.wikimedia.org/) is still broken at the
 moment but another way to build some word frequency data is by
 randomly sampling the wikis for the languages you are interested in.
 At least these Indic languages have Wikipedias of varying sizes:

 Assamese http://as.wikipedia.org
 Bihari http://bh.wikipedia.org
 Bengali http://bn.wikipedia.org
 Bishnupriya Manipuri http://bpy.wikipedia.org
 Gujarati http://gu.wikipedia.org
 Hindi http://hi.wikipedia.org
 Kannada http://kn.wikipedia.org
 Kashmiri http://ks.wikipedia.org
 Marathi http://mr.wikipedia.org
 Nepali http://ne.wikipedia.org
 Nepal Bhasa http://new.wikipedia.org
 Oriya http://or.wikipedia.org/wiki
 Eastern Punjabi http://pa.wikipedia.org
 Western Punjabi http://pnb.wikipedia.org
 Sanskrit http://sa.wikipedia.org
 Sindhi  http://sd.wikipedia.org
 Tamil http://ta.wikipedia.org
 Telugu http://te.wikipedia.org
 Urdu http://ur.wikipedia.org

 If you'd like to use it I have a tool that downloads random samples of
 wiki pages and strips the HTML for purposes such as this.



Yeah, let me know, that will be very useful

Thanks,
Pravin s
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] dataset1, xml dumps

2010-12-14 Thread Brion Vibber
Great news! Thanks for the update and thanks for all you guys' work getting
it beaten back into shape. Keeping fingers crossed for all going well on the
transfer...

-- brion
On Dec 14, 2010 1:12 AM, Ariel T. Glenn ar...@wikimedia.org wrote:
 For folks who have not been following the saga on
 http://wikitech.wikimedia.org/view/Dataset1
 we were able to get the raid array back in service last night on the XML
 data dumps server, and we are now busily copying data off of it to
 another host. There's about 11T of dumps to copy over; once that's done
 we will start serving these dumps read-only to the public again.
 Because the state of the server hardware is still uncertain, we don't
 want to do anything that might put the data at risk until that copy has
 been made.

 The replacement server is on order and we are watching that closely.

 We have also been working on deploying a server to run one round of
 dumps in the interrim.

 Thanks for your patience (which is a way of saying, I know you are all
 out of patience, as am I, but hang on just a little longer).

 Ariel



 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] dataset1, xml dumps

2010-12-14 Thread Diederik van Liere
+1
Diederik 

On 2010-12-14, at 12:02, Brion Vibber br...@pobox.com wrote:

 Great news! Thanks for the update and thanks for all you guys' work getting
 it beaten back into shape. Keeping fingers crossed for all going well on the
 transfer...
 
 -- brion
 On Dec 14, 2010 1:12 AM, Ariel T. Glenn ar...@wikimedia.org wrote:
 For folks who have not been following the saga on
 http://wikitech.wikimedia.org/view/Dataset1
 we were able to get the raid array back in service last night on the XML
 data dumps server, and we are now busily copying data off of it to
 another host. There's about 11T of dumps to copy over; once that's done
 we will start serving these dumps read-only to the public again.
 Because the state of the server hardware is still uncertain, we don't
 want to do anything that might put the data at risk until that copy has
 been made.
 
 The replacement server is on order and we are watching that closely.
 
 We have also been working on deploying a server to run one round of
 dumps in the interrim.
 
 Thanks for your patience (which is a way of saying, I know you are all
 out of patience, as am I, but hang on just a little longer).
 
 Ariel
 
 
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How to find the version of a dump

2010-12-14 Thread James Linden
On Mon, Dec 13, 2010 at 7:09 PM, Michael Gurlitz
michael.gurl...@gmail.com wrote:
 I grabbed the following files in the days before the server broke, and
 I can set up a torrent file if anyone's interested, or I could FTP
 them to a server. 2010-10-11 was the last full Wikipedia dump that was
 completed.
 6652983189 (6.2GB) enwiki-20101011-pages-articles.xml.bz2

I would very much like to get a copy of
enwiki-20101011-pages-articles.xml.bz2 if that's possible?

If you need a server to upload to, message me off-list and I can provide it.

-- James

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] dataset1, xml dumps

2010-12-14 Thread emijrp
Thanks.

Double good news:
http://lists.wikimedia.org/pipermail/foundation-l/2010-December/063088.html

2010/12/14 Ariel T. Glenn ar...@wikimedia.org

 For folks who have not been following the saga on
 http://wikitech.wikimedia.org/view/Dataset1
 we were able to get the raid array back in service last night on the XML
 data dumps server, and we are now busily copying data off of it to
 another host.  There's about 11T of dumps to copy over; once that's done
 we will start serving these dumps read-only to the public again.
 Because the state of the server hardware is still uncertain, we don't
 want to do anything that might put the data at risk until that copy has
 been made.

 The replacement server is on order and we are watching that closely.

 We have also been working on deploying a server to run one round of
 dumps in the interrim.

 Thanks for your patience (which is a way of saying, I know you are all
 out of patience, as am I, but hang on just a little longer).

 Ariel



 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How to find the version of a dump

2010-12-14 Thread Platonides
Monica shu wrote:
 Hi emijrp,
 
 Here is my dump's info:
 
 *enwiki-latest-pages-articles.xml.bz2 *
 *a3a5ee062abc16a79d111273d4a1a99a*
 
 Thanks~

I can't find such md5 on any dump.

Here are the md5s of the latest enwiki pages-articles:
a9506e8aedd3b830e059b7c8a3c0dbcd  enwiki-20100904-pages-articles.xml.bz2
09ae0db25ae95af53296e812bc67554b  enwiki-20100916-pages-articles.xml.bz2
7a4805475bba1599933b3acd5150bd4d  enwiki-20101011-pages-articles.xml.bz2


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] A tool or web form for creating new pages

2010-12-14 Thread Ben Schwartz
Hi all,

I'd like to make it easier for novice users to create Sign Language
definition pages with videos for en.wiktionary's new Sign gloss:
namespace.  It's already possible to create such pages, but it requires a
large number of steps, which can deter potential contributors.

I'd like to make a command-line tool or web-form.  The user would provide
1.  Their name and password*
2.  The name of the page
3.  The text contents of the page (definition, etymology, etc. as plain
text fields)
4.  A video of the sign (and maybe also a video of it in use)

The tool would then automatically
0a.  Check if the page already exists (and stop if it does)
0b.  Convert the video to a format appropriate for Commons, if needed
1.  Log in as the user
2.  Upload the video to Commons
3.  Create the page with the desired contents.

Is this a good idea?

Is there something like this already that I could use as a basis?

If this were a web form, how would I handle username+password securely?

Thanks,
Ben

*:  Ideally I'd like to be able to help users who don't yet have accounts
to make accounts, and also somehow automatically handle the
account-linking business between Commons, Wiktionary, etc.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] A tool or web form for creating new pages

2010-12-14 Thread Platonides
Ben Schwartz wrote:
 Hi all,
 
 I'd like to make it easier for novice users to create Sign Language
 definition pages with videos for en.wiktionary's new Sign gloss:
 namespace.  It's already possible to create such pages, but it requires a
 large number of steps, which can deter potential contributors.
 
 I'd like to make a command-line tool or web-form.  The user would provide
 1.  Their name and password*
 2.  The name of the page
 3.  The text contents of the page (definition, etymology, etc. as plain
 text fields)
 4.  A video of the sign (and maybe also a video of it in use)
 
 The tool would then automatically
 0a.  Check if the page already exists (and stop if it does)
 0b.  Convert the video to a format appropriate for Commons, if needed
 1.  Log in as the user
 2.  Upload the video to Commons
 3.  Create the page with the desired contents.
 
 Is this a good idea?
 
 Is there something like this already that I could use as a basis?

I don't think that's appropiate as a single web form. You would want to
make it a Wizard with several steps, so you don't make him record a
video just to then tell We already have that page.
Neil may be interested on this. He recently made the UploadWizard.

 *:  Ideally I'd like to be able to help users who don't yet have accounts
 to make accounts, and also somehow automatically handle the
 account-linking business between Commons, Wiktionary, etc.

That's not really a problem. All of these steps could be handled quite
easily.


 If this were a web form, how would I handle username+password securely?

This is best done in an extension.



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] A tool or web form for creating new pages

2010-12-14 Thread Ben Schwartz
On 12/14/2010 05:40 PM, Platonides wrote:
 Ben Schwartz wrote:
 I'd like to make a command-line tool or web-form.
...
 Is there something like this already that I could use as a basis?
 
 I don't think that's appropiate as a single web form. You would want to
 make it a Wizard with several steps, so you don't make him record a
 video just to then tell We already have that page.

Oh, good idea... although an AJAXy webform could also serve that purpose.

 Neil may be interested on this. He recently made the UploadWizard.

I hadn't heard of this extension, but it looks interesting.  I presume
it's not yet active on the actual Commons?

 *:  Ideally I'd like to be able to help users who don't yet have accounts
 to make accounts, and also somehow automatically handle the
 account-linking business between Commons, Wiktionary, etc.
 
 That's not really a problem. All of these steps could be handled quite
 easily.
 
 
 If this were a web form, how would I handle username+password securely?
 
 This is best done in an extension.

Can a single extension span Commons and Wiktionary?  Would I have to
convince both of them to install my extension before I can use it?  I had
planned to prototype this on a third-party server, precisely to avoid
interfering with the real infrastructure.  I'm trying to minimize the
number of required clicks, so I'd hate to push people through a multi-step
upload wizard on one site, and then a separate definition-page-creation
wizard on another site.

Thanks,
Ben


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] A tool or web form for creating new pages

2010-12-14 Thread Maciej Jaros
@2010-12-15 00:12, Ben Schwartz:
 On 12/14/2010 05:40 PM, Platonides wrote:
 Neil may be interested on this. He recently made the UploadWizard.
 I hadn't heard of this extension, but it looks interesting.  I presume
 it's not yet active on the actual Commons?

Recently activated:
http://commons.wikimedia.org/wiki/Special:UploadWizard

Regards,
Nux.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] A tool or web form for creating new pages

2010-12-14 Thread Neil Kandalgaonkar
On 12/14/10 3:12 PM, Ben Schwartz wrote:
 I hadn't heard of this extension, but it looks interesting.  I presume
 it's not yet active on the actual Commons?

Nope, we deployed it (mostly just to see if we could), and it works for 
many people, but it's still buggy, so it's not widely promoted.

   http://commons.wikimedia.org/wiki/Special:UploadWizard

   bugzilla - http://bit.ly/UploadWizardBugs


 Can a single extension span Commons and Wiktionary?

With a minimum of cooperation between the two, we can put an extension 
on Wiktionary that uploads to Commons, and then you can configure 
Wiktionary to get some of its media from Commons via the InstantCommons 
extension.


  Would I have to
 convince both of them to install my extension before I can use it?

No.

I'd hate to push people through a multi-step
 upload wizard on one site, and then a separate definition-page-creation
 wizard on another site.

The usability project has a working system with Add Media Wizard where 
you can drop in a media file (and even upload it to Commons) while 
editing an article. I don't know if that meets your needs.

I hadn't really thought about special-purpose upload wizards, but it 
could certainly be done. Maybe the page could be invoked in special ways 
for slightly altered flows.

At the moment my main goal is get the number of crucial bugs down, but 
this is a cool idea. I happen to have an interest (although not a 
talent) in ASL since I have a deaf friend, so this might be something I 
could work on in my spare time.

-- 
Neil Kandalgaonkar  |) ne...@wikimedia.org

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l