The dump site (http://download.wikimedia.org/) is still broken at the
moment but another way to build some word frequency data is by
randomly sampling the wikis for the languages you are interested in.
At least these Indic languages have Wikipedias of varying sizes:

Assamese http://as.wikipedia.org
Bihari http://bh.wikipedia.org
Bengali http://bn.wikipedia.org
Bishnupriya Manipuri http://bpy.wikipedia.org
Gujarati http://gu.wikipedia.org
Hindi http://hi.wikipedia.org
Kannada http://kn.wikipedia.org
Kashmiri http://ks.wikipedia.org
Marathi http://mr.wikipedia.org
Nepali http://ne.wikipedia.org
Nepal Bhasa http://new.wikipedia.org
Oriya http://or.wikipedia.org/wiki
Eastern Punjabi http://pa.wikipedia.org
Western Punjabi http://pnb.wikipedia.org
Sanskrit http://sa.wikipedia.org
Sindhi  http://sd.wikipedia.org
Tamil http://ta.wikipedia.org
Telugu http://te.wikipedia.org
Urdu http://ur.wikipedia.org

If you'd like to use it I have a tool that downloads random samples of
wiki pages and strips the HTML for purposes such as this.

Good luck!

Andrew Dunbar (hippietrail)

On 14 December 2010 18:36, pravin....@gmail.com <pravin....@gmail.com> wrote:
> Hi All,
>
>  I am Pravin Satpute, I am working on language technology and for building
> words and it frequency, i required some webpages in indic language.
>
> Can i get the most recent dump without en.wiki
>
> Thanks,
> Pravin s
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to