Re: [Apertium-stuff] GSOC proposal draft - building a prototype MT system
Thank you for your response, Hèctor. I read the proposal for the Hindi-Bengali translator. There aren't open-source dictionaries for the Bhojpuri language (though there are resources for getting a Bhojpuri corpus), so I was using a hardcopy of a BHO-HIN dictionary for manually adding the pairs. I did some rough calculations, and I shall be able to add at least 8,000 words to the monodix. And, based on my experience with Apertium, I think simultaneously adding words in the bidix makes the work easier, so I think roughly the same number of words in the bidix too. But, I don't think I will be able to achieve a WER below 20% with 8000 words. Should I aim for a WER of nearly 30% then? Since the time for GSoC has been reduced, I am planning to modify my proposal and the inputs from mentors would be extremely helpful. On Wed, 7 Apr 2021 at 20:24, Hèctor Alòs i Font wrote: > Hi, Anuradha. > > Thanks for your proposal draft. First, I would like to tell you that if > Apertium is a rule-based translation system, it is because this paradigm > still makes sense for many languages (indeed, for the vast majority of > them). If Bhojpuri has extensive electronic language resources and, > particularly, bilingual linguistic corpora, then Apertium is probably not > the best approach. But this is probably not the case. If it was, it would > probably already be on Google Translate. > > As for the project. I would advise you to look at Gourab Chakraborty's > proposal for a Hindi-Bengali translator and the comments on it. Most of the > comments apply to your proposal as well. The following message would be > useful to you, for instance: > https://sourceforge.net/p/apertium/mailman/message/37251899/ > > Your proposal seems to me unrealistic. 10,000 words in the monodix (and > how many in the bidix?) are not enough for a WER below 20%, I think (maybe > for two extremely close related languages). > > For better evaluation your proposal I'd like to find the answer for some > basic questions: > > * Which is the current state of Bhojpuri language and, eventually, > the Bhojpuri-Hindi language pair in Apertium? > * Would you have to write a whole Bhojpuri morphological analyser from > scratch and, afterwards, to add some 10,000 words manually assigning them > to a given paradigm? How much time you'll need for this? > * From where would you get the bilingual dictionary? Would you have to > create it yourself? Are there freely available bilingual electronic > dictionaries (like e.g. Wiktionary)? > * Would you work on a Bhojpuri-to-Hindi translator or on a > Hindi-to-Bhojpuri one? In any case there will be a quite a lot of work in > the morphological disambiguation. But for one side you'll have it only > once. If both Hindi-to-Bhojpuri and Hindi-to-Bengali are chosen (which is > entirely possible), this work can be divided by the two projects. > > There is nothing wrong to this all this work by hand, if needed. It > depends on the state of the language resources for the given language. But > it is necessary to know to what extent you will have to do this > time-consuming work. > > When we had twice the time in most of the cases the projects couldn't > reach to create a working translator for a new language pair. In the > current conditions, it is even more difficult. > > Hèctor > > > > > Missatge de Anuradha Pandey del dia dc., 7 > d’abr. 2021 a les 16:28: > >> Hello everyone, >> I am Anuradha Pandey, a sophomore student at BITS Pilani. I am interested >> I participating in GSoC 2021, on the project - "*Develop a prototype MT >> system for a strategic language pair*". >> >> I have prepared a rough draft for the same and I am planning to build >> Bhojpuri(BHO)-Hindi(HIN) MT pair. I am improving my translation system for >> the coding challenge and I will update my work on the GitHub repository >> mentioned in the draft. It would be really helpful if I could get some >> feedback before I make the final submission. >> >> Link to the draft - >> >> https://docs.google.com/document/d/1U19gJ3TMKYkYsp-FRthrvXkCRJUnNYSYKi46XhvZGOE/edit?usp=sharing >> >> Thanks & Regards, >> Anuradha Pandey >> IRC: Anuradha_Pandey >> ___ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] GSoC proposal draft: Developing a Morphological Analyzer for Torwali Language
Hi Naeem, Thanks a lot for your very good and interesting draft application. Torwali is an excellent language for Apertium. You know the challenges it presents and the work on it, and you prove to be committed to the language and the project. I am not a specialist on lexc-twol, but I see a few general things to improve your application: * The coding challenge is very important. It proves you understand how Apertium works (not only theoretically) and that you can do the job. So, do it as well as you can now. Don't leave it until after the application period. * Your 30 hours commitment per week is to be welcome, but bear in mind that it is much more than what Google is asking for this year. * You want to enter 50,000+ words in the morphological analyser. That's a huge amount. But in your work plan you don't say when you are going to do it. It would be necessary to show how many words and which grammatical categories you would add in each time slot (two weeks in your case). Usually we start with the closed categories. When you detail these numbers in your proposal, we will see how many words you will be able to reach. * I have no idea how it is in the case of Dardic languages, but the assignment of words to categories is not usually trivial in Indo-European languages. Do existing works already have lists of words assigned to paradigms? For example: lists of verbs following one model or another. If not, the time needed for assignment increases. It is necessary to know this in order to calculate the feasibility of introducing 50,000, 30,000 or 20,000 words. * Are there extensive lists of words available in electronic format, with their grammatical category, which you could use for your work? They should be free. If they were copyrighted they could not be (semi-)automatically uploaded to Apertium. * It is very likely that, with the very limited time we have this year for GSoC projects, a complete morphological analyser from scratch is perfectly reasonable. Still, before putting so many words into it (especially if you have to add them manually), I think it would be reasonable to spend a couple of weeks training a morphological disambiguator. Hèctor Missatge de Naeemuddin Hadi del dia dj., 8 d’abr. 2021 a les 1:46: > Hello everyone, > > I am Naeem, a student of UET Peshawar. I want to participate in GSoC > 2021. I am working to create a morphological analyzer for an endangered > language of northern Pakistan called Torwali. > I have prepared a draft proposal and will appreciate feedbacks before > final submission. links related to coding challenge are included in the > draft. > > link (Draft) : > https://drive.google.com/file/d/1hnu6gRWVN3LjjxOj0BvimvJ56AIKfe6q/view?usp=sharing > > > Regards, > Naeem > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
[Apertium-stuff] GSoC proposal draft: Developing a Morphological Analyzer for Torwali Language
Hello everyone, I am Naeem, a student of UET Peshawar. I want to participate in GSoC 2021. I am working to create a morphological analyzer for an endangered language of northern Pakistan called Torwali. I have prepared a draft proposal and will appreciate feedbacks before final submission. links related to coding challenge are included in the draft. link (Draft) : https://drive.google.com/file/d/1hnu6gRWVN3LjjxOj0BvimvJ56AIKfe6q/view?usp=sharing Regards, Naeem ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] GSOC proposal draft - building a prototype MT system
Hi, Anuradha. Thanks for your proposal draft. First, I would like to tell you that if Apertium is a rule-based translation system, it is because this paradigm still makes sense for many languages (indeed, for the vast majority of them). If Bhojpuri has extensive electronic language resources and, particularly, bilingual linguistic corpora, then Apertium is probably not the best approach. But this is probably not the case. If it was, it would probably already be on Google Translate. As for the project. I would advise you to look at Gourab Chakraborty's proposal for a Hindi-Bengali translator and the comments on it. Most of the comments apply to your proposal as well. The following message would be useful to you, for instance: https://sourceforge.net/p/apertium/mailman/message/37251899/ Your proposal seems to me unrealistic. 10,000 words in the monodix (and how many in the bidix?) are not enough for a WER below 20%, I think (maybe for two extremely close related languages). For better evaluation your proposal I'd like to find the answer for some basic questions: * Which is the current state of Bhojpuri language and, eventually, the Bhojpuri-Hindi language pair in Apertium? * Would you have to write a whole Bhojpuri morphological analyser from scratch and, afterwards, to add some 10,000 words manually assigning them to a given paradigm? How much time you'll need for this? * From where would you get the bilingual dictionary? Would you have to create it yourself? Are there freely available bilingual electronic dictionaries (like e.g. Wiktionary)? * Would you work on a Bhojpuri-to-Hindi translator or on a Hindi-to-Bhojpuri one? In any case there will be a quite a lot of work in the morphological disambiguation. But for one side you'll have it only once. If both Hindi-to-Bhojpuri and Hindi-to-Bengali are chosen (which is entirely possible), this work can be divided by the two projects. There is nothing wrong to this all this work by hand, if needed. It depends on the state of the language resources for the given language. But it is necessary to know to what extent you will have to do this time-consuming work. When we had twice the time in most of the cases the projects couldn't reach to create a working translator for a new language pair. In the current conditions, it is even more difficult. Hèctor Missatge de Anuradha Pandey del dia dc., 7 d’abr. 2021 a les 16:28: > Hello everyone, > I am Anuradha Pandey, a sophomore student at BITS Pilani. I am interested > I participating in GSoC 2021, on the project - "*Develop a prototype MT > system for a strategic language pair*". > > I have prepared a rough draft for the same and I am planning to build > Bhojpuri(BHO)-Hindi(HIN) MT pair. I am improving my translation system for > the coding challenge and I will update my work on the GitHub repository > mentioned in the draft. It would be really helpful if I could get some > feedback before I make the final submission. > > Link to the draft - > > https://docs.google.com/document/d/1U19gJ3TMKYkYsp-FRthrvXkCRJUnNYSYKi46XhvZGOE/edit?usp=sharing > > Thanks & Regards, > Anuradha Pandey > IRC: Anuradha_Pandey > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] GSOC proposal draft - building a prototype MT system
Rajarshi Roychoudhury čálii: > Bhojpuri and Hindi are very closely related language pairs > As far as I know(correct me if I am wrong) , apart from some minor > phoenetical changes they can be considered identical pairs . Seems like a good fit for Apertium then :) considering one of the most popular pairs in Apertium is Nynorsk–Bokmål. Here's a sentence in Nynorsk: - Dette språkparet er kjempepopulært, veldig rart når det er så likt. And here's the same sentence translated into Bokmål: - Dette språkparet er kjempepopulært, veldig rart når det er så likt. I could give a tree structure but I think you get the point. If people write or want to write things in Bhojpuri then it would be useful to have an MT system and if it doesn't differ much from Hindi then it's more likely to succeed in a (short) Apertium GsoC project. signature.asc Description: PGP signature ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] GSOC proposal draft - building a prototype MT system
# in the grammar On Wed, Apr 7, 2021, 19:34 Rajarshi Roychoudhury wrote: > Please give an example where CFG vary significantly in the 2 languages > > On Wed, Apr 7, 2021, 19:25 Anuradha Pandey > wrote: > >> Yes, I did look into the constraint grammar and the two languages vary >> significantly though lemmas in Bhojpuri are mostly an extension to those in >> Hindi. So what would you suggest? Should I translate it to Marathi instead? >> Since in terms of linguistics, I am proficient in Hindi, English, Marathi, >> and Bhojpuri. >> >> On Wed, 7 Apr 2021 at 19:11, Rajarshi Roychoudhury < >> rroychoudhu...@gmail.com> wrote: >> >>> Bhojpuri and Hindi are very closely related language pairs >>> As far as I know(correct me if I am wrong) , apart from some minor >>> phoenetical changes they can be considered identical pairs . Have you tried >>> building disambiguation rules? What are their structures? >>> >>> >>> On Wed, Apr 7, 2021, 18:57 Anuradha Pandey >>> wrote: >>> Hello everyone, I am Anuradha Pandey, a sophomore student at BITS Pilani. I am interested I participating in GSoC 2021, on the project - "*Develop a prototype MT system for a strategic language pair*". I have prepared a rough draft for the same and I am planning to build Bhojpuri(BHO)-Hindi(HIN) MT pair. I am improving my translation system for the coding challenge and I will update my work on the GitHub repository mentioned in the draft. It would be really helpful if I could get some feedback before I make the final submission. Link to the draft - https://docs.google.com/document/d/1U19gJ3TMKYkYsp-FRthrvXkCRJUnNYSYKi46XhvZGOE/edit?usp=sharing Thanks & Regards, Anuradha Pandey IRC: Anuradha_Pandey ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> ___ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> ___ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] GSOC proposal draft - building a prototype MT system
Please give an example where CFG vary significantly in the 2 languages On Wed, Apr 7, 2021, 19:25 Anuradha Pandey wrote: > Yes, I did look into the constraint grammar and the two languages vary > significantly though lemmas in Bhojpuri are mostly an extension to those in > Hindi. So what would you suggest? Should I translate it to Marathi instead? > Since in terms of linguistics, I am proficient in Hindi, English, Marathi, > and Bhojpuri. > > On Wed, 7 Apr 2021 at 19:11, Rajarshi Roychoudhury < > rroychoudhu...@gmail.com> wrote: > >> Bhojpuri and Hindi are very closely related language pairs >> As far as I know(correct me if I am wrong) , apart from some minor >> phoenetical changes they can be considered identical pairs . Have you tried >> building disambiguation rules? What are their structures? >> >> >> On Wed, Apr 7, 2021, 18:57 Anuradha Pandey >> wrote: >> >>> Hello everyone, >>> I am Anuradha Pandey, a sophomore student at BITS Pilani. I am >>> interested I participating in GSoC 2021, on the project - "*Develop a >>> prototype MT system for a strategic language pair*". >>> >>> I have prepared a rough draft for the same and I am planning to build >>> Bhojpuri(BHO)-Hindi(HIN) MT pair. I am improving my translation system for >>> the coding challenge and I will update my work on the GitHub repository >>> mentioned in the draft. It would be really helpful if I could get some >>> feedback before I make the final submission. >>> >>> Link to the draft - >>> >>> https://docs.google.com/document/d/1U19gJ3TMKYkYsp-FRthrvXkCRJUnNYSYKi46XhvZGOE/edit?usp=sharing >>> >>> Thanks & Regards, >>> Anuradha Pandey >>> IRC: Anuradha_Pandey >>> ___ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> ___ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] GSOC proposal draft - building a prototype MT system
Yes, I did look into the constraint grammar and the two languages vary significantly though lemmas in Bhojpuri are mostly an extension to those in Hindi. So what would you suggest? Should I translate it to Marathi instead? Since in terms of linguistics, I am proficient in Hindi, English, Marathi, and Bhojpuri. On Wed, 7 Apr 2021 at 19:11, Rajarshi Roychoudhury wrote: > Bhojpuri and Hindi are very closely related language pairs > As far as I know(correct me if I am wrong) , apart from some minor > phoenetical changes they can be considered identical pairs . Have you tried > building disambiguation rules? What are their structures? > > > On Wed, Apr 7, 2021, 18:57 Anuradha Pandey > wrote: > >> Hello everyone, >> I am Anuradha Pandey, a sophomore student at BITS Pilani. I am interested >> I participating in GSoC 2021, on the project - "*Develop a prototype MT >> system for a strategic language pair*". >> >> I have prepared a rough draft for the same and I am planning to build >> Bhojpuri(BHO)-Hindi(HIN) MT pair. I am improving my translation system for >> the coding challenge and I will update my work on the GitHub repository >> mentioned in the draft. It would be really helpful if I could get some >> feedback before I make the final submission. >> >> Link to the draft - >> >> https://docs.google.com/document/d/1U19gJ3TMKYkYsp-FRthrvXkCRJUnNYSYKi46XhvZGOE/edit?usp=sharing >> >> Thanks & Regards, >> Anuradha Pandey >> IRC: Anuradha_Pandey >> ___ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] GSOC proposal draft - building a prototype MT system
Bhojpuri and Hindi are very closely related language pairs As far as I know(correct me if I am wrong) , apart from some minor phoenetical changes they can be considered identical pairs . Have you tried building disambiguation rules? What are their structures? On Wed, Apr 7, 2021, 18:57 Anuradha Pandey wrote: > Hello everyone, > I am Anuradha Pandey, a sophomore student at BITS Pilani. I am interested > I participating in GSoC 2021, on the project - "*Develop a prototype MT > system for a strategic language pair*". > > I have prepared a rough draft for the same and I am planning to build > Bhojpuri(BHO)-Hindi(HIN) MT pair. I am improving my translation system for > the coding challenge and I will update my work on the GitHub repository > mentioned in the draft. It would be really helpful if I could get some > feedback before I make the final submission. > > Link to the draft - > > https://docs.google.com/document/d/1U19gJ3TMKYkYsp-FRthrvXkCRJUnNYSYKi46XhvZGOE/edit?usp=sharing > > Thanks & Regards, > Anuradha Pandey > IRC: Anuradha_Pandey > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
[Apertium-stuff] GSOC proposal draft - building a prototype MT system
Hello everyone, I am Anuradha Pandey, a sophomore student at BITS Pilani. I am interested I participating in GSoC 2021, on the project - "*Develop a prototype MT system for a strategic language pair*". I have prepared a rough draft for the same and I am planning to build Bhojpuri(BHO)-Hindi(HIN) MT pair. I am improving my translation system for the coding challenge and I will update my work on the GitHub repository mentioned in the draft. It would be really helpful if I could get some feedback before I make the final submission. Link to the draft - https://docs.google.com/document/d/1U19gJ3TMKYkYsp-FRthrvXkCRJUnNYSYKi46XhvZGOE/edit?usp=sharing Thanks & Regards, Anuradha Pandey IRC: Anuradha_Pandey ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
[Apertium-stuff] GSoC proposal draft: User friendly lexical training
Hello everyone, I am vivek vardhan adepu, an undergraduate from IIT Kharagpur. I am interested to participate in GSoC this year. I would like to work on the project "User-friendly lexical training" and made a draft proposal for the same[please find below link] It would be really helpful if someone gives feedback on my proposal so that I can improve it before the final submission https://docs.google.com/document/d/1YAw5M0-wSqVxfntJTdWLRutvpQfFlS2BscFjD5yMmZg/edit?usp=sharing Regards, Vivek IRC: naan_dhaan/vivekvelda ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff