Re: [Wikitech-l] GSOC 2012 - Text Processing and Data Mining (Oren Bochman)
Dear Oren Bochman, I am very pleased to hear from you. My familiarity with the requirements *on a scale of 5* are as follows: 1. Java and other programming languages :: * 4.5 *...I have done courses on Java, C, C++. I have extensively used Python in my projects. I am very comfortable with the syntax and semantics and understanding different libraries won't be difficult 2. PHP :: * 3.5 *...I have used php in my project and am undergoing a course on it in my university. 3. Apache Lucene :: * 2 *...I was not very familiar with this library until recently. However, I am very much willing to learn this as soon as possible, and be comfortable with it before the coding period starts. 4. Natural Language Processing:: * 4 *...Language processing and Data is my major interest and I have done all my projects on NLP. I have taken up the course on NLP being offered at coursera.org. NLP is what i discuss with my professors at my university too. 5. Computational Linguistics and Word net :: * 4 *...I am using the principles of computational linguistics and the wordnet in my current project- Automatic essay grader. Also, I have chosen Data Mining as an elective and am comfortable with the field I was looking for some clarifications regarding the proposed ideas: 1. Regarding the first project :: "a framework for handling different languages."...how exactly should we be looking at 'handling' languages? what kind of frame work is expected? 2. Regarding the second project :: "Make a Lucene filter which uses such a wordnet to expand search terms."...does this project aim at building everything from scratch or revamping the existing code? My understanding of the proposed idea 1 is : "To extract the corpus from Wikipedia and and to apply the deliverables on them." Please correct me if I am missing something. Also, I was wondering if you were thinking of some specific approach or would it be OK if i come up with an approach and propose the same in my proposal. Some more details regarding my Essay Grader project. The grader does take care of the essay coherence. Spelling and grammar are, as you pointed out important, but not too informative when it comes to the "relatedness" of the essay. The essays are also graded based on the structure. We tried to analyse the statistics of the essay to come up with a measure to grade the essay structure. I am very excited about this and am eagerly looking forward to hear from you. Thank you. Best Regards, Karthik > Date: Mon, 2 Apr 2012 11:46:21 +0200 > From: "Oren Bochman" > To: > Subject: Re: [Wikitech-l] GSOC 2012 - Text Processing and Data Mining > Message-ID: <017401cd10b5$769f9fb0$63dedf10$@com> > Content-Type: text/plain; charset="us-ascii" > > Dear, Karthik Prasad & Other GSOC candidates. > > > > I was not getting this list but I am now. > > > > The GSOC proposal should be specified by the student. > > > > I'll can expand the details on these projects. > > I can answer specific questions you have about expectation. > > > > To optimally match you with a suitable high impact project - to what > extent > are you familiar with : > > *Java and other programming languages? > > *PHP? > > *Apache Lucene? > > *Natural Language Processing? > > *Corpus Linguistics? > > *Word Net? > > > > The listed projects would be either wrapped as services, or consumed by > downstream projects or both. > > > > The corpus is the simplest but requires lots of attention to detail. When > successful, it would be picked up by lots of > > researchers and companies who do not have the resources for doing such CPU > intensive tasks. > > For WMF it would provide us with a standardized body for future NLP work. A > Part Of Speech tagged corpus would > be immediately useful for an 80% accurate word sense disambiguation in the > search engine. > > > > Automatic Summaries are not a strategic priority AFAIK - > > 1. most articles provide a kind of abstract in their intro and > > 2. there are something like this already provided in the dumps for > yahoo. > > 3. I have been using a great pop up preview widget in Wiktionary for > a > year or so. > > > > I do think it would be a great project to learn how to become a MediaWiki > developer but is small for a GSOC. > However I cannot speak for Jebald and other mentors in cellular and other > teams who might be interested in this. > > > > If your easy grader is working it could be the basis of another very > exciting GSOC project aimed at article quality. > > A NLP savvy "smart" article quality assessment
Re: [Wikitech-l] GSOC 2012 - Text Processing and Data Mining
Dear, Karthik Prasad & Other GSOC candidates. I was not getting this list but I am now. The GSOC proposal should be specified by the student. I'll can expand the details on these projects. I can answer specific questions you have about expectation. To optimally match you with a suitable high impact project - to what extent are you familiar with : *Java and other programming languages? *PHP? *Apache Lucene? *Natural Language Processing? *Corpus Linguistics? *Word Net? The listed projects would be either wrapped as services, or consumed by downstream projects or both. The corpus is the simplest but requires lots of attention to detail. When successful, it would be picked up by lots of researchers and companies who do not have the resources for doing such CPU intensive tasks. For WMF it would provide us with a standardized body for future NLP work. A Part Of Speech tagged corpus would be immediately useful for an 80% accurate word sense disambiguation in the search engine. Automatic Summaries are not a strategic priority AFAIK - 1. most articles provide a kind of abstract in their intro and 2. there are something like this already provided in the dumps for yahoo. 3. I have been using a great pop up preview widget in Wiktionary for a year or so. I do think it would be a great project to learn how to become a MediaWiki developer but is small for a GSOC. However I cannot speak for Jebald and other mentors in cellular and other teams who might be interested in this. If your easy grader is working it could be the basis of another very exciting GSOC project aimed at article quality. A NLP savvy "smart" article quality assessment service could improve/expand the current bots grading articles. Grammar and spelling are two good indicators, features. However a full assessment of Wikipedia articles would require more details - both stylistic and information based. Once you have covered sufficient features building discriminators based on samples of graded articles would require some data mining ability. However since there is an Existing bot, undergoing upgrades we would have to check with its small dev team what it currently doing And it would be subject to community oversight. Yours Sincerely, Oren Bochman MediaWiki Search Developer ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] GSOC 2012 - Text Processing and Data Mining (John Erling Blad)
Thank you very much for your feedback Jeblad. I will immediately look into how this can be best implemented by extending the Mediawiki API. Do kindly let me know about my other ideas so that I can shape my proposal well. The mentor for ideas I am interested in is Oren Bochman. But I couldn't track him on the irc. I would love to interact with him or any other mentor and discuss my ideas in detail. I am recahable at Email : karthikprasad...@gmail.com SkypeID : prasadkarthik Facebook: facebook.com/prasadkarthik Google+ : gplus.to/karthikprasad twitter : twitter.com/_karthikprasad Date: Sat, 31 Mar 2012 12:05:00 +0200 > From: John Erling Blad > To: Wikimedia developers > Subject: Re: [Wikitech-l] GSOC 2012 - Text Processing and Data Mining > Message-ID: > > > Content-Type: text/plain; charset=windows-1252 > > Your point (a) "Implementing a wikiSumarizer widget which will give the > summary of the page being read by the user" could be extremely usefull for > a hover/ helpbubbles functionality where bubbles with a small explanations > are created within external articles. Such functionality imply creating an > extension to the Mediawiki API. > > Jeblad > > On Sat, Mar 31, 2012 at 11:09 AM, karthik prasad < > karthikprasad...@gmail.com > > wrote: > > > Hello, > > I am Karthik from India - currently pursuing 3rd year Bachelors in > Computer > > Science and Engineering in PESIT, Bangalore. > > > > I am interested in some of the projects proposed for Google SOC 2012 and > > would love to work and contribute the same to the open-source world. > > > > I am very attracted towards Text Processing and Data Mining. I have > > undertaken course in Natural Language Processing. I am currently working > on > > a project "Automatic Essay Grader" - A system that automatically grades > > English essays based on Spelling, Grammar and Structure, Coherence, > > Frequent phrases and Vocabulary as weighted parameters. Realized by > > implementing a self-designed algorithm ? studying the ?relation graph? of > > words of the essay. > > > > I had also worked on "Sentiment Analysis on Web" - Extraction of reviews > > about a gadget from tech-review forums, analysis of the Sentiments of the > > reviews thus predicting the sentiment/opinion associated with that gadget > > and then generation of appropriate Rating on the scale of 10. > > > > The following projects mentioned on the mediawiki's ideas page caught my > > eye: > > 1) Wikipedia Corpus Tools > > 2) Lucene Lemma Analyzers based on Morphology Extraction from Wikipedia > > Text > > 3) Lucene Automatic Query Expansion from Wikipedia Text > > 4) Translation spellchecking > > > > Apart from the above projects, I also had the following ideas which i > feel > > will be of great help if implemented. > > a) Implementing a wikiSumarizer widget which will give the summary of the > > page being read by the user. > > b) An automatic coherence analyser which would make it easy to find out > if > > the article on a given page talks about the same topic > > c) Details Aggregator for page. > > > > I would be grateful if you could kindly let me know about the specific > > requirements of the projects and about your thoughts on my ideas so that > I > > can suitably write a proposal. > > > > Eagerly waiting for your response. > > > > Thanking you. > > > > Best Regards, > > Karthik. > > ___ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] GSOC 2012 - Text Processing and Data Mining
Your point (a) "Implementing a wikiSumarizer widget which will give the summary of the page being read by the user" could be extremely usefull for a hover/ helpbubbles functionality where bubbles with a small explanations are created within external articles. Such functionality imply creating an extension to the Mediawiki API. Jeblad On Sat, Mar 31, 2012 at 11:09 AM, karthik prasad wrote: > Hello, > I am Karthik from India - currently pursuing 3rd year Bachelors in Computer > Science and Engineering in PESIT, Bangalore. > > I am interested in some of the projects proposed for Google SOC 2012 and > would love to work and contribute the same to the open-source world. > > I am very attracted towards Text Processing and Data Mining. I have > undertaken course in Natural Language Processing. I am currently working on > a project "Automatic Essay Grader" - A system that automatically grades > English essays based on Spelling, Grammar and Structure, Coherence, > Frequent phrases and Vocabulary as weighted parameters. Realized by > implementing a self-designed algorithm – studying the ‘relation graph’ of > words of the essay. > > I had also worked on "Sentiment Analysis on Web" - Extraction of reviews > about a gadget from tech-review forums, analysis of the Sentiments of the > reviews thus predicting the sentiment/opinion associated with that gadget > and then generation of appropriate Rating on the scale of 10. > > The following projects mentioned on the mediawiki's ideas page caught my > eye: > 1) Wikipedia Corpus Tools > 2) Lucene Lemma Analyzers based on Morphology Extraction from Wikipedia > Text > 3) Lucene Automatic Query Expansion from Wikipedia Text > 4) Translation spellchecking > > Apart from the above projects, I also had the following ideas which i feel > will be of great help if implemented. > a) Implementing a wikiSumarizer widget which will give the summary of the > page being read by the user. > b) An automatic coherence analyser which would make it easy to find out if > the article on a given page talks about the same topic > c) Details Aggregator for page. > > I would be grateful if you could kindly let me know about the specific > requirements of the projects and about your thoughts on my ideas so that I > can suitably write a proposal. > > Eagerly waiting for your response. > > Thanking you. > > Best Regards, > Karthik. > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] GSOC 2012 - Text Processing and Data Mining
Hello, I am Karthik from India - currently pursuing 3rd year Bachelors in Computer Science and Engineering in PESIT, Bangalore. I am interested in some of the projects proposed for Google SOC 2012 and would love to work and contribute the same to the open-source world. I am very attracted towards Text Processing and Data Mining. I have undertaken course in Natural Language Processing. I am currently working on a project "Automatic Essay Grader" - A system that automatically grades English essays based on Spelling, Grammar and Structure, Coherence, Frequent phrases and Vocabulary as weighted parameters. Realized by implementing a self-designed algorithm – studying the ‘relation graph’ of words of the essay. I had also worked on "Sentiment Analysis on Web" - Extraction of reviews about a gadget from tech-review forums, analysis of the Sentiments of the reviews thus predicting the sentiment/opinion associated with that gadget and then generation of appropriate Rating on the scale of 10. The following projects mentioned on the mediawiki's ideas page caught my eye: 1) Wikipedia Corpus Tools 2) Lucene Lemma Analyzers based on Morphology Extraction from Wikipedia Text 3) Lucene Automatic Query Expansion from Wikipedia Text 4) Translation spellchecking Apart from the above projects, I also had the following ideas which i feel will be of great help if implemented. a) Implementing a wikiSumarizer widget which will give the summary of the page being read by the user. b) An automatic coherence analyser which would make it easy to find out if the article on a given page talks about the same topic c) Details Aggregator for page. I would be grateful if you could kindly let me know about the specific requirements of the projects and about your thoughts on my ideas so that I can suitably write a proposal. Eagerly waiting for your response. Thanking you. Best Regards, Karthik. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l