Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools (Oren Bochman) (Amir E. Aharoni)(Gregory Varnum)
Dear Sirs, I am grateful for your valuable feedback and suggestions. I have updated my proposal based on the inputs given by you. The split-up of the deliverables on the ideas page indeed helped me understand the requirements more clearly. The link to my updated proposal is https://www.mediawiki.org/wiki/User:Karthikprasad/gsoc2012proposal I request you and everyone to kindly skim through my proposal once again and suggest changes/additions. I am very excited about this project and working with you; and truth be told, 23rd April seems like ages ahead. Thanking you, Yours sincerely, Karthik Date: Wed, 4 Apr 2012 11:49:41 +0200 From: Oren Bochman orenboch...@gmail.com To: 'Wikimedia developers' wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools Message-ID: 007f01cd1248$42ee6f40$c8cb4dc0$@com Content-Type: text/plain; charset=utf-8 You do understand correctly! The main idea about NLP components is with POS tagger as an example: 1. a fall back system that does unsupervised POS tagging. 2. the ability to plug in an existing POS tagger as these become available for specific languages. I would as supervisor would recommend working with 3 languages. English, Hebrew, and the GSOC native language. If we could get QA from other native speakers we would incorporate them into the workflow. I think that by using a deletion/reversion based heuristic we may also be able to make a spam corpus to boost the accuracy of the corpuses. Operation Manager E-mail: o...@romai-horizon.com Mobil: +36 30 866 6706 R?mai Horizon Kft. H-1039 Budapest Kir?lyok ?tja 291. D. ?p. fszt. 2. Tel: +36 1 492 1492 Fax: +36 1 266 5529 -Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto: wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Amir E. Aharoni Sent: Tuesday, April 03, 2012 10:19 PM To: Wikimedia developers Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools 2012/4/3 karthik prasad karthikprasad...@gmail.com: Hello, I am a GSoC aspirant and have compiled a proposal for one of the project ideas - Wikipedia Corpus Tools. [Mentor : Oren Bochman] I would sincerely appreciate if you could kindly go through it and suggest corrections/additions so that I can settle with a coherent proposal. Link to my proposal : https://www.mediawiki.org/wiki/User:Karthikprasad/gsoc2012proposal Nice, but why only English? If i understand the proposal correctly, this project is supposed to be able to work with almost any language with very little effort. -- Amir Elisha Aharoni ? ?? ? ?? http://aharoni.wordpress.com ??We're living in pieces, I want to live in peace.? ? T. Moore? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Date: Wed, 4 Apr 2012 12:58:11 +0300 From: Amir E. Aharoni amir.ahar...@mail.huji.ac.il To: Wikimedia developers wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools Message-ID: CACtNa8tS-PifzJS1JsF02k3qW_-7=uk-wdqnvsflglufhxn...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 2012/4/4 Oren Bochman orenboch...@gmail.com: You do understand correctly! The main idea about NLP components is with POS tagger as an example: Just to make sure, POS = part of speech, isn't it? It's one of the most confusing TLAs in computing :) If we could get QA from other native speakers we would incorporate them into the workflow. Good. As long as there is a way to plug other languages and a way for speakers of other languages to contribute QA, i'm very happy. -- Amir Elisha Aharoni ? ?? ? ?? http://aharoni.wordpress.com ??We're living in pieces, I want to live in peace.? ? T. Moore? Date: Wed, 4 Apr 2012 00:28:29 -0400 From: Gregory Varnum gregory.var...@gmail.com To: Wikimedia developers wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools Message-ID: ac4c429f-a839-4911-be9b-c8928aa2d...@gmail.com Content-Type: text/plain; charset=utf-8 Whoops - I meant that email to be directed to Karthik - although Amir you're welcome to read it as well. :) -greg On Apr 3, 2012, at 11:24 PM, Gregory Varnum gregory.var...@gmail.com wrote: Amir, Thank you for your GSOC proposal! :) Between now and Google's submission deadline on April 6th - you are invited to further modify your proposals. The GSOC page on MW.org - https://www.mediawiki.org/wiki/GSOC - and our IRC rooms - https://www.mediawiki.org/wiki/MediaWiki_on_IRC Looking over your proposal - I think you've got good background information on yourself. However, I think you should flush out more details on the proposed project. Without more familiarity with corpus (and with no links to find that
Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools (Oren Bochman) (Amir E. Aharoni)(Gregory Varnum)
This looks much more in-depth and helpful. I think your best next step is to, if you haven't already, connect with potential mentors and indicate who those folks are within your proposal. -Greg ___ Sent from my iPad. Apologies for any typos. A more detailed response may be sent later. On Apr 4, 2012, at 10:31 AM, karthik prasad karthikprasad...@gmail.com wrote: Dear Sirs, I am grateful for your valuable feedback and suggestions. I have updated my proposal based on the inputs given by you. The split-up of the deliverables on the ideas page indeed helped me understand the requirements more clearly. The link to my updated proposal is https://www.mediawiki.org/wiki/User:Karthikprasad/gsoc2012proposal I request you and everyone to kindly skim through my proposal once again and suggest changes/additions. I am very excited about this project and working with you; and truth be told, 23rd April seems like ages ahead. Thanking you, Yours sincerely, Karthik Date: Wed, 4 Apr 2012 11:49:41 +0200 From: Oren Bochman orenboch...@gmail.com To: 'Wikimedia developers' wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools Message-ID: 007f01cd1248$42ee6f40$c8cb4dc0$@com Content-Type: text/plain; charset=utf-8 You do understand correctly! The main idea about NLP components is with POS tagger as an example: 1. a fall back system that does unsupervised POS tagging. 2. the ability to plug in an existing POS tagger as these become available for specific languages. I would as supervisor would recommend working with 3 languages. English, Hebrew, and the GSOC native language. If we could get QA from other native speakers we would incorporate them into the workflow. I think that by using a deletion/reversion based heuristic we may also be able to make a spam corpus to boost the accuracy of the corpuses. Operation Manager E-mail: o...@romai-horizon.com Mobil: +36 30 866 6706 R?mai Horizon Kft. H-1039 Budapest Kir?lyok ?tja 291. D. ?p. fszt. 2. Tel: +36 1 492 1492 Fax: +36 1 266 5529 -Original Message- From: wikitech-l-boun...@lists.wikimedia.org [mailto: wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Amir E. Aharoni Sent: Tuesday, April 03, 2012 10:19 PM To: Wikimedia developers Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools 2012/4/3 karthik prasad karthikprasad...@gmail.com: Hello, I am a GSoC aspirant and have compiled a proposal for one of the project ideas - Wikipedia Corpus Tools. [Mentor : Oren Bochman] I would sincerely appreciate if you could kindly go through it and suggest corrections/additions so that I can settle with a coherent proposal. Link to my proposal : https://www.mediawiki.org/wiki/User:Karthikprasad/gsoc2012proposal Nice, but why only English? If i understand the proposal correctly, this project is supposed to be able to work with almost any language with very little effort. -- Amir Elisha Aharoni ? ?? ? ?? http://aharoni.wordpress.com ??We're living in pieces, I want to live in peace.? ? T. Moore? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Date: Wed, 4 Apr 2012 12:58:11 +0300 From: Amir E. Aharoni amir.ahar...@mail.huji.ac.il To: Wikimedia developers wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools Message-ID: CACtNa8tS-PifzJS1JsF02k3qW_-7=uk-wdqnvsflglufhxn...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 2012/4/4 Oren Bochman orenboch...@gmail.com: You do understand correctly! The main idea about NLP components is with POS tagger as an example: Just to make sure, POS = part of speech, isn't it? It's one of the most confusing TLAs in computing :) If we could get QA from other native speakers we would incorporate them into the workflow. Good. As long as there is a way to plug other languages and a way for speakers of other languages to contribute QA, i'm very happy. -- Amir Elisha Aharoni ? ?? ? ?? http://aharoni.wordpress.com ??We're living in pieces, I want to live in peace.? ? T. Moore? Date: Wed, 4 Apr 2012 00:28:29 -0400 From: Gregory Varnum gregory.var...@gmail.com To: Wikimedia developers wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools Message-ID: ac4c429f-a839-4911-be9b-c8928aa2d...@gmail.com Content-Type: text/plain; charset=utf-8 Whoops - I meant that email to be directed to Karthik - although Amir you're welcome to read it as well. :) -greg On Apr 3, 2012, at 11:24 PM, Gregory Varnum gregory.var...@gmail.com wrote: Amir, Thank you for your GSOC proposal! :) Between now and Google's submission deadline on