Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools (Oren Bochman) (Amir E. Aharoni)(Gregory Varnum)

2012-04-04 Thread karthik prasad
Dear Sirs,
I am grateful for your valuable feedback and suggestions.

I have updated my proposal based on the inputs given by you. The split-up
of the deliverables on the ideas page indeed helped me understand the
requirements more clearly.

The link to my updated proposal is
https://www.mediawiki.org/wiki/User:Karthikprasad/gsoc2012proposal

I request you and everyone to kindly skim through my proposal once again
and suggest changes/additions.
I am very excited about this project and working with you; and truth be
told, 23rd April seems like ages ahead.

Thanking you,
Yours sincerely,
Karthik


 Date: Wed, 4 Apr 2012 11:49:41 +0200
 From: Oren Bochman orenboch...@gmail.com
 To: 'Wikimedia developers' wikitech-l@lists.wikimedia.org
 Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools
 Message-ID: 007f01cd1248$42ee6f40$c8cb4dc0$@com
 Content-Type: text/plain;   charset=utf-8

 You do understand correctly!

 The main idea about NLP components is with POS tagger as an example:

 1. a fall back system that does unsupervised POS tagging.
 2. the ability to plug in an existing POS tagger as these become
  available for specific languages.

 I would as supervisor would recommend working with 3 languages.
 English, Hebrew, and the GSOC native language.

 If we could get QA from other native speakers we would incorporate them
 into the workflow.

 I think that by using a deletion/reversion based heuristic we may also be
 able to make a spam corpus to boost the accuracy of the corpuses.


 Operation Manager
 E-mail: o...@romai-horizon.com
 Mobil: +36 30 866 6706



 R?mai Horizon Kft.
 H-1039 Budapest
 Kir?lyok ?tja  291. D. ?p. fszt. 2.
 Tel:   +36 1 492 1492
 Fax:  +36 1 266 5529

 -Original Message-
 From: wikitech-l-boun...@lists.wikimedia.org [mailto:
 wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Amir E. Aharoni
 Sent: Tuesday, April 03, 2012 10:19 PM
 To: Wikimedia developers
 Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools

 2012/4/3 karthik prasad karthikprasad...@gmail.com:
  Hello,
  I am a GSoC aspirant and have compiled a proposal for one of the
  project ideas - Wikipedia Corpus Tools. [Mentor : Oren Bochman] I
  would sincerely appreciate if you could kindly go through it and
  suggest corrections/additions so that I can settle with a coherent
 proposal.
 
  Link to my proposal :
  https://www.mediawiki.org/wiki/User:Karthikprasad/gsoc2012proposal

 Nice, but why only English?

 If i understand the proposal correctly, this project is supposed to be
 able to work with almost any language with very little effort.

 --
 Amir Elisha Aharoni ? ?? ? ??
 http://aharoni.wordpress.com ??We're living in pieces, I want to live in
 peace.? ? T. Moore?

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




 --


 Date: Wed, 4 Apr 2012 12:58:11 +0300
 From: Amir E. Aharoni amir.ahar...@mail.huji.ac.il
 To: Wikimedia developers wikitech-l@lists.wikimedia.org
 Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools
 Message-ID:
CACtNa8tS-PifzJS1JsF02k3qW_-7=uk-wdqnvsflglufhxn...@mail.gmail.com
 
 Content-Type: text/plain; charset=UTF-8

 2012/4/4 Oren Bochman orenboch...@gmail.com:
  You do understand correctly!
 
  The main idea about NLP components is with POS tagger as an example:

 Just to make sure, POS = part of speech, isn't it?

 It's one of the most confusing TLAs in computing :)

  If we could get QA from other native speakers we would incorporate them
 into the workflow.

 Good. As long as there is a way to plug other languages and a way for
 speakers of other languages to contribute QA, i'm very happy.

 --
 Amir Elisha Aharoni ? ?? ? ??
 http://aharoni.wordpress.com
 ??We're living in pieces,
 I want to live in peace.? ? T. Moore?



Date: Wed, 4 Apr 2012 00:28:29 -0400
From: Gregory Varnum gregory.var...@gmail.com
To: Wikimedia developers wikitech-l@lists.wikimedia.org
Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools
Message-ID: ac4c429f-a839-4911-be9b-c8928aa2d...@gmail.com
Content-Type: text/plain; charset=utf-8

Whoops - I meant that email to be directed to Karthik - although Amir
you're welcome to read it as well.  :)

-greg


On Apr 3, 2012, at 11:24 PM, Gregory Varnum gregory.var...@gmail.com
wrote:

 Amir,

 Thank you for your GSOC proposal!  :)

 Between now and Google's submission deadline on April 6th - you are
invited to further modify your proposals.  The GSOC page on MW.org -
https://www.mediawiki.org/wiki/GSOC - and our IRC rooms -
https://www.mediawiki.org/wiki/MediaWiki_on_IRC

 Looking over your proposal - I think you've got good background
information on yourself.  However, I think you should flush out more
details on the proposed project.  Without more familiarity with corpus (and
with no links to find that 

Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools (Oren Bochman) (Amir E. Aharoni)(Gregory Varnum)

2012-04-04 Thread Gregory Varnum
This looks much more in-depth and helpful. I think your best next step is to, 
if you haven't already, connect with potential mentors and indicate who those 
folks are within your proposal.

-Greg
___
Sent from my iPad. Apologies for any typos. A more detailed response may be 
sent later.

On Apr 4, 2012, at 10:31 AM, karthik prasad karthikprasad...@gmail.com wrote:

 Dear Sirs,
 I am grateful for your valuable feedback and suggestions.
 
 I have updated my proposal based on the inputs given by you. The split-up
 of the deliverables on the ideas page indeed helped me understand the
 requirements more clearly.
 
 The link to my updated proposal is
 https://www.mediawiki.org/wiki/User:Karthikprasad/gsoc2012proposal
 
 I request you and everyone to kindly skim through my proposal once again
 and suggest changes/additions.
 I am very excited about this project and working with you; and truth be
 told, 23rd April seems like ages ahead.
 
 Thanking you,
 Yours sincerely,
 Karthik
 
 
 Date: Wed, 4 Apr 2012 11:49:41 +0200
 From: Oren Bochman orenboch...@gmail.com
 To: 'Wikimedia developers' wikitech-l@lists.wikimedia.org
 Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools
 Message-ID: 007f01cd1248$42ee6f40$c8cb4dc0$@com
 Content-Type: text/plain;   charset=utf-8
 
 You do understand correctly!
 
 The main idea about NLP components is with POS tagger as an example:
 
 1. a fall back system that does unsupervised POS tagging.
 2. the ability to plug in an existing POS tagger as these become
 available for specific languages.
 
 I would as supervisor would recommend working with 3 languages.
 English, Hebrew, and the GSOC native language.
 
 If we could get QA from other native speakers we would incorporate them
 into the workflow.
 
 I think that by using a deletion/reversion based heuristic we may also be
 able to make a spam corpus to boost the accuracy of the corpuses.
 
 
 Operation Manager
 E-mail: o...@romai-horizon.com
 Mobil: +36 30 866 6706
 
 
 
 R?mai Horizon Kft.
 H-1039 Budapest
 Kir?lyok ?tja  291. D. ?p. fszt. 2.
 Tel:   +36 1 492 1492
 Fax:  +36 1 266 5529
 
 -Original Message-
 From: wikitech-l-boun...@lists.wikimedia.org [mailto:
 wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Amir E. Aharoni
 Sent: Tuesday, April 03, 2012 10:19 PM
 To: Wikimedia developers
 Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools
 
 2012/4/3 karthik prasad karthikprasad...@gmail.com:
 Hello,
 I am a GSoC aspirant and have compiled a proposal for one of the
 project ideas - Wikipedia Corpus Tools. [Mentor : Oren Bochman] I
 would sincerely appreciate if you could kindly go through it and
 suggest corrections/additions so that I can settle with a coherent
 proposal.
 
 Link to my proposal :
 https://www.mediawiki.org/wiki/User:Karthikprasad/gsoc2012proposal
 
 Nice, but why only English?
 
 If i understand the proposal correctly, this project is supposed to be
 able to work with almost any language with very little effort.
 
 --
 Amir Elisha Aharoni ? ?? ? ??
 http://aharoni.wordpress.com ??We're living in pieces, I want to live in
 peace.? ? T. Moore?
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 
 
 
 
 --
 
 
 Date: Wed, 4 Apr 2012 12:58:11 +0300
 From: Amir E. Aharoni amir.ahar...@mail.huji.ac.il
 To: Wikimedia developers wikitech-l@lists.wikimedia.org
 Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools
 Message-ID:
   CACtNa8tS-PifzJS1JsF02k3qW_-7=uk-wdqnvsflglufhxn...@mail.gmail.com
 
 Content-Type: text/plain; charset=UTF-8
 
 2012/4/4 Oren Bochman orenboch...@gmail.com:
 You do understand correctly!
 
 The main idea about NLP components is with POS tagger as an example:
 
 Just to make sure, POS = part of speech, isn't it?
 
 It's one of the most confusing TLAs in computing :)
 
 If we could get QA from other native speakers we would incorporate them
 into the workflow.
 
 Good. As long as there is a way to plug other languages and a way for
 speakers of other languages to contribute QA, i'm very happy.
 
 --
 Amir Elisha Aharoni ? ?? ? ??
 http://aharoni.wordpress.com
 ??We're living in pieces,
 I want to live in peace.? ? T. Moore?
 
 
 
 Date: Wed, 4 Apr 2012 00:28:29 -0400
 From: Gregory Varnum gregory.var...@gmail.com
 To: Wikimedia developers wikitech-l@lists.wikimedia.org
 Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools
 Message-ID: ac4c429f-a839-4911-be9b-c8928aa2d...@gmail.com
 Content-Type: text/plain; charset=utf-8
 
 Whoops - I meant that email to be directed to Karthik - although Amir
 you're welcome to read it as well.  :)
 
 -greg
 
 
 On Apr 3, 2012, at 11:24 PM, Gregory Varnum gregory.var...@gmail.com
 wrote:
 
 Amir,
 
 Thank you for your GSOC proposal!  :)
 
 Between now and Google's submission deadline on