Hello, I am a student who has seen this issue as an idea on Google Summer of Code 2014. I am very interested in coding for this issue this summer and have some questions regarding your needs so that I can write a good Google Summer of Code proposal that addresses your concerns. I have a really strong background in soft computing in linguistics (ANNs, statistical methods, computational linguistics), and a good (and growing) background in database and entity matching to labels. Please provide some guidelines that may let me know what your needs are so that I can submit a winning proposal.
I have posted on the Apache Phonetic Matching Project site https://issues.apache.org/jira/browse/STANBOL-1291 where I hoped 'lurking' will provide some insight into your needs. This is what I understand so far. . ..(from your website STANBOL 1291 Phonetic Linking) "The main question to be answers is if the phonetic matching (step 4) can correctly link Entities even if the writings in the text transcript are incorrect." Perhaps 'soft computing methods' are the best way to answer this question: Neural Networks, Baysian, Fuzzy Sets or Rough Sets because these methods would score well even if the 'writings in the text transcript are incorrect'. I can address this question on many levels given my experience: - Computational Linguistics - Experience in coding Artificial Neural Networks that will learn phonetic speech. This also applies to text recogniton and the generation of grammaticical rules from the language input. I saw that the text to speech engine (Stanbol) uses Sphinx that is built using Baysean approaches (now you have got me really excited!). I would be very interested in working with STANBOL engine to produce tests or measures of how well it is linking entities based on the performance as a NLP engine along the lines of pattern matching. My experience in working with these kinds of networks is with Neural Net simulator (T Learn) and coding MATLAB neural nets. - Text Quality - this would require some kind of examination between a trusted sample of the original data and the output text. Experimental statistical methods would provide measures, and empiracle computational methods may provide means of improvement. However, you know the needs and if I may have your insight or advice regarding the parameters, I am sure that I can produce an excellent proposal. - I am PASSIONATE about coding Neural Nets and Baysian nets regarding languge processing, and have been a student member of academic labs that focus on human cognition and language processing (psycholinguistics). I also have a very strong interest in becoming active in semantic web development. I currently study Human-Computer Interaction at Laurentian, and so speech interfaces are really exciting to me. . .I would really love to have a chance to code with you for the summer (and afterwords too!) because it would bring me the kind of experience that I cant get here at the university. I am most interested in answers to these kinds of questions (which are specific to my application) - a list of deliverables, quantifiable results for the Apache community, (I am not sure what deliverables will meet your needs, can you suggest?) - a detailed description / design document, (I am interested in following standards put forward by Apache for design documentation - perhaps I would be able to see guidelines to give me an idea of what to submit) - an approach, (I would want to meet expectations in the approach) - an approximate schedule and - something of a background text. (does this mean literature search, citations? I can provide these, or what ever else is needed) Thanks very much for your advice!! I will submit a proposal as soon as I receive your reply! AJ Boulay, MSc