[
https://issues.apache.org/jira/browse/STANBOL-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14351756#comment-14351756
]
aj_boulay commented on STANBOL-1291:
------------------------------------
Hello,
Thank you for emailing me regarding your mentorship position with Google
Summer of Code. I have looked at the dates, and it appears that the student
application period starts on the 16 of March. I will be ready this year to
submit a proposal.
I am doing much preparation for this proposal, and may even be able to
provide some example code snippets. I have found some books on Solr, and
have ordered them to arrive in a few weeks, so I will have some good Solr
code examples to work from. I have looked at the design of CMU Sphinx, and
notice that Carnegie Mellon have put a "Caveat Emptor" on their webpage
regarding phoneme-grapheme transformations, so I have been trying to learn
how many layers the HMM CMU sphinx model has to see if different algorithms
can be run in addition to their linking algorithm to improve accuracy and
ability to deal with large datasets. If you have any advice or knowledge
regarding this in CMU Sphinx, especially with regard to its architecture,
please let me know before the application date and I can research your
perspectives to include possible solutions in my submission.
Please provide any advice that you think may be important for my
application. I am looking forward to having a chance to work with you!!
Danke, Gracias, Merci, Thank you very much!!
AJ Boulay
> Phonetic Linking
> ----------------
>
> Key: STANBOL-1291
> URL: https://issues.apache.org/jira/browse/STANBOL-1291
> Project: Stanbol
> Issue Type: New Feature
> Components: Enhancement Engines
> Reporter: Rupert Westenthaler
> Labels: gsoc2015, mentoring
>
> Add Phonetic based EntityLinking support to Apache Stanbol
> The Idea is to
> 1. start of with a sound file
> 2. use a speech to text engine like STANBOL-1007 to get the transcript
> 3. use NLP processing
> 4. use the FST Linking Enigne (STANBOL-1128) to link a SolrIndex configured
> for Phonetic linking [1].
> 5. correct the text transcript based on labels of linked entities.
> The main question to be answers is if the phonetic matching (step 4) can
> correctly link Entities even if the writings in the text transcript are
> incorrect.
> Additional things to validate are
> * the quality of the text transcript good enough
> * does NLP processing still sufficiently well work on text transcripts
> This will definitely also require adaptations to the FST Linking Engine as
> the score is currently calculated base on the levenshtein distance of the
> mention with the best matching label of an entity - what does not make sense
> for this specific use case.
> [1]
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)