Re: [Apertium-stuff] GSOC 2020 idea

Rajarshi Roychoudhury Thu, 27 Feb 2020 17:02:24 -0800

Here are some published papers on how character embeddings are used for
classification.
https://www.google.com/url?sa=t&source=web&rct=j&url=https://arxiv.org/abs/1810.03595&ved=2ahUKEwiu-ajdgvPnAhXXxzgGHQAWA3cQFjAVegQIDBAB&usg=AOvVaw0LQ60M-KXtk-NGyAoVqmeU


https://lsm.media.mit.edu/papers/tweet2vec_vvr.pdf
https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf


We have just finished writting a paper on this and have got better results
than the one in the papers mentioned above.The dataset is collected from
sentiwordnet as i mentioned earlier.
I am not on the IRC ,i will join it then.


Best,
Rajarshi


On Fri, Feb 28, 2020, 01:01 Tanmai Khanna <khanna.tan...@gmail.com> wrote:

> How exactly can characters predict sentiment? Don’t you still need some
> training data for pairs? English, Hindi, Bangla aren’t really low resource
> languages.
>
> Anyway, we can continue this discussion on the IRC so that it’ll be easier
> and more people can contribute to the discussion.
>
> Tanmai
>
> Sent from my iPhone
>
> On 28-Feb-2020, at 00:52, Rajarshi Roychoudhury <rroychoudhu...@gmail.com>
> wrote:
>
> 
> To answer the question on how to analyse sentiment on low resource
> language , I think character embedding would be the best option. The words
> in the corpus is not exhaustive but the number of unique characters is
> certainly well deterministic. We can figure out the embedding weight for
> each character, and can apply it for a number of NLP techniques, not just
> sentiment analysis.The downside of low resource language can be slightly
> minimised using that.
>
> On Fri, Feb 28, 2020, 00:46 Rajarshi Roychoudhury <
> rroychoudhu...@gmail.com> wrote:
>
>> As I mentioned earlier, I would like to work on English-Hindi or
>> English-Bengali translation, the dataset can be obtained from sentiwordnet
>> for Indian languages,
>> https://amitavadas.com/sentiwordnet.php
>> which is by far the most resourceful dataset available for sentiment
>> analysis.It contains data for both Hindi and Bengali.
>>
>> I cannot give any example specific to apertium because whenever I try to
>> translate a word from English in the interface, the available languages for
>> translation are beyond my knowledge. I am not sure if I am right, but
>> Hindi/Bengali is probably not one of the languages to which an English word
>> can be translated to. Correct me if I am wrong
>>
>>
>>
>> On Fri, Feb 28, 2020, 00:31 Tanmai Khanna <khanna.tan...@gmail.com>
>> wrote:
>>
>>> Hi, I have a few questions about this:
>>> 1. How would you analyse the sentiment of the source text? Considering
>>> the language pairs that Apertium deals with are low resource languages.
>>> 2. As Tino mentions, is there a problem of sentiment loss in Apertium?
>>> Any examples of this?
>>> 3. Doesn't the sentiment analysis of a language require a decent amount
>>> of training data? Where would this data be found for low resource languages?
>>>
>>> Tanmai
>>>
>>> On Fri, Feb 28, 2020 at 12:02 AM Rajarshi Roychoudhury <
>>> rroychoudhu...@gmail.com> wrote:
>>>
>>>> The effect won't be very evident on simple sentences, I think it would
>>>> be more effective on sentences where choice of words can decide the
>>>> efficiency of translation. It's not about if "Watch out" could be " be
>>>> careful" , it's about choosing words that can  retain the urgency in "watch
>>>> out". Sentiment information on original sentence can help in that.
>>>>
>>>> On Thu, Feb 27, 2020, 23:47 Scoop Gracie <scoopgra...@gmail.com> wrote:
>>>>
>>>>> So, "Watch out!" Could become "Be careful"?
>>>>>
>>>>> On Thu, Feb 27, 2020, 10:13 Rajarshi Roychoudhury <
>>>>> rroychoudhu...@gmail.com> wrote:
>>>>>
>>>>>> It is not just about  minimizing loss of sentiment , it is about
>>>>>> using that information for better translation. A very trivial example 
>>>>>> would
>>>>>> be that for some situations , sentences can project a strong sentiment 
>>>>>> and
>>>>>> simple translation may not always yield the best result. However if we 
>>>>>> can
>>>>>> use the knowledge of the sentiment to choose the words , it might give
>>>>>> better result.
>>>>>>
>>>>>> As far as the codes are concerned, I need to study the source code ,
>>>>>> or a detailed documentation for proposing a feasible solution.
>>>>>>
>>>>>> Best,
>>>>>> Rajarshi
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 27, 2020, 23:21 Tino Didriksen <m...@tinodidriksen.com>
>>>>>> wrote:
>>>>>>
>>>>>>> My first question would be, is this actually a problem for
>>>>>>> rule-based machine translation? I am not a linguist, but given how RBMT
>>>>>>> works I can't really see where sentiment would be lost in the process,
>>>>>>> especially because Apertium is designed for related languages where
>>>>>>> sentiment is mostly the same. But even for less related languages, it 
>>>>>>> would
>>>>>>> be down to the quality of the source language analysis.
>>>>>>>
>>>>>>> Beyond that, please learn how Apertium specifically works, not just
>>>>>>> RBMT in general. http://wiki.apertium.org/wiki/Documentation is a
>>>>>>> good start, but our IRC channel is the best place to ask technical
>>>>>>> questions.
>>>>>>>
>>>>>>> One major issue specific to Apertium is that the source information
>>>>>>> is no longer available in the target generation step.
>>>>>>>
>>>>>>> E.g., since you mention English-Hindi, you could install
>>>>>>> apertium-eng-hin and see how each part of the pipe works. We have
>>>>>>> precompiled binaries common platforms. Again, see wiki and IRC.
>>>>>>>
>>>>>>> -- Tino Didriksen
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 27 Feb 2020 at 08:16, Rajarshi Roychoudhury <
>>>>>>> rroychoudhu...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Formally i present my idea in this form:
>>>>>>>> From my understanding of RBMT ,
>>>>>>>>
>>>>>>>> The RBMT system contains:
>>>>>>>>
>>>>>>>>    - a *SL morphological analyser* - analyses a source language
>>>>>>>>    word and provides the morphological information;
>>>>>>>>    - a *SL parser* - is a syntax analyser which analyses source
>>>>>>>>    language sentences;
>>>>>>>>    - a *translator* - used to translate a source language word
>>>>>>>>    into the target language;
>>>>>>>>    - a *TL morphological generator* - works as a generator of
>>>>>>>>    appropriate target language words for the given grammatica 
>>>>>>>> information;
>>>>>>>>    - a *TL parser* - works as a composer of suitable target
>>>>>>>>    language sentences
>>>>>>>>
>>>>>>>> I propose a 6th component of the RBMT system: *sentiment based TL
>>>>>>>> morphological generator*
>>>>>>>>
>>>>>>>> I propose that we do word level sentiment analysis of the source
>>>>>>>> language and targeted language. For the time being i want to work on
>>>>>>>> English-Hindi translation. We do not need a neural network based
>>>>>>>> translation, however for getting the sentiment associated with each 
>>>>>>>> word we
>>>>>>>> might use nltk,or develop a character level embedding to just find out 
>>>>>>>> the
>>>>>>>> sentiment assosiated with each word,and form a dictionary out of it.I 
>>>>>>>> have
>>>>>>>> written a paper on it,and received good results.So basically,during the
>>>>>>>> final application development we will just have the dictionary,with no
>>>>>>>> neural network dependencies. This  can easily be done with Python.I 
>>>>>>>> just
>>>>>>>> need a good corpus of English and Hindi words(the sentiment datasets 
>>>>>>>> are
>>>>>>>> available online).
>>>>>>>>
>>>>>>>> The *sentiment based TL morphological generator *will generate the
>>>>>>>> list of possible words,and we will take that word whose sentiment is
>>>>>>>> closest to the source language word.
>>>>>>>> This is a novel method that has probably not been applied before,
>>>>>>>> and might generate better results.
>>>>>>>>
>>>>>>>> Please provide your valuable feedwork and suggest some necessary
>>>>>>>> changes that needs to be made.
>>>>>>>> Best,
>>>>>>>> Rajarshi
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Apertium-stuff mailing list
>>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Apertium-stuff mailing list
>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>
>>>>> _______________________________________________
>>>>> Apertium-stuff mailing list
>>>>> Apertium-stuff@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>
>>>> _______________________________________________
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>>
>>>
>>> --
>>> *Khanna, Tanmai*
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSOC 2020 idea

Reply via email to