Trung,

As you are needing one...

I have a corpus in Vietnamese - words used on a particular forum (In
the future, we might need to consult a better source for this, but I
know this is more than sufficient for the time being). There are also
bigrams and trigrams that we can extract from this corpus too.

I can share with you the tools that I used to construct that
particular dictionary of frequencies. The problem with that is my
dictionary is pretty noisy and I don't think I'm too desperate to
construct a quality one.

I need your help in making a tool that will filter out all the noises
in the corpus.

- We need to filter out the words that have wrong spellings and are
not English. (e.g. Hok, dek, etc)
- We need to filter out the words for names.
- We need to "merge" words that have different spellings (e.g. Hoa` vs Ho`a).
- We need to mark certain vulgar words.
- We need to filter out certain usage patterns that are particular to
that forum.

Contact me and we can discuss.


Cheers,
- H.

PS: I know the team over CocCoc must have a corpus of that kind as
they are doing their browser that guesses tone marks. However they
might want to keep it for themselves... I wonder if anyone in here has
contact with people over CocCoc to find out if they are willing to
share?

On Tue, Feb 11, 2014 at 4:18 AM, Trung Ngo <ndtrung4...@gmail.com> wrote:
> 2014-02-10 8:40 GMT+07:00 Huan T [ML] <m...@tnhh.net>:
>> Hi all,
>>
>> I'm looking for a contemporary conversational word corpus in
>> Vietnamese (e.g. Facebook status updates) to mine some text to release
>> the next version of my open-source keyboard software.
>>
>> Wondering if anyone has such a corpus?
>
> I'm looking for one, too. I need a word frequency list for the
> Vietnamese input method on Firefox OS [1]. Last November, I even wrote
> a letter to the Institute of Linguistics [2] but still got no answer
> so far.
>
> [1]: https://bugzilla.mozilla.org/show_bug.cgi?id=934198
> [2]: http://vienngonnguhoc.gov.vn/
>
> --
> Best regards,
> Trung "Chin" Ngo
>
> Developer, Linux/Unix specialist
>
> http://ngochin.com - ndtrung4...@gmail.com - +84 168 713 4338
> _______________________________________________
> POST RULES : http://wiki.hanoilug.org/hanoilug:mailing_list_guidelines
> _______________________________________________
> HanoiLUG mailing lists: http://lists.hanoilug.org/
> HanoiLUG wiki: http://wiki.hanoilug.org/
> HanoiLUG blog: http://blog.hanoilug.org/



-- 

Eccentric Graduate Student
Google Talk/Jabber hu...@tnhh.net - Website tnhh.net
_______________________________________________
POST RULES : http://wiki.hanoilug.org/hanoilug:mailing_list_guidelines
_______________________________________________
HanoiLUG mailing lists: http://lists.hanoilug.org/
HanoiLUG wiki: http://wiki.hanoilug.org/
HanoiLUG blog: http://blog.hanoilug.org/

Reply via email to