On 28 February 2014 18:21, Alex Aruj <alex.a...@gmail.com> wrote:
> Hi group,
>

Hi.

One part of GSoC is that you will learn how to engage with an open
source community; you've taken your first step. Good job!

A necessary part of interacting with open source communities is to
communicate via mailing lists, and there is a certain amount of
etiquette involved.

In this particular instance, what you have done is usually called
"thread hijacking" -- you've sent a mail as a reply to another, but on
a completely different topic. A normal thread on a mailing list is
essentially a single conversation, and interjecting an email on an
unrelated topic interrupts the usual flow of conversation. This is bad
for us, as the result can be a confusing mix of two separate
conversations, under the same heading. It's bad for you, as a GSoC
applicant, because there may come a time when one of the mentors will
need to refer to an earlier part of the communication with you, and
will find it difficult to find your email.

In future, please write a new email when writing on a new subject,
rather than using 'reply'. It's a minor inconvenience to copy and
paste the mailing list address, but it's more than outweighed by the
later inconvenience involved if you need to refer back to an earlier
email.

To help you, I've changed the subject to one more appropriate to your proposal.

> I am considering tackling the 'restoration of diacritic marks' task. I am in
> the middle of my second semester of C++ and winding down my full-time job in
> a translation company in order to study computational issues related to
> language and work freelance in my pair ES>EN, and possibly to develop more
> in PT>EN. Anyway, back to GSOC:
>
> Is the priority to make the charlifter case-sensitive and for it to respect
> superblanks exactly as in the example in the box laid out here
> http://wiki.apertium.org/wiki/Superblanks?
>

Respecting superblanks is a must: diacritic restoration must not be
applied to them.

Case should definitely be _respected_: the output needs to match the
input in terms of case.

As for case sensitivity, Kevin Scannell is the person to ask for a
definitive answer.  My feeling is that case sensitivity can
potentially be more accurate, but in the absence of sufficient data,
case insensitive (trained on lowercase) should be the default.

> Should the tasks be done in this order or according to applicant interest?
> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Accent_and_diacritic_restoration
>

The task itself is to port Charlifter.

Adding a rule-based replacements can be done in a number of ways, but
possibly the easiest (and likely most effective way) would be to do so
in a similar manner to apertium-tagger -- by adding non-statistically
derived probabilities (i.e., you insert a high probability for a
rule-based replacement).

Training models is a necessary to test the system -- this is a
non-code task, and cannot be a requirement. You will need to train
multiple models, because testing with one will not be sufficient, but
the whatever you can manage of the remainder during the wrap up time
should be sufficient.

"Inform charlifter with target-language information..." -- I think
this is necessary to make this a full GSoC project (that is, I don't
imagine a port of Charlifter will take 3 months by itself). Ideally,
this should be started before midterms, but taking midterms as a
starting point would be fine.

> Are the main coding skills needed for this task boolean operations, loops
> and file input/output knowledge or is something exotic I should be aware of
> (see next question  ; ) )?
>
> Anything to help understand finite state automata in this process? Are the
> different nodes basically functions that are called as the diacritic mark,
> word, structure is analyzed?

The port is the important part. There may be some 'exotic' stuff,
depending on how much time is left over, but you'll just be calling
functions, not implementing them. Nothing scary :)

-- 
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to