Adrián Chaves Fernández
<adriyeticha...@gmail.com> čálii:

> I am working on texts extracted from docbook (XML) files where custom
> HTML entities are often used in place of proper nouns, mainly for the
> name of software applications.
>
> I would like the en-gl pair to be able to handle it, to fix the
> following translation error:
> Source string: The &dolphin; Handbook
> Expected translation: O manual de &dolphin;
> Actual translation: O &dolphin; Manual
>
> So I would like Apertium to interpret custom HTML entities (I can
> write a regular expression to capture any HTML entity, and another one
> to tell standard HTML entities apart from custom HTML entities) as
> proper nouns.

Interesting. It's obvious that the plain html deformatter won't do here:

$ echo 'The &dolphin; Handbook' |apertium-deshtml
The[ &dolphin; ]Handbook.[][
]

Ie. it treats it as a blank to be ignored.

> But I do not know where to start. The following questions come to
> mind:
> - Compartmentalization:
> - Should I create a separate mode for KDE documentation (e.g.
> en-gl-kdedoc) to implement this non-standard feature, since it is
> specific to this use case?
> - Should I apply the improvement to the mainstream en-gl mode, since
> the improvement can fix broken translations and it can only break
> translations that are already broken?

Well, one simple solution would be to add

<e><re>&amp;.*;</re><p><l/><r><s n="np"/></r></e>

to en-gl, and use the txt formatter – I doubt it'd break too much since
anyone expecting actual HTML entities should be using apertium-deshtml
anyway (similarly for other formats). I would think could go in the main
en-gl mode.

> - Implementation:
> - Should I implement an alternative to html-noenv? (I do not think so
> because it is not really part of the format, it is part of the
> sentence)

The "noent" thing is on the *reformatter* side (avoids turning ø into
&oslash; etc.), so shouldn't be relevant here. Of course, you may want
to avoid entitising your ø's too.

But I'm guessing using apertium-destxt won't be good enough if you have
XML in there, so perhaps an alternative *deformatter* is needed, ie. one
that doesn't treat entities as blanks-to-be-ignored. That should be
fairly simple to add based on deshtml, I think.


-Kevin

Attachment: signature.asc
Description: PGP signature

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to