Adrián Chaves Fernández <adriyeticha...@gmail.com> čálii: > I am working on texts extracted from docbook (XML) files where custom > HTML entities are often used in place of proper nouns, mainly for the > name of software applications. > > I would like the en-gl pair to be able to handle it, to fix the > following translation error: > Source string: The &dolphin; Handbook > Expected translation: O manual de &dolphin; > Actual translation: O &dolphin; Manual > > So I would like Apertium to interpret custom HTML entities (I can > write a regular expression to capture any HTML entity, and another one > to tell standard HTML entities apart from custom HTML entities) as > proper nouns.
Interesting. It's obvious that the plain html deformatter won't do here: $ echo 'The &dolphin; Handbook' |apertium-deshtml The[ &dolphin; ]Handbook.[][ ] Ie. it treats it as a blank to be ignored. > But I do not know where to start. The following questions come to > mind: > - Compartmentalization: > - Should I create a separate mode for KDE documentation (e.g. > en-gl-kdedoc) to implement this non-standard feature, since it is > specific to this use case? > - Should I apply the improvement to the mainstream en-gl mode, since > the improvement can fix broken translations and it can only break > translations that are already broken? Well, one simple solution would be to add <e><re>&.*;</re><p><l/><r><s n="np"/></r></e> to en-gl, and use the txt formatter – I doubt it'd break too much since anyone expecting actual HTML entities should be using apertium-deshtml anyway (similarly for other formats). I would think could go in the main en-gl mode. > - Implementation: > - Should I implement an alternative to html-noenv? (I do not think so > because it is not really part of the format, it is part of the > sentence) The "noent" thing is on the *reformatter* side (avoids turning ø into ø etc.), so shouldn't be relevant here. Of course, you may want to avoid entitising your ø's too. But I'm guessing using apertium-destxt won't be good enough if you have XML in there, so perhaps an alternative *deformatter* is needed, ie. one that doesn't treat entities as blanks-to-be-ignored. That should be fairly simple to add based on deshtml, I think. -Kevin
signature.asc
Description: PGP signature
------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports.http://sdm.link/zohodev2dev
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff