Re: [l10n-dev] Imagine :)

2007-07-12 Thread Rafaella Braconi

Hi Jean-Christophe,

thank you once again for sharing your thoughts and experience.

I am trying to reproduce and clarify with other engineers what you say 
here below.


However, from what I understand here, the issue you see is not 
necessarily Pootle but the format Pootle delivers which is .po. As 
already said, Pootle will be able to deliver in near future the content 
in xliff format. Would you still see a probelm with this?


Regards,
Rafaella

Jean-Christophe Helary wrote:

I have no idea where the UI files come from and how they _must_ be  
processed before reaching the state of l10n source files.


So, let me give a very simplified view of the Help files preparation  
for l10n as seen from a pure TMX+TMX supporting tool point of view.  
Since I don't know what the internal processes really are I can only  
guess and I may be mistaken.


• The original Help files are English HTML file sets.
• Each localization has a set of files that corresponds to the  
English HTML sets

• The English and localized versions are sync'ed

To create TMX files:

Use a process that aligns each block level tag in the English set to  
the corresponding block level tag in the localized set. That is  
called paragraph (or block) segmentation and that what SUN does for  
NetBeans: no intermediary file format, no .sdf, no .po, no whatever  
between the Help sets and the TMX sets.


The newly updated English Help files come as sets of files, all HTML.

The process to translate, after the original TMX conversion above  
(only _ONE_ conversion in the whole process) is the following:


Load the source file sets and the TMX sets in the tool.

The HTML tags are automatically handled by the tool.
The already translated segments are automatically translated by the  
tool.
The translator only needs to focus on what has been updated. Using  
the whole translation memory as reference.


Once the translation is done, the translator delivers the full set  
that is integrated in the release after proofreading etc.


What is required from the source files provided side ? Creating TMX  
from HTML paragraph sets.


What is required from the translator ? No conversion whatsoever, just  
work with the files and automatically update the translation with the  
legacy data.




Now, what do we have currently ?

The source files provider creates a differential of the new vs the  
old HTML set.

It converts the result to an intermediate format (.sdf)
It converts that result to yet another intermediate format for the  
translator (either .po or xliff)
It matches the results of the diff strings to corresponding old  
localized strings, thus removing the real context of the old string
It creates a false TMX based on an already intermediate format,  
without hiding the internal codes (no TMX level 2, all the tag info  
is handled as text data...)


The translator is left to use intermediate files that have been  
converted twice, removing most relation to the original format and  
adding the probability of having problems with the back conversion.


It has to work with a false TMX that has none of the original  
context, thus producing false matches that need to be guessed  
backward and that displays internal codes as text data.



Do you see where the overhead is ?



It is very possible that the UI files do require some sort of  
intermediate conversion to provide the translators with a manageable  
set of files, but as far as the Help files are concerned (and as far  
as I understand the process at hand) there is absolutely no need  
whatsoever to use an intermediate conversion, to remove the original  
context and to force the translator to use error prone source files.



It is important to find ways to simplify the system so that more  
people can contribute, so that the source files provider has less  
tasks to handle, but clearly using a .po based process to translate  
HTML files is going totally the opposite way. And translators are  
(sadly without being conscious of that) suffering from that, which  
results into less time spend on checking one's translation and a  
general overhead for checkers and converters.


Don't get me wrong, I am not ranting or anything, I _am_ really  
trying to convince people here that things could (and should) be  
drastically simplified, and for people who have some time, I  
encourage you to see how NetBeans manages its localization process.  
Because we are loosing a _huge_ amount of human resources in the  
current process.


Cheers,

Jean-Christophe Helary (fr team)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [l10n-dev] Imagine :)

2007-07-12 Thread F Wolff
Op Donderdag 12-07-2007 om 10:36 uur [tijdzone +0200], schreef Rafaella
Braconi:
 Hi Jean-Christophe,
 
 thank you once again for sharing your thoughts and experience.
 
 I am trying to reproduce and clarify with other engineers what you say 
 here below.
 
 However, from what I understand here, the issue you see is not 
 necessarily Pootle but the format Pootle delivers which is .po. As 
 already said, Pootle will be able to deliver in near future the content 
 in xliff format. Would you still see a probelm with this?
 
 Regards,
 Rafaella
 

Pootle has XLIFF functionality since version 1.0. Hopefully we can
upgrade the version on the server soon.

F

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [l10n-dev] Imagine :)

2007-07-12 Thread Rafaella Braconi



F Wolff wrote:


Op Donderdag 12-07-2007 om 10:36 uur [tijdzone +0200], schreef Rafaella
Braconi:
 


Hi Jean-Christophe,

thank you once again for sharing your thoughts and experience.

I am trying to reproduce and clarify with other engineers what you say 
here below.


However, from what I understand here, the issue you see is not 
necessarily Pootle but the format Pootle delivers which is .po. As 
already said, Pootle will be able to deliver in near future the content 
in xliff format. Would you still see a probelm with this?


Regards,
Rafaella

   



Pootle has XLIFF functionality since version 1.0. Hopefully we can
upgrade the version on the server soon.
 



That's really great news! Thank you for sharing this with us.

Rafaella


F

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [l10n-dev] Imagine :)

2007-07-12 Thread Jean-Christophe Helary


On 12 juil. 07, at 20:29, Jean-Christophe Helary wrote:



On 12 juil. 07, at 17:36, Rafaella Braconi wrote:

However, from what I understand here, the issue you see is not  
necessarily Pootle but the format Pootle delivers which is .po. As  
already said, Pootle will be able to deliver in near future the  
content in xliff format. Would you still see a probelm with this?


Yes, because the problem is not the delivery format, it is the fact  
that you have 2 conversions from the HTML to the final format and  
the conversion processes are not clean. Similarly, the TMX you  
produce are not real TMX (at least not the one you sent me).


I am not arguing that UI files would benefit from such treatment. I  
am really focusing on the HTML documentation.


To make things even clearer, I am saying that using _any_  
intermediary format for documentation is a waste of resources.


If translators want to use intermediary formats to translate HTML in  
their favorite tool (be it PO, XLIFF or anything else) that is their  
business.


Janice (NetBeans) confirmed me that NB was considering a Pootle  
server exclusively for UI files (currently Java properties files),  
but in the end that would mean overhead anyway since the current  
process takes the Java properties as they are for translation in OmegaT.


In NB, the HTML documentation is available in packages corresponding  
to the modules, and the TMX (a real one...) allows to automatically  
get only the updated segments. No need for a complex infrastructure  
to produce differentials of the files, all this is managed by the  
translation tool automatically and _that_ allows the translator to  
have _much more_ leverage from the context and to benefit from a much  
greater choice of correspondances.


I suppose the overhead caused by the addition of an intermediary  
format for the UI files will be balanced by the management functions  
offered by the new system, but I wish we did not have to go through  
translating yet another intermediate format for the simple reason  
that seeing the existing conversion processes (I've tried only the  
translate-toolkit stuff and it was flawed enough to convince me _not_  
to use its output) is likely to break the existing TMX. If the  
management system were evolved enough to output the same Java  
properties files I am sure everybody would be happy. But, please, no  
more conversion than necessary.


To go back to the OOo processes, I have no doubt that a powerful  
management system available to the community is required. But in the  
end, why is there a need to produce .sdf files ? Why can't we simply  
have HTML sets, like the NB project, that we'd translate with  
appropriately formed TMX files in appropriate tools ?


My understanding from when I worked with Sun Translation Editor (when  
we were delivered .xlz files and before STE was released as OLT) is  
that we had to use XLIFF _because_ the .sdf format was obscure. But  
in the end, the discussion we are having now after many years of  
running in circles apparently) revolves not on how to ease the  
translator's work but on how to ease the management.


If the purpose of all this is to increase the translators' output  
quality, then it would be _much_ better to consider a similar system  
that uses the HTML sets directly. Because _that_ would allow the  
translator to spend much more time on checking the translation in  
commonly available tools (a web browser...) How do you do checks on  
PO/XLIFF/SDF without resorting to hacks ?


Keeping things simple _is_ the way to go.

Jean-Christophe Helary (fr team)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [l10n-dev] Imagine :)

2007-07-12 Thread Uwe Fischer

Hi,

Jean-Christophe Helary wrote:
...
To make things even clearer, I am saying that using _any_ intermediary 
format for documentation is a waste of resources.

...

let me add some words to your message:

The application help files are XML files. See 
http://documentation.openoffice.org/online_help/techdetails.html for 
details.

The Help Viewer converts the XML files to HTML when they are displayed.

Using XML, it should be even more easy to use a straight forward 
translation process without intermediate files.


Today we have a 1:1 correspondence of paragraphs as the smallest units 
to be translated. Each paragraph has an ID number to ensure the correct 
mapping of translated text. This means that no localization with 
additional or removed parts of text is possible. Not 21st. century 
technology in my opinion.


We want to add a link to every help page to a corresponding Wiki page, 
where every user can add comments (or more). This will need some effort 
to re-sync the source files in CVS with the user contents from the Wiki. 
In all languages. Good ideas are welcome.


Uwe
--
  [EMAIL PROTECTED]  -  Technical Writer
  StarOffice - Sun Microsystems, Inc. - Hamburg, Germany
  http://www.sun.com/staroffice
  http://documentation.openoffice.org/online_help/index.html
  http://wiki.services.openoffice.org/wiki/Category:OnlineHelp
  http://blogs.sun.com/oootnt

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [l10n-dev] Imagine :)

2007-07-12 Thread Rafaella Braconi



Jean-Christophe Helary wrote:



On 13 juil. 07, at 00:07, Uwe Fischer wrote:


Jean-Christophe Helary wrote:
...

To make things even clearer, I am saying that using _any_  
intermediary format for documentation is a waste of resources.


...

let me add some words to your message:



Uwe,

Thank you so much for your mail !

The application help files are XML files. See http:// 
documentation.openoffice.org/online_help/techdetails.html for details.

The Help Viewer converts the XML files to HTML when they are  displayed.

Using XML, it should be even more easy to use a straight forward  
translation process without intermediate files.



That is very good to know. There are already free generic XML filters  
that produce valid XLIFF: Okapi framework for ex, developed by Yves  
Savourel, also editor of the XLIFF 1.0 spec. Okapi is developed  in 
.NET 2.0 but I keep asking Yves to make it compatible with Mono so  
that it can be used in other environments. As a side note, OmegaT's  
XLIFF filter has been made to specifically support Okapi's output.


Today we have a 1:1 correspondence of paragraphs as the smallest  
units to be translated. Each paragraph has an ID number to ensure  
the correct mapping of translated text. This means that no  
localization with additional or removed parts of text is possible.  
Not 21st. century technology in my opinion.



No, but that means that correct TMX files are a possibility (even  
now). By the way I wonder why Rafaella told me creating TMXs of the  
state of the strings before the current updates was impossible ? 


to clarify: the only possibility I have is to provide you TMX files in 
which translation exactly matches the English text now. If the English 
source has been changed I have following situation:


New English text - Old translation (matching previous text).
In the database I have no possibility to provide you with files 
containing Old English text and Updated English text.


Rafaella

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[l10n-dev] TMX/XLIFF output (Re: [l10n-dev] Imagine :))

2007-07-12 Thread Jean-Christophe Helary


On 13 juil. 07, at 04:45, Rafaella Braconi wrote:



No, but that means that correct TMX files are a possibility (even   
now). By the way I wonder why Rafaella told me creating TMXs of  
the  state of the strings before the current updates was impossible ?


to clarify: the only possibility I have is to provide you TMX files  
in which translation exactly matches the English text now. If the  
English source has been changed I have following situation:


New English text - Old translation (matching previous text).
In the database I have no possibility to provide you with files  
containing Old English text and Updated English text.


Don't you have a snapshot of the doc _before_ it is modified ?

I mean, I have the 2.2.1 help files on my machine, so I can use the  
XML files in, for ex, sbasic.jar in the EN folder and align them with  
the same files in the FR folder and create a valid TMX of the state  
of the 2.2.1 version.


This is what I suggest you keep somewhere, for each language pair  
(with EN in source).


So you have a static set of TMX, archived by module (sbasic, swriter,  
etc) for each language, available from the community web, and  
translators just get the TMX they need for their current assignment.


Such files don't need to be dynamically generate,d they are valid for  
the most recent stable release, once the release is updated the files  
can be output for the translation of the next version.


So, create the TMX _before_ you modify the data base, _or_ from  
static files that exist anyway inside any copy of OOo. And create TMX  
level2 files, with all the original XML encapsulated so as not to  
confuse CAT tools and translators.




Regarding the output of proper source files, now that we (I...)  
know that the original is in XML, it should be trivial to provide  
them either directly as XML sets (specifically _without_ outputting  
diffs), or as XML diffs, or as XLIFFs.


You may have some technical requirements that have you produce SDF  
files, but those only add an extra layer of complexity to the  
translation process and I am sure you could have a clean XML output  
that includes all the SDF contained meta info, so that the source  
file _is_ some kind of XML and not an hybrid that considers XML as  
text (which is the major source of confusion).


If you have an XML workflow from the beginning, it should be much  
safer to keep it XML all the way hence:


original = XML (the OOo dialect)
diffs = XML (currently SDF, so shift to a dialect that uses the SDF  
info as attributes in XML diffs tags for ex)

source = XML (XLIFF)
reference = XML (TMX, taken from the original)


TMX is not supported by most PO editors anyway, so a clean TMX would  
mostly benefit people who use appropriate translation tools (free  
ones included).


Regarding the XLIFF (or PO, depending on the communities I gather)  
source output, each community (and even each contributor) could use  
the output that fits the tools in use.


XLIFF should be 1.0 so as to ensure OLT can be used (OLT does not  
support more recent versions of XLIFF sadly).


And then you have a clean workflow that satisfies everybody, and the  
management (Pootle) system can be put on all that to provide  
communities with the best environment possible.


And of course, this workflow is also valid for UI strings, since I  
suppose they can also be converted to XML (if they are not already).


What about that ?

Jean-Christophe Helary (fr team)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]