Re: [l10n-dev] Translating .sdf files directly with OmegaT

2007-07-12 Thread Jean-Christophe Helary


On 13 juil. 07, at 02:40, Alessandro Cattelan wrote:


Actually I haven't tried going through the procedure you described, I
think I'll give it a try with the next batch of files. We'll have  
around
4,200 words to translate and as it is a reasonable volume, I think  
I'll

have some time to spend in testing a new procedure.

What I fear, though, is that OmegaT would become extremely slow
processing a huge SDF file. If I have a bunch of PO files I can just
import only a few of them into the OmT project at a time and that  
makes
it possible to translate without too much CPU sweat :o). When I  
tried

loading the whole OLH project on which we worked in June, my computer
was almost collapsing: it took me over an hour just to load the  
project!

  I don't have a powerful machine (AMD Athlon XP, 1500Mhz, 700MB RAM)
but I think that if you have a big TM it is not wise to load a project
with over a thousand segments.


You are definitely right here: the bigger the TMX the more memory it  
takes.


Which is the reason why I just suggested (in the Imagine thread)  
that we have TMX by modules.


Also, you can assign OmegaT more memory that you actually have on  
your machine, I use OmegaT like this:


java -server -Xmx2048M -jar OmegaT.jar 

The -server option makes it faster too.

The sdf files we have are not that big though. So you have to be  
selective with the TMX you use.



Maybe we could split the SDF file into smaller ones, but I'm not sure
that would work.


If you try my method, you can translate bits by bits. There are no  
problems with that. What matters is that the reverse conversion is  
properly made.


JC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [l10n-dev] Translating .sdf files directly with OmegaT

2007-07-11 Thread Jean-Christophe Helary


On 11 juil. 07, at 15:29, Arthur Buijs wrote:


The overhead of using po-files in the translation process is minimal
(exept from the initial trying out).


It is not when you have to modify the tagged links to fit the source.  
In OmegaT that is done automatically without you even noticing it.


Also, all the emph tags, if they need to be displaced or edited,  
require more work in a text based editor than in OmegaT (if done the  
way I suggested).


Of course, using the PO files in a PO editor or in OmegaT will not  
make much difference in terms of editing the matches. The problem  
_is_ which source file you choose to work with and what relation they  
have to the original format (here: HTML-SDF-PO, almost no relation  
anymore when you reach the PO stage.)


So I am really talking about not using PO because _that_ requires to  
handle the files at text, while using the modified .sdf allows them  
to be handled as HTML (which does considerably reduce the amount of  
editing).



Ofcourse this is only true if a
useable tmx-file is available. My advise would be to find a better way
to generate tmx-files and use po-files for the translation-task.


The TMXs provided by Rafaella were similar to the ones provided by  
the translate-toolkit processes (oo2po - po2tmx) and neither  
corresponded to the source po file in terms of number of \  
characters for the escape sequences. They corresponded to the  
original .sdf file, which is what originally prompted me to use the  
original .sdf file as source. The rest of the hack I proposed on the  
7/7 comes from that.



The general problem does not only come from the TMX, but from the  
fact that .sdf is already an intermediate format (that you then  
convert to yet another intermediate format - po).


The original conversion requires escapes and _that_ is what requires  
the files to be handled as text when they could just as well be  
handled as pure and simple HTML which most translation tools support.


The TMX problem is yet another problem.

Here, we have the following structure for the TMXs:

(new source segment)
(old target translation, if present)

A _real_ TMX should be:

(old source segment)
(old target translation)

So the current process is very confusing and does not allow TMX  
supporting tools (like OmegaT or even OLT) to fully leverage the  
contents of the source. Which is the real function of the TMX file.


Plus, the fact that the TMX do not reflect the structure of the  
actual source file (PO) makes them yet another problem.



Of course, I am commenting on the process only with the perspective  
of allowing translation contributors to have access to a translation  
workflow that supports the use of computer aided translation tools.  
Right now the process that is suggested by the file formats available  
for OOo's localization does not facilitate this at all.


Another of SUN's project, namely NetBeans, manages to fully leverage  
legacy translations thanks to the use of simple source file formats  
(the UI files are simple Java properties and the Help files are  
simply HTML) and the whole source files are matched to the legacy  
translations output to TMX for super easy translation (in OmegaT or  
any other TMX supporting tool, even though OmegaT is the most used  
tool there).


As long as OOo sticks to intermediate file formats (.sdf/.po/.xliff)  
with the current unstable conversion processes, hack will be  
necessary to reach the same level of efficiency other communities  
have already reached. And _that_ is really too bad.



Cheers,

Jean-Christophe Helary (fr)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [l10n-dev] Translating .sdf files directly with OmegaT

2007-07-10 Thread Jean-Christophe Helary

Ale,

I was wondering if you eventually had considered this procedure. I  
works very correctly and considerably increases productivity thanks  
to OmegaT's HTML handling features. I think I'm going to investigate  
the possibility of having an .sdf filter for OmegaT rather than  
having to go through all the po loops that really don't provide much  
more than yet another intermediate format that is anyway inconvenient  
to translate.


JC

On 7 juil. 07, at 00:41, Jean-Christophe Helary wrote:

The reason why I tried to do that is because using the .po created  
with oo2po along with the TMX created with po2tmx does not work  
well. The po2tmx removes data from escape sequences and that means  
more things to type in the OmegaT edit window.


So, the idea was to consider the .sdf file as a pseudo HTML file to  
benefit from a few automatic goodies offered by OmegaT:
1) tag reduction (so that one needs to type less when tags are  
inline) and
2) tag protection (for block tags like the ahelp.../ahelp when  
they open and close the segment)


if the TMX could be hacked to show formatting tags similar to the  
modified source file it would become trivial to edit the tags and  
reflect the new contents found in source.


Problem is, an .sdf file is not a HTML file: there is plenty of  
meta information and a lot of escaped ,  and others.
Also, a .sdf file seems to be constituted of 2 lines blocks: the  
source line and the target line.


The first problem will be solved later, now, to extract the  
translatable contents we need to change the 2 lines blocks into one  
line blocks with source and target data next to each other.


This is does using a regexp like (those are not exact, I do them  
from memory plus they may change depending on the editor you chose):


search for:
^(.*)(en-US)(.*)\r^(.*)(fr)(.*)
replace with:
\1\2\3\t\4\5\6

Now that your .sdf is linearized, change its name to .csv and  
open it in OpenOffice by using tab as field separator and  
nothing as text delimiter.


The tabs in the original .sdf create a number of columns from where  
you just need to copy the column with the en-US translatable contents.


Paste that into a text file that you'll name to .html

Now, we need to convert this to pseudo HTML. The idea being that  
OmegaT will smoothly handle all the ahelp etc tags that will be  
found there.


First of all, we need to understand that not all the  are tag  
beginning characters, a number of them are simply inferior  
characters. So we grab those first:


search for:
([^\])
replace with:
\1lt;

 are less of a problem but let's do them anyway:

search for:
([^\])
replace with:
\1gt;

Now we can safely assume that all the remaining  or  are  
escaped with \ and to correct that (so that the non escaped tags  
can be recognized in OmegaT) do:


search for:
\\
replace with:


search for:
\\
replace with:


Last but not least, to ensure that OmegaT will consider each line  
as being a segment we need to add the paragraph mark to each line  
beginning:


search for:
^
replace with:
p

Save, the file should be ready to be processed.



Now, we need to get matches from the TMX files that either we have  
created (oo2po - po2tmx) or that Rafaella  all have provided us  
with.


Problem is that the TMX files reflect the contents of the .sdf that  
we have just modified.


In the TMX, we are likely to find an ahelp tag written as \ahelp  
something\ which will not be helpful since in OmegaT the ahelp  
tag will be displayed as a0 and thus will not match the \ahelp  
something\ string.


So, we need to hack the file so that it looks close enough to what  
the source expects...


In the TMX we want to reduce _all_ the escaped tags to a short  
expression that looks like a for a tag starting with a.


So we would do something like (here again, not 100% exact regexp).

search for:
\\(.)[^]*
replace with:
lt;\1gt;

same for tail tags:
\\/(.)[^]*
replace with:
lt;/\1gt;

If I remember well everything I did in the last few days that is  
about it. Save the TMX, put it in /tm/, load the project and  
translate...


You can also put the Sun glossaries in /glossary/ after a little  
bit of formatting. But that too is trivial.



When translation is done, it is important to verify the tags (Tool - 
 Valitate tags) click on each segment where the tags don't with  
source and correct the target.


Then Project - Create translated files

Get the translated .html file from /target/

And now we need to back process the whole thing to revert it to its  
original .sdf form.


1) remove all the p at the beginning of the lines
2) replace all the  with \, all the  with \, all the lt; with  
 and the gt; with 



This should be enough. Now copy the whole file and paste it in the  
target contents part of the still opened .csv file.


The .csv file now contains the source part and the target part next  
to each other.


Let's save this (be careful: tab as field separator and nothing  
as text delimiter).


Open 

[l10n-dev] Translating .sdf files directly with OmegaT

2007-07-06 Thread Jean-Christophe Helary
The reason why I tried to do that is because using the .po created  
with oo2po along with the TMX created with po2tmx does not work well.  
The po2tmx removes data from escape sequences and that means more  
things to type in the OmegaT edit window.


So, the idea was to consider the .sdf file as a pseudo HTML file to  
benefit from a few automatic goodies offered by OmegaT:
1) tag reduction (so that one needs to type less when tags are  
inline) and
2) tag protection (for block tags like the ahelp.../ahelp when  
they open and close the segment)


if the TMX could be hacked to show formatting tags similar to the  
modified source file it would become trivial to edit the tags and  
reflect the new contents found in source.


Problem is, an .sdf file is not a HTML file: there is plenty of meta  
information and a lot of escaped ,  and others.
Also, a .sdf file seems to be constituted of 2 lines blocks: the  
source line and the target line.


The first problem will be solved later, now, to extract the  
translatable contents we need to change the 2 lines blocks into one  
line blocks with source and target data next to each other.


This is does using a regexp like (those are not exact, I do them from  
memory plus they may change depending on the editor you chose):


search for:
^(.*)(en-US)(.*)\r^(.*)(fr)(.*)
replace with:
\1\2\3\t\4\5\6

Now that your .sdf is linearized, change its name to .csv and open  
it in OpenOffice by using tab as field separator and nothing as  
text delimiter.


The tabs in the original .sdf create a number of columns from where  
you just need to copy the column with the en-US translatable contents.


Paste that into a text file that you'll name to .html

Now, we need to convert this to pseudo HTML. The idea being that  
OmegaT will smoothly handle all the ahelp etc tags that will be  
found there.


First of all, we need to understand that not all the  are tag  
beginning characters, a number of them are simply inferior  
characters. So we grab those first:


search for:
([^\])
replace with:
\1lt;

 are less of a problem but let's do them anyway:

search for:
([^\])
replace with:
\1gt;

Now we can safely assume that all the remaining  or  are  
escaped with \ and to correct that (so that the non escaped tags  
can be recognized in OmegaT) do:


search for:
\\
replace with:


search for:
\\
replace with:


Last but not least, to ensure that OmegaT will consider each line as  
being a segment we need to add the paragraph mark to each line  
beginning:


search for:
^
replace with:
p

Save, the file should be ready to be processed.



Now, we need to get matches from the TMX files that either we have  
created (oo2po - po2tmx) or that Rafaella  all have provided us with.


Problem is that the TMX files reflect the contents of the .sdf that  
we have just modified.


In the TMX, we are likely to find an ahelp tag written as \ahelp  
something\ which will not be helpful since in OmegaT the ahelp tag  
will be displayed as a0 and thus will not match the \ahelp  
something\ string.


So, we need to hack the file so that it looks close enough to what  
the source expects...


In the TMX we want to reduce _all_ the escaped tags to a short  
expression that looks like a for a tag starting with a.


So we would do something like (here again, not 100% exact regexp).

search for:
\\(.)[^]*
replace with:
lt;\1gt;

same for tail tags:
\\/(.)[^]*
replace with:
lt;/\1gt;

If I remember well everything I did in the last few days that is  
about it. Save the TMX, put it in /tm/, load the project and  
translate...


You can also put the Sun glossaries in /glossary/ after a little bit  
of formatting. But that too is trivial.



When translation is done, it is important to verify the tags (Tool -  
Valitate tags) click on each segment where the tags don't with source  
and correct the target.


Then Project - Create translated files

Get the translated .html file from /target/

And now we need to back process the whole thing to revert it to its  
original .sdf form.


1) remove all the p at the beginning of the lines
2) replace all the  with \, all the  with \, all the lt; with   
and the gt; with 



This should be enough. Now copy the whole file and paste it in the  
target contents part of the still opened .csv file.


The .csv file now contains the source part and the target part next  
to each other.


Let's save this (be careful: tab as field separator and nothing  
as text delimiter).


Open the result in the text editor.

The pattern we need to find to revert the 1 line blocks to 2 line  
blocks is something like:


(something)(followed by lots of en-US stuff)a tab(the same something) 
(followed by lots of translated stuff)


^([^\t])(.*)\t\1(.*)$
and we need to replace it with:
\1\2\r\1\4

Make sure there are no mistakes (if there are any they are likely to  
appear right in the first lines).


Now you should have your 2 lines block.

Rename the file to .sdf and here you are.



This