Hi,

> 1.       I would like to perform training of big parallel files , 39
> millions lines , does the training process have limitations for big files?
> (my hardware , quad core 1.8 , RAM 16 GB )

The one step that may choke on data of this size is GIZA++.
In case that this step fails, you could break up the corpus for
word alignment and then combine the results.

> 2.       Does training process of east Asian languages such English – Korean
> , English – Chinese , English –  Japanese, request special settings ?

No, the decoder and training is pretty agnostic about the character set
and encoding, as long as text is space separated. That does mean,
though, that you have to perform word segmentation for Chinese, Japanese,
and Korean.

> 3.       Does it possible after the training process to extract parallel
> dictionary ? if yes what are the steps ?

The are files model/lex* that are created during training that are
word-to-word probabilistic dictionaries. The phrase table is a
phrase-to-phrase dictionary, but it contains many entries that you
would not expect in a human-generated dictionary.

> 4.       Does it possible to convert binary phrase table to none-binary ?

The phrase table is first generated as a text file, so such a
conversion is not needed.

-phi

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to