everything looks ok, I'm not sure why it's segfaulting is the file model/reordering-table.* empty? If it is, then you should look in the log file steps/*/TRAINING_build-reordering.*.STDERR
or is evaluation/*.filtered.*/reordering-table.1.* empty? is your test set empty? On 8 December 2013 09:47, amir haghighi <amir.haghighi...@gmail.com> wrote: > yes, the parallel data is UTF8.(one is UTF8 and one is ascii). > all of the pre-processioning steps are done with moses scripts. > > here is the EMS config file content: > > ################################################ > ### CONFIGURATION FILE FOR AN SMT EXPERIMENT ### > ################################################ > > [GENERAL] > > ### directory in which experiment is run > # > working-dir = /opt/tools/workingEms > > # specification of the language pair > input-extension = En > output-extension = Fa > pair-extension = En-Fa > > ### directories that contain tools and data > # > # moses > moses-src-dir = > /opt/tools/mosesdecoder-RELEASE-1.0/mosesdecoder-RELEASE-1.0 > # > # moses binaries > moses-bin-dir = $moses-src-dir/bin > # > # moses scripts > moses-script-dir = $moses-src-dir/scripts > # > # directory where GIZA++/MGIZA programs resides > external-bin-dir = $moses-src-dir/tools > # > # srilm > #srilm-dir = $moses-src-dir/srilm/bin/i686 > # > # irstlm > irstlm-dir = /opt/tools/irstlm/bin > # > # randlm > #randlm-dir = $moses-src-dir/randlm/bin > # > # data > toy-data = /opt/tools/dataset/mizan > > ### basic tools > # > # moses decoder > decoder = $moses-bin-dir/moses > > # conversion of phrase table into binary on-disk format > ttable-binarizer = $moses-bin-dir/processPhraseTable > > # conversion of rule table into binary on-disk format > #ttable-binarizer = "$moses-bin-dir/CreateOnDiskPt 1 1 5 100 2" > > # tokenizers - comment out if all your data is already tokenized > input-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l > $input-extension" > output-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l > $output-extension" > > # truecasers - comment out if you do not use the truecaser > input-truecaser = $moses-script-dir/recaser/truecase.perl > output-truecaser = $moses-script-dir/recaser/truecase.perl > detruecaser = $moses-script-dir/recaser/detruecase.perl > > ### generic parallelizer for cluster and multi-core machines > # you may specify a script that allows the parallel execution > # parallizable steps (see meta file). you also need specify > # the number of jobs (cluster) or cores (multicore) > # > #generic-parallelizer = > $moses-script-dir/ems/support/generic-parallelizer.perl > #generic-parallelizer = > $moses-script-dir/ems/support/generic-multicore-parallelizer.perl > > ### cluster settings (if run on a cluster machine) > # number of jobs to be submitted in parallel > # > #jobs = 10 > > # arguments to qsub when scheduling a job > #qsub-settings = "" > > # project for priviledges and usage accounting > #qsub-project = iccs_smt > > # memory and time > #qsub-memory = 4 > #qsub-hours = 48 > > ### multi-core settings > # when the generic parallelizer is used, the number of cores > # specified here > cores = 8 > > ################################################################# > # PARALLEL CORPUS PREPARATION: > # create a tokenized, sentence-aligned corpus, ready for training > > [CORPUS] > > ### long sentences are filtered out, since they slow down GIZA++ > # and are a less reliable source of data. set here the maximum > # length of a sentence > # > max-sentence-length = 80 > > [CORPUS:toy] > > ### command to run to get raw corpus files > # > # get-corpus-script = > > ### raw corpus files (untokenized, but sentence aligned) > # > raw-stem = $toy-data/M_Tr > > ### tokenized corpus files (may contain long sentences) > # > #tokenized-stem = > > ### if sentence filtering should be skipped, > # point to the clean training data > # > #clean-stem = > > ### if corpus preparation should be skipped, > # point to the prepared training data > # > #lowercased-stem = > > ################################################################# > # LANGUAGE MODEL TRAINING > > [LM] > > ### tool to be used for language model training > # srilm > #lm-training = $srilm-dir/ngram-count > #settings = "-interpolate -kndiscount -unk" > > # irstlm training > # msb = modified kneser ney; p=0 no singleton pruning > #lm-training = "$moses-script-dir/generic/trainlm-irst2.perl -cores $cores > -irst-dir $irstlm-dir -temp-dir $working-dir/tmp" > #settings = "-s msb -p 0" > > # order of the language model > order = 5 > > ### tool to be used for training randomized language model from scratch > # (more commonly, a SRILM is trained) > # > #rlm-training = "$randlm-dir/buildlm -falsepos 8 -values 8" > > ### script to use for binary table format for irstlm or kenlm > # (default: no binarization) > > # irstlm > #lm-binarizer = $irstlm-dir/compile-lm > > # kenlm, also set type to 8 > #lm-binarizer = $moses-bin-dir/build_binary > #type = 8 > > ### script to create quantized language model format (irstlm) > # (default: no quantization) > # > #lm-quantizer = $irstlm-dir/quantize-lm > > ### script to use for converting into randomized table format > # (default: no randomization) > # > #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8" > > ### each language model to be used has its own section here > > [LM:toy] > > ### command to run to get raw corpus files > # > #get-corpus-script = "" > > ### raw corpus (untokenized) > # > raw-corpus = $toy-data/M_Tr.$output-extension > > ### tokenized corpus files (may contain long sentences) > # > #tokenized-corpus = > > ### if corpus preparation should be skipped, > # point to the prepared language model > # > lm = /opt/tools/lm2/M_FaforLm.blm.Fa > > ################################################################# > # INTERPOLATING LANGUAGE MODELS > > [INTERPOLATED-LM] > > # if multiple language models are used, these may be combined > # by optimizing perplexity on a tuning set > # see, for instance [Koehn and Schwenk, IJCNLP 2008] > > ### script to interpolate language models > # if commented out, no interpolation is performed > # > # script = $moses-script-dir/ems/support/interpolate-lm.perl > > ### tuning set > # you may use the same set that is used for mert tuning (reference set) > # > #tuning-sgm = > #raw-tuning = > #tokenized-tuning = > #factored-tuning = > #lowercased-tuning = > #split-tuning = > > ### group language models for hierarchical interpolation > # (flat interpolation is limited to 10 language models) > #group = "first,second fourth,fifth" > > ### script to use for binary table format for irstlm or kenlm > # (default: no binarization) > > # irstlm > #lm-binarizer = $irstlm-dir/compile-lm > > # kenlm, also set type to 8 > #lm-binarizer = $moses-bin-dir/build_binary > type = 8 > > ### script to create quantized language model format (irstlm) > # (default: no quantization) > # > #lm-quantizer = $irstlm-dir/quantize-lm > > ### script to use for converting into randomized table format > # (default: no randomization) > # > #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8" > > ################################################################# > # MODIFIED MOORE LEWIS FILTERING > > [MML] IGNORE > > ### specifications for language models to be trained > # > #lm-training = $srilm-dir/ngram-count > #lm-settings = "-interpolate -kndiscount -unk" > #lm-binarizer = $moses-src-dir/bin/build_binary > #lm-query = $moses-src-dir/bin/query > #order = 5 > > ### in-/out-of-domain source/target corpora to train the 4 language model > # > # in-domain: point either to a parallel corpus > #outdomain-stem = [CORPUS:toy:clean-split-stem] > > # ... or to two separate monolingual corpora > #indomain-target = [LM:toy:lowercased-corpus] > #raw-indomain-source = $toy-data/M_Tr.$input-extension > > # point to out-of-domain parallel corpus > #outdomain-stem = [CORPUS:giga:clean-split-stem] > > # settings: number of lines sampled from the corpora to train each > language model on > # (if used at all, should be small as a percentage of corpus) > #settings = "--line-count 100000" > > ################################################################# > # TRANSLATION MODEL TRAINING > > [TRAINING] > > ### training script to be used: either a legacy script or > # current moses training script (default) > # > script = $moses-script-dir/training/train-model.perl > > ### general options > # these are options that are passed on to train-model.perl, for instance > # * "-mgiza -mgiza-cpus 8" to use mgiza instead of giza > # * "-sort-buffer-size 8G -sort-compress gzip" to reduce on-disk sorting > # * "-sort-parallel 8 -cores 8" to speed up phrase table building > # > #training-options = "" > > ### factored training: specify here which factors used > # if none specified, single factor training is assumed > # (one translation step, surface to surface) > # > #input-factors = word lemma pos morph > #output-factors = word lemma pos > #alignment-factors = "word -> word" > #translation-factors = "word -> word" > #reordering-factors = "word -> word" > #generation-factors = "word -> pos" > #decoding-steps = "t0, g0" > > ### parallelization of data preparation step > # the two directions of the data preparation can be run in parallel > # comment out if not needed > # > parallel = yes > > ### pre-computation for giza++ > # giza++ has a more efficient data structure that needs to be > # initialized with snt2cooc. if run in parallel, this may reduces > # memory requirements. set here the number of parts > # > #run-giza-in-parts = 5 > > ### symmetrization method to obtain word alignments from giza output > # (commonly used: grow-diag-final-and) > # > alignment-symmetrization-method = grow-diag-final-and > > ### use of berkeley aligner for word alignment > # > #use-berkeley = true > #alignment-symmetrization-method = berkeley > #berkeley-train = $moses-script-dir/ems/support/berkeley-train.sh > #berkeley-process = $moses-script-dir/ems/support/berkeley-process.sh > #berkeley-jar = /your/path/to/berkeleyaligner-1.1/berkeleyaligner.jar > #berkeley-java-options = "-server -mx30000m -ea" > #berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8" > #berkeley-process-options = "-EMWordAligner.numThreads 8" > #berkeley-posterior = 0.5 > > ### use of baseline alignment model (incremental training) > # > #baseline = 68 > #baseline-alignment-model = > "$working-dir/training/prepared.$baseline/$input-extension.vcb \ > # $working-dir/training/prepared.$baseline/$output-extension.vcb \ > # > $working-dir/training/giza.$baseline/${output-extension}-$input-extension.cooc > \ > # > $working-dir/training/giza-inverse.$baseline/${input-extension}-$output-extension.cooc > \ > # > $working-dir/training/giza.$baseline/${output-extension}-$input-extension.thmm.5 > \ > # > $working-dir/training/giza.$baseline/${output-extension}-$input-extension.hhmm.5 > \ > # > $working-dir/training/giza-inverse.$baseline/${input-extension}-$output-extension.thmm.5 > \ > # > $working-dir/training/giza-inverse.$baseline/${input-extension}-$output-extension.hhmm.5" > > ### if word alignment should be skipped, > # point to word alignment files > # > #word-alignment = $working-dir/model/aligned.1 > > ### filtering some corpora with modified Moore-Lewis > # specify corpora to be filtered and ratio to be kept, either before or > after word alignment > #mml-filter-corpora = toy > #mml-before-wa = "-proportion 0.9" > #mml-after-wa = "-proportion 0.9" > > ### create a bilingual concordancer for the model > # > #biconcor = $moses-script-dir/ems/biconcor/biconcor > > ### lexicalized reordering: specify orientation type > # (default: only distance-based reordering model) > # > lexicalized-reordering = msd-bidirectional-fe > > ### hierarchical rule set > # > #hierarchical-rule-set = true > > ### settings for rule extraction > # > #extract-settings = "" > max-phrase-length = 5 > > ### add extracted phrases from baseline model > # > #baseline-extract = $working-dir/model/extract.$baseline > # > # requires aligned parallel corpus for re-estimating lexical translation > probabilities > #baseline-corpus = $working-dir/training/corpus.$baseline > #baseline-alignment = > $working-dir/model/aligned.$baseline.$alignment-symmetrization-method > > ### unknown word labels (target syntax only) > # enables use of unknown word labels during decoding > # label file is generated during rule extraction > # > #use-unknown-word-labels = true > > ### if phrase extraction should be skipped, > # point to stem for extract files > # > # extracted-phrases = > > ### settings for rule scoring > # > score-settings = "--GoodTuring" > > ### include word alignment in phrase table > # > #include-word-alignment-in-rules = yes > > ### sparse lexical features > # > #sparse-lexical-features = "target-word-insertion top 50, > source-word-deletion top 50, word-translation top 50 50, phrase-length" > > ### domain adaptation settings > # options: sparse, any of: indicator, subset, ratio > #domain-features = "subset" > > ### if phrase table training should be skipped, > # point to phrase translation table > # > # phrase-translation-table = > > ### if reordering table training should be skipped, > # point to reordering table > # > # reordering-table = > > ### filtering the phrase table based on significance tests > # Johnson, Martin, Foster and Kuhn. (2007): "Improving Translation Quality > by Discarding Most of the Phrasetable" > # options: -n number of translations; -l 'a+e', 'a-e', or a positive real > value -log prob threshold > #salm-index = /path/to/project/salm/Bin/Linux/Index/IndexSA.O64 > #sigtest-filter = "-l a+e -n 50" > > ### if training should be skipped, > # point to a configuration file that contains > # pointers to all relevant model files > # > #config-with-reused-weights = > > ##################################################### > ### TUNING: finding good weights for model components > > [TUNING] > > ### instead of tuning with this setting, old weights may be recycled > # specify here an old configuration file with matching weights > # > weight-config = $toy-data/weight.ini > > ### tuning script to be used > # > tuning-script = $moses-script-dir/training/mert-moses.pl > tuning-settings = "-mertdir $moses-bin-dir" > > ### specify the corpus used for tuning > # it should contain 1000s of sentences > # > #input-sgm = > #raw-input = > #tokenized-input = > #factorized-input = > #input = > # > #reference-sgm = > #raw-reference = > #tokenized-reference = > #factorized-reference = > #reference = > > ### size of n-best list used (typically 100) > # > nbest = 100 > > ### ranges for weights for random initialization > # if not specified, the tuning script will use generic ranges > # it is not clear, if this matters > # > # lambda = > > ### additional flags for the filter script > # > filter-settings = "" > > ### additional flags for the decoder > # > decoder-settings = "" > > ### if tuning should be skipped, specify this here > # and also point to a configuration file that contains > # pointers to all relevant model files > # > #config = > > ######################################################### > ## RECASER: restore case, this part only trains the model > > [RECASING] > > #decoder = $moses-bin-dir/moses > > ### training data > # raw input needs to be still tokenized, > # also also tokenized input may be specified > # > #tokenized = [LM:europarl:tokenized-corpus] > > # recase-config = > > #lm-training = $srilm-dir/ngram-count > > ####################################################### > ## TRUECASER: train model to truecase corpora and input > > [TRUECASER] > > ### script to train truecaser models > # > trainer = $moses-script-dir/recaser/train-truecaser.perl > > ### training data > # data on which truecaser is trained > # if no training data is specified, parallel corpus is used > # > # raw-stem = > # tokenized-stem = > > ### trained model > # > # truecase-model = > > ###################################################################### > ## EVALUATION: translating a test set using the tuned system and score it > > [EVALUATION] > > ### additional flags for the filter script > # > #filter-settings = "" > > ### additional decoder settings > # switches for the Moses decoder > # common choices: > # "-threads N" for multi-threading > # "-mbr" for MBR decoding > # "-drop-unknown" for dropping unknown source words > # "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000" for cube > pruning > # > decoder-settings = "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s > 5000" > > ### specify size of n-best list, if produced > # > #nbest = 100 > > ### multiple reference translations > # > #multiref = yes > > ### prepare system output for scoring > # this may include detokenization and wrapping output in sgm > # (needed for nist-bleu, ter, meteor) > # > detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l > $output-extension" > #recaser = $moses-script-dir/recaser/recase.perl > wrapping-script = "$moses-script-dir/ems/support/wrap-xml.perl > $output-extension" > #output-sgm = > > ### BLEU > # > nist-bleu = $moses-script-dir/generic/mteval-v13a.pl > nist-bleu-c = "$moses-script-dir/generic/mteval-v13a.pl -c" > #multi-bleu = $moses-script-dir/generic/multi-bleu.perl > #ibm-bleu = > > ### TER: translation error rate (BBN metric) based on edit distance > # not yet integrated > # > # ter = > > ### METEOR: gives credit to stem / worknet synonym matches > # not yet integrated > # > # meteor = > > ### Analysis: carry out various forms of analysis on the output > # > analysis = $moses-script-dir/ems/support/analysis.perl > # > # also report on input coverage > analyze-coverage = yes > # > # also report on phrase mappings used > report-segmentation = yes > # > # report precision of translations for each input word, broken down by > # count of input word in corpus and model > #report-precision-by-coverage = yes > # > # further precision breakdown by factor > #precision-by-coverage-factor = pos > # > # visualization of the search graph in tree-based models > #analyze-search-graph = yes > > [EVALUATION:test] > > ### input data > # > input-sgm = $toy-data/M_Ts.$input-extension > # raw-input = > # tokenized-input = > # factorized-input = > # input = > > ### reference data > # > reference-sgm = $toy-data/M_Ts.$output-extension > # raw-reference = > # tokenized-reference = > # reference = > > ### analysis settings > # may contain any of the general evaluation analysis settings > # specific setting: base coverage statistics on earlier run > # > #precision-by-coverage-base = $working-dir/evaluation/test.analysis.5 > > ### wrapping frame > # for nist-bleu and other scoring scripts, the output needs to be wrapped > # in sgm markup (typically like the input sgm) > # > wrapping-frame = $input-sgm > > ########################################## > ### REPORTING: summarize evaluation scores > > [REPORTING] > > ### currently no parameters for reporting section > >> >> > > On Sat, Dec 7, 2013 at 7:21 PM, Hieu Hoang <hieu.ho...@ed.ac.uk> wrote: > >> are you sure the parallel data is encoded in UTF8? Was it tokenized, >> cleaned and escaped by the Moses scripts or by another external script? >> >> Can you please send me you EMS config file too >> >> >> On 7 December 2013 14:03, amir haghighi <amir.haghighi...@gmail.com>wrote: >> >>> Hi, >>> >>> I have also the same problem in evaluation step with EMS and I would be >>> thankful if you could help me. >>> the lexical reordering file is emtpy and the log of the output in >>> evaluation_test_filter.2.stderr is: >>> >>> Using SCRIPTS_ROOTDIR: >>> /opt/tools/mosesdecoder-RELEASE-1.0/mosesdecoder-RELEASE-1.0/scripts >>> (9) create moses.ini @ Sat Dec 7 04:50:15 PST 2013 >>> Executing: mkdir -p /opt/tools/workingEms/evaluation/test.filtered.2 >>> Considering factor 0 >>> Considering factor 0 >>> filtering /opt/tools/workingEms/model/phrase-table.2 -> >>> /opt/tools/workingEms/evaluation/test.filtered.2/phrase-table.0-0.1.1... >>> 0 of 2197240 phrases pairs used (0.00%) - note: max length 10 >>> binarizing...cat >>> /opt/tools/workingEms/evaluation/test.filtered.2/phrase-table.0-0.1.1 | >>> LC_ALL=C sort -T /opt/tools/workingEms/evaluation/test.filtered.2 | >>> >>> /opt/tools/mosesdecoder-RELEASE-1.0/mosesdecoder-RELEASE-1.0/bin/processPhraseTable >>> -ttable 0 0 - -nscores 5 -out >>> /opt/tools/workingEms/evaluation/test.filtered.2/phrase-table.0-0.1.1 >>> processing ptree for stdin >>> Segmentation fault (core dumped) >>> filtering >>> >>> /opt/tools/workingEms/model/reordering-table.2.wbe-msd-bidirectional-fe.gz >>> -> >>> >>> /opt/tools/workingEms/evaluation/test.filtered.2/reordering-table.2.wbe-msd-bidirectional-fe... >>> 0 of 2197240 phrases pairs used (0.00%) - note: max length 10 >>> >>> binarizing.../opt/tools/mosesdecoder-RELEASE-1.0/mosesdecoder-RELEASE-1.0/bin/processLexicalTable >>> -in >>> >>> /opt/tools/workingEms/evaluation/test.filtered.2/reordering-table.2.wbe-msd-bidirectional-fe >>> -out >>> >>> /opt/tools/workingEms/evaluation/test.filtered.2/reordering-table.2.wbe-msd-bidirectional-fe >>> processLexicalTable v0.1 by Konrad Rawlik >>> processing >>> >>> /opt/tools/workingEms/evaluation/test.filtered.2/reordering-table.2.wbe-msd-bidirectional-fe >>> to >>> >>> /opt/tools/workingEms/evaluation/test.filtered.2/reordering-table.2.wbe-msd-bidirectional-fe.* >>> ERROR: empty lexicalised reordering file >>> >>> >>> >>> Barry Haddow <bhaddow@...> writes: >>> >>> > >>> > Hi Irene >>> > >>> > > But the output is empty. And the errors are 1. segmentation fault >>> > 2. error: empty lexicalized >>> > > reordering file >>> > >>> > Is this lexicalised reordering file empty then? >>> > >>> > It would be helpful if you could post the full log of the output when >>> > your run the filter command, >>> > >>> > cheers - Barry >>> > >>> > On 26/10/12 17:59, Irene Huang wrote: >>> > > Hi, I have trained and tuned the model, now I am using >>> > > >>> > > ~/mosesdecoder/scripts/training/filter-model-given-input.pl >>> > > <http://filter-model-given-input.pl> filtered-newstest2011 >>> > > mert-work/moses.ini ~/corpus/newstest2011.true.fr >>> > > <http://newstest2011.true.fr> \ >>> > > -Binarizer ~/mosesdecoder/bin/processPhraseTable >>> > > >>> > > to filter the phrase table. >>> > > >>> > > But the output is empty. And the errors are 1. segmentation fault >>> > > 2. error: empty lexicalized reordering file >>> > > >>> > > So does this mean it's out of memory error? >>> > > >>> > > Thanks >>> > > >>> > > >>> > > _______________________________________________ >>> > > Moses-support mailing list >>> > > Moses-support@... >>> > > http://mailman.mit.edu/mailman/listinfo/moses-support >>> > >>> >>> >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >> >> >> -- >> Hieu Hoang >> Research Associate >> University of Edinburgh >> http://www.hoang.co.uk/hieu >> >> > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support