[ https://issues.apache.org/jira/browse/JOSHUA-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435133#comment-15435133 ]
Lewis John McGibbney commented on JOSHUA-304: --------------------------------------------- It may help for me to post the options available within the current berkeley aligner jar which was built when I installed Joshua {code} lmcgibbn@LMC-032857 /usr/local/incubator-joshua(master) $ java -jar ./lib/berkeleyaligner.jar -help Usage: log.maxIndLevel < int> : Maximum indent level. [10] log.msPerLine < int> : Maximum number of milliseconds between consecutive lines of output. [1000] log.file < str> : File to write log. [] log.stdout < bool> : Whether to output to the console. [true] log.note < str> : Dummy placeholder for a comment [] log.forcePrint < bool> : Force printing from logs* [false] log.maxPrintErrors < int> : Maximum number of errors (via error()) to print [10000] EMWordAligner.nullProb < dbl> : How to assign null-word probabilities (=1 means 1/n) [1.0E-6] EMWordAligner.usePosteriorDecoding < bool> : Use posterior decoding (recommended for best performance). [true] EMWordAligner.posteriorDecodingThreshold < dbl> : Threshold in [0,1] for deciding whether an alignment should exist. [0.5] EMWordAligner.mergeConsiderNull < bool> : When merging expected sufficient statistics, take into account the NULL (fix). [false] EMWordAligner.handleUnknownWords < bool> : Don't crash with unknown words (better to train on test set). [false] EMWordAligner.priorFraction < dbl> : Fraction of a count to add for links in dictionary prior (1 works well). [0.0] EMWordAligner.numThreads < int> : Number of concurrent threads to use during E-step (set to number of processors). [1] EMWordAligner.safeConcurrency < bool> : Safe concurrency (gets rid of concurrency warnings at the expense of speed) [false] EMWordAligner.evaluateDuringTraining < bool> : Whether to evaluate the model after each training iteration (slower, more memory). [false] TreeWalkModel.usePushProbabilities < bool> : Separate parameters for moving and pushing. [true] TreeWalkModel.conditionOnTag < bool> : Whether to condition distortion on the tag types. [true] TreeWalkModel.cacheTreePaths < bool> : Whether to cache paths through trees (uses lots of memory; faster). [false] Evaluator.searchForThreshold < bool> : Evaluate using line search [false] Evaluator.thresholdIntervals < int> : Sets the number of intervals for posterior threshold line search [20] Evaluator.saveAlignmentObjects < bool> : Save object files for proposed alignments (large files) [false] Main.trainSources < str*> : Directories or files containing training files. [example/train] Main.testSources < str*> : Directory or file containing testing files. [example/test] Main.sentences < int> : Maximum number of the training sentences to use [2147483647] Main.offsetTrainingSentences < int> : Skip this number of the first training sentences [0] Main.maxTestSentences < int> : Maximum number of the test sentences to use [2147483647] Main.offsetTestSentences < int> : Skip this number of the first test sentences [0] Main.foreignSuffix < str> : Foreign language file suffix [f] Main.englishSuffix < str> : English language file suffix [e] Main.itgTrainTestSplitPoint < int> : When writing test (ITG) posteriors, where to divide train/test data? [0] Main.itgInputDir < str> : What directory should we dump ITG test data to? [] Main.reverseAlignments < bool> : Reverse test set alignments (i.e., foreign to english) [false] Main.oneIndexed < bool> : Are alignments one-indexed (default == no, 0-indexed) [false] Main.lowercaseWords < bool> : Convert all words to lowercase [false] Main.leaveTrainingOnDisk < bool> : Don't load and store the training set upfront (slower, but less memory) [false] Main.saveRejects < bool> : Save rejected sentence pairs [false] Main.forwardModels <enum*> : Which word alignment model to use in the forward direction. [MODEL1 HMM] Main.reverseModels <enum*> : Which word alignment model to use in the backward direction. [MODEL1 HMM] Main.iters < int*> : Number of iterations to run the model. [5 5] Main.mode <enum*> : Whether to train the two models jointly or independently. [JOINT JOINT] Main.trainingCacheMaxSize < int> : Max sentence length for caching the HMM trellis (efficiency only). [100] Main.loadParamsDir < str> : Directory to load parameters from. [] Main.loadLexicalModelOnly < bool> : When true, the lexical model is loaded, but the distortion model is not. [true] Main.saveParams < bool> : Whether to save parameters. [true] Main.saveAlignOutput < bool> : Whether to save test alignments produced by the system. [true] Main.alignTraining < bool> : Produce two GIZA files and a Pharaoh file for translation [false] Main.writePosteriors < bool> : Produce posterior alignment weight file when aligning training (lots of disk space) [false] Main.writePosteriorsThreshold < dbl> : In outputting posteriors, where do we threshold them (0.0 == all posteriors) [0.0] Main.saveLexicalWeights < bool> : Produce two lexical translation tables for lexical weighting (unsupported) [false] Main.competitiveThresholding < bool> : Use competitive thresholding to eliminate distributed many-to-one alignments [false] Main.evaluateDirectionalModels < bool> : Evaluate directional models alone [false] Main.evaluateHardCombination < bool> : Evaluate hard alignment combinations [false] Main.evaluateSoftCombination < bool> : Evaluate soft alignment combinations [false] Main.dictionary < str> : Bilingual dictionary file (e.g., en-ch.dict) [example/en-ch.dict] Main.splitDefinitions < bool> : Breaks up multi-word definitions and enters each word into the dictionary map [false] Main.rantOutput < bool> : Output a lot of junk (largely unsupported) [false] exec.create < bool> : Whether to create a directory for this run; if not, don't generate output files [false] exec.monitor < bool> : Whether to create a thread to monitor the status. [false] exec.execDir < str> : Directory to put all output files; if blank, use execPoolDir. [] exec.execPoolDir < str> : Directory which contains all the executions (or symlinks). [] exec.actualExecPoolDir < str> : Directory which actually holds the executions. [] exec.overwriteExecDir < bool> : Overwrite the contents of the execDir if it doesn't exist (e.g., when running a thunk). [false] exec.useStandardExecPoolDirStrategy < bool> : Assume in the run directory, automatically set execPoolDir and actualExecPoolDir [false] exec.printOptionsAndExit < bool> : Simply print options and exit. [false] exec.miscOptions < str*> : Miscellaneous options (written to options.map and output.map, displayed in servlet); example: a=3 b=4 [] exec.addToView < str*> : Name of the view to add this execution to in the servlet [] exec.recordPath < str> : Record file to write to [] exec.charEncoding < str> : Character encoding [] exec.jarFiles < str*> : Name of jar files to load prior to execution [] exec.dontInitializeJars < bool> : Skip initialization of jars [false] exec.initializeJarsAfterDirCreation < bool> : Initialize from jars after copying them to a newly created execDir [false] exec.makeThunk < bool> : Make a thunk (a delayed computation). [false] exec.thunkAutoQueue < bool> : A note to the servlet to automatically run the thunk when it sees it [false] exec.thunkPriority < int> : Priority of the thunk. [0] exec.thunkMainClassName < str> : Launch this class [] exec.thunkJavaOpts < str> : Java options to pass to Java when later running the thunk [] exec.thunkUseScala < bool> : Use Scala to run rather than Java [false] exec.thunkReqMemory < int> : Use Scala to run rather than Java (in MB) [1024] exec.dontCatchExceptions < bool> : Whether to catch exceptions (ignored when making a thunk) [false] {code} > word-align.conf alignment template file not compatible with berkeley aligner > ---------------------------------------------------------------------------- > > Key: JOSHUA-304 > URL: https://issues.apache.org/jira/browse/JOSHUA-304 > Project: Joshua > Issue Type: Bug > Components: alignment, berkeley, templates > Affects Versions: 6.0.5 > Reporter: Lewis John McGibbney > Priority: Blocker > Fix For: 6.1 > > > It takes me quite some time to debug what was going on and why pipeline's > were failing when using the berkeley aligner. > It turns out that the word-align.conf template provided at > https://github.com/apache/incubator-joshua/blob/master/scripts/training/templates/alignment/word-align.conf > is not compatible with the berkeley aligner. > In particular the following lines are non compatible > https://github.com/apache/incubator-joshua/blob/master/scripts/training/templates/alignment/word-align.conf#L12-L15 > Evidence of this is provided below > {code} > lmcgibbn@LMC-032857 /usr/local/incubator-joshua/lib(master) $ java -d64 > -Xmx10g -jar /usr/local/incubator-joshua/lib/berkeleyaligner.jar > ++/usr/local/incubator-joshua/experiments/fisher_callhome_experiment/6/alignments/0/word-align.conf > Invalid enum: 'MODEL1 HMM'; valid choices: MODEL1|MODEL2|HMM|SYNTACTIC|NONE > lmcgibbn@LMC-032857 /usr/local/incubator-joshua/lib(master) $ java -d64 > -Xmx10g -jar /usr/local/incubator-joshua/lib/berkeleyaligner.jar > ++/usr/local/incubator-joshua/experiments/fisher_callhome_experiment/6/alignments/0/word-align.conf > Invalid enum: 'MODEL1, HMM'; valid choices: MODEL1|MODEL2|HMM|SYNTACTIC|NONE > lmcgibbn@LMC-032857 /usr/local/incubator-joshua/lib(master) $ java -d64 > -Xmx10g -jar /usr/local/incubator-joshua/lib/berkeleyaligner.jar > ++/usr/local/incubator-joshua/experiments/fisher_callhome_experiment/6/alignments/0/word-align.conf > Invalid enum: 'MODEL1 HMM'; valid choices: MODEL1|MODEL2|HMM|SYNTACTIC|NONE > lmcgibbn@LMC-032857 /usr/local/incubator-joshua/lib(master) $ java -d64 > -Xmx10g -jar /usr/local/incubator-joshua/lib/berkeleyaligner.jar > ++/usr/local/incubator-joshua/experiments/fisher_callhome_experiment/6/alignments/0/word-align.conf > Invalid enum: 'JOINT JOINT'; valid choices: FORWARD|REVERSE|BOTH_INDEP|JOINT > lmcgibbn@LMC-032857 /usr/local/incubator-joshua/lib(master) $ java -d64 > -Xmx10g -jar /usr/local/incubator-joshua/lib/berkeleyaligner.jar > ++/usr/local/incubator-joshua/experiments/fisher_callhome_experiment/6/alignments/0/word-align.conf > Exception in thread "main" java.lang.NumberFormatException: For input string: > "5 5" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:580) > at java.lang.Integer.parseInt(Integer.java:615) > at > edu.berkeley.nlp.fig.basic.OptInfo.interpretValue(OptionsParser.java:143) > at > edu.berkeley.nlp.fig.basic.OptInfo.interpretValue(OptionsParser.java:240) > at edu.berkeley.nlp.fig.basic.OptInfo.set(OptionsParser.java:294) > at > edu.berkeley.nlp.fig.basic.OptionsParser.readOptionsFile(OptionsParser.java:555) > at > edu.berkeley.nlp.fig.basic.OptionsParser.doParse(OptionsParser.java:604) > at edu.berkeley.nlp.fig.exec.Execution.init(Execution.java:293) > at edu.berkeley.nlp.wordAlignment.Main.main(Main.java:149) > lmcgibbn@LMC-032857 /usr/local/incubator-joshua/lib(master) $ java -d64 > -Xmx10g -jar /usr/local/incubator-joshua/lib/berkeleyaligner.jar > ++/usr/local/incubator-joshua/experiments/fisher_callhome_experiment/6/alignments/0/word-align.conf > Cannot create directory: alignments/0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)