[ https://issues.apache.org/jira/browse/JOSHUA-312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573593#comment-15573593 ]
Lewis John McGibbney commented on JOSHUA-312: --------------------------------------------- OK doke... I managed to reproduce this today. So one of my pipelines just failed, this has to do with me screwing up my paths... however this was after alignment with berkeley aligner. When I went to re-reun the code as follows, alignment was not pulled from the cache... it is completely re-run {code} lmcgibbn@LMC-056430 /usr/local/joshua_resources/russian_experiments $ ls -al total 8 drwxr-xr-x 7 lmcgibbn wheel 238 Oct 13 16:48 . drwxr-xr-x 22 lmcgibbn wheel 748 Oct 13 12:09 .. drwxr-xr-x 29 lmcgibbn wheel 986 Oct 13 16:48 .cachepipe -rw-r--r-- 1 lmcgibbn wheel 47 Oct 13 12:24 README drwxr-xr-x 5 lmcgibbn wheel 170 Oct 13 16:48 alignments drwxr-xr-x 12 lmcgibbn wheel 408 Oct 13 12:23 data drwxr-xr-x 6 lmcgibbn wheel 204 Oct 13 12:24 scripts lmcgibbn@LMC-056430 /usr/local/joshua_resources/russian_experiments $ /usr/local/incubator-joshua/bin/pipeline.pl --rundir . --type hiero --corpus /usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en --tune /usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en.tune --test /usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en.test --source en --target ru --readme "Experiment 1 Run 1 of ru --> en model training" --aligner berkeley [train-copy-and-filter] cached, skipping... [train-tokenize-en] cached, skipping... [train-tokenize-ru] cached, skipping... [train-trim] cached, skipping... [train-lowercase-en] cached, skipping... [train-lowercase-ru] cached, skipping... [train-vocab-en] cached, skipping... [train-vocab-ru] cached, skipping... [tune-copy-and-filter] cached, skipping... [tune-tokenize-en] cached, skipping... [tune-tokenize-ru] cached, skipping... [tune-lowercase-en] cached, skipping... [tune-lowercase-ru] cached, skipping... [tune-vocab-en] cached, skipping... [tune-vocab-ru] cached, skipping... [test-copy-and-filter] cached, skipping... [test-tokenize-en] cached, skipping... [test-tokenize-ru] cached, skipping... [test-lowercase-en] cached, skipping... [test-lowercase-ru] cached, skipping... [test-vocab-en] cached, skipping... [test-vocab-ru] cached, skipping... [source-numlines] cached, skipping... [source-numlines] retrieved cached result => 817962 [berkeley-aligner-chunk-0] rebuilding... dep=alignments/0/word-align.conf dep=/usr/local/joshua_resources/russian_experiments/data/train/splits/corpus.en.0 [NOT FOUND] dep=/usr/local/joshua_resources/russian_experiments/data/train/splits/corpus.ru.0 [NOT FOUND] dep=alignments/0/training.align [NOT FOUND] cmd=java -d64 -Xmx10g -jar /usr/local/incubator-joshua/ext/berkeleyaligner/distribution/berkeleyaligner.jar ++alignments/0/word-align.conf {code} The aligner looks as follows {code} lmcgibbn@LMC-056430 /usr/local $ tail -f joshua_resources/russian_experiments/alignments/0/log main() { Execution directory: alignments/0 Preparing Training Data { ERROR: No files found at source /dev/null } [23s, cum. 23s] 817962 training sentences, 0 test sentences Training models: 2 stages { Training stage 1: MODEL1 and MODEL1 jointly for 5 iterations { Initializing forward model [1m16s, cum. 1m16s] Initializing reverse model [1m36s, cum. 2m53s] Joint Train: 817962 sentences, jointly { Iteration 1/5 { Sentence 1/817962 Sentence 2/817962 Sentence 3/817962 Sentence 11/817962 Sentence 40/817962 Sentence 146/817962 ... {code} It would therefore appear to me that YES, the pipeline is cached, however on re-runs, the cache is not consulted and therefore alignment is repeated. > Even though alignment is cached, it is always re-done in pipeline re-execution > ------------------------------------------------------------------------------ > > Key: JOSHUA-312 > URL: https://issues.apache.org/jira/browse/JOSHUA-312 > Project: Joshua > Issue Type: Improvement > Components: alignment > Affects Versions: 6.0.5 > Reporter: Lewis John McGibbney > Priority: Critical > Fix For: 6.2 > > > Say if a pipeline fails after alignment. The alignment result is never cached > and it becomes necessary to undertake alignment... again! > We should investigate the process for caching alignments as it would really > speed up rerunning end-to-end pipelines for large input datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332)