[jira] [Commented] (JOSHUA-312) Even though alignment is cached, it is always re-done in pipeline re-execution

2016-10-18 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586111#comment-15586111
 ] 

Lewis John McGibbney commented on JOSHUA-312:
-

boom goes the dynamite :)
Thanks [~post]

> Even though alignment is cached, it is always re-done in pipeline re-execution
> --
>
> Key: JOSHUA-312
> URL: https://issues.apache.org/jira/browse/JOSHUA-312
> Project: Joshua
>  Issue Type: Improvement
>  Components: alignment
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 6.1
>
>
> Say if a pipeline fails after alignment. The alignment result is never cached 
> and it becomes necessary to undertake alignment... again!
> We should investigate the process for caching alignments as it would really 
> speed up rerunning end-to-end pipelines for large input datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-312) Even though alignment is cached, it is always re-done in pipeline re-execution

2016-10-18 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586063#comment-15586063
 ] 

Matt Post commented on JOSHUA-312:
--

This is fixed with commit 301f301cdcad5ab49c8465506791e5f117e1c944 (just 
pushed). The problem was that I changed the structure of the alignment splits, 
and did not update the paths for Berkeley aligner. Sorry about the trouble!

> Even though alignment is cached, it is always re-done in pipeline re-execution
> --
>
> Key: JOSHUA-312
> URL: https://issues.apache.org/jira/browse/JOSHUA-312
> Project: Joshua
>  Issue Type: Improvement
>  Components: alignment
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 6.2
>
>
> Say if a pipeline fails after alignment. The alignment result is never cached 
> and it becomes necessary to undertake alignment... again!
> We should investigate the process for caching alignments as it would really 
> speed up rerunning end-to-end pipelines for large input datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-312) Even though alignment is cached, it is always re-done in pipeline re-execution

2016-10-13 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573593#comment-15573593
 ] 

Lewis John McGibbney commented on JOSHUA-312:
-

OK doke... I managed to reproduce this today.
So one of my pipelines just failed, this has to do with me screwing up my 
paths... however this was after alignment with berkeley aligner.
When I went to re-reun the code as follows, alignment was not pulled from the 
cache... it is completely re-run
{code}
lmcgibbn@LMC-056430 /usr/local/joshua_resources/russian_experiments $ ls -al
total 8
drwxr-xr-x   7 lmcgibbn  wheel  238 Oct 13 16:48 .
drwxr-xr-x  22 lmcgibbn  wheel  748 Oct 13 12:09 ..
drwxr-xr-x  29 lmcgibbn  wheel  986 Oct 13 16:48 .cachepipe
-rw-r--r--   1 lmcgibbn  wheel   47 Oct 13 12:24 README
drwxr-xr-x   5 lmcgibbn  wheel  170 Oct 13 16:48 alignments
drwxr-xr-x  12 lmcgibbn  wheel  408 Oct 13 12:23 data
drwxr-xr-x   6 lmcgibbn  wheel  204 Oct 13 12:24 scripts
lmcgibbn@LMC-056430 /usr/local/joshua_resources/russian_experiments $ 
/usr/local/incubator-joshua/bin/pipeline.pl  --rundir . --type hiero --corpus 
/usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en --tune 
/usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en.tune 
--test 
/usr/local/joshua_resources/russian_experiments/data/commoncrawl.ru-en.test 
--source en --target ru --readme "Experiment 1 Run 1 of ru --> en model 
training" --aligner berkeley
[train-copy-and-filter] cached, skipping...
[train-tokenize-en] cached, skipping...
[train-tokenize-ru] cached, skipping...
[train-trim] cached, skipping...
[train-lowercase-en] cached, skipping...
[train-lowercase-ru] cached, skipping...
[train-vocab-en] cached, skipping...
[train-vocab-ru] cached, skipping...
[tune-copy-and-filter] cached, skipping...
[tune-tokenize-en] cached, skipping...
[tune-tokenize-ru] cached, skipping...
[tune-lowercase-en] cached, skipping...
[tune-lowercase-ru] cached, skipping...
[tune-vocab-en] cached, skipping...
[tune-vocab-ru] cached, skipping...
[test-copy-and-filter] cached, skipping...
[test-tokenize-en] cached, skipping...
[test-tokenize-ru] cached, skipping...
[test-lowercase-en] cached, skipping...
[test-lowercase-ru] cached, skipping...
[test-vocab-en] cached, skipping...
[test-vocab-ru] cached, skipping...
[source-numlines] cached, skipping...
[source-numlines] retrieved cached result =>   817962
[berkeley-aligner-chunk-0] rebuilding...
  dep=alignments/0/word-align.conf
  
dep=/usr/local/joshua_resources/russian_experiments/data/train/splits/corpus.en.0
 [NOT FOUND]
  
dep=/usr/local/joshua_resources/russian_experiments/data/train/splits/corpus.ru.0
 [NOT FOUND]
  dep=alignments/0/training.align [NOT FOUND]
  cmd=java -d64 -Xmx10g -jar 
/usr/local/incubator-joshua/ext/berkeleyaligner/distribution/berkeleyaligner.jar
 ++alignments/0/word-align.conf
{code}

The aligner looks as follows

{code}
lmcgibbn@LMC-056430 /usr/local $ tail -f 
joshua_resources/russian_experiments/alignments/0/log
main() {
  Execution directory: alignments/0
  Preparing Training Data {
ERROR: No files found at source /dev/null
  } [23s, cum. 23s]
  817962 training sentences, 0 test sentences
  Training models: 2 stages {
Training stage 1: MODEL1 and MODEL1 jointly for 5 iterations {
  Initializing forward model
 [1m16s, cum. 1m16s]
  Initializing reverse model [1m36s, cum. 2m53s]
  Joint Train: 817962 sentences, jointly {
Iteration 1/5 {
  Sentence 1/817962
  Sentence 2/817962
  Sentence 3/817962
  Sentence 11/817962
  Sentence 40/817962
  Sentence 146/817962
...
{code}

It would therefore appear to me that YES, the pipeline is cached, however on 
re-runs, the cache is not consulted and therefore alignment is repeated.

> Even though alignment is cached, it is always re-done in pipeline re-execution
> --
>
> Key: JOSHUA-312
> URL: https://issues.apache.org/jira/browse/JOSHUA-312
> Project: Joshua
>  Issue Type: Improvement
>  Components: alignment
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: 6.2
>
>
> Say if a pipeline fails after alignment. The alignment result is never cached 
> and it becomes necessary to undertake alignment... again!
> We should investigate the process for caching alignments as it would really 
> speed up rerunning end-to-end pipelines for large input datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-312) Even though alignment is cached, it is always re-done in pipeline re-execution

2016-09-28 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15529750#comment-15529750
 ] 

Matt Post commented on JOSHUA-312:
--

I checked on my end, and I see alignments being cached just fine. Please post 
the output of the pipeline script.

> Even though alignment is cached, it is always re-done in pipeline re-execution
> --
>
> Key: JOSHUA-312
> URL: https://issues.apache.org/jira/browse/JOSHUA-312
> Project: Joshua
>  Issue Type: Improvement
>  Components: alignment
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: 6.2
>
>
> Say if a pipeline fails after alignment. The alignment result is never cached 
> and it becomes necessary to undertake alignment... again!
> We should investigate the process for caching alignments as it would really 
> speed up rerunning end-to-end pipelines for large input datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)