Re: [jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring
Having written that, factoring the pipeline would be a good first step to replacing the guts of the pipeline. It's worth noting that many of these are already done: - alignment is handled by $JOSHUA/scripts/training/paralign.pl - tuning is handled by $JOSHUA/scripts/training/run_tuner.py - there is a script for running Thrax ($JOSHUA/scripts/training/run_thrax.py), but it is not pulled into the decoder yet However, Lewis' basic point stands: the pipeline is a mess, and it would be good to have good interfaces to each of the subtasks, as an intermediate step to replacing the logic of the pipeline with a more versatile (and readable) tool like ducttape. matt > On May 24, 2016, at 7:27 PM, Matt Post (JIRA) wrote: > > > [ > https://issues.apache.org/jira/browse/JOSHUA-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299141#comment-15299141 > ] > > Matt Post commented on JOSHUA-270: > -- > > The pipeline is a huge mess, probably not worth salvaging. I'm hoping (maybe > this year?) to rewrite it, perhaps using this: > https://github.com/jhclark/ducttape/ > >> pipeline.pl needs major refactoring >> --- >> >> Key: JOSHUA-270 >> URL: https://issues.apache.org/jira/browse/JOSHUA-270 >> Project: Joshua >> Issue Type: Bug >> Components: pipeline >> Affects Versions: 6.0.5 >> Reporter: Lewis John McGibbney >> Fix For: 6.1 >> >> >> Right now >> [pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl] >> is well over 2000 lines long and extremely difficult to navigate. >> I propose the following >> * All ENV is refactored into an pipeline_environment file >> * All Command line parsing and definitions are refactored into a >> pipeline_cli file >> * Sanity checking is refactored into a pipeline_sanity_check file >> * Dependenct Variable Checking is refactored into >> pipeline_dependent_variable_setting file >> * filter and preprocess corpora is refactored into >> pipeline_filter_preprocess_corpora >> * pipeline_subsampling becomes a file >> * pipeline_alignment becomes a file >> * pipeline_parsing becomes a file >> * pipeline_thrax becomes a file >> * pipeline_tuning becomes a file >> * pipeline_testing becomes a file >> * pipeline_subreoutines becomes a file > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332)
Re: [jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring
Having written that, factoring the pipeline would be a good first step to replacing the guts of the pipeline. It's worth noting that many of these are already done: - alignment is handled by $JOSHUA/scripts/training/paralign.pl - tuning is handled by $JOSHUA/scripts/training/run_tuner.py - there is a script for running Thrax ($JOSHUA/scripts/training/run_thrax.py), but it is not pulled into the decoder yet However, Lewis' basic point stands: the pipeline is a mess, and it would be good to have good interfaces to each of the subtasks, as an intermediate step to replacing the logic of the pipeline with a more versatile (and readable) tool like ducttape. matt > On May 24, 2016, at 7:27 PM, Matt Post (JIRA) wrote: > > > [ > https://issues.apache.org/jira/browse/JOSHUA-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299141#comment-15299141 > ] > > Matt Post commented on JOSHUA-270: > -- > > The pipeline is a huge mess, probably not worth salvaging. I'm hoping (maybe > this year?) to rewrite it, perhaps using this: > https://github.com/jhclark/ducttape/ > >> pipeline.pl needs major refactoring >> --- >> >> Key: JOSHUA-270 >> URL: https://issues.apache.org/jira/browse/JOSHUA-270 >> Project: Joshua >>Issue Type: Bug >>Components: pipeline >> Affects Versions: 6.0.5 >> Reporter: Lewis John McGibbney >> Fix For: 6.1 >> >> >> Right now >> [pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl] >> is well over 2000 lines long and extremely difficult to navigate. >> I propose the following >> * All ENV is refactored into an pipeline_environment file >> * All Command line parsing and definitions are refactored into a >> pipeline_cli file >> * Sanity checking is refactored into a pipeline_sanity_check file >> * Dependenct Variable Checking is refactored into >> pipeline_dependent_variable_setting file >> * filter and preprocess corpora is refactored into >> pipeline_filter_preprocess_corpora >> * pipeline_subsampling becomes a file >> * pipeline_alignment becomes a file >> * pipeline_parsing becomes a file >> * pipeline_thrax becomes a file >> * pipeline_tuning becomes a file >> * pipeline_testing becomes a file >> * pipeline_subreoutines becomes a file > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332)
***UNCHECKED*** Re: [jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring
binCRkioZFHru.bin Description: PGP/MIME Versions Identification
[jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring
[ https://issues.apache.org/jira/browse/JOSHUA-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300224#comment-15300224 ] Lewis John McGibbney commented on JOSHUA-270: - Yeah I agree. It's pretty cumbersome and way too much code. > pipeline.pl needs major refactoring > --- > > Key: JOSHUA-270 > URL: https://issues.apache.org/jira/browse/JOSHUA-270 > Project: Joshua > Issue Type: Bug > Components: pipeline >Affects Versions: 6.0.5 >Reporter: Lewis John McGibbney > Fix For: 6.1 > > > Right now > [pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl] > is well over 2000 lines long and extremely difficult to navigate. > I propose the following > * All ENV is refactored into an pipeline_environment file > * All Command line parsing and definitions are refactored into a > pipeline_cli file > * Sanity checking is refactored into a pipeline_sanity_check file > * Dependenct Variable Checking is refactored into > pipeline_dependent_variable_setting file > * filter and preprocess corpora is refactored into > pipeline_filter_preprocess_corpora > * pipeline_subsampling becomes a file > * pipeline_alignment becomes a file > * pipeline_parsing becomes a file > * pipeline_thrax becomes a file > * pipeline_tuning becomes a file > * pipeline_testing becomes a file > * pipeline_subreoutines becomes a file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring
[ https://issues.apache.org/jira/browse/JOSHUA-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299141#comment-15299141 ] Matt Post commented on JOSHUA-270: -- The pipeline is a huge mess, probably not worth salvaging. I'm hoping (maybe this year?) to rewrite it, perhaps using this: https://github.com/jhclark/ducttape/ > pipeline.pl needs major refactoring > --- > > Key: JOSHUA-270 > URL: https://issues.apache.org/jira/browse/JOSHUA-270 > Project: Joshua > Issue Type: Bug > Components: pipeline >Affects Versions: 6.0.5 >Reporter: Lewis John McGibbney > Fix For: 6.1 > > > Right now > [pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl] > is well over 2000 lines long and extremely difficult to navigate. > I propose the following > * All ENV is refactored into an pipeline_environment file > * All Command line parsing and definitions are refactored into a > pipeline_cli file > * Sanity checking is refactored into a pipeline_sanity_check file > * Dependenct Variable Checking is refactored into > pipeline_dependent_variable_setting file > * filter and preprocess corpora is refactored into > pipeline_filter_preprocess_corpora > * pipeline_subsampling becomes a file > * pipeline_alignment becomes a file > * pipeline_parsing becomes a file > * pipeline_thrax becomes a file > * pipeline_tuning becomes a file > * pipeline_testing becomes a file > * pipeline_subreoutines becomes a file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring
[ https://issues.apache.org/jira/browse/JOSHUA-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298856#comment-15298856 ] Thamme Gowda commented on JOSHUA-270: - Yes I tried to make it work with maven build. I saw that pipeline.pl requires many external libraries so made that previous script to get and place them in-place. I followed http://joshua.incubator.apache.org/6.0/quick-start.html, but it failed after many steps. I couldn't completely fix it because of my limited perl knowledge. > pipeline.pl needs major refactoring > --- > > Key: JOSHUA-270 > URL: https://issues.apache.org/jira/browse/JOSHUA-270 > Project: Joshua > Issue Type: Bug > Components: pipeline >Affects Versions: 6.0.5 >Reporter: Lewis John McGibbney > Fix For: 6.1 > > > Right now > [pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl] > is well over 2000 lines long and extremely difficult to navigate. > I propose the following > * All ENV is refactored into an pipeline_environment file > * All Command line parsing and definitions are refactored into a > pipeline_cli file > * Sanity checking is refactored into a pipeline_sanity_check file > * Dependenct Variable Checking is refactored into > pipeline_dependent_variable_setting file > * filter and preprocess corpora is refactored into > pipeline_filter_preprocess_corpora > * pipeline_subsampling becomes a file > * pipeline_alignment becomes a file > * pipeline_parsing becomes a file > * pipeline_thrax becomes a file > * pipeline_tuning becomes a file > * pipeline_testing becomes a file > * pipeline_subreoutines becomes a file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring
[ https://issues.apache.org/jira/browse/JOSHUA-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298836#comment-15298836 ] Lewis John McGibbney commented on JOSHUA-270: - Thanks, have you looked at pipeline.pl? pipeline.pl does not build any software, it actually runs a Joshua processing pipeline. > pipeline.pl needs major refactoring > --- > > Key: JOSHUA-270 > URL: https://issues.apache.org/jira/browse/JOSHUA-270 > Project: Joshua > Issue Type: Bug > Components: pipeline >Affects Versions: 6.0.5 >Reporter: Lewis John McGibbney > Fix For: 6.1 > > > Right now > [pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl] > is well over 2000 lines long and extremely difficult to navigate. > I propose the following > * All ENV is refactored into an pipeline_environment file > * All Command line parsing and definitions are refactored into a > pipeline_cli file > * Sanity checking is refactored into a pipeline_sanity_check file > * Dependenct Variable Checking is refactored into > pipeline_dependent_variable_setting file > * filter and preprocess corpora is refactored into > pipeline_filter_preprocess_corpora > * pipeline_subsampling becomes a file > * pipeline_alignment becomes a file > * pipeline_parsing becomes a file > * pipeline_thrax becomes a file > * pipeline_tuning becomes a file > * pipeline_testing becomes a file > * pipeline_subreoutines becomes a file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring
[ https://issues.apache.org/jira/browse/JOSHUA-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298820#comment-15298820 ] Thamme Gowda commented on JOSHUA-270: - Hi [~lewismc], I made a script to setup the environment for pipeline.pl script without touching it . May be helpful for testing and refactoring. {code} #!/usr/bin/env bash echo "STEP: Going to get berkeleyaligner jar" wget https://github.com/apache/incubator-joshua/raw/e70677d2eab23daa7082173e6fe337d68aa12230/lib/berkeleyaligner.jar \ -O $JOSHUA/lib/berkeleyaligner.jar echo "STEP: Going to build GIZA" cd $JOSHUA/ext/giza-pp/ make all make install echo "STEP: Going to build symal" cd $JOSHUA/ext/symal/ make cd $JOSHUA echo "STEP: Going to get Hadoop distribution" wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.5.2/hadoop-2.5.2.tar.gz \ -O $JOSHUA/lib/hadoop-2.5.2.tar.gz cd $JOSHUA echo "STEP: Getting thrax" mkdir -p thrax wget -O /tmp/thrax-e6195e4a1f60edc58448e8922991fe6938c6daba.zip https://github.com/joshua-decoder/thrax/archive/e6195e4a1f60edc58448e8922991fe6938c6daba.zip unzip /tmp/thrax-e6195e4a1f60edc58448e8922991fe6938c6daba.zip mv thrax-e6195e4a1f60edc58448e8922991fe6938c6daba $JOSHUA/thrax echo "STEP: Building Thrax" cd $JOSHUA/thrax ant cd $JOSHUA {code} > pipeline.pl needs major refactoring > --- > > Key: JOSHUA-270 > URL: https://issues.apache.org/jira/browse/JOSHUA-270 > Project: Joshua > Issue Type: Bug > Components: pipeline >Affects Versions: 6.0.5 >Reporter: Lewis John McGibbney > Fix For: 6.1 > > > Right now > [pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl] > is well over 2000 lines long and extremely difficult to navigate. > I propose the following > * All ENV is refactored into an pipeline_environment file > * All Command line parsing and definitions are refactored into a > pipeline_cli file > * Sanity checking is refactored into a pipeline_sanity_check file > * Dependenct Variable Checking is refactored into > pipeline_dependent_variable_setting file > * filter and preprocess corpora is refactored into > pipeline_filter_preprocess_corpora > * pipeline_subsampling becomes a file > * pipeline_alignment becomes a file > * pipeline_parsing becomes a file > * pipeline_thrax becomes a file > * pipeline_tuning becomes a file > * pipeline_testing becomes a file > * pipeline_subreoutines becomes a file -- This message was sent by Atlassian JIRA (v6.3.4#6332)