Re: [jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring

2016-05-25 Thread Matt Post
Having written that, factoring the pipeline would be a good first step to 
replacing the guts of the pipeline. It's worth noting that many of these are 
already done:

- alignment is handled by $JOSHUA/scripts/training/paralign.pl
- tuning is handled by $JOSHUA/scripts/training/run_tuner.py
- there is a script for running Thrax ($JOSHUA/scripts/training/run_thrax.py), 
but it is not pulled into the decoder yet

However, Lewis' basic point stands: the pipeline is a mess, and it would be 
good to have good interfaces to each of the subtasks, as an intermediate step 
to replacing the logic of the pipeline with a more versatile (and readable) 
tool like ducttape.

matt


> On May 24, 2016, at 7:27 PM, Matt Post (JIRA)  wrote:
> 
> 
>   [ 
> https://issues.apache.org/jira/browse/JOSHUA-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299141#comment-15299141
>  ] 
> 
> Matt Post commented on JOSHUA-270:
> --
> 
> The pipeline is a huge mess, probably not worth salvaging. I'm hoping (maybe 
> this year?) to rewrite it, perhaps using this: 
> https://github.com/jhclark/ducttape/
> 
>> pipeline.pl needs major refactoring
>> ---
>> 
>>   Key: JOSHUA-270
>>   URL: https://issues.apache.org/jira/browse/JOSHUA-270
>>   Project: Joshua
>>Issue Type: Bug
>>Components: pipeline
>>  Affects Versions: 6.0.5
>>  Reporter: Lewis John McGibbney
>>   Fix For: 6.1
>> 
>> 
>> Right now 
>> [pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl]
>>  is well over 2000 lines long and extremely difficult to navigate. 
>> I propose the following
>> * All ENV is refactored into an pipeline_environment file
>> * All Command line parsing and definitions are refactored into a 
>> pipeline_cli file
>> * Sanity checking is refactored into a pipeline_sanity_check file
>> * Dependenct Variable Checking is refactored into 
>> pipeline_dependent_variable_setting file
>> * filter and preprocess corpora is refactored into 
>> pipeline_filter_preprocess_corpora
>> * pipeline_subsampling becomes a file
>> * pipeline_alignment becomes a file
>> * pipeline_parsing becomes a file
>> * pipeline_thrax becomes a file
>> * pipeline_tuning becomes a file
>> * pipeline_testing becomes a file
>> * pipeline_subreoutines becomes a file
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)



***UNCHECKED*** Re: [jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring

2016-05-25 Thread Matt Post


binCRkioZFHru.bin
Description: PGP/MIME Versions Identification


[jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring

2016-05-24 Thread Thamme Gowda (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298820#comment-15298820
 ] 

Thamme Gowda commented on JOSHUA-270:
-

Hi [~lewismc], I made a script to setup the environment for pipeline.pl script 
without touching it .
 May be helpful for testing and refactoring.

{code}
#!/usr/bin/env bash

echo "STEP: Going to get berkeleyaligner jar"
wget  
https://github.com/apache/incubator-joshua/raw/e70677d2eab23daa7082173e6fe337d68aa12230/lib/berkeleyaligner.jar
 \
-O $JOSHUA/lib/berkeleyaligner.jar

echo "STEP: Going to build GIZA"
cd $JOSHUA/ext/giza-pp/
make all
make install

echo "STEP: Going to build symal"
cd $JOSHUA/ext/symal/
make


cd $JOSHUA
echo "STEP: Going to get Hadoop distribution"
wget 
http://apache.mirrors.tds.net/hadoop/common/hadoop-2.5.2/hadoop-2.5.2.tar.gz \
 -O $JOSHUA/lib/hadoop-2.5.2.tar.gz

cd $JOSHUA
echo "STEP: Getting thrax"
mkdir -p thrax
wget -O /tmp/thrax-e6195e4a1f60edc58448e8922991fe6938c6daba.zip 
https://github.com/joshua-decoder/thrax/archive/e6195e4a1f60edc58448e8922991fe6938c6daba.zip
unzip /tmp/thrax-e6195e4a1f60edc58448e8922991fe6938c6daba.zip
mv thrax-e6195e4a1f60edc58448e8922991fe6938c6daba $JOSHUA/thrax
echo "STEP: Building Thrax"
cd $JOSHUA/thrax
ant

cd $JOSHUA

{code}

> pipeline.pl needs major refactoring
> ---
>
> Key: JOSHUA-270
> URL: https://issues.apache.org/jira/browse/JOSHUA-270
> Project: Joshua
>  Issue Type: Bug
>  Components: pipeline
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
> Fix For: 6.1
>
>
> Right now 
> [pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl]
>  is well over 2000 lines long and extremely difficult to navigate. 
> I propose the following
>  * All ENV is refactored into an pipeline_environment file
>  * All Command line parsing and definitions are refactored into a 
> pipeline_cli file
>  * Sanity checking is refactored into a pipeline_sanity_check file
>  * Dependenct Variable Checking is refactored into 
> pipeline_dependent_variable_setting file
>  * filter and preprocess corpora is refactored into 
> pipeline_filter_preprocess_corpora
>  * pipeline_subsampling becomes a file
>  * pipeline_alignment becomes a file
>  * pipeline_parsing becomes a file
>  * pipeline_thrax becomes a file
>  * pipeline_tuning becomes a file
>  * pipeline_testing becomes a file
>  * pipeline_subreoutines becomes a file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)