[jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring

2016-05-24 Thread Thamme Gowda (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298820#comment-15298820
 ] 

Thamme Gowda commented on JOSHUA-270:
-

Hi [~lewismc], I made a script to setup the environment for pipeline.pl script 
without touching it .
 May be helpful for testing and refactoring.

{code}
#!/usr/bin/env bash

echo "STEP: Going to get berkeleyaligner jar"
wget  
https://github.com/apache/incubator-joshua/raw/e70677d2eab23daa7082173e6fe337d68aa12230/lib/berkeleyaligner.jar
 \
-O $JOSHUA/lib/berkeleyaligner.jar

echo "STEP: Going to build GIZA"
cd $JOSHUA/ext/giza-pp/
make all
make install

echo "STEP: Going to build symal"
cd $JOSHUA/ext/symal/
make


cd $JOSHUA
echo "STEP: Going to get Hadoop distribution"
wget 
http://apache.mirrors.tds.net/hadoop/common/hadoop-2.5.2/hadoop-2.5.2.tar.gz \
 -O $JOSHUA/lib/hadoop-2.5.2.tar.gz

cd $JOSHUA
echo "STEP: Getting thrax"
mkdir -p thrax
wget -O /tmp/thrax-e6195e4a1f60edc58448e8922991fe6938c6daba.zip 
https://github.com/joshua-decoder/thrax/archive/e6195e4a1f60edc58448e8922991fe6938c6daba.zip
unzip /tmp/thrax-e6195e4a1f60edc58448e8922991fe6938c6daba.zip
mv thrax-e6195e4a1f60edc58448e8922991fe6938c6daba $JOSHUA/thrax
echo "STEP: Building Thrax"
cd $JOSHUA/thrax
ant

cd $JOSHUA

{code}

> pipeline.pl needs major refactoring
> ---
>
> Key: JOSHUA-270
> URL: https://issues.apache.org/jira/browse/JOSHUA-270
> Project: Joshua
>  Issue Type: Bug
>  Components: pipeline
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
> Fix For: 6.1
>
>
> Right now 
> [pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl]
>  is well over 2000 lines long and extremely difficult to navigate. 
> I propose the following
>  * All ENV is refactored into an pipeline_environment file
>  * All Command line parsing and definitions are refactored into a 
> pipeline_cli file
>  * Sanity checking is refactored into a pipeline_sanity_check file
>  * Dependenct Variable Checking is refactored into 
> pipeline_dependent_variable_setting file
>  * filter and preprocess corpora is refactored into 
> pipeline_filter_preprocess_corpora
>  * pipeline_subsampling becomes a file
>  * pipeline_alignment becomes a file
>  * pipeline_parsing becomes a file
>  * pipeline_thrax becomes a file
>  * pipeline_tuning becomes a file
>  * pipeline_testing becomes a file
>  * pipeline_subreoutines becomes a file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (JOSHUA-271) Thrax invocation should not reply upon $HADOOP being set

2016-05-24 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created JOSHUA-271:
---

 Summary: Thrax invocation should not reply upon $HADOOP being set
 Key: JOSHUA-271
 URL: https://issues.apache.org/jira/browse/JOSHUA-271
 Project: Joshua
  Issue Type: Bug
  Components: pipeline, thrax
Affects Versions: 6.0.5
Reporter: Lewis John McGibbney
 Fix For: 6.1


Right now one cannot run thrax unless the $HADOOP env variable is defined. 
Every time the hadoop script is invoked it means that the path is coded as 
$HADOOP/bin/hadoop however what happens if you are using a VM (Vagrant) to 
connect to a cluster for which no $HADOOP env variable is defined? 
The hadoop script should be on the path and available to use from there. The 
only check which should be made is whether it is available from the path or 
not, if it is not then start_hadoop_cluster subroutine can be called. This 
reduces code and makes more sense.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (JOSHUA-270) pipeline.pl needs major refactoring

2016-05-24 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created JOSHUA-270:
---

 Summary: pipeline.pl needs major refactoring
 Key: JOSHUA-270
 URL: https://issues.apache.org/jira/browse/JOSHUA-270
 Project: Joshua
  Issue Type: Bug
  Components: pipeline
Affects Versions: 6.0.5
Reporter: Lewis John McGibbney
 Fix For: 6.1


Right now 
[pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl]
 is well over 2000 lines long and extremely difficult to navigate. 
I propose the following
 * All ENV is refactored into an pipeline_environment file
 * All Command line parsing and definitions are refactored into a pipeline_cli 
file
 * Sanity checking is refactored into a pipeline_sanity_check file
 * Dependenct Variable Checking is refactored into 
pipeline_dependent_variable_setting file
 * filter and preprocess corpora is refactored into 
pipeline_filter_preprocess_corpora
 * pipeline_subsampling becomes a file
 * pipeline_alignment becomes a file
 * pipeline_parsing becomes a file
 * pipeline_thrax becomes a file
 * pipeline_tuning becomes a file
 * pipeline_testing becomes a file
 * pipeline_subreoutines becomes a file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)