Community Review of New Language Pack

2016-10-28 Thread lewis john mcgibbney
Hi Folks,
I managed to generate my first language pack today based on heiro model.
It's 4.8GB in size so I have made it available via my home.apache.org
public space at [0]. Right now it is uploading and will take a wee while.
I would like some community review so we can review the quality of what has
been generated. In addition there are a number of immediate things I am
struggling with.

Firstly, the following files were not present after running the bundler.py.

   -  prepare.sh, this is a baseline requirement for running the tests as
   detailed within the auto-generated README.
   - the entire 'scripts' directory!!! This means that no utility
   processing can be undertaken at all.

I know that both of the above are essential requirements, I therefore added
them from a different language pack, increased default maximum memory usage
and also augmented the README with some details regarding the dataset used
to generate the language pack.

In comparison to the es --> en language pack posted by Matt, due to the fat
that no scripts directory was generated, this language pack does not have
the scripts/release directory either. I am not sure how this was generated.

Over and above what I've detailed so far, there is one blocking issue for
me... when I submit Russian text to the Joshua server, it just spits back
out the same Russian text! I can see the decoder logging to std out however
I can only assume that no decoding is actually taking place.

Can you guys please review the language pack, provide feedback on the
configuration, some of the scores which have been generated and even the
BLEU score? I have absolutely everything local and also backed up so I can
provide absolutely everything as well as the exact commands I invoked to
generate the entire thing from start to finish.
Cheers troops.

[0] http://home.apache.org/~lewismc/language-pack-ru-en-2016-10-28.tar.gz

-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


language packs

2016-10-28 Thread Matt Post
Just a quick note --- i'm offline today and have been at a conference the past 
two but will catch up soon. 

lewis, i've seen you looking at run_bundler.py for language packs. this is 
deprecated; I'm making a new set of scripts in scripts/languageā€“pack. I am also 
writing a downloaded script that lets you list all LPs and select from 
different options. Each LP will have a no dependency version based on berkeley 
lm and then a docket version that compiles KenLM since that is faster and 
better. my colleague has built over 60 language packs and then we just need to 
pick them up. I'm just letting you know so that you don't waste your time with 
rum bundler since i see you editing that page. 

matt (from my phone)


[jira] [Commented] (JOSHUA-316) run_bundler.py returning JOB FAILED (return code 1) TypeError: memoryview: a bytes-like object is required, not 'str'

2016-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15614495#comment-15614495
 ] 

ASF GitHub Bot commented on JOSHUA-316:
---

Github user lewismc commented on the issue:

https://github.com/apache/incubator-joshua/pull/73
  
A further comment, I just closed off 
https://issues.apache.org/jira/browse/JOSHUA-319, this is because I am able to 
run end to end pipelines without a hitch now. The only barrier is ensuring that 
enough memory is allocated to the processes.
I ran with this PR on my local joshua code and it ran like a charm.


> run_bundler.py returning JOB FAILED (return code 1) TypeError: memoryview: a 
> bytes-like object is required, not 'str'
> -
>
> Key: JOSHUA-316
> URL: https://issues.apache.org/jira/browse/JOSHUA-316
> Project: Joshua
>  Issue Type: Bug
>  Components: bundler
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 6.1
>
>
> {code}
> [glue-tune] rebuilding...
>   
> dep=/usr/local/joshua_resources/russian_experiments/exp2/grammar.packed/slice_0.source
>  [CHANGED]
>   
> dep=/usr/local/joshua_resources/russian_experiments/exp2/data/tune/grammar.glue
>  [NOT FOUND]
>   cmd=/usr/local/incubator-joshua/scripts/support/create_glue_grammar.sh 
> /usr/local/joshua_resources/russian_experiments/exp2/grammar.packed > 
> /usr/local/joshua_resources/russian_experiments/exp2/data/tune/grammar.glue
>   took 1 seconds (1s)
> [tune-bundle] rebuilding...
>   
> dep=/usr/local/incubator-joshua/scripts/training/templates/tune/joshua.config 
> [CHANGED]
>   
> dep=/usr/local/joshua_resources/russian_experiments/exp2/grammar.packed/slice_0.source
>  [CHANGED]
>   
> dep=/usr/local/joshua_resources/russian_experiments/exp2/tune/model/run-joshua.sh
>  [NOT FOUND]
>   cmd=/usr/local/incubator-joshua/scripts/support/run_bundler.py --force 
> --symlink --absolute --verbose -T /usr/local/hadoop-2.5.2/hadoop_tmp_dir 
> /usr/local/incubator-joshua/scripts/training/templates/tune/joshua.config 
> /usr/local/joshua_resources/russian_experiments/exp2/tune/model 
> --copy-config-options '-top-n 300 -output-format "%i ||| %s ||| %f ||| %c" 
> -mark-oovs false -search cky -weights "lm_0 1 tm_pt_0 1 tm_pt_1 1 tm_pt_2 1 
> tm_pt_3 1 tm_pt_4 1 tm_pt_5 1 tm_glue_0 1 " -feature-function 
> "StateMinimizingLanguageModel -lm_order 5 -lm_file 
> /usr/local/joshua_resources/russian_experiments/exp2/lm.kenlm"  -tm0/type 
> hiero -tm0/owner pt -tm0/maxspan 20 -tm1/owner glue' --pack-tm 
> /usr/local/joshua_resources/russian_experiments/exp2/grammar.packed --tm 
> /usr/local/joshua_resources/russian_experiments/exp2/data/tune/grammar.glue
>   JOB FAILED (return code 1)
> * Running the copy-config.pl script with the command: 
> /usr/local/incubator-joshua/scripts/copy-config.pl -top-n 300 -output-format 
> "%i ||| %s ||| %f ||| %c" -mark-oovs false -search cky -weights "lm_0 1 
> tm_pt_0 1 tm_pt_1 1 tm_pt_2 1 tm_pt_3 1 tm_pt_4 1 tm_pt_5 1 tm_glue_0 1 " 
> -feature-function "StateMinimizingLanguageModel -lm_order 5 -lm_file 
> /usr/local/joshua_resources/russian_experiments/exp2/lm.kenlm"  -tm0/type 
> hiero -tm0/owner pt -tm0/maxspan 20 -tm1/owner glue
> Traceback (most recent call last):
>   File "/usr/local/incubator-joshua/scripts/support/run_bundler.py", line 
> 748, in main
> operations = collect_operations(opts)
>   File "/usr/local/incubator-joshua/scripts/support/run_bundler.py", line 
> 637, in collect_operations
> opts.copy_config_options
>   File "/usr/local/incubator-joshua/scripts/support/run_bundler.py", line 
> 202, in filter_through_copy_config_script
> result, err = p.communicate(config_text)
>   File "/Users/lmcgibbn/miniconda3/lib/python3.5/subprocess.py", line 1072, 
> in communicate
> stdout, stderr = self._communicate(input, endtime, timeout)
>   File "/Users/lmcgibbn/miniconda3/lib/python3.5/subprocess.py", line 1700, 
> in _communicate
> input_view = memoryview(self._input)
> TypeError: memoryview: a bytes-like object is required, not 'str'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/usr/local/incubator-joshua/scripts/support/run_bundler.py", line 
> 760, in 
> main(sys.argv)
>   File "/usr/local/incubator-joshua/scripts/support/run_bundler.py", line 
> 751, in main
> error_quit(e.message)
> AttributeError: 'TypeError' object has no attribute 'message'
> * WARNING: no key 'outputformat' found in config file (appending to end)
> * WARNING: no key 'search' found in config file (appending to end)
> * WARNING: no key 'topn' found in config file (appending to end)
> * 

[GitHub] incubator-joshua issue #73: JOSHUA-316 run_bundler.py returning JOB FAILED (...

2016-10-28 Thread lewismc
Github user lewismc commented on the issue:

https://github.com/apache/incubator-joshua/pull/73
  
A further comment, I just closed off 
https://issues.apache.org/jira/browse/JOSHUA-319, this is because I am able to 
run end to end pipelines without a hitch now. The only barrier is ensuring that 
enough memory is allocated to the processes.
I ran with this PR on my local joshua code and it ran like a charm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---