Hi Martin,

I don't think that will break the pipeline, but for word alignment and grammar 
extraction, separating out the slash character is probably a good idea for 
reducing data sparsity.

If the reason you want PTB-style tokenization is to prepare the text for 
parsing then I'd recommend taking a look at the script parse-de-berkeley.perl 
in scripts/training/wrappers.  It's a wrapper script for the Berkeley parser 
(it works for English as well as German), which takes the tokenized input and, 
if you give it the -split-slash option, it joins the tokens back together prior 
to parsing.  After parsing it -- via a call to berkeleyparsed2mosesxml.perl -- 
splits them again, adapting the parse tree structure in the process.  If you're 
using a different parser then it should be reasonably simple to write a wrapper 
along the same lines.

Phil

On 17 Dec 2013, at 01:45, Martin Velez <marve...@ucdavis.edu> wrote:

> I would like to tokenize tokens with forward slashes in the same way PTB does 
> it.
> 
> For example:
> Input: "Resolution 55/100"
> Output: "Resolution 55 / 100" (using default options)
> Output: "Resolution 55 %/% 100" (using "-penn" options)
> Desired Output: "Resolution 55/100"
> 
> I skimmed through the code.  I found the relevant commented code at line 400 
> of the tokenizer.perl script.  If I commented it out, will I achieve my goal? 
>  Or will I break something?
> 
> Saludos!
> Martin Velez
> UC Davis
> marve...@ucdavis.edu
> http://csiflabs.cs.ucdavis.edu/~marvelez/
> 
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to