Hi Martin, I don't think that will break the pipeline, but for word alignment and grammar extraction, separating out the slash character is probably a good idea for reducing data sparsity.
If the reason you want PTB-style tokenization is to prepare the text for parsing then I'd recommend taking a look at the script parse-de-berkeley.perl in scripts/training/wrappers. It's a wrapper script for the Berkeley parser (it works for English as well as German), which takes the tokenized input and, if you give it the -split-slash option, it joins the tokens back together prior to parsing. After parsing it -- via a call to berkeleyparsed2mosesxml.perl -- splits them again, adapting the parse tree structure in the process. If you're using a different parser then it should be reasonably simple to write a wrapper along the same lines. Phil On 17 Dec 2013, at 01:45, Martin Velez <marve...@ucdavis.edu> wrote: > I would like to tokenize tokens with forward slashes in the same way PTB does > it. > > For example: > Input: "Resolution 55/100" > Output: "Resolution 55 / 100" (using default options) > Output: "Resolution 55 %/% 100" (using "-penn" options) > Desired Output: "Resolution 55/100" > > I skimmed through the code. I found the relevant commented code at line 400 > of the tokenizer.perl script. If I commented it out, will I achieve my goal? > Or will I break something? > > Saludos! > Martin Velez > UC Davis > marve...@ucdavis.edu > http://csiflabs.cs.ucdavis.edu/~marvelez/ > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support