Hi, after some more thinking about this, I'd relabel your proposal to a regular bug report, asking for this particular minor fix:
Whenever moses expects a single factor only (based on the configuration) in input/ttable/generation-table/..., no split should be done at all. Here are the details in your three bullet style wording: - default is non-factored input (or rather: if "input factors" is set "0" only, pipe has no special meaning) There is still an open issue with phrase/generation/reordering tables/suffix arrays/whatever. My suggestion is (without having look at the code) that whenever the given table speaks about a single factor only according to the moses.ini line, no split should be performed at all => no pipe would make any harm. - surely keep the --factorDelimiter (but make it clear that it does/does not apply also to the phrase, generation and reordering tables) - keep the regular ASCII '|' as the default Cheers, O. On 11/15/2010 10:51 PM, Lane Schwartz wrote: > I agree. How's this proposal: > * Default is non-factored input > * When using factors, have the optional flag --factorDelimiter to allow > user-specified character for factor delimiter (thanks, Chris :) > * When using factors, use a default delimiter char of Unicode character > 2759, MEDIUM VERTICAL BAR, if none is specified by the user flag > > On Mon, Nov 15, 2010 at 4:37 PM, Miles Osborne <mi...@inf.ed.ac.uk > <mailto:mi...@inf.ed.ac.uk>> wrote: > > i second this. > > but can I make another suggestion. make the default be *non* factored > input. i reckon that most people using Moses don't actually use > factors (hands-up if you do). > this means, plain input, with absolutely no meta chars in them. > > and if you are going to use meta-chars, why not just have a flag > such as: > > --factorDelimiter=| > > etc. > > Miles > > On 15 November 2010 21:30, Hieu Hoang <hieuho...@gmail.com > <mailto:hieuho...@gmail.com>> wrote: > > That's a good idea. In the decoder, there's 4 places that has to be > > changed cos it's hardcoded > > ConfusionNet > > GenerationDictionary > > LanguageModelJoint > > Word::createFromString > > > > However, the train-model.perl is more difficult to change > > > > Hieu > > Sent from my flying horse > > > > On 15 Nov 2010, at 09:00 PM, Lane Schwartz <dowob...@gmail.com > <mailto:dowob...@gmail.com>> wrote: > > > >> I'd like to propose changing the current factor delimiter to > something other than the single vertical bar | > >> > >> Looking through the mailing archives, it seems that the failure > to properly purge your corpus of vertical bars is a frequent source > of headaches for users. I know I've encountered this problem before, > but even knowing that I should do this, just today I had to track > down another vertical bar-related problem. > >> > >> I don't really care what the replacement character(s) ends up > being, just so that any corpus munging related to this delimiter > gets handled internally by moses rather than being the user's > responsibility. > >> > >> If moses could easily be modified to take a multi-character > delimeter, that would probably be best. My suggestion for a > single-character delimiter would be something with the following > characteristics: > >> > >> * Character should be printable (ie not a control character) > >> * Character should be one that's implemented in most commonly > used fonts > >> * Character should be highly obscure, and extremely unlikely to > appear in a corpus > >> * Character should not be confusable with any commonly used > character. > >> > >> Many characters in the Dingbats section of Unicode (block 2700) > would fit these desiderata. > >> > >> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a > highly obscure printable character that looks like a thick vertical > bar. It's obviously a vertical bar, but just as obviously not the > same thing as the regular vertical bar |. > >> > >> Cheers, > >> Lane > >> _______________________________________________ > >> Moses-support mailing list > >> Moses-support@mit.edu <mailto:Moses-support@mit.edu> > >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > > Moses-support mailing list > > Moses-support@mit.edu <mailto:Moses-support@mit.edu> > > http://mailman.mit.edu/mailman/listinfo/moses-support > > > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > > > > -- > When a place gets crowded enough to require ID's, social collapse is not > far away. It is time to go elsewhere. The best thing about space travel > is that it made it possible to go elsewhere. > -- R.A. Heinlein, "Time Enough For Love" > > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support -- Ondrej Bojar (mailto:o...@cuni.cz / bo...@ufal.mff.cuni.cz) http://www.cuni.cz/~obo _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support