Interesting excitement around this thread. I support "no change, but if change is necessary, keep the ascii '|' as the default delimiter."
Changing the delimiter creates a lot of work to "resolve" what is essentially a documentation and training challenge, not a technical problem. By the way, the "|" is not the only troublesome character. The Moses for Mere Mortals team documents other troublesome ascii control characters. Changing Moses to a different delimiter does not "fix" those characters. By now, many users have trained many tables with the current delimiter. Changing to a new default delimiter involves the work to implement the changes, work to support the existing tables, and regression testing all the changes. This means adding and testing code to automatically detecting the "|" delimiter. Alternately, all existing users would need to update their systems to use the old default, or they would have to re-train all their tables. That's a lot of unnecessary work when better documentation will suffice. I think the old adage applies: "if it works, don't fix it". If the goal is to reduce the load on moses-support, how about different technical approach? I propose modifying clean-corpus-n.perl to remove them... or modify tokenizer.perl and detokenizer.perl to 'tokenize' the "|" with reserved character(s) and 'detokenize' the reserved characters(s) back to "|". A new option would allow users to define the reserved characters(s). This solves the problem for new European language users with minimal effect on existing users. Changing tokenization could also address the other ascii control characters. RE: "default delimited 0x00" -- bad idea. Many editors (gedit for example) interpret files with ascii null as binary files. Best regards Tom On Tue, 16 Nov 2010 00:10:46 +0100, Ondrej Bojar <bo...@ufal.mff.cuni.cz> wrote: > Hi, > > after some more thinking about this, I'd relabel your proposal to a > regular bug report, asking for this particular minor fix: > > Whenever moses expects a single factor only (based on the > configuration) in input/ttable/generation-table/..., no split > should be done at all. > > Here are the details in your three bullet style wording: > > - default is non-factored input > (or rather: if "input factors" is set "0" only, pipe has no special > meaning) > There is still an open issue with phrase/generation/reordering > tables/suffix arrays/whatever. My suggestion is (without having look > at the code) that whenever the given table speaks about a single > factor only according to the moses.ini line, no split should be > performed at all => no pipe would make any harm. > > - surely keep the --factorDelimiter (but make it clear that it > does/does not apply also to the phrase, generation and reordering > tables) > > - keep the regular ASCII '|' as the default > > Cheers, O. > > > On 11/15/2010 10:51 PM, Lane Schwartz wrote: >> I agree. How's this proposal: >> * Default is non-factored input >> * When using factors, have the optional flag --factorDelimiter to allow >> user-specified character for factor delimiter (thanks, Chris :) >> * When using factors, use a default delimiter char of Unicode character >> 2759, MEDIUM VERTICAL BAR, if none is specified by the user flag >> >> On Mon, Nov 15, 2010 at 4:37 PM, Miles Osborne <mi...@inf.ed.ac.uk >> <mailto:mi...@inf.ed.ac.uk>> wrote: >> >> i second this. >> >> but can I make another suggestion. make the default be *non* >> factored >> input. i reckon that most people using Moses don't actually use >> factors (hands-up if you do). >> this means, plain input, with absolutely no meta chars in them. >> >> and if you are going to use meta-chars, why not just have a flag >> such as: >> >> --factorDelimiter=| >> >> etc. >> >> Miles >> >> On 15 November 2010 21:30, Hieu Hoang <hieuho...@gmail.com >> <mailto:hieuho...@gmail.com>> wrote: >> > That's a good idea. In the decoder, there's 4 places that has to >> > be >> > changed cos it's hardcoded >> > ConfusionNet >> > GenerationDictionary >> > LanguageModelJoint >> > Word::createFromString >> > >> > However, the train-model.perl is more difficult to change >> > >> > Hieu >> > Sent from my flying horse >> > >> > On 15 Nov 2010, at 09:00 PM, Lane Schwartz <dowob...@gmail.com >> <mailto:dowob...@gmail.com>> wrote: >> > >> >> I'd like to propose changing the current factor delimiter to >> something other than the single vertical bar | >> >> >> >> Looking through the mailing archives, it seems that the failure >> to properly purge your corpus of vertical bars is a frequent source >> of headaches for users. I know I've encountered this problem before, >> but even knowing that I should do this, just today I had to track >> down another vertical bar-related problem. >> >> >> >> I don't really care what the replacement character(s) ends up >> being, just so that any corpus munging related to this delimiter >> gets handled internally by moses rather than being the user's >> responsibility. >> >> >> >> If moses could easily be modified to take a multi-character >> delimeter, that would probably be best. My suggestion for a >> single-character delimiter would be something with the following >> characteristics: >> >> >> >> * Character should be printable (ie not a control character) >> >> * Character should be one that's implemented in most commonly >> used fonts >> >> * Character should be highly obscure, and extremely unlikely to >> appear in a corpus >> >> * Character should not be confusable with any commonly used >> character. >> >> >> >> Many characters in the Dingbats section of Unicode (block 2700) >> would fit these desiderata. >> >> >> >> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a >> highly obscure printable character that looks like a thick vertical >> bar. It's obviously a vertical bar, but just as obviously not the >> same thing as the regular vertical bar |. >> >> >> >> Cheers, >> >> Lane >> >> _______________________________________________ >> >> Moses-support mailing list >> >> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >> >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > >> > _______________________________________________ >> > Moses-support mailing list >> > Moses-support@mit.edu <mailto:Moses-support@mit.edu> >> > http://mailman.mit.edu/mailman/listinfo/moses-support >> > >> >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> >> >> >> -- >> When a place gets crowded enough to require ID's, social collapse is not >> far away. It is time to go elsewhere. The best thing about space travel >> is that it made it possible to go elsewhere. >> -- R.A. Heinlein, "Time Enough For Love" >> >> >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support