I fully agree with Miles. In my opinion, replacing the pipe with an exotic Unicode character is bad because - in a web-crawled corpus, any Unicode character might occur, however exotic it is. If it's exotic, it will be even harder to track down the problem when it occurs. - it assumes that everybody is using UTF-8, which I don't think is true. I know people working with Latin-1 encoded corpora, and for all I know, somebody out there may be using an encoding in which the bytes encoding "exotic UTF-8 character of your choice" in fact encode a very common letter or sign. Using a character from the ASCII subset reduces dependence on particular encodings as far as possible.
I like Miles's suggestion of not having a factor delimiter at all unless explicitly turned on. If that's too complicated, I think we should stick to the current situation, so at least we know the problems and how to fix them, and, as Christof pointed out, some people may already have tuned their pipelines to be pipe-proof (I haven't, but if I had, I'd hate to change it). /Christian On Mon, 15 Nov 2010, Miles Osborne wrote: > i second this. > > but can I make another suggestion. make the default be *non* factored > input. i reckon that most people using Moses don't actually use > factors (hands-up if you do). > this means, plain input, with absolutely no meta chars in them. > > and if you are going to use meta-chars, why not just have a flag such as: > > --factorDelimiter=| > > etc. > > Miles > > On 15 November 2010 21:30, Hieu Hoang <hieuho...@gmail.com> wrote: > > That's a good idea. In the decoder, there's 4 places that has to be > > changed cos it's hardcoded > > ConfusionNet > > GenerationDictionary > > LanguageModelJoint > > Word::createFromString > > > > However, the train-model.perl is more difficult to change > > > > Hieu > > Sent from my flying horse > > > > On 15 Nov 2010, at 09:00 PM, Lane Schwartz <dowob...@gmail.com> wrote: > > > >> I'd like to propose changing the current factor delimiter to something > >> other than the single vertical bar | > >> > >> Looking through the mailing archives, it seems that the failure to > >> properly purge your corpus of vertical bars is a frequent source of > >> headaches for users. I know I've encountered this problem before, but even > >> knowing that I should do this, just today I had to track down another > >> vertical bar-related problem. > >> > >> I don't really care what the replacement character(s) ends up being, just > >> so that any corpus munging related to this delimiter gets handled > >> internally by moses rather than being the user's responsibility. > >> > >> If moses could easily be modified to take a multi-character delimeter, > >> that would probably be best. My suggestion for a single-character > >> delimiter would be something with the following characteristics: > >> > >> * Character should be printable (ie not a control character) > >> * Character should be one that's implemented in most commonly used fonts > >> * Character should be highly obscure, and extremely unlikely to appear in > >> a corpus > >> * Character should not be confusable with any commonly used character. > >> > >> Many characters in the Dingbats section of Unicode (block 2700) would fit > >> these desiderata. > >> > >> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly > >> obscure printable character that looks like a thick vertical bar. It's > >> obviously a vertical bar, but just as obviously not the same thing as the > >> regular vertical bar |. > >> > >> Cheers, > >> Lane > >> _______________________________________________ > >> Moses-support mailing list > >> Moses-support@mit.edu > >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > > Moses-support mailing list > > Moses-support@mit.edu > > http://mailman.mit.edu/mailman/listinfo/moses-support > > > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support