On 11/15/10 2:05 PM, Hieu Hoang wrote: > Very true, shouldn't make the delimited another random char otherwise > it's hard to debug. However, if we make the default delimited 0x00, > would that suit people? I believe that makes it very hard to manually create and inspect any corpus. You could use any of the ASCII codes
0x1D (Group Separator) 0x1E (Record Separator) 0x1F (Unit Separator) but none of these is better in concept. You'd still need to check all your raw-input for the occurrence of these characters and escape them accordingly. They might occur less frequent but they do occur. The coding effort to prevent accidence is still the same. best regards Christof > Hieu > Sent from my flying horse > > On 15 Nov 2010, at 09:55 PM, Christian Hardmeier<c...@rax.ch> wrote: > >> I fully agree with Miles. >> >> In my opinion, replacing the pipe with an exotic Unicode character is >> bad because >> - in a web-crawled corpus, any Unicode character might occur, however >> exotic it is. If it's exotic, it will be even harder to track down >> the problem when it occurs. >> - it assumes that everybody is using UTF-8, which I don't think is true. >> I know people working with Latin-1 encoded corpora, and for all I >> know, somebody out there may be using an encoding in which the bytes >> encoding "exotic UTF-8 character of your choice" in fact encode a >> very common letter or sign. Using a character from the ASCII subset >> reduces dependence on particular encodings as far as possible. >> >> I like Miles's suggestion of not having a factor delimiter at all unless >> explicitly turned on. If that's too complicated, I think we should stick >> to the current situation, so at least we know the problems and how to >> fix them, and, as Christof pointed out, some people may already have >> tuned their pipelines to be pipe-proof (I haven't, but if I had, I'd >> hate to change it). >> >> /Christian >> >> On Mon, 15 Nov 2010, Miles Osborne wrote: >> >>> i second this. >>> >>> but can I make another suggestion. make the default be *non* factored >>> input. i reckon that most people using Moses don't actually use >>> factors (hands-up if you do). >>> this means, plain input, with absolutely no meta chars in them. >>> >>> and if you are going to use meta-chars, why not just have a flag such as: >>> >>> --factorDelimiter=| >>> >>> etc. >>> >>> Miles >>> >>> On 15 November 2010 21:30, Hieu Hoang<hieuho...@gmail.com> wrote: >>>> That's a good idea. In the decoder, there's 4 places that has to be >>>> changed cos it's hardcoded >>>> ConfusionNet >>>> GenerationDictionary >>>> LanguageModelJoint >>>> Word::createFromString >>>> >>>> However, the train-model.perl is more difficult to change >>>> >>>> Hieu >>>> Sent from my flying horse >>>> >>>> On 15 Nov 2010, at 09:00 PM, Lane Schwartz<dowob...@gmail.com> wrote: >>>> >>>>> I'd like to propose changing the current factor delimiter to something >>>>> other than the single vertical bar | >>>>> >>>>> Looking through the mailing archives, it seems that the failure to >>>>> properly purge your corpus of vertical bars is a frequent source of >>>>> headaches for users. I know I've encountered this problem before, but >>>>> even knowing that I should do this, just today I had to track down >>>>> another vertical bar-related problem. >>>>> >>>>> I don't really care what the replacement character(s) ends up being, just >>>>> so that any corpus munging related to this delimiter gets handled >>>>> internally by moses rather than being the user's responsibility. >>>>> >>>>> If moses could easily be modified to take a multi-character delimeter, >>>>> that would probably be best. My suggestion for a single-character >>>>> delimiter would be something with the following characteristics: >>>>> >>>>> * Character should be printable (ie not a control character) >>>>> * Character should be one that's implemented in most commonly used fonts >>>>> * Character should be highly obscure, and extremely unlikely to appear in >>>>> a corpus >>>>> * Character should not be confusable with any commonly used character. >>>>> >>>>> Many characters in the Dingbats section of Unicode (block 2700) would fit >>>>> these desiderata. >>>>> >>>>> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly >>>>> obscure printable character that looks like a thick vertical bar. It's >>>>> obviously a vertical bar, but just as obviously not the same thing as the >>>>> regular vertical bar |. >>>>> >>>>> Cheers, >>>>> Lane >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> Moses-support@mit.edu >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> _______________________________________________ >>>> Moses-support mailing list >>>> Moses-support@mit.edu >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>> >>> >>> -- >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu >>> http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support