Hi, all. This is an excellent spark for a flame war ;-)
Here's a summary of my point of view: - I use factors in all my experiments (except the baseline run for comparison) - I need to type the factor delimiter often, in many places, incl. command line, while experimenting. So copy-paste won't work for me and escape sequences are context dependent (bash/vim/perl with ASCII-only source code would all differ) => don't add the further level of obscurity (as Christof correctly points out) I also have experience with moderately-sized (90 milion tokens) parallel corpora and XML. *By all means* do avoid XML for any training or input data. In my experience (a specific dialect of XML, but the parser for it was actually precompiled to C and it just needed to build complex data structures), it was faster to morphologically tag and comparable to parse with McDonald's parser than to reload the tagged/parsed XML. Frankly, I think Moses users should be literate enough to cope with '|'. ;-) However, error reporting should be improved everywhere, and I actually try to do that whenever I touch the code nearby. I'm sending this now, before you jump to a conclusion, you quick bastards! ;-) O. On 11/15/2010 10:35 PM, Christof Pintaske wrote: > Hello Lane, > > frankly I don't see this as sooo desireable. You just exchange a magic > character with an even more magic one. Since the proposed character is > not an ASCII character you'll eventually run into encoding problems. And > for most people it'd be very difficult to type this character on the > keyboard and to distinguish it from the regular | symbol. It just gets > more and more obscure. > > To really improve on the ugly "magic file format" issue I'd love to see > support for XML-based input and configuration files. There is tons of > tooling out there to handle XML files, there are no limitation in > respect to the content (even multi-line input would be possible). You > can easily check conformance (using a DTD) and you can keep them > backwards compatible if you desire so. Of course it's very well > understood that this is a major effort that's not easy to address. > > just my two cents > Christof > > PS: and yes, I spent substantial effort in making my tool chain pipe > proof. I'd hate to sift through all that again for no practical gain. > > > > > On 11/15/10 12:55 PM, Lane Schwartz wrote: >> I'd like to propose changing the current factor delimiter to something >> other than the single vertical bar | >> Looking through the mailing archives, it seems that the failure to >> properly purge your corpus of vertical bars is a frequent source of >> headaches for users. I know I've encountered this problem before, but >> even knowing that I should do this, just today I had to track down >> another vertical bar-related problem. >> I don't really care what the replacement character(s) ends up being, >> just so that any corpus munging related to this delimiter gets handled >> internally by moses rather than being the user's responsibility. >> If moses could easily be modified to take a multi-character delimeter, >> that would probably be best. My suggestion for a single-character >> delimiter would be something with the following characteristics: >> * Character should be printable (ie not a control character) >> * Character should be one that's implemented in most commonly used fonts >> * Character should be highly obscure, and extremely unlikely to appear >> in a corpus >> * Character should not be confusable with any commonly used character. >> Many characters in the Dingbats section of Unicode (block 2700) would >> fit these desiderata. >> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a >> highly obscure printable character that looks like a thick vertical >> bar. It's obviously a vertical bar, but just as obviously not the same >> thing as the regular vertical bar |. >> Cheers, >> Lane >> >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support -- Ondrej Bojar (mailto:o...@cuni.cz / bo...@ufal.mff.cuni.cz) http://www.cuni.cz/~obo _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support