Hi, all.

This is an excellent spark for a flame war ;-)

Here's a summary of my point of view:

- I use factors in all my experiments (except the baseline run for
   comparison)
- I need to type the factor delimiter often, in many places, incl.
   command line, while experimenting. So copy-paste won't work for me
   and escape sequences are context dependent (bash/vim/perl with
   ASCII-only source code would all differ)

=> don't add the further level of obscurity (as Christof correctly
    points out)

I also have experience with moderately-sized (90 milion tokens) parallel 
corpora and XML. *By all means* do avoid XML for any training or input 
data. In my experience (a specific dialect of XML, but the parser for it 
was actually precompiled to C and it just needed to build complex data 
structures), it was faster to morphologically tag and comparable to 
parse with McDonald's parser than to reload the tagged/parsed XML.

Frankly, I think Moses users should be literate enough to cope with '|'. ;-)
However, error reporting should be improved everywhere, and I actually 
try to do that whenever I touch the code nearby.

I'm sending this now, before you jump to a conclusion, you quick 
bastards! ;-)

O.

On 11/15/2010 10:35 PM, Christof Pintaske wrote:
>   Hello Lane,
>
> frankly I don't see this as sooo desireable. You just exchange a magic
> character with an even more magic one. Since the proposed character is
> not an ASCII character you'll eventually run into encoding problems. And
> for most people it'd be very difficult to type this character on the
> keyboard and to distinguish it from the regular | symbol. It just gets
> more and more obscure.
>
> To really improve on the ugly "magic file format" issue I'd love to see
> support for XML-based input and configuration files. There is tons of
> tooling out there to handle XML files, there are no limitation in
> respect to the content (even multi-line input would be possible). You
> can easily check conformance (using a DTD) and you can keep them
> backwards compatible if you desire so. Of course it's very well
> understood that this is a major effort that's not easy to address.
>
> just my two cents
> Christof
>
> PS: and yes, I spent substantial effort in making my tool chain pipe
> proof. I'd hate to sift through all that again for no practical gain.
>
>
>
>
> On 11/15/10 12:55 PM, Lane Schwartz wrote:
>> I'd like to propose changing the current factor delimiter to something
>> other than the single vertical bar |
>> Looking through the mailing archives, it seems that the failure to
>> properly purge your corpus of vertical bars is a frequent source of
>> headaches for users. I know I've encountered this problem before, but
>> even knowing that I should do this, just today I had to track down
>> another vertical bar-related problem.
>> I don't really care what the replacement character(s) ends up being,
>> just so that any corpus munging related to this delimiter gets handled
>> internally by moses rather than being the user's responsibility.
>> If moses could easily be modified to take a multi-character delimeter,
>> that would probably be best. My suggestion for a single-character
>> delimiter would be something with the following characteristics:
>> * Character should be printable (ie not a control character)
>> * Character should be one that's implemented in most commonly used fonts
>> * Character should be highly obscure, and extremely unlikely to appear
>> in a corpus
>> * Character should not be confusable with any commonly used character.
>> Many characters in the Dingbats section of Unicode (block 2700) would
>> fit these desiderata.
>> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a
>> highly obscure printable character that looks like a thick vertical
>> bar. It's obviously a vertical bar, but just as obviously not the same
>> thing as the regular vertical bar |.
>> Cheers,
>> Lane
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-- 
Ondrej Bojar (mailto:o...@cuni.cz / bo...@ufal.mff.cuni.cz)
http://www.cuni.cz/~obo
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to