Interesting excitement around this thread. I support "no change, but if
change is necessary, keep the ascii '|' as the default delimiter."

Changing the delimiter creates a lot of work to "resolve" what is
essentially a documentation and training challenge, not a technical
problem. By the way, the "|" is not the only troublesome character. The
Moses for Mere Mortals team documents other troublesome ascii control
characters. Changing Moses to a different delimiter does not "fix" those
characters.

By now, many users have trained many tables with the current delimiter.
Changing to a new default delimiter involves the work to implement the
changes, work to support the existing tables, and regression testing all
the changes. This means adding and testing code to automatically detecting
the "|" delimiter. Alternately, all existing users would need to update
their systems to use the old default, or they would have to re-train all
their tables. That's a lot of unnecessary work when better documentation
will suffice. I think the old adage applies: "if it works, don't fix it".

If the goal is to reduce the load on moses-support, how about different
technical approach? I propose modifying clean-corpus-n.perl to remove
them... or modify tokenizer.perl and detokenizer.perl to 'tokenize' the "|"
with reserved character(s) and 'detokenize' the reserved characters(s) back
to "|". A new option would allow users to define the reserved
characters(s). This solves the problem for new European language users with
minimal effect on existing users. Changing tokenization could also address
the other ascii control characters.

RE: "default delimited 0x00" -- bad idea. Many editors (gedit for example)
interpret files with ascii null as binary files.

Best regards
Tom


On Tue, 16 Nov 2010 00:10:46 +0100, Ondrej Bojar <bo...@ufal.mff.cuni.cz>
wrote:
> Hi,
> 
> after some more thinking about this, I'd relabel your proposal to a 
> regular bug report, asking for this particular minor fix:
> 
>       Whenever moses expects a single factor only (based on the
>       configuration) in input/ttable/generation-table/..., no split
>       should be done at all.
> 
> Here are the details in your three bullet style wording:
> 
> - default is non-factored input
>    (or rather: if "input factors" is set "0" only, pipe has no special
>    meaning)
>    There is still an open issue with phrase/generation/reordering
>    tables/suffix arrays/whatever. My suggestion is (without having look
>    at the code) that whenever the given table speaks about a single
>    factor only according to the moses.ini line, no split should be
>    performed at all => no pipe would make any harm.
> 
> - surely keep the --factorDelimiter (but make it clear that it
>    does/does not apply also to the phrase, generation and reordering
>    tables)
> 
> - keep the regular ASCII '|' as the default
> 
> Cheers, O.
> 
> 
> On 11/15/2010 10:51 PM, Lane Schwartz wrote:
>> I agree. How's this proposal:
>> * Default is non-factored input
>> * When using factors, have the optional flag --factorDelimiter to allow
>> user-specified character for factor delimiter (thanks, Chris :)
>> * When using factors, use a default delimiter char of Unicode character
>> 2759, MEDIUM VERTICAL BAR, if none is specified by the user flag
>>
>> On Mon, Nov 15, 2010 at 4:37 PM, Miles Osborne <mi...@inf.ed.ac.uk
>> <mailto:mi...@inf.ed.ac.uk>> wrote:
>>
>>     i second this.
>>
>>     but can I make another suggestion.  make the default be *non*
>>     factored
>>     input.  i reckon that most people using Moses don't actually use
>>     factors (hands-up if you do).
>>     this means, plain input, with absolutely no meta chars in them.
>>
>>     and if you are going to use meta-chars, why not just have a flag
>>     such as:
>>
>>     --factorDelimiter=|
>>
>>     etc.
>>
>>     Miles
>>
>>     On 15 November 2010 21:30, Hieu Hoang <hieuho...@gmail.com
>>     <mailto:hieuho...@gmail.com>> wrote:
>>      > That's a good idea. In the decoder, there's 4 places that has to
>>      > be
>>      > changed cos it's hardcoded
>>      >   ConfusionNet
>>      >    GenerationDictionary
>>      >   LanguageModelJoint
>>      >    Word::createFromString
>>      >
>>      > However, the train-model.perl is more difficult to change
>>      >
>>      > Hieu
>>      > Sent from my flying horse
>>      >
>>      > On 15 Nov 2010, at 09:00 PM, Lane Schwartz <dowob...@gmail.com
>>     <mailto:dowob...@gmail.com>> wrote:
>>      >
>>      >> I'd like to propose changing the current factor delimiter to
>>     something other than the single vertical bar |
>>      >>
>>      >> Looking through the mailing archives, it seems that the failure
>>     to properly purge your corpus of vertical bars is a frequent source
>>     of headaches for users. I know I've encountered this problem
before,
>>     but even knowing that I should do this, just today I had to track
>>     down another vertical bar-related problem.
>>      >>
>>      >> I don't really care what the replacement character(s) ends up
>>     being, just so that any corpus munging related to this delimiter
>>     gets handled internally by moses rather than being the user's
>>     responsibility.
>>      >>
>>      >> If moses could easily be modified to take a multi-character
>>     delimeter, that would probably be best. My suggestion for a
>>     single-character delimiter would be something with the following
>>     characteristics:
>>      >>
>>      >> * Character should be printable (ie not a control character)
>>      >> * Character should be one that's implemented in most commonly
>>     used fonts
>>      >> * Character should be highly obscure, and extremely unlikely to
>>     appear in a corpus
>>      >> * Character should not be confusable with any commonly used
>>     character.
>>      >>
>>      >> Many characters in the Dingbats section of Unicode (block 2700)
>>     would fit these desiderata.
>>      >>
>>      >> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is
a
>>     highly obscure printable character that looks like a thick vertical
>>     bar. It's obviously a vertical bar, but just as obviously not the
>>     same thing as the regular vertical bar |.
>>      >>
>>      >> Cheers,
>>      >> Lane
>>      >> _______________________________________________
>>      >> Moses-support mailing list
>>      >> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>      >> http://mailman.mit.edu/mailman/listinfo/moses-support
>>      >
>>      > _______________________________________________
>>      > Moses-support mailing list
>>      > Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>      > http://mailman.mit.edu/mailman/listinfo/moses-support
>>      >
>>
>>
>>
>>     --
>>     The University of Edinburgh is a charitable body, registered in
>>     Scotland, with registration number SC005336.
>>
>>
>>
>>
>> --
>> When a place gets crowded enough to require ID's, social collapse is
not
>> far away.  It is time to go elsewhere.  The best thing about space
travel
>> is that it made it possible to go elsewhere.
>>                  -- R.A. Heinlein, "Time Enough For Love"
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to