[Moses-support] MT Internship at the European Parliament, Luxembourg

2010-11-15 Thread DGTRAD Trainees-ITS

#

### MT Internship at the European Parliament, Luxembourg
###


#

Job Description



The Information Technology Support Unit of the Directorate General for
Translation (ITS) invites applications for an internship in its Research
and Development team. Our current projects focus on developing and
adapting language technology tools to assist the work of one of the
largest translation services in the world.

The successful candidate will work on a machine translation project
focusing on one or more of the following areas: 

- Building of statistical MT systems for various language pairs

- Machine Translation evaluation

- Data pre/ post-processing

- Interfacing of MT with other applications

Possibilities of combining your work here with a master/PhD thesis can
be discussed.

Important Information

-

Deadline for applications: 15 November(midnight)

Please do not reply to this e-mail unless you have a specific question
concerning the application procedure. In order for your application to
be valid and taken into consideration you will have to fill in the
online form provided at
http://www.europarl.europa.eu/parliament/public/staticDisplay.do?id=147&;
pageRank=5&language=EN
  under "Online Application Form". You will
notice that this application form mainly addresses translation trainees.
If you wish to do an internship as a Computational Linguist or Developer
at our service you are kindly asked to check the box "Information
technology (IT)" at the last step of the online form, under Other
Interests.

Education

-

A degree in Computational Linguistics, Information Science or another
related field, and strong interest in Natural Language Processing that
can be proven by relevant research papers, university assignments or
publications.

Qualifications

--

- Excellent command of English, French or German and two other official
languages of the European Union

- Excellent knowledge of Statistical Machine Translation and experience
with relevant implementations (e.g. Moses)

- Advanced knowledge of Linux OS

- Good programming skills in Java and Perl

- SQL, XML and PHP knowledge would be an advantage

- Statistical NLP

A few words about us



ITS is the unit that provides technical and logistical support 
to the European Parliament's Directorate General of Translation
(DGTRAD). 
ITS provides its users with standard IT support services by manning
helpdesks, providing user support, 
installing and trouble-shooting user configurations, and running file,
print and web servers. It caters 
specifically for translation needs by providing its users with a palette
of commercial, inter-institutional and 
in-house tools and by integrating these tools into a coherent working
environment and providing effective 
training and support in their use. 
Our unit promotes the sharing of information and the adoption of best
practices amongst its users by providing 
a Translation Service Portal. It also represents Parliament in a number
of inter-institutional bodies concerned 
with technical questions related to translation.

Alexandros POULIS
IT Project Administrator
DG TRAD - IT Support Unit
European Parliament, L-2929 Luxembourg


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] mert-moses.pl weights

2010-11-15 Thread John Morgan
Hi Barry,
I'm using mert-moses.pl to build a hierarchical system with no problem in r3682.
I turn off lexical reordering for hierarchical models.
The version I had trouble with was very recent, I don't have access to
the machine right now.
Thanks,
John


On 11/15/10, Barry Haddow  wrote:
> Hi John
>
> Does this problem just affect the latest mert-moses.pl (r3697)? Are using
> lexicalised reordering?
>
> regards
> Barry
>
> On Friday 12 November 2010 20:31, John Morgan wrote:
>> Hello,
>> I think there might be a problem with some new code in mert-moses.pl.
>> I got tuning to run by commenting out line 1225 where values for
>> weight-d are pushed into  the used_triples array.
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>


-- 
Regards,
John J Morgan
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Lane Schwartz
I'd like to propose changing the current factor delimiter to something other
than the single vertical bar |

Looking through the mailing archives, it seems that the failure to properly
purge your corpus of vertical bars is a frequent source of headaches for
users. I know I've encountered this problem before, but even knowing that I
should do this, just today I had to track down another vertical bar-related
problem.

I don't really care what the replacement character(s) ends up being, just so
that any corpus munging related to this delimiter gets handled internally by
moses rather than being the user's responsibility.

If moses could easily be modified to take a multi-character delimeter, that
would probably be best. My suggestion for a single-character delimiter would
be something with the following characteristics:

* Character should be printable (ie not a control character)
* Character should be one that's implemented in most commonly used fonts
* Character should be highly obscure, and extremely unlikely to appear in a
corpus
* Character should not be confusable with any commonly used character.

Many characters in the Dingbats section of Unicode (block 2700) would fit
these desiderata.

I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a
highly obscure printable character that looks like a thick vertical bar.
It's obviously a vertical bar, but just as obviously not the same thing as
the regular vertical bar |.

Cheers,
Lane
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Hieu Hoang
That's a good idea. In the decoder, there's 4 places that has to be
changed cos it's hardcoded
   ConfusionNet
GenerationDictionary
   LanguageModelJoint
Word::createFromString

However, the train-model.perl is more difficult to change

Hieu
Sent from my flying horse

On 15 Nov 2010, at 09:00 PM, Lane Schwartz  wrote:

> I'd like to propose changing the current factor delimiter to something other 
> than the single vertical bar |
>
> Looking through the mailing archives, it seems that the failure to properly 
> purge your corpus of vertical bars is a frequent source of headaches for 
> users. I know I've encountered this problem before, but even knowing that I 
> should do this, just today I had to track down another vertical bar-related 
> problem.
>
> I don't really care what the replacement character(s) ends up being, just so 
> that any corpus munging related to this delimiter gets handled internally by 
> moses rather than being the user's responsibility.
>
> If moses could easily be modified to take a multi-character delimeter, that 
> would probably be best. My suggestion for a single-character delimiter would 
> be something with the following characteristics:
>
> * Character should be printable (ie not a control character)
> * Character should be one that's implemented in most commonly used fonts
> * Character should be highly obscure, and extremely unlikely to appear in a 
> corpus
> * Character should not be confusable with any commonly used character.
>
> Many characters in the Dingbats section of Unicode (block 2700) would fit 
> these desiderata.
>
> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly 
> obscure printable character that looks like a thick vertical bar. It's 
> obviously a vertical bar, but just as obviously not the same thing as the 
> regular vertical bar |.
>
> Cheers,
> Lane
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Christof Pintaske

Hello Lane,

frankly I don't see this as sooo desireable. You just exchange a magic 
character with an even more magic one. Since the proposed character is 
not an ASCII character you'll eventually run into encoding problems. And 
for most people it'd be very difficult to type this character on the 
keyboard and to distinguish it from the regular | symbol. It just gets 
more and more obscure.


To really improve on the ugly "magic file format" issue I'd love to see 
support for XML-based input and configuration files. There is tons of 
tooling out there to handle XML files, there are no limitation in 
respect to the content (even multi-line input would be possible). You 
can easily check conformance (using a DTD) and you can keep them 
backwards compatible if you desire so. Of course it's very well 
understood that this is a major effort that's not easy to address.


just my two cents
Christof

PS: and yes, I spent substantial effort in making my tool chain pipe 
proof. I'd hate to sift through all that again for no practical gain.





On 11/15/10 12:55 PM, Lane Schwartz wrote:
I'd like to propose changing the current factor delimiter to something 
other than the single vertical bar |
Looking through the mailing archives, it seems that the failure to 
properly purge your corpus of vertical bars is a frequent source of 
headaches for users. I know I've encountered this problem before, but 
even knowing that I should do this, just today I had to track down 
another vertical bar-related problem.
I don't really care what the replacement character(s) ends up being, 
just so that any corpus munging related to this delimiter gets handled 
internally by moses rather than being the user's responsibility.
If moses could easily be modified to take a multi-character delimeter, 
that would probably be best. My suggestion for a single-character 
delimiter would be something with the following characteristics:

* Character should be printable (ie not a control character)
* Character should be one that's implemented in most commonly used fonts
* Character should be highly obscure, and extremely unlikely to appear 
in a corpus

* Character should not be confusable with any commonly used character.
Many characters in the Dingbats section of Unicode (block 2700) would 
fit these desiderata.
I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a 
highly obscure printable character that looks like a thick vertical 
bar. It's obviously a vertical bar, but just as obviously not the same 
thing as the regular vertical bar |.

Cheers,
Lane


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Miles Osborne
i second this.

but can I make another suggestion.  make the default be *non* factored
input.  i reckon that most people using Moses don't actually use
factors (hands-up if you do).
this means, plain input, with absolutely no meta chars in them.

and if you are going to use meta-chars, why not just have a flag such as:

--factorDelimiter=|

etc.

Miles

On 15 November 2010 21:30, Hieu Hoang  wrote:
> That's a good idea. In the decoder, there's 4 places that has to be
> changed cos it's hardcoded
>   ConfusionNet
>    GenerationDictionary
>   LanguageModelJoint
>    Word::createFromString
>
> However, the train-model.perl is more difficult to change
>
> Hieu
> Sent from my flying horse
>
> On 15 Nov 2010, at 09:00 PM, Lane Schwartz  wrote:
>
>> I'd like to propose changing the current factor delimiter to something other 
>> than the single vertical bar |
>>
>> Looking through the mailing archives, it seems that the failure to properly 
>> purge your corpus of vertical bars is a frequent source of headaches for 
>> users. I know I've encountered this problem before, but even knowing that I 
>> should do this, just today I had to track down another vertical bar-related 
>> problem.
>>
>> I don't really care what the replacement character(s) ends up being, just so 
>> that any corpus munging related to this delimiter gets handled internally by 
>> moses rather than being the user's responsibility.
>>
>> If moses could easily be modified to take a multi-character delimeter, that 
>> would probably be best. My suggestion for a single-character delimiter would 
>> be something with the following characteristics:
>>
>> * Character should be printable (ie not a control character)
>> * Character should be one that's implemented in most commonly used fonts
>> * Character should be highly obscure, and extremely unlikely to appear in a 
>> corpus
>> * Character should not be confusable with any commonly used character.
>>
>> Many characters in the Dingbats section of Unicode (block 2700) would fit 
>> these desiderata.
>>
>> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly 
>> obscure printable character that looks like a thick vertical bar. It's 
>> obviously a vertical bar, but just as obviously not the same thing as the 
>> regular vertical bar |.
>>
>> Cheers,
>> Lane
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Chris Dyer
> --factorDelimiter=|
There is such a flag. I implemented this about 4 years ago, but AFAIK
I'm the only one who ever uses (and I still use it).

-C

>
> etc.
>
> Miles
>
> On 15 November 2010 21:30, Hieu Hoang  wrote:
>> That's a good idea. In the decoder, there's 4 places that has to be
>> changed cos it's hardcoded
>>   ConfusionNet
>>    GenerationDictionary
>>   LanguageModelJoint
>>    Word::createFromString
>>
>> However, the train-model.perl is more difficult to change
>>
>> Hieu
>> Sent from my flying horse
>>
>> On 15 Nov 2010, at 09:00 PM, Lane Schwartz  wrote:
>>
>>> I'd like to propose changing the current factor delimiter to something 
>>> other than the single vertical bar |
>>>
>>> Looking through the mailing archives, it seems that the failure to properly 
>>> purge your corpus of vertical bars is a frequent source of headaches for 
>>> users. I know I've encountered this problem before, but even knowing that I 
>>> should do this, just today I had to track down another vertical bar-related 
>>> problem.
>>>
>>> I don't really care what the replacement character(s) ends up being, just 
>>> so that any corpus munging related to this delimiter gets handled 
>>> internally by moses rather than being the user's responsibility.
>>>
>>> If moses could easily be modified to take a multi-character delimeter, that 
>>> would probably be best. My suggestion for a single-character delimiter 
>>> would be something with the following characteristics:
>>>
>>> * Character should be printable (ie not a control character)
>>> * Character should be one that's implemented in most commonly used fonts
>>> * Character should be highly obscure, and extremely unlikely to appear in a 
>>> corpus
>>> * Character should not be confusable with any commonly used character.
>>>
>>> Many characters in the Dingbats section of Unicode (block 2700) would fit 
>>> these desiderata.
>>>
>>> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly 
>>> obscure printable character that looks like a thick vertical bar. It's 
>>> obviously a vertical bar, but just as obviously not the same thing as the 
>>> regular vertical bar |.
>>>
>>> Cheers,
>>> Lane
>>> ___
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Lane Schwartz
I agree. How's this proposal:

* Default is non-factored input

* When using factors, have the optional flag --factorDelimiter to allow
user-specified character for factor delimiter (thanks, Chris :)

* When using factors, use a default delimiter char of Unicode character
2759, MEDIUM VERTICAL BAR, if none is specified by the user flag




On Mon, Nov 15, 2010 at 4:37 PM, Miles Osborne  wrote:

> i second this.
>
> but can I make another suggestion.  make the default be *non* factored
> input.  i reckon that most people using Moses don't actually use
> factors (hands-up if you do).
> this means, plain input, with absolutely no meta chars in them.
>
> and if you are going to use meta-chars, why not just have a flag such as:
>
> --factorDelimiter=|
>
> etc.
>
> Miles
>
> On 15 November 2010 21:30, Hieu Hoang  wrote:
> > That's a good idea. In the decoder, there's 4 places that has to be
> > changed cos it's hardcoded
> >   ConfusionNet
> >GenerationDictionary
> >   LanguageModelJoint
> >Word::createFromString
> >
> > However, the train-model.perl is more difficult to change
> >
> > Hieu
> > Sent from my flying horse
> >
> > On 15 Nov 2010, at 09:00 PM, Lane Schwartz  wrote:
> >
> >> I'd like to propose changing the current factor delimiter to something
> other than the single vertical bar |
> >>
> >> Looking through the mailing archives, it seems that the failure to
> properly purge your corpus of vertical bars is a frequent source of
> headaches for users. I know I've encountered this problem before, but even
> knowing that I should do this, just today I had to track down another
> vertical bar-related problem.
> >>
> >> I don't really care what the replacement character(s) ends up being,
> just so that any corpus munging related to this delimiter gets handled
> internally by moses rather than being the user's responsibility.
> >>
> >> If moses could easily be modified to take a multi-character delimeter,
> that would probably be best. My suggestion for a single-character delimiter
> would be something with the following characteristics:
> >>
> >> * Character should be printable (ie not a control character)
> >> * Character should be one that's implemented in most commonly used fonts
> >> * Character should be highly obscure, and extremely unlikely to appear
> in a corpus
> >> * Character should not be confusable with any commonly used character.
> >>
> >> Many characters in the Dingbats section of Unicode (block 2700) would
> fit these desiderata.
> >>
> >> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly
> obscure printable character that looks like a thick vertical bar. It's
> obviously a vertical bar, but just as obviously not the same thing as the
> regular vertical bar |.
> >>
> >> Cheers,
> >> Lane
> >> ___
> >> Moses-support mailing list
> >> Moses-support@mit.edu
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>



-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, "Time Enough For Love"
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Christian Hardmeier
I fully agree with Miles.

In my opinion, replacing the pipe with an exotic Unicode character is
bad because
- in a web-crawled corpus, any Unicode character might occur, however
  exotic it is. If it's exotic, it will be even harder to track down
  the problem when it occurs.
- it assumes that everybody is using UTF-8, which I don't think is true.
  I know people working with Latin-1 encoded corpora, and for all I
  know, somebody out there may be using an encoding in which the bytes
  encoding "exotic UTF-8 character of your choice" in fact encode a
  very common letter or sign. Using a character from the ASCII subset
  reduces dependence on particular encodings as far as possible.

I like Miles's suggestion of not having a factor delimiter at all unless
explicitly turned on. If that's too complicated, I think we should stick
to the current situation, so at least we know the problems and how to
fix them, and, as Christof pointed out, some people may already have
tuned their pipelines to be pipe-proof (I haven't, but if I had, I'd
hate to change it).

/Christian

On Mon, 15 Nov 2010, Miles Osborne wrote:

> i second this.
> 
> but can I make another suggestion.  make the default be *non* factored
> input.  i reckon that most people using Moses don't actually use
> factors (hands-up if you do).
> this means, plain input, with absolutely no meta chars in them.
> 
> and if you are going to use meta-chars, why not just have a flag such as:
> 
> --factorDelimiter=|
> 
> etc.
> 
> Miles
> 
> On 15 November 2010 21:30, Hieu Hoang  wrote:
> > That's a good idea. In the decoder, there's 4 places that has to be
> > changed cos it's hardcoded
> >   ConfusionNet
> >    GenerationDictionary
> >   LanguageModelJoint
> >    Word::createFromString
> >
> > However, the train-model.perl is more difficult to change
> >
> > Hieu
> > Sent from my flying horse
> >
> > On 15 Nov 2010, at 09:00 PM, Lane Schwartz  wrote:
> >
> >> I'd like to propose changing the current factor delimiter to something 
> >> other than the single vertical bar |
> >>
> >> Looking through the mailing archives, it seems that the failure to 
> >> properly purge your corpus of vertical bars is a frequent source of 
> >> headaches for users. I know I've encountered this problem before, but even 
> >> knowing that I should do this, just today I had to track down another 
> >> vertical bar-related problem.
> >>
> >> I don't really care what the replacement character(s) ends up being, just 
> >> so that any corpus munging related to this delimiter gets handled 
> >> internally by moses rather than being the user's responsibility.
> >>
> >> If moses could easily be modified to take a multi-character delimeter, 
> >> that would probably be best. My suggestion for a single-character 
> >> delimiter would be something with the following characteristics:
> >>
> >> * Character should be printable (ie not a control character)
> >> * Character should be one that's implemented in most commonly used fonts
> >> * Character should be highly obscure, and extremely unlikely to appear in 
> >> a corpus
> >> * Character should not be confusable with any commonly used character.
> >>
> >> Many characters in the Dingbats section of Unicode (block 2700) would fit 
> >> these desiderata.
> >>
> >> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly 
> >> obscure printable character that looks like a thick vertical bar. It's 
> >> obviously a vertical bar, but just as obviously not the same thing as the 
> >> regular vertical bar |.
> >>
> >> Cheers,
> >> Lane
> >> ___
> >> Moses-support mailing list
> >> Moses-support@mit.edu
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> 
> 
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Lower scores with Word Lattice

2010-11-15 Thread Amit Abbi
Hi,

Could someone kindly help me out with respect to the following?
(Why should there be a difference in translations produced when I am
basically providing the same input except I am giving a different inputtype
- namely lattice format?)

On Fri, Nov 12, 2010 at 6:49 PM, Amit Abbi  wrote:

> Hi,
>
> I had a query with regard to use of lattice input in moses.
> There is a little difference in the translations generated when I run moses
> using the 'normal' input format and when I run it with 'lattice input'
> format.
> The translations weren't radically different - only a few phrases were
> different.
>
> When running moses without lattice input, each line in my input file looks
> like the following:-
> a b c d e f g h
>
> When running it using word lattices each line in my input file looks like
> the following:-
>
> ((('*EPS*',1.0,1),),(('a',1.0,1),),(('b',1.0,1),),(('c',1.0,1),),(('d',1.0,1),),(('e',1.0,1),),(('f',1.0,1),),(('g',1.0,1),),(('h',1.0,1),),(('*EPS*',1.0,1),),)
>
> Should there be any differences in the translations produced in the two
> cases?
> When calling moses I give the parameters -inputtype 2 -weight-i 0.2.
>
> Also I wished to know, how is the 'weight-i' used here?
> My understanding is that (weight-i)*log(path weights) + lambda1*lm + 
> determines the final log probability of a hypothesis. (where by path weights
> I mean the product of the arc weights we specify in the lattice input format
> for the path in question). Is it correct? and in that case should one also
> perform some sort of tuning for this weight?
>
> Regards,
> Amit
>
>
>
>
>
Regards,
Amit
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Ondrej Bojar
Hi, all.

This is an excellent spark for a flame war ;-)

Here's a summary of my point of view:

- I use factors in all my experiments (except the baseline run for
   comparison)
- I need to type the factor delimiter often, in many places, incl.
   command line, while experimenting. So copy-paste won't work for me
   and escape sequences are context dependent (bash/vim/perl with
   ASCII-only source code would all differ)

=> don't add the further level of obscurity (as Christof correctly
points out)

I also have experience with moderately-sized (90 milion tokens) parallel 
corpora and XML. *By all means* do avoid XML for any training or input 
data. In my experience (a specific dialect of XML, but the parser for it 
was actually precompiled to C and it just needed to build complex data 
structures), it was faster to morphologically tag and comparable to 
parse with McDonald's parser than to reload the tagged/parsed XML.

Frankly, I think Moses users should be literate enough to cope with '|'. ;-)
However, error reporting should be improved everywhere, and I actually 
try to do that whenever I touch the code nearby.

I'm sending this now, before you jump to a conclusion, you quick 
bastards! ;-)

O.

On 11/15/2010 10:35 PM, Christof Pintaske wrote:
>   Hello Lane,
>
> frankly I don't see this as sooo desireable. You just exchange a magic
> character with an even more magic one. Since the proposed character is
> not an ASCII character you'll eventually run into encoding problems. And
> for most people it'd be very difficult to type this character on the
> keyboard and to distinguish it from the regular | symbol. It just gets
> more and more obscure.
>
> To really improve on the ugly "magic file format" issue I'd love to see
> support for XML-based input and configuration files. There is tons of
> tooling out there to handle XML files, there are no limitation in
> respect to the content (even multi-line input would be possible). You
> can easily check conformance (using a DTD) and you can keep them
> backwards compatible if you desire so. Of course it's very well
> understood that this is a major effort that's not easy to address.
>
> just my two cents
> Christof
>
> PS: and yes, I spent substantial effort in making my tool chain pipe
> proof. I'd hate to sift through all that again for no practical gain.
>
>
>
>
> On 11/15/10 12:55 PM, Lane Schwartz wrote:
>> I'd like to propose changing the current factor delimiter to something
>> other than the single vertical bar |
>> Looking through the mailing archives, it seems that the failure to
>> properly purge your corpus of vertical bars is a frequent source of
>> headaches for users. I know I've encountered this problem before, but
>> even knowing that I should do this, just today I had to track down
>> another vertical bar-related problem.
>> I don't really care what the replacement character(s) ends up being,
>> just so that any corpus munging related to this delimiter gets handled
>> internally by moses rather than being the user's responsibility.
>> If moses could easily be modified to take a multi-character delimeter,
>> that would probably be best. My suggestion for a single-character
>> delimiter would be something with the following characteristics:
>> * Character should be printable (ie not a control character)
>> * Character should be one that's implemented in most commonly used fonts
>> * Character should be highly obscure, and extremely unlikely to appear
>> in a corpus
>> * Character should not be confusable with any commonly used character.
>> Many characters in the Dingbats section of Unicode (block 2700) would
>> fit these desiderata.
>> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a
>> highly obscure printable character that looks like a thick vertical
>> bar. It's obviously a vertical bar, but just as obviously not the same
>> thing as the regular vertical bar |.
>> Cheers,
>> Lane
>>
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-- 
Ondrej Bojar (mailto:o...@cuni.cz / bo...@ufal.mff.cuni.cz)
http://www.cuni.cz/~obo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Suzy Howlett
My impression has always been that the pipe character is a fairly common 
choice across the field for delimiters like this, possibly to the point 
of being a standard, which makes it seem unwise to arbitrarily choose 
some other character. Perhaps it would be more useful to channel the 
effort into generic tools to help pipe-proof input and pipelines for 
arbitrary applications (not that I have any useful suggestions for how 
to do so).

I also like the suggestion of having to explicitly turn on the factor 
delimiter, and the command-line arg to use a character other than the 
default pipe. I don't know about the xml-based input, so can't comment 
on that.

S.

On 16/11/10 8:54 AM, Christian Hardmeier wrote:
> I fully agree with Miles.
>
> In my opinion, replacing the pipe with an exotic Unicode character is
> bad because
> - in a web-crawled corpus, any Unicode character might occur, however
>exotic it is. If it's exotic, it will be even harder to track down
>the problem when it occurs.
> - it assumes that everybody is using UTF-8, which I don't think is true.
>I know people working with Latin-1 encoded corpora, and for all I
>know, somebody out there may be using an encoding in which the bytes
>encoding "exotic UTF-8 character of your choice" in fact encode a
>very common letter or sign. Using a character from the ASCII subset
>reduces dependence on particular encodings as far as possible.
>
> I like Miles's suggestion of not having a factor delimiter at all unless
> explicitly turned on. If that's too complicated, I think we should stick
> to the current situation, so at least we know the problems and how to
> fix them, and, as Christof pointed out, some people may already have
> tuned their pipelines to be pipe-proof (I haven't, but if I had, I'd
> hate to change it).
>
> /Christian
>
> On Mon, 15 Nov 2010, Miles Osborne wrote:
>
>> i second this.
>>
>> but can I make another suggestion.  make the default be *non* factored
>> input.  i reckon that most people using Moses don't actually use
>> factors (hands-up if you do).
>> this means, plain input, with absolutely no meta chars in them.
>>
>> and if you are going to use meta-chars, why not just have a flag such as:
>>
>> --factorDelimiter=|
>>
>> etc.
>>
>> Miles
>>
>> On 15 November 2010 21:30, Hieu Hoang  wrote:
>>> That's a good idea. In the decoder, there's 4 places that has to be
>>> changed cos it's hardcoded
>>>ConfusionNet
>>> GenerationDictionary
>>>LanguageModelJoint
>>> Word::createFromString
>>>
>>> However, the train-model.perl is more difficult to change
>>>
>>> Hieu
>>> Sent from my flying horse
>>>
>>> On 15 Nov 2010, at 09:00 PM, Lane Schwartz  wrote:
>>>
 I'd like to propose changing the current factor delimiter to something 
 other than the single vertical bar |

 Looking through the mailing archives, it seems that the failure to 
 properly purge your corpus of vertical bars is a frequent source of 
 headaches for users. I know I've encountered this problem before, but even 
 knowing that I should do this, just today I had to track down another 
 vertical bar-related problem.

 I don't really care what the replacement character(s) ends up being, just 
 so that any corpus munging related to this delimiter gets handled 
 internally by moses rather than being the user's responsibility.

 If moses could easily be modified to take a multi-character delimeter, 
 that would probably be best. My suggestion for a single-character 
 delimiter would be something with the following characteristics:

 * Character should be printable (ie not a control character)
 * Character should be one that's implemented in most commonly used fonts
 * Character should be highly obscure, and extremely unlikely to appear in 
 a corpus
 * Character should not be confusable with any commonly used character.

 Many characters in the Dingbats section of Unicode (block 2700) would fit 
 these desiderata.

 I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly 
 obscure printable character that looks like a thick vertical bar. It's 
 obviously a vertical bar, but just as obviously not the same thing as the 
 regular vertical bar |.

 Cheers,
 Lane
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>> ___
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Hieu Hoang
Very true, shouldn't make the delimited another random char otherwise
it's hard to debug. However, if we make the default delimited 0x00,
would that suit people?

Hieu
Sent from my flying horse

On 15 Nov 2010, at 09:55 PM, Christian Hardmeier  wrote:

> I fully agree with Miles.
>
> In my opinion, replacing the pipe with an exotic Unicode character is
> bad because
> - in a web-crawled corpus, any Unicode character might occur, however
>  exotic it is. If it's exotic, it will be even harder to track down
>  the problem when it occurs.
> - it assumes that everybody is using UTF-8, which I don't think is true.
>  I know people working with Latin-1 encoded corpora, and for all I
>  know, somebody out there may be using an encoding in which the bytes
>  encoding "exotic UTF-8 character of your choice" in fact encode a
>  very common letter or sign. Using a character from the ASCII subset
>  reduces dependence on particular encodings as far as possible.
>
> I like Miles's suggestion of not having a factor delimiter at all unless
> explicitly turned on. If that's too complicated, I think we should stick
> to the current situation, so at least we know the problems and how to
> fix them, and, as Christof pointed out, some people may already have
> tuned their pipelines to be pipe-proof (I haven't, but if I had, I'd
> hate to change it).
>
> /Christian
>
> On Mon, 15 Nov 2010, Miles Osborne wrote:
>
>> i second this.
>>
>> but can I make another suggestion.  make the default be *non* factored
>> input.  i reckon that most people using Moses don't actually use
>> factors (hands-up if you do).
>> this means, plain input, with absolutely no meta chars in them.
>>
>> and if you are going to use meta-chars, why not just have a flag such as:
>>
>> --factorDelimiter=|
>>
>> etc.
>>
>> Miles
>>
>> On 15 November 2010 21:30, Hieu Hoang  wrote:
>>> That's a good idea. In the decoder, there's 4 places that has to be
>>> changed cos it's hardcoded
>>>   ConfusionNet
>>>GenerationDictionary
>>>   LanguageModelJoint
>>>Word::createFromString
>>>
>>> However, the train-model.perl is more difficult to change
>>>
>>> Hieu
>>> Sent from my flying horse
>>>
>>> On 15 Nov 2010, at 09:00 PM, Lane Schwartz  wrote:
>>>
 I'd like to propose changing the current factor delimiter to something 
 other than the single vertical bar |

 Looking through the mailing archives, it seems that the failure to 
 properly purge your corpus of vertical bars is a frequent source of 
 headaches for users. I know I've encountered this problem before, but even 
 knowing that I should do this, just today I had to track down another 
 vertical bar-related problem.

 I don't really care what the replacement character(s) ends up being, just 
 so that any corpus munging related to this delimiter gets handled 
 internally by moses rather than being the user's responsibility.

 If moses could easily be modified to take a multi-character delimeter, 
 that would probably be best. My suggestion for a single-character 
 delimiter would be something with the following characteristics:

 * Character should be printable (ie not a control character)
 * Character should be one that's implemented in most commonly used fonts
 * Character should be highly obscure, and extremely unlikely to appear in 
 a corpus
 * Character should not be confusable with any commonly used character.

 Many characters in the Dingbats section of Unicode (block 2700) would fit 
 these desiderata.

 I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly 
 obscure printable character that looks like a thick vertical bar. It's 
 obviously a vertical bar, but just as obviously not the same thing as the 
 regular vertical bar |.

 Cheers,
 Lane
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>> ___
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Ondrej Bojar
Hi,

after some more thinking about this, I'd relabel your proposal to a 
regular bug report, asking for this particular minor fix:

  Whenever moses expects a single factor only (based on the
  configuration) in input/ttable/generation-table/..., no split
  should be done at all.

Here are the details in your three bullet style wording:

- default is non-factored input
   (or rather: if "input factors" is set "0" only, pipe has no special
   meaning)
   There is still an open issue with phrase/generation/reordering
   tables/suffix arrays/whatever. My suggestion is (without having look
   at the code) that whenever the given table speaks about a single
   factor only according to the moses.ini line, no split should be
   performed at all => no pipe would make any harm.

- surely keep the --factorDelimiter (but make it clear that it
   does/does not apply also to the phrase, generation and reordering
   tables)

- keep the regular ASCII '|' as the default

Cheers, O.


On 11/15/2010 10:51 PM, Lane Schwartz wrote:
> I agree. How's this proposal:
> * Default is non-factored input
> * When using factors, have the optional flag --factorDelimiter to allow
> user-specified character for factor delimiter (thanks, Chris :)
> * When using factors, use a default delimiter char of Unicode character
> 2759, MEDIUM VERTICAL BAR, if none is specified by the user flag
>
> On Mon, Nov 15, 2010 at 4:37 PM, Miles Osborne  > wrote:
>
> i second this.
>
> but can I make another suggestion.  make the default be *non* factored
> input.  i reckon that most people using Moses don't actually use
> factors (hands-up if you do).
> this means, plain input, with absolutely no meta chars in them.
>
> and if you are going to use meta-chars, why not just have a flag
> such as:
>
> --factorDelimiter=|
>
> etc.
>
> Miles
>
> On 15 November 2010 21:30, Hieu Hoang  > wrote:
>  > That's a good idea. In the decoder, there's 4 places that has to be
>  > changed cos it's hardcoded
>  >   ConfusionNet
>  >GenerationDictionary
>  >   LanguageModelJoint
>  >Word::createFromString
>  >
>  > However, the train-model.perl is more difficult to change
>  >
>  > Hieu
>  > Sent from my flying horse
>  >
>  > On 15 Nov 2010, at 09:00 PM, Lane Schwartz  > wrote:
>  >
>  >> I'd like to propose changing the current factor delimiter to
> something other than the single vertical bar |
>  >>
>  >> Looking through the mailing archives, it seems that the failure
> to properly purge your corpus of vertical bars is a frequent source
> of headaches for users. I know I've encountered this problem before,
> but even knowing that I should do this, just today I had to track
> down another vertical bar-related problem.
>  >>
>  >> I don't really care what the replacement character(s) ends up
> being, just so that any corpus munging related to this delimiter
> gets handled internally by moses rather than being the user's
> responsibility.
>  >>
>  >> If moses could easily be modified to take a multi-character
> delimeter, that would probably be best. My suggestion for a
> single-character delimiter would be something with the following
> characteristics:
>  >>
>  >> * Character should be printable (ie not a control character)
>  >> * Character should be one that's implemented in most commonly
> used fonts
>  >> * Character should be highly obscure, and extremely unlikely to
> appear in a corpus
>  >> * Character should not be confusable with any commonly used
> character.
>  >>
>  >> Many characters in the Dingbats section of Unicode (block 2700)
> would fit these desiderata.
>  >>
>  >> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a
> highly obscure printable character that looks like a thick vertical
> bar. It's obviously a vertical bar, but just as obviously not the
> same thing as the regular vertical bar |.
>  >>
>  >> Cheers,
>  >> Lane
>  >> ___
>  >> Moses-support mailing list
>  >> Moses-support@mit.edu 
>  >> http://mailman.mit.edu/mailman/listinfo/moses-support
>  >
>  > ___
>  > Moses-support mailing list
>  > Moses-support@mit.edu 
>  > http://mailman.mit.edu/mailman/listinfo/moses-support
>  >
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away.  It is time to go elsewhere.  The best thing about space travel
> is that it made 

Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-15 Thread Christof Pintaske
On 11/15/10 2:05 PM, Hieu Hoang wrote:
> Very true, shouldn't make the delimited another random char otherwise
> it's hard to debug. However, if we make the default delimited 0x00,
> would that suit people?
I believe that makes it very hard to manually create and inspect any 
corpus. You could use any of the ASCII codes

 0x1D (Group Separator)
 0x1E (Record Separator)
 0x1F (Unit Separator)

but none of these is better in concept. You'd still need to check all 
your raw-input for the occurrence of these characters and escape them 
accordingly. They might occur less frequent but they do occur. The 
coding effort to prevent accidence is still the same.

best regards
Christof

> Hieu
> Sent from my flying horse
>
> On 15 Nov 2010, at 09:55 PM, Christian Hardmeier  wrote:
>
>> I fully agree with Miles.
>>
>> In my opinion, replacing the pipe with an exotic Unicode character is
>> bad because
>> - in a web-crawled corpus, any Unicode character might occur, however
>>   exotic it is. If it's exotic, it will be even harder to track down
>>   the problem when it occurs.
>> - it assumes that everybody is using UTF-8, which I don't think is true.
>>   I know people working with Latin-1 encoded corpora, and for all I
>>   know, somebody out there may be using an encoding in which the bytes
>>   encoding "exotic UTF-8 character of your choice" in fact encode a
>>   very common letter or sign. Using a character from the ASCII subset
>>   reduces dependence on particular encodings as far as possible.
>>
>> I like Miles's suggestion of not having a factor delimiter at all unless
>> explicitly turned on. If that's too complicated, I think we should stick
>> to the current situation, so at least we know the problems and how to
>> fix them, and, as Christof pointed out, some people may already have
>> tuned their pipelines to be pipe-proof (I haven't, but if I had, I'd
>> hate to change it).
>>
>> /Christian
>>
>> On Mon, 15 Nov 2010, Miles Osborne wrote:
>>
>>> i second this.
>>>
>>> but can I make another suggestion.  make the default be *non* factored
>>> input.  i reckon that most people using Moses don't actually use
>>> factors (hands-up if you do).
>>> this means, plain input, with absolutely no meta chars in them.
>>>
>>> and if you are going to use meta-chars, why not just have a flag such as:
>>>
>>> --factorDelimiter=|
>>>
>>> etc.
>>>
>>> Miles
>>>
>>> On 15 November 2010 21:30, Hieu Hoang  wrote:
 That's a good idea. In the decoder, there's 4 places that has to be
 changed cos it's hardcoded
ConfusionNet
 GenerationDictionary
LanguageModelJoint
 Word::createFromString

 However, the train-model.perl is more difficult to change

 Hieu
 Sent from my flying horse

 On 15 Nov 2010, at 09:00 PM, Lane Schwartz  wrote:

> I'd like to propose changing the current factor delimiter to something 
> other than the single vertical bar |
>
> Looking through the mailing archives, it seems that the failure to 
> properly purge your corpus of vertical bars is a frequent source of 
> headaches for users. I know I've encountered this problem before, but 
> even knowing that I should do this, just today I had to track down 
> another vertical bar-related problem.
>
> I don't really care what the replacement character(s) ends up being, just 
> so that any corpus munging related to this delimiter gets handled 
> internally by moses rather than being the user's responsibility.
>
> If moses could easily be modified to take a multi-character delimeter, 
> that would probably be best. My suggestion for a single-character 
> delimiter would be something with the following characteristics:
>
> * Character should be printable (ie not a control character)
> * Character should be one that's implemented in most commonly used fonts
> * Character should be highly obscure, and extremely unlikely to appear in 
> a corpus
> * Character should not be confusable with any commonly used character.
>
> Many characters in the Dingbats section of Unicode (block 2700) would fit 
> these desiderata.
>
> I suggest Unicode character 2759, MEDIUM VERTICAL BAR. This is a highly 
> obscure printable character that looks like a thick vertical bar. It's 
> obviously a vertical bar, but just as obviously not the same thing as the 
> regular vertical bar |.
>
> Cheers,
> Lane
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>

Re: [Moses-support] Proposal to replace vertical bar as fac tor delimeter

2010-11-15 Thread support
Interesting excitement around this thread. I support "no change, but if
change is necessary, keep the ascii '|' as the default delimiter."

Changing the delimiter creates a lot of work to "resolve" what is
essentially a documentation and training challenge, not a technical
problem. By the way, the "|" is not the only troublesome character. The
Moses for Mere Mortals team documents other troublesome ascii control
characters. Changing Moses to a different delimiter does not "fix" those
characters.

By now, many users have trained many tables with the current delimiter.
Changing to a new default delimiter involves the work to implement the
changes, work to support the existing tables, and regression testing all
the changes. This means adding and testing code to automatically detecting
the "|" delimiter. Alternately, all existing users would need to update
their systems to use the old default, or they would have to re-train all
their tables. That's a lot of unnecessary work when better documentation
will suffice. I think the old adage applies: "if it works, don't fix it".

If the goal is to reduce the load on moses-support, how about different
technical approach? I propose modifying clean-corpus-n.perl to remove
them... or modify tokenizer.perl and detokenizer.perl to 'tokenize' the "|"
with reserved character(s) and 'detokenize' the reserved characters(s) back
to "|". A new option would allow users to define the reserved
characters(s). This solves the problem for new European language users with
minimal effect on existing users. Changing tokenization could also address
the other ascii control characters.

RE: "default delimited 0x00" -- bad idea. Many editors (gedit for example)
interpret files with ascii null as binary files.

Best regards
Tom


On Tue, 16 Nov 2010 00:10:46 +0100, Ondrej Bojar 
wrote:
> Hi,
> 
> after some more thinking about this, I'd relabel your proposal to a 
> regular bug report, asking for this particular minor fix:
> 
>   Whenever moses expects a single factor only (based on the
>   configuration) in input/ttable/generation-table/..., no split
>   should be done at all.
> 
> Here are the details in your three bullet style wording:
> 
> - default is non-factored input
>(or rather: if "input factors" is set "0" only, pipe has no special
>meaning)
>There is still an open issue with phrase/generation/reordering
>tables/suffix arrays/whatever. My suggestion is (without having look
>at the code) that whenever the given table speaks about a single
>factor only according to the moses.ini line, no split should be
>performed at all => no pipe would make any harm.
> 
> - surely keep the --factorDelimiter (but make it clear that it
>does/does not apply also to the phrase, generation and reordering
>tables)
> 
> - keep the regular ASCII '|' as the default
> 
> Cheers, O.
> 
> 
> On 11/15/2010 10:51 PM, Lane Schwartz wrote:
>> I agree. How's this proposal:
>> * Default is non-factored input
>> * When using factors, have the optional flag --factorDelimiter to allow
>> user-specified character for factor delimiter (thanks, Chris :)
>> * When using factors, use a default delimiter char of Unicode character
>> 2759, MEDIUM VERTICAL BAR, if none is specified by the user flag
>>
>> On Mon, Nov 15, 2010 at 4:37 PM, Miles Osborne > > wrote:
>>
>> i second this.
>>
>> but can I make another suggestion.  make the default be *non*
>> factored
>> input.  i reckon that most people using Moses don't actually use
>> factors (hands-up if you do).
>> this means, plain input, with absolutely no meta chars in them.
>>
>> and if you are going to use meta-chars, why not just have a flag
>> such as:
>>
>> --factorDelimiter=|
>>
>> etc.
>>
>> Miles
>>
>> On 15 November 2010 21:30, Hieu Hoang > > wrote:
>>  > That's a good idea. In the decoder, there's 4 places that has to
>>  > be
>>  > changed cos it's hardcoded
>>  >   ConfusionNet
>>  >GenerationDictionary
>>  >   LanguageModelJoint
>>  >Word::createFromString
>>  >
>>  > However, the train-model.perl is more difficult to change
>>  >
>>  > Hieu
>>  > Sent from my flying horse
>>  >
>>  > On 15 Nov 2010, at 09:00 PM, Lane Schwartz > > wrote:
>>  >
>>  >> I'd like to propose changing the current factor delimiter to
>> something other than the single vertical bar |
>>  >>
>>  >> Looking through the mailing archives, it seems that the failure
>> to properly purge your corpus of vertical bars is a frequent source
>> of headaches for users. I know I've encountered this problem
before,
>> but even knowing that I should do this, just today I had to track
>> down another vertical bar-related problem.
>>  >>
>>  >> I don't really care what the replacement character(s) ends up
>> being