subject:"Re\: \[Apertium\-stuff\] Compound words and dix format"

Re: [Apertium-stuff] Compound words and dix format

2010-12-21 Thread Francis Tyers

Hi!

The problem with this is that there are so many different metadix
formats that it will be impossible to come up with one that covers them
all. For example if I remember correctly how the alt works is
different in es-pt and in oc-es. I think it was decided that it was
desirable to have them functioning differently, or at least would
require substantial changes in either language pair to get a unified
format -- changes that without some push (and let's face it, cash) are
not going to get made. 

On the other hand, implementing compound words gives us the chance to
strike while the iron is hot! We can make a (fairly innocuous change --
any language pair that does not have compounding will be unaffected)
before getting a plethora of different options and thus avoiding the
metadix problem for another set of issues.

Btw, thinking about metadix I have some probably unpopular ideas,
thatwould preclude any standardisation. I think that maybe we should not
have one format, but rather many _codified_ formats depending on the
language(group). For example how to include a verb would be different in
Tajik and Dutch, because different things are important. Unnecessary
examples:

e lm=aanzittenpar n=z/itten__vblex prefix=aan
pp=aangezeten//e

Giving:

e lm=aanzitteniaanz/ipar n=aanz/itten__vblex_sep//e
e lm=aanzittenplz/lraanz/r/ppar
n=z/itten#_aan__vblex_sep/plb/aan/lr/r/p/e
e lm=aanzittenplaangezeten/lraanzitten/r/ppar
n=gesproken__vblex_sep//e

Or in Tajik:

e lm=хариданpar n=кард/ан__vblex stem1=харид stem2=хар//e

Which would give (after transformation) something like:

e lm=хариданplхарид/lrхариданs n=vblex//r/ppar
n=кард/ан__vblex//e
e lm=хариданplхар/lrхариданs n=vblex/s
n=prs//r/ppar n=к/ард.ан__vblex//e
e lm=хариданplнахар/lrхариданs n=vblex/s
n=neg/s n=prs//r/ppar n=к/ард.ан__vblex//e
e lm=хариданplнахарид/lrхариданs n=vblex/s
n=neg//r/ppar n=кард/ан__vblex//e
e lm=хариданplмехарид/lrхариданs
n=vblex//r/ppar n=ме.кард/ан__vblex//e
e lm=хариданplмехар/lrхариданs n=vblex/s
n=pri//r/ppar n=к/ард.ан__vblex//e
e lm=хариданplнамехарид/lrхариданs n=vblex/s
n=neg//r/ppar n=кард/ан__vblex//e
e lm=хариданplнамехар/lrхариданs n=vblex/s
n=neg/s n=pri//r/ppar n=к/ард.ан__vblex//e

Fran

PS. Wasn't the election to be organised by Unhammer, Pasquale and Nic ?

El dt 21 de 12 de 2010 a les 06:31 +0100, en/na Mikel Forcada va
escriure:
 Hi Apertiumers,
 
 Before any more patches to the dictionary format are made, a general 
 agreement should be reached. Remember that we have different dialects of 
 metadix and an unification would be desirable before fiddling anymore 
 with dictionary formats
 
 Mikel L. Forcada
 
 P.S. By the way, as the mandate of the current Project Management 
 Committee has long expired and we haven't been able to run a proper 
 election, I understand I could stage a coup d'etat, put on my BDFL cap, 
 and word the above as a command instead of as an opinion. I'm tempted... 
 anyone interested in
 
 On 12/19/2010 09:57 PM, Francis Tyers wrote:
  El dg 19 de 12 de 2010 a les 18:28 +, en/na Jimmy O'Regan va
  escriure:
  On 19 Dec 2010, at 17:39, Francis Tyersfty...@prompsit.com  wrote:
 
 
  It would be nice to get this done before Christmas, are there any
  comments ?
  It would probably be best to use a character other than '+'. In the
  event of the final part of the compound being analysed as a multword
  with inner inflection, the queue will be attached to the first part of
  the compound. As you're talking about a syntax change anyway, is there
  any reason to not insert the break directly?
  I guess we could use '~' and then change pretransfer.cc to output $^ for
  '~' instead of '$ ^' for '+'...
 
  Anything else ?
 
  Fran
 
 
  --
  Lotusphere 2011
  Register now for Lotusphere 2011 and learn how
  to connect the dots, take your collaborative environment
  to the next level, and enter the era of Social Business.
  http://p.sf.net/sfu/lotusphere-d2d
  ___
  Apertium-stuff mailing list
  Apertium-stuff@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/apertium-stuff
 
 



--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

2010-12-21 Thread Francis Tyers

El dt 21 de 12 de 2010 a les 10:42 +, en/na Francis Tyers va
escriure:
 Hi!

...

 Btw, thinking about metadix I have some probably unpopular ideas,
 that would preclude any standardisation. I think that maybe we should not
 have one format, but rather many _codified_ formats depending on the
 language(group). For example how to include a verb would be different in
 Tajik and Dutch, because different things are important. Unnecessary
 examples:

I was also looking for this which makes interesting reading:

http://www.stanford.edu/~laurik/fsmbook/clarifications/xmldowntrans.html

Fran


--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

2010-12-21 Thread Jacob Nordfalk

2010/12/21 Mikel Forcada m...@dlsi.ua.es

 Hi Apertiumers,

 Before any more patches to the dictionary format are made, a general
 agreement should be reached. Remember that we have different dialects of
 metadix and an unification would be desirable before fiddling anymore
 with dictionary formats


I think this  unification has been tried before, but not with much success.

I propose that you reserve an hour or two during
http://www.uoc.edu/freerbmt11/ to do a last try at unification of metadix,
but if this is unsuccessfull I think it would be wiser to support
compounding than keep things stalled.

The way I see things the problem here is not that different languages choose
different ways of expressing the linguistic content.
Fran's ideas about Tajik and Dutch is a fine illustration of why this is
neccesary.
The problem is that these differences are mostly expressed in XSLT, a
language which is incomprehensible for most (i.a. for me). Take a look at
for example incubator/apertium-en-fr/alt.xsl and you will understand what I
mean.

It all boils down to that a .dix is TWO things:
A) A way of expressing linguistic content which should be relatively easily
maintainable
B) Raw input to a finite state machine processor (lt-comp)

Of course this leaves a gap. We have been closing that gap using XSLT but I
think we should start looking at other, easier to understand, ways of
expressing the gap between A and B.

Jacob


-- 
Jacob Nordfalk
http://javabog.dk
Underviser i Android på http://ihk.dk
--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

2010-12-21 Thread Jacob Nordfalk

2010/12/21 Francis Tyers fty...@prompsit.com

 El dt 21 de 12 de 2010 a les 10:42 +, en/na Francis Tyers va
 escriure:
  Hi!

 ...

  Btw, thinking about metadix I have some probably unpopular ideas,
  that would preclude any standardisation. I think that maybe we should not
  have one format, but rather many _codified_ formats depending on the
  language(group). For example how to include a verb would be different in
  Tajik and Dutch, because different things are important. Unnecessary
  examples:

 I was also looking for this which makes interesting reading:

 http://www.stanford.edu/~laurik/fsmbook/clarifications/xmldowntrans.html


Page 2:

However, I myself don't like XSLT, and I'm not alone. For one thing, it is
based on XSL, intended originally as a stylesheet language, and so has a
strong bias toward
reformatting; the underlying assumption is that the original XML text
contains the data
you want in the output, and that the problem is basically just to reformat
that data
in the output. In practice, my own XML-to-SomethingElse downtranslations
often
involve non-trivial conversions that cannot be handled in XSLT, or which are
extremely
awkward in XSLT. I find XSLT too limiting; it always seems to be preventing
exactly
what I want to do. I want the power of a real programming language like Perl
or Python
while doing downtranslation. Finally, XSLT files are themselves in an XML
format, which some people think is a great advantage. I disagree. XML is a
Good Thing, but it's possible to take any Good
Thing too far. Writing XSLT is a kind of programming, and I dislike
programming in
XML: it's verbose and not easy for human beings to read.


Exactly my words!!
Lets find an alternative to XSLT that suits our needs and build support for
that into lt-comp.


-- 
Jacob Nordfalk
http://javabog.dk
Underviser i Android på http://ihk.dk
--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

2010-12-21 Thread Jacob Nordfalk

2010/12/21 Jacob Nordfalk jacob.nordf...@gmail.com



 2010/12/21 Francis Tyers fty...@prompsit.com

 El dt 21 de 12 de 2010 a les 10:42 +, en/na Francis Tyers va
 escriure:
  Hi!

 ...

  Btw, thinking about metadix I have some probably unpopular ideas,
  that would preclude any standardisation. I think that maybe we should
 not
  have one format, but rather many _codified_ formats depending on the
  language(group). For example how to include a verb would be different in
  Tajik and Dutch, because different things are important. Unnecessary
  examples:

 I was also looking for this which makes interesting reading:

 http://www.stanford.edu/~laurik/fsmbook/clarifications/xmldowntrans.html


 Page 2:


Sorry, page 3



 However, I myself don't like XSLT, and I'm not alone. For one thing, it is
 based on XSL, intended originally as a stylesheet language, and so has a
 strong bias toward
 reformatting; the underlying assumption is that the original XML text
 contains the data
 you want in the output, and that the problem is basically just to reformat
 that data
 in the output. In practice, my own XML-to-SomethingElse downtranslations
 often
 involve non-trivial conversions that cannot be handled in XSLT, or which
 are extremely
 awkward in XSLT. I find XSLT too limiting; it always seems to be preventing
 exactly
 what I want to do. I want the power of a real programming language like
 Perl or Python
 while doing downtranslation. Finally, XSLT files are themselves in an XML
 format, which some people think is a great advantage. I disagree. XML is a
 Good Thing, but it's possible to take any Good
 Thing too far. Writing XSLT is a kind of programming, and I dislike
 programming in
 XML: it's verbose and not easy for human beings to read.


 Exactly my words!!
 Lets find an alternative to XSLT that suits our needs and build support for
 that into lt-comp.


Why not having a GSoC project on this?

The student doesent have to understand much linguistic stuff. He just needs
to get the same output as the XSL transformation, but with a metadix rule
file with a much simpler syntax.


-- 
Jacob Nordfalk
http://javabog.dk
Underviser i Android på http://ihk.dk
--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

2010-12-21 Thread Kevin Brubeck Unhammer

Francis Tyers fty...@prompsit.com writes:

 Hi!

 The problem with this is that there are so many different metadix
 formats that it will be impossible to come up with one that covers them
 all. For example if I remember correctly how the alt works is
 different in es-pt and in oc-es. I think it was decided that it was
 desirable to have them functioning differently, or at least would
 require substantial changes in either language pair to get a unified
 format -- changes that without some push (and let's face it, cash) are
 not going to get made. 

 On the other hand, implementing compound words gives us the chance to
 strike while the iron is hot! We can make a (fairly innocuous change --
 any language pair that does not have compounding will be unaffected)
 before getting a plethora of different options and thus avoiding the
 metadix problem for another set of issues.

 Btw, thinking about metadix I have some probably unpopular ideas,
 thatwould preclude any standardisation. I think that maybe we should not
 have one format, but rather many _codified_ formats depending on the
 language(group). For example how to include a verb would be different in
 Tajik and Dutch, because different things are important. Unnecessary
 examples:

 e lm=aanzittenpar n=z/itten__vblex prefix=aan
 pp=aangezeten//e

 Giving:

 e lm=aanzitteniaanz/ipar n=aanz/itten__vblex_sep//e
 e lm=aanzittenplz/lraanz/r/ppar
 n=z/itten#_aan__vblex_sep/plb/aan/lr/r/p/e
 e lm=aanzittenplaangezeten/lraanzitten/r/ppar
 n=gesproken__vblex_sep//e

 Or in Tajik:

 e lm=хариданpar n=кард/ан__vblex stem1=харид stem2=хар//e

In the unification proposal from

http://wiki.apertium.org/wiki/Unification_of_metadix_and_parametrized_dictionaries#A_unifying_proposal

the calls would look like

e lm=aanzittenpar n=z/itten__vblex prms=prefix='aan' 
pp='aangezeten'//e

and

e lm=хариданpar n=кард/ан__vblex prms=stem1='харид' stem2='хар'//e


Are there good reasons not to go with that kind of syntax?


-- 
Kevin Brubeck Unhammer

--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

2010-12-21 Thread Francis Tyers

El dt 21 de 12 de 2010 a les 13:04 +0100, en/na Kevin Brubeck Unhammer
va escriure:
 Francis Tyers fty...@prompsit.com writes:
 
  Hi!
 
  The problem with this is that there are so many different metadix
  formats that it will be impossible to come up with one that covers them
  all. For example if I remember correctly how the alt works is
  different in es-pt and in oc-es. I think it was decided that it was
  desirable to have them functioning differently, or at least would
  require substantial changes in either language pair to get a unified
  format -- changes that without some push (and let's face it, cash) are
  not going to get made. 
 
  On the other hand, implementing compound words gives us the chance to
  strike while the iron is hot! We can make a (fairly innocuous change --
  any language pair that does not have compounding will be unaffected)
  before getting a plethora of different options and thus avoiding the
  metadix problem for another set of issues.
 
  Btw, thinking about metadix I have some probably unpopular ideas,
  thatwould preclude any standardisation. I think that maybe we should not
  have one format, but rather many _codified_ formats depending on the
  language(group). For example how to include a verb would be different in
  Tajik and Dutch, because different things are important. Unnecessary
  examples:
 
  e lm=aanzittenpar n=z/itten__vblex prefix=aan
  pp=aangezeten//e
 
  Giving:
 
  e lm=aanzitteniaanz/ipar n=aanz/itten__vblex_sep//e
  e lm=aanzittenplz/lraanz/r/ppar
  n=z/itten#_aan__vblex_sep/plb/aan/lr/r/p/e
  e lm=aanzittenplaangezeten/lraanzitten/r/ppar
  n=gesproken__vblex_sep//e
 
  Or in Tajik:
 
  e lm=хариданpar n=кард/ан__vblex stem1=харид stem2=хар//e
 
 In the unification proposal from
 
 http://wiki.apertium.org/wiki/Unification_of_metadix_and_parametrized_dictionaries#A_unifying_proposal
 
 the calls would look like
 
 e lm=aanzittenpar n=z/itten__vblex prms=prefix='aan' 
 pp='aangezeten'//e
 
 and
 
 e lm=хариданpar n=кард/ан__vblex prms=stem1='харид' stem2='хар'//e
 
 
 Are there good reasons not to go with that kind of syntax?

The problem is that what happens after that would be different depending
on the language pair. I think one of the points of the unification
proposal was to have a single xsl file to do the transformations(?)
Where in this case it would be two. 

Fran


--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

2010-12-21 Thread Jimmy O'Regan

On 21 Dec 2010, at 10:42, Francis Tyers fty...@prompsit.com wrote:

 Hi!

 The problem with this is that there are so many different metadix
 formats that it will be impossible to come up with one that covers  
 them
 all. For example if I remember correctly how the alt works is
 different in es-pt and in oc-es. I think it was decided that it was
 desirable to have them functioning differently, or at least would
 require substantial changes in either language pair to get a unified
 format -- changes that without some push (and let's face it, cash) are
 not going to get made.

Nah. Majority rules. There's one dominant flavour of 'alt', that one  
wins; the other can still be transformed in exactly the same way as it  
is currently without anything being lost, so where's the problem?

'alt' and 'v' are relatively easy - they're not much different to  
'r' (though the 'r'-ness would have to be taken into account, which  
can get confusing). That much could be done by a GCI student.

The slightly difficult parts are prm and sa, and I'd really rather see  
those be extended to a templating mechanism, but haven't been able to  
find a way of specifying the parameters that isn't worse to use than  
just copying a modifying pardefs manually.



 On the other hand, implementing compound words gives us the chance to
 strike while the iron is hot! We can make a (fairly innocuous change  
 --
 any language pair that does not have compounding will be unaffected)
 before getting a plethora of different options and thus avoiding the
 metadix problem for another set of issues.

 Btw, thinking about metadix I have some probably unpopular ideas,
 thatwould preclude any standardisation. I think that maybe we should  
 not
 have one format, but rather many _codified_ formats depending on the
 language(group). For example how to include a verb would be  
 different in
 Tajik and Dutch, because different things are important. Unnecessary
 examples:


See, the difference is that metadix is a generic set of extensions,  
while you describe a set of language specific extensions. Not that  
there's anything wrong with that (other than that it creates a  
situation where existing knowledge of dix files is less transferable,  
which is not a major obstacle) - you're just not talking about  
metadix, don't mix up the issues.

Both of your examples seem to be predicated on magic happening -  
pulling secondary paradigm names from the ether. You'd either need to  
have hard coded values in the transformation process, or a set of  
forward declarations that define the mappings. Go for the latter, that  
would involve less name calling on my part.


 e lm=aanzittenpar n=z/itten__vblex prefix=aan
 pp=aangezeten//e

 Giving:

e lm=aanzitteniaanz/ipar n=aanz/itten__vblex_sep//e
e lm=aanzittenplz/lraanz/r/ppar
 n=z/itten#_aan__vblex_sep/plb/aan/lr/r/p/e
e lm=aanzittenplaangezeten/lraanzitten/r/ppar
 n=gesproken__vblex_sep//e

 Or in Tajik:

 e lm=хариданpar n=кард/ан__vblex stem1=харид stem2=хар/ 
 /e

 Which would give (after transformation) something like:

e lm=хариданplхарид/lrхариданs n=vblex//r/ppar
 n=кард/ан__vblex//e
e lm=хариданplхар/lrхариданs n=vblex/s
 n=prs//r/ppar n=к/ард.ан__vblex//e
e lm=хариданplнахар/lrхариданs n=vblex/s
 n=neg/s n=prs//r/ppar n=к/ард.ан__vblex//e
e lm=хариданplнахарид/lrхариданs n=vblex/s
 n=neg//r/ppar n=кард/ан__vblex//e
e lm=хариданplмехарид/lrхариданs
 n=vblex//r/ppar n=ме.кард/ан__vblex//e
e lm=хариданplмехар/lrхариданs n=vblex/s
 n=pri//r/ppar n=к/ард.ан__vblex//e
e lm=хариданplнамехарид/lrхариданs n=vblex/s
 n=neg//r/ppar n=кард/ан__vblex//e
e lm=хариданplнамехар/lrхариданs n=vblex/s
 n=neg/s n=pri//r/ppar n=к/ард.ан__vblex//e

 Fran

 PS. Wasn't the election to be organised by Unhammer, Pasquale and  
 Nic ?

 El dt 21 de 12 de 2010 a les 06:31 +0100, en/na Mikel Forcada va
 escriure:
 Hi Apertiumers,

 Before any more patches to the dictionary format are made, a general
 agreement should be reached. Remember that we have different  
 dialects of
 metadix and an unification would be desirable before fiddling anymore
 with dictionary formats

 Mikel L. Forcada

 P.S. By the way, as the mandate of the current Project Management
 Committee has long expired and we haven't been able to run a proper
 election, I understand I could stage a coup d'etat, put on my BDFL  
 cap,
 and word the above as a command instead of as an opinion. I'm  
 tempted...
 anyone interested in

 On 12/19/2010 09:57 PM, Francis Tyers wrote:
 El dg 19 de 12 de 2010 a les 18:28 +, en/na Jimmy O'Regan va
 escriure:
 On 19 Dec 2010, at 17:39, Francis Tyersfty...@prompsit.com   
 wrote:


 It would be nice to get this done before Christmas, are there any
 comments ?
 It would probably be best to use a character other than '+'. In the
 event of the final part of the compound being analysed as a  
 multword
 with inner inflection, the queue will be attached to the first  
 part of

Re: [Apertium-stuff] Compound words and dix format

2010-12-21 Thread Mikel Forcada

Hi all,
a couple of comments:

(1) I am not advocating the use of XSLT. In fact, I think it is not a
good idea to bundle code (even if it is XSLT) with a language pair. I
would rather have a compiler that understands directly the higher level
descriptions collectively called metadixes; it wouldn't be difficult to
actually parametrize the different dialects of metadix and pass on those
parameters to a compiler that would then do whatever is needed for the
dictionary to compile (there is a single transfer compiler for .t1x,
.t2x and .t3x, for instance).

(2) I am sure there are common features of all metadixes that can be
unified so that transformations embedded into the language pair are
minimal, and, if possible, declarative and not procedural. I did have, I
think, a unifying proposal for par's and sa''s I would encourage
Apertiumers to think of ways to look for those common features to try to
minimize the current anarchy (the other day I found it quite difficult
to explain to a student why one used metadixes for en in en-es and
didn't use them for es).

I think we should make an effort in both directions. Currently, Apertium
does not look good with so many dialects and ad-hoc embedded XSLTs, etc.
And also, format proliferation makes it very hard to use automatic tools
like apertium-dixtools, etc.

Just my 0.02€ worth!

Mikel

12/21/2010 01:43 PM, Jimmy O'Regan wrote:

On 21 Dec 2010, at 12:04, Kevin Brubeck Unhammerunham...@fsfe.org
wrote:

Francis Tyersfty...@prompsit.com writes:

Hi!

The problem with this is that there are so many different metadix
formats that it will be impossible to come up with one that covers
them
all. For example if I remember correctly how the alt works is
different in es-pt and in oc-es. I think it was decided that it was
desirable to have them functioning differently, or at least would
require substantial changes in either language pair to get a unified
format -- changes that without some push (and let's face it, cash)
are
not going to get made.

On the other hand, implementing compound words gives us the chance to
strike while the iron is hot! We can make a (fairly innocuous
change --
any language pair that does not have compounding will be unaffected)
before getting a plethora of different options and thus avoiding the
metadix problem for another set of issues.

Btw, thinking about metadix I have some probably unpopular ideas,
thatwould preclude any standardisation. I think that maybe we
should not
have one format, but rather many _codified_ formats depending on the
language(group). For example how to include a verb would be
different in
Tajik and Dutch, because different things are important. Unnecessary
examples:

e lm=aanzittenpar n=z/itten__vblex prefix=aan
pp=aangezeten//e

Giving:

e lm=aanzitteniaanz/ipar n=aanz/itten__vblex_sep//e
e lm=aanzittenplz/lraanz/r/ppar
n=z/itten#_aan__vblex_sep/plb/aan/lr/r/p/e
e lm=aanzittenplaangezeten/lraanzitten/r/ppar
n=gesproken__vblex_sep//e

Or in Tajik:

e lm=хариданpar n=кард/ан__vblex stem1=харид stem2=хар/
/e
In the unification proposal from

http://wiki.apertium.org/wiki/Unification_of_metadix_and_parametrized_dictionaries#A_unifying_proposal

the calls would look like

e lm=aanzittenpar n=z/itten__vblex prms=prefix='aan'
pp='aangezeten'//e

and

e lm=хариданpar n=кард/ан__vblex prms=stem1='харид' stem2='х
ар'//e

Are there good reasons not to go with that kind of syntax?
The use of the apostrophe, for one thing. Makes it unworkable for
several languages. The key/value pairs ought really to be expressed in
an XML structure.
--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326

--
Forrester recently released a report on the Return on Investment (ROI) of
Google Apps. They found a 300% ROI, 38%-56% cost savings, and break-even
within 7 months. Over 3 million businesses have gone Google with Google Apps:
an online email calendar, and document program that's accessible from your
browser. Read the Forrester report: http://p.sf.net/sfu/googleapps-sfnew
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

2010-12-20 Thread Jacob Nordfalk

2010/12/19 Kevin Brubeck Unhammer unham...@fsfe.org

 Francis Tyers fty...@prompsit.com writes:

  Now we have the java compound word implementation ported to C++ we can
  probably consider this 'de facto' how we are going to do compounds in
  lttoolbox -- it is _in use_ and there have been _no alternatives_.


Hey, happy to hear that the Java code is ported to C++ :-)

I would have liked it to be test driven in more than a single language pair
(nn-nb) before going 'final' but well, it has been a year now and I'm just
beggining to look at stuff that has been piling up for the last 2 months, so
no chance I'll get progress on any Esperanto or Danish compounding stuff
anyway. Probably finishing lttoolbox-java up and making deployable language
pairs as JAR-files (a la .exe) and an Android port would come first in line
when/if I get the time.

The original reason for having this difference was that we so far have
 no examples of forms that can be compound-R but not words on their own,
 so having those extra identical lines means longer dix files.

 However, lttoolbox has this wonderful feature called pardefs :) So what
 the line for kortet really looks like is this:

  e   plkortet/lrkorts n=n/s n=nt/s
   n=sg/s n=def//r/ppar n=cp-R//e

 where

 pardef n=cp-R
   !-- can appear in compounds: --
   e   pl/l  rc r=R//r/p/e
   !-- can appear as a word on its own: --
   e   pl/l  r/r/p/e
 /pardef


 So, if we're deciding on specifications, that's the only thing I'd like
 to see changed.


Seems fair.
Just note that the code require a non-trivial change. It is currently
depending on stuff like compoundOnlyLSymbol and compoundRSymbol.
Also. please try to keep the Java code in sync if possible, and create some
well-documented tests of the new stuff, like the ones in lttoolbox-java.

Jacob


-- 
Jacob Nordfalk
http://javabog.dk
Underviser i Android på http://ihk.dk
--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

2010-12-20 Thread Mikel Forcada

Hi Apertiumers,

Before any more patches to the dictionary format are made, a general 
agreement should be reached. Remember that we have different dialects of 
metadix and an unification would be desirable before fiddling anymore 
with dictionary formats

Mikel L. Forcada

P.S. By the way, as the mandate of the current Project Management 
Committee has long expired and we haven't been able to run a proper 
election, I understand I could stage a coup d'etat, put on my BDFL cap, 
and word the above as a command instead of as an opinion. I'm tempted... 
anyone interested in

On 12/19/2010 09:57 PM, Francis Tyers wrote:
 El dg 19 de 12 de 2010 a les 18:28 +, en/na Jimmy O'Regan va
 escriure:
 On 19 Dec 2010, at 17:39, Francis Tyersfty...@prompsit.com  wrote:


 It would be nice to get this done before Christmas, are there any
 comments ?
 It would probably be best to use a character other than '+'. In the
 event of the final part of the compound being analysed as a multword
 with inner inflection, the queue will be attached to the first part of
 the compound. As you're talking about a syntax change anyway, is there
 any reason to not insert the break directly?
 I guess we could use '~' and then change pretransfer.cc to output $^ for
 '~' instead of '$ ^' for '+'...

 Anything else ?

 Fran


 --
 Lotusphere 2011
 Register now for Lotusphere 2011 and learn how
 to connect the dots, take your collaborative environment
 to the next level, and enter the era of Social Business.
 http://p.sf.net/sfu/lotusphere-d2d
 ___
 Apertium-stuff mailing list
 Apertium-stuff@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/apertium-stuff


-- 
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326


--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

2010-12-19 Thread Kevin Brubeck Unhammer

Francis Tyers fty...@prompsit.com writes:

 Now we have the java compound word implementation ported to C++ we can
 probably consider this 'de facto' how we are going to do compounds in
 lttoolbox -- it is _in use_ and there have been _no alternatives_. 

 So it is probably worth looking at how we are going to represent this
 nicely in the .dix format. At the moment we use two 'special' symbols:

 sdef n=compound-only-L c=for a form that can only appear on the L/
 sdef n=compound-Rc=for a form that can only appear on the R, or
 as a word on its own/

 I propose making a new element c for compound, and having one
 attribute r for restriction.

 s n=compound-only-L/ would be replaced with c r=L/ and 
 s n=compound-R/ would be replaced with c r=R/

I think it would be better if elements with c r=R/ are, like
c r=L/, compound-only. As the examples below show, an element
marked s n=compound-R/ now both allows use in compounds and out
of compounds, while s n=compound-only-L/ marks a path that's only
reachable in compounds. I think new users would find it less confusing
if they mean the same thing, even though it requires a slightly more
explicit dix file. So instead of

   eplplast/lrplasts n=n/s n=m/s n=sg/s
 n=ind/c r=L//r/p/e
   eplplast/lrplasts n=n/s n=m/s n=sg/s
 n=ind//r/p/e
   eplkortet/lrkorts n=n/s n=nt/s n=sg/s
 n=def/c r=R//r/p/e

you would have to have

   eplplast/lrplasts n=n/s n=m/s n=sg/s
 n=ind/c r=L//r/p/e
   eplplast/lrplasts n=n/s n=m/s n=sg/s
 n=ind//r/p/e
   eplkortet/lrkorts n=n/s n=nt/s n=sg/s
 n=def/c r=R//r/p/e
   eplkortet/lrkorts n=n/s n=nt/s n=sg/s
 n=def//r/p/e

(Note the beautiful symmetry.)


The original reason for having this difference was that we so far have
no examples of forms that can be compound-R but not words on their own,
so having those extra identical lines means longer dix files. 

However, lttoolbox has this wonderful feature called pardefs :) So what
the line for kortet really looks like is this:

  e   plkortet/lrkorts n=n/s n=nt/s
  n=sg/s n=def//r/ppar n=cp-R//e

where 

pardef n=cp-R
   !-- can appear in compounds: --
   e   pl/l  rc r=R//r/p/e
   !-- can appear as a word on its own: --
   e   pl/l  r/r/p/e
/pardef


So, if we're deciding on specifications, that's the only thing I'd like
to see changed. 


-Kevin


-- 

Sent from my Emacs


--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

2010-12-19 Thread Francis Tyers

El dg 19 de 12 de 2010 a les 18:28 +, en/na Jimmy O'Regan va
escriure:
 On 19 Dec 2010, at 17:39, Francis Tyers fty...@prompsit.com wrote:
 
 
  It would be nice to get this done before Christmas, are there any
  comments ?
 
 It would probably be best to use a character other than '+'. In the  
 event of the final part of the compound being analysed as a multword  
 with inner inflection, the queue will be attached to the first part of  
 the compound. As you're talking about a syntax change anyway, is there  
 any reason to not insert the break directly?

I guess we could use '~' and then change pretransfer.cc to output $^ for
'~' instead of '$ ^' for '+'...

Anything else ? 

Fran


--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

2010-12-19 Thread Jimmy O'Regan

On 19 Dec 2010, at 20:57, Francis Tyers fty...@prompsit.com wrote:

 El dg 19 de 12 de 2010 a les 18:28 +, en/na Jimmy O'Regan va
 escriure:
 On 19 Dec 2010, at 17:39, Francis Tyers fty...@prompsit.com wrote:


 It would be nice to get this done before Christmas, are there any
 comments ?

 It would probably be best to use a character other than '+'. In the
 event of the final part of the compound being analysed as a multword
 with inner inflection, the queue will be attached to the first part  
 of
 the compound. As you're talking about a syntax change anyway, is  
 there
 any reason to not insert the break directly?

 I guess we could use '~' and then change pretransfer.cc to output $^  
 for
 '~' instead of '$ ^' for '+'...

 Anything else ?

Well, having the differentiation would be generally useful, so it  
would be worth factoring it in aside from compounds.


This would need a small change in the tagger too. Can anyone think of  
a situation where it would need to be treated differently to normal  
def-mults?

--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

Re: [Apertium-stuff] Compound words and dix format

14 matches

Site Navigation

Mail list logo

Footer information