Re: [Apertium-stuff] Compound words and dix format
Hi! The problem with this is that there are so many different metadix formats that it will be impossible to come up with one that covers them all. For example if I remember correctly how the alt works is different in es-pt and in oc-es. I think it was decided that it was desirable to have them functioning differently, or at least would require substantial changes in either language pair to get a unified format -- changes that without some push (and let's face it, cash) are not going to get made. On the other hand, implementing compound words gives us the chance to strike while the iron is hot! We can make a (fairly innocuous change -- any language pair that does not have compounding will be unaffected) before getting a plethora of different options and thus avoiding the metadix problem for another set of issues. Btw, thinking about metadix I have some probably unpopular ideas, thatwould preclude any standardisation. I think that maybe we should not have one format, but rather many _codified_ formats depending on the language(group). For example how to include a verb would be different in Tajik and Dutch, because different things are important. Unnecessary examples: e lm=aanzittenpar n=z/itten__vblex prefix=aan pp=aangezeten//e Giving: e lm=aanzitteniaanz/ipar n=aanz/itten__vblex_sep//e e lm=aanzittenplz/lraanz/r/ppar n=z/itten#_aan__vblex_sep/plb/aan/lr/r/p/e e lm=aanzittenplaangezeten/lraanzitten/r/ppar n=gesproken__vblex_sep//e Or in Tajik: e lm=хариданpar n=кард/ан__vblex stem1=харид stem2=хар//e Which would give (after transformation) something like: e lm=хариданplхарид/lrхариданs n=vblex//r/ppar n=кард/ан__vblex//e e lm=хариданplхар/lrхариданs n=vblex/s n=prs//r/ppar n=к/ард.ан__vblex//e e lm=хариданplнахар/lrхариданs n=vblex/s n=neg/s n=prs//r/ppar n=к/ард.ан__vblex//e e lm=хариданplнахарид/lrхариданs n=vblex/s n=neg//r/ppar n=кард/ан__vblex//e e lm=хариданplмехарид/lrхариданs n=vblex//r/ppar n=ме.кард/ан__vblex//e e lm=хариданplмехар/lrхариданs n=vblex/s n=pri//r/ppar n=к/ард.ан__vblex//e e lm=хариданplнамехарид/lrхариданs n=vblex/s n=neg//r/ppar n=кард/ан__vblex//e e lm=хариданplнамехар/lrхариданs n=vblex/s n=neg/s n=pri//r/ppar n=к/ард.ан__vblex//e Fran PS. Wasn't the election to be organised by Unhammer, Pasquale and Nic ? El dt 21 de 12 de 2010 a les 06:31 +0100, en/na Mikel Forcada va escriure: Hi Apertiumers, Before any more patches to the dictionary format are made, a general agreement should be reached. Remember that we have different dialects of metadix and an unification would be desirable before fiddling anymore with dictionary formats Mikel L. Forcada P.S. By the way, as the mandate of the current Project Management Committee has long expired and we haven't been able to run a proper election, I understand I could stage a coup d'etat, put on my BDFL cap, and word the above as a command instead of as an opinion. I'm tempted... anyone interested in On 12/19/2010 09:57 PM, Francis Tyers wrote: El dg 19 de 12 de 2010 a les 18:28 +, en/na Jimmy O'Regan va escriure: On 19 Dec 2010, at 17:39, Francis Tyersfty...@prompsit.com wrote: It would be nice to get this done before Christmas, are there any comments ? It would probably be best to use a character other than '+'. In the event of the final part of the compound being analysed as a multword with inner inflection, the queue will be attached to the first part of the compound. As you're talking about a syntax change anyway, is there any reason to not insert the break directly? I guess we could use '~' and then change pretransfer.cc to output $^ for '~' instead of '$ ^' for '+'... Anything else ? Fran -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Compound words and dix format
El dt 21 de 12 de 2010 a les 10:42 +, en/na Francis Tyers va escriure: Hi! ... Btw, thinking about metadix I have some probably unpopular ideas, that would preclude any standardisation. I think that maybe we should not have one format, but rather many _codified_ formats depending on the language(group). For example how to include a verb would be different in Tajik and Dutch, because different things are important. Unnecessary examples: I was also looking for this which makes interesting reading: http://www.stanford.edu/~laurik/fsmbook/clarifications/xmldowntrans.html Fran -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Compound words and dix format
2010/12/21 Mikel Forcada m...@dlsi.ua.es Hi Apertiumers, Before any more patches to the dictionary format are made, a general agreement should be reached. Remember that we have different dialects of metadix and an unification would be desirable before fiddling anymore with dictionary formats I think this unification has been tried before, but not with much success. I propose that you reserve an hour or two during http://www.uoc.edu/freerbmt11/ to do a last try at unification of metadix, but if this is unsuccessfull I think it would be wiser to support compounding than keep things stalled. The way I see things the problem here is not that different languages choose different ways of expressing the linguistic content. Fran's ideas about Tajik and Dutch is a fine illustration of why this is neccesary. The problem is that these differences are mostly expressed in XSLT, a language which is incomprehensible for most (i.a. for me). Take a look at for example incubator/apertium-en-fr/alt.xsl and you will understand what I mean. It all boils down to that a .dix is TWO things: A) A way of expressing linguistic content which should be relatively easily maintainable B) Raw input to a finite state machine processor (lt-comp) Of course this leaves a gap. We have been closing that gap using XSLT but I think we should start looking at other, easier to understand, ways of expressing the gap between A and B. Jacob -- Jacob Nordfalk http://javabog.dk Underviser i Android på http://ihk.dk -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Compound words and dix format
2010/12/21 Francis Tyers fty...@prompsit.com El dt 21 de 12 de 2010 a les 10:42 +, en/na Francis Tyers va escriure: Hi! ... Btw, thinking about metadix I have some probably unpopular ideas, that would preclude any standardisation. I think that maybe we should not have one format, but rather many _codified_ formats depending on the language(group). For example how to include a verb would be different in Tajik and Dutch, because different things are important. Unnecessary examples: I was also looking for this which makes interesting reading: http://www.stanford.edu/~laurik/fsmbook/clarifications/xmldowntrans.html Page 2: However, I myself don't like XSLT, and I'm not alone. For one thing, it is based on XSL, intended originally as a stylesheet language, and so has a strong bias toward reformatting; the underlying assumption is that the original XML text contains the data you want in the output, and that the problem is basically just to reformat that data in the output. In practice, my own XML-to-SomethingElse downtranslations often involve non-trivial conversions that cannot be handled in XSLT, or which are extremely awkward in XSLT. I find XSLT too limiting; it always seems to be preventing exactly what I want to do. I want the power of a real programming language like Perl or Python while doing downtranslation. Finally, XSLT files are themselves in an XML format, which some people think is a great advantage. I disagree. XML is a Good Thing, but it's possible to take any Good Thing too far. Writing XSLT is a kind of programming, and I dislike programming in XML: it's verbose and not easy for human beings to read. Exactly my words!! Lets find an alternative to XSLT that suits our needs and build support for that into lt-comp. -- Jacob Nordfalk http://javabog.dk Underviser i Android på http://ihk.dk -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Compound words and dix format
2010/12/21 Jacob Nordfalk jacob.nordf...@gmail.com 2010/12/21 Francis Tyers fty...@prompsit.com El dt 21 de 12 de 2010 a les 10:42 +, en/na Francis Tyers va escriure: Hi! ... Btw, thinking about metadix I have some probably unpopular ideas, that would preclude any standardisation. I think that maybe we should not have one format, but rather many _codified_ formats depending on the language(group). For example how to include a verb would be different in Tajik and Dutch, because different things are important. Unnecessary examples: I was also looking for this which makes interesting reading: http://www.stanford.edu/~laurik/fsmbook/clarifications/xmldowntrans.html Page 2: Sorry, page 3 However, I myself don't like XSLT, and I'm not alone. For one thing, it is based on XSL, intended originally as a stylesheet language, and so has a strong bias toward reformatting; the underlying assumption is that the original XML text contains the data you want in the output, and that the problem is basically just to reformat that data in the output. In practice, my own XML-to-SomethingElse downtranslations often involve non-trivial conversions that cannot be handled in XSLT, or which are extremely awkward in XSLT. I find XSLT too limiting; it always seems to be preventing exactly what I want to do. I want the power of a real programming language like Perl or Python while doing downtranslation. Finally, XSLT files are themselves in an XML format, which some people think is a great advantage. I disagree. XML is a Good Thing, but it's possible to take any Good Thing too far. Writing XSLT is a kind of programming, and I dislike programming in XML: it's verbose and not easy for human beings to read. Exactly my words!! Lets find an alternative to XSLT that suits our needs and build support for that into lt-comp. Why not having a GSoC project on this? The student doesent have to understand much linguistic stuff. He just needs to get the same output as the XSL transformation, but with a metadix rule file with a much simpler syntax. -- Jacob Nordfalk http://javabog.dk Underviser i Android på http://ihk.dk -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Compound words and dix format
Francis Tyers fty...@prompsit.com writes: Hi! The problem with this is that there are so many different metadix formats that it will be impossible to come up with one that covers them all. For example if I remember correctly how the alt works is different in es-pt and in oc-es. I think it was decided that it was desirable to have them functioning differently, or at least would require substantial changes in either language pair to get a unified format -- changes that without some push (and let's face it, cash) are not going to get made. On the other hand, implementing compound words gives us the chance to strike while the iron is hot! We can make a (fairly innocuous change -- any language pair that does not have compounding will be unaffected) before getting a plethora of different options and thus avoiding the metadix problem for another set of issues. Btw, thinking about metadix I have some probably unpopular ideas, thatwould preclude any standardisation. I think that maybe we should not have one format, but rather many _codified_ formats depending on the language(group). For example how to include a verb would be different in Tajik and Dutch, because different things are important. Unnecessary examples: e lm=aanzittenpar n=z/itten__vblex prefix=aan pp=aangezeten//e Giving: e lm=aanzitteniaanz/ipar n=aanz/itten__vblex_sep//e e lm=aanzittenplz/lraanz/r/ppar n=z/itten#_aan__vblex_sep/plb/aan/lr/r/p/e e lm=aanzittenplaangezeten/lraanzitten/r/ppar n=gesproken__vblex_sep//e Or in Tajik: e lm=хариданpar n=кард/ан__vblex stem1=харид stem2=хар//e In the unification proposal from http://wiki.apertium.org/wiki/Unification_of_metadix_and_parametrized_dictionaries#A_unifying_proposal the calls would look like e lm=aanzittenpar n=z/itten__vblex prms=prefix='aan' pp='aangezeten'//e and e lm=хариданpar n=кард/ан__vblex prms=stem1='харид' stem2='хар'//e Are there good reasons not to go with that kind of syntax? -- Kevin Brubeck Unhammer -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Compound words and dix format
El dt 21 de 12 de 2010 a les 13:04 +0100, en/na Kevin Brubeck Unhammer va escriure: Francis Tyers fty...@prompsit.com writes: Hi! The problem with this is that there are so many different metadix formats that it will be impossible to come up with one that covers them all. For example if I remember correctly how the alt works is different in es-pt and in oc-es. I think it was decided that it was desirable to have them functioning differently, or at least would require substantial changes in either language pair to get a unified format -- changes that without some push (and let's face it, cash) are not going to get made. On the other hand, implementing compound words gives us the chance to strike while the iron is hot! We can make a (fairly innocuous change -- any language pair that does not have compounding will be unaffected) before getting a plethora of different options and thus avoiding the metadix problem for another set of issues. Btw, thinking about metadix I have some probably unpopular ideas, thatwould preclude any standardisation. I think that maybe we should not have one format, but rather many _codified_ formats depending on the language(group). For example how to include a verb would be different in Tajik and Dutch, because different things are important. Unnecessary examples: e lm=aanzittenpar n=z/itten__vblex prefix=aan pp=aangezeten//e Giving: e lm=aanzitteniaanz/ipar n=aanz/itten__vblex_sep//e e lm=aanzittenplz/lraanz/r/ppar n=z/itten#_aan__vblex_sep/plb/aan/lr/r/p/e e lm=aanzittenplaangezeten/lraanzitten/r/ppar n=gesproken__vblex_sep//e Or in Tajik: e lm=хариданpar n=кард/ан__vblex stem1=харид stem2=хар//e In the unification proposal from http://wiki.apertium.org/wiki/Unification_of_metadix_and_parametrized_dictionaries#A_unifying_proposal the calls would look like e lm=aanzittenpar n=z/itten__vblex prms=prefix='aan' pp='aangezeten'//e and e lm=хариданpar n=кард/ан__vblex prms=stem1='харид' stem2='хар'//e Are there good reasons not to go with that kind of syntax? The problem is that what happens after that would be different depending on the language pair. I think one of the points of the unification proposal was to have a single xsl file to do the transformations(?) Where in this case it would be two. Fran -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Compound words and dix format
On 21 Dec 2010, at 10:42, Francis Tyers fty...@prompsit.com wrote: Hi! The problem with this is that there are so many different metadix formats that it will be impossible to come up with one that covers them all. For example if I remember correctly how the alt works is different in es-pt and in oc-es. I think it was decided that it was desirable to have them functioning differently, or at least would require substantial changes in either language pair to get a unified format -- changes that without some push (and let's face it, cash) are not going to get made. Nah. Majority rules. There's one dominant flavour of 'alt', that one wins; the other can still be transformed in exactly the same way as it is currently without anything being lost, so where's the problem? 'alt' and 'v' are relatively easy - they're not much different to 'r' (though the 'r'-ness would have to be taken into account, which can get confusing). That much could be done by a GCI student. The slightly difficult parts are prm and sa, and I'd really rather see those be extended to a templating mechanism, but haven't been able to find a way of specifying the parameters that isn't worse to use than just copying a modifying pardefs manually. On the other hand, implementing compound words gives us the chance to strike while the iron is hot! We can make a (fairly innocuous change -- any language pair that does not have compounding will be unaffected) before getting a plethora of different options and thus avoiding the metadix problem for another set of issues. Btw, thinking about metadix I have some probably unpopular ideas, thatwould preclude any standardisation. I think that maybe we should not have one format, but rather many _codified_ formats depending on the language(group). For example how to include a verb would be different in Tajik and Dutch, because different things are important. Unnecessary examples: See, the difference is that metadix is a generic set of extensions, while you describe a set of language specific extensions. Not that there's anything wrong with that (other than that it creates a situation where existing knowledge of dix files is less transferable, which is not a major obstacle) - you're just not talking about metadix, don't mix up the issues. Both of your examples seem to be predicated on magic happening - pulling secondary paradigm names from the ether. You'd either need to have hard coded values in the transformation process, or a set of forward declarations that define the mappings. Go for the latter, that would involve less name calling on my part. e lm=aanzittenpar n=z/itten__vblex prefix=aan pp=aangezeten//e Giving: e lm=aanzitteniaanz/ipar n=aanz/itten__vblex_sep//e e lm=aanzittenplz/lraanz/r/ppar n=z/itten#_aan__vblex_sep/plb/aan/lr/r/p/e e lm=aanzittenplaangezeten/lraanzitten/r/ppar n=gesproken__vblex_sep//e Or in Tajik: e lm=хариданpar n=кард/ан__vblex stem1=харид stem2=хар/ /e Which would give (after transformation) something like: e lm=хариданplхарид/lrхариданs n=vblex//r/ppar n=кард/ан__vblex//e e lm=хариданplхар/lrхариданs n=vblex/s n=prs//r/ppar n=к/ард.ан__vblex//e e lm=хариданplнахар/lrхариданs n=vblex/s n=neg/s n=prs//r/ppar n=к/ард.ан__vblex//e e lm=хариданplнахарид/lrхариданs n=vblex/s n=neg//r/ppar n=кард/ан__vblex//e e lm=хариданplмехарид/lrхариданs n=vblex//r/ppar n=ме.кард/ан__vblex//e e lm=хариданplмехар/lrхариданs n=vblex/s n=pri//r/ppar n=к/ард.ан__vblex//e e lm=хариданplнамехарид/lrхариданs n=vblex/s n=neg//r/ppar n=кард/ан__vblex//e e lm=хариданplнамехар/lrхариданs n=vblex/s n=neg/s n=pri//r/ppar n=к/ард.ан__vblex//e Fran PS. Wasn't the election to be organised by Unhammer, Pasquale and Nic ? El dt 21 de 12 de 2010 a les 06:31 +0100, en/na Mikel Forcada va escriure: Hi Apertiumers, Before any more patches to the dictionary format are made, a general agreement should be reached. Remember that we have different dialects of metadix and an unification would be desirable before fiddling anymore with dictionary formats Mikel L. Forcada P.S. By the way, as the mandate of the current Project Management Committee has long expired and we haven't been able to run a proper election, I understand I could stage a coup d'etat, put on my BDFL cap, and word the above as a command instead of as an opinion. I'm tempted... anyone interested in On 12/19/2010 09:57 PM, Francis Tyers wrote: El dg 19 de 12 de 2010 a les 18:28 +, en/na Jimmy O'Regan va escriure: On 19 Dec 2010, at 17:39, Francis Tyersfty...@prompsit.com wrote: It would be nice to get this done before Christmas, are there any comments ? It would probably be best to use a character other than '+'. In the event of the final part of the compound being analysed as a multword with inner inflection, the queue will be attached to the first part of
Re: [Apertium-stuff] Compound words and dix format
Hi all, a couple of comments: (1) I am not advocating the use of XSLT. In fact, I think it is not a good idea to bundle code (even if it is XSLT) with a language pair. I would rather have a compiler that understands directly the higher level descriptions collectively called metadixes; it wouldn't be difficult to actually parametrize the different dialects of metadix and pass on those parameters to a compiler that would then do whatever is needed for the dictionary to compile (there is a single transfer compiler for .t1x, .t2x and .t3x, for instance). (2) I am sure there are common features of all metadixes that can be unified so that transformations embedded into the language pair are minimal, and, if possible, declarative and not procedural. I did have, I think, a unifying proposal for par's and sa''s I would encourage Apertiumers to think of ways to look for those common features to try to minimize the current anarchy (the other day I found it quite difficult to explain to a student why one used metadixes for en in en-es and didn't use them for es). I think we should make an effort in both directions. Currently, Apertium does not look good with so many dialects and ad-hoc embedded XSLTs, etc. And also, format proliferation makes it very hard to use automatic tools like apertium-dixtools, etc. Just my 0.02€ worth! Mikel 12/21/2010 01:43 PM, Jimmy O'Regan wrote: On 21 Dec 2010, at 12:04, Kevin Brubeck Unhammerunham...@fsfe.org wrote: Francis Tyersfty...@prompsit.com writes: Hi! The problem with this is that there are so many different metadix formats that it will be impossible to come up with one that covers them all. For example if I remember correctly how the alt works is different in es-pt and in oc-es. I think it was decided that it was desirable to have them functioning differently, or at least would require substantial changes in either language pair to get a unified format -- changes that without some push (and let's face it, cash) are not going to get made. On the other hand, implementing compound words gives us the chance to strike while the iron is hot! We can make a (fairly innocuous change -- any language pair that does not have compounding will be unaffected) before getting a plethora of different options and thus avoiding the metadix problem for another set of issues. Btw, thinking about metadix I have some probably unpopular ideas, thatwould preclude any standardisation. I think that maybe we should not have one format, but rather many _codified_ formats depending on the language(group). For example how to include a verb would be different in Tajik and Dutch, because different things are important. Unnecessary examples: e lm=aanzittenpar n=z/itten__vblex prefix=aan pp=aangezeten//e Giving: e lm=aanzitteniaanz/ipar n=aanz/itten__vblex_sep//e e lm=aanzittenplz/lraanz/r/ppar n=z/itten#_aan__vblex_sep/plb/aan/lr/r/p/e e lm=aanzittenplaangezeten/lraanzitten/r/ppar n=gesproken__vblex_sep//e Or in Tajik: e lm=хариданpar n=кард/ан__vblex stem1=харид stem2=хар/ /e In the unification proposal from http://wiki.apertium.org/wiki/Unification_of_metadix_and_parametrized_dictionaries#A_unifying_proposal the calls would look like e lm=aanzittenpar n=z/itten__vblex prms=prefix='aan' pp='aangezeten'//e and e lm=хариданpar n=кард/ан__vblex prms=stem1='харид' stem2='х ар'//e Are there good reasons not to go with that kind of syntax? The use of the apostrophe, for one thing. Makes it unworkable for several languages. The key/value pairs ought really to be expressed in an XML structure. -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff -- Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/) Departament de Llenguatges i Sistemes Informàtics Universitat d'Alacant E-03071 Alacant, Spain Phone: +34 96 590 9776 Fax: +34 96 590 9326 -- Forrester recently released a report on the Return on Investment (ROI) of Google Apps. They found a 300% ROI, 38%-56% cost savings, and break-even within 7 months. Over 3 million businesses have gone Google with Google Apps: an online email calendar, and document program that's accessible from your browser. Read the Forrester report: http://p.sf.net/sfu/googleapps-sfnew ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Compound words and dix format
2010/12/19 Kevin Brubeck Unhammer unham...@fsfe.org Francis Tyers fty...@prompsit.com writes: Now we have the java compound word implementation ported to C++ we can probably consider this 'de facto' how we are going to do compounds in lttoolbox -- it is _in use_ and there have been _no alternatives_. Hey, happy to hear that the Java code is ported to C++ :-) I would have liked it to be test driven in more than a single language pair (nn-nb) before going 'final' but well, it has been a year now and I'm just beggining to look at stuff that has been piling up for the last 2 months, so no chance I'll get progress on any Esperanto or Danish compounding stuff anyway. Probably finishing lttoolbox-java up and making deployable language pairs as JAR-files (a la .exe) and an Android port would come first in line when/if I get the time. The original reason for having this difference was that we so far have no examples of forms that can be compound-R but not words on their own, so having those extra identical lines means longer dix files. However, lttoolbox has this wonderful feature called pardefs :) So what the line for kortet really looks like is this: e plkortet/lrkorts n=n/s n=nt/s n=sg/s n=def//r/ppar n=cp-R//e where pardef n=cp-R !-- can appear in compounds: -- e pl/l rc r=R//r/p/e !-- can appear as a word on its own: -- e pl/l r/r/p/e /pardef So, if we're deciding on specifications, that's the only thing I'd like to see changed. Seems fair. Just note that the code require a non-trivial change. It is currently depending on stuff like compoundOnlyLSymbol and compoundRSymbol. Also. please try to keep the Java code in sync if possible, and create some well-documented tests of the new stuff, like the ones in lttoolbox-java. Jacob -- Jacob Nordfalk http://javabog.dk Underviser i Android på http://ihk.dk -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Compound words and dix format
Hi Apertiumers, Before any more patches to the dictionary format are made, a general agreement should be reached. Remember that we have different dialects of metadix and an unification would be desirable before fiddling anymore with dictionary formats Mikel L. Forcada P.S. By the way, as the mandate of the current Project Management Committee has long expired and we haven't been able to run a proper election, I understand I could stage a coup d'etat, put on my BDFL cap, and word the above as a command instead of as an opinion. I'm tempted... anyone interested in On 12/19/2010 09:57 PM, Francis Tyers wrote: El dg 19 de 12 de 2010 a les 18:28 +, en/na Jimmy O'Regan va escriure: On 19 Dec 2010, at 17:39, Francis Tyersfty...@prompsit.com wrote: It would be nice to get this done before Christmas, are there any comments ? It would probably be best to use a character other than '+'. In the event of the final part of the compound being analysed as a multword with inner inflection, the queue will be attached to the first part of the compound. As you're talking about a syntax change anyway, is there any reason to not insert the break directly? I guess we could use '~' and then change pretransfer.cc to output $^ for '~' instead of '$ ^' for '+'... Anything else ? Fran -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff -- Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/) Departament de Llenguatges i Sistemes Informàtics Universitat d'Alacant E-03071 Alacant, Spain Phone: +34 96 590 9776 Fax: +34 96 590 9326 -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Compound words and dix format
Francis Tyers fty...@prompsit.com writes: Now we have the java compound word implementation ported to C++ we can probably consider this 'de facto' how we are going to do compounds in lttoolbox -- it is _in use_ and there have been _no alternatives_. So it is probably worth looking at how we are going to represent this nicely in the .dix format. At the moment we use two 'special' symbols: sdef n=compound-only-L c=for a form that can only appear on the L/ sdef n=compound-Rc=for a form that can only appear on the R, or as a word on its own/ I propose making a new element c for compound, and having one attribute r for restriction. s n=compound-only-L/ would be replaced with c r=L/ and s n=compound-R/ would be replaced with c r=R/ I think it would be better if elements with c r=R/ are, like c r=L/, compound-only. As the examples below show, an element marked s n=compound-R/ now both allows use in compounds and out of compounds, while s n=compound-only-L/ marks a path that's only reachable in compounds. I think new users would find it less confusing if they mean the same thing, even though it requires a slightly more explicit dix file. So instead of eplplast/lrplasts n=n/s n=m/s n=sg/s n=ind/c r=L//r/p/e eplplast/lrplasts n=n/s n=m/s n=sg/s n=ind//r/p/e eplkortet/lrkorts n=n/s n=nt/s n=sg/s n=def/c r=R//r/p/e you would have to have eplplast/lrplasts n=n/s n=m/s n=sg/s n=ind/c r=L//r/p/e eplplast/lrplasts n=n/s n=m/s n=sg/s n=ind//r/p/e eplkortet/lrkorts n=n/s n=nt/s n=sg/s n=def/c r=R//r/p/e eplkortet/lrkorts n=n/s n=nt/s n=sg/s n=def//r/p/e (Note the beautiful symmetry.) The original reason for having this difference was that we so far have no examples of forms that can be compound-R but not words on their own, so having those extra identical lines means longer dix files. However, lttoolbox has this wonderful feature called pardefs :) So what the line for kortet really looks like is this: e plkortet/lrkorts n=n/s n=nt/s n=sg/s n=def//r/ppar n=cp-R//e where pardef n=cp-R !-- can appear in compounds: -- e pl/l rc r=R//r/p/e !-- can appear as a word on its own: -- e pl/l r/r/p/e /pardef So, if we're deciding on specifications, that's the only thing I'd like to see changed. -Kevin -- Sent from my Emacs -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Compound words and dix format
El dg 19 de 12 de 2010 a les 18:28 +, en/na Jimmy O'Regan va escriure: On 19 Dec 2010, at 17:39, Francis Tyers fty...@prompsit.com wrote: It would be nice to get this done before Christmas, are there any comments ? It would probably be best to use a character other than '+'. In the event of the final part of the compound being analysed as a multword with inner inflection, the queue will be attached to the first part of the compound. As you're talking about a syntax change anyway, is there any reason to not insert the break directly? I guess we could use '~' and then change pretransfer.cc to output $^ for '~' instead of '$ ^' for '+'... Anything else ? Fran -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Compound words and dix format
On 19 Dec 2010, at 20:57, Francis Tyers fty...@prompsit.com wrote: El dg 19 de 12 de 2010 a les 18:28 +, en/na Jimmy O'Regan va escriure: On 19 Dec 2010, at 17:39, Francis Tyers fty...@prompsit.com wrote: It would be nice to get this done before Christmas, are there any comments ? It would probably be best to use a character other than '+'. In the event of the final part of the compound being analysed as a multword with inner inflection, the queue will be attached to the first part of the compound. As you're talking about a syntax change anyway, is there any reason to not insert the break directly? I guess we could use '~' and then change pretransfer.cc to output $^ for '~' instead of '$ ^' for '+'... Anything else ? Well, having the differentiation would be generally useful, so it would be worth factoring it in aside from compounds. This would need a small change in the tagger too. Can anyone think of a situation where it would need to be treated differently to normal def-mults? -- Lotusphere 2011 Register now for Lotusphere 2011 and learn how to connect the dots, take your collaborative environment to the next level, and enter the era of Social Business. http://p.sf.net/sfu/lotusphere-d2d ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff