Re: "Elegant tools deserve elegant solutions." -- L. E. Gant
Hi, Petr, Thank you for the pointer to the site. Indeed a treasure trove of ideas on stemmer algorithms. Tuba On Thu, Aug 11, 2011 at 8:45 AM, Petr Gladkikh wrote: > On Mon, Aug 8, 2011 at 1:46 PM, Tuba Lambanog > wrote: > > Hello, > > > > I’m doing a word stemmer for a non-English language. A stemmer parses > > a word into its word parts: prefixes, roots, suffixes. The input word > > is at least a root word (English example would be ‘cloud’), but can be > > any combination of prefix(es) and a root (e.g., 'pre-nuptial'), or a > > root and suffix(es) (‘cloudy’), or all three ('unidirection'). A > > sequence of more than one prefix in a word is considered one > > occurrence of a prefix, and similarly for complex prefixes, thus, > > ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The > > prefixes, roots, and suffixes are in their own set data structure. > > > > The approach I am pursuing is to create a set of potential suffixes > > that the input word contains. Asssume, for simplicity, that the suffix > > set consists of #{-or, -er, -al, -ion, -ional, able}. The input > > ‘directional’ would have the candidate suffix set #{-al –ional}. Now, > > drop the longest suffix (‘ional’) from the input then check the > > remaining string (‘direct’) if it is a root; if it is, done. If not, > > try the next suffix (‘-al’) in the potential suffix set. Prefixes > > will be similarly processed. Input words with both prefixes and > > affixes will be fun to do ;) > > > > I’m having a hard time thinking through the process of generating the > > candidate suffix set using set forms, and I’m beginning to think I > > have selected an arduous path (for me). > > > > Thoughts? > > > > Somehow offtopic maybe, but have you looked at Snowball > http://snowball.tartarus.org/ ? > Algorithm is different but language that is used to describe stemmers > there is almost lisp and may be useful at least as reference. > > -- > Petr Gladkikh > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: "Elegant tools deserve elegant solutions." -- L. E. Gant
On Mon, Aug 8, 2011 at 1:46 PM, Tuba Lambanog wrote: > Hello, > > I’m doing a word stemmer for a non-English language. A stemmer parses > a word into its word parts: prefixes, roots, suffixes. The input word > is at least a root word (English example would be ‘cloud’), but can be > any combination of prefix(es) and a root (e.g., 'pre-nuptial'), or a > root and suffix(es) (‘cloudy’), or all three ('unidirection'). A > sequence of more than one prefix in a word is considered one > occurrence of a prefix, and similarly for complex prefixes, thus, > ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The > prefixes, roots, and suffixes are in their own set data structure. > > The approach I am pursuing is to create a set of potential suffixes > that the input word contains. Asssume, for simplicity, that the suffix > set consists of #{-or, -er, -al, -ion, -ional, able}. The input > ‘directional’ would have the candidate suffix set #{-al –ional}. Now, > drop the longest suffix (‘ional’) from the input then check the > remaining string (‘direct’) if it is a root; if it is, done. If not, > try the next suffix (‘-al’) in the potential suffix set. Prefixes > will be similarly processed. Input words with both prefixes and > affixes will be fun to do ;) > > I’m having a hard time thinking through the process of generating the > candidate suffix set using set forms, and I’m beginning to think I > have selected an arduous path (for me). > > Thoughts? > Somehow offtopic maybe, but have you looked at Snowball http://snowball.tartarus.org/ ? Algorithm is different but language that is used to describe stemmers there is almost lisp and may be useful at least as reference. -- Petr Gladkikh -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: "Elegant tools deserve elegant solutions." -- L. E. Gant
Hi, Ken, Thanks for the suggestion. As I was looking at a suffix tree, it suddenly struck me that the following strategy may do just as well: 1. Use rest and next to generate the tentative suffix sets, thus for "directional", it will give the set of #{irectional rectional ectional ctional tional ional onal nal al l}. 2. Intersection this with the set of suffixes. 3. Select the longest item in the result set. I'm trying this one out now. Tuba On Mon, Aug 8, 2011 at 11:23 AM, Ken Wesson wrote: > On Mon, Aug 8, 2011 at 11:41 AM, Tuba Lambanog > wrote: > > Hi, > > Thank you for the tip. It does look like the Patricia tree -- or suffix > tree > > -- is made-to-order for this kind of task. I'm reading up on it. > > You're welcome. > > > Would there be a Clojure implementation of this technology, I wonder. > > Even if not, it's probably trivial to slap one together, and test it, > in less than a day in Clojure. > > As for generating your candidate seqs of prefixes and suffixes, just > cons onto an initial nil in your reduction* and you'll end up with a > seq that, traversed forwards, goes from longest candidate to shortest. > For suffixes you'll want to (map #(apply str (reverse %)) the-seq), > though, to get the suffixes the right way around (since they'll need > to be stored reversed in their tree). > > -- > Protege: What is this seething mass of parentheses?! > Master: Your father's Lisp REPL. This is the language of a true > hacker. Not as clumsy or random as C++; a language for a more > civilized age. > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: "Elegant tools deserve elegant solutions." -- L. E. Gant
On Mon, Aug 8, 2011 at 11:41 AM, Tuba Lambanog wrote: > Hi, > Thank you for the tip. It does look like the Patricia tree -- or suffix tree > -- is made-to-order for this kind of task. I'm reading up on it. You're welcome. > Would there be a Clojure implementation of this technology, I wonder. Even if not, it's probably trivial to slap one together, and test it, in less than a day in Clojure. As for generating your candidate seqs of prefixes and suffixes, just cons onto an initial nil in your reduction* and you'll end up with a seq that, traversed forwards, goes from longest candidate to shortest. For suffixes you'll want to (map #(apply str (reverse %)) the-seq), though, to get the suffixes the right way around (since they'll need to be stored reversed in their tree). -- Protege: What is this seething mass of parentheses?! Master: Your father's Lisp REPL. This is the language of a true hacker. Not as clumsy or random as C++; a language for a more civilized age. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: "Elegant tools deserve elegant solutions." -- L. E. Gant
Hi, Thank you for the tip. It does look like the Patricia tree -- or suffix tree -- is made-to-order for this kind of task. I'm reading up on it. Would there be a Clojure implementation of this technology, I wonder. Tuba On Mon, Aug 8, 2011 at 1:40 AM, Ken Wesson wrote: > On Mon, Aug 8, 2011 at 2:46 AM, Tuba Lambanog > wrote: > > I’m having a hard time thinking through the process of generating the > > candidate suffix set using set forms, and I’m beginning to think I > > have selected an arduous path (for me). > > > > Thoughts? > > Store the prefixes in a patricia tree, and the reversed suffixes in > another patricia tree. For suffixes, start at the end of the word and > walk backward while traversing the suffix tree until hitting a leaf. > Each node traversed (including the root, which is the empty string) is > a potential suffix and you traverse them in short-to-long order, so > reverse that to get them in long-to-short order. The case for prefixes > is analogous except you start at the start of the word and walk > forward while traversing the prefix tree. "No suffix" and "No prefix" > needn't be handled as special cases; they are just the empty string as > suffix or prefix, of length zero. > > -- > Protege: What is this seething mass of parentheses?! > Master: Your father's Lisp REPL. This is the language of a true > hacker. Not as clumsy or random as C++; a language for a more > civilized age. > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: "Elegant tools deserve elegant solutions." -- L. E. Gant
Hi, Andreas, << I don't quite understand what you mean by "I’m having a hard time thinking through the process of generating the candidate suffix set using set forms" >> It is my usual roundabout way of saying "I don't know how to do this." ;) I'm looking at your code as we speak. Thanks, Tuba On Mon, Aug 8, 2011 at 1:13 AM, Andreas Kostler < andreas.koest...@leica-geosystems.com> wrote: > Hi Tuba, > I don't quite understand what you mean by "I’m having a hard time > thinking through the process of generating the > candidate suffix set using set forms" but I have created a porter > stemmer for English in the past. > I understand that's not what you're looking for but it is moreso a > framwork for building stemmers: > > You specify rules of the like: > {:c? condition :s1 "abc" :s2 "efg" :a action} > reading if condition is met, replace s1 with s2 and execute action. > Where s1 could be a suffix etc. All you need to do is specify these rules. > Have a browse > https://github.com/AndreasKostler/Stout > > Cheers > Andreas > > > On 8 August 2011 16:16, Tuba Lambanog wrote: > > > > Hello, > > > > I’m doing a word stemmer for a non-English language. A stemmer parses > > a word into its word parts: prefixes, roots, suffixes. The input word > > is at least a root word (English example would be ‘cloud’), but can be > > any combination of prefix(es) and a root (e.g., 'pre-nuptial'), or a > > root and suffix(es) (‘cloudy’), or all three ('unidirection'). A > > sequence of more than one prefix in a word is considered one > > occurrence of a prefix, and similarly for complex prefixes, thus, > > ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The > > prefixes, roots, and suffixes are in their own set data structure. > > > > The approach I am pursuing is to create a set of potential suffixes > > that the input word contains. Asssume, for simplicity, that the suffix > > set consists of #{-or, -er, -al, -ion, -ional, able}. The input > > ‘directional’ would have the candidate suffix set #{-al –ional}. Now, > > drop the longest suffix (‘ional’) from the input then check the > > remaining string (‘direct’) if it is a root; if it is, done. If not, > > try the next suffix (‘-al’) in the potential suffix set. Prefixes > > will be similarly processed. Input words with both prefixes and > > affixes will be fun to do ;) > > > > I’m having a hard time thinking through the process of generating the > > candidate suffix set using set forms, and I’m beginning to think I > > have selected an arduous path (for me). > > > > Thoughts? > > > > Thanks. > > Tuba > > > > -- > > You received this message because you are subscribed to the Google > > Groups "Clojure" group. > > To post to this group, send email to clojure@googlegroups.com > > Note that posts from new members are moderated - please be patient with > your first post. > > To unsubscribe from this group, send email to > > clojure+unsubscr...@googlegroups.com > > For more options, visit this group at > > http://groups.google.com/group/clojure?hl=en > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: "Elegant tools deserve elegant solutions." -- L. E. Gant
On Mon, Aug 8, 2011 at 2:46 AM, Tuba Lambanog wrote: > I’m having a hard time thinking through the process of generating the > candidate suffix set using set forms, and I’m beginning to think I > have selected an arduous path (for me). > > Thoughts? Store the prefixes in a patricia tree, and the reversed suffixes in another patricia tree. For suffixes, start at the end of the word and walk backward while traversing the suffix tree until hitting a leaf. Each node traversed (including the root, which is the empty string) is a potential suffix and you traverse them in short-to-long order, so reverse that to get them in long-to-short order. The case for prefixes is analogous except you start at the start of the word and walk forward while traversing the prefix tree. "No suffix" and "No prefix" needn't be handled as special cases; they are just the empty string as suffix or prefix, of length zero. -- Protege: What is this seething mass of parentheses?! Master: Your father's Lisp REPL. This is the language of a true hacker. Not as clumsy or random as C++; a language for a more civilized age. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: "Elegant tools deserve elegant solutions." -- L. E. Gant
Hi Tuba, I don't quite understand what you mean by "I’m having a hard time thinking through the process of generating the candidate suffix set using set forms" but I have created a porter stemmer for English in the past. I understand that's not what you're looking for but it is moreso a framwork for building stemmers: You specify rules of the like: {:c? condition :s1 "abc" :s2 "efg" :a action} reading if condition is met, replace s1 with s2 and execute action. Where s1 could be a suffix etc. All you need to do is specify these rules. Have a browse https://github.com/AndreasKostler/Stout Cheers Andreas On 8 August 2011 16:16, Tuba Lambanog wrote: > > Hello, > > I’m doing a word stemmer for a non-English language. A stemmer parses > a word into its word parts: prefixes, roots, suffixes. The input word > is at least a root word (English example would be ‘cloud’), but can be > any combination of prefix(es) and a root (e.g., 'pre-nuptial'), or a > root and suffix(es) (‘cloudy’), or all three ('unidirection'). A > sequence of more than one prefix in a word is considered one > occurrence of a prefix, and similarly for complex prefixes, thus, > ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The > prefixes, roots, and suffixes are in their own set data structure. > > The approach I am pursuing is to create a set of potential suffixes > that the input word contains. Asssume, for simplicity, that the suffix > set consists of #{-or, -er, -al, -ion, -ional, able}. The input > ‘directional’ would have the candidate suffix set #{-al –ional}. Now, > drop the longest suffix (‘ional’) from the input then check the > remaining string (‘direct’) if it is a root; if it is, done. If not, > try the next suffix (‘-al’) in the potential suffix set. Prefixes > will be similarly processed. Input words with both prefixes and > affixes will be fun to do ;) > > I’m having a hard time thinking through the process of generating the > candidate suffix set using set forms, and I’m beginning to think I > have selected an arduous path (for me). > > Thoughts? > > Thanks. > Tuba > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with your > first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
"Elegant tools deserve elegant solutions." -- L. E. Gant
Hello, I’m doing a word stemmer for a non-English language. A stemmer parses a word into its word parts: prefixes, roots, suffixes. The input word is at least a root word (English example would be ‘cloud’), but can be any combination of prefix(es) and a root (e.g., 'pre-nuptial'), or a root and suffix(es) (‘cloudy’), or all three ('unidirection'). A sequence of more than one prefix in a word is considered one occurrence of a prefix, and similarly for complex prefixes, thus, ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The prefixes, roots, and suffixes are in their own set data structure. The approach I am pursuing is to create a set of potential suffixes that the input word contains. Asssume, for simplicity, that the suffix set consists of #{-or, -er, -al, -ion, -ional, able}. The input ‘directional’ would have the candidate suffix set #{-al –ional}. Now, drop the longest suffix (‘ional’) from the input then check the remaining string (‘direct’) if it is a root; if it is, done. If not, try the next suffix (‘-al’) in the potential suffix set. Prefixes will be similarly processed. Input words with both prefixes and affixes will be fun to do ;) I’m having a hard time thinking through the process of generating the candidate suffix set using set forms, and I’m beginning to think I have selected an arduous path (for me). Thoughts? Thanks. Tuba -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en