Re: "Elegant tools deserve elegant solutions." -- L. E. Gant

2011-08-11 Thread Tuba Lambanog
Hi, Petr,
Thank you for the pointer to the site. Indeed a treasure trove of ideas on
stemmer algorithms.
Tuba

On Thu, Aug 11, 2011 at 8:45 AM, Petr Gladkikh  wrote:

> On Mon, Aug 8, 2011 at 1:46 PM, Tuba Lambanog 
> wrote:
> > Hello,
> >
> > I’m doing a word stemmer for a non-English language. A stemmer parses
> > a word into its word parts: prefixes, roots, suffixes. The input word
> > is at least a root word (English example would be ‘cloud’), but can be
> > any combination of  prefix(es) and a root (e.g., 'pre-nuptial'), or a
> > root and suffix(es) (‘cloudy’), or all three ('unidirection'). A
> > sequence of more than one prefix in a word is considered one
> > occurrence of a prefix, and similarly for complex prefixes, thus,
> > ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The
> > prefixes, roots, and suffixes are in their own set data structure.
> >
> > The approach I am pursuing is to create a set of potential suffixes
> > that the input word contains. Asssume, for simplicity, that the suffix
> > set consists of #{-or, -er, -al, -ion, -ional, able}. The input
> > ‘directional’ would have the candidate suffix set #{-al –ional}. Now,
> > drop the longest suffix (‘ional’) from the input then check the
> > remaining string (‘direct’) if it is a root; if it is, done. If not,
> > try the next suffix (‘-al’) in the potential suffix set.  Prefixes
> > will be similarly processed. Input words with both prefixes and
> > affixes will be fun to do ;)
> >
> > I’m having a hard time thinking through the process of generating the
> > candidate suffix set using set forms, and I’m beginning to think I
> > have selected an arduous path (for me).
> >
> > Thoughts?
> >
>
> Somehow offtopic maybe, but have you looked at Snowball
> http://snowball.tartarus.org/ ?
> Algorithm is different but language that is used to describe stemmers
> there is almost lisp and may be useful at least as reference.
>
> --
> Petr Gladkikh
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: "Elegant tools deserve elegant solutions." -- L. E. Gant

2011-08-11 Thread Petr Gladkikh
On Mon, Aug 8, 2011 at 1:46 PM, Tuba Lambanog  wrote:
> Hello,
>
> I’m doing a word stemmer for a non-English language. A stemmer parses
> a word into its word parts: prefixes, roots, suffixes. The input word
> is at least a root word (English example would be ‘cloud’), but can be
> any combination of  prefix(es) and a root (e.g., 'pre-nuptial'), or a
> root and suffix(es) (‘cloudy’), or all three ('unidirection'). A
> sequence of more than one prefix in a word is considered one
> occurrence of a prefix, and similarly for complex prefixes, thus,
> ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The
> prefixes, roots, and suffixes are in their own set data structure.
>
> The approach I am pursuing is to create a set of potential suffixes
> that the input word contains. Asssume, for simplicity, that the suffix
> set consists of #{-or, -er, -al, -ion, -ional, able}. The input
> ‘directional’ would have the candidate suffix set #{-al –ional}. Now,
> drop the longest suffix (‘ional’) from the input then check the
> remaining string (‘direct’) if it is a root; if it is, done. If not,
> try the next suffix (‘-al’) in the potential suffix set.  Prefixes
> will be similarly processed. Input words with both prefixes and
> affixes will be fun to do ;)
>
> I’m having a hard time thinking through the process of generating the
> candidate suffix set using set forms, and I’m beginning to think I
> have selected an arduous path (for me).
>
> Thoughts?
>

Somehow offtopic maybe, but have you looked at Snowball
http://snowball.tartarus.org/ ?
Algorithm is different but language that is used to describe stemmers
there is almost lisp and may be useful at least as reference.

-- 
Petr Gladkikh

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: "Elegant tools deserve elegant solutions." -- L. E. Gant

2011-08-08 Thread Resty Cena
Hi, Ken,

Thanks for the suggestion.

As I was looking at a suffix tree, it suddenly struck me that the following
strategy may do just as well:

1. Use rest and next to generate the tentative suffix sets, thus for
"directional", it will give the set of #{irectional rectional ectional
ctional  tional ional onal nal al l}.

2. Intersection this with the set of suffixes.

3. Select the longest item in the result set.

I'm trying this one out now.

Tuba


On Mon, Aug 8, 2011 at 11:23 AM, Ken Wesson  wrote:

> On Mon, Aug 8, 2011 at 11:41 AM, Tuba Lambanog 
> wrote:
> > Hi,
> > Thank you for the tip. It does look like the Patricia tree -- or suffix
> tree
> > -- is made-to-order for this kind of task. I'm reading up on it.
>
> You're welcome.
>
> > Would there be a Clojure implementation of this technology, I wonder.
>
> Even if not, it's probably trivial to slap one together, and test it,
> in less than a day in Clojure.
>
> As for generating your candidate seqs of prefixes and suffixes, just
> cons onto an initial nil in your reduction* and you'll end up with a
> seq that, traversed forwards, goes from longest candidate to shortest.
> For suffixes you'll want to (map #(apply str (reverse %)) the-seq),
> though, to get the suffixes the right way around (since they'll need
> to be stored reversed in their tree).
>
> --
> Protege: What is this seething mass of parentheses?!
> Master: Your father's Lisp REPL. This is the language of a true
> hacker. Not as clumsy or random as C++; a language for a more
> civilized age.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: "Elegant tools deserve elegant solutions." -- L. E. Gant

2011-08-08 Thread Ken Wesson
On Mon, Aug 8, 2011 at 11:41 AM, Tuba Lambanog  wrote:
> Hi,
> Thank you for the tip. It does look like the Patricia tree -- or suffix tree
> -- is made-to-order for this kind of task. I'm reading up on it.

You're welcome.

> Would there be a Clojure implementation of this technology, I wonder.

Even if not, it's probably trivial to slap one together, and test it,
in less than a day in Clojure.

As for generating your candidate seqs of prefixes and suffixes, just
cons onto an initial nil in your reduction* and you'll end up with a
seq that, traversed forwards, goes from longest candidate to shortest.
For suffixes you'll want to (map #(apply str (reverse %)) the-seq),
though, to get the suffixes the right way around (since they'll need
to be stored reversed in their tree).

-- 
Protege: What is this seething mass of parentheses?!
Master: Your father's Lisp REPL. This is the language of a true
hacker. Not as clumsy or random as C++; a language for a more
civilized age.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: "Elegant tools deserve elegant solutions." -- L. E. Gant

2011-08-08 Thread Tuba Lambanog
Hi,
Thank you for the tip. It does look like the Patricia tree -- or suffix tree
-- is made-to-order for this kind of task. I'm reading up on it. Would there
be a Clojure implementation of this technology, I wonder.
Tuba

On Mon, Aug 8, 2011 at 1:40 AM, Ken Wesson  wrote:

> On Mon, Aug 8, 2011 at 2:46 AM, Tuba Lambanog 
> wrote:
> > I’m having a hard time thinking through the process of generating the
> > candidate suffix set using set forms, and I’m beginning to think I
> > have selected an arduous path (for me).
> >
> > Thoughts?
>
> Store the prefixes in a patricia tree, and the reversed suffixes in
> another patricia tree. For suffixes, start at the end of the word and
> walk backward while traversing the suffix tree until hitting a leaf.
> Each node traversed (including the root, which is the empty string) is
> a potential suffix and you traverse them in short-to-long order, so
> reverse that to get them in long-to-short order. The case for prefixes
> is analogous except you start at the start of the word and walk
> forward while traversing the prefix tree. "No suffix" and "No prefix"
> needn't be handled as special cases; they are just the empty string as
> suffix or prefix, of length zero.
>
> --
> Protege: What is this seething mass of parentheses?!
> Master: Your father's Lisp REPL. This is the language of a true
> hacker. Not as clumsy or random as C++; a language for a more
> civilized age.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: "Elegant tools deserve elegant solutions." -- L. E. Gant

2011-08-08 Thread Resty Cena
Hi, Andreas,

<< I don't quite understand what you mean by "I’m having a hard time
thinking through the process of generating the
candidate suffix set using set forms" >>

It is my usual roundabout way of saying "I don't know how to do this." ;)

I'm looking at your code as we speak.

Thanks,
Tuba

On Mon, Aug 8, 2011 at 1:13 AM, Andreas Kostler <
andreas.koest...@leica-geosystems.com> wrote:

> Hi Tuba,
> I don't quite understand what you mean by "I’m having a hard time
> thinking through the process of generating the
> candidate suffix set using set forms" but I have created a porter
> stemmer for English in the past.
> I understand that's not what you're looking for but it is moreso a
> framwork for building stemmers:
>
> You specify rules of the like:
> {:c? condition :s1 "abc" :s2 "efg" :a action}
> reading if condition is met, replace s1 with s2 and execute action.
> Where s1 could be a suffix etc. All you need to do is specify these rules.
> Have a browse
> https://github.com/AndreasKostler/Stout
>
> Cheers
> Andreas
>
>
> On 8 August 2011 16:16, Tuba Lambanog  wrote:
> >
> > Hello,
> >
> > I’m doing a word stemmer for a non-English language. A stemmer parses
> > a word into its word parts: prefixes, roots, suffixes. The input word
> > is at least a root word (English example would be ‘cloud’), but can be
> > any combination of  prefix(es) and a root (e.g., 'pre-nuptial'), or a
> > root and suffix(es) (‘cloudy’), or all three ('unidirection'). A
> > sequence of more than one prefix in a word is considered one
> > occurrence of a prefix, and similarly for complex prefixes, thus,
> > ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The
> > prefixes, roots, and suffixes are in their own set data structure.
> >
> > The approach I am pursuing is to create a set of potential suffixes
> > that the input word contains. Asssume, for simplicity, that the suffix
> > set consists of #{-or, -er, -al, -ion, -ional, able}. The input
> > ‘directional’ would have the candidate suffix set #{-al –ional}. Now,
> > drop the longest suffix (‘ional’) from the input then check the
> > remaining string (‘direct’) if it is a root; if it is, done. If not,
> > try the next suffix (‘-al’) in the potential suffix set.  Prefixes
> > will be similarly processed. Input words with both prefixes and
> > affixes will be fun to do ;)
> >
> > I’m having a hard time thinking through the process of generating the
> > candidate suffix set using set forms, and I’m beginning to think I
> > have selected an arduous path (for me).
> >
> > Thoughts?
> >
> > Thanks.
> > Tuba
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Clojure" group.
> > To post to this group, send email to clojure@googlegroups.com
> > Note that posts from new members are moderated - please be patient with
> your first post.
> > To unsubscribe from this group, send email to
> > clojure+unsubscr...@googlegroups.com
> > For more options, visit this group at
> > http://groups.google.com/group/clojure?hl=en
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: "Elegant tools deserve elegant solutions." -- L. E. Gant

2011-08-08 Thread Ken Wesson
On Mon, Aug 8, 2011 at 2:46 AM, Tuba Lambanog  wrote:
> I’m having a hard time thinking through the process of generating the
> candidate suffix set using set forms, and I’m beginning to think I
> have selected an arduous path (for me).
>
> Thoughts?

Store the prefixes in a patricia tree, and the reversed suffixes in
another patricia tree. For suffixes, start at the end of the word and
walk backward while traversing the suffix tree until hitting a leaf.
Each node traversed (including the root, which is the empty string) is
a potential suffix and you traverse them in short-to-long order, so
reverse that to get them in long-to-short order. The case for prefixes
is analogous except you start at the start of the word and walk
forward while traversing the prefix tree. "No suffix" and "No prefix"
needn't be handled as special cases; they are just the empty string as
suffix or prefix, of length zero.

-- 
Protege: What is this seething mass of parentheses?!
Master: Your father's Lisp REPL. This is the language of a true
hacker. Not as clumsy or random as C++; a language for a more
civilized age.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: "Elegant tools deserve elegant solutions." -- L. E. Gant

2011-08-08 Thread Andreas Kostler
Hi Tuba,
I don't quite understand what you mean by "I’m having a hard time
thinking through the process of generating the
candidate suffix set using set forms" but I have created a porter
stemmer for English in the past.
I understand that's not what you're looking for but it is moreso a
framwork for building stemmers:

You specify rules of the like:
{:c? condition :s1 "abc" :s2 "efg" :a action}
reading if condition is met, replace s1 with s2 and execute action.
Where s1 could be a suffix etc. All you need to do is specify these rules.
Have a browse
https://github.com/AndreasKostler/Stout

Cheers
Andreas


On 8 August 2011 16:16, Tuba Lambanog  wrote:
>
> Hello,
>
> I’m doing a word stemmer for a non-English language. A stemmer parses
> a word into its word parts: prefixes, roots, suffixes. The input word
> is at least a root word (English example would be ‘cloud’), but can be
> any combination of  prefix(es) and a root (e.g., 'pre-nuptial'), or a
> root and suffix(es) (‘cloudy’), or all three ('unidirection'). A
> sequence of more than one prefix in a word is considered one
> occurrence of a prefix, and similarly for complex prefixes, thus,
> ‘directional’ is considered to have the ‘single’ suffix ‘ional’. The
> prefixes, roots, and suffixes are in their own set data structure.
>
> The approach I am pursuing is to create a set of potential suffixes
> that the input word contains. Asssume, for simplicity, that the suffix
> set consists of #{-or, -er, -al, -ion, -ional, able}. The input
> ‘directional’ would have the candidate suffix set #{-al –ional}. Now,
> drop the longest suffix (‘ional’) from the input then check the
> remaining string (‘direct’) if it is a root; if it is, done. If not,
> try the next suffix (‘-al’) in the potential suffix set.  Prefixes
> will be similarly processed. Input words with both prefixes and
> affixes will be fun to do ;)
>
> I’m having a hard time thinking through the process of generating the
> candidate suffix set using set forms, and I’m beginning to think I
> have selected an arduous path (for me).
>
> Thoughts?
>
> Thanks.
> Tuba
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


"Elegant tools deserve elegant solutions." -- L. E. Gant

2011-08-07 Thread Tuba Lambanog
Hello,

I’m doing a word stemmer for a non-English language. A stemmer parses
a word into its word parts: prefixes, roots, suffixes. The input word
is at least a root word (English example would be ‘cloud’), but can be
any combination of  prefix(es) and a root (e.g., 'pre-nuptial'), or a
root and suffix(es) (‘cloudy’), or all three ('unidirection'). A
sequence of more than one prefix in a word is considered one
occurrence of a prefix, and similarly for complex prefixes, thus,
‘directional’ is considered to have the ‘single’ suffix ‘ional’. The
prefixes, roots, and suffixes are in their own set data structure.

The approach I am pursuing is to create a set of potential suffixes
that the input word contains. Asssume, for simplicity, that the suffix
set consists of #{-or, -er, -al, -ion, -ional, able}. The input
‘directional’ would have the candidate suffix set #{-al –ional}. Now,
drop the longest suffix (‘ional’) from the input then check the
remaining string (‘direct’) if it is a root; if it is, done. If not,
try the next suffix (‘-al’) in the potential suffix set.  Prefixes
will be similarly processed. Input words with both prefixes and
affixes will be fun to do ;)

I’m having a hard time thinking through the process of generating the
candidate suffix set using set forms, and I’m beginning to think I
have selected an arduous path (for me).

Thoughts?

Thanks.
Tuba

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en