On Sat, Feb 11, 2012 at 01:10:27PM +0100, Xavier Noria wrote:
> Hi gents,
> 
> I am playing around with an idea to improve the performance of singularize
> and pluralize for Rails 4.0. In my proof of concept I see some 5x boost,
> but it relies an assumption that I'd like to consult with you all. Let me
> explain.
> 
> As you know, inflection rules have a lhs which is a string or regexp, and a
> replacement string as rhs.
> 
> The current implementation collects the rules in an array, and to apply
> them to a particular word the array is iterated. The first pattern that
> matches is the one applied. In particular, the most common rule (eg append
> "s" to form a plural), is the *last* one because most specific rules come
> first. By default we have +30 rules for singulars and +30 for plurals.
> 
> My idea is to build a single regexp with an alternation, detect which
> segment matches, and apply its replacement. That is, let the regexp engine
> itself do the loop. Much faster.
> 
> I have this working with a quick hack that has shown there's a potential
> speedup here. To be able to know which regexp is the one that matched I use
> named captures. For example (?<_25>(ax)is) would be the alternation
> corresponding to the regexp /(ax)is/, if that's the 26th pattern. It won't
> win me an elegance award, but it is a hack that works (I could workaround
> name clashes easily if the user happens to use _25 himself, that's not
> important).
> 
> Named captures are the only way I've seen to be able to build the
> alternation and at the same time know which part matches. Because existing
> inflection regexps have captures.

It's possible to count the number of captures in a given regexp:

    def count_captures regexp
      Regexp.union([regexp, //]).match('').captures.length
    end
    
    p count_captures /(ax)is/     # => 1
    p count_captures /((ax)is)/   # => 2
    p count_captures /(?:(ax)is)/ # => 1

With that information, you can calculate offsets.  Calculating offsets
would still result in a linear scan (since you'd have to keep a list of
the offsets, then find the one that matched), and since we can use named
captures on 1.9, I'm not sure an offset based solution would be great.

> This is the proof-of-concept: https://gist.github.com/1798985.

Seems good, but you don't need a *args to Regexp.union. :-)

> I believe this is correct *as long as* the user regexps have no
> backreferences, because if you have a = /(..)\1/ and b = /(.)\1/ then "xx"
> matches b, but does not match the union a|b because the backreference \1 in
> b now refers to the group in a.
> 
> OK, so this is the question: do you guys use backreferences in custom
> inflections? If you didn't we could consider ruling them out for 4.0 to be
> able to implement this.

If you can rule out backreferences, it seems possible you could build a
state machine that would perform in linear time to the string length.
It seems though, with your algorithm, as the string gets longer, the
original implementation is faster:

  https://gist.github.com/1805359

Do we know what kind of data we're dealing with?  Somewhere between
strings of length 10 and strings of length 100, the original code
outperforms this version.  If we're only dealing with shorter strings,
maybe this newer code would be better.

One last comment, the line:

  name = md.names.detect {|n| md[n]}

The length of `names` increases as the number of rules is added to the
system.  Which means in the case of a match, this algorithm will still
be O(n), where n is the number of rules defined.  We can see this by
testing strings that match first and strings that match last:

  https://gist.github.com/1805536

I'm using AS 3.2.1, so "zombies" will be best case, and a bunch of x's
will be worst.

If we're willing to limit the type of expressions provided to the
inflector, it's possible to construct something that matches the string
regardless of the number of rules (entries in "inflections.rb"), and
proportional to the string length.

MEOW MEOW MEOW MEOW

<3 <3 <3 <3 <3 <3

-- 
Aaron Patterson
http://tenderlovemaking.com/

Attachment: pgpDwQJt9cYVv.pgp
Description: PGP signature

Reply via email to