[jira] [Commented] (CODEC-125) Implement a Beider-Morse phonetic matching codec

Matthew Pocock (JIRA) Mon, 08 Aug 2011 06:46:53 -0700

    [ 
https://issues.apache.org/jira/browse/CODEC-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080943#comment-13080943
 ]


Matthew Pocock commented on CODEC-125:
--------------------------------------

Hi,

* Would not it make sense to add surnames with accented chars to the 
PhoneticEngineTest class? like: Schäffer (German), Győrössy (Hungarian), 
Mészáros (Hungarian).

Yes. I'd love to see more names. I'm not a linguist of any kind so can only 
work with those names people suggest.

* I know it won't increase the code coverage, but probably increase the 
"resource coverage" if you know what I mean.

I know exactly what you're getting at. However, there are a great many rules. 
It will be significant work to test each one of them.

*Something is still wrong with the performance.
*An interesting issue I see is that the current speed test uses almost 30MB of 
memory creating 1.9m a Rule anonymous inner class instances (see attached.) 
GC'ing these objects might explain the wild swings in performance.
*Wow. This must be due to lots of objects being generated. The #1 object 
generate is String and #2 is our AppendableCharSequence.

My performance rewrite traded a lot of string creation for 
AppendableCharSequence. This is because at each step, a processed prefix may 
get applied to a rule that 'forks' it into a number of new alternatives. These 
alternatives themselves may be 'forked' and so on. I can't think of a way to 
reduce the number of these AppendableCharSequence objects. However, it may be 
possible to reduce the per-instance cost and also to look at where all the 
strings are coming from. Most of these things should be very short-lived, and 
I'd hope that on Java7, some of them would get stack-inlined away.

I'm firing up my profiler in 'memory' mode - will get back to you if I have 
progress.


> Implement a Beider-Morse phonetic matching codec
> ------------------------------------------------
>
>                 Key: CODEC-125
>                 URL: https://issues.apache.org/jira/browse/CODEC-125
>             Project: Commons Codec
>          Issue Type: New Feature
>            Reporter: Matthew Pocock
>            Priority: Minor
>         Attachments: Rule$4$1-All_Objects.html, acz.patch, bm-gg.diff, 
> bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, 
> bmpm.patch, bmpm.patch, fixmeInvariant.patch, handleH.patch, majorFix.patch, 
> performanceAndBugs.patch, testEncodeGna.patch
>
>
> I have implemented Beider Morse Phonetic Matching as a codec against the 
> commons-codec svn trunk. I would like to contribute this to commons-codec.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CODEC-125) Implement a Beider-Morse phonetic matching codec

Reply via email to