Re: [LANG] Add alphabet conversion API

Rob Tompkins Tue, 13 Sep 2016 06:48:46 -0700

> On Sep 13, 2016, at 4:39 AM, Eyal Allweil <eyal_allw...@yahoo.com.INVALID> 
> wrote:
> 
> I've created a JIRA issue, https://issues.apache.org/jira/browse/LANG-1266, 
> and a pull request for this: https://github.com/apache/commons-lang/pull/188
> Regards,Eyal
> 
> 
> 
> 
>    On Wednesday, September 7, 2016 5:27 PM, Eyal Allweil 
> <eyal_allw...@yahoo.com> wrote:
> 
> 
> Hi Simo,
> I'm not sure I understood how BitSets would be used in this case. For 
> example, an example with chars might look like this.
> AlphabetConverter ac = new AlphabetConverter(['a','b','c','d'], 
> ['a','e','f','g'],['a']) // 'a' is not encoded

Hello Eyal,

The first thing that springs to mind here is: are we naming this class 
appropriately? I’ll preface my naming argument with I’m coming from a 
mathematical background (combinatorics on words) here. Traditionally in the 
literature such a “mapping” 

        f: {Kleene Closure A} -> {Kleene Closure B} 

with the property f(StringConcatenate(x,y)) = StringConcatenate(f(x),f(y)) for 
x,y strings from {Kleene Closure A}, is called a “Morphism” [1, pg. 8][2]. 
Clearly that name is quite terse when one comes from an application development 
mindset, so I’m not sure that going with the theoretical name is appropriate 
here. That said, I minimally wanted to bring it up so that we can have open 
discourse about naming.

After looking at the code some, the following pop into my head (note. I’m not 
tied to any of the ideas here, just stating thoughts that ran through my head):
There are some stylistic differences that stand out (e.g. "methodName 
(signature)" as opposed to “methodName(signature)”).
More javadoc?
Do we need the “doNotEncodeMap”?
The “.equals" method could use a null check.
Do we want to accommodate non-invertible or non-decodable encodings (e.g. new 
AlphabetConverter([‘a’,’b’,’c’,’d’],[‘a’,’e’,’f’,’e’],[‘a’]))?
Do we want to accommodate alphabets over concatenated chars (e.g. new 
AlphabetConverter([‘ab’,’c’,’d’,e’],[‘a’,’k’,’hi’,’z’],[]))?

Personally I like the idea of having the ability of having the generalization 
of the input/output alphabets, but it would seem that would require having a 
superclass have that implementation and an extension for an invertible 
AlphabetConverter.

All that said, I’m not particularly tied to any of the ideas, and aside from 
the stylistic changes and the .equals bit, the changes seem quite reasonable. I 
would love to hear other folks’ thoughts on the proposed functionality.

Cheers,
-Rob

Biblio.
[1] Jean-Paul Allouche and Jeffrey Shallit. Automatic sequences. Cambridge 
University Press, Cambridge, 2003. Theory, ap- plications, and generalizations.

[2] https://en.wikipedia.org/wiki/Free_monoid#Morphisms

> 
> and the mapping would become a -> a, b -> e, c -> f, d -> g
> so encoding encode("abc") would become "aef".
> Ints can be used instead of chars to support unicode code points that don't 
> fit in a single char (which was our case, but if that seems overkill, the 
> chars implementation is much more direct).
> How did you mean the BitSet to be used?
> Regards,Eyal
> 
> 
> 
>    On Thursday, September 1, 2016 12:26 PM, Simone Tripodi 
> <simonetrip...@apache.org> wrote:
> 
> 
> Hi,I personally think it would a very "nice to have" feature, I had to face 
> similar issues in the past and, if that feature was available would have 
> saved me developing time.
> I just have a small request/suggestion: since int/char can be casted to each 
> other, I would use BitSets rather than Sets.
> Good luck!-Simo
> 
> http://people.apache.org/~simonetripodi/
> http://twitter.com/simonetripodi
> On Thu, Sep 1, 2016 at 10:53 AM, Eyal Allweil 
> <eyal_allw...@yahoo.com.invalid> wrote:
> 
> Hi guys,
> Would you be interested in adding a utility class that creates alphabet 
> converters, perhaps using a helper method available from StringUtils? It 
> doesn't have to stay the way it is now, but the API for the class - 
> AlphabetConverter - is currently:
> /** * The input is integers representing code points, but we can make it 
> accept chars as well * * doNotEncode represents chars we want to leave in the 
> original state (not to encode them using the chars in encoding) */
> public AlphabetConverter(Set<Integer> original, Set<Integer> encoding, 
> Set<Integer> doNotEncode);
> public String encode (String original);
> 
> public String decode (String encoded);
> In StringUtils, we could add
> 
> public AlphabetConverter getAlphabetConverter (Set<Integer> original, 
> Set<Integer> encoding, Set<Integer> doNotEncode);
> I used it to convert from unicode to latin letters, without using any chars I 
> wanted as delimiters, and preserving the English alphabet as is for 
> readability. If you'd like to add it, I'll clean up the code and prepare it 
> for a pull request so you can review it.
> 
> It makes sense to me to add a method that returns the HashMaps used 
> internally for the mappings so they can be serialized (and deserialized) for 
> preserving the mapping.
> Regards,Eyal Allweil (PayPal)
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: [LANG] Add alphabet conversion API

Reply via email to