Re: Prepare String

Emmanuel Lécharny Fri, 01 Apr 2016 03:10:56 -0700

Continuing my investigation on PrepareString, here is where I am, and
some thought about where we should go.


After having re-read RFC 4517 and RFC 4518, which are quite complex, I
came to realize that it's not at all a good idea to have one single
method to prepare the Strings that have to be prepared. There are many
different use cases, and differenciating between them in the prepare()
method would be stupid.

The EQUALITY, ORDERING and SUBSTRING matchingRules have a reference to a
Normalizer which should be the class that is responsible for the String
preparation (if needed). This Normalizer will depend on the
PrepareString static methods to prepare the String, so the PrepareString
class will expose those methods (Transcode, Map, etc).

Btw, here are the MatchingRules that are to use the String Preparation
algorihtm :


EQUALITY MR :

      caseExactMatch : no case fold in Map, only Insignificant Space Handling 
is applied in the Insignificant Character Handling step
      caseExactIA5Match : no case fold in Map, only Insignificant Space 
Handling is applied in the Insignificant Character Handling step
      caseIgnoreMatch : case fold in Map, only Insignificant Space Handling is 
applied in the Insignificant Character Handling step
      caseIgnoreIA5Match : case fold in Map, only Insignificant Space Handling 
is applied in the Insignificant Character Handling step
      caseIgnoreListMatch : apply caseIgnoreMatch on each element
      directoryStringFirstComponentMatch : apply the associated element 
MatchingRule
      numericStringMatch : no case fold in Map, only numericString 
Insignificant Character Handling is applied in the Insignificant Character 
Handling step
      telephoneNumberMatch : no case fold in Map, only telephoneNumber 
Insignificant Character Handling is applied in the Insignificant Character 
Handling step
      wordMatch : apply caseIgnoreMatch on each element

ORDERING MR :

      caseExactOrderingMatch : no case fold in Map, only Insignificant Space 
Handling is applied in the Insignificant Character Handling step
      caseIgnoreOrderingMatch : case fold in Map, only Insignificant Space 
Handling is applied in the Insignificant Character Handling step
      numericStringOrderingMatch : no case fold in Map, only numericString 
Insignificant Character Handling is applied in the Insignificant Character 
Handling step

SUBSTRING MR : 

      caseExactSubstringsMatch : no case fold in Map, only Insignificant Space 
Handling is applied in the Insignificant Character Handling step
      caseIgnoreIA5SubstringsMatch : case fold in Map, only Insignificant Space 
Handling is applied in the Insignificant Character Handling step
      caseIgnoreListSubstringsMatch : apply caseIgnoreMatch on each element
      caseIgnoreSubstringsMatch : case fold in Map, only Insignificant Space 
Handling is applied in the Insignificant Character Handling step
      numericStringSubstringsMatch : no case fold in Map, only numericString 
Insignificant Character Handling is applied in the Insignificant Character 
Handling step
      telephoneNumberSubstringsMatch : no case fold in Map, only 
telephoneNumber Insignificant Character Handling is applied in the 
Insignificant Character Handling step

 Atm, those are just thoughts and analysis, but I do think that this is the way 
to go. For teh record, it will have a huge impact on the Value class : I 
anticipate some more simplification. Typically, there is no need to keep the 
hashcode withing the Value, because we can't easily compute it. In order to 
know if a value is already present in a Attribute, we will have to use the 
Equals() method, which works with the prepared String anyway. We don't need 
either to keep the normalized bytes. I think we should also remove the 
getString() method, we already have a getValue() method that returns a String.

I will most certainly try to work on those changes this week-end.

Thanks !

Le 30/03/16 13:23, Emmanuel Lécharny a écrit :
> Le 28/03/16 12:23, Emmanuel Lécharny a écrit :
>> Hi guys,
>>
>> I'm now working on the PrepareString part. It need a bit of work, as we
>> don't correctly handle spaces. We also have to remove the escaping we do
>> there.
>>
>> That is what I'm working on atm.
> A bit more of what's going on...
>
> The String Preparation is specified in RFC 4518. It's a prcoess that
> involves 6 steps :
>
>
>       1) Transcode
>       2) Map
>       3) Normalize
>       4) Prohibit
>       5) Check bidi
>       6) Insignificant Character Handling
>
> The first phase is just a transformation of a byte[] to a String, which
> is done through a call to Strings.utf8ToString( bytes ). The good thing
> is that Java stores the String using Unicode.
>
> The Map phase is a bit more complex, as we have to go through all the
> chars, and depending on the fact that the Syntax is case sensitive or
> not, it will transform the char to some others so that theyc an be
> compared safely. There is a long list of special chars to handle (around
> 1000).
>
> The Normalize phase consist on a transformation of the String to a
> String respecting the NFKC form, described here :
> http://www.unicode.org/reports/tr15/tr15-22.html#Specification. This is
> also implemented in Java, so we use the Normalizer.normalize( mapped,
> Normalizer.Form.NFKC ) method, if necessary.
>
> The Prohibit phase is about checking every char to check if they are all
> valid. There are a few hundreds prohibited chars.
>
> The Check Bidi phase is about dealing with bi-directional characters
> (arabic, for instance). "Bidirectional characters are ignored." says the
> RFC, so be it :-)
>
> The insignificant character handling phase is the last one, where we
> remove useless spaces or some other specific chars, in various type of
> values.
>
>
> In order to speddup the process, which is quite expensive, the idea is
> to assume the value to be ASCII first. In this case, the Normalize,
> Prohibit and most of the Map phases can be zapped. We can safely design
> a simplest method that will work fast for all those phases, throwing an
> exception when we meet a non-ASCII char. If so, we fail over to the more
> complex process that involves all the phases and the various String
> creations. Somehow, this is the same process than what we have for DNs :
> FastDnParser and ComplexDnParser.
>
>
> One thing thwat will be completely removed from the prepareString
> implementation is the escaping we currently (wrongly) do. It is the not
> the place to do that.
>
>
> Bottom line, this String preparation will completely replace the
> Normalizers we are using. They are useless parts of our schema.
>
>
> last, not least, as this is a COSTLY operation, this function will only
> be called when needed (ie for AT we know are used in Index, or in teh
> DN's RDN, or when a Filter uses it). That will save a hell lot of CPU.
> The consequences is that most of the values we receive or send will
> *not* we converted to String, we will just keep the byte[] value. That
> is the main source of CPU save.
>
> Expect the server and teh API to be kind of impacted :-)
>
>
>

Re: Prepare String

Reply via email to