Continuing my investigation on PrepareString, here is where I am, and
some thought about where we should go.
After having re-read RFC 4517 and RFC 4518, which are quite complex, I
came to realize that it's not at all a good idea to have one single
method to prepare the Strings that have to be prepared. There are many
different use cases, and differenciating between them in the prepare()
method would be stupid.
The EQUALITY, ORDERING and SUBSTRING matchingRules have a reference to a
Normalizer which should be the class that is responsible for the String
preparation (if needed). This Normalizer will depend on the
PrepareString static methods to prepare the String, so the PrepareString
class will expose those methods (Transcode, Map, etc).
Btw, here are the MatchingRules that are to use the String Preparation
algorihtm :
EQUALITY MR :
caseExactMatch : no case fold in Map, only Insignificant Space Handling
is applied in the Insignificant Character Handling step
caseExactIA5Match : no case fold in Map, only Insignificant Space
Handling is applied in the Insignificant Character Handling step
caseIgnoreMatch : case fold in Map, only Insignificant Space Handling is
applied in the Insignificant Character Handling step
caseIgnoreIA5Match : case fold in Map, only Insignificant Space Handling
is applied in the Insignificant Character Handling step
caseIgnoreListMatch : apply caseIgnoreMatch on each element
directoryStringFirstComponentMatch : apply the associated element
MatchingRule
numericStringMatch : no case fold in Map, only numericString
Insignificant Character Handling is applied in the Insignificant Character
Handling step
telephoneNumberMatch : no case fold in Map, only telephoneNumber
Insignificant Character Handling is applied in the Insignificant Character
Handling step
wordMatch : apply caseIgnoreMatch on each element
ORDERING MR :
caseExactOrderingMatch : no case fold in Map, only Insignificant Space
Handling is applied in the Insignificant Character Handling step
caseIgnoreOrderingMatch : case fold in Map, only Insignificant Space
Handling is applied in the Insignificant Character Handling step
numericStringOrderingMatch : no case fold in Map, only numericString
Insignificant Character Handling is applied in the Insignificant Character
Handling step
SUBSTRING MR :
caseExactSubstringsMatch : no case fold in Map, only Insignificant Space
Handling is applied in the Insignificant Character Handling step
caseIgnoreIA5SubstringsMatch : case fold in Map, only Insignificant Space
Handling is applied in the Insignificant Character Handling step
caseIgnoreListSubstringsMatch : apply caseIgnoreMatch on each element
caseIgnoreSubstringsMatch : case fold in Map, only Insignificant Space
Handling is applied in the Insignificant Character Handling step
numericStringSubstringsMatch : no case fold in Map, only numericString
Insignificant Character Handling is applied in the Insignificant Character
Handling step
telephoneNumberSubstringsMatch : no case fold in Map, only
telephoneNumber Insignificant Character Handling is applied in the
Insignificant Character Handling step
Atm, those are just thoughts and analysis, but I do think that this is the way
to go. For teh record, it will have a huge impact on the Value class : I
anticipate some more simplification. Typically, there is no need to keep the
hashcode withing the Value, because we can't easily compute it. In order to
know if a value is already present in a Attribute, we will have to use the
Equals() method, which works with the prepared String anyway. We don't need
either to keep the normalized bytes. I think we should also remove the
getString() method, we already have a getValue() method that returns a String.
I will most certainly try to work on those changes this week-end.
Thanks !
Le 30/03/16 13:23, Emmanuel Lécharny a écrit :
> Le 28/03/16 12:23, Emmanuel Lécharny a écrit :
>> Hi guys,
>>
>> I'm now working on the PrepareString part. It need a bit of work, as we
>> don't correctly handle spaces. We also have to remove the escaping we do
>> there.
>>
>> That is what I'm working on atm.
> A bit more of what's going on...
>
> The String Preparation is specified in RFC 4518. It's a prcoess that
> involves 6 steps :
>
>
> 1) Transcode
> 2) Map
> 3) Normalize
> 4) Prohibit
> 5) Check bidi
> 6) Insignificant Character Handling
>
> The first phase is just a transformation of a byte[] to a String, which
> is done through a call to Strings.utf8ToString( bytes ). The good thing
> is that Java stores the String using Unicode.
>
> The Map phase is a bit more complex, as we have to go through all the
> chars, and depending on the fact that the Syntax is case sensitive or
> not, it will transform the char to some others so that theyc an be
> compared safely. There is a long list of special chars to handle (around
> 1000).
>
> The Normalize phase consist on a transformation of the String to a
> String respecting the NFKC form, described here :
> http://www.unicode.org/reports/tr15/tr15-22.html#Specification. This is
> also implemented in Java, so we use the Normalizer.normalize( mapped,
> Normalizer.Form.NFKC ) method, if necessary.
>
> The Prohibit phase is about checking every char to check if they are all
> valid. There are a few hundreds prohibited chars.
>
> The Check Bidi phase is about dealing with bi-directional characters
> (arabic, for instance). "Bidirectional characters are ignored." says the
> RFC, so be it :-)
>
> The insignificant character handling phase is the last one, where we
> remove useless spaces or some other specific chars, in various type of
> values.
>
>
> In order to speddup the process, which is quite expensive, the idea is
> to assume the value to be ASCII first. In this case, the Normalize,
> Prohibit and most of the Map phases can be zapped. We can safely design
> a simplest method that will work fast for all those phases, throwing an
> exception when we meet a non-ASCII char. If so, we fail over to the more
> complex process that involves all the phases and the various String
> creations. Somehow, this is the same process than what we have for DNs :
> FastDnParser and ComplexDnParser.
>
>
> One thing thwat will be completely removed from the prepareString
> implementation is the escaping we currently (wrongly) do. It is the not
> the place to do that.
>
>
> Bottom line, this String preparation will completely replace the
> Normalizers we are using. They are useless parts of our schema.
>
>
> last, not least, as this is a COSTLY operation, this function will only
> be called when needed (ie for AT we know are used in Index, or in teh
> DN's RDN, or when a Filter uses it). That will save a hell lot of CPU.
> The consequences is that most of the values we receive or send will
> *not* we converted to String, we will just keep the byte[] value. That
> is the main source of CPU save.
>
> Expect the server and teh API to be kind of impacted :-)
>
>
>