Continuing my investigation on PrepareString, here is where I am, and some thought about where we should go.
After having re-read RFC 4517 and RFC 4518, which are quite complex, I came to realize that it's not at all a good idea to have one single method to prepare the Strings that have to be prepared. There are many different use cases, and differenciating between them in the prepare() method would be stupid. The EQUALITY, ORDERING and SUBSTRING matchingRules have a reference to a Normalizer which should be the class that is responsible for the String preparation (if needed). This Normalizer will depend on the PrepareString static methods to prepare the String, so the PrepareString class will expose those methods (Transcode, Map, etc). Btw, here are the MatchingRules that are to use the String Preparation algorihtm : EQUALITY MR : caseExactMatch : no case fold in Map, only Insignificant Space Handling is applied in the Insignificant Character Handling step caseExactIA5Match : no case fold in Map, only Insignificant Space Handling is applied in the Insignificant Character Handling step caseIgnoreMatch : case fold in Map, only Insignificant Space Handling is applied in the Insignificant Character Handling step caseIgnoreIA5Match : case fold in Map, only Insignificant Space Handling is applied in the Insignificant Character Handling step caseIgnoreListMatch : apply caseIgnoreMatch on each element directoryStringFirstComponentMatch : apply the associated element MatchingRule numericStringMatch : no case fold in Map, only numericString Insignificant Character Handling is applied in the Insignificant Character Handling step telephoneNumberMatch : no case fold in Map, only telephoneNumber Insignificant Character Handling is applied in the Insignificant Character Handling step wordMatch : apply caseIgnoreMatch on each element ORDERING MR : caseExactOrderingMatch : no case fold in Map, only Insignificant Space Handling is applied in the Insignificant Character Handling step caseIgnoreOrderingMatch : case fold in Map, only Insignificant Space Handling is applied in the Insignificant Character Handling step numericStringOrderingMatch : no case fold in Map, only numericString Insignificant Character Handling is applied in the Insignificant Character Handling step SUBSTRING MR : caseExactSubstringsMatch : no case fold in Map, only Insignificant Space Handling is applied in the Insignificant Character Handling step caseIgnoreIA5SubstringsMatch : case fold in Map, only Insignificant Space Handling is applied in the Insignificant Character Handling step caseIgnoreListSubstringsMatch : apply caseIgnoreMatch on each element caseIgnoreSubstringsMatch : case fold in Map, only Insignificant Space Handling is applied in the Insignificant Character Handling step numericStringSubstringsMatch : no case fold in Map, only numericString Insignificant Character Handling is applied in the Insignificant Character Handling step telephoneNumberSubstringsMatch : no case fold in Map, only telephoneNumber Insignificant Character Handling is applied in the Insignificant Character Handling step Atm, those are just thoughts and analysis, but I do think that this is the way to go. For teh record, it will have a huge impact on the Value class : I anticipate some more simplification. Typically, there is no need to keep the hashcode withing the Value, because we can't easily compute it. In order to know if a value is already present in a Attribute, we will have to use the Equals() method, which works with the prepared String anyway. We don't need either to keep the normalized bytes. I think we should also remove the getString() method, we already have a getValue() method that returns a String. I will most certainly try to work on those changes this week-end. Thanks ! Le 30/03/16 13:23, Emmanuel Lécharny a écrit : > Le 28/03/16 12:23, Emmanuel Lécharny a écrit : >> Hi guys, >> >> I'm now working on the PrepareString part. It need a bit of work, as we >> don't correctly handle spaces. We also have to remove the escaping we do >> there. >> >> That is what I'm working on atm. > A bit more of what's going on... > > The String Preparation is specified in RFC 4518. It's a prcoess that > involves 6 steps : > > > 1) Transcode > 2) Map > 3) Normalize > 4) Prohibit > 5) Check bidi > 6) Insignificant Character Handling > > The first phase is just a transformation of a byte[] to a String, which > is done through a call to Strings.utf8ToString( bytes ). The good thing > is that Java stores the String using Unicode. > > The Map phase is a bit more complex, as we have to go through all the > chars, and depending on the fact that the Syntax is case sensitive or > not, it will transform the char to some others so that theyc an be > compared safely. There is a long list of special chars to handle (around > 1000). > > The Normalize phase consist on a transformation of the String to a > String respecting the NFKC form, described here : > http://www.unicode.org/reports/tr15/tr15-22.html#Specification. This is > also implemented in Java, so we use the Normalizer.normalize( mapped, > Normalizer.Form.NFKC ) method, if necessary. > > The Prohibit phase is about checking every char to check if they are all > valid. There are a few hundreds prohibited chars. > > The Check Bidi phase is about dealing with bi-directional characters > (arabic, for instance). "Bidirectional characters are ignored." says the > RFC, so be it :-) > > The insignificant character handling phase is the last one, where we > remove useless spaces or some other specific chars, in various type of > values. > > > In order to speddup the process, which is quite expensive, the idea is > to assume the value to be ASCII first. In this case, the Normalize, > Prohibit and most of the Map phases can be zapped. We can safely design > a simplest method that will work fast for all those phases, throwing an > exception when we meet a non-ASCII char. If so, we fail over to the more > complex process that involves all the phases and the various String > creations. Somehow, this is the same process than what we have for DNs : > FastDnParser and ComplexDnParser. > > > One thing thwat will be completely removed from the prepareString > implementation is the escaping we currently (wrongly) do. It is the not > the place to do that. > > > Bottom line, this String preparation will completely replace the > Normalizers we are using. They are useless parts of our schema. > > > last, not least, as this is a COSTLY operation, this function will only > be called when needed (ie for AT we know are used in Index, or in teh > DN's RDN, or when a Filter uses it). That will save a hell lot of CPU. > The consequences is that most of the values we receive or send will > *not* we converted to String, we will just keep the byte[] value. That > is the main source of CPU save. > > Expect the server and teh API to be kind of impacted :-) > > >