Re: RFR: [15] JDK-8236048: Cleanup use of Utils.normalizeNewlines

James Laskey Wed, 18 Dec 2019 06:40:47 -0800

Note: you can use the regular expression \R to match line terminators. A bit 
faster than your pattern. I found this out when testing String::lines() (which 
was faster still.)


Cheers,

— Jim

On the road.

> On Dec 17, 2019, at 8:29 PM, Jonathan Gibbons <[email protected]> 
> wrote:
> 
> 
> Updated webrev. This version uses a regular expression and .replaceAll to 
> replace \r\n? with \n in RawHtml. No other changes.
> 
> http://cr.openjdk.java.net/~jjg/8236048/webrev.01/
> 
> -- Jon
> 
> 
> On 12/17/2019 10:44 AM, Jonathan Gibbons wrote:
>> Inline...
>> 
>> 
>> On 12/17/2019 05:51 AM, Pavel Rappo wrote:
>>> Jon,
>>> 
>>> Could you please explain why "generated files should only contain \n"?
>> 
>> I don't have a robust spec-based answer, so the answer is more along the 
>> lines of ...
>> 
>> * because we've always done it this way ... related, back in the day, Apache 
>> was the
>>    gold-standard webserver, and was typically running on Linux systems, so 
>> favoring that
>>    config seemd a good idea
>> * for consistency in the output between user-provided newlines and generated 
>> newlines
>>     ... I did find a spec reference that said to not be inconsistent
>> * for self-defence in our self-tests, maybe predating back to before 
>> Mercurial and before we
>>   were so rigorous about enforcing Unix-newlines only in our source code
>> 
>> 
>>> Is it possible to perform the `rawHtml.indexOf('\r') == -1` optimization 
>>> check first thing in
>>> normalizeNewlines?
>> 
>> Are you suggesting it would be better to always call normalizeNewlines and 
>> so move the
>> check into the method?  I guess I was thinking it was worth checking to 
>> detect whether it was
>> necessary to call the method in the first place.
>> 
>>> New mechanics of string normalization is more readable, albeit is probably 
>>> slower in a typical case.
>>> Old mechanics used bulk append, whereas the new one goes char by char. Not 
>>> sure if it is of any
>>> practical significance though.
>> 
>> I don't see why it will be slower in any case.   In all previous cases, we 
>> were unconditionally 
>> scanning the string character-by-character in Utils.normalizeNewlines and 
>> building up a 
>> new StringBuilder, which was then converted to a String.  
>> 
>> Now, in StringContent, the scan is merged with the scan to escape the HTML 
>> characters.
>> In RawHtml content, the scan/copy is only done if a \r is found.  Yes, we 
>> have to do a read-scan
>> but we only copy/update the string if needed, which is going to be faster 
>> than the always-copy
>> that was done before.
>> 
>> One possible suggestion/improvement, ... but I don't know the performance 
>> cost
>> 
>> * remove RawHtml.normalizeNewlines completely
>> * change the RawHtml constructor to a regex, as in
>>   52         rawHtmlContent = rawHtml.indexOf('\r') == -1 ? rawHtml : 
>> rawHtml.replaceAll("\\R", "\n");
>> I must admit, I sorta like this, even if it might be a bit slower, to use a 
>> regex.
>> 
>> Note, as an aside, we use RawHtml way more than we should. The design intent 
>> was that it
>> should be a Content-of-last-resort.  I have ideas on how to significantly 
>> reduce the usage,
>> which is the general direction and motivation for a lot of this round of 
>> cleanup work.
>> 
>>> I wonder if normalization should be deferred until the latest possible 
>>> moment. The reason is that
>>> this operation doesn't seem to be additive, i.e.
>>> 
>>>     normalizeNewlines(A) + normalizeNewlines(B) != normalizeNewlines(A + B).
>> 
>> My initial thought had been to do the normalization on the fly while writing 
>> out the characters, 
>> but that would mean writing the file character at a time, instead of 
>> bulk-write.  It would also mean
>> a bigger semantic change to leave un-normalized newlines in the data members 
>> ... note
>> that there is code to check if items "end with newline" which is done by 
>> checking if the last
>> character is DocletConstants.NL.  It also seemed inconsistent to be scanning 
>> for HTML
>> characters while leaving newlines alone.   If we were to defer normalization 
>> until writing out Content
>> objects, I would consider deferring the <>& translation as well. That might 
>> be a good idea but could 
>> be done separately as a later more-localized  changeset.
>> 
>> Note that there is a hidden assumption that newlines are completely 
>> represented in
>> each call to a method or constructor for any subtype of Content.   Even if 
>> we deferred
>> normalization to the very last possible moment, (i.e. writing out) we would 
>> not be able
>> to correctly handle a \r\n pair split across a pair of consecutive Content 
>> nodes. I don't
>> think this is a case worth worrying about.
>> 
>> 
>> 
>>> -Pavel
>>> 
>>>> On 17 Dec 2019, at 02:30, Jonathan Gibbons <[email protected]> 
>>>> wrote:
>>>> 
>>>> Please review a moderately simple change for javadoc, to improve the way 
>>>> that newlines in user input (doc comments and values for command-line 
>>>> options) are handled, reducing the amount of overall string copying.  The 
>>>> underlying issue is that user newlines may be \n, \r or \r\n, but the 
>>>> generated files should only contain \n.
>>>> 
>>>> Instead of using Utils.normalizeNewlines (which does an unconditional 
>>>> copy) at a number of sites, the normalization is now done in StringContent 
>>>> and RawHtml. In the case of StringContent, the normalization is done while 
>>>> checking and handling the escape sequences for < > &; in the case of 
>>>> RawHtml, it is done when constructing the object, but only if the argument 
>>>> contains \r.
>>>> 
>>>> There are no good pre-existing tests for testing newline behavior, and so 
>>>> a new test is provided that runs javadoc on input containing the different 
>>>> kinds of newline.  The test revealed a previously unknown bug, that 
>>>> non-standard newslines in command-line option values were not handled 
>>>> correctly.
>>>> 
>>>> The output is also the same (before-and-after) for the generated JDK API 
>>>> docs.
>>>> 
>>>> -- Jon
>>>> 
>>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8236048
>>>> Webrev: http://cr.openjdk.java.net/~jjg/8236048/webrev.00/index.html
>>>> 
>> 
>

Re: RFR: [15] JDK-8236048: Cleanup use of Utils.normalizeNewlines

Reply via email to