Note: you can use the regular expression \R to match line terminators. A bit faster than your pattern. I found this out when testing String::lines() (which was faster still.)
Cheers, — Jim On the road. > On Dec 17, 2019, at 8:29 PM, Jonathan Gibbons <[email protected]> > wrote: > > > Updated webrev. This version uses a regular expression and .replaceAll to > replace \r\n? with \n in RawHtml. No other changes. > > http://cr.openjdk.java.net/~jjg/8236048/webrev.01/ > > -- Jon > > > On 12/17/2019 10:44 AM, Jonathan Gibbons wrote: >> Inline... >> >> >> On 12/17/2019 05:51 AM, Pavel Rappo wrote: >>> Jon, >>> >>> Could you please explain why "generated files should only contain \n"? >> >> I don't have a robust spec-based answer, so the answer is more along the >> lines of ... >> >> * because we've always done it this way ... related, back in the day, Apache >> was the >> gold-standard webserver, and was typically running on Linux systems, so >> favoring that >> config seemd a good idea >> * for consistency in the output between user-provided newlines and generated >> newlines >> ... I did find a spec reference that said to not be inconsistent >> * for self-defence in our self-tests, maybe predating back to before >> Mercurial and before we >> were so rigorous about enforcing Unix-newlines only in our source code >> >> >>> Is it possible to perform the `rawHtml.indexOf('\r') == -1` optimization >>> check first thing in >>> normalizeNewlines? >> >> Are you suggesting it would be better to always call normalizeNewlines and >> so move the >> check into the method? I guess I was thinking it was worth checking to >> detect whether it was >> necessary to call the method in the first place. >> >>> New mechanics of string normalization is more readable, albeit is probably >>> slower in a typical case. >>> Old mechanics used bulk append, whereas the new one goes char by char. Not >>> sure if it is of any >>> practical significance though. >> >> I don't see why it will be slower in any case. In all previous cases, we >> were unconditionally >> scanning the string character-by-character in Utils.normalizeNewlines and >> building up a >> new StringBuilder, which was then converted to a String. >> >> Now, in StringContent, the scan is merged with the scan to escape the HTML >> characters. >> In RawHtml content, the scan/copy is only done if a \r is found. Yes, we >> have to do a read-scan >> but we only copy/update the string if needed, which is going to be faster >> than the always-copy >> that was done before. >> >> One possible suggestion/improvement, ... but I don't know the performance >> cost >> >> * remove RawHtml.normalizeNewlines completely >> * change the RawHtml constructor to a regex, as in >> 52 rawHtmlContent = rawHtml.indexOf('\r') == -1 ? rawHtml : >> rawHtml.replaceAll("\\R", "\n"); >> I must admit, I sorta like this, even if it might be a bit slower, to use a >> regex. >> >> Note, as an aside, we use RawHtml way more than we should. The design intent >> was that it >> should be a Content-of-last-resort. I have ideas on how to significantly >> reduce the usage, >> which is the general direction and motivation for a lot of this round of >> cleanup work. >> >>> I wonder if normalization should be deferred until the latest possible >>> moment. The reason is that >>> this operation doesn't seem to be additive, i.e. >>> >>> normalizeNewlines(A) + normalizeNewlines(B) != normalizeNewlines(A + B). >> >> My initial thought had been to do the normalization on the fly while writing >> out the characters, >> but that would mean writing the file character at a time, instead of >> bulk-write. It would also mean >> a bigger semantic change to leave un-normalized newlines in the data members >> ... note >> that there is code to check if items "end with newline" which is done by >> checking if the last >> character is DocletConstants.NL. It also seemed inconsistent to be scanning >> for HTML >> characters while leaving newlines alone. If we were to defer normalization >> until writing out Content >> objects, I would consider deferring the <>& translation as well. That might >> be a good idea but could >> be done separately as a later more-localized changeset. >> >> Note that there is a hidden assumption that newlines are completely >> represented in >> each call to a method or constructor for any subtype of Content. Even if >> we deferred >> normalization to the very last possible moment, (i.e. writing out) we would >> not be able >> to correctly handle a \r\n pair split across a pair of consecutive Content >> nodes. I don't >> think this is a case worth worrying about. >> >> >> >>> -Pavel >>> >>>> On 17 Dec 2019, at 02:30, Jonathan Gibbons <[email protected]> >>>> wrote: >>>> >>>> Please review a moderately simple change for javadoc, to improve the way >>>> that newlines in user input (doc comments and values for command-line >>>> options) are handled, reducing the amount of overall string copying. The >>>> underlying issue is that user newlines may be \n, \r or \r\n, but the >>>> generated files should only contain \n. >>>> >>>> Instead of using Utils.normalizeNewlines (which does an unconditional >>>> copy) at a number of sites, the normalization is now done in StringContent >>>> and RawHtml. In the case of StringContent, the normalization is done while >>>> checking and handling the escape sequences for < > &; in the case of >>>> RawHtml, it is done when constructing the object, but only if the argument >>>> contains \r. >>>> >>>> There are no good pre-existing tests for testing newline behavior, and so >>>> a new test is provided that runs javadoc on input containing the different >>>> kinds of newline. The test revealed a previously unknown bug, that >>>> non-standard newslines in command-line option values were not handled >>>> correctly. >>>> >>>> The output is also the same (before-and-after) for the generated JDK API >>>> docs. >>>> >>>> -- Jon >>>> >>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8236048 >>>> Webrev: http://cr.openjdk.java.net/~jjg/8236048/webrev.00/index.html >>>> >> >
