Ah, ByteArrayMap, I was trying ByteArray. Not sure what the next step is here: should I add it and send you mczs, or will you do it yourself? (it is a simple change).
Peter On Sat, Jul 22, 2017 at 10:51:42PM +0200, Sven Van Caekenberghe wrote: > Peter, > > > On 22 Jul 2017, at 22:27, Peter Uhnak <i.uh...@gmail.com> wrote: > > > > Hi Sven, > > > > my use case was hand-edited CSVs (therefore the quotes are extra clutter), > > which imples that I would be hand-viewing/editing only small CSVs (no > > performance issues). > > > > I agree that we should err on the safe side with CR & LF (e.g. tools may > > sometimes autoconvert line endings). > > > > Regarding performance: > > > > #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, > > or I don't know how to use), so I've trried with inCharacterSet: > > Yes, it is a bitch to use, the the version you used does not use the > primitive. > > Try playing with this: > > s10 := 'a' repeat: 10. > s100 := 'a' repeat: 100. > > searchSet := ByteArray new: 256 withAll: 0. > ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ]. > > searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap. > > [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench. > "15,160,161 per second" > [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] bench. > "8,219,081 per second" > > ByteString findFirstInString: ',"', String crlf inSet: searchSet startingAt: > 1. > > Sven > > > Tested on worst-case scenario - strings don't contain tested symbols. > > > > s10 := 'a' repeat: 10. > > s100 := 'a' repeat: 100. > > > > [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 > > includesSubstring: each ] ] bench. "'1,200,046 per second'" > > [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 > > includesSubstring: each ] ] bench. "'495,482 per second'" > > > > [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. > > "'2,819,416 per second'" > > [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. > > "'2,200,668 per second'" > > > > [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf > > startingAt: 1 ] bench. "'1,187,324 per second'" > > [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf > > startingAt: 1 ] bench. "'165,526 per second'" > > > > > > #includesAny: seems to be the best by far. > > > > Storing the tested characters didn't improve it by much. > > > > Peter > > > > On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote: > >> Hi Peter, > >> > >>> On 22 Jul 2017, at 14:12, Peter Uhnak <i.uh...@gmail.com> wrote: > >>> > >>> Hi, > >>> > >>> this is a continuation of an older thread about quoting fields only when > >>> necessary. ( > >>> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) > >>> > >>> I've attached fileouts of NeoCSV packages with the addition (I don't know > >>> if Metacello can file-out only changesets). > >>> > >>> The change doesn't affect the default behavior, only when explicitly > >>> requested. > >>> > >>> Peter > >> > >> I accepted your changes as such, the .mcz's were copied over to the main > >> repositories. This is a pure & clean extension, so that is good. Thank you. > >> > >> This option is always going to be slower, but the current implementation > >> might be improved, I think. > >> > >> The key test in #writeOptionalQuotedField: > >> > >> { > >> lineEnd asString. > >> separator asString. > >> '"' } anySatisfy: [ :each | string includesSubstring: each ] > >> > >> will be quite slow. > >> > >> If we accept a little bit of (over safe) error on EOL and use any > >> occurrence of CR or LF as needing a quote, we could switch to characters > >> to test for. There exists a fast (primitive) test, > >> #findFirstInString:inSet:startingAt: that can do all the testing in one > >> go. If your version turns out to be slow, we could try that, if > >> measurements show a difference. > >> > >> Regards, > >> > >> Sven > >> > >> > > > >