Re: [Pharo-users] NeoCSV change proposal
Thank you Sven! Peter On Sun, Jul 23, 2017 at 01:59:41PM +0200, Sven Van Caekenberghe wrote: > > > On 23 Jul 2017, at 09:55, Peter Uhnak wrote: > > > > Ah, ByteArrayMap, I was trying ByteArray. > > ByteArrayMap is not a class, it is still a ByteArray, but of size 256 used as > an inclusion map for characters that fit a byte. > > BTW, using CharacterSet as an abstraction makes the trick easier. > > I committed: > > === > Name: Neo-CSV-Core-SvenVanCaekenberghe.26 > Author: SvenVanCaekenberghe > Time: 23 July 2017, 1:54:22.095185 pm > UUID: e429f87e-5c11-0d00-a72d-ae5d00db2a8c > Ancestors: Neo-CSV-Core-PeterUhnak.25 > > make the core test in #writeOptionalQuotedField faster by using a primitive > when possible > > add #testWideOptionalQuoted > === > Name: Neo-CSV-Tests-SvenVanCaekenberghe.22 > Author: SvenVanCaekenberghe > Time: 23 July 2017, 1:54:52.133003 pm > UUID: 5f81c280-5c11-0d00-a72e-7cd500db2a8c > Ancestors: Neo-CSV-Tests-PeterUhnak.21 > > make the core test in #writeOptionalQuotedField faster by using a primitive > when possible > > add #testWideOptionalQuoted > === > > Extra caution was needed for WideStrings. > > Sven > > > Not sure what the next step is here: should I add it and send you mczs, or > > will you do it yourself? (it is a simple change). > > > > Peter > > > > On Sat, Jul 22, 2017 at 10:51:42PM +0200, Sven Van Caekenberghe wrote: > >> Peter, > >> > >>> On 22 Jul 2017, at 22:27, Peter Uhnak wrote: > >>> > >>> Hi Sven, > >>> > >>> my use case was hand-edited CSVs (therefore the quotes are extra > >>> clutter), which imples that I would be hand-viewing/editing only small > >>> CSVs (no performance issues). > >>> > >>> I agree that we should err on the safe side with CR & LF (e.g. tools may > >>> sometimes autoconvert line endings). > >>> > >>> Regarding performance: > >>> > >>> #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, > >>> or I don't know how to use), so I've trried with inCharacterSet: > >> > >> Yes, it is a bitch to use, the the version you used does not use the > >> primitive. > >> > >> Try playing with this: > >> > >> s10 := 'a' repeat: 10. > >> s100 := 'a' repeat: 100. > >> > >> searchSet := ByteArray new: 256 withAll: 0. > >> ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ]. > >> > >> searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap. > >> > >> [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench. > >> "15,160,161 per second" > >> [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] > >> bench. > >> "8,219,081 per second" > >> > >> ByteString findFirstInString: ',"', String crlf inSet: searchSet > >> startingAt: 1. > >> > >> Sven > >> > >>> Tested on worst-case scenario - strings don't contain tested symbols. > >>> > >>> s10 := 'a' repeat: 10. > >>> s100 := 'a' repeat: 100. > >>> > >>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 > >>> includesSubstring: each ] ] bench. "'1,200,046 per second'" > >>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 > >>> includesSubstring: each ] ] bench. "'495,482 per second'" > >>> > >>> [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. > >>> "'2,819,416 per second'" > >>> [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. > >>> "'2,200,668 per second'" > >>> > >>> [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf > >>> startingAt: 1 ] bench. "'1,187,324 per second'" > >>> [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf > >>> startingAt: 1 ] bench. "'165,526 per second'" > >>> > >>> > >>> #includesAny: seems to be the best by far. > >>> > >>> Storing the tested characters didn't improve it by much. > >>> > >>> Peter > >>> > >>> On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote: > Hi Peter, > > > On 22 Jul 2017, at 14:12, Peter Uhnak wrote: > > > > Hi, > > > > this is a continuation of an older thread about quoting fields only > > when necessary. ( > > http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) > > > > I've attached fileouts of NeoCSV packages with the addition (I don't > > know if Metacello can file-out only changesets). > > > > The change doesn't affect the default behavior, only when explicitly > > requested. > > > > Peter > > I accepted your changes as such, the .mcz's were copied over to the main > repositories. This is a pure & clean extension, so that is good. Thank > you. > > This option is always going to be slower, but the current implementation > might be improved, I think. > > The key test in #writeOptionalQuotedField: > > { > lineEnd asString. > separator asString. > '"' } anySatisfy: [ :each | string includesSubstring: each ] > > will be quite s
Re: [Pharo-users] NeoCSV change proposal
> On 23 Jul 2017, at 09:55, Peter Uhnak wrote: > > Ah, ByteArrayMap, I was trying ByteArray. ByteArrayMap is not a class, it is still a ByteArray, but of size 256 used as an inclusion map for characters that fit a byte. BTW, using CharacterSet as an abstraction makes the trick easier. I committed: === Name: Neo-CSV-Core-SvenVanCaekenberghe.26 Author: SvenVanCaekenberghe Time: 23 July 2017, 1:54:22.095185 pm UUID: e429f87e-5c11-0d00-a72d-ae5d00db2a8c Ancestors: Neo-CSV-Core-PeterUhnak.25 make the core test in #writeOptionalQuotedField faster by using a primitive when possible add #testWideOptionalQuoted === Name: Neo-CSV-Tests-SvenVanCaekenberghe.22 Author: SvenVanCaekenberghe Time: 23 July 2017, 1:54:52.133003 pm UUID: 5f81c280-5c11-0d00-a72e-7cd500db2a8c Ancestors: Neo-CSV-Tests-PeterUhnak.21 make the core test in #writeOptionalQuotedField faster by using a primitive when possible add #testWideOptionalQuoted === Extra caution was needed for WideStrings. Sven > Not sure what the next step is here: should I add it and send you mczs, or > will you do it yourself? (it is a simple change). > > Peter > > On Sat, Jul 22, 2017 at 10:51:42PM +0200, Sven Van Caekenberghe wrote: >> Peter, >> >>> On 22 Jul 2017, at 22:27, Peter Uhnak wrote: >>> >>> Hi Sven, >>> >>> my use case was hand-edited CSVs (therefore the quotes are extra clutter), >>> which imples that I would be hand-viewing/editing only small CSVs (no >>> performance issues). >>> >>> I agree that we should err on the safe side with CR & LF (e.g. tools may >>> sometimes autoconvert line endings). >>> >>> Regarding performance: >>> >>> #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, >>> or I don't know how to use), so I've trried with inCharacterSet: >> >> Yes, it is a bitch to use, the the version you used does not use the >> primitive. >> >> Try playing with this: >> >> s10 := 'a' repeat: 10. >> s100 := 'a' repeat: 100. >> >> searchSet := ByteArray new: 256 withAll: 0. >> ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ]. >> >> searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap. >> >> [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench. >> "15,160,161 per second" >> [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] bench. >> "8,219,081 per second" >> >> ByteString findFirstInString: ',"', String crlf inSet: searchSet startingAt: >> 1. >> >> Sven >> >>> Tested on worst-case scenario - strings don't contain tested symbols. >>> >>> s10 := 'a' repeat: 10. >>> s100 := 'a' repeat: 100. >>> >>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 >>> includesSubstring: each ] ] bench. "'1,200,046 per second'" >>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 >>> includesSubstring: each ] ] bench. "'495,482 per second'" >>> >>> [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. >>> "'2,819,416 per second'" >>> [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. >>> "'2,200,668 per second'" >>> >>> [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf >>> startingAt: 1 ] bench. "'1,187,324 per second'" >>> [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf >>> startingAt: 1 ] bench. "'165,526 per second'" >>> >>> >>> #includesAny: seems to be the best by far. >>> >>> Storing the tested characters didn't improve it by much. >>> >>> Peter >>> >>> On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote: Hi Peter, > On 22 Jul 2017, at 14:12, Peter Uhnak wrote: > > Hi, > > this is a continuation of an older thread about quoting fields only when > necessary. ( > http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) > > I've attached fileouts of NeoCSV packages with the addition (I don't know > if Metacello can file-out only changesets). > > The change doesn't affect the default behavior, only when explicitly > requested. > > Peter I accepted your changes as such, the .mcz's were copied over to the main repositories. This is a pure & clean extension, so that is good. Thank you. This option is always going to be slower, but the current implementation might be improved, I think. The key test in #writeOptionalQuotedField: { lineEnd asString. separator asString. '"' } anySatisfy: [ :each | string includesSubstring: each ] will be quite slow. If we accept a little bit of (over safe) error on EOL and use any occurrence of CR or LF as needing a quote, we could switch to characters to test for. There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: that can do all the testing in one go. If your version turns out to be slow, we could try that, if measurements show a difference
Re: [Pharo-users] NeoCSV change proposal
thanks for the explanation. this is cool. On Sat, Jul 22, 2017 at 10:06 PM, Sven Van Caekenberghe wrote: > >> On 22 Jul 2017, at 21:51, Stephane Ducasse wrote: >> >> Hi peter >> Do you have an example that I understand? > > In its most simple form, CSV looks like this: > > a,b,c,d > > However, if a field is text, it might contain a comma itself, or other > problematic characters like line endings, hence there is a provision for > quoting fields, as follows: > > "a","b","c",d > > Now, "a" could be "a1,a2" and still be one field. Embedded quotes are doubled. > > Not all fields need quoting, normally only text or string fields. If you know > something is an integer, you don't need quotes, like d above. > > Since CSV is normally a collection of identical records, you set these > options for all of them. > > What Peter wanted, and now implemented himself as an option, is to make the > quoting optional per field, per record, by first checking if a field contains > dangerous characters and only then adding the quotes. I guess this makes most > sense in the context of generating a CSV file for humans to edit. > >> @Svn Once the change is integrated could you add a little paragraph to >> the chapter? >> >> Stef >> >> On Sat, Jul 22, 2017 at 6:51 PM, Sven Van Caekenberghe wrote: >>> Hi Peter, >>> On 22 Jul 2017, at 14:12, Peter Uhnak wrote: Hi, this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets). The change doesn't affect the default behavior, only when explicitly requested. Peter >>> >>> I accepted your changes as such, the .mcz's were copied over to the main >>> repositories. This is a pure & clean extension, so that is good. Thank you. >>> >>> This option is always going to be slower, but the current implementation >>> might be improved, I think. >>> >>> The key test in #writeOptionalQuotedField: >>> >>> { >>> lineEnd asString. >>> separator asString. >>> '"' } anySatisfy: [ :each | string includesSubstring: each ] >>> >>> will be quite slow. >>> >>> If we accept a little bit of (over safe) error on EOL and use any >>> occurrence of CR or LF as needing a quote, we could switch to characters to >>> test for. There exists a fast (primitive) test, >>> #findFirstInString:inSet:startingAt: that can do all the testing in one go. >>> If your version turns out to be slow, we could try that, if measurements >>> show a difference. >>> >>> Regards, >>> >>> Sven >>> >>> >> > >
Re: [Pharo-users] NeoCSV change proposal
Ah, ByteArrayMap, I was trying ByteArray. Not sure what the next step is here: should I add it and send you mczs, or will you do it yourself? (it is a simple change). Peter On Sat, Jul 22, 2017 at 10:51:42PM +0200, Sven Van Caekenberghe wrote: > Peter, > > > On 22 Jul 2017, at 22:27, Peter Uhnak wrote: > > > > Hi Sven, > > > > my use case was hand-edited CSVs (therefore the quotes are extra clutter), > > which imples that I would be hand-viewing/editing only small CSVs (no > > performance issues). > > > > I agree that we should err on the safe side with CR & LF (e.g. tools may > > sometimes autoconvert line endings). > > > > Regarding performance: > > > > #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, > > or I don't know how to use), so I've trried with inCharacterSet: > > Yes, it is a bitch to use, the the version you used does not use the > primitive. > > Try playing with this: > > s10 := 'a' repeat: 10. > s100 := 'a' repeat: 100. > > searchSet := ByteArray new: 256 withAll: 0. > ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ]. > > searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap. > > [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench. > "15,160,161 per second" > [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] bench. > "8,219,081 per second" > > ByteString findFirstInString: ',"', String crlf inSet: searchSet startingAt: > 1. > > Sven > > > Tested on worst-case scenario - strings don't contain tested symbols. > > > > s10 := 'a' repeat: 10. > > s100 := 'a' repeat: 100. > > > > [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 > > includesSubstring: each ] ] bench. "'1,200,046 per second'" > > [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 > > includesSubstring: each ] ] bench. "'495,482 per second'" > > > > [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. > > "'2,819,416 per second'" > > [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. > > "'2,200,668 per second'" > > > > [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf > > startingAt: 1 ] bench. "'1,187,324 per second'" > > [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf > > startingAt: 1 ] bench. "'165,526 per second'" > > > > > > #includesAny: seems to be the best by far. > > > > Storing the tested characters didn't improve it by much. > > > > Peter > > > > On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote: > >> Hi Peter, > >> > >>> On 22 Jul 2017, at 14:12, Peter Uhnak wrote: > >>> > >>> Hi, > >>> > >>> this is a continuation of an older thread about quoting fields only when > >>> necessary. ( > >>> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) > >>> > >>> I've attached fileouts of NeoCSV packages with the addition (I don't know > >>> if Metacello can file-out only changesets). > >>> > >>> The change doesn't affect the default behavior, only when explicitly > >>> requested. > >>> > >>> Peter > >> > >> I accepted your changes as such, the .mcz's were copied over to the main > >> repositories. This is a pure & clean extension, so that is good. Thank you. > >> > >> This option is always going to be slower, but the current implementation > >> might be improved, I think. > >> > >> The key test in #writeOptionalQuotedField: > >> > >> { > >> lineEnd asString. > >> separator asString. > >> '"' } anySatisfy: [ :each | string includesSubstring: each ] > >> > >> will be quite slow. > >> > >> If we accept a little bit of (over safe) error on EOL and use any > >> occurrence of CR or LF as needing a quote, we could switch to characters > >> to test for. There exists a fast (primitive) test, > >> #findFirstInString:inSet:startingAt: that can do all the testing in one > >> go. If your version turns out to be slow, we could try that, if > >> measurements show a difference. > >> > >> Regards, > >> > >> Sven > >> > >> > > > >
Re: [Pharo-users] NeoCSV change proposal
Peter, > On 22 Jul 2017, at 22:27, Peter Uhnak wrote: > > Hi Sven, > > my use case was hand-edited CSVs (therefore the quotes are extra clutter), > which imples that I would be hand-viewing/editing only small CSVs (no > performance issues). > > I agree that we should err on the safe side with CR & LF (e.g. tools may > sometimes autoconvert line endings). > > Regarding performance: > > #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, or > I don't know how to use), so I've trried with inCharacterSet: Yes, it is a bitch to use, the the version you used does not use the primitive. Try playing with this: s10 := 'a' repeat: 10. s100 := 'a' repeat: 100. searchSet := ByteArray new: 256 withAll: 0. ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ]. searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap. [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench. "15,160,161 per second" [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] bench. "8,219,081 per second" ByteString findFirstInString: ',"', String crlf inSet: searchSet startingAt: 1. Sven > Tested on worst-case scenario - strings don't contain tested symbols. > > s10 := 'a' repeat: 10. > s100 := 'a' repeat: 100. > > [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 > includesSubstring: each ] ] bench. "'1,200,046 per second'" > [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 > includesSubstring: each ] ] bench. "'495,482 per second'" > > [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. > "'2,819,416 per second'" > [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. > "'2,200,668 per second'" > > [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf > startingAt: 1 ] bench. "'1,187,324 per second'" > [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf > startingAt: 1 ] bench. "'165,526 per second'" > > > #includesAny: seems to be the best by far. > > Storing the tested characters didn't improve it by much. > > Peter > > On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote: >> Hi Peter, >> >>> On 22 Jul 2017, at 14:12, Peter Uhnak wrote: >>> >>> Hi, >>> >>> this is a continuation of an older thread about quoting fields only when >>> necessary. ( >>> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) >>> >>> I've attached fileouts of NeoCSV packages with the addition (I don't know >>> if Metacello can file-out only changesets). >>> >>> The change doesn't affect the default behavior, only when explicitly >>> requested. >>> >>> Peter >> >> I accepted your changes as such, the .mcz's were copied over to the main >> repositories. This is a pure & clean extension, so that is good. Thank you. >> >> This option is always going to be slower, but the current implementation >> might be improved, I think. >> >> The key test in #writeOptionalQuotedField: >> >> { >> lineEnd asString. >> separator asString. >> '"' } anySatisfy: [ :each | string includesSubstring: each ] >> >> will be quite slow. >> >> If we accept a little bit of (over safe) error on EOL and use any occurrence >> of CR or LF as needing a quote, we could switch to characters to test for. >> There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: >> that can do all the testing in one go. If your version turns out to be slow, >> we could try that, if measurements show a difference. >> >> Regards, >> >> Sven >> >> >
Re: [Pharo-users] NeoCSV change proposal
Hi Sven, my use case was hand-edited CSVs (therefore the quotes are extra clutter), which imples that I would be hand-viewing/editing only small CSVs (no performance issues). I agree that we should err on the safe side with CR & LF (e.g. tools may sometimes autoconvert line endings). Regarding performance: #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, or I don't know how to use), so I've trried with inCharacterSet: Tested on worst-case scenario - strings don't contain tested symbols. s10 := 'a' repeat: 10. s100 := 'a' repeat: 100. [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 includesSubstring: each ] ] bench. "'1,200,046 per second'" [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 includesSubstring: each ] ] bench. "'495,482 per second'" [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,819,416 per second'" [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,200,668 per second'" [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf startingAt: 1 ] bench. "'1,187,324 per second'" [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf startingAt: 1 ] bench. "'165,526 per second'" #includesAny: seems to be the best by far. Storing the tested characters didn't improve it by much. Peter On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote: > Hi Peter, > > > On 22 Jul 2017, at 14:12, Peter Uhnak wrote: > > > > Hi, > > > > this is a continuation of an older thread about quoting fields only when > > necessary. ( > > http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) > > > > I've attached fileouts of NeoCSV packages with the addition (I don't know > > if Metacello can file-out only changesets). > > > > The change doesn't affect the default behavior, only when explicitly > > requested. > > > > Peter > > I accepted your changes as such, the .mcz's were copied over to the main > repositories. This is a pure & clean extension, so that is good. Thank you. > > This option is always going to be slower, but the current implementation > might be improved, I think. > > The key test in #writeOptionalQuotedField: > > { > lineEnd asString. > separator asString. > '"' } anySatisfy: [ :each | string includesSubstring: each ] > > will be quite slow. > > If we accept a little bit of (over safe) error on EOL and use any occurrence > of CR or LF as needing a quote, we could switch to characters to test for. > There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: > that can do all the testing in one go. If your version turns out to be slow, > we could try that, if measurements show a difference. > > Regards, > > Sven > >
Re: [Pharo-users] NeoCSV change proposal
> On 22 Jul 2017, at 21:51, Stephane Ducasse wrote: > > Hi peter > Do you have an example that I understand? In its most simple form, CSV looks like this: a,b,c,d However, if a field is text, it might contain a comma itself, or other problematic characters like line endings, hence there is a provision for quoting fields, as follows: "a","b","c",d Now, "a" could be "a1,a2" and still be one field. Embedded quotes are doubled. Not all fields need quoting, normally only text or string fields. If you know something is an integer, you don't need quotes, like d above. Since CSV is normally a collection of identical records, you set these options for all of them. What Peter wanted, and now implemented himself as an option, is to make the quoting optional per field, per record, by first checking if a field contains dangerous characters and only then adding the quotes. I guess this makes most sense in the context of generating a CSV file for humans to edit. > @Svn Once the change is integrated could you add a little paragraph to > the chapter? > > Stef > > On Sat, Jul 22, 2017 at 6:51 PM, Sven Van Caekenberghe wrote: >> Hi Peter, >> >>> On 22 Jul 2017, at 14:12, Peter Uhnak wrote: >>> >>> Hi, >>> >>> this is a continuation of an older thread about quoting fields only when >>> necessary. ( >>> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) >>> >>> I've attached fileouts of NeoCSV packages with the addition (I don't know >>> if Metacello can file-out only changesets). >>> >>> The change doesn't affect the default behavior, only when explicitly >>> requested. >>> >>> Peter >> >> I accepted your changes as such, the .mcz's were copied over to the main >> repositories. This is a pure & clean extension, so that is good. Thank you. >> >> This option is always going to be slower, but the current implementation >> might be improved, I think. >> >> The key test in #writeOptionalQuotedField: >> >> { >> lineEnd asString. >> separator asString. >> '"' } anySatisfy: [ :each | string includesSubstring: each ] >> >> will be quite slow. >> >> If we accept a little bit of (over safe) error on EOL and use any occurrence >> of CR or LF as needing a quote, we could switch to characters to test for. >> There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: >> that can do all the testing in one go. If your version turns out to be slow, >> we could try that, if measurements show a difference. >> >> Regards, >> >> Sven >> >> >
Re: [Pharo-users] NeoCSV change proposal
Hi peter Do you have an example that I understand? @Svn Once the change is integrated could you add a little paragraph to the chapter? Stef On Sat, Jul 22, 2017 at 6:51 PM, Sven Van Caekenberghe wrote: > Hi Peter, > >> On 22 Jul 2017, at 14:12, Peter Uhnak wrote: >> >> Hi, >> >> this is a continuation of an older thread about quoting fields only when >> necessary. ( >> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) >> >> I've attached fileouts of NeoCSV packages with the addition (I don't know if >> Metacello can file-out only changesets). >> >> The change doesn't affect the default behavior, only when explicitly >> requested. >> >> Peter > > I accepted your changes as such, the .mcz's were copied over to the main > repositories. This is a pure & clean extension, so that is good. Thank you. > > This option is always going to be slower, but the current implementation > might be improved, I think. > > The key test in #writeOptionalQuotedField: > > { > lineEnd asString. > separator asString. > '"' } anySatisfy: [ :each | string includesSubstring: each ] > > will be quite slow. > > If we accept a little bit of (over safe) error on EOL and use any occurrence > of CR or LF as needing a quote, we could switch to characters to test for. > There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: > that can do all the testing in one go. If your version turns out to be slow, > we could try that, if measurements show a difference. > > Regards, > > Sven > >
Re: [Pharo-users] NeoCSV change proposal
Hi Peter, > On 22 Jul 2017, at 14:12, Peter Uhnak wrote: > > Hi, > > this is a continuation of an older thread about quoting fields only when > necessary. ( > http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) > > I've attached fileouts of NeoCSV packages with the addition (I don't know if > Metacello can file-out only changesets). > > The change doesn't affect the default behavior, only when explicitly > requested. > > Peter I accepted your changes as such, the .mcz's were copied over to the main repositories. This is a pure & clean extension, so that is good. Thank you. This option is always going to be slower, but the current implementation might be improved, I think. The key test in #writeOptionalQuotedField: { lineEnd asString. separator asString. '"' } anySatisfy: [ :each | string includesSubstring: each ] will be quite slow. If we accept a little bit of (over safe) error on EOL and use any occurrence of CR or LF as needing a quote, we could switch to characters to test for. There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: that can do all the testing in one go. If your version turns out to be slow, we could try that, if measurements show a difference. Regards, Sven
Re: [Pharo-users] NeoCSV change proposal
Peter, > On 22 Jul 2017, at 14:12, Peter Uhnak wrote: > > Hi, > > this is a continuation of an older thread about quoting fields only when > necessary. ( > http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) > > I've attached fileouts of NeoCSV packages with the addition (I don't know if > Metacello can file-out only changesets). > > The change doesn't affect the default behavior, only when explicitly > requested. > > Peter Could you email me your .mcz packages (saved in the cache) ? That is much easier to work with. Thx, Sven
Re: [Pharo-users] NeoCSV change proposal
attached On Sat, Jul 22, 2017 at 02:12:10PM +0200, Peter Uhnak wrote: > Hi, > > this is a continuation of an older thread about quoting fields only when > necessary. ( > http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) > > I've attached fileouts of NeoCSV packages with the addition (I don't know if > Metacello can file-out only changesets). > > The change doesn't affect the default behavior, only when explicitly > requested. > > Peter Object subclass: #NeoCSVReader instanceVariableNames: 'readStream charBuffer separator stringStream fieldCount recordClass recordClassIsIndexable fieldAccessors emptyFieldValue' classVariableNames: '' poolDictionaries: '' category: 'Neo-CSV-Core'! !NeoCSVReader commentStamp: '' prior: 0! I am NeoCSVReader. I read a format that - is text based (ASCII, Latin1, Unicode) - consists of records, 1 per line (any line ending convention) - where records consist of fields separated by a delimiter (comma, tab, semicolon) - where every record has the same number of fields - where fields can be quoted should they contain separators or line endings Without further configuration, records will become Arrays of Strings. By specifiying a recordClass and fields with optional converters most objects can be read and instanciated correctly. MIT License. ! !NeoCSVReader methodsFor: 'accessing' stamp: 'SvenVanCaekenberghe 5/13/2014 15:34'! readHeader "Read a record, presumably a header and return the header field names. This should normally be called only at the beginning and only once. This sets the fieldCount (but fieldAccessors overrides fieldCount)." | names | names := Array streamContents: [ :out | [ self atEnd or: [ self readEndOfLine ] ] whileFalse: [ out nextPut: self readField. (self readSeparator and: [ self atEnd or: [ self peekEndOfLine ] ]) ifTrue: [ out nextPut: emptyFieldValue ] ] ]. fieldCount := names size. ^ names! ! !NeoCSVReader methodsFor: 'accessing' stamp: 'SvenVanCaekenberghe 10/6/2014 17:34'! select: filter "Read and collect records that satisfy filter into an Array until there are none left. Return the array." ^ Array streamContents: [ :stream | self select: filter thenDo: [ :each | stream nextPut: each ] ]! ! !NeoCSVReader methodsFor: 'accessing' stamp: 'SvenVanCaekenberghe 6/21/2012 22:30'! next "Read the next record. I will return an instance of recordClass." ^ recordClassIsIndexable ifTrue: [ self readNextRecordAsArray ] ifFalse: [ self readNextRecordAsObject ]! ! !NeoCSVReader methodsFor: 'accessing' stamp: 'SvenVanCaekenberghe 10/6/2014 17:33'! select: filter thenDo: block "Execute block for each record that satisfies filter until I am at end." [ self atEnd ] whileFalse: [ | record | record := self next. (filter value: record) ifTrue: [ block value: record ] ]! ! !NeoCSVReader methodsFor: 'accessing' stamp: 'SvenVanCaekenberghe 6/25/2012 14:45'! upToEnd "Read and collect records into an Array until there are none left. Return the array." ^ Array streamContents: [ :stream | self do: [ :each | stream nextPut: each ] ]! ! !NeoCSVReader methodsFor: 'accessing' stamp: 'SvenVanCaekenberghe 6/25/2012 14:45'! do: block "Execute block for each record until I am at end." [ self atEnd ] whileFalse: [ block value: self next ]! ! !NeoCSVReader methodsFor: 'private - reading' stamp: 'SvenVanCaekenberghe 6/14/2012 17:24'! readField ^ self peekQuote ifTrue: [ self readQuotedField ] ifFalse: [ self readUnquotedField ]! ! !NeoCSVReader methodsFor: 'private - reading' stamp: 'SvenVanCaekenberghe 1/15/2014 09:55'! readNextRecord | record | record := recordClass new: fieldCount. fieldAccessors ifNil: [ self readNextRecordWithoutFieldAccessors: record ] ifNotNil: [ self readNextRecordWithFieldAccessors: record ]. self readAtEndOrEndOfLine. ^ record! ! !NeoCSVReader methodsFor: 'private - reading' stamp: 'SvenVanCaekenberghe 5/13/2014 15:35'! readNextRecordAsObject | object | object := recordClass new. fieldAccessors do: [ :each | | rawValue | rawValue := self readFieldAndSeparator. "Note that empty/missing fields are not set" (rawValue = emptyFieldValue or: [ each isNil
[Pharo-users] NeoCSV change proposal
Hi, this is a continuation of an older thread about quoting fields only when necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html ) I've attached fileouts of NeoCSV packages with the addition (I don't know if Metacello can file-out only changesets). The change doesn't affect the default behavior, only when explicitly requested. Peter