Re: [Pharo-users] NeoCSV change proposal

2017-07-23 Thread Peter Uhnak
Thank you Sven!

Peter

On Sun, Jul 23, 2017 at 01:59:41PM +0200, Sven Van Caekenberghe wrote:
> 
> > On 23 Jul 2017, at 09:55, Peter Uhnak  wrote:
> > 
> > Ah, ByteArrayMap, I was trying ByteArray.
> 
> ByteArrayMap is not a class, it is still a ByteArray, but of size 256 used as 
> an inclusion map for characters that fit a byte.
> 
> BTW, using CharacterSet as an abstraction makes the trick easier.
> 
> I committed:
> 
> ===
> Name: Neo-CSV-Core-SvenVanCaekenberghe.26
> Author: SvenVanCaekenberghe
> Time: 23 July 2017, 1:54:22.095185 pm
> UUID: e429f87e-5c11-0d00-a72d-ae5d00db2a8c
> Ancestors: Neo-CSV-Core-PeterUhnak.25
> 
> make the core test in #writeOptionalQuotedField faster by using a primitive 
> when possible
> 
> add #testWideOptionalQuoted
> ===
> Name: Neo-CSV-Tests-SvenVanCaekenberghe.22
> Author: SvenVanCaekenberghe
> Time: 23 July 2017, 1:54:52.133003 pm
> UUID: 5f81c280-5c11-0d00-a72e-7cd500db2a8c
> Ancestors: Neo-CSV-Tests-PeterUhnak.21
> 
> make the core test in #writeOptionalQuotedField faster by using a primitive 
> when possible
> 
> add #testWideOptionalQuoted
> ===
> 
> Extra caution was needed for WideStrings.
> 
> Sven
> 
> > Not sure what the next step is here: should I add it and send you mczs, or 
> > will you do it yourself? (it is a simple change).
> > 
> > Peter
> > 
> > On Sat, Jul 22, 2017 at 10:51:42PM +0200, Sven Van Caekenberghe wrote:
> >> Peter,
> >> 
> >>> On 22 Jul 2017, at 22:27, Peter Uhnak  wrote:
> >>> 
> >>> Hi Sven,
> >>> 
> >>> my use case was hand-edited CSVs (therefore the quotes are extra 
> >>> clutter), which imples that I would be hand-viewing/editing only small 
> >>> CSVs (no performance issues).
> >>> 
> >>> I agree that we should err on the safe side with CR & LF (e.g. tools may 
> >>> sometimes autoconvert line endings).
> >>> 
> >>> Regarding performance:
> >>> 
> >>> #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, 
> >>> or I don't know how to use), so I've trried with inCharacterSet:
> >> 
> >> Yes, it is a bitch to use, the the version you used does not use the 
> >> primitive.
> >> 
> >> Try playing with this:
> >> 
> >> s10 := 'a' repeat: 10.
> >> s100 := 'a' repeat: 100.
> >> 
> >> searchSet := ByteArray new: 256 withAll: 0.
> >> ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ].
> >> 
> >> searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap.
> >> 
> >> [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench.
> >> "15,160,161 per second"
> >> [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] 
> >> bench.
> >> "8,219,081 per second"
> >> 
> >> ByteString findFirstInString: ',"', String crlf inSet: searchSet 
> >> startingAt: 1.
> >> 
> >> Sven
> >> 
> >>> Tested on worst-case scenario - strings don't contain tested symbols.
> >>> 
> >>> s10 := 'a' repeat: 10.
> >>> s100 := 'a' repeat: 100.
> >>> 
> >>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 
> >>> includesSubstring: each ] ] bench. "'1,200,046 per second'"
> >>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 
> >>> includesSubstring: each ] ] bench. "'495,482 per second'"
> >>> 
> >>> [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. 
> >>> "'2,819,416 per second'"
> >>> [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. 
> >>> "'2,200,668 per second'"
> >>> 
> >>> [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf 
> >>> startingAt: 1 ] bench. "'1,187,324 per second'"
> >>> [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf 
> >>> startingAt: 1 ] bench. "'165,526 per second'"
> >>> 
> >>> 
> >>> #includesAny: seems to be the best by far.
> >>> 
> >>> Storing the tested characters didn't improve it by much.
> >>> 
> >>> Peter
> >>> 
> >>> On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote:
>  Hi Peter,
>  
> > On 22 Jul 2017, at 14:12, Peter Uhnak  wrote:
> > 
> > Hi,
> > 
> > this is a continuation of an older thread about quoting fields only 
> > when necessary. ( 
> > http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
> > 
> > I've attached fileouts of NeoCSV packages with the addition (I don't 
> > know if Metacello can file-out only changesets).
> > 
> > The change doesn't affect the default behavior, only when explicitly 
> > requested.
> > 
> > Peter
>  
>  I accepted your changes as such, the .mcz's were copied over to the main 
>  repositories. This is a pure & clean extension, so that is good. Thank 
>  you.
>  
>  This option is always going to be slower, but the current implementation 
>  might be improved, I think.
>  
>  The key test in #writeOptionalQuotedField:
>  
>  {
>  lineEnd asString.
>  separator asString.
>  '"' } anySatisfy: [ :each | string includesSubstring: each ]
>  
>  will be quite s

Re: [Pharo-users] NeoCSV change proposal

2017-07-23 Thread Sven Van Caekenberghe

> On 23 Jul 2017, at 09:55, Peter Uhnak  wrote:
> 
> Ah, ByteArrayMap, I was trying ByteArray.

ByteArrayMap is not a class, it is still a ByteArray, but of size 256 used as 
an inclusion map for characters that fit a byte.

BTW, using CharacterSet as an abstraction makes the trick easier.

I committed:

===
Name: Neo-CSV-Core-SvenVanCaekenberghe.26
Author: SvenVanCaekenberghe
Time: 23 July 2017, 1:54:22.095185 pm
UUID: e429f87e-5c11-0d00-a72d-ae5d00db2a8c
Ancestors: Neo-CSV-Core-PeterUhnak.25

make the core test in #writeOptionalQuotedField faster by using a primitive 
when possible

add #testWideOptionalQuoted
===
Name: Neo-CSV-Tests-SvenVanCaekenberghe.22
Author: SvenVanCaekenberghe
Time: 23 July 2017, 1:54:52.133003 pm
UUID: 5f81c280-5c11-0d00-a72e-7cd500db2a8c
Ancestors: Neo-CSV-Tests-PeterUhnak.21

make the core test in #writeOptionalQuotedField faster by using a primitive 
when possible

add #testWideOptionalQuoted
===

Extra caution was needed for WideStrings.

Sven

> Not sure what the next step is here: should I add it and send you mczs, or 
> will you do it yourself? (it is a simple change).
> 
> Peter
> 
> On Sat, Jul 22, 2017 at 10:51:42PM +0200, Sven Van Caekenberghe wrote:
>> Peter,
>> 
>>> On 22 Jul 2017, at 22:27, Peter Uhnak  wrote:
>>> 
>>> Hi Sven,
>>> 
>>> my use case was hand-edited CSVs (therefore the quotes are extra clutter), 
>>> which imples that I would be hand-viewing/editing only small CSVs (no 
>>> performance issues).
>>> 
>>> I agree that we should err on the safe side with CR & LF (e.g. tools may 
>>> sometimes autoconvert line endings).
>>> 
>>> Regarding performance:
>>> 
>>> #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, 
>>> or I don't know how to use), so I've trried with inCharacterSet:
>> 
>> Yes, it is a bitch to use, the the version you used does not use the 
>> primitive.
>> 
>> Try playing with this:
>> 
>> s10 := 'a' repeat: 10.
>> s100 := 'a' repeat: 100.
>> 
>> searchSet := ByteArray new: 256 withAll: 0.
>> ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ].
>> 
>> searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap.
>> 
>> [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench.
>> "15,160,161 per second"
>> [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] bench.
>> "8,219,081 per second"
>> 
>> ByteString findFirstInString: ',"', String crlf inSet: searchSet startingAt: 
>> 1.
>> 
>> Sven
>> 
>>> Tested on worst-case scenario - strings don't contain tested symbols.
>>> 
>>> s10 := 'a' repeat: 10.
>>> s100 := 'a' repeat: 100.
>>> 
>>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 
>>> includesSubstring: each ] ] bench. "'1,200,046 per second'"
>>> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 
>>> includesSubstring: each ] ] bench. "'495,482 per second'"
>>> 
>>> [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. 
>>> "'2,819,416 per second'"
>>> [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. 
>>> "'2,200,668 per second'"
>>> 
>>> [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf 
>>> startingAt: 1 ] bench. "'1,187,324 per second'"
>>> [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf 
>>> startingAt: 1 ] bench. "'165,526 per second'"
>>> 
>>> 
>>> #includesAny: seems to be the best by far.
>>> 
>>> Storing the tested characters didn't improve it by much.
>>> 
>>> Peter
>>> 
>>> On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote:
 Hi Peter,
 
> On 22 Jul 2017, at 14:12, Peter Uhnak  wrote:
> 
> Hi,
> 
> this is a continuation of an older thread about quoting fields only when 
> necessary. ( 
> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
> 
> I've attached fileouts of NeoCSV packages with the addition (I don't know 
> if Metacello can file-out only changesets).
> 
> The change doesn't affect the default behavior, only when explicitly 
> requested.
> 
> Peter
 
 I accepted your changes as such, the .mcz's were copied over to the main 
 repositories. This is a pure & clean extension, so that is good. Thank you.
 
 This option is always going to be slower, but the current implementation 
 might be improved, I think.
 
 The key test in #writeOptionalQuotedField:
 
 {
 lineEnd asString.
 separator asString.
 '"' } anySatisfy: [ :each | string includesSubstring: each ]
 
 will be quite slow. 
 
 If we accept a little bit of (over safe) error on EOL and use any 
 occurrence of CR or LF as needing a quote, we could switch to characters 
 to test for. There exists a fast (primitive) test, 
 #findFirstInString:inSet:startingAt: that can do all the testing in one 
 go. If your version turns out to be slow, we could try that, if 
 measurements show a difference

Re: [Pharo-users] NeoCSV change proposal

2017-07-23 Thread Stephane Ducasse
thanks for the explanation.
this is cool.


On Sat, Jul 22, 2017 at 10:06 PM, Sven Van Caekenberghe  wrote:
>
>> On 22 Jul 2017, at 21:51, Stephane Ducasse  wrote:
>>
>> Hi peter
>> Do you have an example that I understand?
>
> In its most simple form, CSV looks like this:
>
> a,b,c,d
>
> However, if a field is text, it might contain a comma itself, or other 
> problematic characters like line endings, hence there is a provision for 
> quoting fields, as follows:
>
> "a","b","c",d
>
> Now, "a" could be "a1,a2" and still be one field. Embedded quotes are doubled.
>
> Not all fields need quoting, normally only text or string fields. If you know 
> something is an integer, you don't need quotes, like d above.
>
> Since CSV is normally a collection of identical records, you set these 
> options for all of them.
>
> What Peter wanted, and now implemented himself as an option, is to make the 
> quoting optional per field, per record, by first checking if a field contains 
> dangerous characters and only then adding the quotes. I guess this makes most 
> sense in the context of generating a CSV file for humans to edit.
>
>> @Svn Once the change is integrated could you add a little paragraph to
>> the chapter?
>>
>> Stef
>>
>> On Sat, Jul 22, 2017 at 6:51 PM, Sven Van Caekenberghe  wrote:
>>> Hi Peter,
>>>
 On 22 Jul 2017, at 14:12, Peter Uhnak  wrote:

 Hi,

 this is a continuation of an older thread about quoting fields only when 
 necessary. ( 
 http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )

 I've attached fileouts of NeoCSV packages with the addition (I don't know 
 if Metacello can file-out only changesets).

 The change doesn't affect the default behavior, only when explicitly 
 requested.

 Peter
>>>
>>> I accepted your changes as such, the .mcz's were copied over to the main 
>>> repositories. This is a pure & clean extension, so that is good. Thank you.
>>>
>>> This option is always going to be slower, but the current implementation 
>>> might be improved, I think.
>>>
>>> The key test in #writeOptionalQuotedField:
>>>
>>> {
>>>  lineEnd asString.
>>>  separator asString.
>>>  '"' } anySatisfy: [ :each | string includesSubstring: each ]
>>>
>>> will be quite slow.
>>>
>>> If we accept a little bit of (over safe) error on EOL and use any 
>>> occurrence of CR or LF as needing a quote, we could switch to characters to 
>>> test for. There exists a fast (primitive) test, 
>>> #findFirstInString:inSet:startingAt: that can do all the testing in one go. 
>>> If your version turns out to be slow, we could try that, if measurements 
>>> show a difference.
>>>
>>> Regards,
>>>
>>> Sven
>>>
>>>
>>
>
>



Re: [Pharo-users] NeoCSV change proposal

2017-07-23 Thread Peter Uhnak
Ah, ByteArrayMap, I was trying ByteArray.

Not sure what the next step is here: should I add it and send you mczs, or will 
you do it yourself? (it is a simple change).

Peter

On Sat, Jul 22, 2017 at 10:51:42PM +0200, Sven Van Caekenberghe wrote:
> Peter,
> 
> > On 22 Jul 2017, at 22:27, Peter Uhnak  wrote:
> > 
> > Hi Sven,
> > 
> > my use case was hand-edited CSVs (therefore the quotes are extra clutter), 
> > which imples that I would be hand-viewing/editing only small CSVs (no 
> > performance issues).
> > 
> > I agree that we should err on the safe side with CR & LF (e.g. tools may 
> > sometimes autoconvert line endings).
> > 
> > Regarding performance:
> > 
> > #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, 
> > or I don't know how to use), so I've trried with inCharacterSet:
> 
> Yes, it is a bitch to use, the the version you used does not use the 
> primitive.
> 
> Try playing with this:
> 
> s10 := 'a' repeat: 10.
> s100 := 'a' repeat: 100.
> 
> searchSet := ByteArray new: 256 withAll: 0.
> ',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ].
> 
> searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap.
> 
> [ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench.
>  "15,160,161 per second"
> [ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] bench.
>  "8,219,081 per second"
> 
> ByteString findFirstInString: ',"', String crlf inSet: searchSet startingAt: 
> 1.
> 
> Sven
> 
> > Tested on worst-case scenario - strings don't contain tested symbols.
> > 
> > s10 := 'a' repeat: 10.
> > s100 := 'a' repeat: 100.
> > 
> > [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 
> > includesSubstring: each ] ] bench. "'1,200,046 per second'"
> > [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 
> > includesSubstring: each ] ] bench. "'495,482 per second'"
> > 
> > [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. 
> > "'2,819,416 per second'"
> > [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. 
> > "'2,200,668 per second'"
> > 
> > [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf 
> > startingAt: 1 ] bench. "'1,187,324 per second'"
> > [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf 
> > startingAt: 1 ] bench. "'165,526 per second'"
> > 
> > 
> > #includesAny: seems to be the best by far.
> > 
> > Storing the tested characters didn't improve it by much.
> > 
> > Peter
> > 
> > On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote:
> >> Hi Peter,
> >> 
> >>> On 22 Jul 2017, at 14:12, Peter Uhnak  wrote:
> >>> 
> >>> Hi,
> >>> 
> >>> this is a continuation of an older thread about quoting fields only when 
> >>> necessary. ( 
> >>> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
> >>> 
> >>> I've attached fileouts of NeoCSV packages with the addition (I don't know 
> >>> if Metacello can file-out only changesets).
> >>> 
> >>> The change doesn't affect the default behavior, only when explicitly 
> >>> requested.
> >>> 
> >>> Peter
> >> 
> >> I accepted your changes as such, the .mcz's were copied over to the main 
> >> repositories. This is a pure & clean extension, so that is good. Thank you.
> >> 
> >> This option is always going to be slower, but the current implementation 
> >> might be improved, I think.
> >> 
> >> The key test in #writeOptionalQuotedField:
> >> 
> >> {
> >>  lineEnd asString.
> >>  separator asString.
> >>  '"' } anySatisfy: [ :each | string includesSubstring: each ]
> >> 
> >> will be quite slow. 
> >> 
> >> If we accept a little bit of (over safe) error on EOL and use any 
> >> occurrence of CR or LF as needing a quote, we could switch to characters 
> >> to test for. There exists a fast (primitive) test, 
> >> #findFirstInString:inSet:startingAt: that can do all the testing in one 
> >> go. If your version turns out to be slow, we could try that, if 
> >> measurements show a difference.
> >> 
> >> Regards,
> >> 
> >> Sven 
> >> 
> >> 
> > 
> 
> 



Re: [Pharo-users] NeoCSV change proposal

2017-07-22 Thread Sven Van Caekenberghe
Peter,

> On 22 Jul 2017, at 22:27, Peter Uhnak  wrote:
> 
> Hi Sven,
> 
> my use case was hand-edited CSVs (therefore the quotes are extra clutter), 
> which imples that I would be hand-viewing/editing only small CSVs (no 
> performance issues).
> 
> I agree that we should err on the safe side with CR & LF (e.g. tools may 
> sometimes autoconvert line endings).
> 
> Regarding performance:
> 
> #findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, or 
> I don't know how to use), so I've trried with inCharacterSet:

Yes, it is a bitch to use, the the version you used does not use the primitive.

Try playing with this:

s10 := 'a' repeat: 10.
s100 := 'a' repeat: 100.

searchSet := ByteArray new: 256 withAll: 0.
',"', String crlf do: [ :each | searchSet at: each asInteger + 1 put: 1 ].

searchSet := (CharacterSet newFrom: ',"', String crlf) byteArrayMap.

[ ByteString findFirstInString: s10 inSet: searchSet startingAt: 1 ] bench.
 "15,160,161 per second"
[ ByteString findFirstInString: s100 inSet: searchSet startingAt: 1 ] bench.
 "8,219,081 per second"

ByteString findFirstInString: ',"', String crlf inSet: searchSet startingAt: 1.

Sven

> Tested on worst-case scenario - strings don't contain tested symbols.
> 
> s10 := 'a' repeat: 10.
> s100 := 'a' repeat: 100.
> 
> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 
> includesSubstring: each ] ] bench. "'1,200,046 per second'"
> [ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 
> includesSubstring: each ] ] bench. "'495,482 per second'"
> 
> [ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. 
> "'2,819,416 per second'"
> [ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. 
> "'2,200,668 per second'"
> 
> [ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf 
> startingAt: 1 ] bench. "'1,187,324 per second'"
> [ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf 
> startingAt: 1 ] bench. "'165,526 per second'"
> 
> 
> #includesAny: seems to be the best by far.
> 
> Storing the tested characters didn't improve it by much.
> 
> Peter
> 
> On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote:
>> Hi Peter,
>> 
>>> On 22 Jul 2017, at 14:12, Peter Uhnak  wrote:
>>> 
>>> Hi,
>>> 
>>> this is a continuation of an older thread about quoting fields only when 
>>> necessary. ( 
>>> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
>>> 
>>> I've attached fileouts of NeoCSV packages with the addition (I don't know 
>>> if Metacello can file-out only changesets).
>>> 
>>> The change doesn't affect the default behavior, only when explicitly 
>>> requested.
>>> 
>>> Peter
>> 
>> I accepted your changes as such, the .mcz's were copied over to the main 
>> repositories. This is a pure & clean extension, so that is good. Thank you.
>> 
>> This option is always going to be slower, but the current implementation 
>> might be improved, I think.
>> 
>> The key test in #writeOptionalQuotedField:
>> 
>> {
>>  lineEnd asString.
>>  separator asString.
>>  '"' } anySatisfy: [ :each | string includesSubstring: each ]
>> 
>> will be quite slow. 
>> 
>> If we accept a little bit of (over safe) error on EOL and use any occurrence 
>> of CR or LF as needing a quote, we could switch to characters to test for. 
>> There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: 
>> that can do all the testing in one go. If your version turns out to be slow, 
>> we could try that, if measurements show a difference.
>> 
>> Regards,
>> 
>> Sven 
>> 
>> 
> 




Re: [Pharo-users] NeoCSV change proposal

2017-07-22 Thread Peter Uhnak
Hi Sven,

my use case was hand-edited CSVs (therefore the quotes are extra clutter), 
which imples that I would be hand-viewing/editing only small CSVs (no 
performance issues).

I agree that we should err on the safe side with CR & LF (e.g. tools may 
sometimes autoconvert line endings).

Regarding performance:

#findFirstInString:inSet:startingAt: didn't work for me (not sure if bug, or I 
don't know how to use), so I've trried with inCharacterSet:

Tested on worst-case scenario - strings don't contain tested symbols.

s10 := 'a' repeat: 10.
s100 := 'a' repeat: 100.

[ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s10 
includesSubstring: each ] ] bench. "'1,200,046 per second'"
[ {'"'. ','. String cr. String lf } anySatisfy: [ :each | s100 
includesSubstring: each ] ] bench. "'495,482 per second'"

[ s10 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,819,416 
per second'"
[ s100 includesAny: { $,. $". Character cr. Character lf } ] bench. "'2,200,668 
per second'"

[ ByteString findFirstInString: s10 inCharacterSet: ',"', String crlf 
startingAt: 1 ] bench. "'1,187,324 per second'"
[ ByteString findFirstInString: s100 inCharacterSet: ',"', String crlf 
startingAt: 1 ] bench. "'165,526 per second'"


#includesAny: seems to be the best by far.

Storing the tested characters didn't improve it by much.

Peter

On Sat, Jul 22, 2017 at 06:51:31PM +0200, Sven Van Caekenberghe wrote:
> Hi Peter,
> 
> > On 22 Jul 2017, at 14:12, Peter Uhnak  wrote:
> > 
> > Hi,
> > 
> > this is a continuation of an older thread about quoting fields only when 
> > necessary. ( 
> > http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
> > 
> > I've attached fileouts of NeoCSV packages with the addition (I don't know 
> > if Metacello can file-out only changesets).
> > 
> > The change doesn't affect the default behavior, only when explicitly 
> > requested.
> > 
> > Peter
> 
> I accepted your changes as such, the .mcz's were copied over to the main 
> repositories. This is a pure & clean extension, so that is good. Thank you.
> 
> This option is always going to be slower, but the current implementation 
> might be improved, I think.
> 
> The key test in #writeOptionalQuotedField:
> 
> {
>   lineEnd asString.
>   separator asString.
>   '"' } anySatisfy: [ :each | string includesSubstring: each ]
> 
> will be quite slow. 
> 
> If we accept a little bit of (over safe) error on EOL and use any occurrence 
> of CR or LF as needing a quote, we could switch to characters to test for. 
> There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: 
> that can do all the testing in one go. If your version turns out to be slow, 
> we could try that, if measurements show a difference.
> 
> Regards,
> 
> Sven 
> 
> 



Re: [Pharo-users] NeoCSV change proposal

2017-07-22 Thread Sven Van Caekenberghe

> On 22 Jul 2017, at 21:51, Stephane Ducasse  wrote:
> 
> Hi peter
> Do you have an example that I understand?

In its most simple form, CSV looks like this:

a,b,c,d

However, if a field is text, it might contain a comma itself, or other 
problematic characters like line endings, hence there is a provision for 
quoting fields, as follows:

"a","b","c",d

Now, "a" could be "a1,a2" and still be one field. Embedded quotes are doubled.

Not all fields need quoting, normally only text or string fields. If you know 
something is an integer, you don't need quotes, like d above.

Since CSV is normally a collection of identical records, you set these options 
for all of them.

What Peter wanted, and now implemented himself as an option, is to make the 
quoting optional per field, per record, by first checking if a field contains 
dangerous characters and only then adding the quotes. I guess this makes most 
sense in the context of generating a CSV file for humans to edit.

> @Svn Once the change is integrated could you add a little paragraph to
> the chapter?
> 
> Stef
> 
> On Sat, Jul 22, 2017 at 6:51 PM, Sven Van Caekenberghe  wrote:
>> Hi Peter,
>> 
>>> On 22 Jul 2017, at 14:12, Peter Uhnak  wrote:
>>> 
>>> Hi,
>>> 
>>> this is a continuation of an older thread about quoting fields only when 
>>> necessary. ( 
>>> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
>>> 
>>> I've attached fileouts of NeoCSV packages with the addition (I don't know 
>>> if Metacello can file-out only changesets).
>>> 
>>> The change doesn't affect the default behavior, only when explicitly 
>>> requested.
>>> 
>>> Peter
>> 
>> I accepted your changes as such, the .mcz's were copied over to the main 
>> repositories. This is a pure & clean extension, so that is good. Thank you.
>> 
>> This option is always going to be slower, but the current implementation 
>> might be improved, I think.
>> 
>> The key test in #writeOptionalQuotedField:
>> 
>> {
>>  lineEnd asString.
>>  separator asString.
>>  '"' } anySatisfy: [ :each | string includesSubstring: each ]
>> 
>> will be quite slow.
>> 
>> If we accept a little bit of (over safe) error on EOL and use any occurrence 
>> of CR or LF as needing a quote, we could switch to characters to test for. 
>> There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: 
>> that can do all the testing in one go. If your version turns out to be slow, 
>> we could try that, if measurements show a difference.
>> 
>> Regards,
>> 
>> Sven
>> 
>> 
> 




Re: [Pharo-users] NeoCSV change proposal

2017-07-22 Thread Stephane Ducasse
Hi peter
Do you have an example that I understand?

@Svn Once the change is integrated could you add a little paragraph to
the chapter?

Stef

On Sat, Jul 22, 2017 at 6:51 PM, Sven Van Caekenberghe  wrote:
> Hi Peter,
>
>> On 22 Jul 2017, at 14:12, Peter Uhnak  wrote:
>>
>> Hi,
>>
>> this is a continuation of an older thread about quoting fields only when 
>> necessary. ( 
>> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
>>
>> I've attached fileouts of NeoCSV packages with the addition (I don't know if 
>> Metacello can file-out only changesets).
>>
>> The change doesn't affect the default behavior, only when explicitly 
>> requested.
>>
>> Peter
>
> I accepted your changes as such, the .mcz's were copied over to the main 
> repositories. This is a pure & clean extension, so that is good. Thank you.
>
> This option is always going to be slower, but the current implementation 
> might be improved, I think.
>
> The key test in #writeOptionalQuotedField:
>
> {
>   lineEnd asString.
>   separator asString.
>   '"' } anySatisfy: [ :each | string includesSubstring: each ]
>
> will be quite slow.
>
> If we accept a little bit of (over safe) error on EOL and use any occurrence 
> of CR or LF as needing a quote, we could switch to characters to test for. 
> There exists a fast (primitive) test, #findFirstInString:inSet:startingAt: 
> that can do all the testing in one go. If your version turns out to be slow, 
> we could try that, if measurements show a difference.
>
> Regards,
>
> Sven
>
>



Re: [Pharo-users] NeoCSV change proposal

2017-07-22 Thread Sven Van Caekenberghe
Hi Peter,

> On 22 Jul 2017, at 14:12, Peter Uhnak  wrote:
> 
> Hi,
> 
> this is a continuation of an older thread about quoting fields only when 
> necessary. ( 
> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
> 
> I've attached fileouts of NeoCSV packages with the addition (I don't know if 
> Metacello can file-out only changesets).
> 
> The change doesn't affect the default behavior, only when explicitly 
> requested.
> 
> Peter

I accepted your changes as such, the .mcz's were copied over to the main 
repositories. This is a pure & clean extension, so that is good. Thank you.

This option is always going to be slower, but the current implementation might 
be improved, I think.

The key test in #writeOptionalQuotedField:

{
  lineEnd asString.
  separator asString.
  '"' } anySatisfy: [ :each | string includesSubstring: each ]

will be quite slow. 

If we accept a little bit of (over safe) error on EOL and use any occurrence of 
CR or LF as needing a quote, we could switch to characters to test for. There 
exists a fast (primitive) test, #findFirstInString:inSet:startingAt: that can 
do all the testing in one go. If your version turns out to be slow, we could 
try that, if measurements show a difference.

Regards,

Sven 




Re: [Pharo-users] NeoCSV change proposal

2017-07-22 Thread Sven Van Caekenberghe
Peter,

> On 22 Jul 2017, at 14:12, Peter Uhnak  wrote:
> 
> Hi,
> 
> this is a continuation of an older thread about quoting fields only when 
> necessary. ( 
> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
> 
> I've attached fileouts of NeoCSV packages with the addition (I don't know if 
> Metacello can file-out only changesets).
> 
> The change doesn't affect the default behavior, only when explicitly 
> requested.
> 
> Peter

Could you email me your .mcz packages (saved in the cache) ? That is much 
easier to work with. 

Thx, Sven




Re: [Pharo-users] NeoCSV change proposal

2017-07-22 Thread Peter Uhnak
attached

On Sat, Jul 22, 2017 at 02:12:10PM +0200, Peter Uhnak wrote:
> Hi,
> 
> this is a continuation of an older thread about quoting fields only when 
> necessary. ( 
> http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html )
> 
> I've attached fileouts of NeoCSV packages with the addition (I don't know if 
> Metacello can file-out only changesets).
> 
> The change doesn't affect the default behavior, only when explicitly 
> requested.
> 
> Peter
Object subclass: #NeoCSVReader
instanceVariableNames: 'readStream charBuffer separator stringStream 
fieldCount recordClass recordClassIsIndexable fieldAccessors emptyFieldValue'
classVariableNames: ''
poolDictionaries: ''
category: 'Neo-CSV-Core'!
!NeoCSVReader commentStamp: '' prior: 0!
I am NeoCSVReader.

I read a format that
- is text based (ASCII, Latin1, Unicode)
- consists of records, 1 per line (any line ending convention)
- where records consist of fields separated by a delimiter (comma, tab, 
semicolon)
- where every record has the same number of fields
- where fields can be quoted should they contain separators or line endings

Without further configuration, records will become Arrays of Strings.

By specifiying a recordClass and fields with optional converters most objects 
can be read and instanciated correctly.

MIT License.
!


!NeoCSVReader methodsFor: 'accessing' stamp: 'SvenVanCaekenberghe 5/13/2014 
15:34'!
readHeader
"Read a record, presumably a header and return the header field names.
This should normally be called only at the beginning and only once.
This sets the fieldCount (but fieldAccessors overrides fieldCount)."

| names |
names := Array streamContents: [ :out |
 [ self atEnd or: [ self readEndOfLine ] ]
whileFalse: [ 
out nextPut: self readField.
(self readSeparator and: [ self atEnd or: [ 
self peekEndOfLine ] ])
ifTrue: [ out nextPut: emptyFieldValue 
] ] ].
fieldCount := names size.
^ names! !

!NeoCSVReader methodsFor: 'accessing' stamp: 'SvenVanCaekenberghe 10/6/2014 
17:34'!
select: filter
"Read and collect records that satisfy filter into an Array until there 
are none left.
Return the array."

^ Array streamContents: [ :stream | 
self 
select: filter 
thenDo: [ :each | stream nextPut: each ] ]! !

!NeoCSVReader methodsFor: 'accessing' stamp: 'SvenVanCaekenberghe 6/21/2012 
22:30'!
next
"Read the next record.
I will return an instance of recordClass."

^ recordClassIsIndexable
ifTrue: [ self readNextRecordAsArray ] 
ifFalse: [ self readNextRecordAsObject ]! !

!NeoCSVReader methodsFor: 'accessing' stamp: 'SvenVanCaekenberghe 10/6/2014 
17:33'!
select: filter thenDo: block
"Execute block for each record that satisfies filter until I am at end."

[ self atEnd ]
whileFalse: [ 
| record |
record := self next.
(filter value: record)
ifTrue: [ block value: record ] ]! !

!NeoCSVReader methodsFor: 'accessing' stamp: 'SvenVanCaekenberghe 6/25/2012 
14:45'!
upToEnd 
"Read and collect records into an Array until there are none left.
Return the array."

^ Array streamContents: [ :stream |
self do: [ :each | stream nextPut: each ] ]! !

!NeoCSVReader methodsFor: 'accessing' stamp: 'SvenVanCaekenberghe 6/25/2012 
14:45'!
do: block
"Execute block for each record until I am at end."

[ self atEnd ]
whileFalse: [ 
block value: self next ]! !


!NeoCSVReader methodsFor: 'private - reading' stamp: 'SvenVanCaekenberghe 
6/14/2012 17:24'!
readField
^ self peekQuote
ifTrue: [
self readQuotedField ]
ifFalse: [
self readUnquotedField ]! !

!NeoCSVReader methodsFor: 'private - reading' stamp: 'SvenVanCaekenberghe 
1/15/2014 09:55'!
readNextRecord
| record |
record := recordClass new: fieldCount.
fieldAccessors
ifNil: [ self readNextRecordWithoutFieldAccessors: record ]
ifNotNil: [ self readNextRecordWithFieldAccessors: record ].
self readAtEndOrEndOfLine.
^ record! !

!NeoCSVReader methodsFor: 'private - reading' stamp: 'SvenVanCaekenberghe 
5/13/2014 15:35'!
readNextRecordAsObject
| object |
object := recordClass new.
fieldAccessors do: [ :each | | rawValue |
rawValue := self readFieldAndSeparator.
"Note that empty/missing fields are not set"
(rawValue = emptyFieldValue or: [ each isNil

[Pharo-users] NeoCSV change proposal

2017-07-22 Thread Peter Uhnak
Hi,

this is a continuation of an older thread about quoting fields only when 
necessary. ( http://forum.world.st/NeoCSVWriter-automatic-quotes-td4924781.html 
)

I've attached fileouts of NeoCSV packages with the addition (I don't know if 
Metacello can file-out only changesets).

The change doesn't affect the default behavior, only when explicitly requested.

Peter