Alex Tweedly wrote:

At 00:25 01/07/2004 -0700, Richard Gaskin wrote:
My post from 14 June 2002 with my own CSV2Tab function is at
<http://lists.runrev.com/pipermail/metacard/2002-June/001767.html>.

Hats off to anyone who can improve it's speed, and a bottle of
12-year-old single malt to anyone who can come up with an algorithm I
can use which is at least twice as fast.

Now there's a challenge I can relate to :-)

BUT - the speed of the conversion depends on the data ...
Enclosed below is a version of the script which is between 10% and 90%
faster - and probably has potential to go even faster than that.

It uses the same set up as you did - so there are no quotes left
except those around fields.

Then instead of walking through the data char by char, it use
"split()" to divide into an array; the array elements must then
alternate between in-quotes and not-in-quotes.

Each array element has only the relevant processing applied.

Great stuff. Using split is a ingenious way to reduce the load.

If CSV were consistently implemented CSV2TabNew would work excellently right out of the box, but since some CSVs escape quotes by doubling them I needed to add one line (see below) to also substitute doubled quote chars with the quote placeholder.

Bonus: since the added line reduces the number of quote characters, the function is now even faster.

I ran 1000 iterations of all three algorithms on a small test file which uses the Excel escaping format of doubled quotes (I believe it's MS Access that uses slash-quote to escape, if memory serves). Average times on my machine (G4 PowerBook) are roughly:

CSV2Tab:    438ms
CSV2TabNew: 369ms
CSV2Tab3:   195ms

On a slightly more complex example the times were:

CSV2Tab:    892ms
CSV2TabNew: 333ms
CSV2Tab3:   153ms


Here's CSV2Tab3:


function CSV2Tab3 pData local tNuData -- contains tabbed copy of data local tReturnPlaceholder -- replaces cr in field data to avoid line -- breaks which would be misread as records; -- replaced later during dislay local tEscapedQuotePlaceholder -- used for keeping track of quotes -- in data local tInQuotedText -- flag set while reading data between quotes -- put numtochar(11) into tReturnPlaceholder -- vertical tab as -- placeholder put numtochar(2) into tEscapedQuotePlaceholder -- used to simplify -- distinction between quotes in data and those -- used in delimiters -- -- Normalize line endings: replace crlf with cr in pData -- Win to UNIX replace numtochar(13) with cr in pData -- Mac to UNIX -- -- Put placeholder in escaped quote (non-delimiter) chars: replace ("\"&quote) with tEscapedQuotePlaceholder in pData replace quote&quote with tEscapedQuotePlaceholder in pData --<NEW -- put space before pData -- to avoid ambiguity of starting context split pData by quote put False into tInsideQuoted repeat for each element k in pData if (tInsideQuoted) then replace cr with tReturnPlaceholder in k put k after tNuData put False into tInsideQuoted else replace comma with tab in k put k after tNuData put true into tInsideQuoted end if end repeat -- delete char 1 of tNuData -- remove the leading space replace tEscapedQuotePlaceholder with quote in tNuData return tNuData end CSV2Tab3


Please send me a private email and we'll make arrangements for the scotch delivery. Thanks for the assist.


--
 Richard Gaskin
 Fourth World Media Corporation
 ___________________________________________________
 Rev tools and more:  http://www.fourthworld.com/rev




_______________________________________________ use-revolution mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to