On 19/10/2015 02:52, Mike Kerner wrote:
Well, there goes that idea.  There are tutorials right on Git, but it might
be easier if you (and anyone else so not-inclined to Git) post here and
those of us who are at least inclined to try will make do with doing that
work for you.


OK, OK, I know I need to learn Git / github - and I will soon - but just not today. I looked at some of the tutorials, and decided they would take a small amount of time. But, I have between 1/2 and one hour or so to work on my favourite hobby - Livecode - and I'd rather spend it updating my CSV script than learning a tutorial that I probably won't have time to complete.

Yes - your change for the "no trailing CR' case is better than mine - there's no need to test, just change it.

However, I later decided that that wasn't the best approach .... in conjunction with another change.

The various version of the script all have some initial replacements, like

      -- Normalize line endings:
      replace crlf with cr in pData          -- Win to UNIX
      replace numtochar(13) with cr in pData -- Mac to UNIX

I put these in initially because I didn't fully understand how Runtime Revolution handled these (what can I say, I'd only been a RR user for a couple of weeks at the time :-) :-). I now believe that, so long as the data came from sensible place (i.e. a file, or a web site, or a database) and was pulled in in some sensible way (i.e. put URL "file:....") or equivalent, then this is a non-issue. Otherwise, every real script that handled data would have this kind of thing in it - and they don't. So - I think the 'replace' statements can be removed.

Once they are out, then we see that "pData" is a read-only parameter, until we add this extra CR. Since a large part of the initial purpose was to be efficient (in CPU and in memory usage) so we can handle *large* datasets, it would be desirable to keep pData as read-only, hence avoiding both a memory copy and the additional memory used. So instead of adding a CR, we can instead do that by checking just after the loop whether or not the situation exists, and handling it there.

So - given those two ideas, plus the need to parameterize, I upgraded the code to
 - not do the initial replacements
 - be fully parameterized for input delimiters
 - be fully parameterized for TAB or CR characters within quoted cells
 - and do all the quote- replacement, etc.
( see code below )

I then tested three versions of code
 - the earlier csvToTab2 (i.e. adding the CR at the end)
 - this new version (called csvToTab3
 - Peter's csvToArray
against 3 input datasets - the two from Richard's article, plus one testing the case of no trailing CR.

Fortunately, all 3 produce equivalent output (not identical, since Peter's produces an array, doesn't remove quotes in cells and doesn't do the same things with enclosed CR and TABs - but equivalent.

I also added to my test script the ability to choose how many copies of the input data file to put into the variable before calling each function - to allow simple benchmarking. (All the code for the functions and the test button is below)

With that we get (remember - equivalent results)

1 copy of data (sample 1 from Richard - 7 lines, 370 chars)

csvToTab2         0 msecs
csvToTab3         0 msecs
csvToArray   6539 msecs

20,000 copies of the data ( - 140,000 lines, 7.4Mb)
csvToTab2       690 msecs
csvToTab3       566 msecs
csvToArray       not tested


-- Alex.

Code for the test button

on mouseUp
   local tChosenFile

   put empty into msg
   answer file "CSV file to process"
   if the result is not "Cancel" then
      put it into tChosenFile
   else
      exit mouseUp
   end if

   local tmp, t1, tmp1
   put URL ("file:" & tChosenFile) into tmp1
   put the number of chars in tmp1 & CR & tmp1 & CR after msg

   local tTimes
   ask "How many multiples" with 1
   put it into tTimes
   repeat tTimes
      put tmp1 after tmp
   end repeat

   local time1
   put the millisecs into time1
   put csvToTab2(tmp) into t1
   put "Version 2 took" &&  the millisecs - time1 &CR after msg
   if tTimes = 1 then
      replace numtochar(29) with "<GS>" in t1
      replace numtochar(11) with "<VT>" in t1
      replace TAB with "<TAB>" in t1
      put "[" & t1 & "]" & CR & CR  after msg
   end if

   put the millisecs into time1
   put csvToTab3(tmp) into t1
   put "Version 3 took" &&  the millisecs - time1 &CR after msg
   if tTimes = 1 then
      replace numtochar(29) with "<GS>" in t1
      replace numtochar(11) with "<VT>" in t1
      replace TAB with "<TAB>" in t1
      put "[" & t1 & "]" & CR & CR  after msg
   end if


   put empty into tA
   put the millisecs into time1
   if tTimes = 1 then
      put csvToArray(tmp) into tA
      put "Version Array took" &&  the millisecs - time1 &CR after msg
      repeat for each key K in tA
         repeat for each key KK in tA[K]
            put K && KK && tA[K][KK] &CR after msg
         end repeat
      end repeat
   end if
end mouseUp





Code for the functions
function CSVToTab3 pData, pOldLineDelim, pOldItemDelim, pNewCR, pNewTAB
   -- fill in defaults
   if pOldLineDelim is empty then put CR into pOldLineDelim
   if pOldItemDelim is empty then put COMMA into pOldItemDelim
if pNewCR is empty then put numtochar(11) into pNewCR -- Use <VT> for quoted CRs if pNewTAB is empty then put numtochar(29) into pNewTAB -- Use <GS> (group separator) for quoted TABs

   local tNuData                         -- contains tabbed copy of data

   local tStatus, theInsideStringSoFar

   -- Normalize line endings: REMOVED
-- Will normally be correct already, only binfile: or similar should make this necessary
   -- and that exceptional case should be the caller's responsibility

   put "outside" into tStatus
   set the itemdel to quote
   repeat for each item k in pData
      -- put tStatus && k & CR after msg
      switch tStatus

         case "inside"
            put k after theInsideStringSoFar
            put "passedquote" into tStatus
            next repeat

         case "passedquote"
-- decide if it was a duplicated escapedQuote or a closing quote
            if k is empty then   -- it's a duplicated quote
               put quote after theInsideStringSoFar
               put "inside" into tStatus
               next repeat
            end if
-- not empty - so we remain inside the cell, though we have left the quoted section -- NB this allows for quoted sub-strings within the cell content !!
            replace pOldLineDelim with pNewCR in theInsideStringSoFar
            -- replace pOldItemDelim with pNewTAB in theInsideStringSoFar
            put theInsideStringSoFar after tNuData

         case "outside"
            replace pOldItemDelim with TAB in k
            -- and deal with the "empty trailing item" issue in Livecode
replace (pNewTAB & pOldLineDelim) with pNewTAB & pNewTAB & CR in k
            put k after tNuData
            put "inside" into tStatus
            put empty into theInsideStringSoFar
            next repeat
         default
            put "defaulted"
            break
      end switch
   end repeat

   -- and finally deal with the trailing item isse in input data
-- i.e. the very last char is a quote, so there is no trigger to flush the
   --      last item
   if the last char of pData = quote then
      put theInsideStringSoFar after tNuData
   end if

   return tNuData
end CSVToTab3



function CSVToTab2 pData, pcoldelim
   local tNuData                         -- contains tabbed copy of data
local tReturnPlaceholder -- replaces cr in field data to avoid line
   --      breaks which would be misread as records;
local tNuDelim -- new character to replace the delimiter local tNeedToRemoveFinalCR -- boolean to remove CR if added to ensure final token is processed
   local tStatus, theInsideStringSoFar
   --
   put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
   put numtochar(29) into tNuDelim               -- use <GS> internally
   --
   if pcoldelim is empty then put comma into pcoldelim
   -- Normalize line endings:
   replace crlf with cr in pData                   -- Win to UNIX
   replace numtochar(13) with cr in pData  -- Mac to UNIX

   if the last char of pData <> CR then
      put CR after pData
      put TRUE into tNeedToRemoveFinalCR
   end if

   put "outside" into tStatus
   set the itemdel to quote
   repeat for each item k in pData
      -- put tStatus && k & CR after msg
      switch tStatus

         case "inside"
            put k after theInsideStringSoFar
            put "passedquote" into tStatus
            next repeat

         case "passedquote"
-- decide if it was a duplicated escapedQuote or a closing quote
            if k is empty then   -- it's a duplicated quote
               put quote after theInsideStringSoFar
               put "inside" into tStatus
               next repeat
            end if
-- not empty - so we remain inside the cell, though we have left the quoted section -- NB this allows for quoted sub-strings within the cell content !!
            replace cr with tReturnPlaceholder in theInsideStringSoFar
            put theInsideStringSoFar after tNuData

         case "outside"
            replace pcoldelim with tNuDelim in k
            -- and deal with the "empty trailing item" issue in Livecode
            replace (tNuDelim & CR) with tNuDelim & tNuDelim & CR in k
            put k after tNuData
            put "inside" into tStatus
            put empty into theInsideStringSoFar
            next repeat
         default
            put "defaulted"
            break
      end switch
   end repeat

   if tNeedToRemoveFinalCR and the last char of tNuData = CR then
      delete the last char of tNuData
   end if

   -- OPTIONAL - if the data may contain enclosed TABs,
   --     remove this next line to keep <GS> as separator
   replace numtochar(29) with TAB in tNuData

   return tNuData
end CSVToTab2



function offsets str, pContainer
-- returns a comma-delimited list of all the offsets of str in pContainer
   -- returns 0 if not found
   -- note: offsets("xx","xxxxxx") returns "1,3,5" not "1,2,3,4,5"
   --     ie, overlapping offsets are not counted
-- note: to get the last occurrence of a string in a container (often useful)
   --     use "item -1 of offsets(...)"
   -- by Peter M. Brigham, pmb...@gmail.com — freeware

   if str is not in pContainer then return 0
   put 0 into startPoint
   repeat
      put offset(str,pContainer,startPoint) into thisOffset
      if thisOffset = 0 then exit repeat
      add thisOffset to startPoint
      put startPoint & comma after offsetList
      add length(str)-1 to startPoint
   end repeat
   return item 1 to -1 of offsetList -- delete trailing comma
end offsets

function CSVtoArray pData
   -- by Peter M. Brigham, pmb...@gmail.com
   -- requires getDelimiters(), howmany()
   put getDelimiters(pData,5) into tDelims
   put line 1 of tDelims into crChar
   put line 2 of tDelims into tabChar
   put line 3 of tDelims into commaChar
   put line 4 of tDelims into openQuoteChar
   put line 5 of tDelims into closeQuoteChar

   replace crlf with cr in pData          -- Win to UNIX
   replace numtochar(13) with cr in pData -- Mac to UNIX

   if howmany(quote,pData) mod 2 = 1 then
      return "This CSV data is not parsable (unclosed quotes in data)."
   end if

   put offsets(quote,pData) into qOffsets
   if qOffsets > 0 then
      put 1 into counter
      repeat for each item q in qOffsets
         if counter mod 2 = 1 then put openQuoteChar into char q of pData
         else put closeQuoteChar into char q of pData
         add 1 to counter
      end repeat
   end if

   put offsets(cr,pData) into crOffsets
   repeat for each item r in crOffsets
      put char 1 to r of pData into upToHere
if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
         -- the cr is within a quoted string
         put crChar into char r of pData
      end if
   end repeat
   put offsets(tab,pData) into tabOffsets
   repeat for each item t in tabOffsets
      put char 1 to t of pData into upToHere
if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
         -- the tab is within a quoted string
         put tabChar into char t of pData
      end if
   end repeat
   put offsets(comma,pData) into commaOffsets
   repeat for each item c in commaOffsets
      put char 1 to c of pData into upToHere
if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) then
         -- the comma is within a quoted string
         put commaChar into char c of pData
      end if
   end repeat
   put 0 into lineCounter
   repeat for each line L in pData
      add 1 to lineCounter
      put 0 into itemCounter
      repeat for each item i in L
         add 1 to itemCounter
         put i into thisItem
         if howmany(quote,thisItem) mod 2 = 1 then
return "This CSV data is not parsable (unclosed quotes in item)."
         end if
         replace crChar with cr in thisItem
         replace tabChar with tab in thisItem
         replace commaChar with comma in thisItem
         replace openQuoteChar with quote in thisItem
         replace closeQuoteChar with quote in thisItem
         put thisItem into A[lineCounter][itemCounter]
      end repeat
   end repeat
   return A
end CSVtoArray

function getDelimiters pText, nbr
   -- returns a cr-delimited list of <nbr> characters
   --    not found in the variable pText
-- use for delimiters for, eg, parsing text files, manipulating arrays, etc.
   -- usage: put getDelimiters(pText,2) into tDelims
   --        if tDelims begins with "Error" then exit to top -- or whatever
   --        put line 1 of tDelims into lineDivider
   --        put line 2 of tDelims into itemDivider
   --             etc.
   -- by Peter M. Brigham, pmb...@gmail.com — freeware

   if pText = empty then return "Error: no text specified."
   if nbr = empty then put 1 into nbr -- default 1 delimiter
   put "2,3,4,5,6,7,8,16,17,18,19,20,21,22,23,24,25,26" into baseList
   -- low ASCII values, excluding CR, LF, tab, etc.
   put the number of items of baseList into maxNbr
   if nbr > maxNbr then return "Error: max" && maxNbr && "delimiters."
   repeat with tCount = 1 to nbr
      put true into failed
      repeat with i = 1 to the number of items of baseList
         put item i of baseList into testNbr
         put numtochar(testNbr) into testChar
         if testChar is not in pText then
            -- found one, store and get next delim
            put false into failed
            put testChar into line tCount of delimList
            exit repeat
         end if
      end repeat
      if failed then
         if tCount = 0 then
            return "Error: cannot get any delimiters."
         else if tCount = 1 then
            return "Error: can only get one delimiter."
         else
            return "Error: can only get" && tCount && "delimiters."
         end if
      end if
      delete item i of baseList
   end repeat
   return delimList
end getDelimiters

function howmany pStr, pContainer, pCaseSens
   -- how many times pStr occurs in pContainer
   -- note that howmany("00","000000") returns 3, not 5
   --     ie,  overlapping matches are not counted
   -- by Peter M. Brigham, pmb...@gmail.com — freeware

   if pCaseSens = empty then put false into pCaseSens
   set the casesensitive to pCaseSens
   if pStr is not in pContainer then return 0
   put len(pContainer) into origLength
   replace pStr with char 2 to -1 of pStr in pContainer
   return origLength - len(pContainer)
end howmany

end of code


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to