Rich Shepard wrote:
On Mon, 13 Sep 2010, Mithun Nair wrote:

When i try importing it into a SQLite table, i get some errors like
"expected 2 columns of data but found 1".  Later i found that importing a
comma separated file into SQlite wont work because a comma is considered
as a delimiter by the the SQLite engine even within a "". Is this true? My
csv file source is a webservice, so i have to try adjusting with the comma
delimiter. They are not gonna change it for me :). Should i write some
simple string parser of my own? Or is there any better, optimal method?

Mithun,

   I've gone through this a couple of times (because I did not save the
solution from the first time), most recently this past weekend. Here's what
you need to do to clean up your .csv file for import into SQLite.

   Use the text editor of your choice (emacs, vim, joe, whatever).

   First, you are correct that commas embedded in text strings are seen as
column delimiters. Therefore, convert all commas to the default SQLite
separator |. In my data I run this series:
   - replace all "," with "|"
   - replace all ,,, with |||
   - repeat above for all NULL columns in your source file with varying
numbers of commans.

   Second, replace all " (double qoutes) used to delimit text attributes with
' (single quotes). That's also a global search and replace. I've not had a
problem with apostrophes within a quoted text string as long as the column
separators were the vertical bars.

HTH,

Rich

Mithun --

Attached is a( bash + gawk ) pair that can
convert a CSV File to a Pipe-Delimited file.

The script will run on a system having bash
and gawk( including MSys on Win32 ).

HTH.

-- kjh

[kon...@kjhlt4 ~]$ /home/konrad/bin/qcq2pipe -h

   usage:  qcq2pipe [ options ] InputFile.CSV

      qcq2pipe processes an RFC 4180 compliant file to produce by default,
      a pipe-delimited file.

      Typical Usage is

         qcq2pipe Input.CSV > Output.TXT

      Tricks inside !  RFC 4180 allows embedded newlines.  qcq2pipe converts
      embedded newlines (0x0a) to literal "\n" strings because pipe-delimited
      files cannot embed newlines in a row.

      It is up to the downstream handler to do something with the literal "\n"
      strings in each field.

      Options Include:

         -f NumCols - force each row to have exactly NumCol Fields.

                      Truncates or appends Null Fields to each row as necessary
                      to assure each Row has NumCol Fields

         -d Delim   - Override the default "|" Delimiter

#!/bin/sh

PrgNam="`basename $0`"
DirNam="`dirname $0`"
AwkNam=${DirNam}/${PrgNam}.awk

TmpDir="/tmp"

Delim="|"
Force=0

PrmNam="${TmpDir}/${PrgNam}-prm.$$"

# figure out what version of echo works on this OpSys

Echo="/usr/5bin/echo"  #      SunOS
if [ ! -x "${Echo}" ]
then
   Echo="`echo -e`"    #       Linux       SysvR4
   [ "${Echo}" = "" ] && Echo="echo -e" || Echo="echo"
fi

Usage ()
{
   ErrNum=$1
   shift

   [ $# -gt 0 ] && ${Echo} "\n$*" >&2

   ${Echo} "\nusage:  $PrgNam [ options ] InputFile.CSV" >&2
   cat  <<Usage_EOF >&2

   $PrgNam processes an RFC 4180 compliant file to produce by default, 
   a pipe-delimited file.
   
   Typical Usage is
   
      $PrgNam Input.CSV > Output.TXT
   
   Tricks inside !  RFC 4180 allows embedded newlines.  $PrgNam converts 
   embedded newlines (0x0a) to literal "\\n" strings because pipe-delimited 
   files cannot embed newlines in a row.
   
   It is up to the downstream handler to do something with the literal "\\n" 
   strings in each field.
   
   Options Include:
   
      -f NumCols - force each row to have exactly NumCol Fields.
   
                   Truncates or appends Null Fields to each row as necessary 
                   to assure each Row has NumCol Fields
   
      -d Delim   - Override the default "|" Delimiter
Usage_EOF
 
   exit $ErrNum

}
while  getopts hd:f:n: junk 2>/dev/null
do
   case $junk in
      d)    Delim="$OPTARG"      ;;
      f)    Force="$OPTARG"      ;;
      n)    Force="$OPTARG"      ;;
      h)    Usage 0             ;;
      *)    Usage 1             ;;
   esac
done

shift `expr $OPTIND - 1`

echo "delim    $Delim"  > ${PrmNam}
echo "force    $Force" >> ${PrmNam}
echo "done"            >> ${PrmNam}

gawk -f ${AwkNam} ${PrmNam} "$@" |
if [ $Force -eq 0 ]
then
   cat
else
   gawk '
   BEGIN {

      Force = '"$Force"' + 0
      Delim = "'"$Delim"'"

      if ( Delim == "" ) Delim = "|"

      FS = Delim
   }
   function Trim( InStr )
   {
      gsub ( "^[\t\n\r ]*", "", InStr )
      gsub ( "[\t\n\r ]*$", "", InStr )
      return ( InStr )
   }
   {
      if ( NF < 1 )
         print
      else if ( NF == Force )
         print Trim( $0 )
      else
      {
         printf ( "%s", Trim( $1 ))

         for ( i = 2 ; i <= Force ; i ++ )
            printf ( "%s%s", Delim, Trim( $i ))
              
         print ""
      }
   }'
fi

rm -f $PrmNam

exit 0
BEGIN {
   
   Quote = "\""
   Comma = ","
   Delim = "|"
   NewLn = "\n"
   Force = 0

   ProcessParms()

   InQuote = 0
}
function Trim( InStr )
{
   gsub ( "^[\t\n\r ]*", "", InStr )
   gsub ( "[\t\n\r ]*$", "", InStr )
   return ( InStr )
}
function ProcessParms()
{
   while ( getline > 0 )
   {
      if ( $1 == "done" )
         break

      if ( $1 == "delim" )
         Delim = $2
      else if ( $1 == "force" )
         Force = $2+0
   }
}
{
   OutLine = ""
   InpLine = Trim( $0 )

   for ( i = 1 ; i <= ( L = length ( InpLine )) ; i ++ )
   {
      if (( c = substr ( InpLine, i, 1 )) == Quote )
      {
         if ( InQuote )
         {
            if (( i < L ) && ( substr( InpLine, i+1, 1 ) == Quote ))
            {
               OutLine = OutLine Quote
               i ++
            }
            else
            {
               InQuote = 0
            }
         }
         else
         {
            InQuote = 1
         }
      }
      else
      {
         if ( c == Comma )
         {
            if ( ! InQuote ) 
               c = Delim
         }
         OutLine = OutLine c
      }
      if (( i == L ) && ( InQuote ))
      {
           OutLine = OutLine "\\n"    # fake NewLine

           getline
           InpLine = Trim( $0 )
           L = length( InpLine )
           i = 0
      }
   }
   gsub ( / *\| */, "|", OutLine ) # trim spaces around the "|"'s
   gsub ( /^ */, "", OutLine )     # trim leading spaces on OutLine
   print OutLine
}

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to