Rich Shepard wrote:
On Mon, 13 Sep 2010, Mithun Nair wrote:
When i try importing it into a SQLite table, i get some errors like
"expected 2 columns of data but found 1". Later i found that importing a
comma separated file into SQlite wont work because a comma is considered
as a delimiter by the the SQLite engine even within a "". Is this true? My
csv file source is a webservice, so i have to try adjusting with the comma
delimiter. They are not gonna change it for me :). Should i write some
simple string parser of my own? Or is there any better, optimal method?
Mithun,
I've gone through this a couple of times (because I did not save the
solution from the first time), most recently this past weekend. Here's what
you need to do to clean up your .csv file for import into SQLite.
Use the text editor of your choice (emacs, vim, joe, whatever).
First, you are correct that commas embedded in text strings are seen as
column delimiters. Therefore, convert all commas to the default SQLite
separator |. In my data I run this series:
- replace all "," with "|"
- replace all ,,, with |||
- repeat above for all NULL columns in your source file with varying
numbers of commans.
Second, replace all " (double qoutes) used to delimit text attributes with
' (single quotes). That's also a global search and replace. I've not had a
problem with apostrophes within a quoted text string as long as the column
separators were the vertical bars.
HTH,
Rich
Mithun --
Attached is a( bash + gawk ) pair that can
convert a CSV File to a Pipe-Delimited file.
The script will run on a system having bash
and gawk( including MSys on Win32 ).
HTH.
-- kjh
[kon...@kjhlt4 ~]$ /home/konrad/bin/qcq2pipe -h
usage: qcq2pipe [ options ] InputFile.CSV
qcq2pipe processes an RFC 4180 compliant file to produce by default,
a pipe-delimited file.
Typical Usage is
qcq2pipe Input.CSV > Output.TXT
Tricks inside ! RFC 4180 allows embedded newlines. qcq2pipe converts
embedded newlines (0x0a) to literal "\n" strings because pipe-delimited
files cannot embed newlines in a row.
It is up to the downstream handler to do something with the literal "\n"
strings in each field.
Options Include:
-f NumCols - force each row to have exactly NumCol Fields.
Truncates or appends Null Fields to each row as necessary
to assure each Row has NumCol Fields
-d Delim - Override the default "|" Delimiter
#!/bin/sh
PrgNam="`basename $0`"
DirNam="`dirname $0`"
AwkNam=${DirNam}/${PrgNam}.awk
TmpDir="/tmp"
Delim="|"
Force=0
PrmNam="${TmpDir}/${PrgNam}-prm.$$"
# figure out what version of echo works on this OpSys
Echo="/usr/5bin/echo" # SunOS
if [ ! -x "${Echo}" ]
then
Echo="`echo -e`" # Linux SysvR4
[ "${Echo}" = "" ] && Echo="echo -e" || Echo="echo"
fi
Usage ()
{
ErrNum=$1
shift
[ $# -gt 0 ] && ${Echo} "\n$*" >&2
${Echo} "\nusage: $PrgNam [ options ] InputFile.CSV" >&2
cat <<Usage_EOF >&2
$PrgNam processes an RFC 4180 compliant file to produce by default,
a pipe-delimited file.
Typical Usage is
$PrgNam Input.CSV > Output.TXT
Tricks inside ! RFC 4180 allows embedded newlines. $PrgNam converts
embedded newlines (0x0a) to literal "\\n" strings because pipe-delimited
files cannot embed newlines in a row.
It is up to the downstream handler to do something with the literal "\\n"
strings in each field.
Options Include:
-f NumCols - force each row to have exactly NumCol Fields.
Truncates or appends Null Fields to each row as necessary
to assure each Row has NumCol Fields
-d Delim - Override the default "|" Delimiter
Usage_EOF
exit $ErrNum
}
while getopts hd:f:n: junk 2>/dev/null
do
case $junk in
d) Delim="$OPTARG" ;;
f) Force="$OPTARG" ;;
n) Force="$OPTARG" ;;
h) Usage 0 ;;
*) Usage 1 ;;
esac
done
shift `expr $OPTIND - 1`
echo "delim $Delim" > ${PrmNam}
echo "force $Force" >> ${PrmNam}
echo "done" >> ${PrmNam}
gawk -f ${AwkNam} ${PrmNam} "$@" |
if [ $Force -eq 0 ]
then
cat
else
gawk '
BEGIN {
Force = '"$Force"' + 0
Delim = "'"$Delim"'"
if ( Delim == "" ) Delim = "|"
FS = Delim
}
function Trim( InStr )
{
gsub ( "^[\t\n\r ]*", "", InStr )
gsub ( "[\t\n\r ]*$", "", InStr )
return ( InStr )
}
{
if ( NF < 1 )
print
else if ( NF == Force )
print Trim( $0 )
else
{
printf ( "%s", Trim( $1 ))
for ( i = 2 ; i <= Force ; i ++ )
printf ( "%s%s", Delim, Trim( $i ))
print ""
}
}'
fi
rm -f $PrmNam
exit 0
BEGIN {
Quote = "\""
Comma = ","
Delim = "|"
NewLn = "\n"
Force = 0
ProcessParms()
InQuote = 0
}
function Trim( InStr )
{
gsub ( "^[\t\n\r ]*", "", InStr )
gsub ( "[\t\n\r ]*$", "", InStr )
return ( InStr )
}
function ProcessParms()
{
while ( getline > 0 )
{
if ( $1 == "done" )
break
if ( $1 == "delim" )
Delim = $2
else if ( $1 == "force" )
Force = $2+0
}
}
{
OutLine = ""
InpLine = Trim( $0 )
for ( i = 1 ; i <= ( L = length ( InpLine )) ; i ++ )
{
if (( c = substr ( InpLine, i, 1 )) == Quote )
{
if ( InQuote )
{
if (( i < L ) && ( substr( InpLine, i+1, 1 ) == Quote ))
{
OutLine = OutLine Quote
i ++
}
else
{
InQuote = 0
}
}
else
{
InQuote = 1
}
}
else
{
if ( c == Comma )
{
if ( ! InQuote )
c = Delim
}
OutLine = OutLine c
}
if (( i == L ) && ( InQuote ))
{
OutLine = OutLine "\\n" # fake NewLine
getline
InpLine = Trim( $0 )
L = length( InpLine )
i = 0
}
}
gsub ( / *\| */, "|", OutLine ) # trim spaces around the "|"'s
gsub ( /^ */, "", OutLine ) # trim leading spaces on OutLine
print OutLine
}
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users