Re: [sqlite] is SQLite the right tool to analyze a 44GB file

Black, Michael (IS) Fri, 04 May 2012 09:18:48 -0700

Here's a utility to import a comma separated file (does not work for quoted 
strings or strings with commas).

Figures out the # of columns automagically from the csv file.

All wrapped in a singled transaction.

Shows progress every 100,000 inserts.

Usage: csvimport filename databasename tablename

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "sqlite3.h"

#define BUFSIZE 1000000
#define MOD 100000

char *sep=",\r\n";
// Add comma delimited file to exisiting database/table
// Strings cannot have commas in them and quotes will be retained
int main(int argc, char *argv[]) {
  sqlite3 *db;
  sqlite3_stmt *stmt;
  int rc;
  int ncol=0;
  int nline=0;
  char *buf=malloc(BUFSIZE);
  char sql[8192];
  FILE *fp;
  char *filename;
  char *databasename;
  char *tablename;
  if (argc != 4) {
    fprintf(stderr,"Usage: %s filename databasename tablename\n",argv[0]);
    exit(1);
  }
  filename = argv[1];
  databasename = argv[2];
  tablename = argv[3];
  rc = sqlite3_open_v2(databasename,&db,SQLITE_OPEN_READWRITE,NULL);
  if (rc) {
    fprintf(stderr,"Error opening database '%s': 
%s\n",databasename,sqlite3_errmsg(db));
    exit(1);
  }
  sprintf(sql,"insert into %s values (",tablename);
  fp=fopen(filename,"r");
  if (fp == NULL) {
    perror(filename);
    exit(1);
  }
  buf[BUFSIZE-1] = '*';
  fgets(buf,BUFSIZE,fp);
  if (buf[BUFSIZE-1] != '*') {
    fprintf(stderr,"BUFSIZE not big enough...aborting\n");
    exit(1);
  }
  // count the columns
  char *p=strtok(buf,sep);
  ncol=0;
  while(p) {
    ++ncol;
    strcat(sql,ncol==1?"":",");
    strcat(sql,"?");
    p=strtok(NULL,sep);
  }
  printf("%d columns detected\n",ncol);
  strcat(sql,")");
  puts(sql);
  rewind(fp);
  // Let's wrap things in a transaction
  rc = sqlite3_exec(db,"BEGIN",NULL,NULL,NULL);
  if (rc) {
    fprintf(stderr,"BEGIN failed on '%s': %s\n",sql,sqlite3_errmsg(db));
    exit(1);
  }
  // prepare our statement
  rc = sqlite3_prepare(db,sql,strlen(sql),&stmt,NULL);
  if (rc) {
    fprintf(stderr,"Prepare failed on '%s': %s\n",sql,sqlite3_errmsg(db));
    exit(1);
  }
  // Loop over file file
  while(fgets(buf,BUFSIZE,fp)) {
    char *p=strtok(buf,sep);
    int i=1;
    ++nline;
    if ((nline % MOD)==0) {
      printf("%d\r",nline);
      fflush(stdout);
    }
    while(p) { // bind the columns as text, table will take care of conversion 
to column types
      rc=sqlite3_bind_text(stmt,i,p,-1,SQLITE_TRANSIENT);
      if (rc) {
        fprintf(stderr,"bind_text failed on '%s': %s\n",sql,sqlite3_errmsg(db));
        exit(1);
      }
      ++i;
      p=strtok(NULL,sep);
    }
    if (--i != ncol) {
      fprintf(stderr,"expected %d cols, got %d cols on line#%d\n",ncol,i,nline)

    } else {
      rc = sqlite3_step(stmt);
      if (rc != SQLITE_DONE) {
        fprintf(stderr,"Insert failed on '%s': %s\n",sql,sqlite3_errmsg(db));
        exit(1);
      }
      rc = sqlite3_reset(stmt);
      if (rc) {
        fprintf(stderr,"Reset failed on '%s': %s\n",sql,sqlite3_errmsg(db));
        exit(1);
      }
    }
  }
  rc=sqlite3_finalize(stmt);
  if (rc) {
    fprintf(stderr,"Finalize failed: %s\n",sqlite3_errmsg(db));
    exit(1);
  }
  printf("%d inserts, committing...\n",nline);
  rc = sqlite3_exec(db,"COMMIT",NULL,NULL,NULL);
  if (rc) {
    fprintf(stderr,"COMMIT failed on '%s': %s\n",sql,sqlite3_errmsg(db));
    exit(1);
  }
  rc=sqlite3_close(db);
  if (rc) {
    fprintf(stderr,"Close failed on '%s': %s\n",argv[2],sqlite3_errmsg(db));
    exit(1);
  }
  fclose(fp);
  return 0;
}

Michael D. Black

Senior Scientist

Advanced Analytics Directorate

Advanced GEOINT Solutions Operating Unit

Northrop Grumman Information Systems

________________________________
From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on 
behalf of peter korinis [kori...@earthlink.net]
Sent: Friday, May 04, 2012 10:23 AM
To: 'General Discussion of SQLite Database'
Subject: EXT :Re: [sqlite] is SQLite the right tool to analyze a 44GB file

I have scaled down the attributes of interest to 46 columns (discarding the
other 550). No columns are calculated. No updates to this file ... one user
... only  query, sort, etc. type transactions.
So I want to load two 22GB csv files into an empty 46 column table. (I
intend to test load with 999 records by 46 col file.) initially I only have
1 index on a record # ... am not positive several other fields that I want
to index may not be missing data in some records (I assume that will error
out if I make those an index).
After I get the data loaded and inspect for nulls in prospective index
attributes, can I add indices?

I was planning to load using sqlite3 CLI ".import" command. Is there a way I
can monitor the progress of the load, with only minimal impact on
performance ? I've started several loads only to find out hours later that
nothing has been loaded.

Thanks for helpful info.

peter

-----Original Message-----
From: sqlite-users-boun...@sqlite.org
[mailto:sqlite-users-boun...@sqlite.org] On Behalf Of Valentin Davydov
Sent: Friday, May 04, 2012 9:43 AM
To: General Discussion of SQLite Database
Subject: Re: [sqlite] is SQLite the right tool to analyze a 44GB file

On Tue, May 01, 2012 at 04:06:01PM -0400, peter korinis wrote:
> I'm new to SQLite . not a programmer . not a DBA . just an end-user
> with no dev support for a pilot project (single user, no updates, just
queries).
>
>
>
> I want to analyze the data contained in a 44GB csv file with 44M rows
> x 600 columns (fields all <15 char). Seems like a DBMS will allow me
> to query it in a variety of ways to analyze the data.

Yes, SQLite is quite capable of doing simple analyzis of such amounts of
data, especially selecting small subsets based on a simple criteria. However
before trying to do some real work you have to understand the structure of
your data, realize your possible queries and carefully design database
schema (tables and, equally important, indises). Perhaps, putting all data
in a single 600-column table is not a good idea (though allowed
technically), especially if your columns are equal by their physical nature:
it is not so easy to select arbitrarily calculated columns, only rows.

> I have the data files and SQLite on my laptop: a 64-bit Win7 Intel
> dual-proc with 4GB RAM + 200GB free disk space.

Well-indexed database of small data pieces usually takes up several times
more disk space than the raw data. Probably 200GB would not be enough,
dependng mostly on the number of indises. Consider dedicating a separate
disk (or even RAID array) for it.

> End-user tools like Excel & Access failed due to lack of memory. I
> downloaded SQLite ver.3.6.23.1. I tried to use Firefox Data Manager
> add-on but it would not load the csv files - 'csv worker failed'. So I
> tried Database Master from Nucleon but it failed after loading (it
> took 100
> minutes) ~57,000 rows with error message = 'database or disk is full".
> I tried to create another table in the same db but could not with same
> error message. The DB size shows as 10,000KB (that looks suspiciously
> like a size setting?).

Try bare sqlite shell instead of those external tools. It should take at
least several hours to fill up your database.

> From what I've read SQLite can handle this size DB.

Surely it can. In one of my projects the database takes up almost 3
terabytes of disk space, contains more than 10^10 records and still provides
small selects of indexed data in real time.

> 1.       Is SQLite the wrong tool for this project? (I don't want the
> overkill and admin overhead of a large MySQL or SQL Server, etc.)

It depends on the data structure, semantics and what you are going to find
there. SQLite isn't very good for calculation of complex aggregate
functions, but works fine in simple selecting and sorting.

> 2.       If SQLite will work, are there configuration settings in SQLite
or
> Win7 that will permit the load . or is there a better tool for this
project?

Increasing PAGE_SIZE to match the filesystem block (cluster) size and
perhaps CACHE_SIZE to fill most of the available RAM would help a bit.
Also, don't forget to turn off journaling and wrap all in a single
transaction when creating database for the first time.

Valentin Davydov.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] is SQLite the right tool to analyze a 44GB file

Reply via email to