RE: DBD::CSV optimization (hack)

Ofer Nave Mon, 20 Sep 2004 13:55:28 -0700

[NOTE: Moved to [EMAIL PROTECTED] from a private email thread]

Allow me to interrupt with some background:


We have thousands of tables and views across dozens of databases on a score
of severs.  We have many hundreds of gigabytes of data in the those
databases.  For some reason, our databases are painfully slow (Sybase, and I
don't why they're slow - maybe our DBA isn't on top of things ;), so we try
to do as much work outside the database as possible.  In the past, people
have written dozens of custom scripts that pull data from dozens of tables
and dumped them to random locations across our filesystem, using random
naming conventions, in custom undocumented formats.  Everyone who wanted to
pull data from these "table dumps" instead of the database had to figure out
if one even existed, ask around to find out where it is, `head` the file,
guess the format rules, and write custom code in their script to read it.

Needless to say, this is all very, very ugly.  Being the new guy, I quickly
drank a glass water to overcome my gag reflex and set about eliminating this
mess and solving the problem once and for all.  That's when I started poking
around CPAN (I'm a long time perl programmer, but somehow never got around
to exploring CPAN till just recently - good god there's some amazing stuff
there) and found DBD::CSV and Text::CSV_XS.

I thought: perfect!  I'll use Text::CSV_XS to dump tables, pick a standard
location and filename scheme
(/netapp/dbcache/<server>/<database>/<table|view>), and whip up a quick
module with two functions: cache_table, and query_cached_table, the second
of which would take a SQL query (using DBD::CSV), so that people could
painlessly change code to use either the live tables (DBI + DBD::Sybase) or
the cached table (DBI + DBD::CSV), as the situation dictates.

However, most of the time the user simply wants to read the whole file in
(or a subset of columns) and do something special (non-SQLish) with the
data, and in those cases, loading the data myself with Text::CSV_XS is far
faster than going through DBD::CSV.  I even wrote a little iterator object
that implements a simple subset of DBI::st, like fetchrow_array and
fetchrow_arrayref (trivial, but cute), which I return if I decided to use
Text::CSV_XS directly and not not DBD::CSV.

But I am currently deciding which strategy to use in the background with a
regex to determine if the query meets my definition of "simple", and that's
kinda ugly.  I will probably mess around with SQL::Parser and use that
instead when I get a moment, but I wanted to bring this whole situation to
your attention in case you were willing/able to optimize DBD::CSV for the
special case of "just give me some columns from this one table".

Of course, maybe this whole situation has alredy been solved by an existing
generic table caching module, and I just haven't had the good fortune to
come across it.  If not, however, perhaps it would be worthwhile polishing
up what I've worked on and putting it on CPAN.

For completeness, responses to individual points from the previous email can
be found below:

> > I recently ran a test of Text::CSV_XS vs. DBD::CSV in which
> I simply wanted
> > to load all data (or a subset of columns) from a CSV file
> (essentially a
> > SELECT * FROM <table> or SELECT <fieldlist> FROM <table>
> query).  The raw
> > Text::CSV_XS code ran significantly faster, sometimes twice as fast.
>
> If you don't have a WHERE clause, or only a single WHERE
> comparison, try
> using DBI::SQL::Nano as the DBD::CSV parser instead of
> SQL::Statement.
> Just put BEGIN { $ENV{DBI_SQL_NANO}=1 } at the top of your script and
> the DBD will use Nano rather than the usual SQL::Statement.  This can
> significantly speed things up for simple cases.

I can't know ahead of time how complex the query will be.  My goal is to
seamlessly support as much of SQL as I can on top of our database caching
system, while using shortcuts when possible if the performance difference
justifies it (as it does for straight table slurps).

> > That's not surprising, since DBD::CSV provides an entire SQL
> > parsing/execution layer that enables it to do far cooler
> things than a
> > simple SELECT, but that means that if I want to optimize
> the table caching
> > system I'm building for my company, I need to either:
>
> If your goal is an in-memory table, try DBD::AnyData which
> handles CSV
> and creates in-memory tables by default.

Only casually familiar with DBD::AnyData (I skimmed the docs - looks neat),
but in-memory table isn't exactly what I'm shooting for.  The working
assumption here is that the table was already dumped to disk in CSV format,
say once a day at 3am or something.

> > 2) parse the query myself (perhaps with SQL::Parser), and
> use Text::CSV_XS
> > directly if its a simple select, or DBD::CSV for anything else
>
> Um, but that is just what DBD::CSV already does - it uses
> SQL::Parser to
> parse the query and Text::CSV_XS to parse the CSV.

I know DBD::CSV uses SQL::Parser (in fact, the docs imply that's what it was
written for! :), but I intend to parsing the query myself before-hand to
determine whether or not I even need to pass it along to DBD::CSV.  If it
does get passed along, of course, then the query will end up being parsed
twice, but the cost of that is negligble next to reading data from disk.

> > PS-If I was supposed to have sent this to some mailing list
> instead of
> > directly to you, then I apologize.  Let me know which one, and I'll
> > subscribe and restart the thread there.
>
> [EMAIL PROTECTED] is the best place for general discussion, but it's
> also fine to email me directly, especially for suggestions
> like this one.

Ok, I will move the thread to dbi-users, since people there might have some
creative input on my situation that didn't occur to me.

-ofer

RE: DBD::CSV optimization (hack)

Reply via email to