[NOTE: Moved to [EMAIL PROTECTED] from a private email thread] Allow me to interrupt with some background:
We have thousands of tables and views across dozens of databases on a score of severs. We have many hundreds of gigabytes of data in the those databases. For some reason, our databases are painfully slow (Sybase, and I don't why they're slow - maybe our DBA isn't on top of things ;), so we try to do as much work outside the database as possible. In the past, people have written dozens of custom scripts that pull data from dozens of tables and dumped them to random locations across our filesystem, using random naming conventions, in custom undocumented formats. Everyone who wanted to pull data from these "table dumps" instead of the database had to figure out if one even existed, ask around to find out where it is, `head` the file, guess the format rules, and write custom code in their script to read it. Needless to say, this is all very, very ugly. Being the new guy, I quickly drank a glass water to overcome my gag reflex and set about eliminating this mess and solving the problem once and for all. That's when I started poking around CPAN (I'm a long time perl programmer, but somehow never got around to exploring CPAN till just recently - good god there's some amazing stuff there) and found DBD::CSV and Text::CSV_XS. I thought: perfect! I'll use Text::CSV_XS to dump tables, pick a standard location and filename scheme (/netapp/dbcache/<server>/<database>/<table|view>), and whip up a quick module with two functions: cache_table, and query_cached_table, the second of which would take a SQL query (using DBD::CSV), so that people could painlessly change code to use either the live tables (DBI + DBD::Sybase) or the cached table (DBI + DBD::CSV), as the situation dictates. However, most of the time the user simply wants to read the whole file in (or a subset of columns) and do something special (non-SQLish) with the data, and in those cases, loading the data myself with Text::CSV_XS is far faster than going through DBD::CSV. I even wrote a little iterator object that implements a simple subset of DBI::st, like fetchrow_array and fetchrow_arrayref (trivial, but cute), which I return if I decided to use Text::CSV_XS directly and not not DBD::CSV. But I am currently deciding which strategy to use in the background with a regex to determine if the query meets my definition of "simple", and that's kinda ugly. I will probably mess around with SQL::Parser and use that instead when I get a moment, but I wanted to bring this whole situation to your attention in case you were willing/able to optimize DBD::CSV for the special case of "just give me some columns from this one table". Of course, maybe this whole situation has alredy been solved by an existing generic table caching module, and I just haven't had the good fortune to come across it. If not, however, perhaps it would be worthwhile polishing up what I've worked on and putting it on CPAN. For completeness, responses to individual points from the previous email can be found below: > > I recently ran a test of Text::CSV_XS vs. DBD::CSV in which > I simply wanted > > to load all data (or a subset of columns) from a CSV file > (essentially a > > SELECT * FROM <table> or SELECT <fieldlist> FROM <table> > query). The raw > > Text::CSV_XS code ran significantly faster, sometimes twice as fast. > > If you don't have a WHERE clause, or only a single WHERE > comparison, try > using DBI::SQL::Nano as the DBD::CSV parser instead of > SQL::Statement. > Just put BEGIN { $ENV{DBI_SQL_NANO}=1 } at the top of your script and > the DBD will use Nano rather than the usual SQL::Statement. This can > significantly speed things up for simple cases. I can't know ahead of time how complex the query will be. My goal is to seamlessly support as much of SQL as I can on top of our database caching system, while using shortcuts when possible if the performance difference justifies it (as it does for straight table slurps). > > That's not surprising, since DBD::CSV provides an entire SQL > > parsing/execution layer that enables it to do far cooler > things than a > > simple SELECT, but that means that if I want to optimize > the table caching > > system I'm building for my company, I need to either: > > If your goal is an in-memory table, try DBD::AnyData which > handles CSV > and creates in-memory tables by default. Only casually familiar with DBD::AnyData (I skimmed the docs - looks neat), but in-memory table isn't exactly what I'm shooting for. The working assumption here is that the table was already dumped to disk in CSV format, say once a day at 3am or something. > > 2) parse the query myself (perhaps with SQL::Parser), and > use Text::CSV_XS > > directly if its a simple select, or DBD::CSV for anything else > > Um, but that is just what DBD::CSV already does - it uses > SQL::Parser to > parse the query and Text::CSV_XS to parse the CSV. I know DBD::CSV uses SQL::Parser (in fact, the docs imply that's what it was written for! :), but I intend to parsing the query myself before-hand to determine whether or not I even need to pass it along to DBD::CSV. If it does get passed along, of course, then the query will end up being parsed twice, but the cost of that is negligble next to reading data from disk. > > PS-If I was supposed to have sent this to some mailing list > instead of > > directly to you, then I apologize. Let me know which one, and I'll > > subscribe and restart the thread there. > > [EMAIL PROTECTED] is the best place for general discussion, but it's > also fine to email me directly, especially for suggestions > like this one. Ok, I will move the thread to dbi-users, since people there might have some creative input on my situation that didn't occur to me. -ofer