Re: [Perldl] Loading large data from database into a piddle

kmx Sat, 15 Nov 2014 09:07:06 -0800

Hi Doug ,

It does not seem to make things much better. The trouble withpg_getcopydata seems to be that it receives data in text format (in factCSV lines) and it is more expensive to parse and convert CSV data than toget numeric values from binary Pg protocol and perhaps you have to alsodownload more megabytes (which is not for free) as CSV text data might belonger than binary encoded query results.


I have tested CSV parsing on a 200 MB CSV file with approx 3 mil. lines/records
- loading with rcols(..) takes like 63s

- my optimized approach (I might share the code) similar to what I do in myDB module takes 34s


To sum up - loading 3 mil. records from DB in ~30s is not that bad.

The bottleneck is probably the way of creating a piddle from a stream oflarge data (you do not have all input data in memory, you do not even knowthe count of records until you read'em all) - the best I was able toachieve is in my gist.


--
kmx


On 14.11.2014 17:10, Doug Hunt wrote:

Hi kmx: What if you make a temporary table by selecting the subset ofthe table you want and then use pg_getcopydata to dump this entire temptable?
Just a thought...

Regards,

  Doug Hunt

[email protected]
Software Engineer
UCAR - COSMIC, Tel. (303) 497-2611

On Fri, 14 Nov 2014, kmx wrote:
I have tried pg_getcopydata, however I was not able to make it betterthan my old approach. After many tries it was still
15-20% slower.
My guess is that pg_getcopydata(..) might be significantly faster whendumping the whole table (which I was not able to test asthe table in question was too big). When dumping a result of SQL querythere seems to be no advantage.
I have also slightly updated my "maybe module" athttps://gist.github.com/kmx/6f1234478828e7960fbd
--
kmx

On 12.11.2014 23:54, kmx wrote:
      Thanks, pg_getcopydata sounds very promising.
I'll try to implement an alternative solution based onpg_getcopydata and compare it with my current approach.
      --
      kmx

      On 12.11.2014 16:48, Vikas N Kumar wrote:
      On 11/12/2014 07:43 AM, kmx wrote:
            my $dbh = DBI->connect($dsn);
              my $pdl = pdl($dbh->selectall_arrayref($sql_query));
But it does not scale well for very large data (millions ofrows).
      Hi KMX
If you're using Postgresql you should use theDBD::Pg->pg_getcopydata using the "COPY mytable to STDOUT"functionality for accessing millions of rows. You can do this inasync or sync mode. This will get you therefaster than using selectall_arrayref(). This allows you to get therows without having to redesign your DB.
      SQLite has a stream API but I am not familiar with it.

      --Vikas



_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Re: [Perldl] Loading large data from database into a piddle

Reply via email to