Re: [R] R, PostgresSQL and poor performance

2011-12-13 Thread James Cloos
> "BD" == Berry, David  writes:

BD> All variables are reals other than id which is varchar(10) and date
BD> which is a timestamp, approximately 1.5 million rows are returned by
BD> the query and it takes order 10 second to execute using psql (the
BD> command line client for Postgres) and a similar time using pgAdmin
BD> 3. In R it takes several minutes to run and I'm unsure where the
BD> bottleneck is occurring.

You may want to test progressively smaller chunks of the data to see how
quickly R slows down as compared to psql on that query.

My first guess is that something allocating and re-allocating ram in a
quadratic (or worse) fashion.

I don't know whether OSX has anything equivilent, but you could test on
the linux box using oprofile (http://oprofile.sourceforge.net; SuSE
should have an rpm for it and kernel support compiled in) to confirm
where the time is spent.

It is /possible/ that the (sql)NULL->(r)NA logic in RS-PostgreSQL.c may
be slow (relatively speaking), but it is necessary.  Nothing else jumps
out as a possible choke point.

Oprofile (or the equivilent) would best answer the question.

-JimC
-- 
James Cloos  OpenPGP: 1024D/ED7DAEA6

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R, PostgresSQL and poor performance

2011-12-02 Thread Joe Conway
On 12/02/2011 09:46 PM, Berry, David I. wrote:
> Thanks for the reply and suggestions. I've tried the RpgSQL drivers and
> the results are pretty similar in terms of performance.
> 
> The ~1.5M records I'm trying to read into R are being extracted from a
> table with ~300M rows (and ~60 columns) that has been indexed on the
> relevant columns and horizontally partitioned (with constraint checking
> on). I do need to try and optimize the database a bit more but I don¹t
> think this is the cause of the performance issues.

With that much data you might want to consider PL/R:
  http://www.joeconway.com/plr/

HTH,

Joe


-- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R, PostgresSQL and poor performance

2011-12-02 Thread Berry, David I.
On 01/12/2011 17:01, "Gabor Grothendieck"  wrote:

>On Thu, Dec 1, 2011 at 10:02 AM, Berry, David I.  wrote:
>> Hi List
>>
>> Apologies if this isn't the correct place for this query (I've tried a
>>search of the mail archives but not had much joy).
>>
>> I'm running R (2.14.0)  on a Mac (OSX v 10.5.8, 2.66GHz, 4GB memory)
>>and am having a few performance issues with reading data in from a
>>Postres database (using RPostgreSQL). My query / code are as below
>>
>> # -
>> library('RPostgreSQL')
>>
>> drv <- dbDriver("PostgreSQL")
>>
>> dbh <- dbConnect(drv,user="Š",password="Š",dbname="Š",host="Š")
>>
>> sql <- "select id, date, lon, lat, date_trunc('day' , date) as jday,
>>extract('hour' from date) as hour, extract('year' from date) as year
>>from observations where pt = 6 and date >= '1990-01-01' and date <
>>'1995-01-01' and lon > 180 and lon < 290 and lat > -30 and lat < 30 and
>>sst is not null"
>>
>> dataIn <- dbGetQuery(dbh,sql)
>
>If this is a large table of which the desired rows are a small
>fraction of all rows then be sure there indexes on the variables in
>your where clause.
>
>You can also try it with the RpgSQL driver although there is no reason
>to think that that would be faster.
>
>-- 
>Statistics & Software Consulting
>GKX Group, GKX Associates Inc.
>tel: 1-877-GKX-GROUP
>email: ggrothendieck at gmail.com

Thanks for the reply and suggestions. I've tried the RpgSQL drivers and
the results are pretty similar in terms of performance.

The ~1.5M records I'm trying to read into R are being extracted from a
table with ~300M rows (and ~60 columns) that has been indexed on the
relevant columns and horizontally partitioned (with constraint checking
on). I do need to try and optimize the database a bit more but I don¹t
think this is the cause of the performance issues.

As an example, when I run the query purely in R it takes 273s to run
(using system.time() to time it). When I extract the data via psql and
system() and then import it into R using read.table() it takes 32s. The
code I've used for both are below. The second way of doing it (psql and
read.table()) is less than ideal but does seem to have a big performance
advantage at the moment ­ the only difference in the results is that the
date variables are stored as strings in the second example.

# Query purely in R
# 
dbh <- dbConnect(drv,user="Š",password="Š", dbname="Š",host="Š")

sql <- "select id, date, lon, lat, date_trunc('day' , date) as jday,
extract('hour' from date) as hour, extract('year' from date) as year from
observations where pt = 6 and date >= '1990-01-01' and date < '1995-01-01'
and lon > 180 and lon < 290 and lat > -30 and lat < 30 and sst is not
null;"

dataIn <- dbGetQuery(dbh,sql)



# Query via command line
# --
system('psql ­h myhost ­d mydb ­U myuid ­f getData.sql')

system('cat tmp.csv | sed 's/^,/""&/g;s/^[0-9a-zA-Z]\+/"&"/g' > tmp2.csv')
# This just ensures the first column is quoted

dataIn <- read.table('tmp2.csv',sep=',' ,col.names=c(
"id","date","lon","lat","jday","hour","year") )


# Contents of getData.sql
# -
\o ./tmp.csv
\pset format unaligned
\pset fieldsep ','
\pset tuples_only
select 
id, date, lon, lat, date_trunc('day' , date) as jday, extract('hour' 
from
date) as hour, extract('year' from date) as year
from 
observations 
where 
pt = 6 and date >= '1990-01-01' and date < '1995-01-01' and lon > 180 
and
lon < 290 and lat > -30 and lat < 30 and sst is not null;
\q


--
David Berry
National Oceanography Centre, UK



-- 
This message (and any attachments) is for the recipient only. NERC
is subject to the Freedom of Information Act 2000 and the contents
of this email and any reply you make may be disclosed by NERC unless
it is exempt from release under the Act. Any material supplied to
NERC may be stored in an electronic records management system.
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R, PostgresSQL and poor performance

2011-12-01 Thread Gabor Grothendieck
On Thu, Dec 1, 2011 at 10:02 AM, Berry, David I.  wrote:
> Hi List
>
> Apologies if this isn't the correct place for this query (I've tried a search 
> of the mail archives but not had much joy).
>
> I'm running R (2.14.0)  on a Mac (OSX v 10.5.8, 2.66GHz, 4GB memory) and am 
> having a few performance issues with reading data in from a Postres database 
> (using RPostgreSQL). My query / code are as below
>
> # -
> library('RPostgreSQL')
>
> drv <- dbDriver("PostgreSQL")
>
> dbh <- dbConnect(drv,user="…",password="…",dbname="…",host="…")
>
> sql <- "select id, date, lon, lat, date_trunc('day' , date) as jday, 
> extract('hour' from date) as hour, extract('year' from date) as year from 
> observations where pt = 6 and date >= '1990-01-01' and date < '1995-01-01' 
> and lon > 180 and lon < 290 and lat > -30 and lat < 30 and sst is not null"
>
> dataIn <- dbGetQuery(dbh,sql)

If this is a large table of which the desired rows are a small
fraction of all rows then be sure there indexes on the variables in
your where clause.

You can also try it with the RpgSQL driver although there is no reason
to think that that would be faster.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] R, PostgresSQL and poor performance

2011-12-01 Thread Berry, David I.
Hi List

Apologies if this isn't the correct place for this query (I've tried a search 
of the mail archives but not had much joy).

I'm running R (2.14.0)  on a Mac (OSX v 10.5.8, 2.66GHz, 4GB memory) and am 
having a few performance issues with reading data in from a Postres database 
(using RPostgreSQL). My query / code are as below

# -
library('RPostgreSQL')

drv <- dbDriver("PostgreSQL")

dbh <- dbConnect(drv,user="…",password="…",dbname="…",host="…")

sql <- "select id, date, lon, lat, date_trunc('day' , date) as jday, 
extract('hour' from date) as hour, extract('year' from date) as year from 
observations where pt = 6 and date >= '1990-01-01' and date < '1995-01-01' and 
lon > 180 and lon < 290 and lat > -30 and lat < 30 and sst is not null"

dataIn <- dbGetQuery(dbh,sql)
# -


All variables are reals other than id which is varchar(10) and date which is a 
timestamp, approximately 1.5 million rows are returned by the query and it 
takes order 10 second to execute using psql (the command line client for 
Postgres) and a similar time using pgAdmin 3. In R it takes several minutes to 
run and I'm unsure where the bottleneck is occurring.  I don't believe it's in 
the database access as the query disappears from the server status (of pgAdmin) 
quite quickly and where it only takes ~10 seconds from psql.  I've only been 
using R for the last month or two so any tips / advice on how to speed up the 
query or how to track down the source of the poor performance would be 
appreciated.

For info, I get similar poor performance running R (2.14.0) on a linux 
workstation (Linux 2.6.16.60-0.91.1-smp x86_64, 42GB memory, 2 6 core 
processors) running SUSE.

Thanks in advance

Dave Berry.
-- 
This message (and any attachments) is for the recipient ...{{dropped:8}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.