Re: JDBC Very Slow

2016-09-16 Thread Takeshi Yamamuro
Hi,

It'd be better to set `predicates` in jdbc arguments for loading in
parallel.
See:
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L200

// maropu

On Sat, Sep 17, 2016 at 7:46 AM, Benjamin Kim  wrote:

> I am testing this in spark-shell. I am following the Spark documentation
> by simply adding the PostgreSQL driver to the Spark Classpath.
>
> SPARK_CLASSPATH=/path/to/postgresql/driver spark-shell
>
>
> Then, I run the code below to connect to the PostgreSQL database to query.
> This is when I have problems.
>
> Thanks,
> Ben
>
>
> On Sep 16, 2016, at 3:29 PM, Nikolay Zhebet  wrote:
>
> Hi! Can you split init code with current comand? I thing it is main
> problem in your code.
> 16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim" 
> написал:
>
>> Has anyone using Spark 1.6.2 encountered very slow responses from pulling
>> data from PostgreSQL using JDBC? I can get to the table and see the schema,
>> but when I do a show, it takes very long or keeps timing out.
>>
>> The code is simple.
>>
>> val jdbcDF = sqlContext.read.format("jdbc").options(
>> Map("url" -> "jdbc:postgresql://dbserver:po
>> rt/database?user=user=password",
>>"dbtable" -> “schema.table")).load()
>>
>> jdbcDF.show
>>
>>
>> If anyone can help, please let me know.
>>
>> Thanks,
>> Ben
>>
>>
>


-- 
---
Takeshi Yamamuro


Re: JDBC Very Slow

2016-09-16 Thread Benjamin Kim
I am testing this in spark-shell. I am following the Spark documentation by 
simply adding the PostgreSQL driver to the Spark Classpath.

SPARK_CLASSPATH=/path/to/postgresql/driver spark-shell

Then, I run the code below to connect to the PostgreSQL database to query. This 
is when I have problems.

Thanks,
Ben


> On Sep 16, 2016, at 3:29 PM, Nikolay Zhebet  wrote:
> 
> Hi! Can you split init code with current comand? I thing it is main problem 
> in your code.
> 
> 16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim"  > написал:
> Has anyone using Spark 1.6.2 encountered very slow responses from pulling 
> data from PostgreSQL using JDBC? I can get to the table and see the schema, 
> but when I do a show, it takes very long or keeps timing out.
> 
> The code is simple.
> 
> val jdbcDF = sqlContext.read.format("jdbc").options(
> Map("url" -> 
> "jdbc:postgresql://dbserver:port/database?user=user=password",
>"dbtable" -> “schema.table")).load()
> 
> jdbcDF.show
> 
> If anyone can help, please let me know.
> 
> Thanks,
> Ben
> 



Re: JDBC Very Slow

2016-09-16 Thread Nikolay Zhebet
Hi! Can you split init code with current comand? I thing it is main problem
in your code.
16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim" 
написал:

> Has anyone using Spark 1.6.2 encountered very slow responses from pulling
> data from PostgreSQL using JDBC? I can get to the table and see the schema,
> but when I do a show, it takes very long or keeps timing out.
>
> The code is simple.
>
> val jdbcDF = sqlContext.read.format("jdbc").options(
> Map("url" -> "jdbc:postgresql://dbserver:port/database?user=user&
> password=password",
>"dbtable" -> “schema.table")).load()
>
> jdbcDF.show
>
>
> If anyone can help, please let me know.
>
> Thanks,
> Ben
>
>


JDBC Very Slow

2016-09-16 Thread Benjamin Kim
Has anyone using Spark 1.6.2 encountered very slow responses from pulling data 
from PostgreSQL using JDBC? I can get to the table and see the schema, but when 
I do a show, it takes very long or keeps timing out.

The code is simple.

val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> 
"jdbc:postgresql://dbserver:port/database?user=user=password",
   "dbtable" -> “schema.table")).load()

jdbcDF.show

If anyone can help, please let me know.

Thanks,
Ben



Re: DataFrame/JDBC very slow performance

2015-08-26 Thread Dhaval Patel
Thanks Michael, much appreciated!

Nothing should be held in memory for a query like this (other than a single
count per partition), so I don't think that is the problem.  There is
likely an error buried somewhere.

For your above comments - I don't get any error but just get the NULL as
return value. I have tried digging deeper in the logs etc but couldn't spot
anything. Is there any other suggestions to spot such buried errors?

Thanks,
Dhaval

On Mon, Aug 24, 2015 at 6:38 PM, Michael Armbrust mich...@databricks.com
wrote:

 Much appreciated! I am not comparing with select count(*) for
 performance, but it was one simple thing I tried to check the performance
 :). I think it now makes sense since Spark tries to extract all records
 before doing the count. I thought having an aggregated function query
 submitted over JDBC/Teradata would let Teradata do the heavy lifting.


 We currently only push down filters since there is a lot of variability in
 what types of aggregations various databases support.  You can manually
 pushdown whatever you want by replacing the table name with a subquery
 (i.e. (SELECT ... FROM ...))

- How come my second query for (5B) records didn't return anything
 even after a long processing? If I understood correctly, Spark would try to
 fit it in memory and if not then might use disk space, which I have
 available?


 Nothing should be held in memory for a query like this (other than a
 single count per partition), so I don't think that is the problem.  There
 is likely an error buried somewhere.


  - Am I supposed to do any Spark related tuning to make it work?

 My main need is to access data from these large table(s) on demand and
 provide aggregated and calculated results much quicker, for that  I was
 trying out Spark. Next step I am thinking to export data in Parque files
 and give it a try. Do you have any suggestions for to deal with the problem?


 Exporting to parquet will likely be a faster option that trying to query
 through JDBC, since we have many more opportunities for parallelism here.



DataFrame/JDBC very slow performance

2015-08-24 Thread Dhaval Patel
I am trying to access a mid-size Teradata table (~100 million rows) via
JDBC in standalone mode on a single node (local[*]). When I tried with BIG
table (5B records) then no results returned upon completion of query.

I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24
cores, 126G RAM).

I have tried several memory setup and tuning options to make it work
faster, but neither of them made a huge impact.

I am sure there is something I am missing and below is my final try that
took about 11 minutes to get this simple counts vs it only took 40 seconds
using a JDBC connection through R to get the counts.


bin/pyspark --driver-memory 40g --executor-memory 40g

df = sqlContext.read.jdbc(jdbc:teradata://..)
df.count()


[image: Inline image 1]


Re: DataFrame/JDBC very slow performance

2015-08-24 Thread Michael Armbrust

 Much appreciated! I am not comparing with select count(*) for
 performance, but it was one simple thing I tried to check the performance
 :). I think it now makes sense since Spark tries to extract all records
 before doing the count. I thought having an aggregated function query
 submitted over JDBC/Teradata would let Teradata do the heavy lifting.


We currently only push down filters since there is a lot of variability in
what types of aggregations various databases support.  You can manually
pushdown whatever you want by replacing the table name with a subquery
(i.e. (SELECT ... FROM ...))

   - How come my second query for (5B) records didn't return anything
 even after a long processing? If I understood correctly, Spark would try to
 fit it in memory and if not then might use disk space, which I have
 available?


Nothing should be held in memory for a query like this (other than a single
count per partition), so I don't think that is the problem.  There is
likely an error buried somewhere.


  - Am I supposed to do any Spark related tuning to make it work?

 My main need is to access data from these large table(s) on demand and
 provide aggregated and calculated results much quicker, for that  I was
 trying out Spark. Next step I am thinking to export data in Parque files
 and give it a try. Do you have any suggestions for to deal with the problem?


Exporting to parquet will likely be a faster option that trying to query
through JDBC, since we have many more opportunities for parallelism here.