subject:"RE\: Fighting against performance\: JDBC RDD badly distributed"

Re: Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread shenyan zhen

Hi Saif,

Are you using JdbcRDD directly from Spark?
If yes, then the poor distribution could be due to the bound key you used.

See the JdbcRDD Scala doc at
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD
:
sql

the text of the query. The query must contain two ? placeholders for
parameters used to partition the results. E.g. select title, author from
books where ? = id and id = ?
lowerBound

the minimum value of the first placeholder
upperBound

the maximum value of the second placeholder The lower and upper bounds are
inclusive.
numPartitions

the number of partitions. Given a lowerBound of 1, an upperBound of 20, and
a numPartitions of 2, the query would be executed twice, once with (1, 10)
and once with (11, 20)

Shenyan


On Tue, Jul 28, 2015 at 2:41 PM, saif.a.ell...@wellsfargo.com wrote:

  Hi all,

 I am experimenting and learning performance on big tasks locally, with a
 32 cores node and more than 64GB of Ram, data is loaded from a database
 through JDBC driver, and launching heavy computations against it. I am
 presented with two questions:


1. My RDD is poorly distributed. I am partitioning into 32 pieces, but
first 31 pieces are extremely lightweight compared to piece 32


 15/07/28 13:37:48 INFO Executor: Finished task 30.0 in stage 0.0 (TID 30).
 1419 bytes result sent to driver
 15/07/28 13:37:48 INFO TaskSetManager: Starting task 31.0 in stage 0.0
 (TID 31, localhost, PROCESS_LOCAL, 1539 bytes)
 15/07/28 13:37:48 INFO Executor: Running task 31.0 in stage 0.0 (TID 31)
 15/07/28 13:37:48 INFO TaskSetManager: Finished task 30.0 in stage 0.0
 (TID 30) in 2798 ms on localhost (31/32)
 15/07/28 13:37:48 INFO CacheManager: Partition rdd_2_31 not found,
 computing it
 *...All pieces take 3 seconds while last one takes around 15 minutes to
 compute...*

 Is there anything I can do about this? preferrably without reshufling,
 i.e. in the DataFrameReader JDBC options (lowerBound, upperBound, partition
 column)


1. After long time of processing, sometimes I get OOMs, I fail to find
a how-to for fallback and give retries to already persisted data to avoid
time.


 Thanks,
 Saif

RE: Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread Saif.A.Ellafi

Thank you for your response Zhen,

I am using some vendor specific JDBC driver JAR file (honestly I dont know 
where it came from). It’s api is NOT like JdbcRDD, instead, more like jdbc from 
DataFrameReader
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader

So I ask two questions now:

1.   Will running a query using JdbcRDD prove better than bringing an 
entire table as DataFrame? I am later on, converting back to RDDs.

2.   I lack of some proper criteria to decide a proper column for 
distributon. My table has more than 400 columns.

Saif

From: shenyan zhen [mailto:shenya...@gmail.com]
Sent: Tuesday, July 28, 2015 4:16 PM
To: Ellafi, Saif A.
Cc: user@spark.apache.org
Subject: Re: Fighting against performance: JDBC RDD badly distributed

Hi Saif,

Are you using JdbcRDD directly from Spark?
If yes, then the poor distribution could be due to the bound key you used.

See the JdbcRDD Scala doc at 
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD:
sql

the text of the query. The query must contain two ? placeholders for parameters 
used to partition the results. E.g. select title, author from books where ? = 
id and id = ?
lowerBound

the minimum value of the first placeholder
upperBound

the maximum value of the second placeholder The lower and upper bounds are 
inclusive.
numPartitions

the number of partitions. Given a lowerBound of 1, an upperBound of 20, and a 
numPartitions of 2, the query would be executed twice, once with (1, 10) and 
once with (11, 20)

Shenyan


On Tue, Jul 28, 2015 at 2:41 PM, 
saif.a.ell...@wellsfargo.commailto:saif.a.ell...@wellsfargo.com wrote:
Hi all,

I am experimenting and learning performance on big tasks locally, with a 32 
cores node and more than 64GB of Ram, data is loaded from a database through 
JDBC driver, and launching heavy computations against it. I am presented with 
two questions:

1.   My RDD is poorly distributed. I am partitioning into 32 pieces, but 
first 31 pieces are extremely lightweight compared to piece 32

15/07/28 13:37:48 INFO Executor: Finished task 30.0 in stage 0.0 (TID 30). 1419 
bytes result sent to driver
15/07/28 13:37:48 INFO TaskSetManager: Starting task 31.0 in stage 0.0 (TID 31, 
localhost, PROCESS_LOCAL, 1539 bytes)
15/07/28 13:37:48 INFO Executor: Running task 31.0 in stage 0.0 (TID 31)
15/07/28 13:37:48 INFO TaskSetManager: Finished task 30.0 in stage 0.0 (TID 30) 
in 2798 ms on localhost (31/32)
15/07/28 13:37:48 INFO CacheManager: Partition rdd_2_31 not found, computing it
...All pieces take 3 seconds while last one takes around 15 minutes to 
compute...

Is there anything I can do about this? preferrably without reshufling, i.e. in 
the DataFrameReader JDBC options (lowerBound, upperBound, partition column)

2.   After long time of processing, sometimes I get OOMs, I fail to find a 
how-to for fallback and give retries to already persisted data to avoid time.

Thanks,
Saif

Re: Fighting against performance: JDBC RDD badly distributed

2015-07-28 Thread shenyan zhen

Saif,

I am guessing but not sure your use case. Are you retrieving the entire
table into Spark? If yes, do you have primary key on your table?
If also yes, then JdbcRDD should be efficient. DataFrameReader.jdbc gives
you more options, again, depends on your use case.

Possible for you to describe your objective and show some code snippet?

Shenyan

On Tue, Jul 28, 2015 at 3:23 PM, saif.a.ell...@wellsfargo.com wrote:

Thank you for your response Zhen,

I am using some vendor specific JDBC driver JAR file (honestly I dont know
where it came from). It’s api is NOT like JdbcRDD, instead, more like jdbc
from DataFrameReader

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader

So I ask two questions now:

1. Will running a query using JdbcRDD prove better than bringing an
entire table as DataFrame? I am later on, converting back to RDDs.

2. I lack of some proper criteria to decide a proper column for
distributon. My table has more than 400 columns.

Saif

*From:* shenyan zhen [mailto:shenya...@gmail.com]
*Sent:* Tuesday, July 28, 2015 4:16 PM
*To:* Ellafi, Saif A.
*Cc:* user@spark.apache.org
*Subject:* Re: Fighting against performance: JDBC RDD badly distributed

Hi Saif,

Are you using JdbcRDD directly from Spark?

If yes, then the poor distribution could be due to the bound key you used.

See the JdbcRDD Scala doc at
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD
:

*sql*

the text of the query. The query must contain two ? placeholders for
parameters used to partition the results. E.g. select title, author from
books where ? = id and id = ?

*lowerBound*

the minimum value of the first placeholder

*upperBound*

the maximum value of the second placeholder The lower and upper bounds are
inclusive.

*numPartitions *

the number of partitions. Given a lowerBound of 1, an upperBound of 20,
and a numPartitions of 2, the query would be executed twice, once with (1,
10) and once with (11, 20)

Shenyan

On Tue, Jul 28, 2015 at 2:41 PM, saif.a.ell...@wellsfargo.com wrote:

Hi all,

I am experimenting and learning performance on big tasks locally, with a
32 cores node and more than 64GB of Ram, data is loaded from a database
through JDBC driver, and launching heavy computations against it. I am
presented with two questions:

1. My RDD is poorly distributed. I am partitioning into 32 pieces,
but first 31 pieces are extremely lightweight compared to piece 32

15/07/28 13:37:48 INFO Executor: Finished task 30.0 in stage 0.0 (TID 30).
1419 bytes result sent to driver

15/07/28 13:37:48 INFO TaskSetManager: Starting task 31.0 in stage 0.0
(TID 31, localhost, PROCESS_LOCAL, 1539 bytes)

15/07/28 13:37:48 INFO Executor: Running task 31.0 in stage 0.0 (TID 31)

15/07/28 13:37:48 INFO TaskSetManager: Finished task 30.0 in stage 0.0
(TID 30) in 2798 ms on localhost (31/32)

15/07/28 13:37:48 INFO CacheManager: Partition rdd_2_31 not found,
computing it

*...All pieces take 3 seconds while last one takes around 15 minutes to
compute...*

Is there anything I can do about this? preferrably without reshufling,
i.e. in the DataFrameReader JDBC options (lowerBound, upperBound, partition
column)

2. After long time of processing, sometimes I get OOMs, I fail to
find a how-to for fallback and give retries to already persisted data to
avoid time.

Thanks,

Saif

Re: Fighting against performance: JDBC RDD badly distributed

RE: Fighting against performance: JDBC RDD badly distributed

Re: Fighting against performance: JDBC RDD badly distributed

3 matches

Site Navigation

Mail list logo

Footer information