Re: parallel processing with JDBC

Madabhattula Rajesh Kumar Mon, 15 Aug 2016 04:19:59 -0700

Hi Mich,

Thank you for detailed explanation. One more question


In my cluster, I have one master and 4 workers. In this case, 4 connections
will be opened to Oracle ?

Regards,
Rajesh

On Mon, Aug 15, 2016 at 3:59 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> It happens that the number of parallel processes open from Spark to RDBMS
> is determined by the number of executors.
>
> I just tested this.
>
> With Yarn client using to executors I see two connections to RDBMS
>
>
> EXECUTIONS USERNAME       SID SERIAL# USERS_EXECUTING SQL_TEXT
> ---------- ---------- ------- ------- ---------------
> --------------------------------------------------
>          1 SCRATCHPAD     443   62565               1 SELECT
> "RANDOMISED","RANDOM_STRING","PADDING","CLU
>
> STERED","ID","SCATTERED","SMALL_VC" FROM (SELECT t
>                                                       o_char(ID) AS ID,
> to_char(CLUSTERED) AS CLUSTERED,
>                                                        to_char(SCATTERED)
> AS SCATTERED, to_char(RANDOMIS
>                                                       ED) AS RANDOMISED,
> RANDOM_STRING, SMALL_VC, PADDIN
>                                                       G FROM
> scratchpad.dummy) WHERE ID >= 23000001 AND
>                                                       ID < 24000001
>          1 SCRATCHPAD     406   46793               1 SELECT
> "RANDOMISED","RANDOM_STRING","PADDING","CLU
>
> STERED","ID","SCATTERED","SMALL_VC" FROM (SELECT t
>                                                       o_char(ID) AS ID,
> to_char(CLUSTERED) AS CLUSTERED,
>                                                        to_char(SCATTERED)
> AS SCATTERED, to_char(RANDOMIS
>                                                       ED) AS RANDOMISED,
> RANDOM_STRING, SMALL_VC, PADDIN
>                                                       G FROM
> scratchpad.dummy) WHERE ID >= 24000001 AND
>                                                       ID < 25000001
>
> So it  sounds like (can someone else independently confirm this) that
> regardless of what one specifies in "numPartitions" one ends up one
> connection from one Spark executor to RDBMS.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 August 2016 at 09:12, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi.
>>
>> This is a very good question
>>
>> I did some tests on this.
>>
>> If you are joining two tables then you are creating a result set based on
>> some conditions. In this case what I normally do is to specify an ID column
>> from either tables and will base my partitioning on that ID column. This is
>> pretty straight forward. So bring back your ID column and base you lower
>> and upper limit on that ID value
>>
>> "partitionColumn" -> "ID",
>> "lowerBound" -> "1",
>> "upperBound" -> "100000000",
>> "numPartitions" -> "100",
>>
>>
>>
>> Also I have noticed that regardless of the number of partitions you
>> specify at the RDBMS site, the number of parallel connections will be
>> limited and the result set will be partitioned accordingly. For example
>> with numberPartitions=100, I see only 8 connections in Oracle coming from
>> Spark connection.
>>
>> scala> val s = HiveContext.read.format("jdbc").options(
>>      | Map("url" -> _ORACLEserver,
>>      | "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS
>> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>>      | "partitionColumn" -> "ID",
>>      | "lowerBound" -> "1",
>>      | "upperBound" -> "100000000",
>>      | "numPartitions" -> "100",
>>      | "user" -> _username,
>>      | "password" -> _password)).load
>> s: org.apache.spark.sql.DataFrame = [ID: string, CLUSTERED: string ... 5
>> more fields]
>> scala> s.toJavaRDD.partitions.size()
>> res1: Int = 100
>>
>> This also seems to set the number of partitions. I still think that the
>> emphasis has to be on getting data from RDBMS as quickly as possible. The
>> partitioning does work. In below the login scratchpad has multiple
>> connections to Oracle and does the range selection OK
>>
>>          1 SCRATCHPAD      45   43048               1 SELECT
>> "SMALL_VC","CLUSTERED","PADDING","RANDOM_ST
>>
>> RING","ID","SCATTERED","RANDOMISED" FROM (SELECT t
>>                                                       o_char(ID) AS ID,
>> to_char(CLUSTERED) AS CLUSTERED,
>>
>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS
>>                                                       ED) AS RANDOMISED,
>> RANDOM_STRING, SMALL_VC, PADDIN
>>                                                       G FROM
>> scratchpad.dummy)
>> *WHERE ID >= 16000001
>> AND                                                      ID < 17000001*
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 15 August 2016 at 08:18, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I would suggest you to look at sqoop as well. Essentially, you can
>>> provide a splitBy/partitionBy column using which data will be distributed
>>> among your stated number of mappers
>>>
>>> On Mon, Aug 15, 2016 at 5:07 PM, Madabhattula Rajesh Kumar <
>>> mrajaf...@gmail.com> wrote:
>>>
>>>> Hi Mich,
>>>>
>>>> I have a below question.
>>>>
>>>> I want to join two tables and return the result based on the input
>>>> value. In this case, how we need to specify lower bound and upper bound
>>>> values ?
>>>>
>>>> select t1.id, t1.name, t2.course, t2.qualification from t1, t2 where
>>>> t1.transactionid=*11111* and t1.id = t2.id
>>>>
>>>> *11111 => dynamic input value.*
>>>>
>>>> Regards,
>>>> Rajesh
>>>>
>>>> On Mon, Aug 15, 2016 at 12:05 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> If you have your RDBMS table partitioned, then you need to consider
>>>>> how much data you want to extract in other words the result set returned 
>>>>> by
>>>>> the JDBC call.
>>>>>
>>>>> If you want all the data, then the number of partitions specified in
>>>>> the JDBC call should be equal to the number of partitions in your RDBMS
>>>>> table.
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 14 August 2016 at 21:44, Ashok Kumar <ashok34...@yahoo.com> wrote:
>>>>>
>>>>>> Thank you very much sir.
>>>>>>
>>>>>> I forgot to mention that two of these Oracle tables are range
>>>>>> partitioned. In that case what would be the optimum number of partitions 
>>>>>> if
>>>>>> you can share?
>>>>>>
>>>>>> Warmest
>>>>>>
>>>>>>
>>>>>> On Sunday, 14 August 2016, 21:37, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>> If you have primary keys on these tables then you can parallelise the
>>>>>> process reading data.
>>>>>>
>>>>>> You have to be careful not to set the number of partitions too many.
>>>>>> Certainly there is a balance between the number of partitions supplied to
>>>>>> JDBC and the load on the network and the source DB.
>>>>>>
>>>>>> Assuming that your underlying table has primary key ID, then this
>>>>>> will create 20 parallel processes to Oracle DB
>>>>>>
>>>>>>  val d = HiveContext.read.format("jdbc").options(
>>>>>>  Map("url" -> _ORACLEserver,
>>>>>>  "dbtable" -> "(SELECT <COL1>, <COL2>, ....FROM <TABLE>)",
>>>>>>  "partitionColumn" -> "ID",
>>>>>>  "lowerBound" -> "1",
>>>>>>  "upperBound" -> "maxID",
>>>>>>  "numPartitions" -> "20",
>>>>>>  "user" -> _username,
>>>>>>  "password" -> _password)).load
>>>>>>
>>>>>> assuming your upper bound on ID is maxID
>>>>>>
>>>>>>
>>>>>> This will open multiple connections to RDBMS, each getting a subset
>>>>>> of data that you want.
>>>>>>
>>>>>> You need to test it to ensure that you get the numPartitions optimum
>>>>>> and you don't overload any component.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>> On 14 August 2016 at 21:15, Ashok Kumar <ashok34...@yahoo.com.invalid
>>>>>> > wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> There are 4 tables ranging from 10 million to 100 million rows but
>>>>>> they all have primary keys.
>>>>>>
>>>>>> The network is fine but our Oracle is RAC and we can only connect to
>>>>>> a designated Oracle node (where we have a DQ account only).
>>>>>>
>>>>>> We have a limited time window of few hours to get the required data
>>>>>> out.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Sunday, 14 August 2016, 21:07, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>> How big are your tables and is there any issue with the network
>>>>>> between your Spark nodes and your Oracle DB that adds to issues?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/ profile/view?id=
>>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>> http://talebzadehmich. wordpress.com
>>>>>> <http://talebzadehmich.wordpress.com/>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>> On 14 August 2016 at 20:50, Ashok Kumar <ashok34...@yahoo.com.invalid
>>>>>> > wrote:
>>>>>>
>>>>>> Hi Gurus,
>>>>>>
>>>>>> I have few large tables in rdbms (ours is Oracle). We want to access
>>>>>> these tables through Spark JDBC
>>>>>>
>>>>>> What is the quickest way of getting data into Spark Dataframe say
>>>>>> multiple connections from Spark
>>>>>>
>>>>>> thanking you
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>

Re: parallel processing with JDBC

Reply via email to